README

Presentation

irspdf is a simple textual information retrieval system for pdf documents.

Text is extracted from pdf with pdfplumber.

Standard text preprocessing for information retrieval is applied:

The ranking function used is BM25.

pip install irspdf

git clone https://github.com/Jibril-Frej/irspdf.git
cd irspdf && python setup.py install

from irspdf import build
build(folder_path, collection_path)

folder_path : path of the folder that contains all the pdf files to include to the collection.

collection_path : file where the collection will be saved

from irspdf import query
query(collection_path)

collection_path : file where the collection is saved

from irspdf import update
update(folder_path, collection_path)

folder_path : path of the folder that contains all the pdf files to add to the collection.

collection_path : file where the original collection is saved