Submodules

irspdf.build module

irspdf.build.build(folder_path, pkl_path)[source]

Builds and save a collection

Parameters
  • folder_path (str) – folder containing all pdf files used to build the collection

  • pkl_path (str) – pkl file were the collection will be saved

irspdf.query module

irspdf.query.query(collection_path)[source]

Reads the collection and print the documents ranked by relevance with respect to the query

Parameters

collection_path (str) – Path of the collection file

irspdf.update module

irspdf.update.update(folder_path, collection_path)[source]

Builds and save a collection

Parameters
  • folder_path (str) – folder containing all pdf files used to update the collection

  • collection_path – Path of the collection file

Classes

irspdf.ir_collection module

class irspdf.ir_collection.IRCollection(path=None)[source]

Bases: object

Builds a text IR collection from a set of pdf files.

Parameters
  • max_length (int) – max number of char in a valid word

  • vocabulary (collections.Counter) – contains all the words in the collection

  • inverted_index (dict) – inverted index of the collection

  • doc_length (collections.Counter) – contains all the length of all the document

  • avg_doc_length (float) – average length of documents in the collection

  • min_freq – min number of occurences for a word to be in the vocabulary :type min_freq: int

  • idf (dict) – inverted document frequency of all words

  • stops (set) – set of stopwords to be deleterd from the vocabulary

  • num_docs (int) – total number of documents in the collection

BM25(query, k1=1.2, b=0.75, k=1000, display=True)[source]

Compute the BM25 score of all the documents with rtespect to a query

Parameters
  • query – the query as a string

  • k1 – BM25 parameter must be a positive real value

  • b – BM25 parameter must be in [0,1]

  • k – max number of documents to return

  • display – if set to true will print top-k document with their score

Returns

A counter of the documents and their BM25 score

Return type

collections.Counter

build_collection(path)[source]

Builds the collection from the pdf files in the folder path

Parameters

path (str) – folder containing all pdf files used to build the collection

compute_docs_lengths()[source]

Compute the length of all documents using the inverted index

compute_idfs()[source]

Compute the idf of all words in the vocabulary

get_idf(word)[source]

Computes the smoothed idf of a single word

Parameters

word (int) – index of a word in the inverted_index

Returns

idf of word

Return type

float

index_words()[source]

Exchanges words in the vocabulary and the inverted index with int

read_all_pdfs(path)[source]

Extracts the text from all the pdf files in path

Parameters

path (str) – folder containing the pdf files

read_pdf(path, docname)[source]

Reads a single pdf file, builds a document from it and updates the vocabulary and the inverted index

Parameters
  • path (str) – pdf file location

  • docname (str) – name that will be given to the document

remove_low_freq()[source]

Deletes from the vocabulary the words that occur less than min_freq times

score_BM25(word_id, doc, freq, k1, b)[source]

Computes the BM25 score of a term in a document

Parameters
  • word_id (int) – id of the word in the inverted index

  • doc (str) – document name

  • freq (int) – frequency of the word in the document

  • k1 (float) – BM25 parameter must be a positive real value

  • b (float) – BM25 parameter must be in [0,1]

Returns

BM25 score of the word in the document

Return type

float

update(collection)[source]

Updates the IRCollection with documents from a new IRCollection

WARNING: The documents in the new IRCollection must be different from the documents in the original IRCollection

Parameters

collection (irspdf.IRCollection) – IRCollection object that contains the documents to update to the collection with