Submodules
irspdf.build module
irspdf.query module
irspdf.update module
Classes
irspdf.ir_collection module
- class irspdf.ir_collection.IRCollection(path=None)[source]
Bases:
object
Builds a text IR collection from a set of pdf files.
- Parameters
max_length (int) – max number of char in a valid word
vocabulary (collections.Counter) – contains all the words in the collection
inverted_index (dict) – inverted index of the collection
doc_length (collections.Counter) – contains all the length of all the document
avg_doc_length (float) – average length of documents in the collection
min_freq – min number of occurences for a word to be in the vocabulary :type min_freq: int
idf (dict) – inverted document frequency of all words
stops (set) – set of stopwords to be deleterd from the vocabulary
num_docs (int) – total number of documents in the collection
- BM25(query, k1=1.2, b=0.75, k=1000, display=True)[source]
Compute the BM25 score of all the documents with rtespect to a query
- Parameters
query – the query as a string
k1 – BM25 parameter must be a positive real value
b – BM25 parameter must be in [0,1]
k – max number of documents to return
display – if set to true will print top-k document with their score
- Returns
A counter of the documents and their BM25 score
- Return type
collections.Counter
- build_collection(path)[source]
Builds the collection from the pdf files in the folder path
- Parameters
path (str) – folder containing all pdf files used to build the collection
- get_idf(word)[source]
Computes the smoothed idf of a single word
- Parameters
word (int) – index of a word in the inverted_index
- Returns
idf of word
- Return type
float
- read_all_pdfs(path)[source]
Extracts the text from all the pdf files in path
- Parameters
path (str) – folder containing the pdf files
- read_pdf(path, docname)[source]
Reads a single pdf file, builds a document from it and updates the vocabulary and the inverted index
- Parameters
path (str) – pdf file location
docname (str) – name that will be given to the document
- remove_low_freq()[source]
Deletes from the vocabulary the words that occur less than min_freq times
- score_BM25(word_id, doc, freq, k1, b)[source]
Computes the BM25 score of a term in a document
- Parameters
word_id (int) – id of the word in the inverted index
doc (str) – document name
freq (int) – frequency of the word in the document
k1 (float) – BM25 parameter must be a positive real value
b (float) – BM25 parameter must be in [0,1]
- Returns
BM25 score of the word in the document
- Return type
float
- update(collection)[source]
Updates the IRCollection with documents from a new IRCollection
WARNING: The documents in the new IRCollection must be different from the documents in the original IRCollection
- Parameters
collection (irspdf.IRCollection) – IRCollection object that contains the documents to update to the collection with