Introduction to Natural Language Processing — TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) is a combination of two individual metrics, TF and IDF, respectively. TF-IDF is used when we have multiple documents. It is based on the idea that rare words contain more information about the content of a document than words that are used many times throughout all the documents.
A problem with scoring word frequency is that highly frequent words start to dominate in the document, but may not contain as much “informational content” to the model as rarer but perhaps domain specific words. One approach is to rescale the frequency of words by how often they appear in all documents, so that the scores for frequent words that are also frequent across all documents are penalized.
TF and IDF are calculated with the following formulas:
where d refers to a document, N is the total number of documents, df is the number of documents with term t. TF-IDF are word frequency scores that try to highlight words that are more interesting. The scores have the effect of highlighting words that are distinct in a given document.
The predictive modeling can be implemented by Python with scikit-learn. The TfidfVectorizer can tokenize documents, learn the vocabulary and inverse document frequency weightings, and encode new documents. Alternately, if there is a learned CountVectorizer, TfidfTransformer can be used to just calculate the inverse document frequencies and start encoding documents.