TFIDF (*term frequency-inverse document frequency*) is a technique in machine learning, which is used to quantify the importance of terms in a text corpus. This is useful not only in preparing training data by removing unnecessary terms but also in information retrieval when finding the relevant data for a given query.

This is achieved by multiplying the **Term frequency** and the **Inverse document frequency**, which are defined as follows:

#### Term Frequency

**The Term frequency** is a measure for the number of times a given term appears in a text corpus. Most of the time it is computed by simply counting the occurrences of each term individually, though other methods, such as boolean frequencies (1 if occurs, 0 otherwise), may also be used.

#### Inverse Document Frequency

**IDF **attempts to measure the importance of a given term in a collection of documents. If a term appears in a large number of documents, it is suspected to not carry much information.

IDF is calculated by dividing the number of all documents by the number of documents a given term appears in (and log scaling the result of this formula), which generally means that in order for a term to get a high IDF-value, it has to appear in only a small number of documents.

#### Conclusion

TFIDF is a way to vectorize a text corpus, by multiplying the **TF **and **IDF** for every single term. For every given document, the TFIDF-score for each term will be high, if the Term Frequency of the given word is high, and the term also doesn’t appear in a large number of documents.

More information about Natural Language Processing can be found in our article series NLP Insights.