Bag of Words and TF-IDF¶

Once your text is cleaned and tokenised, you need to turn it into numbers a model can learn from. Bag of Words and TF-IDF are the two most widely used classical approaches. They are fast, interpretable, and — when combined with Logistic Regression — competitive with neural models on many real-world text classification tasks.

Learning Objectives¶

Explain how Bag of Words encodes text as count vectors
Build and inspect a document-term matrix with CountVectorizer
Explain the TF-IDF formula and why it outperforms raw counts
Use TfidfVectorizer with meaningful parameters (max_features, ngram_range, sublinear_tf)
Understand n-grams and why bigrams improve sentiment classification
Handle sparse matrices correctly

Bag of Words: Ignore Order, Count Occurrences¶

Bag of Words (BoW) represents each document as a vector of word counts. The "bag" metaphor is intentional — word order is thrown away. The vocabulary is built from all words across all documents, and each document is described by how many times each vocabulary word appears in it.

Vocabulary: ["bad", "film", "good", "great", "movie", "not", "terrible"]

Document 1: "great movie great acting"
Vector:      [0, 0, 0, 2, 1, 0, 0]

Document 2: "not good not great terrible film"
Vector:      [0, 1, 1, 1, 0, 2, 1]

Building BoW with CountVectorizer¶

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "this movie is great",
    "this movie is not great it is terrible",
    "great acting great story great film",
    "terrible waste of time and money",
    "not bad actually pretty good",
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# X is a sparse matrix — convert to dense for inspection
vocab = vectorizer.get_feature_names_out()
df_bow = pd.DataFrame(X.toarray(), columns=vocab)
print(df_bow)

# Output (selected columns):
#    actually  acting  and  bad  film  good  great  is  it  money  movie  not  ...
# 0         0       0    0    0     0     0      1   1   0      0      1    0  ...
# 1         0       0    0    0     0     0      1   2   1      0      1    1  ...
# 2         0       1    0    0     1     0      3   0   0      0      0    0  ...
# 3         0       0    1    0     0     0      0   0   0      1      0    0  ...
# 4         1       0    0    1     0     1      0   0   0      0      0    1  ...

Key parameters¶

# max_features: keep only the top N most frequent tokens
# min_df: ignore tokens appearing in fewer than N documents (or fraction)
# max_df: ignore tokens appearing in more than N documents (too common = noise)
# stop_words: built-in English stopword list

vectorizer = CountVectorizer(
    max_features=10_000,   # cap vocabulary size
    min_df=2,              # token must appear in at least 2 documents
    max_df=0.95,           # ignore tokens in >95% of documents
    stop_words="english",  # remove common English stopwords
)

Warning

CountVectorizer(stop_words="english") uses sklearn's built-in stopword list, which does not remove "not". However, NLTK's stopword list does. Verify your stopword list before applying it to sentiment tasks.

The Problem with Raw Counts¶

Raw counts have a bias: words that appear frequently everywhere (like "the", "is", "said") get high counts in every document and dominate the feature space, even though they carry no discriminative power.

Consider two documents: - Doc A: 100 words total, "movie" appears 3 times → frequency = 3% - Doc B: 10 words total, "movie" appears 3 times → frequency = 30%

Raw counts treat both the same. TF-IDF fixes both problems: it normalises by document length and penalises words that appear across many documents.

TF-IDF: Reward the Distinctive, Penalise the Common¶

TF-IDF stands for Term Frequency – Inverse Document Frequency. The score for a word in a document has two components:

Term Frequency (TF): How often does this word appear in this document?

TF(word, document) = count of word in document / total words in document

Inverse Document Frequency (IDF): How rare is this word across all documents?

IDF(word) = log( N / (1 + df(word)) )

Where:
  N = total number of documents
  df(word) = number of documents containing the word
  +1 avoids division by zero (sklearn adds 1 by default)

TF-IDF score:

TF-IDF(word, document) = TF(word, document) × IDF(word)

A word that appears often in one document but rarely elsewhere gets a high TF-IDF score. A word that appears in every document gets an IDF close to zero — it is effectively ignored.

# Manual TF-IDF calculation to build intuition
import math

docs = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "the cat chased the dog",
]

# Vocabulary of interest
words_of_interest = ["cat", "dog", "the", "sat"]

N = len(docs)

for word in words_of_interest:
    # Document frequency: how many docs contain this word
    df = sum(1 for doc in docs if word in doc.split())
    idf = math.log(N / (1 + df)) + 1  # sklearn smoothed IDF formula (adds 1 outside log)

    # TF in document 0
    doc0_words = docs[0].split()
    tf = doc0_words.count(word) / len(doc0_words)
    tfidf = tf * idf

    print(f"{word:5s}: df={df}, IDF={idf:.3f}, TF(doc0)={tf:.3f}, TF-IDF(doc0)={tfidf:.3f}")

# Output:
# cat  : df=2, IDF=1.405, TF(doc0)=0.167, TF-IDF(doc0)=0.234
# dog  : df=2, IDF=1.405, TF(doc0)=0.000, TF-IDF(doc0)=0.000
# the  : df=3, IDF=1.288, TF(doc0)=0.333, TF-IDF(doc0)=0.429
# sat  : df=2, IDF=1.405, TF(doc0)=0.167, TF-IDF(doc0)=0.234
# Note: 'the' has a high score because it also has a high TF — but its IDF
# is lower than 'cat', correctly reflecting that it is less distinctive.

Using TfidfVectorizer¶

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

reviews = [
    "absolutely loved this product best purchase ever",
    "not worth the money totally disappointed",
    "amazing quality fast shipping will buy again",
    "broke after two days terrible quality",
    "great value excellent customer service",
    "not what i expected poor quality control",
]

tfidf = TfidfVectorizer(
    max_features=20,       # keep top 20 features
    ngram_range=(1, 2),    # unigrams and bigrams
    sublinear_tf=True,     # apply log(1 + tf) — dampens very high counts
    min_df=1,
)

X = tfidf.fit_transform(reviews)
vocab = tfidf.get_feature_names_out()

# Inspect the matrix
df_tfidf = pd.DataFrame(X.toarray().round(3), columns=vocab)
print(df_tfidf.to_string())
# Each row is a document; each column is a feature.
# Higher values = more distinctive/important in that document.

Tip

Always use sublinear_tf=True in TfidfVectorizer for classification tasks. It applies log(1 + tf) instead of raw tf, which prevents documents with many repetitions of a word from dominating the feature space. The improvement is usually 1–3 F1 points with no downside.

N-Grams: Capturing Phrases¶

Unigrams (single words) miss the fact that "not good" is very different from "good". N-grams capture sequences of N consecutive words.

from sklearn.feature_extraction.text import CountVectorizer

text = ["the movie was not good at all", "the movie was good fun for all"]

# Unigrams only (default)
uni = CountVectorizer(ngram_range=(1, 1))
uni.fit(text)
print("Unigrams:", list(uni.vocabulary_.keys()))
# Output: Unigrams: ['the', 'movie', 'was', 'not', 'good', 'at', 'all', 'fun', 'for']

# Bigrams only
bi = CountVectorizer(ngram_range=(2, 2))
bi.fit(text)
print("Bigrams:", list(bi.vocabulary_.keys()))
# Output: Bigrams: ['the movie', 'movie was', 'was not', 'not good', 'good at',
#                    'at all', 'was good', 'good fun', 'fun for', 'for all']

# Unigrams + Bigrams (most common in practice)
uni_bi = CountVectorizer(ngram_range=(1, 2))
uni_bi.fit(text)
print("Feature count:", len(uni_bi.vocabulary_))
# Output: Feature count: 19

Info

Adding bigrams typically improves sentiment classification accuracy because phrases like "not bad", "very good", "not worth", and "highly recommend" carry sentiment that the individual words do not. Adding trigrams rarely helps further and increases vocabulary size significantly.

Warning

N-gram vocabularies grow fast. Unigrams on a 50,000-document corpus might yield 80,000 features. Adding bigrams can push that to 500,000+. Always use max_features to cap vocabulary size, or min_df to cut rare n-grams. Sparse matrices handle this gracefully, but your model training time scales with feature count.

Sparse Matrices: What They Are and Why They Matter¶

The document-term matrix from a realistic corpus is enormous and almost entirely zeros. A 10,000-document corpus with a 50,000-word vocabulary would have 500 million cells — but a typical document uses perhaps 200 unique words, so 99.6% of the matrix is zero.

Scipy sparse matrices store only the non-zero values and their positions, making this tractable:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Simulated small corpus
corpus = [f"document {i} about topic {i % 5} with some words" for i in range(1000)]

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(corpus)

print(f"Dense shape: {X.shape}")                          # (1000, 5000)
print(f"Dense size: {X.shape[0] * X.shape[1]:,} cells")  # 5,000,000 cells
print(f"Non-zero elements: {X.nnz:,}")                    # Only non-zero stored
print(f"Sparsity: {1 - X.nnz / (X.shape[0] * X.shape[1]):.1%}")

# Output:
# Dense shape: (1000, 5000)
# Dense size: 5,000,000 cells
# Non-zero elements: ~8,000
# Sparsity: 99.8%

# Most sklearn models accept sparse matrices directly — do NOT call .toarray()
# on large matrices; it will exhaust your RAM.

Success

TF-IDF is not an old or inferior technique. On datasets with fewer than ~50,000 labelled examples, a well-tuned TF-IDF + Logistic Regression pipeline frequently matches or outperforms fine-tuned BERT — with a fraction of the training time and inference cost. Always establish a TF-IDF baseline before reaching for transformers.

What's Next¶

You've covered the Bag-of-Words document-term matrix, TF-IDF weighting and the IDF formula, n-gram ranges for capturing phrase-level signal, the memory-efficiency of scipy sparse matrices, and the TF-IDF baseline argument against premature adoption of transformers. Next up: 04-sentiment-classification — where you'll combine preprocessing, TF-IDF, and logistic regression into a complete end-to-end text classification pipeline, inspect feature coefficients to understand what the model learned, tune the classification threshold for precision/recall tradeoffs, and benchmark against Multinomial Naive Bayes.

Optional Deep Dive

Read the original TF-IDF paper by Salton and Buckley (1988) "Term-Weighting Approaches in Automatic Text Retrieval" (Information Processing and Management) — it explains the information-theoretic justification for inverse document frequency weighting and the different TF normalisation variants that sklearn's TfidfVectorizer implements via the sublinear_tf and smooth_idf parameters.

← Text Preprocessing | Next: Sentiment Classification →