Sentiment Classification¶

Sentiment classification is the "hello world" of NLP — but doing it well requires understanding pipelines, avoiding data leakage, interpreting model weights, and making principled decisions about the precision-recall trade-off. This note walks through a complete, production-patterned classifier from raw text to evaluated predictions.

Learning Objectives¶

Build an end-to-end text classification pipeline with TfidfVectorizer and LogisticRegression
Avoid the most common form of NLP data leakage (fitting the vectoriser on the full dataset)
Inspect model coefficients to understand which words drive positive and negative predictions
Compare Logistic Regression against Naive Bayes and understand the trade-offs
Adjust the classification threshold to shift the precision-recall balance

Why Pipelines Prevent Data Leakage¶

The most common mistake when building a text classifier is fitting the vectoriser on the full dataset before splitting into train and test sets.

# WRONG — leaks test vocabulary into training
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(all_texts)   # learns vocabulary from ALL data
# Then split X into train/test — but the vectoriser already saw test words

The vectoriser learns IDF weights from all documents, including the test set. Words that only appear in test documents influence the feature matrix that the model trains on — a subtle but real form of data leakage.

# CORRECT — vectoriser sees only training data
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("model", LogisticRegression()),
])

pipe.fit(X_train, y_train)    # TF-IDF fits on X_train only
pipe.predict(X_test)          # TF-IDF transforms X_test using training vocabulary

Pipeline.fit() calls fit_transform() on all steps except the last, and transform() on the last step. This is the correct behaviour with no extra code.

Building a Realistic Dataset¶

Four examples are not enough to evaluate a classifier. Here is a synthetic but realistic dataset of product reviews:

import pandas as pd
import numpy as np

positive_reviews = [
    "This product exceeded all my expectations. Highly recommend.",
    "Excellent quality, fast shipping, and great customer service.",
    "I've been using this for three months and it still works perfectly.",
    "Solid build quality. Worth every penny.",
    "Amazing product. Bought two more for my family.",
    "Works exactly as described. Very happy with my purchase.",
    "Surprisingly good for the price. Will definitely buy again.",
    "Five stars. Does exactly what it says on the tin.",
    "Best purchase I've made all year. Cannot recommend enough.",
    "Outstanding quality and arrived ahead of schedule.",
    "Great product, very easy to use, works perfectly.",
    "Very satisfied. The quality is much better than expected.",
    "Brilliant. Exactly what I needed. Fast delivery too.",
    "Good value for money. Would recommend to anyone.",
    "Lovely product, well packaged, no issues at all.",
]

negative_reviews = [
    "Broke after two weeks. Complete waste of money.",
    "Not what was advertised. Very disappointed with the quality.",
    "Terrible product. Stopped working after three days.",
    "Would not recommend. Poor quality and slow delivery.",
    "This is not worth the price. Extremely disappointed.",
    "Arrived damaged and customer service was unhelpful.",
    "Do not buy this. It broke the first time I used it.",
    "Cheap material, poor construction. A total disappointment.",
    "Took six weeks to arrive and was completely useless.",
    "One star. This is the worst product I have ever bought.",
    "Nothing worked out of the box. Had to return immediately.",
    "Packaging was fine but the product itself is garbage.",
    "Misleading description. This is not what I ordered.",
    "Very poor quality. Fell apart within a week.",
    "Avoid this seller. Product is a cheap knockoff.",
]

df = pd.DataFrame({
    "review": positive_reviews + negative_reviews,
    "sentiment": [1] * len(positive_reviews) + [0] * len(negative_reviews),
})

print(df["sentiment"].value_counts())
# Output:
# 1    15
# 0    15

Training the Classifier¶

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

X = df["review"]
y = df["sentiment"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),     # unigrams and bigrams
        max_features=5000,
        sublinear_tf=True,      # log-scale term frequency
        stop_words=None,        # do NOT remove stopwords — 'not' matters for sentiment
    )),
    ("model", LogisticRegression(
        C=1.0,                  # regularisation strength (smaller = more regularisation)
        max_iter=1000,
        random_state=42,
    )),
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))

# Output (approximate — varies by split):
#               precision    recall  f1-score   support
#     Negative       0.88      0.88      0.88         8
#     Positive       0.86      0.86      0.86         7
#     accuracy                           0.87        15

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Output:
# [[7 1]
#  [1 6]]

Interpreting Model Coefficients¶

Logistic Regression's coefficients tell you exactly which features pushed the model toward positive or negative sentiment. This is one of the key advantages of classical models over black-box neural networks.

import numpy as np

# Extract the vectoriser and model from the pipeline
vectoriser = pipe.named_steps["tfidf"]
model = pipe.named_steps["model"]

# Get feature names and their coefficients
feature_names = vectoriser.get_feature_names_out()
coefficients = model.coef_[0]   # shape: (n_features,)

# Pair each feature with its coefficient
feature_coef = list(zip(feature_names, coefficients))

# Sort by coefficient value
feature_coef.sort(key=lambda x: x[1])

print("Top 10 NEGATIVE features (push toward class 0):")
for feature, coef in feature_coef[:10]:
    print(f"  {feature:30s}  {coef:.3f}")

print("\nTop 10 POSITIVE features (push toward class 1):")
for feature, coef in feature_coef[-10:][::-1]:
    print(f"  {feature:30s}  {coef:.3f}")

# Output (approximate):
# Top 10 NEGATIVE features:
#   waste of money                 -2.341
#   broke after                    -2.187
#   do not                         -1.956
#   terrible                       -1.823
#   very disappointed              -1.741
#   ...
#
# Top 10 POSITIVE features:
#   highly recommend               +2.418
#   excellent quality              +2.203
#   cannot recommend enough        +1.987
#   exceeded all                   +1.876
#   ...

Tip

Inspecting top features is one of the most valuable debugging steps for a text classifier. If you see nonsensical features at the top (like usernames, product codes, or date formats), they are probably leaking label information from your training data. Fix the preprocessing, not the model.

Predicting New Reviews¶

new_reviews = [
    "This is not what I expected but it works fine.",
    "Absolutely brilliant. My kids love it.",
    "Stopped working after a week. Very frustrating.",
    "Not bad for the price. Would consider buying again.",
]

predictions = pipe.predict(new_reviews)
probabilities = pipe.predict_proba(new_reviews)

for review, pred, probs in zip(new_reviews, predictions, probabilities):
    label = "Positive" if pred == 1 else "Negative"
    confidence = max(probs)
    print(f"[{label} | {confidence:.0%}] {review}")

# Output:
# [Negative | 73%] This is not what I expected but it works fine.
# [Positive | 91%] Absolutely brilliant. My kids love it.
# [Negative | 88%] Stopped working after a week. Very frustrating.
# [Positive | 62%] Not bad for the price. Would consider buying again.

Naive Bayes: A Strong Baseline for Text¶

Multinomial Naive Bayes is fast, requires very little data to train, and often performs surprisingly well on text classification. It is the right model to compare against before committing to Logistic Regression.

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

# Note: MultinomialNB requires non-negative features.
# TF-IDF values are always non-negative, so this works.
# sublinear_tf must be False with MultinomialNB (log-scaled values break probability assumptions).

nb_pipe = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),
        max_features=5000,
        sublinear_tf=False,    # Must be False for MultinomialNB
        stop_words=None,
    )),
    ("model", MultinomialNB(alpha=1.0)),  # alpha = Laplace smoothing
])

nb_pipe.fit(X_train, y_train)
y_pred_nb = nb_pipe.predict(X_test)

print("Naive Bayes:")
print(classification_report(y_test, y_pred_nb, target_names=["Negative", "Positive"]))

Comparing the two models¶

Property	Logistic Regression	Multinomial Naive Bayes
Works well with	Larger datasets (>1k examples)	Small datasets (<500 examples)
Training speed	Slower	Very fast
Interpretability	Coefficients	Feature log-probabilities
Handles negation	Better	Worse
Default performance	Usually higher	Surprisingly competitive
Assumes feature independence	No	Yes (and usually wrong — but still works)

Info

Naive Bayes is "naive" because it assumes all features (words) are independent of each other — which is clearly false in language ("New York" is not the same as "New" + "York" independently). Despite this incorrect assumption, it works well because the decision boundary for many text classification problems is separable enough that the violation does not matter much.

Adjusting the Classification Threshold¶

By default, predict() uses a threshold of 0.5 — if the predicted probability of the positive class exceeds 50%, the model predicts positive. Changing this threshold shifts the precision-recall trade-off.

import numpy as np
from sklearn.metrics import precision_score, recall_score

# Get probability scores for the positive class
y_scores = pipe.predict_proba(X_test)[:, 1]

# Evaluate at different thresholds
print(f"{'Threshold':>10} {'Precision':>10} {'Recall':>10} {'F1':>10}")
for threshold in [0.3, 0.4, 0.5, 0.6, 0.7]:
    y_pred_thresh = (y_scores >= threshold).astype(int)
    precision = precision_score(y_test, y_pred_thresh, zero_division=0)
    recall = recall_score(y_test, y_pred_thresh, zero_division=0)
    f1 = 2 * precision * recall / (precision + recall + 1e-9)
    print(f"{threshold:>10.1f} {precision:>10.3f} {recall:>10.3f} {f1:>10.3f}")

# Output (approximate):
# Threshold  Precision     Recall         F1
#       0.3      0.700      1.000      0.824
#       0.4      0.750      0.857      0.800
#       0.5      0.857      0.857      0.857
#       0.6      1.000      0.714      0.833
#       0.7      1.000      0.571      0.727

Warning

The default threshold of 0.5 maximises overall accuracy, not business value. If you are building a content moderation classifier, missing a toxic review (false negative) is far more costly than over-flagging a neutral one (false positive). Lower the threshold to increase recall at the cost of precision. Always define the cost asymmetry before choosing a threshold.

Success

A complete, working sentiment classifier in under 30 lines of code: preprocess, TF-IDF, train, evaluate, interpret. This is your baseline. Every more complex model you build should be compared against this. If a fine-tuned BERT model does not beat this baseline by a meaningful margin on your task, the classical pipeline wins — it is faster to train, faster to inference, and easier to explain to stakeholders.

What's Next¶

You've covered a complete sentiment classification pipeline with TF-IDF and Logistic Regression, coefficient inspection for model interpretability, new-text prediction with probability scores, Multinomial Naive Bayes as a fast baseline, the comparison table between LR and NB for text, and threshold tuning for precision/recall tradeoffs. Next up: 05-transformers-overview — where you'll learn how the attention mechanism lets transformers capture long-range dependencies that TF-IDF cannot, how BERT's pre-training on masked language modelling enables transfer learning, and the decision framework for when a transformer is worth the engineering cost over a classical pipeline.

Optional Deep Dive

Read the "Sentiment Analysis" chapter in "Natural Language Processing with Python" by Bird, Klein, and Loper (O'Reilly, free at nltk.org/book/ch06.html) — it builds a Naive Bayes sentiment classifier from scratch, showing how to extract features manually before vectorisers existed, which deepens your understanding of what TF-IDF automates.

← Bag of Words and TF-IDF | Next: Transformers Overview →