Text Preprocessing¶

Raw text is not model-ready. It contains case variations, punctuation, URLs, HTML tags, stopwords, misspellings, and dozens of other sources of noise that inflate your vocabulary and dilute your signal. Preprocessing is where you make deliberate choices about what information to keep — and those choices have a larger impact on model performance than most hyperparameter tuning.

Learning Objectives¶

Explain why each preprocessing step exists, not just how to apply it
Build a reusable preprocessing function using Python and regex
Understand when to skip stopword removal (negations, sentiment tasks)
Distinguish stemming from lemmatisation and choose appropriately
Wrap preprocessing into a scikit-learn Pipeline-compatible transformer

Why Preprocessing Matters¶

Consider these two sentences:

"This product is AMAZING!!!   :)"
"this product is amazing"

To a human: identical sentiment. To a bag-of-words model without preprocessing: four different tokens (AMAZING!!!, amazing, :), amazing) and a vocabulary bloated with punctuation variants. Every unnecessary variation in your vocabulary costs you a column in your feature matrix and dilutes the model's ability to learn the underlying pattern.

Info

Preprocessing is a form of feature engineering for text. You are manually encoding assumptions about what information is and is not relevant to your task. A preprocessing choice that helps a spam filter (remove all URLs) might destroy signal in a web-content classifier (URLs carry domain information).

Step 1: Lowercasing¶

The most common and most universally beneficial step. "Python", "PYTHON", and "python" should be the same token in almost every task.

text = "This MOVIE was Amazing but the Ending was terrible."
lowered = text.lower()
print(lowered)
# Output: this movie was amazing but the ending was terrible.

Warning

Do not lowercase when case carries genuine meaning. Named entity recognition (NER) relies heavily on capitalisation — "apple" (fruit) vs "Apple" (company). If you are building a NER system, skip lowercasing or handle it carefully.

Step 2: Removing Noise with Regex¶

Real-world text contains URLs, email addresses, HTML tags, and special characters that rarely contribute signal to classification tasks.

import re

def remove_noise(text: str) -> str:
    """Remove URLs, HTML tags, and non-alphabetic characters."""
    # Remove URLs
    text = re.sub(r"https?://\S+|www\.\S+", "", text)
    # Remove HTML tags
    text = re.sub(r"<[^>]+>", "", text)
    # Remove anything that is not a letter or whitespace
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    # Collapse multiple whitespace into one
    text = re.sub(r"\s+", " ", text).strip()
    return text

raw = "Check out https://example.com — <b>best deal</b> ever!!! #sale"
print(remove_noise(raw))
# Output: Check out  best deal ever  sale

Tip

Keep emojis when building sentiment models for social media data. Emojis like "😍" and "💔" carry strong sentiment signals. You will need to either convert them to text tokens (":heart_eyes:") or treat them as special tokens rather than stripping them.

Step 3: Tokenisation¶

Tokenisation splits a string into a list of meaningful units (tokens). The simplest approach is whitespace splitting, but dedicated tokenisers handle edge cases better.

# Naive: whitespace split
text = "I'm not going, but she'll be there."
naive_tokens = text.split()
print(naive_tokens)
# Output: ["I'm", 'not', 'going,', 'but', "she'll", 'be', 'there.']
# Problem: punctuation is attached to words

# Better: NLTK word tokeniser
import nltk
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)

from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens)
# Output: ['I', "'m", 'not', 'going', ',', 'but', 'she', "'ll", 'be', 'there', '.']
# Better: contractions are split, punctuation is separated

Sentence tokenisation¶

When you need to work at the sentence level (for summarisation, or to preserve sentence structure before word-level processing):

from nltk.tokenize import sent_tokenize

paragraph = "Dr. Smith graduated in 2019. She now works at St. Mary's hospital. It's a great place."
sentences = sent_tokenize(paragraph)
for s in sentences:
    print(s)
# Output:
# Dr. Smith graduated in 2019.
# She now works at St. Mary's hospital.
# It's a great place.
# Note: sent_tokenize correctly handles "Dr." and "St." without splitting on them

Step 4: Stopword Removal¶

Stopwords are extremely frequent words — "the", "is", "at", "which" — that rarely carry discriminative information for classification. Removing them reduces vocabulary size and speeds up training.

import nltk
nltk.download("stopwords", quiet=True)

from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

tokens = ["this", "movie", "is", "not", "amazing", "at", "all"]
filtered = [token for token in tokens if token not in stop_words]
print(filtered)
# Output: ['movie', 'amazing']
# Problem: 'not' was removed, flipping the meaning!

Warning

Removing stopwords blindly is one of the most common mistakes in NLP. The word "not" is a stopword in NLTK's default list. Removing it turns "not amazing" into "amazing" — a complete meaning reversal. For sentiment analysis, always check whether negation words ("not", "never", "no", "isn't", "wasn't") are in your stopword list and explicitly exclude them.

# Safe stopword removal for sentiment tasks
negative_words = {"not", "no", "never", "neither", "nor", "nothing",
                  "nowhere", "nobody", "isn't", "wasn't", "won't",
                  "wouldn't", "can't", "couldn't", "shouldn't", "doesn't"}

safe_stopwords = stop_words - negative_words

tokens = ["this", "movie", "is", "not", "amazing", "at", "all"]
filtered = [token for token in tokens if token not in safe_stopwords]
print(filtered)
# Output: ['movie', 'not', 'amazing']
# 'not' is preserved — meaning is intact

Step 5: Stemming vs Lemmatisation¶

Both techniques reduce words to a common base form to group variants together. They differ in method and output quality.

Stemming¶

Stemming chops word endings using rules — fast but crude. The result is not always a real word.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["running", "runs", "runner", "easily", "fairly", "studies", "studying"]
stemmed = [stemmer.stem(w) for w in words]
print(list(zip(words, stemmed)))
# Output:
# [('running', 'run'), ('runs', 'run'), ('runner', 'runner'),
#  ('easily', 'easili'), ('fairly', 'fairli'),
#  ('studies', 'studi'), ('studying', 'studi')]
# Note: 'easily' → 'easili' is not a real word, but it works for grouping

Lemmatisation¶

Lemmatisation uses a vocabulary and morphological analysis to return the actual base word (lemma). Slower, but produces real words and handles irregular forms correctly.

import nltk
nltk.download("wordnet", quiet=True)
nltk.download("omw-1.4", quiet=True)

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["running", "runs", "runner", "better", "studies", "studying"]

# Must specify part of speech for accuracy
# 'v' = verb, 'n' = noun, 'a' = adjective
lemmas_verb = [lemmatizer.lemmatize(w, pos="v") for w in words]
print("As verbs:", lemmas_verb)
# Output: As verbs: ['run', 'run', 'runner', 'better', 'study', 'study']

lemmas_adj = [lemmatizer.lemmatize("better", pos="a")]
print("'better' as adjective:", lemmas_adj)
# Output: 'better' as adjective: ['good']
# Lemmatisation knows 'better' is the comparative of 'good'

When to use which¶

Scenario	Recommendation
Speed matters, large dataset, any topic	Stemming
Accuracy matters, smaller dataset	Lemmatisation
Production sentiment classifier	Lemmatisation (or neither — TF-IDF handles it)
Search index	Stemming (better recall, users expect broad matches)
When you have no preprocessing budget at all	Neither — TF-IDF + `sublinear_tf=True` handles variance well

Info

In practice, when you use TfidfVectorizer with a reasonable vocabulary size, the performance difference between stemming and lemmatisation is often small — because TF-IDF already down-weights high-frequency variants. Run a quick experiment on your data before committing to the extra complexity of lemmatisation.

Putting It Together: A Reusable Preprocessing Pipeline¶

import re
import nltk
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("stopwords", quiet=True)
nltk.download("wordnet", quiet=True)
nltk.download("omw-1.4", quiet=True)

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Stopwords with negations preserved
STOP_WORDS = set(stopwords.words("english")) - {
    "not", "no", "never", "nor", "nothing", "nobody",
    "isn't", "wasn't", "won't", "wouldn't", "can't",
    "couldn't", "shouldn't", "doesn't", "didn't",
}


def preprocess_text(text: str) -> str:
    """
    Full preprocessing pipeline for sentiment classification.
    Returns a cleaned, tokenised, lemmatised string.
    """
    # 1. Lowercase
    text = text.lower()
    # 2. Remove URLs and HTML
    text = re.sub(r"https?://\S+|www\.\S+", "", text)
    text = re.sub(r"<[^>]+>", "", text)
    # 3. Remove punctuation and digits
    text = re.sub(r"[^a-z\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    # 4. Tokenise
    tokens = word_tokenize(text)
    # 5. Remove stopwords (negations preserved)
    tokens = [t for t in tokens if t not in STOP_WORDS and len(t) > 1]
    # 6. Lemmatise
    tokens = [lemmatizer.lemmatize(t, pos="v") for t in tokens]
    return " ".join(tokens)


# Test it
reviews = [
    "I absolutely LOVED this product!!! Best purchase ever 😊",
    "This was NOT worth the money. Totally disappointed.",
    "Check out https://deals.com — <b>Amazing</b> quality for the price.",
]

for review in reviews:
    print(preprocess_text(review))

# Output:
# absolutely love product best purchase ever
# not worth money totally disappoint
# amazing quality price

Scikit-learn Compatible Transformer¶

When you want preprocessing inside a scikit-learn Pipeline, wrap it in a FunctionTransformer:

import pandas as pd
from sklearn.preprocessing import FunctionTransformer

def preprocess_series(texts):
    """Apply preprocess_text to an array of strings."""
    return [preprocess_text(t) for t in texts]

# Creates a drop-in sklearn transformer
text_cleaner = FunctionTransformer(preprocess_series)

sample_texts = [
    "Absolutely terrible product. Would NOT recommend.",
    "BEST movie I've seen all year!!!",
]

cleaned = text_cleaner.transform(sample_texts)
for original, clean in zip(sample_texts, cleaned):
    print(f"Before: {original}")
    print(f"After:  {clean}\n")

# Output:
# Before: Absolutely terrible product. Would NOT recommend.
# After:  absolutely terrible product not recommend
#
# Before: BEST movie I've seen all year!!!
# After:  best movie see year

Success

The goal of preprocessing is controlled vocabulary reduction. Every step should reduce the number of distinct tokens your vectoriser sees, while preserving the tokens that carry predictive signal. If a preprocessing step does not measurably help your downstream model, it is complexity you do not need.

What's Next¶

You've covered the full text preprocessing pipeline — lowercasing, URL and HTML removal, tokenisation, stop word filtering (with negation preservation), stemming vs lemmatisation, and wrapping the pipeline into a sklearn FunctionTransformer. Next up: 03-bow-tfidf — where you'll convert preprocessed text into numeric features using CountVectorizer and TfidfVectorizer, understand how IDF downweights common words, add bigrams for phrase-level signal, and handle the sparse matrix format that makes large document-term matrices tractable.

Optional Deep Dive

Read the NLTK documentation on corpora and lexical resources at https://www.nltk.org/book/ch02.html — it covers WordNet's synsets, hypernym hierarchies, and morphological derivations that underpin WordNetLemmatizer, giving you the linguistic database context that explains why lemmatisation is more accurate than rule-based stemming.

← NLP Overview | Next: Bag of Words and TF-IDF →