🧹 02 — Text Preprocessing¶
Common preprocessing steps:
- lowercase
- remove extra whitespace
- remove punctuation if useful
- remove stop words
- tokenize
- lemmatize/stem
Simple Cleaning¶
With Pandas:
What Not to Remove Blindly¶
Do not always remove:
- negation words like "not"
- punctuation in sentiment tasks
- casing if it carries meaning
- emojis if they matter
Preprocessing depends on the task.
Next¶
➡️ 03-bow-tfidf