Python Cheat Sheet¶
A practitioner's reference for data science Python. Each entry explains when to use the pattern, then shows a runnable snippet. Scan it, don't memorize it — return here when you're in the middle of real work.
1. Data Types & Variables¶
Numeric types: int and float¶
Use int for counts, indices, and discrete values. Use float when you need decimals or when a calculation might produce one. Python 3 division always returns a float — use // when you want integer division.
rows = 1000 # int — row count
learning_rate = 0.01 # float — model hyperparameter
print(type(rows)) # <class 'int'>
print(type(learning_rate)) # <class 'float'>
print(7 / 2) # 3.5 — true division
print(7 // 2) # 3 — floor division
print(7 % 2) # 1 — remainder
Strings and booleans¶
str is immutable. bool is a subclass of int — True == 1 and False == 0, which matters when summing boolean masks in NumPy/pandas.
label = "churn"
is_valid = True
print(type(label)) # <class 'str'>
print(int(is_valid)) # 1
print(True + True) # 2 — useful for counting matches in a list
None — the absence of a value¶
None is not zero, not an empty string, not False. It signals that a value is missing or not yet assigned. Use is None for checks, not == None.
result = None
if result is None:
print("no result yet") # no result yet
# Common pattern: function returns None on failure
def safe_divide(a, b):
if b == 0:
return None
return a / b
print(safe_divide(10, 0)) # None
print(safe_divide(10, 2)) # 5.0
type() and isinstance()¶
Use type() for exact type checks. Use isinstance() when subclasses should also pass — prefer it in production code and when working with numeric types where bool is a subclass of int.
x = 42
print(type(x)) # <class 'int'>
print(type(x) == int) # True — exact match only
print(isinstance(x, int)) # True
print(isinstance(True, int)) # True — bool IS an int
print(isinstance(True, bool)) # True
print(isinstance(x, (int, float)))# True — check multiple types at once
Type conversion¶
Explicit conversion is safer than implicit. Know where it fails so you can wrap it in error handling.
print(int("42")) # 42
print(float("3.14")) # 3.14
print(str(100)) # '100'
print(bool(0)) # False
print(bool("")) # False — empty string is falsy
print(bool("hello")) # True
# int("3.14") raises ValueError — convert to float first
print(int(float("3.14"))) # 3
2. String Operations¶
f-strings — the default way to format strings¶
f-strings are faster and more readable than .format() or %. Use them everywhere. Expressions inside {} are evaluated at runtime.
name = "Nikhil"
accuracy = 0.9342
print(f"Model trained by {name}") # Model trained by Nikhil
print(f"Accuracy: {accuracy:.2%}") # Accuracy: 93.42%
print(f"Pi approx: {22/7:.4f}") # Pi approx: 3.1429
print(f"{'left':<10}|{'right':>10}") # left | right
print(f"Debug: {name=}") # Debug: name='Nikhil' (Python 3.8+)
.split() and .join()¶
.split() breaks a string into a list. .join() reassembles a list into a string. They are inverses of each other and come up constantly in text preprocessing.
sentence = "age,income,churn_label"
columns = sentence.split(",")
print(columns) # ['age', 'income', 'churn_label']
cleaned = "_".join(columns)
print(cleaned) # age_income_churn_label
# Split on whitespace (default) strips extra spaces
text = " hello world "
print(text.split()) # ['hello', 'world']
.strip(), .lstrip(), .rstrip()¶
Strip whitespace (or specified characters) from string edges. Essential when reading messy CSV data where columns may have extra spaces.
raw = " churn "
print(raw.strip()) # 'churn'
print(raw.lstrip()) # 'churn '
print(raw.rstrip()) # ' churn'
# Strip specific characters
path = "///data/file.csv///"
print(path.strip("/")) # 'data/file.csv'
.replace() and case methods¶
header = "Customer ID"
snake = header.lower().replace(" ", "_")
print(snake) # customer_id
print("hello".upper()) # HELLO
print("HELLO".lower()) # hello
print("hello world".title())# Hello World
print(" hi ".strip()) # hi
# Check string content
print("abc123".isdigit()) # False
print("123".isdigit()) # True
print("abc".isalpha()) # True
String slicing¶
Strings are sequences — the same slicing rules apply to lists. s[start:stop:step].
s = "data_science"
print(s[0]) # d — first character
print(s[-1]) # e — last character
print(s[0:4]) # data — characters 0,1,2,3
print(s[5:]) # science — from index 5 to end
print(s[:4]) # data — up to (not including) index 4
print(s[::-1]) # ecneics_atad — reversed
print(s[::2]) # dt_cec — every second character
Checking membership and string methods¶
text = "Random Forest is an ensemble method"
print("ensemble" in text) # True
print(text.startswith("Random")) # True
print(text.endswith("method")) # True
print(text.count("e")) # 4
print(text.find("Forest")) # 7 — index of first match, -1 if not found
print(text.replace("Random", "Gradient"))
# Gradient Forest is an ensemble method
3. Lists¶
Creation and indexing¶
Lists are ordered, mutable, and allow duplicates. They are your default sequential container in Python.
scores = [0.91, 0.87, 0.93, 0.78, 0.95]
print(scores[0]) # 0.91 — first element
print(scores[-1]) # 0.95 — last element
print(scores[1:3]) # [0.87, 0.93]
# Lists can hold mixed types (usually avoid this in data work)
row = [1, "Alice", 29, True]
Common list methods¶
features = ["age", "income"]
features.append("churn") # add to end
print(features) # ['age', 'income', 'churn']
features.insert(1, "gender") # insert at index
print(features) # ['age', 'gender', 'income', 'churn']
features.remove("gender") # remove first occurrence by value
print(features) # ['age', 'income', 'churn']
popped = features.pop() # remove and return last
print(popped) # churn
features.extend(["region", "plan"])# add multiple items
print(features) # ['age', 'income', 'region', 'plan']
print(len(features)) # 4
print(features.index("income")) # 1
print(features.count("age")) # 1
Sorting¶
vals = [3, 1, 4, 1, 5, 9, 2, 6]
vals.sort() # in-place, modifies the list
print(vals) # [1, 1, 2, 3, 4, 5, 6, 9]
vals.sort(reverse=True)
print(vals) # [9, 6, 5, 4, 3, 2, 1, 1]
# sorted() returns a new list — use when you need to keep the original
original = [3, 1, 4, 1, 5]
ranked = sorted(original)
print(original) # [3, 1, 4, 1, 5] — unchanged
print(ranked) # [1, 1, 3, 4, 5]
# Sort by key
models = [("RandomForest", 0.91), ("XGBoost", 0.93), ("LogReg", 0.87)]
models.sort(key=lambda m: m[1], reverse=True)
print(models) # [('XGBoost', 0.93), ('RandomForest', 0.91), ('LogReg', 0.87)]
List comprehensions¶
The most Pythonic way to build a list from another iterable. More readable and faster than a for loop with .append().
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
squares = [n**2 for n in numbers]
print(squares) # [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
evens = [n for n in numbers if n % 2 == 0]
print(evens) # [2, 4, 6, 8, 10]
# Transformation + filter in one line
normalized = [round((n - 1) / 9, 2) for n in numbers if n > 3]
print(normalized) # [0.33, 0.44, 0.56, 0.67, 0.78, 0.89, 1.0]
Flattening and copying¶
nested = [[1, 2], [3, 4], [5, 6]]
# Flatten a list of lists
flat = [x for sublist in nested for x in sublist]
print(flat) # [1, 2, 3, 4, 5, 6]
# Shallow copy — avoid mutating the original
original = [1, 2, 3]
copy1 = original[:] # slice copy
copy2 = original.copy() # explicit copy
copy3 = list(original) # constructor copy
copy1.append(99)
print(original) # [1, 2, 3] — unaffected
4. Dictionaries¶
Creation and basic access¶
Dictionaries are key-value stores. In Python 3.7+ they preserve insertion order. Use them to represent a single record, a mapping of labels to values, or a lookup table.
model_scores = {
"logistic_regression": 0.87,
"random_forest": 0.91,
"xgboost": 0.93,
}
print(model_scores["xgboost"]) # 0.93
print(len(model_scores)) # 3
print("svm" in model_scores) # False — checks keys
print(list(model_scores.keys())) # ['logistic_regression', 'random_forest', 'xgboost']
print(list(model_scores.values())) # [0.87, 0.91, 0.93]
.get() — safe access¶
Use .get() instead of [] when the key might not exist. It returns None (or a default you specify) rather than raising a KeyError.
scores = {"rf": 0.91, "xgb": 0.93}
print(scores["svm"]) # KeyError — crashes
print(scores.get("svm")) # None — safe
print(scores.get("svm", 0.0)) # 0.0 — custom default
.items(), .update(), .pop()¶
config = {"lr": 0.01, "epochs": 100, "batch_size": 32}
# Iterate key-value pairs
for param, value in config.items():
print(f"{param}: {value}")
# Update (merge or overwrite)
config.update({"epochs": 200, "dropout": 0.3})
print(config)
# {'lr': 0.01, 'epochs': 200, 'batch_size': 32, 'dropout': 0.3}
# Remove a key and get its value
removed = config.pop("dropout")
print(removed) # 0.3
Dict comprehensions¶
Same idea as list comprehensions. Great for transforming or inverting mappings.
raw = {"Age": 29, "Income": 75000, "Churn": 1}
# Lowercase all keys
cleaned = {k.lower(): v for k, v in raw.items()}
print(cleaned) # {'age': 29, 'income': 75000, 'churn': 1}
# Invert a mapping
label_map = {0: "no_churn", 1: "churn"}
inverted = {v: k for k, v in label_map.items()}
print(inverted) # {'no_churn': 0, 'churn': 1}
# Filter by value
high_scores = {model: score for model, score in
{"rf": 0.91, "lr": 0.78, "xgb": 0.93}.items()
if score > 0.85}
print(high_scores) # {'rf': 0.91, 'xgb': 0.93}
defaultdict — skip the key-exists check¶
defaultdict from collections automatically creates a default value when you access a missing key. Eliminates boilerplate if key not in d: d[key] = [] patterns.
from collections import defaultdict
# Group records by category without checking if key exists
records = [("churned", 1), ("retained", 2), ("churned", 3), ("retained", 4)]
grouped = defaultdict(list)
for label, val in records:
grouped[label].append(val)
print(dict(grouped)) # {'churned': [1, 3], 'retained': [2, 4]}
# Count occurrences
word_count = defaultdict(int)
for word in ["apple", "banana", "apple", "cherry", "banana", "apple"]:
word_count[word] += 1
print(dict(word_count)) # {'apple': 3, 'banana': 2, 'cherry': 1}
Counter — frequency counts in one line¶
from collections import Counter
labels = ["cat", "dog", "cat", "bird", "dog", "cat"]
counts = Counter(labels)
print(counts) # Counter({'cat': 3, 'dog': 2, 'bird': 1})
print(counts.most_common(2)) # [('cat', 3), ('dog', 2)]
print(counts["cat"]) # 3
print(counts["fish"]) # 0 — no KeyError for missing keys
5. Tuples & Sets¶
Tuples — immutable sequences¶
Use tuples for data that should not change: coordinates, RGB values, function return pairs, dictionary keys. They are faster than lists and signal intent ("this data is fixed").
point = (3.5, 7.2) # x, y coordinate
rgb = (255, 128, 0)
print(point[0]) # 3.5
# point[0] = 4.0 # TypeError — tuples are immutable
# Tuple unpacking — very common in Python
x, y = point
print(x, y) # 3.5 7.2
# Multiple return values use tuples under the hood
def min_max(values):
return min(values), max(values) # returns a tuple
low, high = min_max([3, 1, 4, 1, 5, 9])
print(low, high) # 1 9
Sets — unordered, unique elements¶
Use sets when you need to eliminate duplicates or perform membership tests on large collections (O(1) lookup vs O(n) for lists). Also use them for set algebra: union, intersection, difference.
a = {1, 2, 3, 4, 5}
b = {4, 5, 6, 7, 8}
print(a | b) # {1, 2, 3, 4, 5, 6, 7, 8} — union
print(a & b) # {4, 5} — intersection
print(a - b) # {1, 2, 3} — difference (in a but not b)
print(a ^ b) # {1, 2, 3, 6, 7, 8} — symmetric difference
# Deduplicate a list
raw = [1, 2, 2, 3, 3, 3, 4]
unique = list(set(raw))
print(unique) # [1, 2, 3, 4] — order not guaranteed
# Fast membership test
valid_categories = {"electronics", "clothing", "food"}
item = "clothing"
print(item in valid_categories) # True — O(1) lookup
When to use which¶
# Use a tuple when:
# - data is fixed (coordinates, config pairs, dict keys)
# - returning multiple values from a function
# - you want to signal "this shouldn't change"
config = ("localhost", 5432) # (host, port) — a natural tuple
# Use a set when:
# - you need unique values
# - you need fast membership testing
# - you need set algebra (union, intersection, etc.)
selected_features = {"age", "income", "region"}
required_features = {"age", "credit_score"}
missing = required_features - selected_features
print(missing) # {'credit_score'}
6. Control Flow¶
if / elif / else¶
score = 0.74
if score >= 0.90:
grade = "excellent"
elif score >= 0.80:
grade = "good"
elif score >= 0.70:
grade = "acceptable"
else:
grade = "needs improvement"
print(grade) # acceptable
Ternary expression¶
For simple conditional assignments, the ternary form is more readable than a full if/else block.
x = 15
label = "odd" if x % 2 != 0 else "even"
print(label) # odd
# Useful in list comprehensions
scores = [0.91, 0.65, 0.88, 0.72, 0.55]
grades = ["pass" if s >= 0.70 else "fail" for s in scores]
print(grades) # ['pass', 'fail', 'pass', 'pass', 'fail']
for loops¶
features = ["age", "income", "region"]
for feature in features:
print(feature.upper())
# AGE INCOME REGION
# Range-based loops
for i in range(5):
print(i, end=" ") # 0 1 2 3 4
# range(start, stop, step)
for i in range(0, 10, 2):
print(i, end=" ") # 0 2 4 6 8
while, break, continue¶
# while — use when you don't know the iteration count in advance
attempts = 0
max_attempts = 5
while attempts < max_attempts:
attempts += 1
if attempts == 3:
continue # skip the rest of this iteration
if attempts == 4:
break # exit the loop entirely
print(f"attempt {attempts}")
# attempt 1
# attempt 2
# for-else and while-else: else runs only if loop completed without break
for n in [2, 3, 5, 7]:
if n % 2 == 0 and n != 2:
print("found even")
break
else:
print("no non-two even found") # this runs
Truthiness — what evaluates to False¶
Knowing what Python considers falsy saves many unnecessary == None or == [] checks.
# All of these are falsy:
falsy_values = [False, 0, 0.0, "", [], {}, set(), None, (), 0j]
for v in falsy_values:
if not v:
print(f"{repr(v):12} is falsy")
# Practical use: check if a list has elements
data = []
if not data:
print("no data loaded") # no data loaded
results = [0.91, 0.87]
if results:
print(f"best: {max(results)}") # best: 0.91
7. Functions¶
def, default arguments, return¶
Default arguments make functions flexible without requiring every caller to pass every argument. Put defaults at the end of the parameter list.
def evaluate_model(y_true, y_pred, threshold=0.5, verbose=False):
predictions = [1 if p >= threshold else 0 for p in y_pred]
correct = sum(t == p for t, p in zip(y_true, predictions))
accuracy = correct / len(y_true)
if verbose:
print(f"Correct: {correct}/{len(y_true)}")
return accuracy
y_true = [1, 0, 1, 1, 0]
y_pred = [0.9, 0.2, 0.8, 0.4, 0.1]
print(evaluate_model(y_true, y_pred)) # 0.8
print(evaluate_model(y_true, y_pred, threshold=0.35)) # 0.6
evaluate_model(y_true, y_pred, verbose=True)
# Correct: 4/5
args and *kwargs¶
*args collects positional arguments into a tuple. **kwargs collects keyword arguments into a dict. Use them to write flexible utility functions or wrappers.
def log(*args, **kwargs):
# args is a tuple, kwargs is a dict
prefix = kwargs.get("prefix", "INFO")
message = " ".join(str(a) for a in args)
print(f"[{prefix}] {message}")
log("Training complete", "epoch=10")
# [INFO] Training complete epoch=10
log("Accuracy", 0.93, prefix="RESULT")
# [RESULT] Accuracy 0.93
lambda — anonymous functions¶
Use lambda for short, one-off functions — especially as the key= argument to sorted() or map(). If the logic is more than one expression, write a proper def.
square = lambda x: x ** 2
print(square(5)) # 25
# Most common use: as a key function
models = [("rf", 0.91), ("lr", 0.78), ("xgb", 0.93)]
ranked = sorted(models, key=lambda m: m[1], reverse=True)
print(ranked) # [('xgb', 0.93), ('rf', 0.91), ('lr', 0.78)]
# With pandas (preview — covered in pandas cheat sheet)
# df.sort_values(key=lambda col: col.str.lower())
map() and filter()¶
map() applies a function to every element. filter() selects elements where the function returns True. Both return lazy iterators — wrap in list() to get the result immediately.
values = [1, 4, 9, 16, 25]
roots = list(map(lambda x: x ** 0.5, values))
print(roots) # [1.0, 2.0, 3.0, 4.0, 5.0]
# filter keeps elements where function returns True
scores = [0.91, 0.65, 0.88, 0.72, 0.55]
passing = list(filter(lambda s: s >= 0.70, scores))
print(passing) # [0.91, 0.88, 0.72]
# List comprehensions are usually more readable than map/filter
roots_lc = [x ** 0.5 for x in values]
passing_lc = [s for s in scores if s >= 0.70]
Variable scope — LEGB¶
Python looks up names in this order: Local, Enclosing, Global, Built-in.
threshold = 0.5 # global
def classify(score):
threshold = 0.7 # local — shadows the global
return "high" if score >= threshold else "low"
print(classify(0.8)) # high
print(threshold) # 0.5 — global unchanged
# Use global keyword to modify a global (usually a code smell — prefer return values)
counter = 0
def increment():
global counter
counter += 1
increment()
print(counter) # 1
8. File I/O¶
Reading a text file¶
Always use a with block. It guarantees the file is closed even if an exception occurs.
# Write a sample file first
with open("sample.txt", "w") as f:
f.write("line one\nline two\nline three\n")
# Read entire file as one string
with open("sample.txt", "r") as f:
content = f.read()
print(content)
# line one
# line two
# line three
# Read line by line — memory-efficient for large files
with open("sample.txt", "r") as f:
for line in f:
print(line.strip()) # .strip() removes the trailing newline
Writing and appending¶
rows = ["Alice,29,1", "Bob,34,0", "Carol,27,1"]
# "w" overwrites the file; "a" appends to it
with open("output.csv", "w") as f:
f.write("name,age,churn\n")
for row in rows:
f.write(row + "\n")
# Append a new row without overwriting
with open("output.csv", "a") as f:
f.write("Dave,41,0\n")
CSV with the csv module¶
Use the csv module instead of manual string splitting — it handles quoted fields, commas inside values, and different delimiters correctly.
import csv
# Write CSV
data = [
["name", "age", "score"],
["Alice", 29, 0.91],
["Bob", 34, 0.87],
]
with open("data.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(data)
# Read CSV as dicts — column names become keys
with open("data.csv", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
print(dict(row))
# {'name': 'Alice', 'age': '29', 'score': '0.91'}
# {'name': 'Bob', 'age': '34', 'score': '0.87'}
# Note: all values are strings — cast as needed
Working with file paths¶
import os
from pathlib import Path # preferred in Python 3.4+
# pathlib is more readable and cross-platform than os.path
data_dir = Path("data")
csv_file = data_dir / "customers.csv" # / operator builds paths
print(csv_file.name) # customers.csv
print(csv_file.stem) # customers
print(csv_file.suffix) # .csv
print(csv_file.parent) # data
# Check existence before reading
if csv_file.exists():
with open(csv_file) as f:
pass
# List all CSV files in a directory
for f in Path(".").glob("*.csv"):
print(f)
9. Error Handling¶
try / except / finally¶
Wrap code that can legitimately fail. Catch specific exceptions — catching bare Exception hides bugs.
def load_value(data, key):
try:
value = float(data[key])
return value
except KeyError:
print(f"Key '{key}' not found in data")
return None
except ValueError:
print(f"Cannot convert '{data[key]}' to float")
return None
finally:
# Always runs — use for cleanup (closing files, etc.)
print("load_value finished")
record = {"age": "29", "income": "not_a_number"}
print(load_value(record, "age")) # 29.0
print(load_value(record, "income")) # ValueError message, then None
print(load_value(record, "name")) # KeyError message, then None
Common exception types¶
# Know these so your except clauses are specific:
# ValueError — right type, wrong value
int("abc") # ValueError: invalid literal for int()
# TypeError — wrong type for the operation
"5" + 5 # TypeError: can only concatenate str (not "int") to str
# KeyError — dict access with missing key
{}["x"] # KeyError: 'x'
# IndexError — list/tuple access out of range
[][0] # IndexError: list index out of range
# AttributeError — attribute doesn't exist on the object
None.split() # AttributeError: 'NoneType' object has no attribute 'split'
# FileNotFoundError — file doesn't exist
open("ghost.csv") # FileNotFoundError: [Errno 2] No such file or directory
# ZeroDivisionError
1 / 0 # ZeroDivisionError: division by zero
Raising exceptions¶
Raise exceptions when your function receives input that violates its contract. This is better than silently returning wrong results.
def train_test_split_check(data, test_size):
if not 0 < test_size < 1:
raise ValueError(f"test_size must be between 0 and 1, got {test_size}")
if len(data) < 2:
raise ValueError("Need at least 2 samples to split")
split_idx = int(len(data) * (1 - test_size))
return data[:split_idx], data[split_idx:]
try:
train, test = train_test_split_check([1, 2, 3, 4, 5], test_size=1.5)
except ValueError as e:
print(f"Error: {e}")
# Error: test_size must be between 0 and 1, got 1.5
assert — for development checks¶
Use assert to verify assumptions during development. Do NOT use it for input validation in production — assertions can be disabled with -O flag.
def normalize(values):
assert len(values) > 0, "values cannot be empty"
min_val = min(values)
max_val = max(values)
assert min_val != max_val, "all values are identical — cannot normalize"
return [(v - min_val) / (max_val - min_val) for v in values]
print(normalize([1, 2, 3, 4, 5])) # [0.0, 0.25, 0.5, 0.75, 1.0]
10. Itertools & Useful Builtins¶
enumerate — loop with index¶
Use enumerate instead of manually tracking an index variable with i += 1.
features = ["age", "income", "region", "plan_type"]
for i, feature in enumerate(features):
print(f"{i}: {feature}")
# 0: age
# 1: income
# 2: region
# 3: plan_type
# Start from a different index
for i, feature in enumerate(features, start=1):
print(f"{i}. {feature}")
# 1. age 2. income 3. region 4. plan_type
zip — pair up iterables¶
Combine two or more iterables element-by-element. Stops at the shortest iterable.
models = ["LogReg", "RandomForest", "XGBoost"]
scores = [0.87, 0.91, 0.93]
times = [0.1, 2.3, 1.8]
for model, score, time in zip(models, scores, times):
print(f"{model}: acc={score:.2f}, time={time}s")
# LogReg: acc=0.87, time=0.1s
# RandomForest: acc=0.91, time=2.3s
# XGBoost: acc=0.93, time=1.8s
# Unzip — transpose a list of tuples
pairs = [("a", 1), ("b", 2), ("c", 3)]
keys, vals = zip(*pairs)
print(list(keys)) # ['a', 'b', 'c']
print(list(vals)) # [1, 2, 3]
sorted with key, any, all¶
data = [("Alice", 29, 0.91), ("Bob", 34, 0.78), ("Carol", 27, 0.95)]
# Sort by score descending
by_score = sorted(data, key=lambda row: row[2], reverse=True)
print(by_score[0]) # ('Carol', 27, 0.95)
# any — True if at least one element is truthy
scores = [0.65, 0.72, 0.88]
print(any(s > 0.85 for s in scores)) # True
print(any(s > 0.90 for s in scores)) # False
# all — True only if every element is truthy
print(all(s > 0.60 for s in scores)) # True
print(all(s > 0.70 for s in scores)) # False
itertools.chain — flatten iterables¶
from itertools import chain
week1 = ["python", "numpy", "pandas"]
week2 = ["sklearn", "matplotlib", "statsmodels"]
all_topics = list(chain(week1, week2))
print(all_topics)
# ['python', 'numpy', 'pandas', 'sklearn', 'matplotlib', 'statsmodels']
# Chain handles any number of iterables
batches = [[1, 2], [3, 4], [5, 6]]
flat = list(chain.from_iterable(batches))
print(flat) # [1, 2, 3, 4, 5, 6]
itertools.product — cartesian product¶
Use product instead of nested for loops when you want every combination of two or more iterables. Common in hyperparameter grid search.
from itertools import product
learning_rates = [0.01, 0.1]
max_depths = [3, 5, 7]
grid = list(product(learning_rates, max_depths))
print(grid)
# [(0.01, 3), (0.01, 5), (0.01, 7), (0.1, 3), (0.1, 5), (0.1, 7)]
for lr, depth in grid:
print(f"lr={lr}, depth={depth}")
itertools.combinations and permutations¶
from itertools import combinations, permutations
features = ["age", "income", "region"]
# All pairs of features (order doesn't matter)
for pair in combinations(features, 2):
print(pair)
# ('age', 'income')
# ('age', 'region')
# ('income', 'region')
# All orderings of 2 features (order matters)
for perm in permutations(features, 2):
print(perm)
# ('age', 'income'), ('age', 'region'), ('income', 'age'), ...
11. OOP Basics¶
class and init¶
A class bundles data (attributes) and behavior (methods). __init__ runs automatically when you create an instance. self refers to the instance itself — always the first parameter of instance methods.
class ModelEvaluator:
def __init__(self, model_name, threshold=0.5):
self.model_name = model_name
self.threshold = threshold
self.results = []
def evaluate(self, y_true, y_pred):
predictions = [1 if p >= self.threshold else 0 for p in y_pred]
accuracy = sum(t == p for t, p in zip(y_true, predictions)) / len(y_true)
self.results.append(accuracy)
return accuracy
def best_score(self):
if not self.results:
return None
return max(self.results)
def __repr__(self):
# Controls what you see when you print the object
return f"ModelEvaluator(model='{self.model_name}', threshold={self.threshold})"
evaluator = ModelEvaluator("XGBoost", threshold=0.4)
print(evaluator) # ModelEvaluator(model='XGBoost', threshold=0.4)
y_true = [1, 0, 1, 1, 0]
y_pred = [0.9, 0.2, 0.8, 0.45, 0.1]
print(evaluator.evaluate(y_true, y_pred)) # 1.0
Inheritance¶
Subclasses inherit all methods from the parent. Use super() to call the parent's __init__ so you don't duplicate setup code.
class BaseModel:
def __init__(self, name):
self.name = name
self.is_trained = False
def fit(self, X, y):
self.is_trained = True
print(f"{self.name} trained on {len(X)} samples")
def predict(self, X):
raise NotImplementedError("Subclasses must implement predict()")
class ThresholdClassifier(BaseModel):
def __init__(self, name, threshold=0.5):
super().__init__(name) # call parent __init__
self.threshold = threshold
def predict(self, X):
if not self.is_trained:
raise RuntimeError("Call fit() before predict()")
return [1 if x >= self.threshold else 0 for x in X]
clf = ThresholdClassifier("MyClassifier", threshold=0.6)
clf.fit([0.9, 0.2, 0.8], [1, 0, 1]) # MyClassifier trained on 3 samples
print(clf.predict([0.7, 0.4, 0.9])) # [1, 0, 1]
@property — controlled attribute access¶
Use @property to make a method look like an attribute. Lets you add validation or computation without changing the calling code.
class Dataset:
def __init__(self, name, rows, cols):
self.name = name
self._rows = rows # _ prefix signals "internal use"
self._cols = cols
@property
def shape(self):
return (self._rows, self._cols)
@property
def size(self):
return self._rows * self._cols
@property
def rows(self):
return self._rows
@rows.setter
def rows(self, value):
if value < 0:
raise ValueError("rows cannot be negative")
self._rows = value
ds = Dataset("customers", 10000, 15)
print(ds.shape) # (10000, 15)
print(ds.size) # 150000
ds.rows = 12000
print(ds.shape) # (12000, 15)
@classmethod and @staticmethod¶
@classmethod receives the class itself as the first argument — use it for alternative constructors. @staticmethod is a plain function attached to the class for organizational purposes.
class Scaler:
def __init__(self, mean, std):
self.mean = mean
self.std = std
@classmethod
def from_data(cls, data):
# Alternative constructor: compute parameters from data
n = len(data)
mean = sum(data) / n
std = (sum((x - mean) ** 2 for x in data) / n) ** 0.5
return cls(mean, std)
@staticmethod
def validate(data):
return all(isinstance(x, (int, float)) for x in data)
def transform(self, x):
return (x - self.mean) / self.std
data = [2.0, 4.0, 6.0, 8.0, 10.0]
scaler = Scaler.from_data(data) # classmethod — no need to pre-compute
print(f"mean={scaler.mean}, std={scaler.std}") # mean=6.0, std=2.83...
print(scaler.transform(8.0)) # ~0.707
print(Scaler.validate(data)) # True
print(Scaler.validate([1, "a", 3])) # False
12. Comprehensions¶
Side-by-side comparison of all four forms¶
Understanding all four comprehension types lets you pick the right tool for each situation.
numbers = [1, 2, 3, 4, 5]
# List comprehension — returns a list, eager
squares_list = [n**2 for n in numbers]
print(squares_list) # [1, 4, 9, 16, 25]
print(type(squares_list)) # <class 'list'>
# Dict comprehension — returns a dict
squares_dict = {n: n**2 for n in numbers}
print(squares_dict) # {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
print(type(squares_dict)) # <class 'dict'>
# Set comprehension — returns a set (unique, unordered)
remainders = {n % 3 for n in numbers}
print(remainders) # {0, 1, 2}
print(type(remainders)) # <class 'set'>
# Generator expression — returns a lazy iterator, NOT a list
squares_gen = (n**2 for n in numbers)
print(squares_gen) # <generator object ...>
print(next(squares_gen)) # 1 — consumed one at a time
print(list(squares_gen)) # [4, 9, 16, 25] — rest of the values
List comprehension — with condition¶
data = [1, -2, 3, -4, 5, -6]
positives = [x for x in data if x > 0]
print(positives) # [1, 3, 5]
# if/else inside the expression (ternary) — note position differs from filter
clamped = [x if x > 0 else 0 for x in data]
print(clamped) # [1, 0, 3, 0, 5, 0]
# Nested comprehension — flatten a 2D list
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flat = [val for row in matrix for val in row]
print(flat) # [1, 2, 3, 4, 5, 6, 7, 8, 9]
Dict comprehension — real-world patterns¶
# Invert a label-to-index mapping
class_labels = {"cat": 0, "dog": 1, "bird": 2}
index_to_label = {v: k for k, v in class_labels.items()}
print(index_to_label) # {0: 'cat', 1: 'dog', 2: 'bird'}
# Build a lookup from two lists
features = ["age", "income", "region"]
dtypes = ["int", "float", "str"]
schema = {f: d for f, d in zip(features, dtypes)}
print(schema) # {'age': 'int', 'income': 'float', 'region': 'str'}
# Filter dict by value
scores = {"rf": 0.91, "lr": 0.78, "svm": 0.65, "xgb": 0.93}
top_models = {k: v for k, v in scores.items() if v >= 0.85}
print(top_models) # {'rf': 0.91, 'xgb': 0.93}
Generator expressions — when to use them¶
Use generator expressions when you don't need all the results at once — they compute values on demand and use far less memory than lists for large datasets.
import sys
large_range = range(1_000_000)
# List uses memory for all 1M items
list_mem = sys.getsizeof([n**2 for n in large_range])
# Generator uses constant memory regardless of size
gen_mem = sys.getsizeof((n**2 for n in large_range))
print(f"List: {list_mem:,} bytes") # List: 8,697,464 bytes
print(f"Generator: {gen_mem:,} bytes") # Generator: 104 bytes
# Pass generator directly to functions that accept iterables
total = sum(n**2 for n in range(1000)) # no list created at all
print(total) # 332833500
has_large = any(n > 999 for n in range(10000)) # stops at first match
print(has_large) # True
Set comprehension — deduplication with transformation¶
# Extract unique domain names from a list of emails
emails = [
"alice@gmail.com",
"bob@company.com",
"carol@gmail.com",
"dave@university.edu",
"eve@company.com",
]
domains = {email.split("@")[1] for email in emails}
print(domains) # {'gmail.com', 'company.com', 'university.edu'}
# Unique first letters of feature names
features = ["age", "annual_income", "balance", "account_age", "balance_ratio"]
initials = {f[0] for f in features}
print(initials) # {'a', 'b'}
Tip
The most important habit: reach for a comprehension when you find yourself writing result = [] followed by a for loop with .append(). If the logic fits in one expression, the comprehension is almost always clearer and faster.
Warning
Nested comprehensions beyond two levels become unreadable. If you need three levels of nesting, write it as explicit loops or break it into helper functions.
Success
Key patterns to internalize: enumerate over manual indexing, zip for parallel iteration, .get() over [] for safe dict access, with open() for all file operations, and isinstance() over type() == for type checks.