A rigorous, project-driven curriculum focused on the most important and most overlooked part of data science — getting data right before a model ever runs. By Week 6, you will think like a data scientist, work with real messy datasets, and have a fully prepared feature pipeline ready for machine learning.
Highlights
Project 1 — The Churn Paradox
A subscription software company is losing customers. Draft a Technical Scoping Proposition defining: the target variable (accounting for inactivity vs. explicit cancellation), five features including one proxy variable representing an abstract concept, and the constraint explaining why a model trained on 2025 data could produce false positives when applied to 2026 data after a UI change. Written deliverable — no code required.
Project 2 — The LaTeX Optimizer
Build a "Living Document" in Jupyter that calculates Root Mean Squared Error (RMSE) from scratch — no libraries, pure Python logic. Create a dedicated Conda environment named ds_logic, export the environment.yml file, represent the RMSE formula in LaTeX notation inside the notebook, and implement calculate_rmse(actual, predicted) using a list comprehension. The exponentiation operator ** 0.5 only — math.sqrt() is forbidden.
Project 3 — The Memory Gatekeeper
Build a smart data loader loader.py for a large real-world dataset (NYC Taxi Data). The script must: peek at the first 100 rows to detect data types, optimize memory by mapping float64 to float32 and int64 to int32 (cutting memory use by ~50%), and filter rows where Trip_Distance is greater than 0 — all during the ingestion phase, not after. Demonstrates production-aware data engineering from day one.
Project 4 — The Imputation Integrity Test
Clean a dataset of medical records where Heart Rate values are either missing or biologically impossible (recorded as 0). Apply three rules: treat 0 as NaN, impute missing values with the group average by Age Group — not the global average — and drop any patient record with more than three missing vital signs. Output a comparison table showing Mean and Standard Deviation before and after. If standard deviation shifts by more than 10%, the imputation logic is considered too aggressive and fails.
Project 5 — The EDA Detective: Simpson's Paradox
A university is accused of gender bias in admissions. Overall data shows men admitted at a higher rate — but data broken down by department shows the opposite. Using a provided admissions.csv, calculate the global admission rate by gender, then deconstruct it by department. Visualize why the global average is misleading. Write a three-sentence logical proof identifying and explaining the lurking variable. A real statistical phenomenon that has appeared in landmark legal cases.
Build a transformation function preprocess_pipeline(raw_df) that converts a messy real-world dataset into a clean NumPy array ready for a machine learning model. The pipeline must handle four transformations: cyclical encoding of timestamps (converting Hour into sine and cosine features so 23:00 and 01:00 are logically close), target encoding of city names (replacing each city with the average price for that city), Min-Max scaling of all numeric values to a range of [0, 1], and a final validation check ensuring the output contains no strings and no nulls. The most technically demanding project of the course.
| Week | Project | Output |
|---|---|---|
| Week 1 | The Churn Paradox | Technical Scoping Document |
| Week 2 | The LaTeX Optimizer | Jupyter Notebook · Conda Environment |
| Week 3 | The Memory Gatekeeper | Production-Ready Data Loader Script |
| Week 4 | The Imputation Integrity Test | Validated Cleaning Pipeline |
| Week 5 | Simpson's Paradox EDA | Statistical Analysis · Visualizations |
| Week 6 | The Vectorizer (Capstone) | Full Feature Engineering Pipeline |
You finish the course knowing how to take any raw dataset — no matter how broken — and prepare it for machine learning with production-aware, mathematically sound logic.
Every week opens with a real business or scientific problem before any Python is written. This is the Harvard CS109 model — the code only matters because the question came first.
The Churn Paradox, The Memory Gatekeeper, and The Imputation Integrity Test are not textbook exercises. They are the kinds of problems data scientists and data engineers encounter in their first three months on the job.
Data cleaning is 80% of real data science work and gets less than 10% of the attention in most courses. Week 4 is dedicated entirely to it — with a project that will actually break if the logic is wrong.
This course ends exactly where machine learning begins. Students who complete it arrive at a modeling course with a skill set that most people entering those courses are still missing.
Methods inspired by Harvard CS109 Data Science and MIT 6.036 Machine Learning