Data Science

₦45,000

Pre-enrolling for next cohort — starts 29 June 2026(Tentative)

⏳

Pre-enrol for Next Cohort is active. Registration is open but your cohort begins on 29 June 2026 (tentative). You'll receive access when classes begin.

Curriculum

A rigorous, project-driven curriculum focused on the most important and most overlooked part of data science — getting data right before a model ever runs. By Week 6, you will think like a data scientist, work with real messy datasets, and have a fully prepared feature pipeline ready for machine learning.

Highlights

No prior data science experience required — basic Python familiarity helpful
6 portfolio projects, all documented and shareable
Python · Pandas · NumPy · Jupyter · SQL · EDA · Feature Engineering
3 classes per week · 40–60 minutes per class
Methods inspired by Harvard CS109 Data Science and MIT 6.036 Machine Learning

What You'll Learn

What Data Science is vs. Machine Learning — and why the distinction matters
The modern data science skill set and common ML algorithms
The 6-step Data Science Lifecycle: from Scoping to Insights
Introduction to Data Preparation and Exploratory Data Analysis (EDA)
How to read a data problem like a consultant, not a coder
Course structure, project framework, and resource setup

Project

Project 1 — The Churn Paradox

A subscription software company is losing customers. Draft a Technical Scoping Proposition defining: the target variable (accounting for inactivity vs. explicit cancellation), five features including one proxy variable representing an abstract concept, and the constraint explaining why a model trained on 2025 data could produce false positives when applied to 2026 data after a UI change. Written deliverable — no code required.

What You'll Learn

End-user perspective and problem brainstorming before opening a dataset
Supervised vs. unsupervised learning — choosing the right approach for the right problem
Identifying data sources, understanding data structures, and defining model features
Installing Anaconda and Jupyter Notebook
Notebook mastery: Edit vs. Command Mode, Code cells vs. Markdown cells
Conda environments: creating, managing, and exporting reproducible setups

Project

Project 2 — The LaTeX Optimizer

Build a "Living Document" in Jupyter that calculates Root Mean Squared Error (RMSE) from scratch — no libraries, pure Python logic. Create a dedicated Conda environment named ds_logic, export the environment.yml file, represent the RMSE formula in LaTeX notation inside the notebook, and implement calculate_rmse(actual, predicted) using a list comprehension. The exponentiation operator ** 0.5 only — math.sqrt() is forbidden.

What You'll Learn

Structured vs. unstructured data — what each requires
The Pandas DataFrame architecture: how data is stored and accessed
Reading flat files: CSV, TXT, and Excel into Pandas
Connecting to SQL databases directly from Python
Quick exploration techniques: shape, dtypes, nulls, and value counts at a glance
What to look for in the first five minutes with any new dataset

Project

Project 3 — The Memory Gatekeeper

Build a smart data loader loader.py for a large real-world dataset (NYC Taxi Data). The script must: peek at the first 100 rows to detect data types, optimize memory by mapping float64 to float32 and int64 to int32 (cutting memory use by ~50%), and filter rows where Trip_Distance is greater than 0 — all during the ingestion phase, not after. Demonstrates production-aware data engineering from day one.

What You'll Learn

Converting numeric and DateTime column types correctly
Detecting, removing, and imputing missing data
Mapping values, fixing typos, and applying logical condition-based updates
Handling duplicate records without dropping valid data
Statistical outlier detection: box plots, histograms, and standard deviation
Creating new numeric, text, and DateTime columns through feature engineering

Project

Project 4 — The Imputation Integrity Test

Clean a dataset of medical records where Heart Rate values are either missing or biologically impossible (recorded as 0). Apply three rules: treat 0 as NaN, impute missing values with the group average by Age Group — not the global average — and drop any patient record with more than three missing vital signs. Output a comparison table showing Mean and Standard Deviation before and after. If standard deviation shifts by more than 10%, the imputation logic is considered too aggressive and fails.

What You'll Learn

Advanced filtering, sorting, and groupby aggregations in Pandas
Statistical distributions: normal vs. skewed, and why the difference matters for modeling
Correlation analysis and scatter plots: identifying relationships between variables
Pair plots for multi-variable analysis across an entire dataset
Spotting misleading aggregates and hidden variables in data
Building a complete EDA narrative: question, exploration, finding, implication

Project

Project 5 — The EDA Detective: Simpson's Paradox

A university is accused of gender bias in admissions. Overall data shows men admitted at a higher rate — but data broken down by department shows the opposite. Using a provided admissions.csv, calculate the global admission rate by gender, then deconstruct it by department. Visualize why the global average is misleading. Write a three-sentence logical proof identifying and explaining the lurking variable. A real statistical phenomenon that has appeared in landmark legal cases.

What You'll Learn

Appending and joining datasets: inner, outer, left, and right joins
Creating the single clean table a model actually needs
One-hot encoding: converting categorical variables to numbers
Feature scaling and transformations: when and why to normalize
Proxy variables: representing abstract concepts that cannot be directly measured
Model evaluation preparation: train/test split logic before modeling begins

Capstone Project — The Vectorizer

Build a transformation function preprocess_pipeline(raw_df) that converts a messy real-world dataset into a clean NumPy array ready for a machine learning model. The pipeline must handle four transformations: cyclical encoding of timestamps (converting Hour into sine and cosine features so 23:00 and 01:00 are logically close), target encoding of city names (replacing each city with the average price for that city), Min-Max scaling of all numeric values to a range of [0, 1], and a final validation check ensuring the output contains no strings and no nulls. The most technically demanding project of the course.

End-of-Course Outcome

6 Projects — A Complete Data Preparation Portfolio

Week	Project	Output
Week 1	The Churn Paradox	Technical Scoping Document
Week 2	The LaTeX Optimizer	Jupyter Notebook · Conda Environment
Week 3	The Memory Gatekeeper	Production-Ready Data Loader Script
Week 4	The Imputation Integrity Test	Validated Cleaning Pipeline
Week 5	Simpson's Paradox EDA	Statistical Analysis · Visualizations
Week 6	The Vectorizer (Capstone)	Full Feature Engineering Pipeline

You finish the course knowing how to take any raw dataset — no matter how broken — and prepare it for machine learning with production-aware, mathematically sound logic.

Why This Curriculum Works

Problem First, Code Second

Every week opens with a real business or scientific problem before any Python is written. This is the Harvard CS109 model — the code only matters because the question came first.

Projects That Reflect Real Work

The Churn Paradox, The Memory Gatekeeper, and The Imputation Integrity Test are not textbook exercises. They are the kinds of problems data scientists and data engineers encounter in their first three months on the job.

The Hard Part Gets Its Own Week

Data cleaning is 80% of real data science work and gets less than 10% of the attention in most courses. Week 4 is dedicated entirely to it — with a project that will actually break if the logic is wrong.

Modeling Readiness Is the Capstone

This course ends exactly where machine learning begins. Students who complete it arrive at a modeling course with a skill set that most people entering those courses are still missing.

Methods inspired by Harvard CS109 Data Science and MIT 6.036 Machine Learning

Price

₦45,000