DSI Curriculum

Our aim is to create a single introductory data science course for students from all grade levels in high school. We want to teach students the very foundations of data science: the logic and the execution. The logic involves teaching fundamental mathematical and statistical concepts, and the execution involves teaching them basic programming to apply these concepts. By the end of the course, most students should be able to understand and appreciate what data science is and how analysis works.

While it would be ideal for all students to have computer access during the course, we can’t guarantee this. As such, each lesson will consist of a worksheet that students can complete on paper, as well as a Jupyter notebook that will allow them to apply the concepts they learned in Python, often on real-world data. This means that the notebook content must be auxiliary to the worksheet content; all of the core content needs to be expressed in the worksheets.

Lesson 0: Setup

Software used with our curriculum

Installation Guide

Lesson 1: Why Data Science

Definitions of data and data science, why we should care about the subject, and applications of the field

Worksheet

Examples of data analysis and visualization using Python on real-world datasets

Notebook

Lesson 2: Introduction to Statistics and Python (Part 1)

Basic statistical quantities, such as mean, median, and standard deviation. Distinction between sample and population

Worksheet

Variables, data types, and conditional statements in Python

Notebook

Lesson 3: Introduction to Statistics and Python (Part 2)

Concepts of standard units and percentiles

Worksheet

Loops, functions, and arrays (using the Numpy library)

Notebook

Lesson 4: Collecting Data and Tables

Distinction between observational studies and controlled experiments. Collecting data and analyzing survey responses

Worksheet

Creating and manipulating tables using Python's datascience library (developed at UC Berkeley)

Notebook

Lesson 5: Introduction to Probability and Python Application Problems

Calculating the probability of simple and compound events, concept of conditional probability

Worksheet

Using Python with Numpy and Datascience libraries to solve statistics problems with real-world data

Notebook

Lesson 6: Probability– Digging Deeper

Concept of distributions (emphasis on normal distribution), relation between empirical and probability distributions

Worksheet

Creating normal distributions and comparing the empirical and probability distributions for a simulated experiment

Notebook

Lesson 7: Data Visualization

Importance of visualizing data, different types of graphs, misrepresentation of data

Worksheet

Using Matplotlib library to generate graphs of real-world data in Python

Notebook

Lesson 8: Finding Trends with Correlation

Calculating correlation coefficient for two sets of data points, types of correlation, preview of regression

Worksheet

Finding correlation coefficients to describe scatterplots

Notebook

Lesson 9: Linear Regression

When to use linear regression, how it works, calculating equation of a regression line to make predictions

Worksheet

Using Python to find regression lines to fit datasets, visualizing why regression works

Notebook

Lesson 10: Case Study

Series of technical questions assessing knowledge of course concepts using data selected from survey of engineering graduates across India

Worksheet

Similar questions as worksheet, but with an emphasis on using Python to write functions and manipulate tables to gain insights about the entire dataset

Notebook

Lesson 11: Project

Draw conclusions about crimes against women and literacy in Indian states with course concepts

Worksheet

Achieve the same goal as the worksheet, but using Python tools to answer more open-ended questions

Notebook

Lesson 12: Data Ethics

Discussion of real-world examples of the prevalence of data, as well as the inherent moral and ethical implications of using (and misusing) these data

Worksheet