# Description

### See also: syllabus

# Short description

Drew Conway's data science venn diagram illustrates the technical skills a data scientist needs.

This course is designed to take people solidly in the green, and move them into the green/pink intersection. Actually, rather than hacking skills, we would prefer to teach software development fundamentals (in the spirit of software carpentry) along with some important data tools. This will be done using real data, and without sacrificing statistical rigor (thus emphasizing the *science* in data science). In order to make time for all this, we will reduce the number of algorithms that we introduce. These last two points differentiate this course from a traditional course in machine learning.

## Logistics

- Number: STAT 4249
- Class times: MW 6:10 - 7:25
- Lead instructor: Ian Langmore
- Instructor: Daniel Krasner
Instructor: Chang She

## This class is...

- a way to learn/write basic algorithms for statistical inference and predictive analytics
- a chance to apply algorithms to real data sets and gain data science intuition
- a way to learn solid programming skills (beginning through intermediate)
- Python
- Linux
- Github
- Collaborative development
- Object Oriented Design

- for people with an understanding of statistics at the first-year graduate level or beyond
- for people with some programming experience (you should have written for loops before)

## This class is not...

- a way to learn
*advanced*machine learning methods

# Long description

The explosion of available data coinciding with the continued evolution of statistical and computational methods has resulted in a new breed of specialist. These data scientists use rigorous statistical methods to find meaning in data. Minimizing a loss function is not enough: Business and societal decisions hinge on the interpretation of these insights. The world of scientific computation is rapidly evolving. Quick-and-dirty scripts are not enough: A maintainable code base and collaborative development environment allows projects to productionalize and scale. A data scientist must wear many caps, we present two of them here.

Maintainable coding techniques will be taught using test-driven-development, version control, and collaboration. Code will be of the type found in the scikit-learn and statsmodels packages. Students finish the class having created a library on GitHub, and an understanding of several core statistical/machine-learning algorithms.

Case studies give students the opportunity to use these their own software on real world data sets. Here they develop intuition for extracting meaning from data. Students finish the class with a website/blog/portfolio, and experience with the translation:

Real world --> data --> scientist --> collaborators/coworkers --> policy-decision/data-product

## Lecture structure

- An algorithm is presented. Students are randomly assigned to groups and together write a productionalizable implementation.
- The class is presented with a data-driven business/scientific problem that a company/institution has, and they must solve (using the algorithm from 1).
- Each step takes one week.
- Step 1 demands that a GitHub repo be created. The repo is maintained with the imaginary goal of being later productionalized for a client. This problem is very clearly defined. The goals here are to learn algorithms and scientific computing skills in a collaborative environment.
- Step 2 demands the creation of a presentation and a written report. One group is randomly chosen to present their pitch/solution to the class. The problem will not necessarily be clearly defined. Students must find where they can add value, then convince us that they can. Students use software developed in step 1, along with other packages.
- The data for step 2 will come from NYC start-ups and non-profits.
- Will use the book python for data analysis.

## Prerequisites

- First year graduate stats (4105, 4107, 4315 or permission of instructor)
Some proficiency in programming

- You should have written for loops before
- You should have written scripts at least 100 lines long
- If you barely meet these requirements, prepare to work
**very hard**

Computer:

- A mac or Linux is great
- If you have Windows, we will assist you in setting up a Linux dual-boot or virtual machine.
- An 8GB machine with help you tremendously. A 2GB machine will cause headaches. Spend the $60 and upgrade...you want to analyze data right?