This post is no longer current

Please see the course description

Course basics

This class is...

  • for people with an understanding of statistics at the first-year graduate level or beyond
  • a way to learn/write basic algorithms for statistical inference and predictive analytics
  • a chance to apply algorithms to real data sets and gain data science intuition
  • a way to learn solid programming skills (beginning through intermediate)
    • Python
    • Linux
    • Github
    • Collaborative development
    • Object Oriented Design

This class is not...

  • for people who don't know any stats or linear algebra
  • for people who have never programmed before
  • an overview of advanced methods in machine learning

Full description

The explosion of available data coinciding with the continued evolution of statistical and computational methods has resulted in a new breed of specialist. These data scientists use rigorous statistical methods to find meaning in data. Minimizing a loss function is not enough: Business and societal decisions hinge on the interpretation of these insights. The world of scientific computation is rapidly evolving. Quick-and-dirty scripts are not enough: A maintainable code base and collaborative development environment allows projects to productionalize and scale. A data scientist must wear many caps, we present two of them here.

Maintainable coding techniques will be taught using test-driven-development, version control, and collaboration. Code will be of the type found in the scikit-learn and statsmodels packages. Students finish the class having created a library on GitHub, and an understanding of several core statistical/machine-learning algorithms.

Case studies give students the opportunity to use these their own software on real world data sets. Here they develop intuition for extracting meaning from data. Students finish the class with a website/blog/portfolio, and experience with the translation:

Real world --> data --> scientist --> collaborators/coworkers --> policy-decision/data-product

Lecture structure

  1. An algorithm is presented. Students are randomly assigned to groups and together write a productionalizable implementation.
  2. The class is presented with a data-driven business/scientific problem that a company/institution has, and they must solve (using the algorithm from 1).
    • Each step takes one week.
    • Step 1 demands that a GitHub repo be created. The repo is maintained with the imaginary goal of being later productionalized for a client. This problem is very clearly defined. The goals here are to learn algorithms and scientific computing skills in a collaborative environment.
    • Step 2 demands the creation of a presentation and a written report. One group is randomly chosen to present their pitch/solution to the class. The problem will not necessarily be clearly defined. Students must find where they can add value, then convince us that they can. Students use software developed in step 1, along with other packages.
    • The data for step 2 will come from NYC start-ups and non-profits.
    • Will use the book python for data analysis.
    • May use the book machine learning in action. If so, we will require modifications of the algorithms presented there.

Prerequisites

  • Stats (4109 or 4105+4107) or equivalent
  • Some proficiency in programming
  • Computer:
    • A mac or Linux is fine.
    • If you have Windows, we will assist you in setting up a Linux dual-boot or virtual machine. *An 8GB machine with help you tremendously. A 2GB machine will cause headaches. Spend the $60 and upgrade...you want to analyze data right?

Lectures/algorithms/HW

  1. Introduction
    • Course introduction
    • Software setup workshops
      • If you have Linux, then we will do a quick check of your system
      • If you have a Mac, we will transform it into a real mac
      • If you have a Windows machine, we will set up Linux with either a virtual machine or dual-boot.
  2. Programming introduction
  3. Data Tools
    • Git/Github introduction
    • Teams build a suite of data tools
      • Cleaning filters
      • Subsampling
      • SQL scripts
  4. Exploratory data analysis
    • Pandas
    • Numpy, scipy, matplotlib
    • Build an EDA suite
  5. Linear regression
    • The singular value decomposition (SVD)
    • Maximum likelihood
    • Regularization and Bayesian estimators
    • Memory hierarchy, stability, and why you never explicitly invert a medium or large matrix
    • Teams build a linear regression module
    • Teams work on case study (topic TBD)

Other algorithms presented will follow the same structure as "Linear regression" above, and could include:

  • Logistic regression/classification
  • K-nearest neighbors
  • Kernel density estimation
  • Decision trees, random forests
  • Monte Carlo simulation
  • Recommendation systems

Possible additional topics

  • Web scraping
  • Typing and compilers. Could be taught by using Cython.


blog comments powered by Disqus

Published

15 November 2012

Tags

Subscribe to RSS Feed