Columbia Applied Data Science

Notes on higher performance python

2013-06-04T00:00:00-07:00

The newest version of the lecture notes includes a section on high(er) performance Python.

Sections include

Memory hierarchy
Parallelism
Profiling
Standard Python rules of thumb
For loops versus BLAS
Stream processing of text
Multiprocessing

profiling and performance basics

2013-05-12T00:00:00-07:00

Note

The newest version of the lecture notes includes a section on high(er) performance Python. Read that rather than this short post.

Original post

Python code can be very slow or very fast. For loops are slower than list comprehensions, which are much much slower than numpy calls or built in python functions. The latter two use optimized Fortran and C libraries. So the first rule of thumb is, whenever you find yourself writing a for loop or list comprehension, check if there is a built-in Python or numpy function that does the same thing.

The above rule always holds since built in functions will lead to simpler code (remember the importance of simplicity). However, there are often other optimizations that lead to slightly harder to read or more complicated code. To address this, first consider the following quote, credited to Donald Knuth, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." Then solution is to profile your code first, to determine where the slow spots are, and then only re-write the parts that are slowing your code down.

To profile scientific code, you need a line-by-line readout of the time taken in different function calls in your code. This can be had by using the line_profiler in conjunction with the kernprof script. To install, simply type

pip install line_profiler

Then, at the top of a function that you want to profile (you can only profile functions), put @profile. For example:

@profile
def myfun(x):
    y = 2 * x
    return y

Then you need some way to call myfun from the command line. This could be for example a script run_myfun.py, which could be as simple as:

from mymodule import myfun

myfun(10)

Then, at the command line, type

kernprof.py -l run_myfun.py

If you are using anaconda and have installed line_profiler, then kernprof.py will be in your PATH, and the above line will work. The above line will produce the file run_myfun.py.lprof. This is the profiler output. You need to use the module line_profiler to read it. To do this, type

python -m line_profiler run_myfun.py.lprof  |  less

You should see a line-by-line breakdown of time taken to run your code.

Announcements: May 6

2013-04-30T00:00:00-07:00

A new version of the lecture notes has been posted.

Homework 08: Stackoverflow questions

2013-04-29T00:00:00-07:00

Homework 08: Stackoverflow

Due: May 13, in class presentation during the final exam slot (7:10 - 11pm). Also, a pdf of your slides must be emailed to applied.data.science@gmail.com before 7:00pm.

You will use logistic regression to predict whether a Stackoverflow question will be closed or not. This assignment is similar to this Kaggle competition.

Guidelines

General

You must use logistic regression to give the probability that a Stackoverflow question will be closed or not.
For modeling, you can use any existing Python (not R) packages such as statsmodels or sklearn. You can use any Unix utility.
You must attempt to use both the numeric data (e.g. ReputationAtPostCreation) as well as the text data Title and BodyMarkdown.
For the test data, you must report:
- The ROC AUC (Area Under the Curve)
- How well your predicted average closed rate matches reality for users in the bottom/middle/top third in terms of reputation
You must build a classifier that uses your logistic regression model. Pick a cutoff that makes sense for this problem and explain why you chose it.
You must also build a classifier that uses the exact same variables as your logistic classifier, but uses some other technique such as a random forest, SVM, and nearest neighbors. You must compare this classifier with the logistic classifier and explain which worked better and why.

Presentation

20 minutes presentation, 10 minutes questions. At least two group members must talk.
Your slides must be in pdf format
Email your slides to applied.data.science@gmail.com before 7pm on the day of the final. Put your group name in the title of the pdf.
Your presentation should describe why you chose to keep/create/throw-away variables
Your presentation should describe how you evaluated your model, the results of the evaluation, and why this evaluation was or was not sufficient
Your presentation should not describe your data-munging. EDA should be described only insomuch as it relates to the above tasks.

Data

Your training data should be the file named train taken from this website.
Your test data should be current data that you obtain by using the Stackoverflow API. Note that you can get lots and lots of variables using the API. Only get the ones that are also in the training set. There are no hard requirements on the amount of test data you must obtain. You must explain why you used the number of samples that you did, and why it makes the "prediction vs. reality" test statistically significant. You are permitted to use your test set as a cross validation set (usually this is not good practice, but (i) there is no way to stop you from doing this, and (ii) this will give you experience with out-of-time errors).

About the starting repo

This is meant to give you a decent starting point. You can modify it as you wish.

Dependencies

You will probably use code from previous homeworks, especially cut.py and subsample.py.

Directories

data

Don't version data.
To avoid excess sharing of processed data (which changes often), it is preferable to share raw data and the scripts and notebooks that transform raw into processed.
Contents of any raw folder should never be modified or deleted. This way, your script will create the same output as everyone else's script.
Shell scripts and notebooks should assume the existence of the local folders data/raw and data/processed. They already exist in the repo.

Notebooks

For ipython notebooks. Put your name in the notebook name to avoid redundancy.

src

Source code.

tests

Unit and integration tests. Add these if you want.

scripts

Shell scripts.

Final exam

2013-04-29T00:00:00-07:00

We estimate that the May 6 final exam will have:

1 git question
2-3 unix (incl. regex)
2 nltk
2 dataflow (read/write/IO/stdin/stdout/API)
2 numpy/pandas
1 linear
2-3 logistic
1 naive bayes
1 decision trees/random forest
2 ROC/R2/PseudoR2
1 nonlinear optimization

Homework 07: Hints

2013-04-28T00:00:00-07:00

Exercise 6.3.1. Here I'm looking for you to say how mislabled data can be re-phrased of as an error in your model. There are probably many correct answers.
Exercise 6.5.2. The key point (that I did not explicitly state...sorry!) is that the data is trained with normal, non-truncated linear regression.
Exercise 6.5.3. In part 1, assume the epsilon are iid.

Starting projects that involve lots of file I/O

2013-04-26T00:00:00-07:00

Often times I work on a project where the general goal is:

Read lots of files from disk
Modify and extract information from the files
Write results to disk

Steps 1 and 3 provide the interface (in this case the plumbing that interfaces with the OS and the disk) and step 2 is the implementation (the logic that you want to implement). The point of this post is that interface should be separated from implementation. The reason is that interface and implementation tend to change at different times. To illustrate, imagine you write the following script:

infilename = 'data/smallfile.csv'
outfilename = 'data/my_outfile.csv'

modify_and_write(infilename, outfilename)

This would work fine during an initial development phase where you want to test your script on one single file. It however doesn't work if you want to modify many files. You could change this with:

indir = 'data/'
outfilename = 'data/my_outfile.csv'

for infilename in get_filenames(indir):
    modify_and_write(infilename, outfilename, append=True)

This solves the first problem, but suppose you want to read from standard in and/or write to standard out (this would be helpful since then you could tie modifyandwrite together with other utilities)? To do this, you could pass open file objects rather than file names.

# Use with hardcoded files in a script

indir = 'data/'
outfilename = 'data/my_outfile.csv'

for infilename in get_filenames(indir):
    with open(infilename, 'r') as f:
        with open(outfilename, 'w') as g:
            modify_and_write(infile, outfile)

# Use with stdin/stdout as part of a larger program.

modify_and_write(sys.stdin, sys.stdout)

Here we have pushed the file opening part of the interface away from the modification part. This allows us to tie modifyandwrite together with other programs or use it by itself. An example can be found here.

This sort of setup is good if you know ahead of time that you will be reading/writing from files or stdin/stdout only. Although this is a good way to tie programs together, unix pipelines can be restrictive. Suppose all files are small. Then it is possible to read them in all at once. In this case you can write:

# Simple script to write as you develop modify_lines

infilename = 'data/smallfile.csv'
outfilename = 'data/my_outfile.csv'

with open(infilename, 'r') as f:
    lines = f.read()
    newlines = modify_lines(lines)
    with open(outfilename, 'w') as g:
        g.write(newlines)

Above, modify_lines takes in the bare minimum that it needs in order to modify the lines in the file takes in the bare minimum that it needs in order to modify the lines in the file. This the string returned by f.read(). Later, when we decide exactly how modify_lines will be used, we can build the interface. If that interface changes over time, that is fine, because the implementation (modify_lines) doesn't need to change. For example, we can decide to read lines from stdin, or a file, or another function.

Announcements: April 29

2013-04-24T00:00:00-07:00

A couple corrections were made on April 24 at 8pm to exercise 6.5.3.
Hints were given for HW 7
See this estimate of material on the final exam (May 6)
Homework 8 has been posted. Teams assignments will be mailed out soon.

Announcements: April 24

2013-04-22T00:00:00-07:00

The written traditional final exam will take place on the last day of normal class, May 6. It will be worth the same as a homework. Details to follow.
During the final exam slot, May 13, 7:10 - 11:00 pm, you will be doing presentations as part of homework 8. Details to follow
A new version of the lecture notes has been posted. A small addition to the end of section 6.5 was made.

Homework 07

2013-04-21T00:00:00-07:00

Due: Monday April 29, hand in write-up in class as a written or printed piece of paper.

What it is Do every exercise in the logistic regression chapter of the lecture notes. This is not group work. Every person must turn in their own solutions. You are allowed to work with others.

Announcements: April 22

2013-04-18T00:00:00-07:00

Some updates were made to the lecture notes. In particular, the theorem on L1 variable selection was re-worded and the proof fixed.
Homework 7 has been posted.

Announcements April 17

2013-04-17T00:00:00-07:00

The logistic regression lecture notes have been posted.

Crash Course on APIs, pandas/statsmodels timeseries API

2013-04-11T00:00:00-07:00

The slides for the Crash Course on APIs, the Github API example notebook, and the pandas timeseries API notebook all live in the public repository git@github.com:columbia-applied-data-science/lecture_timeseries.git

The slides are in PDF format, and you can see the static versions of the notebooks here:

Pandas Timeseries:

http://nbviewer.ipython.org/urls/raw.github.com/columbia-applied-data-science/lecture_timeseries/master/time%2520series%2520python.ipynb

Github API:

http://nbviewer.ipython.org/urls/raw.github.com/columbia-applied-data-science/lecture_timeseries/master/Github%2520API.ipynb

You can find the GitHub API reference here: http://developer.github.com/v3/

Note that Web APIs live independently of the programming language and your Python code just need to construct the right URL that encodes all the search parameters. In addition, for many web APIs, you must go through an authentication step in order to obtain certain types of data.

Notes on your GSS models

2013-03-22T00:00:00-07:00

Here are some general comments that applied to many people's homework.

The best presentations told a "story." They told the steps you used, as well as the results and why the results were good or bad.
If the direct inversion method fails on a small data set such as this, then there is often a problem with your data. E.g. you have linearly dependent (i.e. redundant) variables. The best approach is to figure out why there are issues and then fix them.
The pandas function get_dummies can be used to get indicators for categories, e.g. is_married.
I didn't see anyone building new variables from (nonlinear) combinations of more than one old variable. That would have been nice.
Many of the NaN values were follow-up questions that could have been used to build new variables. For example, if someone answers yes to "have you ever been a smoker", then they also get to answer the follow up, "have you ever tried to quit smoking." This could be used to create two new variables, smokingtriedquitting and smokingnevertried_quitting. Or, you could figure that people who never tried to quit were more severe smokers, and therefore they get a 2, people who tried quitting get a 1, and people who never smoked get a 0. This way you create one single new variable with three levels.

Announcements March 13

2013-03-12T00:00:00-07:00

There was a bug in the cross validator module from HW 3. In the docstring of cross_validator._get_xy_traincv() I had switched the usage of the "cv set" and the "training set." The correct docstring reads:

Returns slices of X and Y used for training and cv.  The cv
set should be e.g.:  X[istart: istop, :], and the training set should
be everything else.

The updated unit tests are here

Announcements March 11

2013-03-06T00:00:00-08:00

For HW 4, email a pdf presentation to the TA, and be prepared to present in class. Do both of these on March 13. Nothing else is due at any time.
Suggestions for your presentation:

List the variables your models use, along with the values of the corresponding coefficients. Do the coefficient signs make sense?
Show your error and how you evaluated error. Remember that you must train/cross-validate using 2006 data, and then test (i.e. measure your error) using the 2010 data.

I noticed that there are quite a few variables that are actually the same as income (up to some constant). If you notice this, then it's ok to point it out. However, please spend your time building models that don't use them.
Tuesday's OH moved to Wedn 11-12.

IDSE Symposium Call for Student Volunteers

2013-03-01T00:00:00-08:00

Columbia's Institute for Data Science and Engineering has an upcoming symposium. It is an "invite only" event and doing some grunt work may be your only way in. See below:

See this google doc for a breakdown of the shifts and tasks available – volunteers may sign up for any/all available shifts (business attire advised).

Homework 04

2013-02-28T00:00:00-08:00

Homework 4: Linear Regression with GSS Data

You will use your linear regression module from last week to analyze the General Social Survey (GSS) data. This is a yearly social science survey that "takes the pulse of America."

Presentation Wednesday March 13 Email a copy to the TA and be prepared to present in class.

Data directory layouts

You will have to transform your data by cleaning/cutting out certain columns. This can lead to a mess of different data files. Here are some suggestions.

You have the following layout by default:

notebooks/
scripts/
src/

data/
  raw/
  processed/

Remember not to commit any data to the repository.
I like to keep the data in raw completely untouched copies from websites, or common transformations of them (e.g. the csv files that result of Getting the data above). The key point is that once data goes into raw I never change it.
The processed directory is for the altered versions of the raw directory.
There is a scripts directory that can be used to store scripts (Python or Bash) that transform data from the type in raw to the type in processed.
T also use ipython notebooks (stored in notebooks/ to transform data from raw to processed. I commit these scripts and notebooks to the repo and other people can run them to get copies of my processed data.
For longer projects I create snapshots of the processed directory that have a timestamp on their name. E.g. processed-2013-02-11.

Project deliverables

Use the 2006 data to train models, test those models on 2010 data.
Predict income06 as a function of other variables. For simplicity, this should be one single model that includes everyone...in other words, don't segment your data. Note that income06 is missing in about 15% of responses. You don't have to predict income06 for these people. This model should work, even in the presence of missing data (with the exception of missing income06)! So, you should probably fill the missing values with something.
Find one other relation to predict. Make sure it is appropriate for linear regression. Segment or do whatever you want. The model can work for the whole population or subpopulations.
Make a 15 minute slide-show presentation documenting your work. This is what you turn in. Two randomly chosen groups will present this in class on Wednesday March 13. The intended audience is the class...so present at the appropriate technical level.

Getting started

Useful links

The 2008 GSS Codebook will be useful for variable definitions.
The GSS User's Guide shows you how to search for variable description using the website. The website is very very very slow.

Basic workflow

Get data
Inspect data
Clean data
Explore relationships (EDA)
Fit model
Inspect results
Repeat 2-7

Get data

Download the 2006 and 2010 datasets from this site. Get the individual years, which are under "Download Individual Year Data Sets"
Convert these STATA dataset into Pandas DataFrames using these instructions
Store them as csv files using (assuming the DataFrame is named df):

df.to_csv('filename', index=False)

Inspect data

Inspect the csv files with less and see what they look like. Remember Ctrl-f, Ctrl-b to move forward and backward.
Use head to create a file (probably located in /tmp/) containing the first 100 lines. Look at this file in excel (or libreoffice in ubuntu). You may get an error about too many columns...that's ok, just look at what you can!

Clean data

Start your ipython notebook with ipython notebook --pylab inline.
Create a new notebook named cleaning-your-name. This will be used for cleaning data.
See notebooks/HW4_cleaning_EDA for an example.
Read in the 2006 and 2010 datasets into DataFrames named df2006, df2010.
We will probably have little use for columns that are mostly NaN. Use df.count().order() to figure out which columns have lots of missing values. Chop off these columns by creating a boolean mask that will be true if a column has enough good entries and then using df.ix[:, mask].
We are only interested in variables that are in both datasets, so use pandas reindex to modify and align the columns like so

col = df2006.columns.intersection(df2010.columns)

df2006 = df2006.reindex(columns=col)

df2010 = df2010.reindex(columns=col)

EDA

See notebooks/HW4_cleaning_EDA or goto this link for an example.

Build your model

See notebooks/HW4_regression_example or goto this link for an example.

You will have to add variables and see if it improves your fit. Make sure your variables make sense intuitively. Do EDA and read about the data to gain intuition.

Announcements March 4

2013-02-28T00:00:00-08:00

The unit test TestLinearReg.test_solve_pinv_4 had an issue. It used 0 as a cutoff. On some machines, svd does not return a zero singular value, instead it returns a small nonzero value (due to roundoff error). Please use the revised unittests available here.
Notebooks the EDA/cleaning notebook and regression notebook are now available.
For those who are interested there is a intermediate git workshop happening on campus (thanks to Michael Discenza for letting us know). Here are the details:

Intermediate Git Workshop Tue, March 12, 9pm – 10pm Where
Hamilton 603 map Calendar
Application Development Initiative Created by
znewman01@gmail.com Description

Know how to use Git but don't know anything about it's internals? Want to learn how to rebase, cherry-pick, and fix merge conflicts like a champ? Use git effectively for collaboration and development using the techniques in this workshop.

Presupposes only a basic knowledge of Git/VCS.

Don't forget there will be presentations next week March 12th. Two teams will be randomly chosen to for 15mins.
Here is a nice list of regular expression wild cards, what they do and where they are supported.

Announcements Feb 27

2013-02-26T00:00:00-08:00

I made a change in the lecture notes that ended up affecting the homework numbering. I undid that change and now the old numbering is back.

Learning Numpy and Pandas

2013-02-21T00:00:00-08:00

The numpy notebook and pandas notebook and that Chang used in class are now available.

I also recommend the following for more Numpy information:

Chapter 4 from Wes's book
If you're a MATLAB user, then this is useful. Note: Don't use the matrix class. Just use normal Numpy nd.arrays.
The official docs are also useful.

Announcements: Feb 25

2013-02-21T00:00:00-08:00

See this post for tutorials/docs/notebooks to help you learn Numpy.
My previous announcement said that all exercises from the linear algebra chapter are due. I changed my mind and only some are due.
Homework 3 is now due March 4th at 6pm.

A new version of the notes with some changes has been posted. Some of these changes effect the homework. See below:

To avoid confusion with standard deviation, I changed the symbol for singular values to lambda
Problem 6.12.1 has a typo. It should read, "we will have at least one singular value sigma_k = 0 for k <= K", not "k<K."
Exercise 6.14.1 had a errors in the w estimate. This same exercise had some confusing wording regarding the error model. See the updated lecture notes for an improvement.
Exercise 6.6.1 will be easier to answer after you read remark 6.11. In other words, this question can be rigorously answered by using the SVD solution to the least squares problem.
In homework_03/src/simulator.py, in the function gaussian_samples(), there was a comment that should not be there. The comment was

Start with an identity then populate off diagonal entries

Homework 03

2013-02-19T00:00:00-08:00

homework_03

Due Mar 4th, 6pm

Code To receive full credit all unit tests must pass and one copy of the exercises must be completed.

Exercises: 6.4.2, 6.4.3, 6.6.1, 6.7, 6.9.1, 6.12.1, 6.12.2, 6.14.1, 6.19.1

To start

Clone the repo into a local directory named homework_03. Do not use the original repo name. Replace X below with your team name.

git clone https://github.com/columbia-applied-data-science/homework_03_team_X.git \
  homework_03

See demo.py and the tests to get an idea of how things work.

Numerical techniques

See this section in the lecture notes information about pseudo inverses.

5-fold cross validation

You will make a 5-fold cross validation module. This is used as a way to pick out your regularization parameter delta. Our 5-fold cross validation is:

For every delta:

Divide the data up into 5 equal chunks
Pick out the first chunk as a cross-validation set, and group the other 4 together as training data.
Fit the model using the training data and use the cross validation set to measure both the training and cross-validation squared error |Xw - Y|^2
Repeat 5 times, each time using a different chunk as the cross validation set.
Average the training and cross-validation errors across the 5 folds.

Compare the average cross-validation errors and use this to choose delta. Note that the training error should not be used to choose delta. It is there to serve as a reality check and to diagnose the degree of over/under fitting.

Caution!

These routines are very picky about array shape. Some functions, e.g. np.dot, return arrays who have shape = (N,) (a tuple with only one element). In that case, you will often have to reshape this into a proper two dimensional array. The docstring for linear_reg.fit() tells you when to do this.

Two functions, linearreg.fit() and crossvalidator.cross_val() can handle pandas objects as their input. The others may or may not. However, these are the only public methods in their modules, so this is ok.

sed oddities

2013-02-18T00:00:00-08:00

The instructions for HW 02 told you to use a sed command:

sed -e 's/|Open/&\n/g'

This should put a newline after every occurrence of the string |Open. Some older versions of sed don't work this way however. In these versions, instead of a newline, the letter n will be inserted. If this happens to you, change your command to:

sed -e 's/|Open/&\
/g'

Note that I have actually hit the Enter key on my keyboard, which put a newline into the script. This should work on all versions of sed. You can test this by writing:

echo -e 'abcabcabc' | sed 's/c/&\n/g'
echo -e 'abcabcabc' | sed 's/c/&\
/g'

and seeing which of the two works. Both scripts are trying to insert newlines before every a.

Converting datasets from STATA

2013-02-15T00:00:00-08:00

It's very important in practical data science to know how to convert datasets into the right format and structure. Being able to import data from language-/tool- specific formats like Stata is something that's very useful, especially for a lot of social science data. Fortunately, that's already available as part of the StatsModels library in Python.

Here's how you do it:

In the terminal, execute the command: pip install -U statsmodels. This should upgrade you to 0.5.0+. If you already have the latest version of statsmodels, you can skip this step.

In ipython:

import statsmodels.iolib.foreign as smio
from pandas import DataFrame
arr = smio.genfromdta('~/path/to/stata/data.dta')
frame = DataFrame.from_records(arr)

The genfromdta function in statsmodels.iolib.foreign converts a dta file to a NumPy record array (special numpy array type). The last line above show how to convert the record array into a pandas DataFrame so the data can live happily ever after.

Announcements: Feb 18

2013-02-13T00:00:00-08:00

The linear regression notes are now posted as part of the lecture notes.
Every exercise from the linear regression chapter will be due as part of the next homework (due Feb 27). Every team hands in one written solution set to these exercises.
Here are some resources for learning Python
- Software Carpentry
- Codecademy
See this post about possible problems using the prescribed sed command on a mac.
Feb 18 lecture is on numpy/pandas
Feb 20 lecture is on cross-validation (necessary for HW 03)

Homework 02

2013-02-10T00:00:00-08:00

Homework 2 has been handed out via email and github notifications. It is due Feb 18.

The README.md handed out with the hw

This homework will have you write shell scripts that that use unix utilities and python utilities that you build. This is done in the name of analyzing (an altered version of) the SF 311 Dataset. This altered version is available here

Due: Monday Feb 18, 6pm.

To receive full credit, you must commit and push code that passes all unit tests, and shell scripts that give the correct output.

Setup

Clone the repo and save it in a local directory called homework_02 by typing

git clone https://github.com/columbia-applied-data-science/homework_02_team_XX.git \
homework_02

Utilities

Note: To use the pytyhon utilities, your PYTHONPATH must be modified. In your ~/.bashrc (or ~/.bash_profile on macs), put

export PYTHONPATH=path-to-directory-above-homework_02:$PYTHONPATH

Then source it with source ~/.bashrc or open a new terminal.

To see how the utilities should work:

Create a comma delimited file with a header and run the utilities on it. Set a breakpoint and step through, guessing reading the comments and code fragments provided. You can view the documentation for each utility by typing python utilityname -h.
Go to test/ and view the unit tests in test/testutils.py.
Look at the comments in the utilities. These are only hints. Any utility that passes tests is acceptable.

body

Note: This utility will not be tested, it is just given to you.

In your .bashrc, put

body() {
    IFS= read -r header
    printf '%s\n' "$header"
    "$@"
}
export -f body

then source the bashrc.

This allows you to run a command on the body of the function, skipping the header (but still printing the header). For example,

cat filewithheader | body sort -k1,1

will sort filewithheader, using the first field, but leave the header at the top of the file.

cut.py

Acts like the unix cut utility, except...

Takes field names rather than numbers
Uses the python csv module for more automatic handling of stuff like quoted delimiters

reformat.py

Reformats stuff like delimiters and capitalization

common.py

Common files for all utilities

averager.py

Gets the average of different groups of a sorted file

timeopen.py

Reads a SF 311 case file, appends a 'timeopen' column giving the time (in minutes) a case was open.

subsample.py

Subsamples in the space of rows.

Shell Scripts

These are simple shell scripts. They simply define variables and pipe together some commands. The input file is written into the script. The script writes to stdout and stderr. An example of a script like this (that counts words) would be:

DATA=../data

cat $DATA/infile.csv \
  | sort \
  | uniq -c \
  > outfile.csv

Use the hints inside of these shell scripts to complete them. "Complete" means that they reproduce the sample input/output inside data/. For example,

cd scripts
./count_categories.sh > /tmp/stdout 2> /tmp/stderr
diff /tmp/stderr ../data/count_categories_stderr
diff /tmp/stdout ../data/count_categories_stdout

will produce two files, /tmp/stdout and /tmp/stderr and then compare them to the files in data. If everything is working, then diff should print nothing.

count_categories.sh

Count the number of tickets in each category

countcategoriesopenclosed.sh

Count the number of tickets in each category that are Open or Closed

compute_averages.sh

Compute the average time tickets in different categories remain open.

For closed tickets, compute the average time it was open before being closed.
For open tickets, compute the time it has been left open.

Unit Tests

To run tests, cd to tests/ and do

python -m unittest -v testutils

Once you are done, you will get notification that all tests passed.

Midterm

2013-02-06T00:00:00-08:00

We will have an in-class midterm Feb 25. It will be worth approximately the same as one homework.

The exam will be written (NO COMPUTERS ALLOWED!!!!) and will be designed to test:

Your understanding of basic unix/python/git skills - if you have been doing the homework and following lecture you should have no issues with any of the questions.
Linear regression theory. It will be similar to the linear regression lecture notes and the exercises.

Announcements: Feb 06

2013-02-05T00:00:00-08:00

The next homework assignment will be handed out (via emails) tomorrow. We will discuss this today, along with a short discussion of linear regression.
A visualization of the multiple levels of Git that I talked about Monday is available here
A visual reference to Git that goes into multiple commands is available here
The midterm date has been set to Feb 25.

debugging

2013-02-04T00:00:00-08:00

Description

A debugger is a program that allows you to follow your code as it runs. You run your code line-by-line and see exactly what is going on. This is useful for fixing bugs. It is also useful for understanding what is going on with code.

Installation

Install pdb++ using

pip install pdbpp

Trying it out

Create a file called test.py that looks like:

import pdb

pdb.set_trace()
numbers = range(5)

for num in numbers:
    newnumber = modify_number(num)
    print newnumber

def modify_number(num):
    return 3 * num

Then, from the command line type python test.py. Python will then start interpreting this file (as it always does). When it gets to the line pdb.set_trace() the debugger will "hook" (stop execution of your program and display the position you are at). You should see a syntax-highlighted snapshot of your code. Type sticky and you will see a display of all your code. Type next or n to go to the next line. Type step or s to step into the function modify_number (do this when you are over that line). At any point you can print out the contents of a variable by typing the name of the variable. You can quit with q (unless you have a variable named q, in which case use !!q. To see a full display of commands type help. Also, check out this website.

Customization

Finally, you can customize pdb++ by creating a .pdbrc.py file in your home directory. Mine looks like:

import readline
import pdb

class Config(pdb.DefaultConfig):

    stdin_paste = 'epaste'
    sticky_by_default = True

    def __init__(self):
        readline.parse_and_bind('set convert-meta on')
        readline.parse_and_bind('Meta-/: complete')

    def setup(self, pdb):
        Pdb = pdb.__class__
        Pdb.do_l = Pdb.do_longlist
        Pdb.do_st = Pdb.do_sticky

Material from Software Carpentry Bootcamp

2013-02-02T00:00:00-08:00

Some material from the workshops is available here. More will be added to this post as it becomes available.

Editors

2013-02-02T00:00:00-08:00

For simplicity and multi-platform compatibility we have been asking you to use "nano" when editing files or code in terminal. As you have probably noticed this is not a great editor and, of course, there are many better options. Here are some editors we like (note: when coding in python spaces and indentation is important, because this is how the interpreter deliminates logical statements, so you need to make sure that when you press tab this is seen as some number of spaces; otherwise, you will end up with a mess of python indentation errors for yourself and everyone you collaborate with):

For mac:

Sublime which you can download here. Once you install it, click on the editor icon and do the following:

Open Preferences, under the Sublime Text 2 tab, and select settings-default. This should open up a bunch of code in your Sublime editor window. Search for 'translatetabstospaces' and change the 'false' to 'true,' then search for 'tabsize' and make sure that is set to 4. Save and that's it. You can see some details about these setting here.
MacVim on which page you see download options, so choose the one appropriate for your mac. Note: this is more powerful editor but you should be familiar with its basic use; you don't actually need to download MacVim and can just use Vim which is you can invoke in terminal (type: vim or vim filename), but the standalone editor is nice.... To edit settings for vim/MacVim
1. open ~/.vimrc (you can do this by typing: vim ~/.vimrc in terminal)
2. paste in set tabstop=4 set shiftwidth=4 set expandtab
3. Save

For Ubuntu:

GEdit download the latest version. Then go to Preferences, click on the editor tab and check the boxes "Insert spaces instead of tabs" and "Enable automatic indentation;" also, set the tab width to 4.
vim is very powerful but has a steep learning curve. You can install it with:

sudo apt-get install vim-gnome

Then modify your .vimrc as shown in the MacVim instructions.

Announcements: Feb 04

2013-02-02T00:00:00-08:00

We will start posting announcements rather than sending email for every little thing. You are responsible for checking these.
Material from the Software Carpentry bootcamps will be posted here
You should have received emails from the TA and GitHub regarding homework 1.5. If you didn't, please send your name, github username, uni, and email to the TA at zss2101@columbia.edu
See this post about editors. You are expected to install a decent text editor.
We posted about debugging with the pdb++ debugger.
In homework 1.5, you should modify the top line of test/testscripts.py to reference the name of your particular repo. In other words, if your repo is homework_1p5_team_1, then change homework_1p5 to homework_1p5_team_1.

Fixing your VM

2013-01-31T00:00:00-08:00

The most important thing

Turn on your VM and open a terminal
In the terminal, type sudo apt-get install gnome-session-fallback
- This will install a new graphics manager for your desktop
- Click Y when asked
Log out (or restart)
When you log in, there will be a "gear shaped" icon near your login name. Click it and select GNOME Classic (No Effects)

Memory

By default, the memory allocated for the VM is only 512 MB. This is too little. Make sure Windows and your VM each have a decent amount of memory allocated.

If Windows has less than 3GB, it isn't happy
If Ubuntu has less than 3GB, it isn't happy
32 Bit Ubuntu uses less memory

Guest Additions

If guest additions is not installed, then your display will be very small. Install it.

Lecture notes

2013-01-29T00:00:00-08:00

I added a link to the lecture notes on the home page of this website. I also posted the unix notes (they are chapter 1).

head tail

2013-01-29T00:00:00-08:00

In class, someone asked me how to extract the second row of a file. I said that a Python script would be the simplest way. How wrong I was! I can't believe I missed it, given the topic of yesterday's lecture, but there is a very simple way to do this using head, tail, and a pipe |.

Suppose data.csv looks like:

name,score
ian,100000
daniel,1
mike-tyson,10

Then head works like:

$ head -n 3 data.csv
name,score
ian,100000
daniel,1

In other words, if you give it the option -n3 then it returns the first three lines of the file. tail works like head, but gives the last n lines. Now...try using a pipe to tie them together and demonstrate how to extract the second line. Post your answer as a comment on this site.

Office Hours

2013-01-29T00:00:00-08:00

A short post about office hours... We will be hosting office hours online via google hangouts. The procedure to "attend" office hours is the following:

Signup for a google/google+ account
During an allotted time send a message to applied.data.science@gmail.com via Gchat and request to be added to office hours.
The instructor will then invite you to the google hangout. Note: there is a limited number of users who can be in a hangout at the same time, so like in "real" office hours you might have to wait to gain access. The instructor will notify you if this is the case.

The office hour times are listed below:

Online Office Hours

Via Google+ Hang Outs: applied.data.science@gmail.com

Sunday 1:00 - 2:30
Tuesday 9:30am - 11:00am
Thursday 5:30 - 7:00

There is also an in-person office hour with our TA Zach Shahn zss2101@columbia.edu:

Friday 1:00 - 3:00, stat dept. lounge, 10th floor

Lecture Notes

2013-01-28T00:00:00-08:00

We are writing notes that are meant to complement our lectures. The first installment, the preface, is now available here.

Since the lecture notes will be published along with homework, look for posts about them in the "Homework XX" categories.

Auditing the Course

2013-01-26T00:00:00-08:00

If you are thinking of auditing the course.... We will allow auditors as long as there are seats in the class. Please setup your computer and create a github account as described on this site. Then email the github username to applied.data.science@gmail.com so that you can have access to the repositories. Note: you will not be able to submit homework or be graded on the code you write. Also, try to attend the Software Carpentry workshops.

Software Setup (Homework 01) Questions

2013-01-24T00:00:00-08:00

Be sure to also read the latest-and-greatest solution to VM problems here

There were many common questions that came up during the setup process. For those of you who have yet to set things up, hopefully this post will be able to save you some time and headaches:

I'm on OSX, how do I install Xcode?

Go to the App Store and install it. Once it's installed, launch it and go to 'Xcode -> Preferences -> Downloads' and make sure to install 'command line tools'

The Ubuntu VM password doesn't work.

The password in the USB sticks were incorrect. It should be "reverse".

Where is this "Terminal" thing?

OSX: Terminal should be in Applications -> Utilities (or Other) Ubuntu: Click on the top left ubuntu icon. This is your "start menu". Type in the word "terminal", and click on the search result. Hint - you can right click on the icon in both OSes and lock Terminal to the launcher.

OK, I downloaded Anaconda, what do I do?

Now that you know where Terminal is, we can install Anaconda. 1. Open up a Terminal instance 2. Type cd ~/Downloads and hit Enter. 3. Type bash Anaco and hit Tab. At this point the entire file name for the Anaconda file you downloaded should appear. Now hit Enter. 4. Follow the instructions and install using default options. If you run into a problem where a single letter y scrolls down infinitely, do not panic. Just hit Ctrl-C to interrupt the process and start again by pressing Up-arrow (this is the previous command) then Enter.

OK, Anaconda finished installing. Am I done?

Almost. You need to configure your environment.

On OSX:

Type cd into the Terminal and hit Enter.
Type nano .bash_profile into the Terminal and hit Enter. This brings up a text editor window in the Terminal.
In Nano, type export PATH=$HOME/anaconda/bin:$PATH
Hit Ctrl+X, then Y, then Enter. Now you should be back in the Terminal
Type source .bash_profile and hit Enter
Now type which python and hit Enter. If the printed output includes the anaconda installation directory then you're all set.

On Ubuntu:

Type cd into the Terminal and hit Enter.
Type nano .bashrc into the Terminal and hit Enter. This brings up a text editor window in the Terminal.
In Nano, hit Enter then Up-arrow, this creates a blank line at the top of the file.
On the blank line, type export PATH=$HOME/anaconda/bin:$PATH
Hit Ctrl+X, then Y, then Enter. Now you should be back in the Terminal
Type source .bashrc and hit Enter
Now type which python and hit Enter. If the printed output includes then anaconda installation directory then you're all set.

My VM is very very slow...

Try the 32 bit version

I can't get a VM to work...what should I do?

Try this VM
Try a dual-boot setup

Software Carpentry Bootcamps

2013-01-23T00:00:00-08:00

Unless you are already proficient at unix/git/python/unit-tests, this course will be VERY difficult.

To help ease you into things, we have organized some software carpentry bootcamps. Attendance is highly encouraged, as without a level of comfort with the material presented the course cannot be completed. Sign up here

Dates: Attend either Jan 30, 31 OR Feb 1,2 (slots may fill up)

Times: 9am - 4:30pm

Location: 414 and 750 Shapiro CEPSR (Interschool lab). See the signup sheet.

Class waitlist

2013-01-23T00:00:00-08:00

There is a waitlist for this course. It is here.

We cannot admit more than the maximum (100) people. So this is the only way to get added.

DevFest

2013-01-21T00:00:00-08:00

Here's a great chance to work on developing some new products. They're looking for hackers or people with data knowhow.

DevFest is a week-long development festival during which students build applications, experiment with new technologies, and compete for awesome prizes. DevFest will kick off with a pitchfest and team formation on Saturday, February 2, followed by workshops and hacking time. The week will continue with a technical workshop and hacker office hours every night. DevFest will finish strong with an all-night hackathon from Friday, February 8th - Saturday 9th, after which the apps will be demoed to a panel of judges and prizes will be awarded. For more details see http://adicu.com/devfest

New course: Computational Social Science

2013-01-19T00:00:00-08:00

I just received word of an exciting new course offering through the Applied Math Department. The course, Computational Social Science is being taught by Sharad Goel, Jake Hofman, and Sergei Vassilvitskii. I have heard Jake lecture before and in addition to being technically very strong, he is quite a good speaker.

Bicoastal Datafest: Analyzing money's influence in politics

2013-01-06T00:00:00-08:00

Meet journalists, scientists, engineers, data experts, and developers for a cross-disciplinary and bicoastal weekend of brainstorming, data-diving, story telling and civic action, not to mention prizes, food, and fellowship.

For more information, see http://www.bdatafest.computationalreporting.com/

merry christmas linux laptop

2012-12-21T00:00:00-08:00

Here's an idea for those of you stuck with a crappy Windows machine. For Christmas, get a new Linux laptop from System76.

I got the 14 inch Lemur with 16GB memory, 4 cores, 512 GB solid state drive and all the accessories...it set my new employer back $1400 (less than half what a comparable Macbook would cost). A cheaper, but still acceptable machine for this class (remember to get at least 8GB memory!) can be put together for $700.

The advantage of Linux is that it is the absolute easiest OS for installing hacking/programming/scientific-computing software. The latest and greatest stuff is often made first for Linux, then ported to OSX. The downside is that it can be difficult to get all the hardware working correctly (especially on a laptop). That's the reason to buy from a vendor who pre-installs Linux. They guarantee hardware compatibility. Note that some things, like YouTube, will still run into glitches every now and then. The glitches get fixed...but it's not just plug and play like a Macbook.

Here are some other places you can get pre-installed Ubuntu Linux.

ZaReason is similar to System76. Their UltraLap 430 almost won me over with its small size and weight.
Dell's Project Sputnik campaign has put together a high end, super thin, super light machine.
Thinkmate's HPX series workstations are complete overkill and unnecessary for this class, but fun to look at.

Update

After owning the computer for two months, I have this report:

I LOVE it
For scientific computing, it works better "right out of the box" than my co-worker's Macs
The keyboard/trackpad isn't that good...but I use an external keyboard/mouse
After installing some extra "32-bit enabling" libraries, Skype works perfectly
YouTube works perfectly
Amazon's streaming movies don't work

winter break fun

2012-12-20T00:00:00-08:00

Here are some things you can do to get a jumpstart on the class.

Everyone

Motivate yourself to use unix
- Hole Hawg story (this is what converted me)
- The unix philosophy
Set up your computer for this course. Instructions here. WARNING! This may be difficult. If you are stuck, then get help from a friend (I can't help now), or wait until class starts.

Beginner

Software Carpentry Lessons
- Python
- The unix shell
Python introductory course
Unix shell tutorial

Intermediate

Git
- Clone the first homework repo and try to understand it
- The first homework will ask you to modify this utility to do other useful things
Software engineering
Regular expressions

Homework 01: Computer Setup for Applied Data Science Course

2012-12-20T00:00:00-08:00

Note: People not in this Course, but who are participating in the software carpentry bootcamp, should instead follow these instructions

Before First Class

Bring your computer to class so we can help you set things up

You should download the following before coming to the first class on Wednesday, January 23rd, 2013:

A version of Anaconda appropriate for you machine
If you have Windows, download and unzip either the 32 or 64 bit VM image. See this explanation about 32 vs. 64 bit).
- If you download the versions with guest-additions pre-installed you can save yourself a little bit of work
If you have a mac, download Xcode
- First try getting it from the app store
- If this doesn't work (due to an older OSX), you have to register as a developer

Software to install

Overview

Python distribution
- Anaconda or EPD
- For Linux users
- For Mac users
- For Windows users
Editor
- vim-gnome or macvim
Version control
- git
Additional libraries
- pdbpp
- pep8
- line_profiler

Motivation

Installing software and setting up your system for this class can be quite easy, or very very difficult. This is based on your OS, existing environment, and random chance. During the first week of class, we will have dates/times dedicated to helping you set up your system. After these dates, you are almost on your own. Although you can find instructions on the Internet, they often don't work exactly as stated.

Supported Configurations

To the best of our ability, we will support Ubuntu Linux and Mac OSX operating systems along with the anaconda (anaconda handles all of your Python package needs).
This class will require use of Linux utilities. Standard Microsoft Windows will not work.

Installing Supported Python Configurations

If you have Linux

Install Anaconda CE.
Hints
- Try to install in your HOME directory (default) so you don't need sudo
- Don't invoke the installer shell using sudo if installing into HOME directory
- Remember to configure your environment
- Remember to read the documentation
Install Git version control, and additional packages including VIM and other Python packages

If you have a Mac

You need to first install xcode,
- xcode can be installed by going to the App Store. You need to install xcode, then goto the top left of your screen and click XCode -> Preferences -> Downloads, find "command line tools" and click install.
- xcode is a 1GB+ download so you will not have time to download it in class on Wednesday
- Note that the current version of xcode is only supported by OSX 10.7.4+ so we highly recommend you upgrade your operating system. If for whatever reason you absolutely cannot upgrade your os, you need to register for a free Apple developer account and download the appropriate version of xcode
For 64 bit OSX, install Anaconda CE
- By default this is installed in your home directory. Unless you know what you're doing, don't change it.
For 32 bit OSX, install EPD academic
Remember to read the documentation
Remember to configure your environment
Install Git version control, and additional packages including VIM and other Python packages

If you have Windows

We will set you up with a Linux virtual machine. You can then follow the Linux instructions

Use a 32 bit Ubuntu Linux VM if you have 4-6GB of memory
Use a 64 bit Ubuntu Linux VM if you have > 6GB of memory
Download and unzip either the 32 or 64 bit VM image.
- If you download the versions with guest-additions pre-installed you can save yourself a little bit of work
Download VirtualBox
Run the installer
Open VirtualBox Manager
Click "New" to create new virtual machine
Select "Linux" for Type and "Ubuntu" or "Ubuntu (64-bit)" for Version
Next, allocate memory for your VM. If you have X total GB of RAM, and you allocate Y to your VM, then Windows has X - Y left over for itself. You must balance the needs of both Ubuntu and Windows. Here are some hints.
- 64 bit Windows is unhappy with less than 3 GB
- 64 bit Ubuntu is unhappy with less than 3 GB
Next, select "Use an existing virtual hard drive file" and select the VDI file you downloaded
Once the VM has been created, select it and click "Start"
If the VM image you downloaded already has guest-additions installed then you can skip this step. Otherwise once the setup is complete you need to install guest additions.
The default keyboard layout is Italian. To change this
- Click System Settings
- Keyboard Layout
- Hit the "+" button
- Select a new layout from the list

Now change the window manager IMPORTANT!!

Turn on your VM and open a terminal
In the terminal, type sudo apt-get install gnome-session-fallback
- This will install a new graphics manager for your desktop
- Click Y when asked
Log out (or restart)
When you log in, there will be a "gear shaped" icon near your login name. Click it and select GNOME Classic (No Effects)

Extra Help: Configuring environment variables

Modify your shell configuration file, henceforth referred to as your bashrc file.
- Mac OSX: From your home directory (i.e., ~/) open either .bashprofile or .bashaliases (create one if neither exists), add export PATH=$HOME/anaconda/bin:$PATH
- Linux: From your home directory (i.e., ~/) open .bashrc (create one if neither exists), add "export PATH=/path/to/python:$PATH"
Refresh your terminal by typing source ~/.bashrc or just opening a new terminal.

Verify Things

Open a terminal and start IPython with: ipython --pylab
- Verify numpy with import numpy
- To check pandas and matplotlib, from IPython, type
```
from pandas import Series;
Series(randn(10)).plot()
```
Verify the notebook:
- ipython notebook --pylab=inline should pop up a browser window and show the notebook dashboard
Verify your PATH setting:
- which python should show the directory in which you installed Anaconda/EPD as the first entry

Install additional software

Ubuntu users should use "apt-get" command to install software packages. The syntax is "sudo apt-get install ..."
VIM
- Linux: "sudo apt-get install vim vim-gnome"
- Mac: download Macvim and follow installation instructions
Other (easier/weaker) editors
- Linux: "sudo apt-get install gedit-plugins"
- Mac: Download and install sublime text
Python libraries
```
pip install pdbpp line_profiler pep8
```
- If you don't have pip, install it first using easy_install pip

Set up version control

Sign-up for free Github account
- Send username, email address, and uni to Zach Shahn zss2101@columbia.edu
Install git
- Mac: download from mac.github.com
- Linux: type sudo apt-get install git

A few words about what this class indends to teach you

2012-12-20T00:00:00-08:00

What this class is not

This class is not a traditional statistics course, although much of the material will be rooted in statistical analysis. This class is not a computer science course, although you will program a lot and hopefully become better at doing so in different environments. And, this class is not a machine learning course, although ML techniques will be foundational material for the lectures. You will not have clean data sets to start with all the time; you will get data sets as we have seen them working in data science space for the last few years. The data sets might be messy and unstructured, and it might not always be clear how to extract the relevant signals for the problem at hand. However, this is part of the fun and we hope you will agree by the time May rolls around.

What this class is

This class is an introduction to the collection of techniques we have found indispensable when working in the data science space. There will be significant emphasis on understanding the relevant statistics and proper application thereof. It will teach you to write good code and use collaborative tools to do so, because after all if you intend to build things for other people to use there is no other option. We will talk about staple machine learning algorithms and techniques, going into some depth about the background mathematics, but will always revert back to implementing those techniques into python libraries to be used for subsequent data analysis. Sometimes you will have to find, get, process and clean data before taking initial steps in any kind of statistical modeling. In short, you will have a taste of the day-to-day in the data science world, and walk away with the foundational knowledge and toolkit that will allow you to build solutions in this relatively new and exciting area.

Course. Data Science and Technology Entrepreneurship

2012-11-30T00:00:00-08:00

Chris Wiggins informed me of a course that may be of interest to some of you. See this page for updated information including the syllabus. A brief description is below.

Data Science And Technology Entrepreneurship

"Offered jointly between Columbia Business School and Computer Science Department"

This course will pair up MBA students from Columbia Business School with Master’s/PhD students from Computer Science department to form teams of two (or more) who will be guided through an entrepreneurial experience of building a technology startup. The course will be very hands on! The course will also have a team of 12 Industry Advisors/Mentors (CEOs, CTOs and VC Partners of various firms) who will engage with students to help them convert their idea into a sustainable technology business.

Data Science is an emerging interdisciplinary field across statistics, computer science and business. The course will not only focus on theoretical aspects of data sciences but also on applying them in building products and improving business processes. Student teams (composed of CS/Engineering and Business students) will use data driven methods to test feasibility of the idea/innovation, build the product, develop customers, study sales channels and try to raise capital during the span of 4 months. Industry mentors will critique the student teams and their ideas through various stages of the startup implementation addressing such questions related to feasibility, market attractiveness, customer acquisition, metrics, launch strategy and more. The students will be able to interact with CEOs for business mentorship, CTOs for technical mentorship and VC firm partners for advice on the capital raising process.

Course Proposal

2012-11-15T00:00:00-08:00

This post is no longer current

Please see the course description

Course basics

Number: STAT 4249
Class times: MW 6:10 - 7:25
Lead instructor: Ian Langmore, ianlangmore@gmail.com

This class is...

for people with an understanding of statistics at the first-year graduate level or beyond
a way to learn/write basic algorithms for statistical inference and predictive analytics
a chance to apply algorithms to real data sets and gain data science intuition
a way to learn solid programming skills (beginning through intermediate)
- Python
- Linux
- Github
- Collaborative development
- Object Oriented Design

This class is not...

for people who don't know any stats or linear algebra
for people who have never programmed before
an overview of advanced methods in machine learning

Full description

The explosion of available data coinciding with the continued evolution of statistical and computational methods has resulted in a new breed of specialist. These data scientists use rigorous statistical methods to find meaning in data. Minimizing a loss function is not enough: Business and societal decisions hinge on the interpretation of these insights. The world of scientific computation is rapidly evolving. Quick-and-dirty scripts are not enough: A maintainable code base and collaborative development environment allows projects to productionalize and scale. A data scientist must wear many caps, we present two of them here.

Maintainable coding techniques will be taught using test-driven-development, version control, and collaboration. Code will be of the type found in the scikit-learn and statsmodels packages. Students finish the class having created a library on GitHub, and an understanding of several core statistical/machine-learning algorithms.

Case studies give students the opportunity to use these their own software on real world data sets. Here they develop intuition for extracting meaning from data. Students finish the class with a website/blog/portfolio, and experience with the translation:

Real world --> data --> scientist --> collaborators/coworkers --> policy-decision/data-product

Lecture structure

An algorithm is presented. Students are randomly assigned to groups and together write a productionalizable implementation.
The class is presented with a data-driven business/scientific problem that a company/institution has, and they must solve (using the algorithm from 1).
- Each step takes one week.
- Step 1 demands that a GitHub repo be created. The repo is maintained with the imaginary goal of being later productionalized for a client. This problem is very clearly defined. The goals here are to learn algorithms and scientific computing skills in a collaborative environment.
- Step 2 demands the creation of a presentation and a written report. One group is randomly chosen to present their pitch/solution to the class. The problem will not necessarily be clearly defined. Students must find where they can add value, then convince us that they can. Students use software developed in step 1, along with other packages.
- The data for step 2 will come from NYC start-ups and non-profits.
- Will use the book python for data analysis.
- May use the book machine learning in action. If so, we will require modifications of the algorithms presented there.

Prerequisites

Stats (4109 or 4105+4107) or equivalent
Some proficiency in programming
Computer:
- A mac or Linux is fine.
- If you have Windows, we will assist you in setting up a Linux dual-boot or virtual machine. *An 8GB machine with help you tremendously. A 2GB machine will cause headaches. Spend the $60 and upgrade...you want to analyze data right?

Lectures/algorithms/HW

Introduction
- Course introduction
- Software setup workshops
  - If you have Linux, then we will do a quick check of your system
  - If you have a Mac, we will transform it into a real mac
  - If you have a Windows machine, we will set up Linux with either a virtual machine or dual-boot.
Programming introduction
- Python introduction
- Unix introduction
- Software carpentry workshops
Data Tools
- Git/Github introduction
- Teams build a suite of data tools
  - Cleaning filters
  - Subsampling
  - SQL scripts
Exploratory data analysis
- Pandas
- Numpy, scipy, matplotlib
- Build an EDA suite
Linear regression
- The singular value decomposition (SVD)
- Maximum likelihood
- Regularization and Bayesian estimators
- Memory hierarchy, stability, and why you never explicitly invert a medium or large matrix
- Teams build a linear regression module
- Teams work on case study (topic TBD)

Other algorithms presented will follow the same structure as "Linear regression" above, and could include:

Logistic regression/classification
K-nearest neighbors
Kernel density estimation
Decision trees, random forests
Monte Carlo simulation
Recommendation systems

Possible additional topics

Web scraping
Typing and compilers. Could be taught by using Cython.

Columbia Applied Data Science

Notes on higher performance python

profiling and performance basics

Note

Original post

Announcements: May 6

Homework 08: Stackoverflow questions

Homework 08: Stackoverflow

Guidelines

General

Presentation

Data

About the starting repo

Dependencies

Directories

data

Notebooks

src

tests

scripts

Final exam

Homework 07: Hints

Starting projects that involve lots of file I/O

Announcements: April 29

Announcements: April 24

Homework 07

Announcements: April 22

Announcements April 17

Crash Course on APIs, pandas/statsmodels timeseries API

Notes on your GSS models

Announcements March 13

Announcements March 11

IDSE Symposium Call for Student Volunteers

Homework 04

Homework 4: Linear Regression with GSS Data

Data directory layouts

Project deliverables

Getting started

Useful links

Basic workflow

Get data

Inspect data

Clean data

EDA

Build your model

Announcements March 4

Announcements Feb 27

Learning Numpy and Pandas

Announcements: Feb 25

Start with an identity then populate off diagonal entries

Homework 03

homework_03

To start

Numerical techniques

5-fold cross validation

Caution!

sed oddities

Converting datasets from STATA

Announcements: Feb 18

Homework 02

The README.md handed out with the hw

Setup

Utilities

body

cut.py

reformat.py

common.py

averager.py

timeopen.py

subsample.py

Shell Scripts

count_categories.sh

countcategoriesopenclosed.sh

compute_averages.sh

Unit Tests

Midterm

Announcements: Feb 06

debugging

Description

Installation