Homework 04
Homework 4: Linear Regression with GSS Data
You will use your linear regression module from last week to analyze the General Social Survey (GSS) data. This is a yearly social science survey that "takes the pulse of America."
Presentation Wednesday March 13 Email a copy to the TA and be prepared to present in class.
Data directory layouts
You will have to transform your data by cleaning/cutting out certain columns. This can lead to a mess of different data files. Here are some suggestions.
You have the following layout by default:
notebooks/
scripts/
src/
data/
raw/
processed/
- Remember not to commit any data to the repository.
- I like to keep the data in
raw
completely untouched copies from websites, or common transformations of them (e.g. the csv files that result ofGetting the data
above). The key point is that once data goes intoraw
I never change it. - The
processed
directory is for the altered versions of theraw
directory. - There is a
scripts
directory that can be used to store scripts (Python or Bash) that transform data from the type inraw
to the type inprocessed
. - T also use ipython notebooks (stored in
notebooks/
to transform data fromraw
toprocessed
. I commit these scripts and notebooks to the repo and other people can run them to get copies of my processed data. - For longer projects I create snapshots of the
processed
directory that have a timestamp on their name. E.g.processed-2013-02-11
.
Project deliverables
- Use the 2006 data to train models, test those models on 2010 data.
- Predict
income06
as a function of other variables. For simplicity, this should be one single model that includes everyone...in other words, don't segment your data. Note thatincome06
is missing in about 15% of responses. You don't have to predictincome06
for these people. This model should work, even in the presence of missing data (with the exception of missingincome06
)! So, you should probably fill the missing values with something. - Find one other relation to predict. Make sure it is appropriate for linear regression. Segment or do whatever you want. The model can work for the whole population or subpopulations.
- Make a 15 minute slide-show presentation documenting your work. This is what you turn in. Two randomly chosen groups will present this in class on Wednesday March 13. The intended audience is the class...so present at the appropriate technical level.
Getting started
Useful links
- The 2008 GSS Codebook will be useful for variable definitions.
- The GSS User's Guide shows you how to search for variable description using the website. The website is very very very slow.
Basic workflow
- Get data
- Inspect data
- Clean data
- Explore relationships (EDA)
- Fit model
- Inspect results
- Repeat 2-7
Get data
- Download the 2006 and 2010 datasets from this site. Get the individual years, which are under "Download Individual Year Data Sets"
- Convert these STATA dataset into Pandas DataFrames using these instructions
Store them as csv files using (assuming the DataFrame is named
df
):df.to_csv('filename', index=False)
Inspect data
- Inspect the csv files with
less
and see what they look like. RememberCtrl-f
,Ctrl-b
to move forward and backward. - Use
head
to create a file (probably located in/tmp/
) containing the first 100 lines. Look at this file in excel (orlibreoffice
in ubuntu). You may get an error about too many columns...that's ok, just look at what you can!
Clean data
- Start your ipython notebook with
ipython notebook --pylab inline
. - Create a new notebook named
cleaning-your-name
. This will be used for cleaning data. - See
notebooks/HW4_cleaning_EDA
for an example. - Read in the 2006 and 2010 datasets into DataFrames named
df2006
,df2010
. - We will probably have little use for columns that are mostly NaN. Use
df.count().order()
to figure out which columns have lots of missing values. Chop off these columns by creating a boolean mask that will be true if a column has enough good entries and then usingdf.ix[:, mask]
. We are only interested in variables that are in both datasets, so use pandas reindex to modify and align the columns like so
col = df2006.columns.intersection(df2010.columns)
df2006 = df2006.reindex(columns=col)
df2010 = df2010.reindex(columns=col)
EDA
See notebooks/HW4_cleaning_EDA
or goto this link for an example.
Build your model
See notebooks/HW4_regression_example
or goto this link for an example.
You will have to add variables and see if it improves your fit. Make sure your variables make sense intuitively. Do EDA and read about the data to gain intuition.
blog comments powered by Disqus