Homework 4: Linear Regression with GSS Data
You will use your linear regression module from last week to analyze the General Social Survey (GSS) data. This is a yearly social science survey that "takes the pulse of America."
Presentation Wednesday March 13 Email a copy to the TA and be prepared to present in class.
Data directory layouts
You will have to transform your data by cleaning/cutting out certain columns. This can lead to a mess of different data files. Here are some suggestions.
You have the following layout by default:
notebooks/ scripts/ src/ data/ raw/ processed/
- Remember not to commit any data to the repository.
- I like to keep the data in
rawcompletely untouched copies from websites, or common transformations of them (e.g. the csv files that result of
Getting the dataabove). The key point is that once data goes into
rawI never change it.
processeddirectory is for the altered versions of the
- There is a
scriptsdirectory that can be used to store scripts (Python or Bash) that transform data from the type in
rawto the type in
- T also use ipython notebooks (stored in
notebooks/to transform data from
processed. I commit these scripts and notebooks to the repo and other people can run them to get copies of my processed data.
- For longer projects I create snapshots of the
processeddirectory that have a timestamp on their name. E.g.
- Use the 2006 data to train models, test those models on 2010 data.
income06as a function of other variables. For simplicity, this should be one single model that includes everyone...in other words, don't segment your data. Note that
income06is missing in about 15% of responses. You don't have to predict
income06for these people. This model should work, even in the presence of missing data (with the exception of missing
income06)! So, you should probably fill the missing values with something.
- Find one other relation to predict. Make sure it is appropriate for linear regression. Segment or do whatever you want. The model can work for the whole population or subpopulations.
- Make a 15 minute slide-show presentation documenting your work. This is what you turn in. Two randomly chosen groups will present this in class on Wednesday March 13. The intended audience is the class...so present at the appropriate technical level.
- The 2008 GSS Codebook will be useful for variable definitions.
- The GSS User's Guide shows you how to search for variable description using the website. The website is very very very slow.
- Get data
- Inspect data
- Clean data
- Explore relationships (EDA)
- Fit model
- Inspect results
- Repeat 2-7
- Download the 2006 and 2010 datasets from this site. Get the individual years, which are under "Download Individual Year Data Sets"
- Convert these STATA dataset into Pandas DataFrames using these instructions
Store them as csv files using (assuming the DataFrame is named
- Inspect the csv files with
lessand see what they look like. Remember
Ctrl-bto move forward and backward.
headto create a file (probably located in
/tmp/) containing the first 100 lines. Look at this file in excel (or
libreofficein ubuntu). You may get an error about too many columns...that's ok, just look at what you can!
- Start your ipython notebook with
ipython notebook --pylab inline.
- Create a new notebook named
cleaning-your-name. This will be used for cleaning data.
notebooks/HW4_cleaning_EDAfor an example.
- Read in the 2006 and 2010 datasets into DataFrames named
- We will probably have little use for columns that are mostly NaN. Use
df.count().order()to figure out which columns have lots of missing values. Chop off these columns by creating a boolean mask that will be true if a column has enough good entries and then using
We are only interested in variables that are in both datasets, so use pandas reindex to modify and align the columns like so
col = df2006.columns.intersection(df2010.columns)
df2006 = df2006.reindex(columns=col)
df2010 = df2010.reindex(columns=col)
notebooks/HW4_cleaning_EDA or goto this link for an example.
Build your model
notebooks/HW4_regression_example or goto this link for an example.
You will have to add variables and see if it improves your fit. Make sure your variables make sense intuitively. Do EDA and read about the data to gain intuition.
blog comments powered by Disqus