Notes on your GSS models

Here are some general comments that applied to many people's homework.

The best presentations told a "story." They told the steps you used, as well as the results and why the results were good or bad.
If the direct inversion method fails on a small data set such as this, then there is often a problem with your data. E.g. you have linearly dependent (i.e. redundant) variables. The best approach is to figure out why there are issues and then fix them.
The pandas function get_dummies can be used to get indicators for categories, e.g. is_married.
I didn't see anyone building new variables from (nonlinear) combinations of more than one old variable. That would have been nice.
Many of the NaN values were follow-up questions that could have been used to build new variables. For example, if someone answers yes to "have you ever been a smoker", then they also get to answer the follow up, "have you ever tried to quit smoking." This could be used to create two new variables, smokingtriedquitting and smokingnevertried_quitting. Or, you could figure that people who never tried to quit were more severe smokers, and therefore they get a 2, people who tried quitting get a 1, and people who never smoked get a 0. This way you create one single new variable with three levels.

22 March 2013