Here are some general comments that applied to many people's homework.

  • The best presentations told a "story." They told the steps you used, as well as the results and why the results were good or bad.
  • If the direct inversion method fails on a small data set such as this, then there is often a problem with your data. E.g. you have linearly dependent (i.e. redundant) variables. The best approach is to figure out why there are issues and then fix them.
  • The pandas function get_dummies can be used to get indicators for categories, e.g. is_married.
  • I didn't see anyone building new variables from (nonlinear) combinations of more than one old variable. That would have been nice.
  • Many of the NaN values were follow-up questions that could have been used to build new variables. For example, if someone answers yes to "have you ever been a smoker", then they also get to answer the follow up, "have you ever tried to quit smoking." This could be used to create two new variables, smokingtriedquitting and smokingnevertried_quitting. Or, you could figure that people who never tried to quit were more severe smokers, and therefore they get a 2, people who tried quitting get a 1, and people who never smoked get a 0. This way you create one single new variable with three levels.


blog comments powered by Disqus

Published

22 March 2013
Subscribe to RSS Feed