Homework 08: Stackoverflow

Due: May 13, in class presentation during the final exam slot (7:10 - 11pm). Also, a pdf of your slides must be emailed to applied.data.science@gmail.com before 7:00pm.

You will use logistic regression to predict whether a Stackoverflow question will be closed or not. This assignment is similar to this Kaggle competition.

Guidelines

General

You must use logistic regression to give the probability that a Stackoverflow question will be closed or not.
For modeling, you can use any existing Python (not R) packages such as statsmodels or sklearn. You can use any Unix utility.
You must attempt to use both the numeric data (e.g. ReputationAtPostCreation) as well as the text data Title and BodyMarkdown.
For the test data, you must report:
- The ROC AUC (Area Under the Curve)
- How well your predicted average closed rate matches reality for users in the bottom/middle/top third in terms of reputation
You must build a classifier that uses your logistic regression model. Pick a cutoff that makes sense for this problem and explain why you chose it.
You must also build a classifier that uses the exact same variables as your logistic classifier, but uses some other technique such as a random forest, SVM, and nearest neighbors. You must compare this classifier with the logistic classifier and explain which worked better and why.

Presentation

20 minutes presentation, 10 minutes questions. At least two group members must talk.
Your slides must be in pdf format
Email your slides to applied.data.science@gmail.com before 7pm on the day of the final. Put your group name in the title of the pdf.
Your presentation should describe why you chose to keep/create/throw-away variables
Your presentation should describe how you evaluated your model, the results of the evaluation, and why this evaluation was or was not sufficient
Your presentation should not describe your data-munging. EDA should be described only insomuch as it relates to the above tasks.

Data

Your training data should be the file named train taken from this website.
Your test data should be current data that you obtain by using the Stackoverflow API. Note that you can get lots and lots of variables using the API. Only get the ones that are also in the training set. There are no hard requirements on the amount of test data you must obtain. You must explain why you used the number of samples that you did, and why it makes the "prediction vs. reality" test statistically significant. You are permitted to use your test set as a cross validation set (usually this is not good practice, but (i) there is no way to stop you from doing this, and (ii) this will give you experience with out-of-time errors).

About the starting repo

This is meant to give you a decent starting point. You can modify it as you wish.

Dependencies

You will probably use code from previous homeworks, especially cut.py and subsample.py.

Directories

data

Don't version data.
To avoid excess sharing of processed data (which changes often), it is preferable to share raw data and the scripts and notebooks that transform raw into processed.
Contents of any raw folder should never be modified or deleted. This way, your script will create the same output as everyone else's script.
Shell scripts and notebooks should assume the existence of the local folders data/raw and data/processed. They already exist in the repo.

Notebooks

For ipython notebooks. Put your name in the notebook name to avoid redundancy.

src

Source code.

tests

Unit and integration tests. Add these if you want.

scripts

Shell scripts.

blog comments powered by Disqus

Published

29 April 2013

Homework 08: Stackoverflow questions