Converting datasets from STATA

It's very important in practical data science to know how to convert datasets into the right format and structure. Being able to import data from language-/tool- specific formats like Stata is something that's very useful, especially for a lot of social science data. Fortunately, that's already available as part of the StatsModels library in Python.

Here's how you do it:

In the terminal, execute the command: pip install -U statsmodels. This should upgrade you to 0.5.0+. If you already have the latest version of statsmodels, you can skip this step.

In ipython:

import statsmodels.iolib.foreign as smio
from pandas import DataFrame
arr = smio.genfromdta('~/path/to/stata/data.dta')
frame = DataFrame.from_records(arr)

The genfromdta function in statsmodels.iolib.foreign converts a dta file to a NumPy record array (special numpy array type). The last line above show how to convert the record array into a pandas DataFrame so the data can live happily ever after.

blog comments powered by Disqus

Published

15 February 2013

Converting datasets from STATA

Published

Tags