It's very important in practical data science to know how to convert datasets into the right format and structure. Being able to import data from language-/tool- specific formats like Stata is something that's very useful, especially for a lot of social science data. Fortunately, that's already available as part of the StatsModels library in Python.

Here's how you do it:

  1. In the terminal, execute the command: pip install -U statsmodels. This should upgrade you to 0.5.0+. If you already have the latest version of statsmodels, you can skip this step.

  2. In ipython:

    import statsmodels.iolib.foreign as smio
    from pandas import DataFrame
    arr = smio.genfromdta('~/path/to/stata/data.dta')
    frame = DataFrame.from_records(arr)

The genfromdta function in statsmodels.iolib.foreign converts a dta file to a NumPy record array (special numpy array type). The last line above show how to convert the record array into a pandas DataFrame so the data can live happily ever after.

blog comments powered by Disqus


15 February 2013


Subscribe to RSS Feed