LowClass Python: Style Guide for Data Scientists
This style guide is meant for use by advanced beginner to advanced intermediate developers of scientific code in Python. In other words, non-professional programmers...for example, data scientists. The term LowClass Python hints at reducing the use of object oriented design. It is an attempt to be as witty as Tom Anderson when he coined the term C++-- (C plus-plus, minus-minus). You see, C++ is a very rich language with many features. This allows a variety of abstract design patterns that can result in confusing and hard to maintain code. In that essay, professor Anderson encourages limiting your use of the C++ language to a smaller set of features.
Similar to C++, Python allows a variety of coding styles and abstraction and some limits can help. The limits depends on who is writing the code and for whom. Our perspective is that of a data scientist who is writing a model to be used by others as part of a larger system. An example would be a feature extractor that reads streaming (through stdin/stdout) text data and counts the number of times different words appear. The data scientist should be able to prototype the model using a smaller collection of text files (stored on his laptop). Once the model is working, it should be easy for the scientist to
- Work with other data scientists to integrate the feature extractor into a much larger prototype. Here the challenge is fitting together and maintaining code written by many different people.
- Work with engineers to productionalize this model. Here the challenge is that the code must be scaled and maintained by people who have not worked on and may not understand the underlying algorithm.
Disclaimer: We're not saying the patterns found below are the only way to write LowClass Python. We have however found them effective. Comments and discussion is welcome!
Basics
Separation of source code from scripts
It is ok to write quick and dirty scripts to get some job done (e.g. replacing words in a file). These should be kept in a separate directory than the source code, which needs to follow the other conventions in this guide. Keep in mind that as scripts grow, you will be expected to make them adhere to this guide.
Initial reading
- Read and follow Code like a pythonista.
- Find a tutorial on Object Oriented Design (OOD) in Python and make sure you understand how to create a class and understand attributes and methods. Make sure you've worked through one example of class inheritance.
Some rules
The standard basic style-guide is PEP8. Read this from beginning until the Version Bookkeeping section. Here we present some additions/clarifications/modifications. The reason for consistency in style is that it makes it easier for others to read and grep your code.
- Use 4 spaces per indentation level. This is absolutely essential since code that uses different indentation cannot be used together! Tabs cannot be used. This does not mean you have to hit the space bar four times. Every decent text editor has a setting that allows you to map the tab key to spaces.
- Limit lines to some pre-determined number of characters. 79 is the accepted standard and to contribute to many projects you will need to adhere to this.
- Try to limit functions to 50 lines (not including doc strings). If a function is longer than 50 lines you have probably made it do too many things. Something close to 20 lines should be your average.
- Don't use abbreviations unless the word will be used many many times. If you're worried about having to type long names, then learn to use tab complete. The only short names that are acceptable are common math names such as x, y, X, Y, t, and counters such as i, j, k (used ONLY for the normal purposes). Note that it is standard convention to never use single letters as variable names, and to only use lower case letters and underscores...so we're breaking from Python programming convention here in the name of conforming to math/stats/ML convention.
- Try to be consistent with names across multiple functions. In other words, don't use the full name in one location and an abbreviation elsewhere.
Simplicity
Simple is better!
While designing and writing code, the goal of simplicity should be foremost in your mind. Your code must be understood, used, and maintained by many different people, over long time periods. Adding additional complexity to your code forces these people to do unnecessary work. This is rude and causes mistakes. This principle is sometimes called KISS.
- A very simple useful pattern is to write functions that take well-defined inputs and produce well-defined output. You tie these together with code that reads like a book. For example,
def extract_feature_counts(data_string):
"""
Some docstring here...
"""
cleaned_data_string = _clean(data_string)
word_counts = _count_words(cleaned_data_string)
return word_counts
def _clean(data_string):
# Some code here.
return cleaned_data_string
def _count_words(data_string):
# Some code here.
return word_counts
Notice that:
- The only public function is
extract_feature_counts
(the other functions start with an underscore). This means that a user should look first at the functionextract_feature_counts
. - Upon looking at
extract_feature_counts
, the user can easily see that the extraction consists of two steps, cleaning and counting words.
Beware of feature creep
Feature creep is a phenomenon where additional functionality added to your code becomes excessive. A feature could be an additional method in a class, or an optional keyword argument in a function. Instead of helping, this practice complicates matters for current and future developers. Here are some guidelines.
- Only create features that are needed now...not at some potential time in the future. This is known as the principle, You aren't gonna need it
- Sometimes features are added with the intention of making calling the code easier. Before doing this, stop and consider the fact that this feature will have to be maintained by you and others for a long time over many different code changes. For example:
class MyModel(object):
"""
Implements a classification model.
"""
def trim_variables(self):
trimmed_variable_names = self.variables
# some code to eliminate variables from trimmed_variable_names.
self.variables = trimmed_variable_names
def fit(self, X, Y):
# some code to fit coefficients to self.variables.
def trim_and_fit(X, Y):
self.trim_variables()
self.fit(X, Y)
Is the method trim_and_fit
really needed? Sure, it will save you from making two separate calls (one to trim_variables
and one to fit
), but it will have to be maintained by someone. If the API to either trim_variables
or fit
changes, then so will the API to trim_and_fit
. If the preferred steps for fitting and trimming changes, then this method may become extinct. A user of this class needs to understand the functionality of all three methods now. Note that we are not saying this convenience method should not be written. It's just that the convenience must be weighed against the work that you are asking others to do when they attempt to use and maintain your code.
- Be especially careful with features that are part of a large system (e.g. options in a function that is part of a data processing pipeline). Every time someone ties one program to another, they will have to consider what state every feature will be in. This causes a headache.
Explicit is better than implicit
- Give variables, methods, and attributes descriptive names. For example:
for person in person_list:
age_years = todays_date - person.birthdate
age_days = age_years * 365
age_factor = np.sqrt(age_days) * age_multiplier
age_factor_list.append(age_factor)
Notice that we used the full name of every variable. Writing out age_factor
makes it more memorable. We can grep the code and find find all places where age_factor
appears. Notice we also use age_factor_list
rather than age_factors
. age_factor
and age_factors
are barely distinguishable and if both are variable names they will often be inadvertently swapped.
Being explicit while setting attributes
Consider the following fragment.
class MyModel(object):
def __init__(self, training_data):
self.X, self.Y = self._get_XY(training_data)
# Some more code here.
def _get_XY(self, training_data):
# Some code here.
return X, Y
def fit(self):
# Some code here...this code ONLY fits the model and sets attributes.
# DIRECTLY related to fitting.
self.coefficients = coefficients
self.model_has_been_fit = True
The reason for returning X and Y (rather than setting them inside a method called e.g. self._set_XY
is twofold:
- It pushes attribute setting up to the highest level (in
__init__
). This allows you to read__init__
and see as much useful information as possible (in particular, that the attribute namesX
andY
will be used). - It encourages a functional style of programming, minimizing side effects.
In the fit
method, notice that
- We ONLY fit the model and set attributes directly related to fitting.
- Since attribute setting is the "output" of this method, it is put at the end (just as a return call from a function would be put at the end).
Limit use of optional (keyword) arguments
- Options should be optional. The code should work fine without them.
- Options can lead to feature creep.
Separate interface from implementation
Interface is:
I/0
- Opening/closing of files
- Reading/writing of files
APIs
- The functions that a module/class user calls to "do stuff."
Implementation is stuff like math and modification of data.
import numpy as np
def subsample(infile, outfile, subsample_rate=0.01, delimiter=',', seed=None):
"""
Subsample infile and write to outfile.
Parameters
----------
infile : csv file open in read mode, or String
File should be delimited text and have a header
outfile : File open in write mode, or String
Output is written here
subsample_rate : Real number in the interval [0, 1]
Keep this fraction of rows/key-values
delimiter : Single character string
The delimiter of infile. Also used for outfile.
seed : Integer
If given, use this to seed the random number generator.
"""
# Seed the random number generator for deterministic results.
if seed:
np.random.seed(seed)
# Open files (if necessary).
infile, infile_was_path = _openfile_wrap(infile, 'r')
outfile, outfile_was_path = _openfile_wrap(outfile, 'w')
# Get the csv reader and writer. Use these to read/write the files.
reader = csv.DictReader(infile, delimiter=delimiter)
writer = csv.DictWriter(
outfile, delimiter=delimiter, fieldnames=reader.fieldnames)
writer.writeheader()
# To the actual subsampling.
_read_select_write(reader, writer.writerow, subsample_rate)
# Close files (if necessary).
if infile_was_path:
infile.close()
if outfile_was_path:
outfile.close()
def _read_select_write(reader, writemethod, subsample_rate):
"""
Iterate through reader and use writer to write a selection of rows.
Parameters
----------
reader : Iterable
An iterable over the "input file".
writemethod : Function
writemethod(reader.next()) should write the contents of reader.next()
subsample_rate : Real number in [0, 1]
This fraction of iterates will be written.
"""
assert 0 <= subsample_rate <= 1
for row in reader:
if subsample_rate > np.random.rand():
writer.writerow(row)
def _openfile_wrap(filename, mode):
"""
If filename is a string, returns an opened version of filename.
If filename is a file buffer, then passthrough.
Parameters
----------
filename : String or file buffer
mode : String
mode to open the file in
Returns
-------
opened_file : Opened file buffer
was_path : Boolean
If True, then filename was a string (and thus was opened here, and so
the calling function should close it).
"""
if isinstance(filename, str):
was_path = True
opened_file = open(filename, mode)
elif isinstance(filename, file) or isinstance(filename, StringIO):
was_path = False
opened_file = filename
else:
raise Exception(
"Did not know how to handle %s, type = %s" % \
(filename, type(filename))
return opened_file, was_path
Above the interface is subsample
. It handles the opening/closing of the infile and outfile. The infile should be a csv file. It then uses Python's csv module to form in iterator over infile (the reader), as well as a method (writemethod) for writing the contents returned by the reader. This reader and writemethod are passed to the implementation, _read_select_write
. This separation achieves the following:
- If the interface changed (e.g. we decide to work with files without headers, or read/write to/from stdin/stdout or a database), we would only change the function
subsample
. We would not have to change the implementation. - The implementation might change if e.g. we optimize it for performance or have to use this in an environment where numpy is not available.
You can view a version of subsample that includes a command line interface in here.
Interface design
- Minimize entry points.
- Try to make most methods/functions of a class/module non-public (the name starts with an underscore). These non-public methods/functions should not be used by code outside of this module. This makes it easier for a new user to understand. It also allows you to change the non-public methods/functions without messing up other modules that may depend on this module.
- Minimize the number of ways that attributes can be set.
- Make sure your interface accepts open files (created with e.g.
open(filename, 'r')
. This allows them to be tested usingStringIO
. See the example below. - It is ok to accept open files or paths. In this case, the interface should (using a wrapper function) determine whether it was passed an open file or path and convert to an open file. For example:
def openfile_wrap(filename, mode):
"""
If filename is a string, returns an opened version of filename.
If filename is a file buffer, then passthrough.
Parameters
----------
filename : String or file buffer
mode : String
mode to open the file in
Returns
-------
opened_file : Opened file buffer
was_path : Boolean
If True, then filename was a string (and thus was opened here, and so
you better remember to close it elsewhere)
Examples
--------
>>> infile, was_path = openfile_wrap(infilename, 'r')
>>> myfunction(infile,...)
>>> if was_path:
>>> infile.close()
"""
if isinstance(filename, str):
was_path = True
opened_file = open(filename, mode)
elif isinstance(filename, file) or isinstance(filename, StringIO):
was_path = False
opened_file = filename
else:
raise Exception("Could not work with %s" % filename)
return opened_file, was_path
Be careful with objects
Object Oriented Design (OOD) is useful in many cases and is a fundamental part of Python and most medium to large projects built in Python. For those reasons, it must be learned and should often be used by data scientists. However, the additional abstraction makes it easy to get intro trouble. Our view is that a non-professional programmer should take care when working with objects.
- The main purpose of OOD should be to bundle together data with methods that use, make available, and, (with caution) modify the data. For example:
class MyModel(object):
def __init__(self, X, Y,...):
self.X, self.Y = X, Y
self.model_has_been_fit = False
# Some code.
def fit(self):
# Some code that uses self.X, self.Y.
self.coefficients = coefficients
self.model_has_been_fit = True
def predict(self):
# Some code to predict Yhat given self.X
return Yhat
def plot(self):
# Some code.
def get_data_stats(self):
# Some code.
return statistical_summary_of_data
- The
fit
method is an example of a method that uses the data. It should (and does) set the coefficients and an attribute to tell you that the model has been fit. A common mishap is to also transform the X, Y data withinfit
. For example, we could rescale self.X and self.Y. This would be a side effect and should be avoided. - The
plot
method makes the data available (as a plot). This is a convenience method since it probably just wraps some matplotlib code. So long as this method has no side effect, it is fine to include. Theget_data_stats
method is similar.
There is another issue with MyModel
as written: Consider the fact that we have bundled together the training data with the object, and only through this training data do we set self.coefficients
. Only using these coefficients can we predict (using self.predict()
). What if the user wanted to store the (lightweight) coefficients in a file, and then use them at some point in the future to make predictions on new data? What if a user wanted to use a subset of the training data to fit (in e.g. cross validation). Neither of these could be done! This sort of difficulty can be seen in the statsmodels package, which was meant to be used to analyze one non-changing piece of data, rather than predict on multiple new pieces of data.
To avoid this problems, we suggest the following revision where we no longer bundle the main data with the model. This is more in line with the scikit-learn philosophy.
class MyImprovedModel(object):
def __init__(self,...):
# Some code. Notice that we don't initialize the model with any data.
self.model_has_been_fit = False
def fit(self, X, Y):
# Some code.
self.coefficients = coefficients
self.model_has_been_fit = True
def predict(self, X, coefficients=None):
# Some code to predict Yhat given X.
# If coefficients is None, use self.coefficients.
return Yhat
Side effects
A side effect in programming occurs when a function/method is called and unintended data or state is modified. This can produce difficult to detect bugs. For example,
- The attributes of an object determine its state and hence its behavior. If an attribute is modified, then the behavior of the object changes. For example, the
predict
method ofMyModel
depends onself.coefficients
. Ifself.coefficients
are modified, thenself.predict()
has, in effect, changed. An alternative to storingself.coefficients
would be to return coefficients and make the user store them. This may be more or less complicated, but please be aware of the trade offs. - You can cut down on side effects by giving methods very explicit names and making sure the functionality coincides with the name. For example, the method
get_data_stats
returns (i.e. gets) some statistics related to the data and nothing more. If someone wanted to also modify an attributeself.stats
they would be "forced" to do this in a separate call (e.g.self.data_stats = self._get_data_stats()
. - Try googling side effects in object oriented design and you can enjoy reading some other rants about side effects.
Multiple inheritance
Multiple inheritance is a feature of some object-oriented computer programming languages in which a class can inherit characteristics and features from more than one superclass. This results in significantly more complicated code and should not be used in LowClass Python.
Some more guidelines
- Push attribute modification to the highest level possible. Complex math should be in functions that return a value (that can then be used to set an attribute). This allows developers to see when and how an attribute is being modified without digging into the math.
- Since attributes determine state and they can end up being modified in unexpected ways, please limit the number of attributes you create.
- Don't write uber methods. An uber method is a method that does lots of things at once. For example, you could have a method that initializes a few attributes, modifies self.X, self .Y, and returns Z. This may make it easy to perform some common task with one simple method call. However, (i) if one of these tasks need to be modified, you may destroy the ability to perform the other tasks. (ii), it is difficult to understand the code in these methods. (iii) Good luck writing unit tests!
Miscellaneous
Here we go over some miscellaneous odds and ends that are consistent with LowClass Python.
CSV files
Any text file delimited by a character (e.g. comma, pipe, tab, etc...) is known as a csv file. For example, let data.csv
be:
name|age
ian|22
daniel|33
chang|44
- A convenient object for reading/writing csv files without headers is the
csv.reader
. For example, consider the following code that iterates through a file, modifying each row.
import csv
reader = csv.reader(infile, delimiter='|')
writer = csv.writer(outfile, delimiter='|')
for row in reader:
newrow = _modify_row(row)
writer.writerow(newrow)
Each row
is a list of strings. So the first row would be ['name', 'age']
, the second would be ['ian', '22']
and so on.
- If your csv file has a header, the preferred object is the
csv.DictReader
.
import csv
reader = csv.DictReader(infile, delimiter='|')
writer = csv.DictWriter(outfile, delimiter='|', fieldnames=reader.fieldnames)
writer.writeheader()
# and so on...
With a DictReader
, each row is a dictionary with keys/values given by the header and the row in the file. So the first row would be {'name': 'ian', 'age': '22'}
and so on.
This enforces column names. As your code grows, you will add/subtract columns. The DictReader allows you to continue using old code even as the number of columns change.
Tests
Unit tests test one small thing.
- They should be built using either Python's
unittest
ornosetest
module. - If a unit test relies on some outside data file, it is probably too big.
- Always test the implementation with unit tests. Since the implementation should be separated from IO decisions, it should work with file buffers. This can look like:
from StringIO import StringIO
import unittest
class TestSubsample(unittest.TestCase):
"""
Tests the subsampler
"""
def setUp(self):
self.outfile = StringIO()
self.commafile = StringIO(
'name,age,weight\nian,1,11\ndaniel,2,22\nchang,3,33')
self.seed = 1234
def test_r0p0_comma(self):
subsample.subsample(
self.commafile, self.outfile, subsample_rate=0.0, seed=self.seed)
result = self.outfile.getvalue()
benchmark = 'name,age,weight\r\n'
self.assertEqual(result, benchmark)
def tearDown(self):
self.outfile.close()
Integration tests test functionality of the entire module or pipeline at once.
- The interface may be tested as part of integration tests (that can use outside files).
- Integration tests can generate some output data and compare this data to benchmark data. This comparison will tell you if the output has changed.
- This can be done with a Python script or a shell script. If you have a script that runs your production pipeline, it is convenient to use this same script to run your tests.
Debugging
Debuggers can be used for finding bugs, or just understanding how a piece of code is working. See this post
Configuration files
- Config files are nearly impossible to keep consistent with multiple users, so most of your code should work without them.
- YAML is very very nice.
Pandas
- The basic Pandas API is still changing. When possible, production code should use numpy or standard Python. If you use Pandas in production code, try to use simple functionality that has been around for some time.
- For indexing, use the
loc
andiloc
methods of DataFrame and series.ix
should never appear in production code since it carries multiple meanings and does many thing automagically. - Pandas supports a "long one-liner" style of data modification. This is useful for quick prototyping, but not in production code. Therefore
# This is ok in IPython.
age_stats = frame.rename(columns={'ID': 'doc_id'}).set_index('doc_id').groupby('age').apply(get_age_stats)
# This is preferred in production code.
frame = frame.rename(columns={'ID': 'doc_id'})
frame = frame.set_index('doc_id')
age_stats = frame.groupby('age').apply(get_age_stats)
Performance
- Read the high(er) performance Python chapter in the applied data science lecture notes.
Documentation
- All public methods/functions must contain a docstrings conforming to the numpy standard. This is for consistency and also allows automatic generation of html docs. To see an example, start
ipython
, import numpy, and typenumpy.arange?
. - In-code comments should explain the algorithm. If you have to think a few times about how something works, then comment it!
- Example scripts and tests can help explain functionality.
- External documentation is not recommended. External docs are never updated, so in addition to creating the illusion of proper documentation they can be misleading.
Packaging
Once you have made a module, mymodule.py
, you need some way for other code to import this. We recommend:
- Organize modules into project directories. These could be e.g. your repos.
- Place these project directories in
$HOME/lib/
- Inside every project directory, and every subfolder that you wish to be able to import, place a blank file
__init__.py
. So your directory/file layout could look like:
$HOME/lib/
first_repo/
__init__.py
module_1.py
module_2.py
second_repo/
__init__.py
src/
__init__.py
module_3.py
module_4.py
- Add
$HOME/lib
to yourPYTHONPATH
Now, you can do:
import first_repo
from first_repo import module_1
import second_repo
from second_repo.src import module_3