This style guide is meant for use by advanced beginner to advanced intermediate developers of scientific code in Python. In other words, non-professional programmers...for example, data scientists. The term LowClass Python hints at reducing the use of object oriented design. It is an attempt to be as witty as Tom Anderson when he coined the term C++-- (C plus-plus, minus-minus). You see, C++ is a very rich language with many features. This allows a variety of abstract design patterns that can result in confusing and hard to maintain code. In that essay, professor Anderson encourages limiting your use of the C++ language to a smaller set of features.

Similar to C++, Python allows a variety of coding styles and abstraction and some limits can help. The limits depends on who is writing the code and for whom. Our perspective is that of a data scientist who is writing a model to be used by others as part of a larger system. An example would be a feature extractor that reads streaming (through stdin/stdout) text data and counts the number of times different words appear. The data scientist should be able to prototype the model using a smaller collection of text files (stored on his laptop). Once the model is working, it should be easy for the scientist to

  1. Work with other data scientists to integrate the feature extractor into a much larger prototype. Here the challenge is fitting together and maintaining code written by many different people.
  2. Work with engineers to productionalize this model. Here the challenge is that the code must be scaled and maintained by people who have not worked on and may not understand the underlying algorithm.

Disclaimer: We're not saying the patterns found below are the only way to write LowClass Python. We have however found them effective. Comments and discussion is welcome!


Basics

Separation of source code from scripts

It is ok to write quick and dirty scripts to get some job done (e.g. replacing words in a file). These should be kept in a separate directory than the source code, which needs to follow the other conventions in this guide. Keep in mind that as scripts grow, you will be expected to make them adhere to this guide.

Initial reading

  • Read and follow Code like a pythonista.
  • Find a tutorial on Object Oriented Design (OOD) in Python and make sure you understand how to create a class and understand attributes and methods. Make sure you've worked through one example of class inheritance.

Some rules

The standard basic style-guide is PEP8. Read this from beginning until the Version Bookkeeping section. Here we present some additions/clarifications/modifications. The reason for consistency in style is that it makes it easier for others to read and grep your code.

  • Use 4 spaces per indentation level. This is absolutely essential since code that uses different indentation cannot be used together! Tabs cannot be used. This does not mean you have to hit the space bar four times. Every decent text editor has a setting that allows you to map the tab key to spaces.
  • Limit lines to some pre-determined number of characters. 79 is the accepted standard and to contribute to many projects you will need to adhere to this.
  • Try to limit functions to 50 lines (not including doc strings). If a function is longer than 50 lines you have probably made it do too many things. Something close to 20 lines should be your average.
  • Don't use abbreviations unless the word will be used many many times. If you're worried about having to type long names, then learn to use tab complete. The only short names that are acceptable are common math names such as x, y, X, Y, t, and counters such as i, j, k (used ONLY for the normal purposes). Note that it is standard convention to never use single letters as variable names, and to only use lower case letters and underscores...so we're breaking from Python programming convention here in the name of conforming to math/stats/ML convention.
  • Try to be consistent with names across multiple functions. In other words, don't use the full name in one location and an abbreviation elsewhere.

Simplicity

Simple is better!

While designing and writing code, the goal of simplicity should be foremost in your mind. Your code must be understood, used, and maintained by many different people, over long time periods. Adding additional complexity to your code forces these people to do unnecessary work. This is rude and causes mistakes. This principle is sometimes called KISS.

  • A very simple useful pattern is to write functions that take well-defined inputs and produce well-defined output. You tie these together with code that reads like a book. For example,
def extract_feature_counts(data_string):
    """
    Some docstring here...
    """
    cleaned_data_string = _clean(data_string)
    word_counts = _count_words(cleaned_data_string)

    return word_counts


def _clean(data_string):
    # Some code here.
    return cleaned_data_string


def _count_words(data_string):
    # Some code here.
    return word_counts

Notice that:

  • The only public function is extract_feature_counts (the other functions start with an underscore). This means that a user should look first at the function extract_feature_counts.
  • Upon looking at extract_feature_counts, the user can easily see that the extraction consists of two steps, cleaning and counting words.

Beware of feature creep

Feature creep is a phenomenon where additional functionality added to your code becomes excessive. A feature could be an additional method in a class, or an optional keyword argument in a function. Instead of helping, this practice complicates matters for current and future developers. Here are some guidelines.

  • Only create features that are needed now...not at some potential time in the future. This is known as the principle, You aren't gonna need it
  • Sometimes features are added with the intention of making calling the code easier. Before doing this, stop and consider the fact that this feature will have to be maintained by you and others for a long time over many different code changes. For example:
class MyModel(object):
    """
    Implements a classification model.
    """
    def trim_variables(self):
        trimmed_variable_names = self.variables
        # some code to eliminate variables from trimmed_variable_names.
        self.variables = trimmed_variable_names

    def fit(self, X, Y):
        # some code to fit coefficients to self.variables.

    def trim_and_fit(X, Y):
        self.trim_variables()
        self.fit(X, Y)

Is the method trim_and_fit really needed? Sure, it will save you from making two separate calls (one to trim_variables and one to fit), but it will have to be maintained by someone. If the API to either trim_variables or fit changes, then so will the API to trim_and_fit. If the preferred steps for fitting and trimming changes, then this method may become extinct. A user of this class needs to understand the functionality of all three methods now. Note that we are not saying this convenience method should not be written. It's just that the convenience must be weighed against the work that you are asking others to do when they attempt to use and maintain your code.

  • Be especially careful with features that are part of a large system (e.g. options in a function that is part of a data processing pipeline). Every time someone ties one program to another, they will have to consider what state every feature will be in. This causes a headache.

Explicit is better than implicit

  • Give variables, methods, and attributes descriptive names. For example:
for person in person_list:
    age_years = todays_date - person.birthdate
    age_days = age_years * 365
    age_factor = np.sqrt(age_days) * age_multiplier
    age_factor_list.append(age_factor)

Notice that we used the full name of every variable. Writing out age_factor makes it more memorable. We can grep the code and find find all places where age_factor appears. Notice we also use age_factor_list rather than age_factors. age_factor and age_factors are barely distinguishable and if both are variable names they will often be inadvertently swapped.

Being explicit while setting attributes

Consider the following fragment.

class MyModel(object):
    def __init__(self, training_data):
        self.X, self.Y = self._get_XY(training_data)
        # Some more code here.

    def _get_XY(self, training_data):
        # Some code here.
        return X, Y

    def fit(self):
        # Some code here...this code ONLY fits the model and sets attributes.
        # DIRECTLY related to fitting.
        self.coefficients = coefficients
        self.model_has_been_fit = True

The reason for returning X and Y (rather than setting them inside a method called e.g. self._set_XY is twofold:

  1. It pushes attribute setting up to the highest level (in __init__). This allows you to read __init__ and see as much useful information as possible (in particular, that the attribute names X and Y will be used).
  2. It encourages a functional style of programming, minimizing side effects.

In the fit method, notice that

  • We ONLY fit the model and set attributes directly related to fitting.
  • Since attribute setting is the "output" of this method, it is put at the end (just as a return call from a function would be put at the end).

Limit use of optional (keyword) arguments

  • Options should be optional. The code should work fine without them.
  • Options can lead to feature creep.

Separate interface from implementation

Interface is:

I/0

  • Opening/closing of files
  • Reading/writing of files

APIs

  • The functions that a module/class user calls to "do stuff."

Implementation is stuff like math and modification of data.

For example:

import numpy as np


def subsample(infile, outfile, subsample_rate=0.01, delimiter=',', seed=None):
    """
    Subsample infile and write to outfile.

    Parameters
    ----------
    infile : csv file open in read mode, or String
        File should be delimited text and have a header
    outfile : File open in write mode, or String
        Output is written here
    subsample_rate : Real number in the interval [0, 1]
        Keep this fraction of rows/key-values
    delimiter : Single character string
        The delimiter of infile.  Also used for outfile.
    seed : Integer
        If given, use this to seed the random number generator.
    """
    # Seed the random number generator for deterministic results.
    if seed:
        np.random.seed(seed)

    # Open files (if necessary).
    infile, infile_was_path = _openfile_wrap(infile, 'r')
    outfile, outfile_was_path = _openfile_wrap(outfile, 'w')

    # Get the csv reader and writer.  Use these to read/write the files.
    reader = csv.DictReader(infile, delimiter=delimiter)
    writer = csv.DictWriter(
        outfile, delimiter=delimiter, fieldnames=reader.fieldnames)
    writer.writeheader()

    # To the actual subsampling.
    _read_select_write(reader, writer.writerow, subsample_rate)

    # Close files (if necessary).
    if infile_was_path:
        infile.close()
    if outfile_was_path:
        outfile.close()


def _read_select_write(reader, writemethod, subsample_rate):
    """
    Iterate through reader and use writer to write a selection of rows.

    Parameters
    ----------
    reader : Iterable
        An iterable over the "input file".
    writemethod : Function
        writemethod(reader.next()) should write the contents of reader.next()
    subsample_rate : Real number in [0, 1]
        This fraction of iterates will be written.
    """
    assert 0 <= subsample_rate <= 1

    for row in reader:
        if subsample_rate > np.random.rand():
            writer.writerow(row)


def _openfile_wrap(filename, mode):
    """
    If filename is a string, returns an opened version of filename.
    If filename is a file buffer, then passthrough.

    Parameters
    ----------
    filename : String or file buffer
    mode : String
        mode to open the file in

    Returns
    -------
    opened_file : Opened file buffer
    was_path : Boolean
        If True, then filename was a string (and thus was opened here, and so
        the calling function should close it).
    """
    if isinstance(filename, str):
        was_path = True
        opened_file = open(filename, mode)
    elif isinstance(filename, file) or isinstance(filename, StringIO):
        was_path = False
        opened_file = filename
    else:
        raise Exception(
            "Did not know how to handle %s, type = %s" % \
            (filename, type(filename))

    return opened_file, was_path

Above the interface is subsample. It handles the opening/closing of the infile and outfile. The infile should be a csv file. It then uses Python's csv module to form in iterator over infile (the reader), as well as a method (writemethod) for writing the contents returned by the reader. This reader and writemethod are passed to the implementation, _read_select_write. This separation achieves the following:

  • If the interface changed (e.g. we decide to work with files without headers, or read/write to/from stdin/stdout or a database), we would only change the function subsample. We would not have to change the implementation.
  • The implementation might change if e.g. we optimize it for performance or have to use this in an environment where numpy is not available.

You can view a version of subsample that includes a command line interface in here.

Interface design

  • Minimize entry points.
    • Try to make most methods/functions of a class/module non-public (the name starts with an underscore). These non-public methods/functions should not be used by code outside of this module. This makes it easier for a new user to understand. It also allows you to change the non-public methods/functions without messing up other modules that may depend on this module.
    • Minimize the number of ways that attributes can be set.
  • Make sure your interface accepts open files (created with e.g. open(filename, 'r'). This allows them to be tested using StringIO. See the example below.
  • It is ok to accept open files or paths. In this case, the interface should (using a wrapper function) determine whether it was passed an open file or path and convert to an open file. For example:
def openfile_wrap(filename, mode):
    """
    If filename is a string, returns an opened version of filename.
    If filename is a file buffer, then passthrough.

    Parameters
    ----------
    filename : String or file buffer
    mode : String
        mode to open the file in

    Returns
    -------
    opened_file : Opened file buffer
    was_path : Boolean
        If True, then filename was a string (and thus was opened here, and so
        you better remember to close it elsewhere)

    Examples
    --------
    >>> infile, was_path = openfile_wrap(infilename, 'r')
    >>> myfunction(infile,...)
    >>> if was_path:
    >>>     infile.close()
    """
    if isinstance(filename, str):
        was_path = True
        opened_file = open(filename, mode)
    elif isinstance(filename, file) or isinstance(filename, StringIO):
        was_path = False
        opened_file = filename
    else:
        raise Exception("Could not work with %s" % filename)

    return opened_file, was_path

Be careful with objects

Object Oriented Design (OOD) is useful in many cases and is a fundamental part of Python and most medium to large projects built in Python. For those reasons, it must be learned and should often be used by data scientists. However, the additional abstraction makes it easy to get intro trouble. Our view is that a non-professional programmer should take care when working with objects.

  • The main purpose of OOD should be to bundle together data with methods that use, make available, and, (with caution) modify the data. For example:
class MyModel(object):
    def __init__(self, X, Y,...):
        self.X, self.Y = X, Y
        self.model_has_been_fit = False
        # Some code.

    def fit(self):
        # Some code that uses self.X, self.Y.
        self.coefficients = coefficients
        self.model_has_been_fit = True

    def predict(self):
        # Some code to predict Yhat given self.X
        return Yhat

    def plot(self):
        # Some code.

    def get_data_stats(self):
        # Some code.
        return statistical_summary_of_data
  • The fit method is an example of a method that uses the data. It should (and does) set the coefficients and an attribute to tell you that the model has been fit. A common mishap is to also transform the X, Y data within fit. For example, we could rescale self.X and self.Y. This would be a side effect and should be avoided.
  • The plot method makes the data available (as a plot). This is a convenience method since it probably just wraps some matplotlib code. So long as this method has no side effect, it is fine to include. The get_data_stats method is similar.

There is another issue with MyModel as written: Consider the fact that we have bundled together the training data with the object, and only through this training data do we set self.coefficients. Only using these coefficients can we predict (using self.predict()). What if the user wanted to store the (lightweight) coefficients in a file, and then use them at some point in the future to make predictions on new data? What if a user wanted to use a subset of the training data to fit (in e.g. cross validation). Neither of these could be done! This sort of difficulty can be seen in the statsmodels package, which was meant to be used to analyze one non-changing piece of data, rather than predict on multiple new pieces of data.

To avoid this problems, we suggest the following revision where we no longer bundle the main data with the model. This is more in line with the scikit-learn philosophy.

class MyImprovedModel(object):
    def __init__(self,...):
        # Some code.  Notice that we don't initialize the model with any data.
        self.model_has_been_fit = False

    def fit(self, X, Y):
        # Some code.
        self.coefficients = coefficients
        self.model_has_been_fit = True

    def predict(self, X, coefficients=None):
        # Some code to predict Yhat given X.
        # If coefficients is None, use self.coefficients.
        return Yhat

Side effects

A side effect in programming occurs when a function/method is called and unintended data or state is modified. This can produce difficult to detect bugs. For example,

  • The attributes of an object determine its state and hence its behavior. If an attribute is modified, then the behavior of the object changes. For example, the predict method of MyModel depends on self.coefficients. If self.coefficients are modified, then self.predict() has, in effect, changed. An alternative to storing self.coefficients would be to return coefficients and make the user store them. This may be more or less complicated, but please be aware of the trade offs.
  • You can cut down on side effects by giving methods very explicit names and making sure the functionality coincides with the name. For example, the method get_data_stats returns (i.e. gets) some statistics related to the data and nothing more. If someone wanted to also modify an attribute self.stats they would be "forced" to do this in a separate call (e.g. self.data_stats = self._get_data_stats().
  • Try googling side effects in object oriented design and you can enjoy reading some other rants about side effects.

Multiple inheritance

Multiple inheritance is a feature of some object-oriented computer programming languages in which a class can inherit characteristics and features from more than one superclass. This results in significantly more complicated code and should not be used in LowClass Python.

Some more guidelines

  • Push attribute modification to the highest level possible. Complex math should be in functions that return a value (that can then be used to set an attribute). This allows developers to see when and how an attribute is being modified without digging into the math.
  • Since attributes determine state and they can end up being modified in unexpected ways, please limit the number of attributes you create.
  • Don't write uber methods. An uber method is a method that does lots of things at once. For example, you could have a method that initializes a few attributes, modifies self.X, self .Y, and returns Z. This may make it easy to perform some common task with one simple method call. However, (i) if one of these tasks need to be modified, you may destroy the ability to perform the other tasks. (ii), it is difficult to understand the code in these methods. (iii) Good luck writing unit tests!

Miscellaneous

Here we go over some miscellaneous odds and ends that are consistent with LowClass Python.

CSV files

Any text file delimited by a character (e.g. comma, pipe, tab, etc...) is known as a csv file. For example, let data.csv be:

name|age
ian|22
daniel|33
chang|44
  • A convenient object for reading/writing csv files without headers is the csv.reader. For example, consider the following code that iterates through a file, modifying each row.
import csv

reader = csv.reader(infile, delimiter='|')
writer = csv.writer(outfile, delimiter='|')

for row in reader:
    newrow = _modify_row(row)
    writer.writerow(newrow)

Each row is a list of strings. So the first row would be ['name', 'age'], the second would be ['ian', '22'] and so on.

  • If your csv file has a header, the preferred object is the csv.DictReader.
import csv

reader = csv.DictReader(infile, delimiter='|')
writer = csv.DictWriter(outfile, delimiter='|', fieldnames=reader.fieldnames)
writer.writeheader()

# and so on...

With a DictReader, each row is a dictionary with keys/values given by the header and the row in the file. So the first row would be {'name': 'ian', 'age': '22'} and so on. This enforces column names. As your code grows, you will add/subtract columns. The DictReader allows you to continue using old code even as the number of columns change.

Tests

Unit tests test one small thing.

  • They should be built using either Python's unittest or nosetest module.
  • If a unit test relies on some outside data file, it is probably too big.
  • Always test the implementation with unit tests. Since the implementation should be separated from IO decisions, it should work with file buffers. This can look like:
from StringIO import StringIO
import unittest

class TestSubsample(unittest.TestCase):
    """
    Tests the subsampler
    """
    def setUp(self):
        self.outfile = StringIO()
        self.commafile = StringIO(
            'name,age,weight\nian,1,11\ndaniel,2,22\nchang,3,33')
        self.seed = 1234

    def test_r0p0_comma(self):
        subsample.subsample(
            self.commafile, self.outfile, subsample_rate=0.0, seed=self.seed)
        result = self.outfile.getvalue()
        benchmark = 'name,age,weight\r\n'
        self.assertEqual(result, benchmark)

    def tearDown(self):
        self.outfile.close()

Integration tests test functionality of the entire module or pipeline at once.

  • The interface may be tested as part of integration tests (that can use outside files).
  • Integration tests can generate some output data and compare this data to benchmark data. This comparison will tell you if the output has changed.
  • This can be done with a Python script or a shell script. If you have a script that runs your production pipeline, it is convenient to use this same script to run your tests.

Debugging

Debuggers can be used for finding bugs, or just understanding how a piece of code is working. See this post

Configuration files

  • Config files are nearly impossible to keep consistent with multiple users, so most of your code should work without them.
  • YAML is very very nice.

Pandas

  • The basic Pandas API is still changing. When possible, production code should use numpy or standard Python. If you use Pandas in production code, try to use simple functionality that has been around for some time.
  • For indexing, use the loc and iloc methods of DataFrame and series. ix should never appear in production code since it carries multiple meanings and does many thing automagically.
  • Pandas supports a "long one-liner" style of data modification. This is useful for quick prototyping, but not in production code. Therefore
# This is ok in IPython.
age_stats = frame.rename(columns={'ID': 'doc_id'}).set_index('doc_id').groupby('age').apply(get_age_stats)

# This is preferred in production code.
frame = frame.rename(columns={'ID': 'doc_id'})
frame = frame.set_index('doc_id')
age_stats = frame.groupby('age').apply(get_age_stats)

Performance

  • Read the high(er) performance Python chapter in the applied data science lecture notes.

Documentation

  • All public methods/functions must contain a docstrings conforming to the numpy standard. This is for consistency and also allows automatic generation of html docs. To see an example, start ipython, import numpy, and type numpy.arange?.
  • In-code comments should explain the algorithm. If you have to think a few times about how something works, then comment it!
  • Example scripts and tests can help explain functionality.
  • External documentation is not recommended. External docs are never updated, so in addition to creating the illusion of proper documentation they can be misleading.

Packaging

Once you have made a module, mymodule.py, you need some way for other code to import this. We recommend:

  • Organize modules into project directories. These could be e.g. your repos.
  • Place these project directories in $HOME/lib/
  • Inside every project directory, and every subfolder that you wish to be able to import, place a blank file __init__.py. So your directory/file layout could look like:
$HOME/lib/

    first_repo/
       __init__.py
       module_1.py
       module_2.py

    second_repo/
        __init__.py
        src/
            __init__.py
            module_3.py
            module_4.py
  • Add $HOME/lib to your PYTHONPATH

Now, you can do:

import first_repo
from first_repo import module_1

import second_repo
from second_repo.src import module_3
Subscribe to RSS Feed