Starting projects that involve lots of file I/O

Often times I work on a project where the general goal is:

Read lots of files from disk
Modify and extract information from the files
Write results to disk

Steps 1 and 3 provide the interface (in this case the plumbing that interfaces with the OS and the disk) and step 2 is the implementation (the logic that you want to implement). The point of this post is that interface should be separated from implementation. The reason is that interface and implementation tend to change at different times. To illustrate, imagine you write the following script:

infilename = 'data/smallfile.csv'
outfilename = 'data/my_outfile.csv'

modify_and_write(infilename, outfilename)

This would work fine during an initial development phase where you want to test your script on one single file. It however doesn't work if you want to modify many files. You could change this with:

indir = 'data/'
outfilename = 'data/my_outfile.csv'

for infilename in get_filenames(indir):
    modify_and_write(infilename, outfilename, append=True)

This solves the first problem, but suppose you want to read from standard in and/or write to standard out (this would be helpful since then you could tie modifyandwrite together with other utilities)? To do this, you could pass open file objects rather than file names.

# Use with hardcoded files in a script

indir = 'data/'
outfilename = 'data/my_outfile.csv'

for infilename in get_filenames(indir):
    with open(infilename, 'r') as f:
        with open(outfilename, 'w') as g:
            modify_and_write(infile, outfile)

# Use with stdin/stdout as part of a larger program.

modify_and_write(sys.stdin, sys.stdout)

Here we have pushed the file opening part of the interface away from the modification part. This allows us to tie modifyandwrite together with other programs or use it by itself. An example can be found here.

This sort of setup is good if you know ahead of time that you will be reading/writing from files or stdin/stdout only. Although this is a good way to tie programs together, unix pipelines can be restrictive. Suppose all files are small. Then it is possible to read them in all at once. In this case you can write:

# Simple script to write as you develop modify_lines

infilename = 'data/smallfile.csv'
outfilename = 'data/my_outfile.csv'

with open(infilename, 'r') as f:
    lines = f.read()
    newlines = modify_lines(lines)
    with open(outfilename, 'w') as g:
        g.write(newlines)

Above, modify_lines takes in the bare minimum that it needs in order to modify the lines in the file takes in the bare minimum that it needs in order to modify the lines in the file. This the string returned by f.read(). Later, when we decide exactly how modify_lines will be used, we can build the interface. If that interface changes over time, that is fine, because the implementation (modify_lines) doesn't need to change. For example, we can decide to read lines from stdin, or a file, or another function.

blog comments powered by Disqus

Published

26 April 2013

Starting projects that involve lots of file I/O

Published

Tags