Starting projects that involve lots of file I/O
Often times I work on a project where the general goal is:
- Read lots of files from disk
- Modify and extract information from the files
- Write results to disk
Steps 1 and 3 provide the interface (in this case the plumbing that interfaces with the OS and the disk) and step 2 is the implementation (the logic that you want to implement). The point of this post is that interface should be separated from implementation. The reason is that interface and implementation tend to change at different times. To illustrate, imagine you write the following script:
infilename = 'data/smallfile.csv'
outfilename = 'data/my_outfile.csv'
modify_and_write(infilename, outfilename)
This would work fine during an initial development phase where you want to test your script on one single file. It however doesn't work if you want to modify many files. You could change this with:
indir = 'data/'
outfilename = 'data/my_outfile.csv'
for infilename in get_filenames(indir):
modify_and_write(infilename, outfilename, append=True)
This solves the first problem, but suppose you want to read from standard in and/or write to standard out (this would be helpful since then you could tie modifyandwrite together with other utilities)? To do this, you could pass open file objects rather than file names.
# Use with hardcoded files in a script
indir = 'data/'
outfilename = 'data/my_outfile.csv'
for infilename in get_filenames(indir):
with open(infilename, 'r') as f:
with open(outfilename, 'w') as g:
modify_and_write(infile, outfile)
# Use with stdin/stdout as part of a larger program.
modify_and_write(sys.stdin, sys.stdout)
Here we have pushed the file opening part of the interface away from the modification part. This allows us to tie modifyandwrite together with other programs or use it by itself. An example can be found here.
This sort of setup is good if you know ahead of time that you will be reading/writing from files or stdin/stdout only. Although this is a good way to tie programs together, unix pipelines can be restrictive. Suppose all files are small. Then it is possible to read them in all at once. In this case you can write:
# Simple script to write as you develop modify_lines
infilename = 'data/smallfile.csv'
outfilename = 'data/my_outfile.csv'
with open(infilename, 'r') as f:
lines = f.read()
newlines = modify_lines(lines)
with open(outfilename, 'w') as g:
g.write(newlines)
Above, modify_lines
takes in the bare minimum that it needs in order to modify the lines in the file takes in the bare minimum that it needs in order to modify the lines in the file. This the string returned by f.read()
. Later, when we decide exactly how modify_lines
will be used, we can build the interface. If that interface changes over time, that is fine, because the implementation (modify_lines
) doesn't need to change. For example, we can decide to read lines
from stdin, or a file, or another function.
blog comments powered by Disqus