Homework 2 has been handed out via email and github notifications. It is due Feb 18.

The README.md handed out with the hw

This homework will have you write shell scripts that that use unix utilities and python utilities that you build. This is done in the name of analyzing (an altered version of) the SF 311 Dataset. This altered version is available here

Due: Monday Feb 18, 6pm.

To receive full credit, you must commit and push code that passes all unit tests, and shell scripts that give the correct output.

Setup

Clone the repo and save it in a local directory called homework_02 by typing

git clone https://github.com/columbia-applied-data-science/homework_02_team_XX.git \
homework_02


Utilities

Note: To use the pytyhon utilities, your PYTHONPATH must be modified. In your ~/.bashrc (or ~/.bash_profile on macs), put

"$@" } export -f body  then source the bashrc. This allows you to run a command on the body of the function, skipping the header (but still printing the header). For example, cat filewithheader | body sort -k1,1  will sort filewithheader, using the first field, but leave the header at the top of the file. cut.py Acts like the unix cut utility, except... • Takes field names rather than numbers • Uses the python csv module for more automatic handling of stuff like quoted delimiters reformat.py Reformats stuff like delimiters and capitalization common.py Common files for all utilities averager.py Gets the average of different groups of a sorted file timeopen.py Reads a SF 311 case file, appends a 'timeopen' column giving the time (in minutes) a case was open. subsample.py Subsamples in the space of rows. Shell Scripts These are simple shell scripts. They simply define variables and pipe together some commands. The input file is written into the script. The script writes to stdout and stderr. An example of a script like this (that counts words) would be: DATA=../data cat$DATA/infile.csv \
| sort \
| uniq -c \
> outfile.csv


Use the hints inside of these shell scripts to complete them. "Complete" means that they reproduce the sample input/output inside data/. For example,

cd scripts
./count_categories.sh > /tmp/stdout 2> /tmp/stderr
diff /tmp/stderr ../data/count_categories_stderr
diff /tmp/stdout ../data/count_categories_stdout


will produce two files, /tmp/stdout and /tmp/stderr and then compare them to the files in data. If everything is working, then diff should print nothing.

count_categories.sh

Count the number of tickets in each category

countcategoriesopenclosed.sh

Count the number of tickets in each category that are Open or Closed

compute_averages.sh

Compute the average time tickets in different categories remain open.

• For closed tickets, compute the average time it was open before being closed.
• For open tickets, compute the time it has been left open.

Unit Tests

To run tests, cd to tests/ and do

python -m unittest -v testutils


Once you are done, you will get notification that all tests passed.