Scripting for the Masses

or, how to process lots of data efficiently

This is directed at neuroscientists who need to process data from a large number of subjects. An extended example of a batch-processing script is developed below. It involves typical software for analysis of brain image data, but the basic approach uses the Python scripting language and is quite general.

For a multi-subject experiment, typically the same analysis programs are run on every data set for every subject. This is an ideal application for a simple script program that can feed multiple data sets into the same analysis program in series. In order to take full advantage of the miracle of modern computers, writing a batch script is required.

This tutorial builds a typical and useful batch-script. After you work through it, you will be able to build your own analysis program by simply copying the parts you want, and changing a couple of lines to feed your data into the program(s) of your choice. I have tried to shift the work load from "computer programming" to "cut-n-paste".

This example uses FSL's "fast" program, which segments an MRI image into 3 tissue components (gray matter, white matter, CSF), and then attempts to fix the inhomogeneity artifact, so pixels with the same tissue type have similar values throughout the brain.

Repetitive Analysis

The simplest way to run a series of similar analyses is to type each command at the shell prompt, hit "Return", wait for it to finish, then type the next command:

> fast -or /fullpath/fn_1.img
> fast -or /fullpath/fn_2.img
> fast -or /fullpath/fn_3.img

Usually the only change you need to make is a different input filename. There are some drawbacks to this, however:

You have to be physically at your keyboard.
You have to wait for one process to finish before starting the next. This can be frustrating because each process usually takes several minutes, so it is easy to get distracted.
Operator error due to typos and thinkos.
You need to make sure the program created the correct output file(s).
You should examine the result to make sure it is acceptable.

A big improvement in efficiency is to type the commands into a text editor, save it as an executable text-file, and run it from the command line. The most common way to do this is as a simple bash-file. For this to work, the first line of the file must contain instructions telling what environment should be used to execute the commands. For the bash shell environment, place something like this text on the first line:

#!/bin/bash

(The actual text will depend on your computer setup.)

The bash environment is quite useful and is found on most unix, linux, and Macintosh computers. You can work your way up to rather elaborate bash-scripts, but sooner or later (probably sooner) you will reach the limits of the bash syntax. So, I recommend you start out with a more complete programming language, Python. The rest of this example uses Python.

Example 1:

To run the analysis using "fast" on all three files, create a file like this:

#!/bin/python
import os

res = os.system('fast -or /fullpath/fn_1.img')
res = os.system('fast -or /fullpath/fn_2.img')
res = os.system('fast -or /fullpath/fn_3.img')

Let's assume you named the file "go_fast.py"

The first line instructs the computer to run the text file as a Python script. In our lab, John Ollinger has come up with a clever way to link a Python executable program with the Python interpreter.

The second line imports the "os" library. Most of the useful parts of Python are contained in libraries which are only loaded when requested, in order to keep things as light-weight as possible. The "os" library contains subroutines that interact with the Operating System.

The last 3 lines call "fast", once per input file. In python, subroutines or members of a library are denoted by the syntax "library.subroutine". In this case, the string (text between parentheses) is fed to the operating system, exactly the same as when you type the command at the keyboard. This means you can type the name of the text-file (script) on the command line, hit Return, and walk away while all three files get processed. Back when smoking was allowed indoors, this would be a perfect time for a 10-minute cigarette break. These days, its a coffee.

Since we are scientists, or at least wanna-be tinkerers, we naturally ask, "How can we make this better?" Some problems with this script are:

You have to edit the file every time you want to analyze a new set of files.
You still need to check the results, both in terms of succesful program completion (file creation), and the quality of the results.
If there is a problem with data input, the script will fail. This gets annoying when the script fails 5 minutes after you leave on a Friday evening, then come back to the lab Sunday morning expecting to go over your fabulous results...

Example 2: Adding a Loop

One thing we can take advantage of is that the the same thing happens to every input file. We can thus create a "loop", which will run analysis for each file sequentially. This approach has two advantages:

The code is more compact and easily understood;
The number of files can vary.

We will create a list (also called an array) containing the filenames. The "for-loop" sequentially extracts each filename from the list, creates a string (command) that feeds the filename into "fast", and executes the command. The pink text shows what changed, compared to the previous example.

#!/bin/python
import os

fn_list=['/fullpath/fn_1.img', '/fullpath/fn_2.img', '/fullpath/fn_3.img']

for fn in fn_list:

com = 'fast -or '+fn
print com
res = os.system(com)

Our program can be called like this from the command line:

> go_fast.py

In python, loops and most other control structures (if, while, etc.) are denoted by indentation. You should refer to a Python textbook if you need more details, but if you just follow the examples here your program will work out just fine. The main point is that the indentations are not just good style, they are an integral part of the python syntax.

Constructing the command as a named variable string is better than constructing it inside the os.system call (as in Example 1) because you can print the exact command to the screen, so you can check if it is correct.

Example 3: Passing filenames to the program:

A handy feature is to pass the names of the files you want to process to the Python script. This way you don't need to edit the program for every new file. There are several ways to pass the filename, but my favorite way is to use the Python library, "optparse".

#!/bin/python
import os
from optparse import OptionParser

### Parse parameters, store into variables: ###
optparser = OptionParser()
(options, args) = optparser.parse_args()

fn_list=args
for fn in fn_list:

com = 'fast -or '+fn
print com
res = os.system(com)

The new command line program call looks like:

> go_fast.py /fullpath/fn_1.img /fullpath/fn_2.img /fullpath/fn_3.img

Each of the files will be passed as an argument, and placed into the list "args" as a string. Then, to make it obvious what is happening, we put the filename-strings into another list, "fn_list".

Example 4: Passing optional flags to the program:

It is easy to add options using the optparse library. Useful optional flags include:

verbose: print out extra informational messages
rollcall: check for the presence of all of the input files before calling the main program
overwrite: it is OK to overwrite (delete) existing output files
clean: delete unneeded temporary (intermediate) files created as a byproduct

#!/bin/python
import os
from optparse import OptionParser

### Parse parameters, store into variables: ###
optparser = OptionParser()
optparser.add_option("-v", "--verbose", action="store_true", dest="verbose", default=False, help="Print informational messages")

optparser.add_option("-r", "--rollcall", action="store_true", dest="rollcall", default=False, help="Perform rollcall to see if all needed files are present.")

(options, args) = optparser.parse_args()

# get shorthand parameters from "options": #
verbose = options.verbose
rollcall = options.rollcall

fn_list=args
for fn in fn_list:

com = 'fast -or '+fn
if rollcall:

print com

else:

if verbose: print com
res = os.system(com)

The "add_option" function from optparser adds options for "verbose" and "rollcall". Each variable is set up with a default value, in this case, "False" for both variables. Using the "rollcall" option prints out the command that would be issued, but does not actually run it. This lets you check the syntax of your command, and if the basic loop structure works as expected. Using the verbose command lets you be selective about which messages get printed. When you are first developing a script, it is a good idea to print out lots of messages, but after you get the kinks worked out,you only need a message if something goes wrong.

Example 4.5: Printing messages:

For a trouble-shooting message to be really helpful, you need to know both the problem, and the program that sent the message.

#!/bin/python
prog_str = ' (go_fast.py)'
import os
from optparse import OptionParser

# get shorthand parameters from "options": #
verbose = options.verbose
rollcall = options.rollcall

fn_list=args

if verbose: print ' '
if (verbose or rollcall): print '-------- '+prog_str+' --------'
for fn in fn_list:

com = 'fast -or '+fn
if rollcall:

print com

else:

if verbose: print com
res = os.system(com)

Notice the second line, "prog_str = ' (go_fast.py)'". This is a convenient way to print out where a message is coming from. If you have one script that calls another script, it can become difficult to figure out which one is printing a message, which in turn makes trouble-shooting more confusing. Putting the name of the program into a variable has two advantages:

it makes it easy to cut and paste code between programs, since you are always printing the code contained in a standard variable, "prog_str"
putting it at the top makes it easy to remember

Example 5: Rollcall to check input files:

The rollcall feature can be even more useful than the example above- it can also be used to check file input and output features before you set the entire script off and running. This can alert you to problems that might crash the script. Some things we should look at:

Input files:

Do all of the input files exist?
Can they be read?
By me?
Are they the correct image format?

Output files:

Do the output directories exist?
Can I write to them?
Do the output files already exist?
If so, can I overwrite (delete) them?

Dependent software:

Does the software I need exist on this system?

In the following example, file input/output checks are implemented whether or not "rollcall" is desired. The main difference is that for "rollcall", the "fast" program is not called, so all of the files needed for every loop iteration can be checked rapidly. It is still a good idea to check files when running the actual analysis, since:

things can change between rollcall and analysis (e.g. disk filled up)
you might forget to run rollcall
you have better control over how to handle problems that might arise
the same mechanism is used for "rollcall" as for running the actual program, so the code is simpler and you also know the results will be same

#!/bin/python
prog_str = ' (go_fast.py)'
import os
from optparse import OptionParser

# get shorthand parameters from "options": #
verbose = options.verbose
rollcall = options.rollcall

fn_list=args

if verbose: print ' '
if (verbose or rollcall): print '-------- '+prog_str+' --------'
for fn in fn_list:

com = 'fast -or '+fn
if verbose or rollcall:

print com

### Rollcall ###

# make sure user can read input files: #
if not os.access(fn, os.R_OK):

print 'Warning- you cannot read the input file: '+prog_str
print ' '+fn
sys.exit()

# Check if output directory exists, if contents can be overwritten:
# output directory will be same is input file directory:
dir_out = os.path.dirname(fn)
# make sure "dir_out" ends in a directory separator character:
sc = os.sep
last_char = dir_out[-1]
if (last_char != sc): dir_out=dir_out+sc
# see if output directory exists:
if not os.access(dir_out, os.F_OK):

print 'Warning- the output directory does not exist: '+prog_str
print ' '+dir_out
sys.exit()

# see if user can write to output direcotry:
if not os.access(dir_out, os.W_OK):

print 'Warning- you do not have permission to write to output directory: '+prog_str
print ' '+dir_out
sys.exit

# Run the command, if not rollcall:
if !rollcall:

res = os.system(com)

The mechanics of using the rollcall was changed a bit- now the command is printed initially if either verbose or rollcall is desired. Then, there are several checks for data integrity. Finally, if everything is OK, the command is run.

Example 5.5: Moving Rollcall out of the main script:

So by this time you are thinking, "Holy cow- my simple program has been taken over by rollcall!". Well yes, but there is a good way to simplify this. We still need to run all of these data checks (and a few more...), but we can move them out of the main body of the script, and into a "subroutine". This has several benefits:

your program is a lot more compact and easily understood
the program is easier to change and maintain
the rollcall can be invoked several times if needed
it is easy to use the rollcall approach in a new script if you modularize it

The following example shows how to move rollcall into a subroutine. Typically, the new rollcall subroutine is stored in the same file as the main script, but it comes first in the file. This way, the Python interpreter will find the subroutine and compile it before it is used in the main script. The main thing to notice is that the new subroutine returns a value: 1 if everything is OK, or 0 if there is a problem with the data. The main script uses this information to decide what to do (quit or continue).

A second thing is that in the previous example (5), a problem with a file would cause the script to halt, so only one problem at a time could be found. It is usually more convenient to detect all of the problems and report them all at the same time, since frequently a similar problem will be encountered with numerous files and it is easier to fix all the problems at once. The example below (5.5) detects and prints out problems for all of the files in one go.

#!/bin/python
import os

def rollcall(fn, verbose=0)

prog_str = ' (rollcall.go_fast.py)'

# assume file is OK:
file_OK=1
# make sure user can read input files: #
if not os.access(fn, os.R_OK):

print 'Warning- you cannot read the input file: '+prog_str
print ' '+fn
file_OK=0

# Check if output directory exists, if contents can be overwritten:
# output directory will be same is input file directory:
dir_out = os.path.dirname(fn)
# make sure "dir_out" ends in a directory separator character:
sc = os.sep
last_char = dir_out[-1]
if (last_char != sc): dir_out=dir_out+sc
# see if output directory exists:
if not os.access(dir_out, os.F_OK):

print 'Warning- the output directory does not exist: '+prog_str
print ' '+dir_out
file_OK=0

# see if user can write to output direcotry:
if not os.access(dir_out, os.W_OK):

print 'Warning- you do not have permission to write to output directory: '+prog_str
print ' '+dir_out
file_OK=0

return file_OK

prog_str = ' (go_fast.py)'

from optparse import OptionParser

# get shorthand parameters from "options": #
verbose = options.verbose
rollcall = options.rollcall

fn_list=args

if verbose: print ' '
if (verbose or rollcall): print '-------- '+prog_str+' --------'
for fn in fn_list:

com = 'fast -or '+fn
if verbose or rollcall:

print com

### Rollcall ###
rollcall_check = rollcall(fn, verbose=verbose)

# if rollcall fails and script is running, halt here:
if !rollcall and (rollcall_check == 0):

print 'Error- problem with file(s) for input file:'+prog_str
print ' '+fn
sys.exit()

else:

res = os.system(com)

In this example, rollcall is run every time, whether or not the rollcall flag is set. This makes it easy to spot problems. Notice how the main body of the script has become relatively compact again. In subsequent examples, the rollcall subroutine will not be reprinted.

A few new Python elements are introduced here:

The variable "verbose" is passed to the rollcall subroutine, so the child process will behave similarly as the parent (caller) process.
The subroutine returns a value (1 or 0), rather than using sys.exit to stop.
If a problem is encountered with rollcall prior toattempting to run the command, the script halts (gracefully) via sys.exit(). We could choose to continue on to process the next file, but this is usually considered bad form. It is important to spot bad data, so the program halts. If it continues, you might not notice the problem.

Example 6: Check results:

The final step is to check the output of the program. The following example shows how to construct the expected output filename and check for its existence. If the program fails, usually it will not create an output file. Sometimes, this check can be misleading, e.g. if a previous analysis was performed and an output file already exists. This is where it becomes important to check for the existence of the output both before and after calling the analysis program (fast).

#!/bin/python
prog_str = ' (go_fast.py)'
import os
from optparse import OptionParser
import utils_tro
import image_manip

# get shorthand parameters from "options": #
verbose = options.verbose
rollcall = options.rollcall

fn_list=args

if verbose: print ' '
if (verbose or rollcall): print '-------- '+prog_str+' --------'
for fn in fn_list:

com = 'fast -or '+fn
if verbose or rollcall:

print com

### Rollcall ###
rollcall_check = rollcall(fn, verbose=verbose)

# if rollcall fails and script is running, halt here:
if !rollcall and (rollcall_check == 0):

print 'Error- problem with file(s) for input file:'+prog_str
print ' '+fn
sys.exit()

else:

res = os.system(com)

# see if it worked (check for .img ofile):
fn_res = fn[:-4]+'_restore.img'
if not utils_tro.is_file(fn_res):

print 'Error- FAST failed for input file:'+prog_str
print ' '+fn
sys.exit(1)

# rename output file:
fn_new_in = fn[:-4]+'_R.img'
fn_new = image_manip.rename_anlz(fn_res, fn_new_in, overwrite=1)
if verbose: print 'Restored file: '+fn_new

print 'Finished.+prog_str

The latest addition checks for the existence of the output file. It only checks if not performing rollcall. The code assumes a particular naming convention by "fast", namely that it adds "_restore" to the basename of the input file.

A couple of new libraries have been imported, image_manip and utils_tro. These provide support for image analysis- contact Terry Oakes if you have questions about them.

The rename step can be helpful for managing the output filenames. It is useful to have meaningful filenames, so you can tell at a glance what sort of processing has been performed. However, this usually comes at the price of cumbersomely long filenames. A handy approach is to designate a single letter for each process stage, and to append this to the base filename. Renaming the output files lets you maintain a uniform naming convention, even if you switch from one software tool to another.

The last statement lets you know when the program is completed. This is handy, since sometimes a program will hang. If you don't get any output for a long time, suspect that it has hung. If there is no message letting you know the program has completed, you can wait a very long time...

Additional Rollcall checks:

Here are a few other things you can check, in order to iron out problems before you try to run a long analysis script:

Is external software available (e.g. FSL)?
Does the desired output file already exist?
If so, do I have permission to overwrite (delete) it?

Some items can be added to the rollcall subroutine, while other checks are best added to the main body of your script.

Helpful Code Snippets:

1) Rollcall items:

You can mix and match the following snippets in a rollcall program. These take virtually no time to run, and generally the more comprehensive the checking procedure is, the easier it is to avoid problems. Make the error message as detailed and specific as possible- this will help a lot in tracking down problems quickly.

### Make sure user can read input file: #
if not os.access(fn, os.R_OK):

print 'Warning- you cannot read the file: '+prog_str
print ' '+fn
return 0

### Make sure user can write to the output directory:
# output directory should be same as input file:
dir_out = os.path.dirname(fn) + os.sep
# see if the output directory exists:
if not os.access(dir_out, os.F_OK):

print 'Warning- the output directory does not exist: '+prog_str
print ' '+dir_out
return 0

# see if user has permission has write-permission for output directory:
if not os.access(dir_out, os.W_OK):

print 'Warning- you do not have permission to write to output directory: '+prog_str
print ' '+dir_out
return 0

### Create an output directory if it does not already exist:
flag=1
if not os.access(dir_out, os.W_OK):

msg = 'Warning- the output directory does not exist cannot be written to: '+prog_str
# See if it exists:
if os.access(dir_out, os.F_OK):

msg = 'Warning- the existing output directory cannot be written to: '+prog_str
flag=0

else:

# Try to create the output directory:
os.makedirs(dir_out)
msg = 'Creating new output directory: '+prog_str
# Make sure it exists:
if not os.access(dir_out, os.W_OK):

msg = 'Warning- failed to create new output directory: '+prog_str
flag=0

print msg
print ' '+dir_out

return flag

# If user wants to avoid overwriting existing files,
# see if there are any files in the output directory:
if (overwrite == 0):

if os.access(dir_out, os.W_OK):

# see if there is anything in the directory:
filelist=os.listdir(dir_out)
if (len(filelist) > 0):

print 'Warning- existing files in:'+prog_str
print ' '+dir_out
retuen 0

else:

print 'Error- cannot write to output directory.'+prog_str
return 0

2) Check to see if a file exists, in a very robust way:

# Make sure file exists:
if (utils_tro.is_file(fn) == 0):

print 'Error- input file does not exist:'+prog_str
print ' '+fn
sys.exit(1)

This is useful because the standard Python approach, e.g.

if not os.access(fn, os.R_OK):

requires a string to be input as '/fullpath/filename.ext'. Anything else (number, list, structure, etc.) will cause a crash. The library function "utils_tro.isfile()" performs a number of checks to make sure that when the existence of the file is eventually examined, a crash is unlikely.

3) Check for existence of required software tools.

When outside software (FSL, AIR, SPM, etc.) is called by your script, you should make sure the software is available, and also that the correct version is installed on your system. This example shows how to check for a program in the "AIR" coregistration suite, with a specific directory hard-coded in:

#---- Use the correct version of AIR: -----
dir_air '/apps/AIR/AIR5.2.5/16_bit_type2_linux/'
align = dir_air+'alignlinear'
# make sure a valid installation of AIR has been specified:
if not os.access(align, os.R_OK):

print 'Warning- an invalid AIR program location was specified: '+prog_str
print ' '+align
return 0

The following example checks for the existence of FSL software, using a more generic approach. The program "image_manip.get_path" maintains a lookup table of directory locations associated with specific software tools. As long as this centralized list is up-to-date, you don't need to worry about hardcoding a specific software version in your code. (The same approach can be used for AIR and several other software tools- see the program source code for more details.)

# make sure the correct version of flirt is accesible: #
dir_FSL = image_manip.get_path('fsl')
flirt = os.path.join(dir_FSL, 'flirt')

if not os.access(flirt, os.F_OK):
print 'Warning- the FSL flirt program is unavailable: '+prog_str
print ' '+flirt
return ['0']

4) Read lines from a text file:

# open the text file, read each line into a separate entry in a list: #
if verbose: print 'Opening text input: '+fn_txt
txt = open(fn_txt, 'r')
linelist = txt.readlines()
txt.close()
# see if the file had useful information:
if (len(linelist) < 2):

print 'Warning- text file does not contain useful information:'+prog_str
print ' '+fn_txt
sys.exit(1)

if verbose: print 'Text-file lines: '+str(len(linelist))

5) Find all files in a directory with pattern "T1*.img":

base_dir = '/study/data/mri/'
for fn in os.listdir(base_dir):

#find all files with pattern "T1*.img"
if (fn[:3] == 'T1_') and (fn[-4:] == '.img'):

# print filename:
print 'T1 filename: '+fn

6) Create a log file.

For complex or lengthy analysis, it is very useful to record the actions performed and any program messages (good or bad) in a text-format log-file. The following snippet shows how to create a log-file using "utils_tro.logfile". This function consolidates tasks for creating and writing to a log-file, making it convenient to include this useful feature in your code.

This snippet assumes that a variable flag, "log" is used in the larger program to indicate whether an optional log-file is desired.

# Create logfile: #
if log:

# make sure user can write to desired log-file directory:
fn_log_dir = os.path.dirname(fn_txt)
if not os.access(fn_log_dir, os.W_OK):

print 'Warning- you are unable to write to the directory: '+prog_str
print ' '+fn_txt
print ' No top-level log-file will be created.'
fn_log='none'

else:

# make logfile name:
fn_log = fn_txt+'_log_'+timestamp+'.txt'
# create new logfile:
fn_log_new=utils_tro.logfile(fn=fn_log, create=1, title='Logfile for batch process of rhesus PET, MRI registration to template')
# verify existence of new logfile:
if (utils_tro.is_file(fn_log_new) == 0):

print 'Error- could not create logfile:'+prog_str
print ' '+fn_log
sys.exit(1)

fn_log=utils_tro.logfile(fn=fn_log, w=[prog_str])
if rollcall:

fn_log=utils_tro.logfile(fn=fn_log, w=['===','ROLLCALL ONLY- NO PROCESSING PERFORMED.','==='])

else:

fn_log='none'

Writing subsequent information to the logfile is easy:

w=['Error- subject '+str(cntr_sub), 'Mandatory files does not exist:'+prog_str, fn]
fn_log=utils_tro.logfile(fn=fn_log, w=w)

The words that get written, "w", can be either a simple string or a list (array) of strings. See the comments in the program "utils_tro.logfile" for output formatting options. This subroutine handles all file openings and closings, so the impact on your program is minimal.

7) Delete temporary or intermediate files created during analysis:

Several large files can be created during the course of running data analysis. Usually the intermediate files are not needed, as they can be easily recreated from the input data again, if needed. Frequently, the intermediate files take up more disk space than the input and putput files combined! It is good practice to delete these intermediate files whenever possible.

The approach I use is to maintain a list of intermediate files as I work through an analysis path, and then to delete the files at the end of each loop. Typically the default action is to delete the files, but sometimes I want to examine them, e.g. if there was a problem with the results. By deleting all the the files from the list at the end, it is easy to choose whether to delete them or not.

A variable flag, "clean" is used to indicate whether deletion is desired.

### Remove temporary files if desired: ###
if clean:

if verbose:

print ' '
print 'Cleaning (deleting) intermediate files:'+prog_str

# make sure the input files are not on the temp / delete list: #
for ff in temp_files:

if (ff == fn_in):
temp_files.remove(ff)

# delete files from the checked list:
res = image_manip.delete_files(temp_files, verbose=verbose)

The program "image_manip.delete_files" handles non-existing files gracefully, and also checks for the existence of paired files. For example, an ANALYZE-7.5 file has a .hdr and a .img pair, but you only need to pass one of these two filenames. Both files (if they exist) will be deleted.

Program examples:

Here are a few examples of Python scripts you can mine for code snippets. They are roughly presented in order of complexity (least to most):

(I'm still working on this part...)

Questions or comments?

Contact Terry Oakes: troakes- at - wisc.edu