Name	APPS/BIO/BLAST-2.2.18
Description	NCBI BLAST sequence analysis tool
Status	Production
Last update	2008-08-14

BLAST Runtime Environment home page

Version information

Only this development version currently available.

Interface definition

The runtime environment sets the following environment variables:

NCBI_DIR points to the NCBI toolkit base directory
PATH is set so that the $NCBI_DIR/bin is included in the path
BLASTDB is set to a suitable temporary directory for unpacking the databases
BLAST_NUM_CPUS is set to the number of allocated cpus for this job. Remember to use blastall -a $BLAST_NUM_CPUS ... when running.

In addition, command called prepare_db is available for unpacking gzipped database files to BLASTDB. See the example below for usage.

It is not necessary to have all the NCBI toolkit files available, only files in subdirectories $NCBI_DIR/bin and $NCBI_DIR/data are required.

Examples

Here are two use cases for the runtime environment.

Using a shared database on a grid-enabled storage

In the first example, we pull the database from a grid enabled storage element as an input file. For moderate database sizes this is quite efficient and the cluster frontend might be able to cache the database, thus only one transfer actually takes place. In the following example we run a query against UNIPROT-Swissprot database taken from NDGF's BioGrid database storage.

Download the example files here.

The job description file blast.xrsl

&
(executable=runblast.sh)
(jobname=blast_test_swissprot)
(stdout=std.out)
(stderr=std.err)
(gmlog=gridlog)
(cputime=60)
(memory=1000)
(disk=300)
(runtimeenvironment>=APPS/BIO/BLAST-2.2.18)
(inputfiles=
  ("input_sequence" "test10.fasta")
  (
   "uniprot_sprot.tar.gz"
   "srm://srm.ndgf.org/biogrid/db/uniprot/UniProt12.8/uniprot_sprot.blastdb.tar.gz"
  )
)

The job script is very simple

#!/bin/sh
echo "Hello BLAST!"

# match the database name in job description
dbname="uniprot_sprot"

# here we extract the database to a temporary directory that has been selected by the 
# runtime environment script
prepare_db $dbname.tar.gz

echo "Running with $BLAST_NUM_CPUS threads"

blastall -a $BLAST_NUM_CPUS -p blastp -d $dbname.fasta -i input_sequence
exitcode=$?

echo "Bye BLAST!"

exit $exitcode

Here the file test10.fasta from the current directory is used as input for the query. By using the renaming features of xrsl the job script can be shared as such for different input files. The exitcode from blastall is used as the exit code for the script, this way ARC knows whether the job has succeeded or failed.

Shipping the database with the job

The second example is just a copy of the first one with the database being transferred from a local directory. The modification of the job description and obtaining a suitable database is left as an exercise to the reader.

The job description file blast.xrsl

&
(executable=runblast.sh)
(jobname=blast_test2)
(stdout=std.out)
(stderr=std.err)
(gmlog=gridlog)
(cputime=60)
(memory=1000)
(disk=300)
(runtimeenvironment>=APPS/BIO/BLAST-2.2.18)
(inputfiles=
  ("input_sequence" "test10.fasta")
  ("database.tar.gz" "my_sequence_db.tar.gz")
)

The database name parameter in the blastall line in job script has to be changed to reflect the user database name, just like when running locally.

System administrator guide for installing the RE

BLAST binaries

Source and installation instructions for the NCBI BLAST software itself can be found from NCBI ftp site. There might also be a package for your OS directly.

Here is an example of installing the latest version of BLAST.

Get the package:

$ wget ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz
$ tar xvfz ncbi.tar.gz

Compile:

$ ncbi/make/makedis.csh

Check the version:

$ ncbi/bin/blastall | grep blastall
blastall 2.2.18   arguments:

Copy the binaries and data directories to a shared file system (here /home/opt/) under the proper folder for the version:

$ mkdir /home/opt/ncbi/2.2.18
$ cp -r ncbi/bin ncbi/data /home/opt/ncbi/2.2.18/.

Download runtime environment script templates: SGE version or PBS version and prepare_db -script.

Modify the scripts as needed and save the main script in your ARC runtime directory as APPS/BIO/BLAST-2.2.18. Make sure that prepare_db script is available in the path for grid jobs using the RE, for example by placing it under BLAST installation bin -directory.

As long as the interface requirements are satisfied, the implementation does not really matter. And some adaptation is needed anyway to accomondate differences in the cluster environment (batch queue systems, temporary directory location etc.)

Contact information

Contact olli.tourunen@csc.fi if you have any grid use specific questions. Contact your local BLAST guru in sequence analysis related questions.