|Description||NCBI BLAST sequence analysis tool|
Only this development version currently available.
The runtime environment sets the following environment variables:
In addition, command called
prepare_db is available for unpacking gzipped database files to BLASTDB.
See the example below for usage.
It is not necessary to have all the NCBI toolkit files available, only files in subdirectories $NCBI_DIR/bin and $NCBI_DIR/data are required.
Here are two use cases for the runtime environment.
In the first example, we pull the database from a grid enabled storage element as an input file. For moderate database sizes this is quite efficient and the cluster frontend might be able to cache the database, thus only one transfer actually takes place. In the following example we run a query against UNIPROT-Swissprot database taken from NDGF's BioGrid database storage.
Download the example files here.
The job description file blast.xrsl
& (executable=runblast.sh) (jobname=blast_test_swissprot) (stdout=std.out) (stderr=std.err) (gmlog=gridlog) (cputime=60) (memory=1000) (disk=300) (runtimeenvironment>=APPS/BIO/BLAST-2.2.18) (inputfiles= ("input_sequence" "test10.fasta") ( "uniprot_sprot.tar.gz" "srm://srm.ndgf.org/biogrid/db/uniprot/UniProt12.8/uniprot_sprot.blastdb.tar.gz" ) )
The job script is very simple
#!/bin/sh echo "Hello BLAST!" # match the database name in job description dbname="uniprot_sprot" # here we extract the database to a temporary directory that has been selected by the # runtime environment script prepare_db $dbname.tar.gz echo "Running with $BLAST_NUM_CPUS threads" blastall -a $BLAST_NUM_CPUS -p blastp -d $dbname.fasta -i input_sequence exitcode=$? echo "Bye BLAST!" exit $exitcode
Here the file test10.fasta from the current directory is used as input for the query. By using the renaming features of xrsl the job script can be shared as such for different input files. The exitcode from blastall is used as the exit code for the script, this way ARC knows whether the job has succeeded or failed.
The second example is just a copy of the first one with the database being transferred from a local directory. The modification of the job description and obtaining a suitable database is left as an exercise to the reader.
The job description file blast.xrsl
& (executable=runblast.sh) (jobname=blast_test2) (stdout=std.out) (stderr=std.err) (gmlog=gridlog) (cputime=60) (memory=1000) (disk=300) (runtimeenvironment>=APPS/BIO/BLAST-2.2.18) (inputfiles= ("input_sequence" "test10.fasta") ("database.tar.gz" "my_sequence_db.tar.gz") )
The database name parameter in the blastall line in job script has to be changed to reflect the user database name, just like when running locally.
Source and installation instructions for the NCBI BLAST software itself can be found from NCBI ftp site. There might also be a package for your OS directly.
Here is an example of installing the latest version of BLAST.
Get the package:
$ wget ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz $ tar xvfz ncbi.tar.gz
Check the version:
$ ncbi/bin/blastall | grep blastall blastall 2.2.18 arguments:
Copy the binaries and data directories to a shared file system (here /home/opt/) under the proper folder for the version:
$ mkdir /home/opt/ncbi/2.2.18 $ cp -r ncbi/bin ncbi/data /home/opt/ncbi/2.2.18/.
Download runtime environment script templates: SGE version or PBS version and prepare_db -script.
Modify the scripts as needed and save the main script in your ARC runtime directory as APPS/BIO/BLAST-2.2.18. Make sure that prepare_db script is available in the path for grid jobs using the RE, for example by placing it under BLAST installation bin -directory.
As long as the interface requirements are satisfied, the implementation does not really matter. And some adaptation is needed anyway to accomondate differences in the cluster environment (batch queue systems, temporary directory location etc.)
Contact firstname.lastname@example.org if you have any grid use specific questions. Contact your local BLAST guru in sequence analysis related questions.