README.md 7.42 KB
Newer Older
Adrien Leger's avatar
Adrien Leger committed
1
2
# pyBlast 0.1

Adrien Leger's avatar
Adrien Leger committed
3
See [GitHub page]( http://a-slide.github.io/pyBlast)
Adrien Leger's avatar
Adrien Leger committed
4

Adrien Leger's avatar
Adrien Leger committed
5
**Simple and lightweight Python 2.7 wrapper module for NCBI BLAST+**
Adrien Leger's avatar
Adrien Leger committed
6
7
8

**Creation : 2015/05/18**

Adrien Leger's avatar
Adrien Leger committed
9
**Last update : 2015/05/22**
Adrien Leger's avatar
Adrien Leger committed
10

Adrien Leger's avatar
Adrien Leger committed
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
## BlastHit

Python object representing a hit found by blastn. The object contains the following public fields:

* id: Auto incremented unique identifier [INT]
* q_id: Query sequence name [STR]
* s_id: Subject sequence name [STR]
* identity: % of identity in the hit [FLOAT 0:100]
* length: length of the hit [INT >=0]
* mis: Number of mismatch in the hit [INT >=0]
* gap: Number of gap in the hit [INT >=0]
* q_start: Hit start position of the query sequence [INT >=0]
* q_end: Hit end position of the query sequence [INT >=0]
* s_start: Hit start position of the subject sequence [INT >=0]
* s_end: Hit end position of the subject sequence [INT >=0]
* evalue: E value of the alignment [FLOAT >=0]
* bscore: Bit score of the alignment[FLOAT >=0]
* q_seq: Sequence of the query aligned on the subject sequence [STR]
* q_orient: Orientation of the query sequence [+ or -]
* s_orient: Orientation of the subject sequence [+ or -]

The validity of numeric value is checked upon instantiation. Invalid values will raise assertion errors.

## Blastn

This class contain the wrapper for Blastn and require the installation of ncbi Blast+ 2.2.28+.

### Setup Blastn object: Create subject database

Upon instantiation, a database is created from the user-provided subject sequence. Database files are created in a temporary directory.
The following parameters can be customized at Blastn objects instantiation
* ref_path: Path to the reference fasta file (not gzipped). Mandatory
* makeblastdb_exec: Path of the makeblastdb executable. Default = "makeblastdb"
* makeblastdb_opt: makeblastdb command line options as a string. Default = ""
* dbtype: Molecule type of target db ('nucl', 'prot'). Default = "nucl"
* input_type: type of the data specified in input_file ('asn1_bin', 'asn1_txt', 'blastdb', 'fasta'). Default = "fasta"

To ensure a proper database files deletion at the end of the execution it is possible to call the object using the `with` statement.
Alternatively you can call the `rm_db` method at the end of the Blastn usage.

**Code**
```
with Blastn(ref_path="./subject.fa") as blastn:
    print (blastn)
```
**Output**
```
CREATE DATABASE: makeblastdb  -dbtype nucl -input_type fasta -in subject.fa -out /tmp/tmpihszgZ/subject

MAKEBLASTDB CLASS	Parameters list
	db_basename	subject
	db_dir	/tmp/tmpihszgZ
	db_path	/tmp/tmpihszgZ/subject
	dbtype	nucl
	input_type	fasta
	makeblastdb_exec	makeblastdb
	makeblastdb_opt	
	ref_path	subject.fa

Cleaning up blast DB files for "subject"
```
Adrien Leger's avatar
Adrien Leger committed
72

Adrien Leger's avatar
Adrien Leger committed
73
### Calling Blastn object: Perform Blastn and return a list of hits
Adrien Leger's avatar
Adrien Leger committed
74

Adrien Leger's avatar
Adrien Leger committed
75
76
77
The Blastn object can then be directly called, as many time as desired, with query fasta files, that can contain several sequences.
The method will automatically call blastn in a multiprocessing fashion, using as many threads as possible.
The following parameters can be customized at Blastn objects calling:
Adrien Leger's avatar
Adrien Leger committed
78

Adrien Leger's avatar
Adrien Leger committed
79
80
81
82
83
84
* query_path: Path to a fasta file containing the query sequences (not gzipped). Mandatory
* blast_exec: Path of the blast executable. By Default blastn will be used. Default = "blastn"
* blastn_opt: Blastn command line options as a string. Default = ""
* task: Type of blast to be performed ('blastn' 'blastn-short' 'dc-megablast' 'megablast' 'rmblastn'). Default = "dc-megablast"
* evalue: E Value cuttoff to retain alignments. Default = 1
* best_query_hit: find and return only the best hit per query. Default = False
Adrien Leger's avatar
Adrien Leger committed
85

Adrien Leger's avatar
Adrien Leger committed
86
87
A list containing 1 BlastHit object for each query hit found in the subject will be returned, except if not hit were found in which situation 'None' will be returned. 
If the best_query_hit flag was set to True, Only the best hit per query sequence from the query file will be returned.
Adrien Leger's avatar
Adrien Leger committed
88

Adrien Leger's avatar
Adrien Leger committed
89
90
91
92
93
94
95
96
97
98
**Code**
```
with Blastn(ref_path="./subject.fa") as blastn:
    hit_list = blastn(query_path="./query.fa")
    for hit in hit_list:
        print (hit)
```
**Output**
```
CREATE DATABASE: makeblastdb  -dbtype nucl -input_type fasta -in ./subject.fa -out /tmp/tmp1ZBlfT/subject
Adrien Leger's avatar
Adrien Leger committed
99

Adrien Leger's avatar
Adrien Leger committed
100
MAKE BLAST: blastn  -num_threads 4 -task dc-megablast -evalue 1 -outfmt "6 std qseq" -dust no -query ./query.fa -db /tmp/tmp1ZBlfT/subject
Adrien Leger's avatar
Adrien Leger committed
101

Adrien Leger's avatar
Adrien Leger committed
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
	2 hits found
HIT 0	Query	query1:0-48(+)
	Subject	subject:19-67(+)
	Lenght : 48	Identity : 100.0%	Evalue : 2e-23	Bit score : 87.8
	Aligned query seq : GCATGCTCGATCAGTAGCTCTCAGTACGCATACGCTAGCATCACGACT

HIT 1	Query	query2:0-48(+)
	Subject	subject:89-137(+)
	Lenght : 48	Identity : 100.0%	Evalue : 2e-23	Bit score : 87.8
	Aligned query seq : CGCATCGACTCGATCTGATCAGCTCACAGTCAGCATCAGCTACGATCA

Cleaning up blast DB files for "subject"
```


## Testing pyBlast module

The module can be easily tested thanks to pytest

* Install pytest with pip `pip instal pytest`
Adrien Leger's avatar
Adrien Leger committed
122
* Run test with py.test-2.7  -v
Adrien Leger's avatar
Adrien Leger committed
123

Adrien Leger's avatar
Adrien Leger committed
124
Example of output if successful. Please note than some tests might fail due to the random sampling of DNA sequences, and uncertainties of Blastn algorithm.
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
```
========================================== test session starts ===========================================
platform linux2 -- Python 2.7.5 -- py-1.4.27 -- pytest-2.7.0 -- /usr/bin/python
rootdir: /home/adrien/Programming/Python/pyBlast, inifile: 
collected 21 items 

test_pyBlast.py::test_BlastHit[4.16866907958-57-98-69-88-12-100-43-1.40452897105-47.3666242716] PASSED
test_pyBlast.py::test_BlastHit[-1-7-10-20-73-54-25-45-98.7921480151-45.2397166228] xfail
test_pyBlast.py::test_BlastHit[8.92741377413--1-100-36-34-33-14-71-18.8547135761-97.6604693294] xfail
test_pyBlast.py::test_BlastHit[10.5987790458-46--1-45-78-81-86-86-73.8740266727-56.887410005] xfail
test_pyBlast.py::test_BlastHit[66.8213911219-62-48--1-91-10-60-20-88.7850139735-81.7901609219] xfail
test_pyBlast.py::test_BlastHit[86.6626174287-29-83-34--1-53-57-68-17.9799756069-7.83036609495] xfail
test_pyBlast.py::test_BlastHit[5.23985331666-43-85-33-7--1-14-3-74.2130782704-88.9289495285] xfail
test_pyBlast.py::test_BlastHit[75.6935977321-8-78-68-10-39--1-74-44.1447867052-22.5203082483] xfail
test_pyBlast.py::test_BlastHit[39.8692596061-60-5-49-77-9-31--1-2.59963139531-46.3133849683] xfail
test_pyBlast.py::test_BlastHit[15.7192632366-24-92-1-64-82-83-90--1-75.5540618409] xfail
test_pyBlast.py::test_BlastHit[18.6627439886-34-57-60-5-45-26-40-77.7840842678--1] xfail
test_pyBlast.py::test_Blastn[blastn-Queries from Subject] PASSED
test_pyBlast.py::test_Blastn[blastn-Random queries] xfail
test_pyBlast.py::test_Blastn[blastn-short-Queries from Subject] PASSED
test_pyBlast.py::test_Blastn[blastn-short-Random queries] xfail
test_pyBlast.py::test_Blastn[dc-megablast-Queries from Subject] PASSED
test_pyBlast.py::test_Blastn[dc-megablast-Random queries] xfail
test_pyBlast.py::test_Blastn[megablast-Queries from Subject] PASSED
test_pyBlast.py::test_Blastn[megablast-Random queries] xfail
test_pyBlast.py::test_Blastn[rmblastn-Queries from Subject] PASSED
test_pyBlast.py::test_Blastn[rmblastn-Random queries] xfail

================================== 6 passed, 15 xfailed in 5.91 seconds ==================================
```

Adrien Leger's avatar
Adrien Leger committed
156
## Dependencies
Adrien Leger's avatar
Adrien Leger committed
157
158
159

* [ncbi Blast+ 2.2.28+](http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download)
* [python package pytest](http://pytest.org/latest/): `pip instal pytest`
Adrien Leger's avatar
Adrien Leger committed
160

Adrien Leger's avatar
Adrien Leger committed
161
## Authors and Contact
Adrien Leger's avatar
Adrien Leger committed
162

Adrien Leger's avatar
Adrien Leger committed
163
164
165
166
Adrien Leger - 2015

* <adrien.leger@gmail.com> - <adrien.leger@inserm.fr> - <adrien.leger@univ-nantes.fr>
* [Github](https://github.com/a-slide)
Adrien Leger's avatar
Adrien Leger committed
167
* [Atlantic Gene Therapies - INSERM 1089](http://www.atlantic-gene-therapies.fr/)