Commit c393f2b0 authored by adrien's avatar adrien
Browse files

Merge pull request #3 from a-slide/dev

Dev
parents bcecb23c 1648e1e9
......@@ -56,6 +56,5 @@ docs/_build/
target/
# User specific paths
/test/local_dataset/
/test/test/
/src/Quade
/test/local*
# Quade
# Quade 0.3
**Fastq file demultiplexer, handling double indexing, molecular indexing and filtering based on index quality**
**Fastq files demultiplexer, handling double indexing, molecular indexing and filtering based on index quality (Pure Python2.7)**
[see GitHub Page](http://a-slide.github.io/Quade)
**Creation : 2015/01/07**
**Last update : 2015/03/30**
**Last update : 2015/03/31**
## Motivation
......@@ -35,11 +35,10 @@ Specific features:
## Dependencies
The program was developed under Linux Mint 17 and require a python 2.7 environment.
The following dependencies are required for proper program execution:
The program was developed under Linux Mint 17 and was not tested with other OS.
In addition to python2.7 the following dependencies are required for proper program execution:
* python package [numpy](http://www.numpy.org/) 1.7.1+
* python package [HTSeq](http://www-huber.embl.de/users/anders/HTSeq/doc/index.html#) 0.6.1p1+
If you have pip already installed, enter the following line to install packages: ```sudo pip install numpy HTSeq```
......@@ -49,7 +48,7 @@ If you have pip already installed, enter the following line to install packages:
* Enter the src folder of the program folder and make the main script executable ```sudo chmod u+x Quade.py```
* Finally, Add Quade.py to your PATH
* Finally, add Quade.py to your PATH
## Usage
......@@ -64,12 +63,12 @@ In the folder where fastq files will be created
-i Generate an example configuration file and exit [Facultative]
An example configuration file can be generated by running the program with the option -i
The configuration of options is described directly in the configuration file.
The possible options are extensively described in the configuration file.
The program can be tested from the test folder with the dataset provided and the default configuration file.
```
cd ./test/result
Quade.py -i
Quade.py -c Conf_file.txt
Quade.py -c Quade_conf_file.txt
```
......
#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-
"""
@package Conf_file
@brief Contain the template of the empty configuration file
@package Quade
@brief Contain the template of the empty configuration file for Quade
@copyright [GNU General Public License v2](http://www.gnu.org/licenses/gpl-2.0.html)
@author Adrien Leger - 2014
* <adrien.leger@gmail.com>
......@@ -15,18 +14,18 @@
def write_example_conf():
with open ("Conf_file.txt", 'wb') as fp:
with open ("Quade_conf_file.txt", 'wb') as fp:
fp.write ("""
###################################################################################################
# QUADE CONFIGURATION FILE #
###################################################################################################
# Values can by customized with users values, but the file template must remain unchanged,
# otherwise the program will not be able to load default values.
#
# - Values identified with '**' in the descriptor are not recommended to be modified
###################################################################################################
[quality]
# The quality encoding of your sequence have to be Illumina 1.8+ Phred+33. The program do not
# The quality encoding of your sequence have to be Illumina 1.8+ Phred+33. The program does not
# manage the other encoding scales
# Minimal quality for one base of the index to consider a read pair valid. 0 if no filtering
......
#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-
"""
@package Quade
@brief Contain a class to model a fastq sequence and an iterator function to read fastq files
@copyright [GNU General Public License v2](http://www.gnu.org/licenses/gpl-2.0.html)
@author Adrien Leger - 2014
* <adrien.leger@gmail.com>
* <adrien.leger@inserm.fr>
* <adrien.leger@univ-nantes.fr>
* [Github](https://github.com/a-slide)
* [Atlantic Gene Therapies - INSERM 1089] (http://www.atlantic-gene-therapies.fr/)
"""
# Standard library imports
from gzip import open as gopen
# Third party imports
# Third party imports
import numpy as np
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
......@@ -66,16 +77,19 @@ class FastqSeq (object):
descr = self.descr+other.descr)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# FUNCTIONS
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
def is_gz(fp):
""" Indicate if a file is gziped """
return fp[-2:].lower() == "gz"
def FastqReader (fastq_file):
""" Simple fastq reader returning a generator over a fastq file """
try:
# Open the file depending of the compression status
if fastq_file[-2:].lower() == "gz":
fastq = gopen(fastq_file, "rb")
else:
fastq = open(fastq_file, "rb")
fastq = gopen(fastq_file, "rb") if is_gz(fastq_file) else open(fastq_file, "rb")
i=0
# Iterate on the file until the end
......
#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-
"""
@package FastqWriter
@brief Helper class for Sample
@package Quade
@brief Contain a class that Handle fastq writing
@copyright [GNU General Public License v2](http://www.gnu.org/licenses/gpl-2.0.html)
@author Adrien Leger - 2014
* <adrien.leger@gmail.com>
......@@ -35,13 +34,13 @@ class FastqWriter (object):
self.R2_buffer = ""
# Fundamental class functions str and repr
def __repr__(self):
def __str__(self):
msg = "FASTQ_WRITER CLASS\n"
for key, value in self.__dict__.items():
msg+="\t\t{}\t{}\n".format(key, value)
return (msg)
def __str__(self):
def __repr__(self):
return "<Instance of {} from {} >\n".format(self.__class__.__name__, self.__module__)
#~~~~~~~PUBLIC METHODS~~~~~~~#
......
......@@ -42,7 +42,7 @@ class Quade(object):
#~~~~~~~CLASS FIELDS~~~~~~~#
VERSION = "Quade 0.2"
VERSION = "Quade 0.3"
USAGE = "Usage: %prog -c Conf.txt [-i -h]"
#~~~~~~~CLASS METHODS~~~~~~~#
......@@ -148,7 +148,7 @@ class Quade(object):
print ("One of the value in the configuration file is not correct\n" + E.message)
sys.exit(1)
def __repr__(self):
def __str__(self):
msg = "QUADE CLASS\n\tParameters list\n"
# list all values in object dict in alphabetical order
keylist = [key for key in self.__dict__.keys()]
......@@ -157,10 +157,9 @@ class Quade(object):
msg+="\t{}\t{}\n".format(key, self.__dict__[key])
return (msg)
def __str__(self):
def __repr__(self):
return "<Instance of {} from {} >\n".format(self.__class__.__name__, self.__module__)
#~~~~~~~PUBLIC METHODS~~~~~~~#
def __call__(self):
......@@ -194,7 +193,7 @@ class Quade(object):
# Iterate over fastq chunks for sequence and index reads
for n, (R1, R2, I1, I2) in enumerate (zip (self.seq_R1, self.seq_R2, self.index_R1, self.index_R2)):
print("Start parsing chunk {}/{}".format(n+1, len(self.seq_R1)))
print("START PARSING CHUNK {}/{}".format(n+1, len(self.seq_R1)))
# Init FastqReader generators
R1_gen = FastqReader(R1)
......@@ -218,7 +217,7 @@ class Quade(object):
Sample.FINDER (read1,read2, index, molecular)
except StopIteration as E:
print("\tEnd of chunk {}/{}".format(n+1))
print("\tEnd of chunk {}".format(n+1))
def simple_index_parser (self):
......@@ -232,7 +231,7 @@ class Quade(object):
R2_gen = FastqReader(R2)
I1_gen = FastqReader(I1)
# Iterate over read in fastq files until it is exhaust
# Iterate over reads in fastq files until exhaustion
try:
while True:
read1 = R1_gen.next()
......
#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-
"""
@package Sample
@brief Helper class for Quade
@package Quade
@brief Helper class for Quade to represent Samples
@copyright [GNU General Public License v2](http://www.gnu.org/licenses/gpl-2.0.html)
@author Adrien Leger - 2014
* <adrien.leger@gmail.com>
......@@ -158,13 +157,13 @@ class Sample(object):
return (self.pass_qual+self.fail_qual)
# Fundamental class functions str and repr
def __repr__(self):
def __str__(self):
msg = "SAMPLE CLASS\n"
for key, value in self.__dict__.items():
msg+="\t{}\t{}\n".format(key, value)
return (msg)
def __str__(self):
def __repr__(self):
return "<Instance of {} from {} >\n".format(self.__class__.__name__, self.__module__)
#~~~~~~~PRIVATE METHODS~~~~~~~#
......
###################################################################################################
# QUADE CONFIGURATION FILE #
###################################################################################################
# Values can by customized with users values, but the file template must remain unchanged,
# otherwise the program will not be able to load default values.
# - Values identified with '**' in the descriptor are not recommended to be modified
###################################################################################################
[quality]
# The quality encoding of your sequence have to be Illumina 1.8+ Phred+33. The program does not
# manage the other encoding scales
# Minimal quality for one base of the index to consider a read pair valid. 0 if no filtering
# required. (INTEGER)
minimal_qual : 25
###################################################################################################
[fastq]
# Path to the fastq files containing non demultiplexed sequences. Since fastq splited in several
# chunks of data a list of fastq can be given for each categories, in the same order in all
# categories. seq_R1 and seq_R2 are for the fastq files containing the insert sequencing reads,
# index_R1 is for the first index read and index_R1 for the second index read (required if double
# indexing only). Usually for simple indexing : seq_R1 = R1, seq_R2 = R3 and index_R1 = R2. For
# double indexing: seq_R1 = R1, seq_R2 = R4, index_R1 = R2 and index_R2 = R3
seq_R1 : ../dataset/C1_R1.fastq.gz ../dataset/C2_R1.fastq.gz ../dataset/C3_R1.fastq.gz
seq_R2 : ../dataset/C1_R4.fastq.gz ../dataset/C2_R4.fastq.gz ../dataset/C3_R4.fastq.gz
index_R1 : ../dataset/C1_R2.fastq.gz ../dataset/C2_R2.fastq.gz ../dataset/C3_R2.fastq.gz
index_R2 : ../dataset/C1_R3.fastq.gz ../dataset/C2_R3.fastq.gz ../dataset/C3_R3.fastq.gz
###################################################################################################
[index]
# At the exception of index 1 which is mandatory, indicate by a boolean if the index is to be used.
# index 2 is used only if sample are double indexed, molecular1 if a molecular barcoding was
# performed and molecular2 in case of double molecular indexing (BOOLEAN)
index2 : True
molecular1 : True
molecular2 : True
# Indicate the start and end positions of each index and molecular barcode within in index read
# using 1 base coordinates. If an index is not in usage (from the previous section), the positions
# of the index are not required(INTEGERS)
index1_start : 1
index1_end : 4
index2_start : 1
index2_end : 4
molecular1_start : 4
molecular1_end : 6
molecular2_start : 4
molecular2_end : 6
###################################################################################################
[output]
# Write fastq files for each sample containing reads passing the quality filter **
write_pass : True
# Write fastq files for each sample containing reads failing the quality filter
write_fail : True
# Write fastq files containing reads whose sample is undetermined
write_undetermined : True
###################################################################################################
# SAMPLE DEFINITIONS
# It is possible to include as many independant sample as required by duplicating a entire sample
# section. All sample name and index have to be unique and or to be organized as follow :
# [sampleX] = Sample identifier section, where X is the sample id number starting from 1 for the
# first sample and incrementing by 1 for each additional sample
# name = Unique identifier that will be used to prefix the read files (STRING)
# index1_seq = Index 1 DNA sequences associated with the sample (STRING)
# index2_seq = Similar to index 1, only in case of double indexing (STRING)
[sample1]
name : S1
index1_seq : ACAG
index2_seq : ACAG
[sample2]
name : S2
index1_seq : CTTG
index2_seq : CTTG
\ No newline at end of file
Program Quade 0.2 Date 2015-03-31 12:38:32.372406
Program Quade 0.3 Date 2015-03-31 20:07:12.177268
Total pair 299
Pair pass quality 52
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment