|
|
|
# Quality check and decontamination #
|
|
|
|
|
|
|
|
Low quality sequence, artefacts , sequencing adaptors , duplicates , contaminants and others may compromised downstream analysis. Thus, pre-process metagenomics data is necessary. The QC modules apply a quality filter and a contaminant filter on metagenomic reads.
|
|
|
|
|
|
|
|
## Quality filtering ##
|
|
|
|
|
|
|
|
Reads quality and filtering is performed using [fastp](https://github.com/OpenGene/fastp) which support paired-end and single-end datas. fastp, as its name suggests, is fast and produce one report per sample. fastp is supported by [multiqc](https://multiqc.info).
|
|
|
|
|
|
|
|
fastp filtering produce one filtered file by input (i.e <samples>_R1.filtered.fq <samples>_R2.filtered.fq or <samples>_SE.filtered.fq). In case of paired-end data, a third file containing newly unpaired reads is produced.
|
|
|
|
|
|
|
|
editable QC parameters are listed below :
|
|
|
|
- trimming_R1_front : 0 #[int] trim reads from front for R1 or SE
|
|
|
|
- trimming_R1_tail : 0 #[int] trim reads from tail for R1 or SE
|
|
|
|
- trimming_R2_front : 0 #[int] trim reads from front for R2
|
|
|
|
- trimming_R2_tail : 0 #[int] trim reads from tail for R2
|
|
|
|
- polyG_min_len : 10 #[int] minimum number of base to consider for polyG trimming
|
|
|
|
- min_phred_score : 30 #[int] minimum quality per base
|
|
|
|
- unqualified_bases : 40 #[0-100] minimum percentage of qualified bases
|
|
|
|
- max_N : 5 #[int] max number of N base
|
|
|
|
- average_Phred : 0 #[int] average reads quality
|
|
|
|
- minimum_reads_length : 15 #[int] minimum reads length
|
|
|
|
|
|
|
|
For more details about QC parameters, please see [fastp's documentation](https://github.com/OpenGene/fastp)
|
|
|
|
|
|
|
|
## Contaminant removal ##
|
|
|
|
|
|
|
|
Contamination during sequencing experiment is unavoidable. Thus, it's necessary to remove sequences which are not supposed to be there. The common approach is to map reads against contaminant targets and to extract reads that does not mapped. For this task, we used [fastQscreen](https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/_build/html/index.html) with --nohits flags to extract reads which are not mapped against references. FastQscreen have integrated references (Human, Mouse, Adapter, Arabidopsis, Drosophilia, Ecoli, Lambda, Mitochondria , PhiX, Rat, rRNA and vectors) but you can add your own targets within the config file. If personal references is preferred over fastQscreen references, you can avoid downloading them by setting dl_genomes (within config file) to False.
|
|
|
|
|
|
|
|
|
|
|
|
fastQscreen process one fastq file at once. Therefore, for paired-end data, fastQscreen will be run three times (<samples>_R1.filtered.fq, <samples>_R2.filtered.fq, <samples>_unpaired.filtered.fq). This may lead to newly unpaired reads between R1 and R2 and required an additional step to extract those reads from R1 or R2 and to add them to the unpaired file. This step is performed using [repair utility](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/repair-guide/) from the bbtools suit.
|
|
|
|
|
|
|
|
|
|
|
|
------
|
|
|
|
------
|
|
|
|
|
|
|
|
 |