|
|
---
|
|
---
|
|
|
title: Genes collection
|
|
title: Genes collection
|
|
|
---
|
|
---
|
|
|
|
|
|
|
Magneto makes it possible to obtain a genes collection directly from the assembly. This eliminates the need for binning and provides preliminary taxonomic and functional information (although this is less accurate than genomes_collection).
|
|
Magneto makes it possible to obtain a genes collection directly from the assembly. This eliminates the need for binning and provides preliminary taxonomic and functional information (although this is less accurate than genomes_collection).
|
|
|
|
|
|
|
|
|
`magneto run genes_collection --config target=single_assembly **snakemake.args `
|
|
|
|
|
|
|
This step is only performed with single-assembly data. However, even if you choose to perform co-assembly, single-assembly can be performed on your data to obtain the gene collection. You may also not wish to create this genes collection, whether using single or co-assembly, this choice can be made in the config file by setting genes_collection to False.
|
|
This step is only performed with single-assembly data. However, even if you choose to perform co-assembly, single-assembly can be performed on your data to obtain the gene collection. You may also not wish to create this genes collection, whether using single or co-assembly, this choice can be made in the config file by setting genes_collection to False.
|
|
|
|
|
|
|
|
## CDS search ##
|
|
### CDS search and clustering ###
|
|
|
|
|
|
|
|
Once the assembly has been carried out, the first step in the genes collection is to search for CDS (coding DNA sequences) within the contigs of each sample, using [prodigal](https://github.com/hyattpd/Prodigal). These CDS are then concatenated and clustered at 95% with [mmseqs2](https://github.com/soedinglab/MMseqs2) to avoid redundancy.
|
|
|
|
|
|
|
|
### Genes abundance ###
|
|
|
|
|
|
|
|
The abundance information of the detected and clustered genes is then calculated. The reads from each sample are mapped onto the genes using [bowtie2](https://github.com/BenLangmead/bowtie2) and [samtools](https://github.com/samtools/samtools), then the abundance is calculated from this mapping using the [coverM](https://github.com/wwood/CoverM) tool.
|
|
|
|
|
|
|
|
### Genes functional annotation ###
|
|
|
|
|
|
|
|
The functional annotation of genes detected with prodigal is then carried out with [eggNOGG-mapper](https://github.com/eggnogdb/eggnog-mapper). The tool compares the input sequences with the eggnog database and assigns an orthologous group to each of these sequences, enabling them to be functionally annotated.
|
|
|
|
|
|
|
|
### Genes taxonomic annotation ###
|
|
|
|
|
|
|
|
After the clustered genes have been translated into proteins using [seqkit](https://github.com/shenwei356/seqkit), taxonomic annotation is performed using mmseqs2.
|
|
|
|
|
|
|
|
### Output ###
|
|
|
|
|
|
|
|
|
```
|
|
|
|
genes_collection/tables/
|
|
|
|
├── coverm_genes_abundance
|
|
|
|
├── genes_functions.tsv
|
|
|
|
├── genes_length.tsv
|
|
|
|
├── genes_taxo.UniRef50.classified.tsv
|
|
|
|
└── genes_taxo.UniRef50.tsv
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
[Previous - Assembly (Module)](Modules/assembly)
|
|
[Previous - Assembly (Module)](Modules/assembly)
|
| ... | |
... | |
| ... | | ... | |