... | @@ -32,17 +32,16 @@ For example, if there are two projects in my samplesheet, one specified "hg19" a |
... | @@ -32,17 +32,16 @@ For example, if there are two projects in my samplesheet, one specified "hg19" a |
|
|
|
|
|
By default, the fasta file is build in multiple steps by the snakemake pipeline:
|
|
By default, the fasta file is build in multiple steps by the snakemake pipeline:
|
|
1. downloading the transcript sequences: `<species>_refMrna.fa`
|
|
1. downloading the transcript sequences: `<species>_refMrna.fa`
|
|
2. stripping polyA tails from the sequences: `<species>_polyAstrip.fa`
|
|
2. downloading the mitochondrial sequence: `<species>_chrM.fa`
|
|
3. downloading the mitochondrial sequence: `<species>_chrM.fa`
|
|
3. merging transcript sequences, chrM and ERCC spike in sequences: `<species>_ERCC_chrm.fa`
|
|
4. stripping polyA from chrM sequence: `<species>_chrM_polyAstrip.fa`
|
|
4. stripping polyA tails from the sequences: `<species>_ERCC_chrm_polyAstrip.fa`
|
|
5. merging transcript sequences polyA stripped, chrM polyA stripped and ERCC spike in sequences: `<species>_ERCC_chrm_polyAstrip.fa`
|
|
|
|
|
|
|
|
> **Note:**
|
|
> **Note:**
|
|
|
|
|
|
> - The sequence of the ERCC spike in are available in "SCRIPTS/ERCC92_polyAstrip.fa"
|
|
> - The sequence of the ERCC spike in are available in "SCRIPTS/ERCC92_polyAstrip.fa"
|
|
> - These ERCC sequences are unused on the GenoBiRD protocole but must still be added for legacy reasons.
|
|
> - These ERCC sequences are unused on the GenoBiRD protocole but must still be added for legacy reasons.
|
|
|
|
|
|
These steps are performed by snakemake. If you're building the fasta manually, all you need is a file named `<species>_ERCC_chrm_polyAstrip.fa`. Snakemake will not try to perform the previous steps (downloading, merging) if this file already exists.
|
|
These steps are performed by snakemake. If you're building the fasta manually, all you need is a file named `<species>_ERCC_chrm.fa`. Snakemake will not try to perform the previous steps (downloading, merging) if this file already exists.
|
|
|
|
|
|
Since sequences contained in the fastq files are aligned with bwa and not a "RNAseq aligner" such as STAR, you need a reference transcriptome and not a reference genome. A reference transcriptome contains the sequences of the transcripts (cDNA).
|
|
Since sequences contained in the fastq files are aligned with bwa and not a "RNAseq aligner" such as STAR, you need a reference transcriptome and not a reference genome. A reference transcriptome contains the sequences of the transcripts (cDNA).
|
|
|
|
|
... | @@ -76,13 +75,66 @@ This example shows how the first two transcripts in the fasta file example belon |
... | @@ -76,13 +75,66 @@ This example shows how the first two transcripts in the fasta file example belon |
|
|
|
|
|
The snakemake has a rule to transform the downloaded refGene into this sym2ref file. If you are manually making your reference, than you have to build this file and call it `<species>_sym2ref.dat`.
|
|
The snakemake has a rule to transform the downloaded refGene into this sym2ref file. If you are manually making your reference, than you have to build this file and call it `<species>_sym2ref.dat`.
|
|
|
|
|
|
|
|
# Download an alternative reference
|
|
|
|
|
|
|
|
If you're not satisfied by refseq sequences for any reason, you can use the script `SCRIPTS/make_ref.py` to build a reference from **gencode** or **ensembl** cDNA sequences.
|
|
|
|
|
|
|
|
The help can be visualized with:
|
|
|
|
|
|
|
|
```
|
|
|
|
$ python SCRIPTS/make_ref.py -h
|
|
|
|
usage: make_ref.py [-h] [-p {refseq,gencode,ensembl}] [-s {human,mouse,rat}]
|
|
|
|
[-r DIR] [-n REFNAME] [-u PROXY]
|
|
|
|
|
|
|
|
optional arguments:
|
|
|
|
-h, --help show this help message and exit
|
|
|
|
-p {refseq,gencode,ensembl}, --provenance {refseq,gencode,ensembl}
|
|
|
|
Provenance of the reference transcriptome to be
|
|
|
|
downloaded (default : refseq).
|
|
|
|
-s {human,mouse,rat}, --species {human,mouse,rat}
|
|
|
|
Species of the reference transcriptome to be
|
|
|
|
downloaded (default: human). If you need another
|
|
|
|
species, you're going to have to build your reference
|
|
|
|
manually. Have a look at this page:
|
|
|
|
https://gitlab.univ-
|
|
|
|
nantes.fr/bird_pipeline_registry/srp-
|
|
|
|
pipeline/-/wikis/usage/reference
|
|
|
|
-r DIR, --reference-dir DIR
|
|
|
|
Directory where the new reference will be built
|
|
|
|
(default : REF).
|
|
|
|
-n REFNAME, --name REFNAME
|
|
|
|
Name of the new reference (default to standard last
|
|
|
|
build name according to chosen species. Ex:
|
|
|
|
hg38,mm10,rn6). This name should be the same specified
|
|
|
|
in the samplesheet describing the samples.
|
|
|
|
-u PROXY, --use-proxy PROXY
|
|
|
|
Use univ-nantes proxy
|
|
|
|
```
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
If you need the mouse reference from gencode:
|
|
|
|
`python -p gencode -s mouse`
|
|
|
|
|
|
|
|
This command will produce the files
|
|
|
|
- `mm10_ERCC_chrm.fa`
|
|
|
|
- `mm10_sym2ref.dat`
|
|
|
|
|
|
|
|
in the folder `REF/gencode/mm10`.
|
|
|
|
While using the script `SCRIPTS/make_srp_config.py` to make the configuration file, you will have to use the `-r` option by specifying `REF/gencode` if you specified `mm10` in the samplesheet.
|
|
|
|
|
|
|
|
> **Note:**
|
|
|
|
|
|
|
|
> - **Gencode** only allows to download human and mouse references.
|
|
|
|
> - By specifying the species to this script, the latest build will be downloaded.
|
|
|
|
|
|
# Conclusion
|
|
# Conclusion
|
|
|
|
|
|
If you want the pipeline to automatically build you reference, you should only specify a genomic assembly that exists in refseq in your samplesheet. The pipeline has been tested with "hg19, hg38, mm10". For "rn6", the chrM is missing in the default path, so it has to be downloaded by hand.
|
|
If you want the pipeline to automatically build you reference, you should only specify a genomic assembly that exists in refseq in your samplesheet. The pipeline has been tested with "hg19, hg38, mm10". For "rn6", the chrM is missing in the default path, so it has to be downloaded by hand.
|
|
|
|
|
|
If you want to manually build your reference with you own transcript sequences then you have to build the two files:
|
|
If you want to manually build your reference with you own transcript sequences then you have to build the two files:
|
|
- `<species>_ERCC_chrm_polyAstrip.fa`
|
|
- `<species>_ERCC_chrm.fa` : merging of mRNA sequences, ERCC and chrM fasta files.
|
|
- `<species>_sym2ref.dat`
|
|
- `<species>_sym2ref.dat` : gene symbol to transcripts names file.
|
|
|
|
|
|
<div align="right">
|
|
<div align="right">
|
|
|
|
|
... | | ... | |