|
|
[*Back to home*](../Home.md)
|
|
|
*Back to: [The inputs](inputs.md)*
|
|
|
|
|
|
# Building a reference for the SRP pipeline
|
|
|
|
|
|
The pipeline needs one (or more) reference in order to align the sequences contained in the fastq files and to count the reads by features (usually genes).
|
|
|
|
|
|
By default, the pipeline uses a transcriptomic reference from [refseq](https://www.ncbi.nlm.nih.gov/refseq/about/).
|
|
|
|
|
|
All reference transcriptomes specified in column "species" (hg19, hg38, rn6, mm10 ...) will be downloaded if you specify a reference folder that doesn't contain these refs. Three files are needed to build the reference:
|
|
|
- refMrna containing all the refseq sequences of the transcripts
|
|
|
- chrM genomic sequence
|
|
|
- refGene containing all the refseq annotations
|
|
|
|
|
|
Thoses references are downloaded from the UCSC ftp server. URLs are programmatically built so the files located here should exist:
|
|
|
- `http://hgdownload.cse.ucsc.edu/goldenPath/<species>/chromosomes/chrM.fa.gz`
|
|
|
- `http://hgdownload.cse.ucsc.edu/goldenPath/<species>/database/refGene.txt.gz`
|
|
|
- `http://hgdownload.cse.ucsc.edu/goldenPath/<species>/bigZips/refMrna.fa.gz`
|
|
|
|
|
|
with \<species\> being the genome specified in the samplesheet.
|
|
|
|
|
|
**If these files don't exist, you're going to have to build the reference by yourself.**
|
|
|
|
|
|
A correct reference must contain a **fasta file** containing transcript sequences and an **annotation file** (that we call "_sym2ref.dat") linking the name of the sequences in the fasta file to a gene name . Examples of this files are provided in the "TESTDATA/REFERENCES/hg19" folder.
|
|
|
|
|
|
## The fasta file
|
|
|
|
|
|
By default, the fasta file is build in multiple steps by the snakemake pipeline:
|
|
|
1. downloading the transcript sequences: `<species>_refMrna.fa`
|
|
|
2. stripping polyA tails from the sequences: `<species>_polyAstrip.fa`
|
|
|
3. downloading the mitochondrial sequence: `<species>_chrM.fa`
|
|
|
4. stripping polyA from chrM sequence: `<species>_chrM_polyAstrip.fa`
|
|
|
5. merging transcript sequences polyA stripped, chrM polyA stripped and ERCC spike in sequences: `<species>_ERCC_chrm_polyAstrip.fa`
|
|
|
|
|
|
> **Note:**
|
|
|
|
|
|
> - The sequence of the ERCC spike in are available in "SCRIPTS/ERCC92_polyAstrip.fa"
|
|
|
> - These ERCC sequences are unused on the GenoBiRD protocole but must still be added for legacy reasons.
|
|
|
|
|
|
These steps are performed by snakemake. If you're building the fasta manually, all you need is a file named `<species>_ERCC_chrm_polyAstrip.fa`. Snakemake will not try to perform the previous steps (downloading, merging) if this file already exists.
|
|
|
|
|
|
Since sequences contained in the fastq files are aligned with bwa and not a "RNAseq aligner" such as STAR, you need a reference transcriptome and not a reference genome. A reference transcriptome contains the sequences of the transcripts (cDNA).
|
|
|
|
|
|
example:
|
|
|
|
|
|
```txt
|
|
|
>NR_152111 1
|
|
|
GAGGAGAGAGCAGAGTATACCGCAGACATCATTTCTACTACAGTGGCGGAGCCGTACAGG
|
|
|
ACCTGTTTCACTGCAGGGGGATCCAAAACAAGCCCCGTGGAGCAGCAGCCAGA.......
|
|
|
>NR_152112 1
|
|
|
GAGGAGAGAGCAGAGTATACCGCAGACATCATTTCTACTACAGTGGCGGAGCCGTACAGG
|
|
|
ACCTGTTTCACTGCAGGGGGATCCAAAACAAGCCCCGTGGAGCAGCAGCCAGA.......
|
|
|
>NR_137427 1
|
|
|
AGTTCTCACGCTAGGTCTCCTCCGGGCGCTTCCCTAGCCCGTTCGCCGCCTGAGAGGGAC
|
|
|
GCTGTTCCGCCGCGTGGAAGCTTCGAGTCTCGACTCCACTGTTGACCCCTAGA.......
|
|
|
```
|
|
|
|
|
|
## The annotation file
|
|
|
|
|
|
By default, this file is built by looking at the links between transcript names and gene names from the file `<species>_refGene.txt`.
|
|
|
It is a two column file listing all the transcripts associated with their corresponding genes. The first column is the gene names that will later on in the pipeline be used as the features of the expression matrix. The second column is a comma separated list of all the transcripts of the genes.
|
|
|
|
|
|
example:
|
|
|
|
|
|
```txt
|
|
|
ATXN1 NM_000332,NM_001128164,NM_001357857,NR_152111,NR_152112,NR_152113,NR_152114
|
|
|
KLRG1 NM_001329099,NM_001329101,NM_001329102,NM_001329103,NM_005810,NR_137426,NR_137427,NR_137428
|
|
|
```
|
|
|
|
|
|
This example shows how the first two transcripts in the fasta file example belong to the same gene.
|
|
|
|
|
|
The snakemake has a rule to transform the downloaded refGene into this sym2ref file. If you are manually making your reference, than you have to build this file and call it `<species>_sym2ref.dat`.
|
|
|
|
|
|
# Conclusion
|
|
|
|
|
|
If you want the pipeline to automatically build you reference, you should only specify a genomic assembly that exists in refseq in your samplesheet. The pipeline has been tested with "hg19, hg38, mm10". For "rn6", the chrM is missing in the default path, so it has to be downloaded by hand.
|
|
|
|
|
|
If you want to manually build your reference with you own transcript sequences then you have to build the two files:
|
|
|
- `<species>_ERCC_chrm_polyAstrip.fa`
|
|
|
- `<species>_sym2ref.dat`
|
|
|
|
|
|
<div style="text-align: right">
|
|
|
<i>Back to: <a href="inputs.md">The inputs</a></i>
|
|
|
</div> |