... | ... | @@ -11,163 +11,38 @@ If you have not created your environment, see [the requirments](usage/requiremen |
|
|
conda activate rna
|
|
|
```
|
|
|
|
|
|
## Creating the project.json file
|
|
|
|
|
|
Fastq files must be located under one root folder. They can be dispatched into sub-folders.
|
|
|
The **project.json** configuration file can be built automatically with a simple command:
|
|
|
|
|
|
```
|
|
|
find </path/to/fastq/files> -type f -name "*.fastq.gz" | java -jar scripts/illuminadir.jar -J | python -m json.tool > project.json
|
|
|
```
|
|
|
|
|
|
> **Note:**
|
|
|
|
|
|
> - Replace \</path/to/fastq/files\> with the full path to the folder containing the fastq files. Be aware that all "*.fastq.gz" will be grabbed under this folder.
|
|
|
> - Checkout that the generated json file looks good, specifically the sample names extracted from the fastq filenames. The latter will have to correspond to the ones specified in the [samplesheet](usage/inputs#samplesheet).
|
|
|
|
|
|
## Creating the config.json file
|
|
|
|
|
|
The config.json file can be created with the help of the python script in "scripts/make_rna_config.py".
|
|
|
The config.json file can be created with the help of the python script in `scripts/make_rna_config.py`.
|
|
|
|
|
|
You can visualize the help of the script with:
|
|
|
|
|
|
```
|
|
|
$ python scripts/make_rna_config.py -h
|
|
|
|
|
|
usage: make_rna_config.py [-h] -s FILE [-w DIR] [-r NAME] -c FILE -t
|
|
|
{unstranded,first-strand,second-strand} [-l N]
|
|
|
[-n PROJECTNAME] -p FILE [--minLogFC N]
|
|
|
|
|
|
optional arguments:
|
|
|
-h, --help show this help message and exit
|
|
|
-s FILE, --samplesheet FILE
|
|
|
Tab delimited file with no header describing samples.
|
|
|
Columns must be: "name condition". Only characters
|
|
|
"A-Z","0-9","-" and "_" allowed. Both columns are
|
|
|
mandatory. (REQUIRED)
|
|
|
-w DIR, --workdir DIR
|
|
|
Analysis working directory. Default: current directory
|
|
|
-r NAME, --reference-name NAME
|
|
|
Reference name. This name must match a key in the
|
|
|
CONFIG/references.json file. If not used, you will
|
|
|
have to write the reference object yourself in the
|
|
|
config.json file
|
|
|
-c FILE, --comparisons FILE
|
|
|
Tab delimited file with no headers indicating which
|
|
|
conditions to compare during differential expression
|
|
|
analysis. Columns must be "condition1 condition2".
|
|
|
(REQUIRED)
|
|
|
-t {unstranded,first-strand,second-strand}, --librarytype {unstranded,first-strand,second-strand}
|
|
|
Library type. If you have no idea what this is, please
|
|
|
see "https://chipster.csc.fi/manual/library-type-
|
|
|
summary.html"
|
|
|
-l N, --readlength N Length of the reads.
|
|
|
-n PROJECTNAME, --projectname PROJECTNAME
|
|
|
Project name which will appear in html report.
|
|
|
-p FILE, --project FILE
|
|
|
project.json file generated by illuminadir.jar.
|
|
|
(REQUIRED)
|
|
|
--minLogFC N Minimum log Fold-Change threshold for differentially
|
|
|
expessed gene. (Default 0.58 (1.5 FC))
|
|
|
```
|
|
|
|
|
|
> **Note:**
|
|
|
|
|
|
> - The `-c` argument is optionnal. It triggers the secondary analysis steps. See *[the input files](usage/inputs#compFile)* for more explanations on the comparisons file.
|
|
|
> - The `-r` argument expects a key in the CONFIG/references.json file. If you use it, make sure it corresponds to a key for an already defined reference on your system. If you don't use it, you will have to fill in the blanks in the generated configuration file:
|
|
|
> - define the path to the fasta reference genome (only work with ensembl genomes)
|
|
|
> - define the path to the gtf annotation file of the reference genome
|
|
|
> - define a path where the aligner STAR will create the index of the reference genome
|
|
|
> - define the accession for biomart in order to map ENSGs to gene symbols
|
|
|
> - description will be used in the final report file.
|
|
|
> - The `-t` argument is mandatory. It defines if your library is unstranded or stranded. see (https://chipster.csc.fi/manual/library-type-summary.html)
|
|
|
> - The `-l` defines the length of your read (according to the number of cycles of the sequencing run). It is mandatory to build an appropriate index for the reference genome.
|
|
|
> - The `-p` argument is mandatory and should be the path of your **project.json** file created previously. The script will match the names of the samples defined in the [samplesheet](usage/inputs#samplesheet) and the names defined in the **project.json** file.
|
|
|
|
|
|
The program outputs the config file on stdout. In the first time, you can try the command to see if everything is alright and in the second time, redirect the output to a file.
|
|
|
If you need to learn more on how to create this file, see the [inputs page](usage/inputs).
|
|
|
|
|
|
```
|
|
|
python SCRIPTS/make_srp_config.py -s <my_samplesheet> -r <key_of_references.json> -w <path_to_workdir> -t <library_type> -l <read_length> -n <project_name> -p <path_to_project.json> > config.json
|
|
|
python SCRIPTS/make_srp_config.py -s <samplesheet> -d <species> -w <output folder> -t <library_type> -l <read_length> -n <project_name> > config.json
|
|
|
```
|
|
|
|
|
|
If you want secondary analysis to be performed, use option `-c` to specify the comparisons.
|
|
|
|
|
|
```
|
|
|
python SCRIPTS/make_srp_config.py -s <my_samplesheet> -r <key_of_references.json> -w <path_to_workdir> -t <library_type> -l <read_length> -n <project_name> -p <path_to_project.json> -c <comparisons_file> > config.json
|
|
|
python SCRIPTS/make_srp_config.py -s <samplesheet> -d <species> -w <output folder> -t <library_type> -l <read_length> -n <project_name> -c <comparisons file> > config.json
|
|
|
```
|
|
|
|
|
|
In every case, check the generated configuration file to see if everything seems ok.
|
|
|
```
|
|
|
$ cat config.json
|
|
|
|
|
|
{
|
|
|
"project-name":"test-project",
|
|
|
"outdir":"Results",
|
|
|
"prinseq-meanquality":"30",
|
|
|
"cutadapt-forward":"AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",
|
|
|
"cutadapt-reverse":"AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT",
|
|
|
"library-type":"first-strand",
|
|
|
"read-length":"100",
|
|
|
"align-cpu":"1",
|
|
|
"reference":
|
|
|
{
|
|
|
"name":"Ensembl_GRCh37",
|
|
|
"description":"Homo sapiens Ensembl GRCh37",
|
|
|
"STARindexDir":"CONFIG/genome",
|
|
|
"fasta":"CONFIG/genome/human_g1k_v37.chr22.fasta",
|
|
|
"gtf":"CONFIG/genome/chr22.gff",
|
|
|
"biomart":"37,hsapiens_gene_ensembl"
|
|
|
},
|
|
|
"samplesCondition": [
|
|
|
{
|
|
|
"name": "80_CT_1_chr22",
|
|
|
"condition": "CT"
|
|
|
},
|
|
|
{
|
|
|
"name": "81_CT_2_chr22",
|
|
|
"condition": "CT"
|
|
|
},
|
|
|
{
|
|
|
"name": "82_CT_3_chr22",
|
|
|
"condition": "CT"
|
|
|
},
|
|
|
{
|
|
|
"name": "83_CT_4_chr22",
|
|
|
"condition": "CT"
|
|
|
},
|
|
|
{
|
|
|
"name": "96_TREATED_1_chr22",
|
|
|
"condition": "TREATED"
|
|
|
},
|
|
|
{
|
|
|
"name": "97_TREATED_2_chr22",
|
|
|
"condition": "TREATED"
|
|
|
},
|
|
|
{
|
|
|
"name": "98_TREATED_3_chr22",
|
|
|
"condition": "TREATED"
|
|
|
},
|
|
|
{
|
|
|
"name": "99_TREATED_4_chr22",
|
|
|
"condition": "TREATED"
|
|
|
}
|
|
|
],
|
|
|
"comparisons":
|
|
|
{
|
|
|
"TREATED__vs__CT": {
|
|
|
"minLogFC": 0.58,
|
|
|
"condition1": "TREATED",
|
|
|
"condition2": "CT"
|
|
|
}
|
|
|
}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## Launch the snakemake pipeline.
|
|
|
|
|
|
Test the launch with a dry run:
|
|
|
```
|
|
|
snakemake --config proj="project.json" conf="config.json" -rpn
|
|
|
snakemake --config conf="config.json" -rpn
|
|
|
```
|
|
|
where:
|
|
|
- `--config` inject the configuration file in the snakefile
|
... | ... | @@ -179,7 +54,7 @@ If you see the rules and commands that will be run, everything's fine. |
|
|
|
|
|
Launch the run on a personal computer (**only for test data**):
|
|
|
```
|
|
|
snakemake --config proj="project.json" conf="config.json" -rp -j 2
|
|
|
snakemake --config conf="CONFIG/config.json" -rp -j 2
|
|
|
```
|
|
|
|
|
|
> **Note:**
|
... | ... | @@ -192,13 +67,13 @@ snakemake --config proj="project.json" conf="config.json" -rp -j 2 |
|
|
If you want to launch the pipeline on a cluster, you have to specify a script to encapsulate the jobs for snakemake.
|
|
|
example for SGE:
|
|
|
```
|
|
|
snakemake --config proj="project.json" conf="config.json" --cluster "qsub -e ./logs/ -o ./logs/" -j 30 --jobscript sge.sh --latency-wait 100 -rp --resources parallel_star=3
|
|
|
snakemake --config conf="config.json" --cluster "qsub -e ./logs/ -o ./logs/" -j 30 --jobscript sge.sh --latency-wait 100 -rp --resources parallel_star=5
|
|
|
```
|
|
|
|
|
|
> **Note:**
|
|
|
|
|
|
> - The path to the log output files must **exist** (`$ mkdir ./logs`).
|
|
|
> - `--resources` will limit the parallel alignments to 3 in order not to consume all the memory. Increase this number carefully if your cluster contains a lot of RAM.
|
|
|
> - `--resources` will limit the parallel alignments in order not to consume all the memory. Increase this number carefully if your cluster contains a lot of RAM.
|
|
|
|
|
|
where `SCRIPTS/sge.sh` is a wrapper for the SGE jobs:
|
|
|
|
... | ... | |