Eric CHARPENTIER · 08781cf6
--- a/usage/run_pipeline.md
+++ b/usage/run_pipeline.md
@@ -3,161 +3,171 @@

 # Running the pipeline

-## Creating the configuration file
+## Activate the rna conda environment

-The first task before running the pipeline is to create a **configuration file** in the json format. This task can be performed with the help of the python script in "SCRIPTS/make_srp_config.py" assumin you have sourced your conda virtual environment with either `conda activate srp` or `source activate srp`.
+If you have not created your environment, see [the requirments](usage/requirements) page.
+
+```
+conda activate rna
+```
+
+## Creating the project.json file
+
+Fastq files must be located under one root folder. They can be dispatched into sub-folders.  
+The **project.json** configuration file can be built automatically with a simple command:
+
+```
+find </path/to/fastq/files> -type f -name "*.fastq.gz" | java -jar scripts/illuminadir.jar -J | python -m json.tool > project.json
+```
+
+> **Note:**
+
+> - Replace \</path/to/fastq/files\> with the full path to the folder containing the fastq files. Be aware that all "*.fastq.gz" will be grabbed under this folder.
+> - Checkout that the generated json file looks good, specifically the sample names extracted from the fastq filenames. The latter will have to correspond to the ones specified in the [samplesheet](usage/inputs#samplesheet).
+
+## Creating the config.json file
+
+The config.json file can be created with the help of the python script in "scripts/make_rna_config.py".

 You can visualize the help of the script with:

 ```
-$ python SCRIPTS/make_srp_config.py -h
+$ python scripts/make_rna_config.py -h

-usage: make_srp_config.py [-h] -s FILE [-w DIR] [-i DIR] [-f FILE] [-r DIR]
-                          [-c FILE] [--minGenes N] [--minReads N]
-                          [--minLogFC N]
+usage: make_rna_config.py [-h] -s FILE [-w DIR] [-r NAME] -c FILE -t
+                          {unstranded,first-strand,second-strand} [-l N]
+                          [-n PROJECTNAME] -p FILE [--minLogFC N]

 optional arguments:
  -h, --help            show this help message and exit
  -s FILE, --samplesheet FILE
                        Tab delimited file with no header describing samples.
-                        Columns must be: "well index name project condition
-                        species". Only characters "A-Z","0-9","-" and "_"
-                        allowed. All columns are mandatory. (REQUIRED)
+                        Columns must be: "name condition". Only characters
+                        "A-Z","0-9","-" and "_" allowed. Both columns are
+                        mandatory. (REQUIRED)
  -w DIR, --workdir DIR
                        Analysis working directory. Default: current directory
-  -i DIR, --illumina-dir DIR
-                        Directory containing the fastq input files generated
-                        by Illumina. The fastq files should be in paired-end
-                        mode. If your fastq files are not coming from an
-                        Illumina sequencer, please use option -f to specify a
-                        file listing the fastq input files. (REQUIRED if no
-                        "-f")
-  -f FILE, --fastq-file FILE
-                        File describing the fastq input file. This file should
-                        be tab delimited. First column: full path of Forward
-                        file; second column: full path of Reverse file. The
-                        fastq files should be in paired-end mode. If your
-                        fastq files were generated by an Illumina sequencer,
-                        you can use option "-i" to specify the directory
-                        containing the fastq input files. (REQUIRED if no
-                        "-i")
-  -a, --reanalyze       
-                        Flag to indicate that this run will be a re-analysis
-                        which needs the already demultiplexed fastq files (one
-                        per sample)
-  -r DIR, --reference-dir DIR
-                        Directory containing the reference files. It is
-                        recommended that you use this option if you have
-                        already used this pipeline and downloaded genome
-                        files.
-  -c FILE, --conditions FILE
+  -r NAME, --reference-name NAME
+                        Reference name. This name must match a key in the
+                        CONFIG/references.json file. If not used, you will
+                        have to write the reference object yourself in the
+                        config.json file
+  -c FILE, --comparisons FILE
                        Tab delimited file with no headers indicating which
                        conditions to compare during differential expression
-                        analysis. Columns must be "project condition1
-                        condition2". If not specified, only primary analysis
-                        will be performed
-  --minGenes N          Minimum genes detected necessary for a sample to pass
-                        the filtering step in secondary analysis. (Default
-                        5000)
-  --minReads N          Minimum reads assigned necessary for a sample to pass
-                        the filtering step in secondary analysis. (Default
-                        200000)
+                        analysis. Columns must be "condition1 condition2".
+                        (REQUIRED)
+  -t {unstranded,first-strand,second-strand}, --librarytype {unstranded,first-strand,second-strand}
+                        Library type. If you have no idea what this is, please
+                        see "https://chipster.csc.fi/manual/library-type-
+                        summary.html"
+  -l N, --readlength N  Length of the reads.
+  -n PROJECTNAME, --projectname PROJECTNAME
+                        Project name which will appear in html report.
+  -p FILE, --project FILE
+                        project.json file generated by illuminadir.jar.
+                        (REQUIRED)
  --minLogFC N          Minimum log Fold-Change threshold for differentially
                        expessed gene. (Default 0.58 (1.5 FC))
 ```

 > **Note:**

-> - The `-i` argument is here for legacy reasons. If you have split your raw fastq files as described in *[the input files](usage/inputs#fastqFile)* page, you will use the `-f` argument and specify the path to the manifest listing the splitted fastq pairs.
 > - The `-c` argument is optionnal. It triggers the secondary analysis steps. See *[the input files](usage/inputs#compFile)* for more explanations on the comparisons file.
-> - You can specify the path of an already existing reference folder with the `-r` argument. If you do so, the already downloaded reference transcriptome will be used. If the assemblies specified in the samplesheet do not exist under this folder, they will be downloaded.
+> - The `-r` argument expects a key in the CONFIG/references.json file. If you use it, make sure it corresponds to a key for an already defined reference on your system. If you don't use it, you will have to fill in the blanks in the generated configuration file:
+>   - define the path to the fasta reference genome (only work with ensembl genomes)
+>   - define the path to the gtf annotation file of the reference genome
+>   - define a path where the aligner STAR will create the index of the reference genome
+>   - define the accession for biomart in order to map ENSGs to gene symbols
+>   - description will be used in the final report file.
+> - The `-t` argument is mandatory. It defines if your library is unstranded or stranded. see (https://chipster.csc.fi/manual/library-type-summary.html)
+> - The `-l` defines the length of your read (according to the number of cycles of the sequencing run). It is mandatory to build an appropriate index for the reference genome.
+> - The `-p` argument is mandatory and should be the path of your **project.json** file created previously. The script will match the names of the samples defined in the [samplesheet](usage/inputs#samplesheet) and the names defined in the **project.json** file.

 The program outputs the config file on stdout. In the first time, you can try the command to see if everything is alright and in the second time, redirect the output to a file.

 ```
-python SCRIPTS/make_srp_config.py -s <my_samplesheet> -r <path_to_reference_folder> -w <path_to_workdir> -f <path_to_manifest> > config.json
+python SCRIPTS/make_srp_config.py -s <my_samplesheet> -r <key_of_references.json> -w <path_to_workdir> -t <library_type> -l <read_length> -n <project_name> -p <path_to_project.json> > config.json
 ```

 If you want secondary analysis to be performed, use option `-c` to specify the comparisons.

 ```
-python SCRIPTS/make_srp_config.py -s <my_samplesheet> -r <path_to_reference_folder> -w <path_to_workdir> -f <path_to_manifest> -c <comparisons_file> > config.json
+python SCRIPTS/make_srp_config.py -s <my_samplesheet> -r <key_of_references.json> -w <path_to_workdir> -t <library_type> -l <read_length> -n <project_name> -p <path_to_project.json> -c <comparisons_file> > config.json
 ```

 In every case, check the generated configuration file to see if everything seems ok.
 ```
 $ cat config.json
+
 {
-    "maindir": "/path/to/RESULTS",
-    "fastq_folder": "FASTQ",
-    "cutadapt_folder": "CUTADAPT",
-    "fastqc_folder": "FASTQC",
-    "multiqc_folder": "MULTIQC",
-    "align_folder": "ALIGNMENT",
-    "expression_folder": "EXPRESSION",
-    "de_folder": "DE",
-    "report_folder": "REPORT",
-    "ref_folder": "/path/to/TESTDATA/REFERENCES",
-    "samples": [
-        {
-            "well": "D09",
-            "name": "sample1",
-            "index": "AGACCT",
-            "project": "TestProject",
-            "condition": "cond1",
-            "species": "hg19"
-        },
+        "project-name":"test-project",
+        "outdir":"Results",
+        "prinseq-meanquality":"30",
+        "cutadapt-forward":"AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",
+        "cutadapt-reverse":"AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT",
+        "library-type":"first-strand",
+        "read-length":"100",
+        "align-cpu":"1",
+        "reference":
        {
-            "well": "E08",
-            "name": "sample2",
-            "index": "ACGGGG",
-            "project": "TestProject",
-            "condition": "cond2",
-            "species": "hg19"
+                "name":"Ensembl_GRCh37",
+                "description":"Homo sapiens Ensembl GRCh37",
+                "STARindexDir":"CONFIG/genome",
+                "fasta":"CONFIG/genome/human_g1k_v37.chr22.fasta",
+                "gtf":"CONFIG/genome/chr22.gff",
+                "biomart":"37,hsapiens_gene_ensembl"
        },
-        ...
-    ],
-    "comparisons": {
-        "TestProject": {
-            "species": "hg19",
-            "minRep": 2,
-            "minGenes": 0,
-            "minReads": 0,
-            "minLogFC": 0.58,
-            "performComps": true,
-            "comps": [
+        "samplesCondition": [
                {
-                    "condition1": "cond1",
-                    "condition2": "cond2"
+                        "name": "80_CT_1_chr22",
+                        "condition": "CT"
                },
                {
-                    "condition1": "cond1",
-                    "condition2": "cond3"
+                        "name": "81_CT_2_chr22",
+                        "condition": "CT"
                },
                {
-                    "condition1": "cond2",
-                    "condition2": "cond3"
+                        "name": "82_CT_3_chr22",
+                        "condition": "CT"
+                },
+                {
+                        "name": "83_CT_4_chr22",
+                        "condition": "CT"
+                },
+                {
+                        "name": "96_TREATED_1_chr22",
+                        "condition": "TREATED"
+                },
+                {
+                        "name": "97_TREATED_2_chr22",
+                        "condition": "TREATED"
+                },
+                {
+                        "name": "98_TREATED_3_chr22",
+                        "condition": "TREATED"
+                },
+                {
+                        "name": "99_TREATED_4_chr22",
+                        "condition": "TREATED"
                }
-            ]
-        }
-    },
-    "fastq_pairs": [
-        {
-            "read1": "TESTDATA/split.aa_R1.fastq.gz",
-            "read2": "TESTDATA/split.aa_R2.fastq.gz"
-        },
+        ],
+        "comparisons": 
        {
-            "read1": "TESTDATA/split.ab_R1.fastq.gz",
-            "read2": "TESTDATA/split.ab_R2.fastq.gz"
-        },
-        ...
+                "TREATED__vs__CT": {
+			"minLogFC": 0.58,
+                        "condition1": "TREATED",
+                        "condition2": "CT"
+                }
+        }
+}
 ```

 ## Launch the snakemake pipeline.

 Test the launch with a dry run:
 ```
-snakemake -nrp --config conf="config.json"
+snakemake --config proj="project.json" conf="config.json" -rpn
 ```
 where: 
 - `--config` inject the configuration file in the snakefile
@@ -167,27 +177,28 @@ where:

 If you see the rules and commands that will be run, everything's fine.

-Launch the run on a personal computer:
+Launch the run on a personal computer (**only for test data**):
 ```
-snakemake -rp --config conf="config.json" -j 2
+snakemake --config proj="project.json" conf="config.json" -rp -j 2
 ```

 > **Note:**

 > - You can specify the number of jobs with `-j <N>`.
-> - :warning: Beware that even if you don't specify multiple jobs, two scripts in the pipeline are still parallelized which means you can crash the computer. The pipeline has been built to run on a HPC.
+> - :warning: The alignment step will consume a lot of memory. Do not run this on real data with a real reference genome.

 ### Running on a cluster 

 If you want to launch the pipeline on a cluster, you have to specify a script to encapsulate the jobs for snakemake.
 example for SGE:
 ```
-snakemake -rp --config conf="config.json" --cluster "qsub -e ./logs/ -o ./logs/" -j 33 --jobscript SCRIPTS/sge.sh --latency-wait 100
+snakemake --config proj="project.json" conf="config.json" --cluster "qsub -e ./logs/ -o ./logs/" -j 30 --jobscript sge.sh --latency-wait 100 -rp --resources parallel_star=3
 ```

 > **Note:**

 > - The path to the log output files must **exist** (`$ mkdir ./logs`).
+> - `--resources` will limit the parallel alignments to 3 in order not to consume all the memory. Increase this number carefully if your cluster contains a lot of RAM.

 where `SCRIPTS/sge.sh` is a wrapper for the SGE jobs: