... | ... | @@ -3,7 +3,7 @@ |
|
|
# Re-analyzing data
|
|
|
|
|
|
**Snakemake** is able to re-analyze data based on already generated results.
|
|
|
If you have been provided a zip or tar archive with analyzed data, you can re-analyze it without having the original [input file](usage/inputs).
|
|
|
If you have been provided a zip or tar archive with analyzed data, you can re-analyze it without having the original/raw [input file](usage/inputs).
|
|
|
|
|
|
## The configuration file
|
|
|
|
... | ... | @@ -12,15 +12,15 @@ In order to create the configuration file needed to run the snakemake pipeline, |
|
|
### The samplesheet (and comparisons file)
|
|
|
|
|
|
You can either
|
|
|
- **create the samplesheet from scratch** making sure the name of the samples correspond to the ones found in the `CUTADAPT` folder of your previously analyzed data and that the project (column 4) is the name of the folder of this data.
|
|
|
- **create the samplesheet from scratch** making sure the name of the samples correspond to the ones found in the `FASTQ` folder (`CUTADAPT` folder if data provided before mid-2022) of your previously analyzed data and that the project (column 4) is the name of the folder of this data.
|
|
|
- **generate the samplesheet from a previous configuration file** by using the script `SCRIPTS/config2inputs.py`.
|
|
|
- **use the samplesheet provided in the "INPUT_FILES" folder** of your previously analyzed data.
|
|
|
- **use the samplesheet provided in the `INPUT_FILES` folder** of your previously analyzed data.
|
|
|
|
|
|
### Creating the configuration file
|
|
|
### Creating the configuration file
|
|
|
|
|
|
Without the original raw fastq files (not demultiplexed files), you need to use the `-a` option of the `SCRIPTS/make_srp_config.py` script.
|
|
|
You also need to make sure you define the output directory with the `-w` argument as the folder **containing** your previously analyzed folder.
|
|
|
For example, if the directory structure is like:
|
|
|
Without the original raw fastq files (undemultiplexed files), you need to use the `-a` option of the `SCRIPTS/make_srp_config.py` script.
|
|
|
You also need to make sure you define the output directory with the `-w` argument as the folder **containing** your previously analyzed folder (ie. parent directory).
|
|
|
For example, if the directory structure looks like:
|
|
|
|
|
|
```sh
|
|
|
📦MYPROJECT # main output folder specified with '-w' argument
|
... | ... | @@ -36,22 +36,44 @@ For example, if the directory structure is like: |
|
|
┃ ┣ 📂REPORT # necessary files for report (js, css, etc...)
|
|
|
┃ ┗ 📜report.html #### MAIN REPORT FOR PROJECT
|
|
|
```
|
|
|
then you must specify `MYPROJECT` with the `-w` option and `NTS-XXX` in the 4th column of your samplesheet. You may have multiple project folder.
|
|
|
- then you must specify the path of the `MYPROJECT` folder with the `-w` option.
|
|
|
- `NTS-XXX` must be the 4th column of the samples described in your samplesheet.
|
|
|
|
|
|
Example:
|
|
|
You may have multiple project folders (`NTS-XXX_Y`, `ǸTS-XXX_Z`, etc.) under the parent directory `MYPROJECT`, in which case, you will also have multiple projects in the 4th column of the samplesheet if you need to analyze them in one run.
|
|
|
|
|
|
```sh
|
|
|
Use the script `SCRIPTS/make_srp_config.py` to create your configuration file. Example:
|
|
|
|
|
|
```
|
|
|
python SCRIPTS/make_srp_config.py -s <my_samplesheet> -r <path_to_reference_folder> -w <path_to_workdir> -c <comparisons_file> -a > config.json
|
|
|
```
|
|
|
|
|
|
## Running the pipeline
|
|
|
|
|
|
### Testing the configuration file
|
|
|
Test your configuration with a dry run:
|
|
|
|
|
|
```sh
|
|
|
snakemake -nrp --config conf="config.json"
|
|
|
```
|
|
|
If everything is fine, the pipeline **SHOULD NOT** run the `split_fastq` rule as it should find the already created `XXX.fastq.gz` in the `FASTQ` directory of your previously analyzed data. If this is not the case, have a look at the reasons why snakemake wants to create these files again by looking at the output of the dry run. My guess is that you did not specify the parent directory of the NTS-XXX folder with the `-w` argument of the `make_srp_config.py` script.
|
|
|
snakemake --config conf="config.json" --use-conda -rp -n
|
|
|
```
|
|
|
If everything is fine, the pipeline **SHOULD NOT** run the `split_fastq` rule as it should find the already created `XXX.fastq.gz` in the `FASTQ` directory of your previously analyzed data. If you see this rule `split_fastq` appear in the job listing, have a look at the reasons why snakemake wants to create these files again by looking at the output of the dry run. My guess is that you did not specify well the parent directory of the "NTS-XXX" folder with the `-w` argument of the `make_srp_config.py` script.
|
|
|
|
|
|
You can now launch the pipeline in cluster mode according to the main page of the project:
|
|
|
https://gitlab.univ-nantes.fr/bird_pipeline_registry/srp-pipeline
|
|
|
### Updating timestamps of the files
|
|
|
Since you problably downloaded your data, all the timestamps of the files will not reflect the order they were created in. Snakemake can touch the files to put them back in order. This will avoid re-executing all the rules.
|
|
|
```
|
|
|
snakemake --config conf="config.json" --use-conda -rp --touch -j 5
|
|
|
```
|
|
|
|
|
|
### Running in cluster mode
|
|
|
You can now launch the pipeline in cluster mode.
|
|
|
|
|
|
```
|
|
|
snakemake --config conf="config.json" --cluster "qsub -e ./logs/ -o ./logs/" -j 33 --jobscript SCRIPTS/sge.sh --latency-wait 100 --use-conda -rp
|
|
|
```
|
|
|
> **Note:**
|
|
|
> - You can specify the number of jobs with `-j <N>`.
|
|
|
> - :warning: Beware that even if you don't specify multiple jobs, two scripts in the pipeline are still parallelized.
|
|
|
> - The path to the log output files must **exist** (`$ mkdir ./logs`).
|
|
|
> - If your cluster runs under an other task manager than SGE, take a look at [this page](https://snakemake.readthedocs.io/en/stable/executing/cluster.html).
|
|
|
|
|
|
#### Re-analyzing with sample splitting into multiple projects
|
|
|
|
... | ... | |