longrnaseq ========== .. image:: /_static/images/longrnaseq.png :alt: Just keep smiling Introduction ------------ **longrnaseq** is a bioinformatics pipeline that processes long-read RNA sequencing data. The pipeline performs quality control, alignment, classification, contamination detection, and transcript quantification for long-read RNA-seq data from multiple samples. The pipeline includes the following main steps: 1. Read QC (`FastQC `_) 2. Present QC for samples (`MultiQC `_) 3. Genome alignment (`Minimap2 `_) 4. Contamination detection (`Centrifuge `_) 5. Comparion of samples by transcript classification (`SQANTI-reads `_) 6. Transcript quantification (`Oarfish `_) and gene-level summarization Dependencies ------------ An environment with nextflow (>=24.04.2) and Singularity installed. **Note:** If you want to run SQANTI-reads quality control, you will also need to: - Install all `SQANTI3 dependencies `_ in the same environment as nextflow/nf-core environment (sorry there is not functional container for nextflow at the moment..) *Important*: for converting output to html poppler also need to be installed: ``conda install poppler`` - Clone the `SQANTI3 git repository `_ and provide the directory as input. v ==5.5.4 For running Centrifuge, you also need to create a `Centrifuge database `_. Both of these can be skipped with ``--skip_sqanti`` and ``--skip_centrifuge`` Usage ----- 1. Clone the repository of the pipeline ``git clone https://github.com/nadjano/longrnaseq.git`` 2. Prepare a samplesheet with your input data that looks as follows: ``samplesheet.csv``: .. code-block:: text sample,fastq_1 SAMPLE1,sample1.fastq.gz SAMPLE2,sample2.fastq.gz Each row represents a sample with one fastq file. Running the Pipeline -------------------- Required Parameters ~~~~~~~~~~~~~~~~~~~ The pipeline requires the following mandatory parameters: - ``--input``: Path to samplesheet CSV file - ``--outdir``: Output directory path - ``--fasta``: Path to reference genome FASTA file - ``--gtf``: Path to GTF annotation file (for BAMBU to get the right output with gene_id!) - ``--centrifuge_db``: Path to Centrifuge database - ``--sqanti_dir``: Path to SQANTI3 directory - ``--technology``: ONT or PacBio, sets minimap2 parameters for read mapping Note about gtf file ^^^^^^^^^^^^^^^^^^^ gtf-version 3 should include features: gene, transcript, exon, CDS Profile Support ~~~~~~~~~~~~~~~ Currently, only the ``singularity`` profile is supported. Use ``-profile singularity`` in your command. Example Command ~~~~~~~~~~~~~~~ .. code-block:: bash nextflow run main.nf -resume -profile singularity \ --input assets/samplesheet.csv \ --outdir results \ --fasta /path/to/genome.fa \ --gtf /path/to/annotation.gtf \ --centrifuge_db /path/to/centrifuge_db \ --sqanti_dir /path/to/sqanti3 \ --technology ONT/PacBio \ Optional Parameters ~~~~~~~~~~~~~~~~~~~ - ``--skip_deseq2_qc``: Skip deseq2, when only one sample is present deseq2 will fail [default: false] - ``--skip_sqanti``: Skip sqanit and sqanti reads [default: false] - ``--skip_centrifuge``: Skip centrigure [default: false] - ``-bg``: Run pipeline in background - ``-resume``: Resume previous run from where it left off - ``--downsample_rate``: fraction between 0-1 for downsampling before running SQANTI3 to reduce runtime and for vizualization to have smaller files [default: 0.05] - ``--large_genome``: In case minimap2 fails druing genome indexing, this can be due to large genomes and long chromosomes. [default: false] Pipeline output --------------- The main output is a MultiQC.html and oarfish transcript and gene counts. An example MultiQC report can be found `here <_static/multiqc_report.html>`_ Running on HPC -------------- For running the pipeline on a HPC (e.g SLURM) you need to add some configuartion to the nextflow.config file e.g:: process.executor = 'slurm' process.clusterOptions = '--qos=short' # if you have to submit to a specific queue Test Run -------- A test dataset is available for testing and demonstration purposes. This dataset contains a phased genome assembly and annotation for chromosome 1 across all haplotypes of the tetraploid potato cultivar Atlantic. * long-read RNA-seq fastq files: Download from SRA the samples: SRR14993893 and SRR14993894. * genome and annotation files: `fasta `_, `gtf `_ First add samples to sample sheet, download the annotation files and then run the pipeline like this: .. code-block:: bash nextflow run main.nf -profile singularity \ --input assets/samplesheet.csv \ --outdir output_test \ --fasta test_data/ATL_v3.asm.with_chloroplast_and_mito.fa \ --gtf test_data/unitato2Atl.with_chloroplast_and_mito.no_scaffold.agat.gtf \ --technology ONT --downsample_rate 0.99 --skip_centrifuge --skip_sqanti -resume This should finish in less than one hour (running with 30 cpu) including pulling of singularity images. Contributions and Support ------------------------- If you would like to contribute to this pipeline, please get in touch nadja.franziska.nolte[at]nib.si Citations --------- .. TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. .. If you use nf-core/plantlongrnaseq for your analysis, please cite it using the following doi: `10.5281/zenodo.XXXXXX `_ .. TODO nf-core: Add bibliography of tools and data used in your pipeline An extensive list of references for the tools used by the pipeline can be found in the `CITATIONS.md `_ file. You can cite the ``nf-core`` publication as follows: **The nf-core framework for community-curated bioinformatics pipelines.** Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. *Nat Biotechnol.* 2020 Feb 13. doi: `10.1038/s41587-020-0439-x `_.