Syntelog identification in hexaploid wheat ========================================== This is an example of running the syntelogfinder pipeline on an example. We will use a wheat long-read RNA-seq dataset from the cultivar AK58. Part 1: Preparing the Phased Reference Genome ********************************************************** * **fasta**: https://download.cncb.ac.cn/gwh/Plants/Triticum_aestivum_1_GWHANRF00000000/GWHANRF00000000.genome.fasta.gz * **gff**: https://download.cncb.ac.cn/gwh/Plants/Triticum_aestivum_1_GWHANRF00000000/GWHANRF00000000.gff.gz The chromosome names are not so nice, so we will rename them: * e.g. GWHANRF00000001 --> chr1_A * e.g. GWHANRF00000002 --> chr1_B * e.g. GWHANRF00000003 --> chr1_C Now we are ready to run the syntelog finder pipeline. Part 2: Running the Syntelog Finder Pipeline ********************************************************** 1. Install nextflow and conda 2. Prepare the params.config file ``params/wheatAK58.json`` .. code-block:: json { "reference_fasta": "/scratch/nadjafn/LR_DESIREE_PAPER/ANALYSIS/wheat_example/genome/GWHANRF00000000.renamed.fasta", "reference_gff": "/scratch/nadjafn/LR_DESIREE_PAPER/ANALYSIS/wheat_example/genome/GWHANRF00000000.renamed.gff", "ploidy": 3, "outdir": "/DKED/scratch/nadjafn/potato-allelic-orthogroups/output_wheat" } .. code-block:: bash nextflow run main.nf -resume -params-file params/wheatAK58.json -c conf/nextflow.config -profile conda -bg Why ploidy 3? It is a hexaploid species but we only have A, B and D subgenomes. Results ------- The main output we are interested in is the ``syntelogfinder/output_wheat/03_GENESPACE`` directory, which contains these three files: * ``GWHANRF00000000.renamed_genespace.pie_chart.svg`` .. figure:: /_static/images/tutorial/GWHANRF00000000.renamed_genespace_pie_chart.svg :width: 50% :align: center :alt: Syntelog categories pie chart * ``GWHANRF00000000.renamed_genespace_combined_barplots.svg`` .. figure:: /_static/images/tutorial/GWHANRF00000000.renamed_genespace_combined_barplots.svg :width: 100% :align: center :alt: Syntelog categories combined bar plots We can see here that the exon lengths are very different between the genes in the 1hapA_1hapB_1hapD_s synteny category, but the exon lengths are more similar within each haplotype, with most of them having the same lengths. .. figure:: /_static/images/tutorial/wheat_example_different_UTRlengths.png :width: 70% :align: center :alt: Different UTR lengths So to avoid any bias in read mapping to the longest haplotype (if on the other haplotypes the transcript is too short) we will modify the gff3 file to "chop" the UTRs off that more transcripts have the same length. .. _longrnaseq-section: Part 3: Long-Read RNA-Seq Analysis ********************************************************** Prepare the ``assets/sample.csv`` file: .. code-block:: text sample,fastq_1 SRR33004955,fastq/SRR33004955.fastq SRR33004956,fastq/SRR33004956.fastq SRR33004957,fastq/SRR33004957.fastq SRR33004958,fastq/SRR33004958.fastq The wheat is very large so we need to use the option ``--large_genome`` to choose the right mapping options. .. code-block:: bash nextflow run main.nf -resume -profile singularity \ --input assets/samplesheet_AK58.csv \ --outdir output_wheat_AK58 \ --fasta genome/GWHANRF00000000.renamed.fasta \ --gtf GWHANRF00000000.renamed.cds2exon.gtf \ --centrifuge_db centrifuge/dbs_v2018/ \ --sqanti_dir sqanti3/release_sqanti3 \ --sqanti_test -bg --technology PacBio --large_genome