Syntelog identification in hexaploid wheat

This is an example of running the syntelogfinder pipeline on an example. We will use a wheat long-read RNA-seq dataset from the cultivar AK58.

Part 1: Preparing the Phased Reference Genome

The chromosome names are not so nice, so we will rename them:

e.g. GWHANRF00000001 –> chr1_A
e.g. GWHANRF00000002 –> chr1_B
e.g. GWHANRF00000003 –> chr1_C

Now we are ready to run the syntelog finder pipeline.

Part 2: Running the Syntelog Finder Pipeline

Install nextflow and conda
Prepare the params.config file

params/wheatAK58.json

{
  "reference_fasta": "/scratch/nadjafn/LR_DESIREE_PAPER/ANALYSIS/wheat_example/genome/GWHANRF00000000.renamed.fasta",
  "reference_gff": "/scratch/nadjafn/LR_DESIREE_PAPER/ANALYSIS/wheat_example/genome/GWHANRF00000000.renamed.gff",
  "ploidy": 3,
  "outdir": "/DKED/scratch/nadjafn/potato-allelic-orthogroups/output_wheat"
}

nextflow run main.nf -resume -params-file params/wheatAK58.json -c conf/nextflow.config -profile conda -bg

Why ploidy 3? It is a hexaploid species but we only have A, B and D subgenomes.

Results

The main output we are interested in is the syntelogfinder/output_wheat/03_GENESPACE directory, which contains these three files:

GWHANRF00000000.renamed_genespace.pie_chart.svg

GWHANRF00000000.renamed_genespace_combined_barplots.svg

We can see here that the exon lengths are very different between the genes in the 1hapA_1hapB_1hapD_s synteny category, but the exon lengths are more similar within each haplotype, with most of them having the same lengths.

So to avoid any bias in read mapping to the longest haplotype (if on the other haplotypes the transcript is too short) we will modify the gff3 file to “chop” the UTRs off that more transcripts have the same length.

Part 3: Long-Read RNA-Seq Analysis

Prepare the assets/sample.csv file: .. code-block:: text

sample,fastq_1 SRR33004955,fastq/SRR33004955.fastq SRR33004956,fastq/SRR33004956.fastq SRR33004957,fastq/SRR33004957.fastq SRR33004958,fastq/SRR33004958.fastq

The wheat is very large so we need to use the option --large_genome to choose the right mapping options.

nextflow run main.nf -resume -profile singularity \
                    --input assets/samplesheet_AK58.csv \
                    --outdir output_wheat_AK58 \
                    --fasta genome/GWHANRF00000000.renamed.fasta \
                    --gtf  GWHANRF00000000.renamed.cds2exon.gtf \
                    --centrifuge_db centrifuge/dbs_v2018/ \
                    --sqanti_dir sqanti3/release_sqanti3 \
                    --sqanti_test -bg --technology PacBio --large_genome