syntelogfinder

Nextflow pipeline to group genes on polyploid phased assemblies that are orthologous and syntelogous based on GENESPACE results.

Requirements

nextflow
conda

The following packages are not in bioconda/pip so need to be installed manually if running with --profile conda (for singularity this is not necessary):

McxScan (follow instructions here and provide path to installation to --mcscanx_path)
GENESPACE (instructions) (inside the conda environment genespace-env (syntelogfinder/modules/local/genespace/genespace_run/environment.yml))

Minimal input

parameter file (params.json)
genome fasta of phased reference (chromosome names like this: >chr[_]01_1, >chr[_]01_2 Where the _suffix is the haplotype)
gff/gtf with CDS corresponding to the reference (same chromosome names!)

The gff file should look like this https://agat.readthedocs.io/en/latest/gff_to_gtf.html#the-gff-file-to-convert with the following features:

gene
mRNA
exon
CDS

Or a gtf file with the following features:

gene
mRNA/transcript
exon
CDS

Mandatory Attributes

gene_id - must be present on ALL lines
transcript_id - required for transcript, exon, CDS features
Parent - links child features to parent

The params.json should look like this:

{
    "reference_fasta": "genome.fa",
    "reference_gff": "annotation.gff",
    "ploidy": 3,
    "outdir": "output_path"
}

Usage

Run like this (after cloning the repository):

nextflow run main.nf -params-file params/params.json \
                     -profile singularity \
                     -resume

or with conda:

nextflow run main.nf -params-file params/params.json \
                     -profile conda \
                     --mcscanx_path [path to MCScanX installation] \
                     -resume

Test data

A test dataset is available for testing and demonstration purposes. This dataset contains a phased genome assembly and annotation for chromosome 1 across all haplotypes of the tetraploid potato cultivar Atlantic.

fasta
gtf

The params.json should look like this:

{
    "reference_fasta": "ATL_v3.asm.chr01_all_haplotypes.fa",
    "reference_gff": "ATL_unitato_liftoff.chr01_all_haplotypes.gtf",
    "ploidy": 4,
    "outdir": "output_path"
}

Running Syntelogfinder on test data

After downloading the fasta and gtf file and preperation of the parameter file the pipeline can be run like this:

git clone https://github.com/NIB-SI/syntelogfinder.git --branch v1.0.0
cd syntelogfinder
conda create -n nextflow -c bioconda nextflow
conda activate nextflow
nextflow run main.nf \
  -params-file params/params_test.json \
  -profile singularity \
  --run_blast \
  -resume

Expected runtime: 10 minutes (if all singularity images are already pulled)

Tutorial

Output

Sample Output

The pipeline generates a tab-separated file with the following columns:

Column	Description
`gene_id`	Gene identifier
`transcript_id`	Transcript identifier
`Synt_id`	Synteny group identifier
`synteny_category`	Summary of syntenic gene distribution across haplotypes
`syntenic_genes`	Comma-separated list of all syntenic genes
`haplotype`	Haplotype assignment
`CDS_length_category`	CDS length classification (if applicable)
`CDS_haplotype_with_longest_annotation`	Haplotype with the longest CDS annotation (if applicable)

Example Output

gene_id     transcript_id   Synt_id synteny_category        syntenic_genes  haplotype       CDS_length_category     CDS_haplotype_with_longest_annotation
TraesAK58CH7A01G122800      TraesAK58CH7A01G122800.1        Synt_id_0       1hapA_3hapB_1hapD_no_s  TraesAK58CH7A01G122800.1,TraesAK58CH1B01G017800.1,TraesAK58CH4B01G024800.1,TraesAK58CH2B01G118200.1,TraesAK58CH2D01G119400.1    hapA
TraesAK58CH1A01G005100      TraesAK58CH1A01G005100.1        Synt_id_1       2hapA_1hapB_2hapD_no_s  TraesAK58CH1A01G005100.1,TraesAK58CH3A01G490400.1,TraesAK58CH1B01G017500.1,TraesAK58CH1D01G000500.1,TraesAK58CH7D01G525700.1    hapA
TraesAK58CH3A01G490400      TraesAK58CH3A01G490400.1        Synt_id_1       2hapA_1hapB_2hapD_no_s  TraesAK58CH1A01G005100.1,TraesAK58CH3A01G490400.1,TraesAK58CH1B01G017500.1,TraesAK58CH1D01G000500.1,TraesAK58CH7D01G525700.1    hapA
TraesAK58CH3A01G236000      TraesAK58CH3A01G236000.1        Synt_id_2       1hapA_1hapB_2hapD_no_s  TraesAK58CH3A01G236000.1,TraesAK58CH1B01G017200.1,TraesAK58CH1D01G000900.1,TraesAK58CH5D01G521100.1     hapA
TraesAK58CH1A01G006000      TraesAK58CH1A01G006000.1        Synt_id_3       3hapA_0hapB_1hapD_no_s  TraesAK58CH1A01G006000.1,TraesAK58CH3A01G436000.1,TraesAK58CH5A01G002300.1,<NA>,TraesAK58CH1D01G365400.1        hapA
TraesAK58CH3A01G436000      TraesAK58CH3A01G436000.1        Synt_id_3       3hapA_0hapB_1hapD_no_s  TraesAK58CH1A01G006000.1,TraesAK58CH3A01G436000.1,TraesAK58CH5A01G002300.1,<NA>,TraesAK58CH1D01G365400.1        hapA

Key Features

Each gene is assigned to a synteny group (Synt_id)
The synteny_category shows the distribution pattern (e.g., 2hapA_1hapB_2hapD_no_s means 2 genes in hapA, 1 in hapB, 2 in hapD, with no specific pattern)
Missing syntenic genes are indicated with <NA>
All syntenic gene members are listed in the syntenic_genes column

Plots

1. Pie Chart (Genespace_pie_chart.svg)

Shows the distribution of syntelog categories.

2. Combined Bar Plots (Genespace_combined_barplots.svg)

Displays detailed statistics for each syntelog category. Exon lengths should be very similar between gene pairs since we used lifted annotations.

Troubleshooting

If GENESPACE process is interrupted, running with -resume flag will fail. To cache the other processes, delete the genespace work dir before resuming