Tiley Lab - Plant Evolution

Polyploidy across populations, species, and communities

Software

Analysis Pipelines

Phasing target enrichment data from polyploids (PATÉ)

The is the Nextflow rewrite of PATÉ. This pipeline returns phased haplotype sequences for polyploids, with the caveat that the divergence of subgenomes is not too high where homeologous variation can be captured as biallelic variants.

PATÉ on GitHub

Polyploid Genotyping Pipeline (PGP)

This Nextflow pipeline simplifies the genotyping process with GATK using reasonable defaults and hard filtering. There is an option to pass the set of high-confidence variants on to subsequent polyploid genotyping models when the allelic dosage of polyploids is important.

PGP on GitHub

Packages

Polyploid Population Genomics ToolKit (PPGTK)

The PPGTK is under active development and gaining features as the lab makes more of our work programmatic and reproducible. A stable component of PPGTK of near-term interest though is the ploidy inference method.

PPGTK on GitHub

Subgenome phasing with Multi-Labeled RF distances (RF-Phase)

RF-Phase is meant for automating analyses of subgenome dominance in gene family evolution. You start with a species network and known ploidies, and use objective criteria from the MUL-RF distance to phase subgenomes without chromosome-level assemblies.

RF-Phase on Github

Older Products

Here are some older links that are maintained here for the near-term to remain findable. All have been subsumed into ongoing projects in the lab with a more modern software design philosophy, or have been rendered obsolete by better options out there.

Phasing target enrichment data from polyploids

PATÉ is a pipeline for phasing sequence data from polyploids.

The pipeline gives users analysis-ready output and hopefully avoids a number of bioinformatic headaches for biologists.

Mixture models for detecting whole-genome duplications

Some mixture models in R might be useful for detecting ancient whole-genome duplications from genomic or transcriptomic data. The models have been recreated in some other packges that are more computationally efficient, but revisiting the base R code might be helpful for some cases.

I never made this into a proper R package on CRAN because I do not think the code is that novel - it is mostly repackaging pre-existing algorithms to implement some slightly different models. The code has no dependencies, so it is easy enough to run with a source call. Models can be used to analyze any data for that matter, it does not have to be limited to the task of detecting whole-genome duplications.

Testing phylogenetic hypotheses of ancient whole-genome duplications

A simulation-based test was implemented for the placement of ancient whole-genome duplications on phylogenies based on summary statistics from reconciled gene trees.

Models exist that might be more interesting these days, but as far as I am aware, our approach is the best option for large-scale phylogenomic studies because it is fast. This should probably be used as a data exploration method and then test a candidate set of hypotheses with some of the more rigorous models.

Simulations under the multispecies coalescent with introgression

It took a while to find a satisfying approach for simulating data that allows for genealogical discordance under the MSC and introgression. BPP does this very well and here are some scripts that might be helpful for others. I found BPP to be much more intuitive than ms, but ms or fastsimcoal2 might be more appropriate to simulate under infinite sites for population genomics.