Software

Phasing target enrichment data from polyploids

PATÉ is a pipeline for phasing sequence data from polyploids.

The pipeline gives users analysis-ready output and hopefully avoids a number of bioinformatic headaches for biologists. I am currently re-writing the better parts of PATÉ as a Nextflow pipeline to help things play more nicely with diverse compute environments.

Mixture models for detecting whole-genome duplications

Some mixture models in R might be useful for detecting ancient whole-genome duplications from genomic or transcriptomic data. The models have been recreated in some other packges that are more computationally efficient, but revisiting the base R code might be helpful for some cases.

I never made this into a proper R package on CRAN because I do not think the code is that novel - it is mostly repackaging pre-existing algorithms to implement some slightly different models. The code has no dependencies, so it is easy enough to run with a source call. Models can be used to analyze any data for that matter, it does not have to be limited to the task of detecting whole-genome duplications.

Testing phylogenetic hypotheses of ancient whole-genome duplications

A simulation-based test was implemented for the placement of ancient whole-genome duplications on phylogenies based on summary statistics from reconciled gene trees.

Models exist that might be more interesting these days, but as far as I am aware, our approach is the best option for large-scale phylogenomic studies because it is fast. This should probably be used as a data exploration method and then test a candidate set of hypotheses with some of the more rigorous models.

Simulations under the multispecies coalescent with introgression

It took a while to find a satisfying approach for simulating data that allows for genealogical discordance under the MSC and introgression. BPP does this very well and here are some scripts that might be helpful for others. I found BPP to be much more intuitive than ms, but ms or fastsimcoal2 might be more appropriate to simulate under infinite sites for population genomics.