Tutorial 1: Setting Up and Running PHRAPL
The purpose of this tutorial is to provide a brief introduction for using PHRAPL
. It will walk you through the necessary steps required for installing the package, preparing a dataset and model set using R, running an analysis, and summarizing results, all using a toy dataset. There are many options that one can persue when running a PHRAPL
analysis, which could get a bit confusing for someone using the program for the first time. For this reason, we have highlighted with orange asterisks those blocks of code that constitute the minimum steps that must be taken to run a PHRAPL analysis using the toy dataset.
- Installing PHRAPL
- PHRAPL help
- Importing your dataset into R
- Subsampling trees
- Calculating degeneracy weights for subsampled trees
- Generating models
- Running a PHRAPL grid search
- Compiling, summarizing, and visualizing results
- Build your own model
Installing PHRAPL
System requirements
This tutorial assumes you are working in a Mac/Linux environment. We have yet to properly test the method using a PC, although we suspect a few (surmountable) issues will arise when the sufficiently adventurous try this out. For more detailed information about any of the tasks discussed below, take a look at the help files within PHRAPL (for a list of these, type library(help=phrapl)
once the package is installed).
Install requirements
Note that most of PHRAPL
is written in R. However, the part of the code that matches observed trees to trees expected under a given model is currently written in Perl in order to speed up the method. If you are running PHRAPL
on a Mac or Linux machine, Perl has most likely already been installed (to make sure, within R, you can type system("perl -v")
, which will give you the version installed, if it exists). If Perl is not installed on your machine, you can download it from the Perl website (http://www.perl.org/get.html
).
One final note. It has come to our attention that one of the R packages that PHRAPL imports, diagram
, includes an error in it’s DESCRPTION file such that devtools
can’t install it. Thus, if you are thrown an error upon trying to install PHRAPL using devtools
(something along the lines of Invalid comparison operator in dependency: >=
, try installing diagram
first without devtools
, by typing install.packages("diagram")
. Then, try installing PHRAPL again using the devtools
command.
Also, for Mac users, if you experience issues with `rgl` properly installing, another PHRAPL dependency, you should try updating your version of XQuartz (`https://www.xquartz.org/`), then reinstalling `rgl` separately in R (`install.packages("rgl")`). Then install PHRAPL.
Install Github version
PHRAPL
can currently be found on Github (https://github.com/bomeara/phrapl
), although you don’t actually need to navigate Github to install the program. You can simply install PHRAPL in R using the R package devtools
. To do this, first install devtools
from CRAN by typing
install.packages("devtools")
Then load the devtools
library (i.e., library(devtools)
) and type
devtools::install_github("bomeara/phrapl")
which grabs PHRAPL
from Github and installs it.
If you find an error during the installation, it is probably related to P2C2M, go to the CRAN repository and download the *.gz folder. You can type the following in your terminal:
wget https://cran.r-project.org/src/contrib/Archive/P2C2M/P2C2M_0.7.6.tar.gz
R CMD INSTALL P2C2M_0.7.6.tar.gz
PHRAPL help
Now PHRAPL is ready to run. Type library(help=phrapl)
to get a list of functions with documentation. To open a help file for a particular function, type ?function_name
.
Importing your dataset into R
Three R objects must be initially specified to run a PHRAPL
analysis:
- A set of trees in newick format
- A table that assigns tips of the trees to populations or species
- A vector specifying the number of tips to subsample per population
1. Importing trees
If you are beginning with sequence data, note that PHRAPL
includes a function for inferring gene trees from sequence data by calling up RAxML (type ?RunRaxml
for more information on using this function). If your sequence data is in nexus format, there is also a function for converting your data to phylip format, which is the required format for running RAxML (type ?RunSeqConverter
). This function calls up a Perl script written by Olaf R.P. Bininda-Emonds. For the purposes of this tutorial, we will assume that you already have a set of trees in newick format sitting in a text file called “trees.tre”.
PHRAPL
includes a toy dataset that we will use during this tutorial. This dataset consists of 10 trees, each with 61 tips: 20 tips from each of three populations (“A”, “B”, and “C”), plus one outgroup individual. To bring this dataset into R, first load the ape
library:
library(ape)
library(phrapl)
Then read the trees into R and create a multiPhylo tree object called trees
:
trees<-read.tree(paste(path.package("phrapl"),"/extdata/trees.tre",sep=""))
Note that multiPhylo objects behave as R lists. Thus to access the first tree within the object, type
trees[[1]]
To plot this tree, type
plot.phylo(trees[[1]])
2. Importing an assignment table
PHRAPL
assumes that the assignment of tips to populations or species is known. Thus these assignments must be specified upfront in the form of a table.
This population assignment table must consist of two columns: the first column lists the individuals in the dataset, whose names must match those at the tips of the trees. Note that not all the individuals listed in the table must exist on every tree (i.e., missing data/unique tip names for each tree are fine). The second column should provide the population or species name to which each individual is assigned (e.g., “A”, “B”, “C”). If there is an outgroup taxon, it MUST be listed as the last population in the table and the first letter in the population name should also come last alphanumerically (e.g., "Z.outgroup"
). Also note that PHRAPL assigns population indexes (i.e., 1, 2, 3, etc.) to taxa/populations alphanumerically, such that population 1 corresponds to the population name in your assignment table that comes first in the alphabet, and so forth. This is important to remember when interpreting parameter nomenclature in the output.A header row of some sort must also be included.
You can of course create this table directly in R as a data.frame. If you’ve created the assignment table as a tab-delimited text file (e.g., “cladeAssignments.txt”), you can import it into R as a data.frame by using the function read.table
. Do not forget to set header=TRUE
and stringsAsFactors=FALSE
.
To import an assignment table for our toy dataset, type
***************************************************************** assignFile<-read.table(paste(path.package("phrapl"),"/extdata/cladeAssignments.txt",sep=""),
header=TRUE,stringsAsFactors=FALSE)
assignFile
3. Specifying the number of tips to subsample (popAssignments)
Rather than fitting datasets to models using all the tree tips simultaneously, PHRAPL
uses iterative subsamples of tips, allowing for the method to work in a manageable tree space. Thus, the number of tips to subsample per population must be specified by the user.
A popVector
is a vector, whose length is equal to the number of populations in the dataset and whose values are equal to the number of subsampled tips. So, for example, if you are subsampling 3 tips from a dataset that contains 3 populations, then popVector = c(3,3,3)
. If subsampling 4 tips from 2 populations, popVector = c(4,4)
.
Because PHRAPL
possesses the ability to analyze a dataset under a series of different subsampling regimes, popAssignments
, which is simply a list of popVectors
, is the object that users actually specify (although popAssignments
will typically be a list of one popVector
). So, if the desired popVector = c(3,3,3)
, then popAssignments
should be defined by typing
popAssignments<-list(c(3,3,3))
which is the value we will use for our toy dataset.
We recommend generally subsampling 3 or 4 tips per population. However, subsampling 2 tips may be necessary with more than 5 populations while subsampling 5 tips may work fine with 3 or fewer populations. As a rule of thumb, we recommend subsampling in a way that keeps the sum of popAssignments
at 16 or less. With greater than 16 tips, the number of possible trees becomes so large that PHRAPL
has a difficult time simulating enough trees under a given model to find any matches to the empirical trees, rendering log-likelihood (lnL) estimation challenging.
Subsampling trees
As discussed above, approximate likelihood calculation as currently implemented in PHRAPL
requires that trees not be too large (~16 tips or fewer) such that there exists a measurable chance of observing the empirical tree within a distribution of simulated trees. Since most phylogeographic datasets are considerably larger than this, replicate subsamples of trees must typically be analyzed. Trees are subsampled (randomly, with replacement) using the function PrepSubsampling
, which requires the following arguments:
assignmentsGlobal
(the population assignments table)observedTrees
(the original trees)popAssignments
(the number of tips subsampled per population)subsamplesPerGene
(the number of replicate subsamples to take per locus)outgroup
(whether an outgroup is present in the dataset, i.e.,TRUE
orFALSE
)outgroupPrune
(whether an outgroup, if present, should be excluded from the subsampled trees).
To subsample our toy dataset using 10 replicates and 3 tips per population, do the following:
***************************************************************** assignmentsGlobal<-assignFile
observedTrees<-trees
popAssignments<-list(c(3,3,3))
subsamplesPerGene<-10
outgroup=TRUE
outgroupPrune=TRUE
observedTrees<-PrepSubsampling(assignmentsGlobal=assignmentsGlobal,observedTrees=observedTrees,
popAssignments=popAssignments,subsamplesPerGene=subsamplesPerGene,outgroup=outgroup,
outgroupPrune=outgroupPrune)
observedTrees
plot(observedTrees[[1]][1])
Note that this function produces a list of subsampled tree sets to match the format of popAssignments
(i.e., one set per popVector
). Thus, if you have a single popVector
, the PrepSubsampling
function will output a list with a single set of trees. This set of subsampled trees first contains all the subsample iterations for the first locus, then all the iterations for the second locus, and so forth. Thus, for example, for our observedTrees
we just generated, we can access the first subsample iteration of the second locus by typing observedTrees[[1]][[11]]
.
Note also that the more tips that exist in the original tree, the larger the number of subsample iterations that must be taken in order to capture a representative sample of the tree. One rule of thumb is to take AT LEAST as many subsample iterations that would be required to sample each tip once if subsampling were being done without replacement. So, for our current dataset, with three tips being taken per population, this would require that subsamplesPerGene
be at least 7 (3 subsampled tips x 7 iterations > 20 total tips for a given population in the toy tree). However, increasing the number of subsamples to 2, 3, or more times this number will likely reduce the error in the likelihood estimate for a given tree. We suggest taking as many subsample iterations as is logistically feasible, although keeping in mind that there will be a point of diminishing returns that should be weighed against other specifications of the analysis (e.g., the number of models to run, the number of parameters to specify, the number of trees to simulate, etc.).
Finally, keep in mind that PHRAPL requires that trees be rooted, but not that the outgroup be included in the analysis. The purpose of including an outgroup taxon is merely to ensure that the subsampled trees are properly rooted. Thus, typically if an outgroup is present, outgroupPrune
should be set to TRUE. If an outgroup is not available, one could midpoint root the subsampled trees. One way to do this in R is to use the midpoint
function within the phangorn
package. For example,
library(phangorn)
observedTreesMidpoint<-lapply(observedTrees[[1]],midpoint)
class(observedTreesMidpoint)<-"multiPhylo"
observedTreesMidpoint<-list(observedTreesMidpoint)
Calculating degeneracy weights for subsampled trees
As a way of reducing the tree space that PHRAPL faces when calculating the probability of a gene topology, all tip labels within populations are essentially ignored when assessing matches between simulated and observed trees. If the only differences between two trees consist of intra-population discrepancies, then these trees are considered to be “matches”. To adjust lnLs in a way that accounts for this degeneracy of intra-population tip labels, topological weights are calculated based on the proportion of times that the intra-population permuting of labels across a tree results in the same topology. These weights can be calculated during the PHRAPL search analysis - to do this, when running GridSearch
(described below), set subsampleWeights.df = NULL
and doWeights = TRUE
. However, when there are more than a few tips in the subsampled trees, this can be time consuming, and thus it is helpful to calculate these upfront. To get these weights beforehand, you can use the function GetPermutationWeightsAcrossSubsamples
, which takes as input popAssignments
and the set of subsampled trees. For our toy dataset, type
subsampleWeights.df<-GetPermutationWeightsAcrossSubsamples(popAssignments=popAssignments,
observedTrees=observedTrees)
As with PrepSubsampling
, this function produces a list of weights tables with the same length as popAssignments
. Within a weights table, each weight corresponds to a subsampled tree.
Generating models
The last thing we need prior to analyzing the dataset is a set of models to test. This set of models can be constructed “one at a time”, if you have specific a priori histories to compare, or a set of “all possible models” can be generated automatically given a specified list of criteria. Or of course you can combine these approaches. This section first describes the structure of models in PHRAPL and then outlines how to go about generating a model set using these two approaches.
One advantage of PHRAPL is that it can generate and analyze a large set of models relatively quickly. A model set (migrationArray
) consisting of all possible models, given certain constraints such as the number of free parameters, K, available overall (set by maxK
), or available for a particular parameter (set by maxN0K
, maxGrowthK
, and maxMigrationK
), can be generated using the GenerateMigrationIndividuals
function. This generates a model set that includes all possible combinations of parameter indices, given the constraints.
For example, a list of models for 3 populations with
- An overall maximum K of 3 (
maxK = 3
) - A maximum number of migration parameters of 1 (
maxMigrationK = 1
) - No variation in population size among populations (
maxN0K = 1
) - Only fully resolved trees (
forceTree = TRUE
) - Migration set to be symmetrical between populations (
forceSymmetricalMigration = TRUE
)
can be generated for use with our toy dataset as follows:
***************************************************************** popVector<-popAssignments[[1]]
maxK<-3
maxMigrationK=1
maxN0K=1
maxGrowthK=0
forceTree=TRUE
forceSymmetricalMigration=TRUE
migrationArray<-GenerateMigrationIndividuals(popVector=popVector,maxK=maxK,
maxMigrationK=maxMigrationK,maxN0K=maxN0K,maxGrowthK=maxGrowthK,
forceTree=forceTree,forceSymmetricalMigration=forceSymmetricalMigration)
This yields a migrationArray
containing 48 models. If desired, one can also specify a particular set of fixed parameter indices (using the collapseList
, n0multiplierList
, growthList
, and migrationList
arguments) such that only a subset of parameters will be varied in the outputted migrationArray
. For more on this, see the R documentation (?GenerateMigrationIndividuals
).
Running a PHRAPL grid search
A brief summary of the GridSearch
Once you have a model set in hand and the data have been subsampled and weights have been calculated, you are finally ready to calculate AIC values and parameter estimates for your dataset under the models using the GridSearch
function. This function is called GridSearch
because it analyzes each model across a grid of parameter values as a way to estimate those values that yield the best fit to the data. Numerical optimization searches are also available within PHRAPL
(type ?GridSearch
for more on these), but we don’t recommend using them due to their being more computationally onerous without imparting a compensatory improvement in performance. The parameter values contained in the grid can be specified by the user using the arguments collapseStarts
, n0Starts
, growthStarts
, and migrationStarts
, or default values can be used (to see these default values, type ?GridSearch
). When choosing grid values, keep in mind that the grid size, and thus number of simulations that PHRAPL
must do, increases rapidly with each added parameter value, and thus there will be tradeoffs if you want to explore a large grid. Also, a model with many parameters will need to be analyzed using fewer specified values per parameter compared to a model with only one or two parameters in order to yield a parameter grid of the same size.
The main inputs for GridSearch
are:
migrationArray
(the model set)observedTrees
(the set of subsampled trees)subsampleWeights.df
(the topology weights)popAssignments
(the number of tips subsampled per population)nTrees
(the number of trees to simulate for each point in a model parameter grid)subsamplesPerGene
(the number of subsample iterations)
A few things to note:
nTrees
should be set to 10,000 at a minimum and should be as high as logistically feasible (computational time increases linearly with increasingnTrees
). IncreasingnTrees
can increase the accuracy and consistency of approximate lnLs.If you only want to run a subset of the models within
migrationArray
, you can specify this usingmodelRange
(e.g.,modelRange = 1:5
will run only the first five models). This is useful if you wish to break up analyses across a computer cluster. If a single model will take a long time to run, you can also setcheckpointFile = TRUE
, which will cause results to be printed to a checkpoint file throughout the analysis, such that, if the analysis needs to be terminated prior to completion, one can resume the analysis later from the most recently analyzed grid point.If the dataset being analyzed contains loci that differ in their prevalence in a population (e.g., mix of diploid/haploid or organellar/nuclear loci), the relative scaling of effective population size should be inputted using
popScaling
. This argument takes a vector of scaling values, one value for each locus in the dataset (e.g., where a diploid nuclear locus = 1, X-linked locus = 0.75, mtDNA/chloroplast locus = 0.25, etc.).
To run a grid search on our toy dataset, let’s just analyze the first three models using nTrees = 1000
(so it runs quickly):
modelRange<-1:3
nTrees<-1000
result<-GridSearch(migrationArray=migrationArray,
modelRange=modelRange,popAssignments=popAssignments,
observedTrees=observedTrees,
subsampleWeights.df=subsampleWeights.df,
subsamplesPerGene=subsamplesPerGene,
nTrees=nTrees)
GridSearch
outputs a list containing:
- Parameter grids (with AICs) for each model (itself a list)
- A results table that gives AICs, lnLs, Ks (
params.K
), parameters (params.vector
), and parameter estimates for each model Grid points with anAIC = NA
signal parameter combinations that did not yield enough matches between observed and expected trees to calculate a lnL. Parameter estimates equal toNA
occur when those parameters are not included in the model. Parameter estimates are obtained by model averaging across the grid. Parameters for t, n, g, and M are given for each time slice in the model, starting at the tips and moving toward the root (t1
= coalescence time at time 1,m2
= ancestral migration rates at time 2, etc.). For t, the coalescing populations follow the underscore and are separated by a period (e.g.,t2_1-2.3
gives the coalescence time for ancestral populations 1-2 and 3 at time 2); for m, migrants move from the population before the period to the population after the period (e.g.,m1_2.1
gives the rate of migration at the tips, from population 2 into population 1).
Saving GridSearch results
Save the GridSearch
output results, you will want to access these reults later. If you are analyzing models in separate runs, these can be compilated later. IMPORTANT NOTE: The results saved from a GridSearch should literally be named "result", for that is the object name that the next function discussed (ConcatenateResults
) looks for when summarizing GridSearch
results from across different runs.
save(list="result",file="phraplOutput_models1-3.rda")
Compiling, summarizing, and visualizing results
Concatenate results from multiple runs and find the ‘best’ model
The ConcatenateResults
function can compile results from across separate runs into a single table. When concatenating results, you can either specify the names of the files containing the results to be concatenated (by giving a vector of names to the argument rdaFiles
) or the directory in which these files are stored (using rdaFilesPath
). So, for our toy dataset, type
## Concatenate results, calculate AIC and model weights.
totalResults<-ConcatenateResults(rdaFiles="phraplOutput_models1-3.rda",
migrationArray=migrationArray, outFile = "totalResults.txt")
totalResults
This table can be outputted to a file by specifying an output filename using the outFile
argument.
models | AIC | lnL | rank | dAIC | wAIC | params.K | params.vector | t1_1.2 | t2_1-2.3 | m1_1.2 | m1_1.3 | m1_2.1 | m1_3.1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 104.681808558051 | -50.340904279025 | 1 | 0 | 0.747493733212393 | 2 | collapse_1 collapse_2 | 0.802010307787278 | 2.41206310088882 | NA | NA | NA | NA |
2 | 107.427516309546 | -50.713758154773 | 2 | 2.746 | 0.189374383642894 | 3 | collapse_1 collapse_2 migration_1 | 1.10932831804684 | 2.89120505586268 | 0.12375682767371 | NA | 0.12375682767371 | NA |
3 | 109.625209365656 | -51.812604682828 | 3 | 4.943 | 0.0631318831447129 | 3 | collapse_1 collapse_2 migration_1 | 0.776721051811846 | 7.39406167578743 | NA | 0.112313076507491 | NA | 0.112313076507491 |
If you analyzed all models in a single run (not recommended for large datasets), this table is already within the output result
of the GridSearch
.
## If you ran all models in a single analyses (not recommended for large datasets).
result$overall.results
Calculate model averages
You can calculate model averaged parameter values across all models using the modelAverages
function:
## Calculate model averages across all models
modelAverages<-CalculateModelAverages(totalResults)
The default is to model average parameters across all models (assuming that the parameter = 0 when it is not specified in a model). Alternatively, one can model average across only those models in which a given parameter is explicitly incorporated by setting averageAcrossAllModels = FALSE
.
t1_1.2 | t2_1-2.3 | m1_1.2 | m1_1.3 | m1_2.1 | m1_3.1 |
---|---|---|---|---|---|
0.858611908209486 | 2.81732326514707 | 0.0234363729623087 | 0.00709053602169414 | 0.0234363729623087 | 0.00709053602169414 |
Plot the ‘best’ model
Upon inspecting the results, you may feel inclined to create a 3-D image of the best model, e.g.:
bestModel<-totalResults[1,"models"] ## extract best modelID from totalResults
PlotModel(migrationIndividual=migrationArray[[bestModel]],taxonNames=c("A","B","C"))
A statuette of this model can then be 3D printed using 14 carat gold, which could be useful as a project souvenir. Just follow this link.
Build your own model
Generating a single* a priori model
What if, instead of or in addition to generating a list of possible models, you would like to create a specific model for which you have some a priori reason to consider alone, or to add to a larger migrationArray
as generated above? To do this, you can use the function MakingMigrationIndividualsOneAtATime
, whose purpose is to create a single a priori migrationIndividual
for a specified coalescence, n0multiplier, growth, and migration history. The four corresponding arguments for this function are collapseList
, n0multiplierMapList
, growthList
and migrationList
.
collapseList
is a list of coalescence history vectors, one vector for each coalescent event in the tree (i.e., for each column in the collapseMatrix
. So,
collapseList = list(c(1,1,0),c(2,NA,2))
means that there are two coalescence events: in the first event, populations 1 and 2 coalesce while population 3 does not; in the second event, ancestral population 1-2 coalesces with population 3.
n0multiplierList
and growthList
are given in the same format as collapseList
, and the available parameters for this must match the splitting history depicted in the collapseList
. Remember that the n0multiplier
parameter is not a parameter for population size, but is rather a population size scalar. So, a model in which population sizes are equal across populations (contemporary and ancestral) would be given as
n0multiplierList = list(c(1,1,1),c(1,NA,1))
If you would like to model one growth rate for tip populations and another growth rate for ancestral populations, this would be given as
growthList = list(c(1,1,1),c(2,NA,2))
Finally, migrationList
is a list of migration matrices for the model. There will be one migration matrix for the present generation, plus any historical matrices that apply (there will be as many matrices as there are collapse events). So, in the three population scenario, if there is symmetrical migration between populations 1 and 2 and no historical migration, migrationList will be:
migrationList<-list(
t(array(c(
NA, 1, 0,
1, NA, 0,
0, 0, NA),
dim=c(3,3))),
t(array(c(
NA, NA, 0,
NA, NA, NA,
0, NA, NA),
dim=c(3,3))))
Note that in R, arrays are constructed by reading in values from column 1 first then from column 2, etc. However, it is more intuitive to construct migration matrices by rows (i.e., first listing values for row 1, then row 2, etc.). Thus, here, arrays are typed in by rows and then transposed (using t
). Also, spacing and hard returns can be used to visualize these values in the form of matrices.
So, let’s say you want to add one more model to our 48-model migrationArray
. Assume that this desired model is the same as model #1 (i.e., migrationArray[[1]]
, where ((pop1, pop2) pop3))), but where asymmetrical migration rates between sister populations are established at the tips: one rate for pop1 –> pop2 and another rate for pop2 –> pop1. This migrationList
would be given as
migration_1<-t(array(c(
NA, 1, 0,
2, NA, 0,
0, 0, NA),
dim=c(3,3)))
migration_2<-t(array(c(
NA, NA, 0,
NA, NA, NA,
0, NA, NA),
dim=c(3,3)))
migrationList<-list(migration_1,migration_2)
So, to produce our desired model, we use the function GenerateMigrationIndividualsOneAtATime
:
migrationIndividual<-GenerateMigrationIndividualsOneAtATime(collapseList=collapseList,
n0multiplierList=n0multiplierList,growthList=growthList,migrationList=migrationList)
Then, we can add this model to the migrationArray
:
length(migrationArray)
migrationArray<-c(migrationArray,list(migrationIndividual))
length(migrationArray)
and view it:
migrationArray[[49]]
Note that a collapseList
must always be specified. However, the remaining three parameter matrices are optional, and if they are not specified, null matrices will be automatically produced in which all n0multipliers are set to one, and all growth and migration parameters are set to zero.
Thus, the same migrationIndividual
as above can be produced by simply typing
migrationIndividual<-GenerateMigrationIndividualsOneAtATime(collapseList=collapseList,
migrationList=migrationList)
##