In this vignette, we’ll walk through how we simulated a birth-death branching process to validate our growth rate methods. The simulation procedure is a direct implementation of results by Amaury Lambert in a recent paper. We’ll show here that these results allow for extremely fast generation of a sampled tree of size n from a birth-death process of time T with fixed birth and death rates. The key point here is that we don’t have to simulate this using computationally intensive algorithms (gillespie etc.); the math provides a shortcut which allows us to generate the results on a single core in under a second for a tree with 100 tips or about 8 seconds for a tree with 1000 tips! In R!
The applications of this fast generation of sampled trees are countless. For us, these results allowed us to simulate thousands of trees to check our methods for growth rate estimation from phylogenetic tree reconstruction, which is detailed in our paper. We provide an intro to simulation here, with full reproduction of our simulation results in the article linked here.
First, we’ll have to load the packages we want to use. We’ll be
plotting the trees using the ape
package function
ape::plot.phylo()
along with some other ape
functions. If you have
cloneRate
installed, you already have the ape package
. We’ll
also be using ggplot2
to make
our plots, which can be installed from CRAN.
# Load and attach our package cloneRate
library(cloneRate)
# Load and attach ape, which will be installed if you've installed cloneRate
library(ape)
# Install ggplot2 from CRAN if necessary, then load and attach it with library()
library(ggplot2)
We’ll also set the color palette which we’ll use for most of the plotting. The palette is avilable here.
In this section, we’ll simulate ultrametric and mutation-based trees. Ultrametric just means that the tips are all the same distance from the root. In our specific case, the ultrametric trees will have edge lengths in units of time, so ultrametric means that the tips are sampled at the same time. Mutation-based trees are those with edge lengths in units of mutations. Mutation-based trees will typically not be ultrametric, because there will be some fluctuations in the number of mutations acquired.
We’ll be generating trees of class phylo
, which is a
fairly straightforward encoding of phylogenetic trees in R. The ape
documentation describing the class phylo
can be found here.
Let’s start with ultrametric trees. First, let’s simulate a single
tree and have a look at the class phylo
. We’ll use the
function simUltra()
from our package to do this, which
simulates an ultrametric tree. We’ll set a birth rate a
, a
death rate b
, an age for the tree cloneAge
,
and the number of sampled tips n
. Note that we use
cloneAge
instead of T for the time because
T
in R means TRUE
.
# Generate the tree
tree <- simUltra(a = 1, b = 0, cloneAge = 20, n = 100, addStem = TRUE)
# We see that the tree is of class phylo
class(tree)
#> [1] "phylo"
# Preview the tree
print(tree)
#>
#> Phylogenetic tree with 100 tips and 100 internal nodes.
#>
#> Tip labels:
#> t44, t10, t68, t58, t96, t5, ...
#>
#> Rooted; includes branch lengths.
Again, the best place to get more info about the class
phylo
is here.
Our functions for simulating trees will include metadata. Let’s take a
look at the metadata included in the tree.
We see that the columns are:
r
, a
, b
: The growth rate
params. r
is the net growth rate, a
is birth
rate and b
is death rate, so r=a-b
cloneAge
: The time that passes, in the same units as
r
, a
, and b
.n
: The number of samples from the birth-death process,
producing a tree with n
tips.runtime_seconds
: The elapsed time to generate the
tree.addStem
: Tells us whether the tree has a root edge
preceding the first split. We’ll show the “stem” in plots below.Now let’s plot the tree.
plot.phylo(tree, direction = "downwards", show.tip.label = FALSE)
axisPhylo(side = 2, backward = FALSE, las = 1)
title(main = "Simulated tree", ylab = "Time")
In the tree we plotted above, we see that the “stem” is the edge extending from time 0 to the first split. We call these splits coalescence events and they’ll be very important for our methods of growth rate estimation.
Also, we see that the tree is ultrametric, meaning that they all are sampled at the same time, in this case 20 units of time after the birth-death process began. As a default, we’ll talk about units of time in years, but that is arbitrary. Just make sure the units of the growth rate (per year) are the same as the time units (years).
Let’s apply our growth rate estimates to this tree, seeing if our
estimates are close to the value of r
that we used to
generate the tree. We have two functions for estimating the growth rate
of an ultrametric tree, and we’ll compare their performance later
on:
internalLengths()
: This function sums the edge lengths
of the internal edges of the tree (excluding the stem). We call this sum
Li. The
growth rate is calculated as r = Li/n
where n is the number of
sampled tips.maxLikelihood()
: This function uses a maximum
likelihood estimation of the distribution of coalescence times, using
the information about when the tree splits. The distribution of
coalescence times is approximated by a standard logistic distribution
scaled by 1/r. Maximizing the
likelihood of the observed coalescence times for the r term gives us our estimate.Both methods are detailed in our paper.
We know that our actual growth rate is 1, so we can evaluate how our
estimates are doing. One estimate doesn’t really tell us much though.
Let’s try 100. To do this, we just set the nTrees
param
equal to 100 in our simUltra()
function. You can
parallelize this process if you have the parallel
package
installed by simply setting nCores
, but we’ll leave this at
1 for now. By printing the elapsed time, we see that with only one core
it takes about a minute.
ptm <- proc.time()
tree.list <- simUltra(a = 1, b = 0, cloneAge = 20, n = 100, nTrees = 100, nCores = 1, addStem = TRUE)
print(proc.time()[["elapsed"]] - ptm[["elapsed"]])
#> [1] 47.467
Let’s apply our methods to these trees. We can input either a single
phylo
object, or a list of phylo
objects to
our growth rate functions. Either way, it’ll return a
data.frame
which each row corresponding to an estimate for
each tree.
We gave each method 100 trees to estimate, let’s plot these estimates. Remember that all the input trees had a growth rate of 1, so we want these estimates to be close to 1. We’ll use ggplot’s density plot for this:
# Combine for ggplot formatting
resultsCombined <- rbind(resultsLengths, resultsMaxLike)
# Plot, adding a vertical line at r=1 because that's the true growth rate
ggplot(resultsCombined) +
geom_density(aes(x = estimate, color = method), linewidth = 1.5) +
geom_vline(xintercept = exampleUltraTrees[[1]]$metadata$r) +
theme_bw() +
theme(
axis.text.y = element_blank(), axis.ticks.y = element_blank(),
legend.title = element_blank()
) +
xlab("Net growth rate estimate (r)") +
ylab("Density") +
scale_color_manual(labels = c("Internal lengths", "Max. likelihood"), values = c("black", "#009E73"))
Not bad, but we’ll want to test this more formally over a large range of parameter values. More on that later in this vignette. For now, let’s shift our attention to mutation trees.
There are two ways to generate mutation trees:
Either way, the process of generating mutation trees consists of:
nu
multiplied by the edge length of the time-based tree.Let’s take the ultrametric trees we just generated and convert them
to mutation-based trees. For this, we provide the function
ultra2mut()
:
# Set mutation rate equal to 10 muts/year for all trees
mutTree.list <- ultra2mut(tree.list, nu = 10)
Alternatively, we can generate a new set of mutation trees using the
function simMut()
. We won’t run this because it will take
another few minutes, and we already have the mutation-based trees
generated above.
# Set params for ultra tree + a mutation rate
mutTree.list2 <- simMut(a = 1, b = 0, cloneAge = 20, n = 100, nTrees = 100, nu = 100, nCores = 1)
If we want to estimate the growth rate from a mutation tree, we use
the sharedMuts()
function, which works very similarly to
the internalLengths()
function. However, instead of
counting the internal edge lengths in units of time, the
sharedMuts()
function counts the internal/shared mutations
and uses the mutation rate to scale to time. So, if Mi is the number
of internal or shared mutations, ν (nu
) is the mutation
rate, and n is the number of
samples, the growth rate estimate is r = Mi/(νn).
Let’s apply the sharedMuts()
function to the trees that
we converted to mutation-based. Remember that these mutation-based trees
are generated from the ultrametric trees we applied our functions
maxLikelihood()
and internalLengths()
to.
Let’s plot the estimates all together. First, we’ll combine the
data.frames produced. The sharedMuts()
function outputs an
additional column corresponding to the mutation rate nu
.
Because of this, using rbind on the full data.frames will give an error.
We only want to look at the estimates for now, so we’ll just combine the
necessary columns from each data.frame.
# Combine the columns with the estimates
colsUse <- c("lowerBound", "estimate", "upperBound", "method")
resultsAll <- rbind(resultsShared[, colsUse], resultsCombined[, colsUse])
# Plot, adding a vertical line at r=1 because that's the true growth rate
ggplot(resultsAll) +
geom_density(aes(x = estimate, color = method), linewidth = 1.5) +
geom_vline(xintercept = exampleUltraTrees[[1]]$metadata$r) +
theme_bw() +
theme(
axis.text.y = element_blank(), axis.ticks.y = element_blank(),
legend.title = element_blank()
) +
xlab("Net growth rate estimate (r)") +
ylab("Density") +
scale_color_manual(labels = c("Internal lengths", "Max. likelihood", "Shared Muts."), values = c(colorPal[1], colorPal[4], colorPal[6]))
It looks like we’re on the right track. From this, it’s hard to tell which is best. Maximum likelihood seems a bit biased on the high side, but the estimates seem to fall in a more narrow range. Let’s do a more quantitative analysis now that we have the hang of it.
In this section, we’ll use our ability to generate trees in order to test our methods. We’ll focus mainly on the ultrametric trees, noting that the mutation-based approach based on shared mutations is analogous to the internal lengths method. We’ll see how well our methods perform, where they perform well, and where they fail.
So far, we’ve looked at the same parameters. Now, we’ll see what happens when we change some important parameters about the birth-death process.
One of the most interesting results to explore further is how the
accuracy of the estimates depends on the number of samples. For this,
we’ll look at a few different n
parameter values. To keep
things from running for too long, we can re-use our trees from
n=100
and add 100 more trees at different n
values. For this, let’s show how to run simUltra()
with a
vector input for n
. First, we’ll generate the vector, which
we’ll call n_vec
. Let’s explore a small n
value, 10, and an intermediate n
value, 30.
For now, we’ll keep the growth rate a - b
equal to 1,
but again we’ll let the actual birth and death rates vary.
In total, our n_vec
is of length 200. So we’ll want to
make sure a_vec
and b_vec
are also of length
200. And we’ll have to set nTrees = 200
. If we generate
multiple trees, we either set a single value for the parameters, or a
vector of length nTrees
. This applies to both
simUltra()
and simMut()
. Let’s generate some
more trees. Note: this will also take a minute or two.
Now, let’s apply our estimates and combine the data.frames:
# Apply our estimates for ultrametric trees
n10_n30_maxLike <- maxLikelihood(n10_n30_trees)
n10_n30_lengths <- internalLengths(n10_n30_trees)
# Combine the estimates
n10_n30_both <- rbind(n10_n30_maxLike, n10_n30_lengths)
Finally, we can add our resultsCombined
data.frame from
before, which has the results of applying maxLikelihood()
and internalLengths()
to the set of 100 ultrametric trees
at n = 100
.
Let’s make a plot for each of these, showing the performance. We’ll
use ggplot2::facet_wrap()
in order to show the same plot at
various n
values. Although we have a different number of
trees from the n = 100
case, the density plot will
normalize this.
ggplot(results_vary_n) +
geom_density(aes(x = estimate, color = method), linewidth = 1.5) +
geom_vline(xintercept = exampleUltraTrees[[1]]$metadata$r) +
theme_bw() +
theme(
axis.text.y = element_blank(), axis.ticks.y = element_blank(),
legend.title = element_blank()
) +
xlim(0, 3) +
xlab("Net growth rate estimate (r)") +
ylab("Density") +
scale_color_manual(
labels = c("Internal lengths", "Max. likelihood"),
values = c("black", "#009E73")
) +
facet_wrap(~ factor(paste0("n = ", n), levels = c("n = 10", "n = 30", "n = 100")),
ncol = 1, strip.position = "bottom", scales = "free", dir = "v"
)
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_density()`).
As you can see, both estimates are highly dependent on the number of
samples. If you’d like to see the same density values on the y-axis for
all three n
values, set scales = "fixed"
in
the facet_wrap()
function.
If you’re interested, you can apply the same quantitative measures to this data as we did before in Quantitative comparisons. We won’t repeat that here, but we show this in Figure 2 of our paper. This information is also available in the Supplementary tables, which are also available as .csv files.
As we noted before, we can simulate many trees at once using
simUltra()
and simMut()
. Here, we’ll use
simUltra()
to generate 100 trees at various growth rates.
We do a similar analysis in Figures 3 and 4 of our
paper, noting that our methods struggle with small growth rates.
Let’s see if we come to the same conclusion here.
First, we have to address something that might otherwise lead to
confusion; varying r
is the same as varying
cloneAge
. When you think about it, it makes sense. If I
simulate a population with a net growth rate of 1 per year for 20 years
it should look the same as a growth rate of 0.5 per year for 40 years.
We’ll show this, but first consider the fact that the units are
meaningless, so long as they’re consistent. So a growth rate of 1/12 and
a clone age of 20*12 is exactly the same as a growth rate of 1 and 20 if
the units change from a years to months.
This all might be a bit confusing, so let’s plot it. First, we’ll simulate a tree with a growth rate of 2 for 30 years and compare it to a tree with a growth rate of 0.5 and the same 30 years. These trees should look different. Finally, we’ll add a tree that has a growth rate of 0.5, but run it for 120 years. If what I said above is true, we should see that the last tree looks like the first tree, because 0.5 for 120 should be the same as 2 for 30. Let’s find out:
# First tree, r = a - b = 2
tree1 <- simUltra(a = 2.5, b = .5, cloneAge = 30, n = 50)
# Second tree, with r = a - b = 0.5
tree2 <- simUltra(a = 1, b = .5, cloneAge = 30, n = 50)
# Third tree, with r = 0.5 but cloneAge = 120
tree3 <- simUltra(a = 1, b = .5, cloneAge = 120, n = 50)
Now, let’s plot using par()
to show all three in one
plot
oldpar <- graphics::par(mfrow = c(1, 3))
ape::plot.phylo(tree1, direction = "downwards", show.tip.label = F, main = "r = 2, cloneAge = 30")
ape::plot.phylo(tree2, direction = "downwards", show.tip.label = F, main = "r = 0.5, cloneAge = 30")
ape::plot.phylo(tree3, direction = "downwards", show.tip.label = F, main = "r = 0.5, cloneAge = 120")
While the trees on the left and the right aren’t identical due to the stochastic nature of the birth-death process, they are similar, and certainly different from the tree in the middle.
Now we can arbitrarily choose whether to vary r
or
cloneAge
. For consistency, we’ll vary r
, as we
do in Figures 3 and 4 of our
paper. Here, we have two options:
r
values, showing which are good and
which are problematicr
values and then try to decipher
which estimates are good and which aren’tOption 1 is essentially the same as we did in Varying n, but with r
instead of
n
, so repeating that here wouldn’t be as useful. Option 2
is also more realistic…we can pretend we don’t know the ground truth and
show how we decide if our methods are relevant before comparing to the
ground truth.
Let’s simulate 1000 trees with 50 samples n
and various
growth rates r
, ranging from 0.1 to 1 per year, run for 40
years. This will take a while, so we won’t actually evaluate this as
part of this vignette. However, we do a full reproduction of our work
from our
recent paper in an
article, “Reproduce simulation results” which can be found on our package website.
# Uniform ditribution of r used to generate b_vec
r_vec <- stats::runif(n = 1000, min = 0.1, max = 1)
a_vec <- stats::runif(n = 1000, min = 1, max = 3)
b_vec <- a_vec - r_vec
# Input to simUltra()
vary_r_trees <- simUltra(a = a_vec, b = b_vec, cloneAge = 40, n = 50, nTrees = length(a_vec))
If we were to run the code above, we could analyze it the same as we
have done, with our internalLengths()
and
maxLikelihood()
functions. However, we’d also have to apply
our diagnostic, which tells us which clones are good candidates for our
method, and which are not. If we run a clone that’s not a good
candidate, we’ll get a warning. Let’s generate a tree with a very low
growth rate and plot it.
# Let's call it slowClone
slowClone <- simUltra(a = .15, b = .05, n = 50, cloneAge = 40)
# Plot the tree
ape::plot.phylo(slowClone, direction = "downwards", show.tip.label = F)
We see that this tree has long internal branches and short external branches. This indicates that the process has not been “supercritical” enough (i.e. the growth rate and time are too low). Quantitatively, this means the ratio of external to internal edge lengths is low. Let’s try to apply our methods:
# Apply our estimates
maxLikelihood(slowClone)
#> Warning in maxLikelihood(slowClone): External to internal lengths ratio is less than or equal to 3,
#> which means max. likelihood method may not be applicable. Consider
#> using birthDeathMCMC() function, which avoids this issue.
#> lowerBound estimate upperBound cloneAgeEstimate sumInternalLengths
#> 1 0.1554107 0.2023034 0.2491961 42.55997 242.2894
#> sumExternalLengths extIntRatio n alpha runtime_s method
#> 1 317.3994 1.310001 50 0.05 0.004 maxLike
internalLengths(slowClone)
#> Warning in internalLengths(slowClone): External to internal lengths ratio is less than or equal to 3,
#> which means internal lengths method may not be applicable. Consider
#> using birthDeathMCMC() function, which avoids this issue.
#> lowerBound estimate upperBound cloneAgeEstimate sumInternalLengths
#> 1 0.1491645 0.2063648 0.2635652 42.46268 242.2894
#> sumExternalLengths extIntRatio n alpha runtime_s method
#> 1 317.3994 1.310001 50 0.05 0.002 lengths
Our estimates are off by quite a bit (the ground truth estimate is 0.1), and we see a warning. The warning tells us that the external to internal edge length ratio is below 3, so this clone is likely not a good candidate for our methods. For details on how we came up with 3 as a cutoff see Figure 4 of our paper or the reproduced analysis in the article, “Reproduce simulation results” from our package website. The real advantage to this diagnostic is that we can look at a tree and calculate this ratio to tell us right away if our methods are applicable. Again, if you’re interested in the quantitative side of this, you can apply the metrics from the Quantitative comparisons section, which we do in the longer form analysis in the article mentioned above.
The method for simulating the trees is a direct result of the work of Amaury Lambert in the following paper. The mathematical methods for estimating growth rates build in large part from the same work, linked below:
And here’s a final link to our paper for more of the details of the methods and data analysis.
There aren’t too many colors in this vignette, but we tried to use colorblind friendly colors, specifically pulling colors from a palette designed by Bang Wong and available here.