In this vignette, we’ll walk through our data analysis corresponding to our paper analyzing single-cell Whole Genome Sequencing (scWGS) data from four previously published papers. As part of this package, we have included everything needed to reproduce this analysis, which consists of reconstructed phylogenetic trees and de-identified metadata about the donors who were sequenced. We also include some longitudinal bulk Whole Genome Sequencing (bulk WGS) data which we use for validation.
First, we’ll have to load the packages we want to use. We’ll be
plotting the trees using the ape
package function
ape::plot.phylo()
along with some other ape
functions. If you have
cloneRate
installed, you already have the ape package
. We’ll
also be using ggplot2
to make
our plots, as well as survival
, ggsurvfit
,
and car
for some
downstream analysis. All of these can be installed from CRAN.
# Load and attach our package cloneRate
library(cloneRate)
# Load and attach ape
library(ape)
# Install from CRAN if necessary, then load and attach with library()
library(ggplot2)
library(survival)
library(ggsurvfit)
library(car)
We’ll also set the colors which we’ll use for most of the plotting. The palette is available here
First, let’s examine realCloneData
, which contains most
of the necessary data to reproduce our analysis.
summary(cloneRate::realCloneData)
#> Length Class Mode
#> fullTrees 30 -none- list
#> cloneTrees 42 -none- list
There are two lists contained in this dataset, one which contains the
full reconstructed tree for each individual at each timepoint (named
fullTrees
), and one which contains specific clones from the
full trees (named cloneTrees
). We require that a clone must
have at least n=10
tips (sampled cells) to be included in
cloneTrees
. We discuss this requirement in our simulation vignette. For now, let’s
split these into two different variables for clarity.
fullTrees.list <- cloneRate::realCloneData$fullTrees
cloneTrees.list <- cloneRate::realCloneData$cloneTrees
Let’s start by looking at the full tree from a single individual,
“PD9478”. Because some individuals in this study are sequenced at
multiple timepoints, we refer to the ID (“PD9478”) and timepoint (“_1”)
to get a specific tree from fullTrees.list
. In this case,
the only timepoint is timepoint 1, so let’s take a look at the full tree
“PD9478_1”. This data comes from the work of Williams et al.
We see that the tree is of class phylo
(http://ape-package.ird.fr/misc/FormatTreeR_24Oct2012.pdf),
which is a fairly straightforward encoding of phylogenetic trees in
R.
Let’s plot the tree:
# Plot the full tree from individual PD9478 at timepoint 1
PD9478 <- fullTrees.list$PD9478_1
# Plot, then add scale and title
plot.phylo(PD9478,
direction = "downwards",
show.tip.label = FALSE, edge.width = 2,
edge.color = c(rep("black", 2), "red", "#0070C0", rep("black", (200)))
)
axisPhylo(side = 2, backward = FALSE, las = 1)
title(main = "PD9478 full tree", ylab = "Time (years)")
If you went through the simulation vignette, you’ll notice that this tree looks quite a bit different than the ones we simulated. In the above tree, there is a clear example of a somatic clonal expansion. We know from the authors who produced this data (see Williams et al. Fig. 3 PD9478), that all the sampled cells within this clone have a mutation in the JAK2 gene (which occurs along the red edge) and the DNMT3A gene (which occurs along the blue edge). Both of these mutations are known to be important in clonal expansions in blood. Let’s look more closely at the clone and see what we can learn from the tree.
# Load the clone tree from our cloneTrees.list
PD9478_subClone <- cloneTrees.list[["PD9478_1_clone1"]]
# Plot, then add scale and title
plot.phylo(PD9478_subClone,
direction = "downwards",
show.tip.label = FALSE, edge.width = 2,
)
axisPhylo(side = 2, backward = FALSE, las = 1)
title(main = "PD9478 JAK2 & DNMT3A clone tree", ylab = "Time (years)")
Now this tree of the clonal expansion looks a lot more like the trees we simulate! This suggests that the growth rate of each cell within this clone is roughly the same as other cells within the clone. We can now apply our methods to the clone tree shown above!
We see that the above tree is ultrametric and we know that it’s
time-based, in units of years, so we should apply our
internalLengths()
and maxLikelihood()
functions.
# Get maximum likelihood and internal lengths estimates
print(round(maxLikelihood(PD9478_subClone)$estimate, 3))
#> [1] 0.516
print(round(internalLengths(PD9478_subClone)$estimate, 3))
#> [1] 0.601
Unfortunately, we don’t have a ground truth to compare to when working with real data. However, this individual clone has longitudinal sequencing data, which we can use to estimate an orthogonal growth rate. We’ll come back to this later (see Validation).
First, let’s apply our analysis to all of the clone trees. If we give
a list of phylo
objects, our growth rate functions will estimate the growth rate of all
of them separately, returning a data.frame with each row telling us the
result from each tree.
# Apply each of our estimates
resultsLengths <- internalLengths(cloneTrees.list)
#> Warning in FUN(X[[i]], ...): Tree is not binary. Birth-death branching trees should be binary,
#> but tree resonstruction from data may lead to 3+ descendants from
#> a single parent node. Proceed with caution! Input tree has
#> 3 nodes directly descending from a single
#> parent node. A binary tree would only have 2 descendant nodes
#> from each parent node. Tree throwing warning is PD5847_1_clone1
#> Warning in FUN(X[[i]], ...): Tree is not binary. Birth-death branching trees should be binary,
#> but tree resonstruction from data may lead to 3+ descendants from
#> a single parent node. Proceed with caution! Input tree has
#> 3 nodes directly descending from a single
#> parent node. A binary tree would only have 2 descendant nodes
#> from each parent node. Tree throwing warning is PD4781_1_clone1
Oof! Looks like we have a warning…
It says that two of the trees we input “PD5847_1_clone1” and “PD4781_1_clone1” are not binary branching trees. This means that one node has more than 2 direct descendants. In this case, it’s not a big deal. We know that these trees are resconstructed from mutation data, so this could just mean that we picked up two divisions which happened without mutations occurring in between. Let’s plot one of them so we can see where this happens
# Plot, highlighting the 3 descendants of node 76. Then add scale and title
plot.phylo(cloneTrees.list$PD5847_1_clone1,
edge.width = 2,
direction = "downwards", show.tip.label = FALSE,
edge.color = c(
rep(1, 4), "darkorange", rep(1, 34),
"darkorange", rep(1, 6), "darkorange", rep(1, 1000)
)
)
axisPhylo(side = 2, backward = FALSE, las = 1)
title(main = "PD5847 clone 1 tree (multi-branching)", ylab = "Time (years)")
We can see the three descendants from a single node (node 76), which
is identified by the edges in orange. We see that there are a lot of
places where there are multiple splits (coalescence events) in quick
succession, but this is the only place where 1 node appears to split
into 3. It’s okay; we know mutations are an imperfect way to keep track
of time. If we saw more of these events where one node splits into 3, or
cases where one node splits into 4+, we might be more concerned. In this
case, we can move on. Our functions for growth rate estimation are
equipped to handle trees like this, but they throw warnings so we know
what’s happening. Let’s continue, applying our
maxLikelihood()
function to the trees.
resultsMaxLike <- suppressWarnings(maxLikelihood(cloneTrees.list))
# Preview the output
print(head(resultsMaxLike[, c(1:3)]))
#> lowerBound estimate upperBound
#> PD34493_clone1 0.09733005 0.1687401 0.2401502
#> PD34493_clone3 0.17679235 0.2315774 0.2863625
#> PD34493_clone4 0.09425558 0.1956755 0.2970954
#> PD41305_clone1 0.17395097 0.3301699 0.4863887
#> PD41305_clone2 0.15904836 0.2757405 0.3924327
#> PD41305_clone3 0.32745600 0.5435162 0.7595764
# Print correlation coefficient of the estimates from the two methods
print(stats::cor.test(resultsLengths$estimate, resultsMaxLike$estimate)$estimate)
#> cor
#> 0.9829718
Not bad! With a correlation coefficient around 0.98, the estimates from each method agree very well. If you want to compare the estimates from the birth-death MCMC as well, we’ll include an example of how to run this, but won’t include the results here as this tends to take longer to run. We note the very high correlation (R2 > 0.99) between the birth-death MCMC and the maximum likelihood estimate:
# Longer running code. Not run as part of vignette.
resultsBDMCMC <- suppressWarnings(birthDeathMCMC(cloneTrees.list))
# Print correlation coefficient of the estimates from the two methods
print(stats::cor.test(resultsBDMCMC$estimate, resultsMaxLike$estimate)$estimate)
# > cor
# > 0.9964975
Each of the functions (maxLikelihood()
,
internalLengths()
and birthDeathMCMC()
)
returns a data.frame with the same columns, and we’ll walk through what
the columns mean here:
# Check that column names are the same
stopifnot(all(colnames(resultsMaxLike) == colnames(resultsLengths)))
# Print column names
colnames(resultsMaxLike)
#> [1] "lowerBound" "estimate" "upperBound"
#> [4] "cloneAgeEstimate" "sumInternalLengths" "sumExternalLengths"
#> [7] "extIntRatio" "n" "alpha"
#> [10] "runtime_s" "method" "cloneName_result"
lowerBound
, estimate
,
upperBound
: These are the growth rate estimate and
confidence intervals. The default is 95% confidence intervals, but this
can be adjusted by setting the alpha
parameter in the
function input.cloneAgeEstimate
is the estimate of the time from clone
initiation to sampling. The earliest coalescence time in the clone tree
gives us a rough estimate of the clone age, and then we add 1/r,which is 1 divided by the growth
rate estimate, to account for the expected time from initiation to the
first surviving division (first coalescence). We’ll discuss this more
later.sumInternalLengths
, sumExternalLengths
,
extintRatio
: These are the sum of the internal edge
lengths, external edge lengths, and the ratio between the two. We use
this ratio to say whether enough time has passed to use our method (see simulation vignette).
Fortunately, all of our real data has a ratio > 3, so we can apply
our methods to all of the clone trees.n
is the number of tips in the clone tree. In this
case, this corresponds to the number of cells sequenced from the
clone.alpha
is the confidence level. The default is 0.05, and
we use 1-alpha
(default 95%) confidence intervals for the
upper and lower bounds.runtime_s
is the runtime of the method in seconds.method
is the method used, which helps if we want to
combine data.frames from different methods for plotting.cloneName_result
is a unique name for the clone
corresponding to the individual ID and clone number. For individuals
from the Williams
paper, where some individuals were sampled at multiple timepoints,
the ID will be followed by “_1”, “_2”, or “_3” to indicate the
timepoint.Now that we have growth rates, let’s make some sense of them. If we look inside cloneTrees.list, we realize that we also have metadata. Let’s print two examples, one from an individual with a Myeloproliferative Neoplasm (MPN), and one without:
# Show the metadata of an individual without MPN
print(cloneTrees.list$PD34493_clone1$metadata)
#> ID age cloneName_meta cloneDriver diagnosis.age malnorm sourcePaper
#> 1 PD34493 83 PD34493_clone1 SF3B1:k666n_DelY NA Normal fabre
# And the metadata of an individual with MPN
print(cloneTrees.list$PD9478_1_clone1$metadata)
#> ID age cloneName_meta cloneDriver
#> 1 PD9478_1 68.75291 PD9478_1_clone1 JAK2:p.F537_K539delinsL_AND_DNMT3A:p.Y908*
#> diagnosis.age malnorm sourcePaper
#> 1 53 Malignant williams
For a biological and/or clinical analysis, the most important of these are:
age
which tells us the individual’s age at the time of
sampling. We know that clonal hematopoiesis is associated with
aging.cloneDriver
which tells us the clonal hematopoiesis
driver(s) which likely leads to the expansion.diagnosis.age
which tells us when the individual was
diagnosed with Myeloproliferative Neoplasm (MPN). If the individual
doesn’t have MPN, diagnosis.age
is NA
.malnorm
which tells us whether the individual has been
diagnosed with a Myeloproliferative Neoplasm (MPN). This information is
already conveyed in diagnosis.age
, but it’s sometimes
easier to work with in this format.Let’s combine all of the cloneTrees.list
metadata into
one data.frame, so we can analyze it together with the results of
running our functions.
# Combine all metadata into a single data.frame
metadataAll <- do.call(rbind, lapply(cloneTrees.list, function(x) {
return(x$metadata)
}))
Let’s join the data.frames containing the metadata and the results.
We can use cbind()
because we know the order of the
metadata is the same as the order of the results data.frame. We can
check this by comparing the cloneName_result
and
cloneName_meta
columns, which should be (and are) the
same.
# Combine metadata with estimates using cbind. Check if cloneNames match
resultsLengthsMeta <- cbind(resultsLengths, metadataAll)
stopifnot(resultsLengthsMeta$cloneName_result == resultsLengthsMeta$cloneName_meta)
resultsMaxLikeMeta <- cbind(resultsMaxLike, metadataAll)
stopifnot(resultsLengthsMeta$cloneName_result == resultsLengthsMeta$cloneName_meta)
# Because max. likelihood performs slightly better, use that going forward
results <- resultsMaxLikeMeta
We’ll go forward with our analysis using the results data.frame,
which contains the estimates from the maxLikelihood()
function and the metadata. In the simulation vignette, we show that the
maximum likelihood estimate performs better than the
internalLengths()
estimate.
Here, we recreate the analysis in Figure 5E-G of our work. These are just a few of many analyses that can be done with this data, so don’t be limited by what we do.
If we split the cohort by MPN and normal, we might expect differences in the growth rates. Because the number of sampled cells, on average, is higher in the healthy individuals due to study design differences, we’re likely to pick up more clones from the healthy individuals. To avoid this bias as much as possible, we’ll compare the mean growth rate between individuals with and without MPN.
First, we want to avoid double counting the clones which appear twice. There are three such clones from the Williams paper:
# If patient ID and clone number are the same, even if timepoint
# differs, then we have a duplicate.
# Make a new column for patient, removing anything after "_" from cloneName_result
results$patient <- gsub("_.*", "", results$cloneName_result)
results$cloneNumber <- gsub(".*_clone", "", results$cloneName_result)
# Combining patient ID (without timepoint) and clone number
# will give us a unique ID for the clone regardless of sampling time
results$uniqueCloneID <- paste0(results$patient, "_", results$cloneNumber)
# Find which clone IDs appear twice
tmp <- table(results$uniqueCloneID)
repeatsVec <- names(tmp)[tmp == 2]
# Remove the duplicate with fewer number of cells sequenced, n
rowsRemove <- c()
for (cloneID in repeatsVec) {
duplicateRows <- which(results$uniqueCloneID == cloneID)
removeIndex <- duplicateRows[which.min(results$n[duplicateRows])]
rowsRemove <- c(rowsRemove, removeIndex)
}
uniqueResults <- results[!c(1:nrow(results)) %in% rowsRemove, ]
# Check that each unique clone now only appears once (no duplicates)
stopifnot(all(table(uniqueResults$uniqueCloneID) == 1))
For now, our uniqueResults
dataframe contains the
results with only one clone from each duplicated pair. Later on, when we
want to validate our estimates, we’ll make use of the fact that we have
data from different timepoints but the same clone, showing that our
methods make consistent estimates regardless of which sampling time we
use. More on that in the Validation
section.
Now, let’s take the mean for each individual and re-create Fig. 5E.
# Get the unique individuals, and initialize vectors
uniqueIndividuals <- unique(uniqueResults$patient)
individualMeans <- c()
malNorm <- c()
# Fill vectors with mean growth rate and MPN status of each individual
for (ind in uniqueIndividuals) {
individualMeans <- c(individualMeans, mean(uniqueResults$estimate[uniqueResults$patient == ind]))
malNorm <- c(malNorm, uniqueResults$malnorm[uniqueResults$patient == ind][1])
}
# Combine results into a data.frame
mal_vs_norm.df <- data.frame(
"Patient" = uniqueIndividuals,
"meanEstimate" = individualMeans,
"malNorm" = malNorm
)
# Run a non-parametric Mann-whitney test to see if they're significantly different
nonparamTest <- wilcox.test(meanEstimate ~ malNorm, data = mal_vs_norm.df)
print(nonparamTest)
#>
#> Wilcoxon rank sum exact test
#>
#> data: meanEstimate by malNorm
#> W = 64, p-value = 0.02853
#> alternative hypothesis: true location shift is not equal to 0
# Set factor ordering for plot and plot using ggplot (Fig. 5E)
mal_vs_norm.df$malNorm <- factor(mal_vs_norm.df$malNorm, levels = c("Normal", "Malignant"))
ggplot(mal_vs_norm.df, aes(x = malNorm, y = meanEstimate)) +
geom_label(label = paste0("p = ", round(nonparamTest$p.value, 3)), x = 1.5, y = 1.2) +
geom_boxplot(width = 0.3, aes(color = malNorm), outlier.shape = NA) +
geom_jitter(aes(x = malNorm, y = meanEstimate, color = malNorm), width = .1, size = 2) +
scale_color_manual(values = c("black", "red"), labels = c("Normal", "MPN")) +
theme_bw() +
ylab("Mean net growth rate (r)") +
xlab("") +
theme(
legend.position = "none",
axis.ticks.x = element_blank(), axis.title.x = element_blank()
)
Taking our results from unique clones, uniqueResults
,
let’s compare growth rates from clones with either an unknown driver or
a single driver, to those with multiple drivers. For this, we’ll notice
a handy convention in the metadata: drivers are separated with “AND”, so
any clone with “AND” in the driver column, has multiple drivers:
# Set the multi driver variable according to whether multiple drivers are annotated
uniqueResults$multiDriver <- "Single or unknown"
uniqueResults$multiDriver[grepl("AND", uniqueResults$cloneDriver)] <- "Multiple"
# Run a non-parametric test to see if the differences are significant
nonParamTest <- wilcox.test(estimate ~ multiDriver, data = uniqueResults)
print(nonParamTest)
#>
#> Wilcoxon rank sum exact test
#>
#> data: estimate by multiDriver
#> W = 276, p-value = 1.554e-06
#> alternative hypothesis: true location shift is not equal to 0
# Set factor levels to control plot order
uniqueResults$multiDriver <- factor(uniqueResults$multiDriver, levels = c("Single or unknown", "Multiple"))
# Plot (Fig. 5F)
ggplot(uniqueResults, aes(x = multiDriver, y = estimate)) +
geom_boxplot(width = 0.3, aes(color = multiDriver), outlier.shape = NA) +
geom_jitter(aes(x = multiDriver, y = estimate, color = multiDriver), width = .1) +
scale_color_manual(values = c(colorPal[4], colorPal[8])) +
theme_bw() +
geom_label(label = paste0("p=", round(nonParamTest$p.value, 7)), x = 1.5, y = 2) +
ylab("Net growth rate (r)") +
xlab("") +
ggtitle("Driver mutations") +
theme(
legend.position = "none", axis.ticks.x = element_blank(),
axis.title.x = element_blank(), plot.title = element_text(hjust = .5)
)
We discussed before that our results data.frame (the output of
internalLengths()
or maxLikelihood()
) came
with a clone age estimate, which was 1/r + the time from sampling back to
the first coalescence. If the birth rate of a clone is a and the death rate is b, then the net growth rate is r = a − b. The
first coalescence represents the first surviving division because, if
neither of the offspring of a division die out, they’ll account for, on
average, half the cells in the clone. This means that even if we sample
only 10 cells, we’re very likely to sample at least one descendant from
each of the two offspring, meaning that first surviving division should
be the earliest coalescence, relative to the time of sampling. How do we
know when the first surviving division occurs? Well, we know:
With that out of the way, let’s see if the growth rate tells us anything about the time from clone initiation to diagnosis. For this, we want to use the max. growth rate from each patient, as that’s the one which we’d assume to be most likely to drive MPN initiation. We don’t understand MPN very well though, so an alternative could be to use the most recently emerged clone prior to diagnosis. We’ll do both here, showing that they actually give the same results, probably because any clone which arises later in life must be fit enough to out-compete other clones which have already expanded.
# Get the clone IDs of the fittest clone in each patient
cloneIDs_max <- sapply(unique(uniqueResults$patient), function(x) {
patient.df <- uniqueResults[uniqueResults$patient == x, ]
cloneID <- patient.df$cloneName_result[which.max(patient.df$estimate)]
cloneID
})
# Get the clone IDs of the youngest clone in each patient
cloneIDs_youngest <- sapply(unique(uniqueResults$patient), function(x) {
patient.df <- uniqueResults[uniqueResults$patient == x, ]
cloneID <- patient.df$cloneName_result[which.min(patient.df$cloneAgeEstimate)]
cloneID
})
# See how much overlap there is between the most recent clone and the highest growth rate
table(cloneIDs_max == cloneIDs_youngest)
#>
#> FALSE TRUE
#> 4 15
We see that in 15/19 cases, the youngest clone (with >= 10 tips) is also the most fit. First, let’s use the most fit clones and stratify by the mean, reproducing our result in Figure 5G.
# Subset the results to only have the fittest clone for each patient
maxResults <- uniqueResults[uniqueResults$cloneName_result %in% cloneIDs_max, ]
# Subset to the MPN patients
malMax <- maxResults[maxResults$malnorm == "Malignant", ]
# Determine time from clone initiation to diagnosis
malMax$latency_to_dx <- malMax$cloneAgeEstimate - (malMax$age - malMax$diagnosis.age)
# Stratify the population by the mean growth rate, adding a column "aboveMean"
malMax$aboveMean <- malMax$estimate > mean(malMax$estimate)
# Use the survival package to see if differences are significant
malMax$status <- 1
survivalTest <- survival::survdiff(survival::Surv(time = latency_to_dx, event = status) ~ aboveMean, data = malMax)
print(survivalTest$pvalue)
#> [1] 0.002628379
# Plot survival curves
survfit2(Surv(time = latency_to_dx, event = status) ~ aboveMean, data = malMax) %>%
ggsurvfit() +
labs(
y = "Probability diagnosis-free",
x = "Time from clone initiation (yrs.)"
) + scale_color_manual(values = c(colorPal[7], colorPal[3])) +
theme(legend.position = "none") +
geom_label(label = paste0("p = ", round(survivalTest$pvalue, 4)), x = 45, y = .85, color = "black", fill = "white") +
geom_label(label = "r < mean", x = 40, y = .375, color = colorPal[7], fill = "white") +
geom_label(label = "r > mean", x = 12, y = .35, color = colorPal[3], fill = "white")
#> Warning in geom_label(label = paste0("p = ", round(survivalTest$pvalue, : All aesthetics have length 1, but the data has 15 rows.
#> ℹ Please consider using `annotate()` or provide this layer with data containing
#> a single row.
#> Warning in geom_label(label = "r < mean", x = 40, y = 0.375, color = colorPal[7], : All aesthetics have length 1, but the data has 15 rows.
#> ℹ Please consider using `annotate()` or provide this layer with data containing
#> a single row.
#> Warning in geom_label(label = "r > mean", x = 12, y = 0.35, color = colorPal[3], : All aesthetics have length 1, but the data has 15 rows.
#> ℹ Please consider using `annotate()` or provide this layer with data containing
#> a single row.
Because we don’t fully understand clonal hematopoiesis and MPN, it’s reasonable to wonder whether using the fittest clone is the best way to compare time to diagnosis. As an alternative, let’s compare the youngest clones, seeing if our growth rate estimates still stratify the time to diagnosis.
youngestResults <- uniqueResults[uniqueResults$cloneName_result %in% cloneIDs_youngest, ]
# Subset to the MPN patients
malYoungest <- youngestResults[youngestResults$malnorm == "Malignant", ]
# Check how much overlap there is between youngest clone and fittest clone in the MPN dataset
table(malYoungest$cloneName_result == malMax$cloneName_result)
#>
#> TRUE
#> 13
As it turns out, the youngest clone is the fittest clone in all 13 individuals diagnosed with MPN. We saw before that this was the case in 15 of the 19 individuals when including the healthy individuals.
Remember when we said that there were three cases of duplicate clones? Well, if our estimates are worth anything, they must be consistent. This means that, in these three clones, we should expect that the estimates are similar at both timepoints. Let’s check this:
# Like before, we can find which clone IDs appear twice
tmp <- table(results$uniqueCloneID)
repeatsVec <- names(tmp)[tmp == 2]
# Subset the results to include only repeated clones
repeatResults <- results[results$uniqueCloneID %in% repeatsVec, ]
# Plot the repeated clones
ggplot(repeatResults) +
geom_pointrange(data = repeatResults, aes(
x = cloneName_result, y = estimate, ymin = lowerBound,
ymax = upperBound, color = uniqueCloneID
)) +
scale_color_manual(values = colorPal[c(1, 6, 7)]) +
theme_bw() +
ylab("Net growth rate estimate (r)") +
theme(axis.text.x = element_blank(), axis.title.x = element_blank(), legend.title = element_blank())
Not too bad! It’s good to be consistent, but it’s better to be consistent and accurate. For that, we need a completely orthogonal way to measure clonal growth rates, which comes from longitudinal data. One of the clones from the Williams paper and two of the clones from the Fabre paper have matching longitudinal data. We have formatted this nicely and it comes with the package!
Let’s now fit a logistic curve to the longitudinal data and estimate the growth rate for one of these clones, from individual PD9478, which is the tree we showed at the beginning of this vignette. Here, we are recreating some of the analysis from our Figure 6.
# Get data from PD9478
PD9478_long <- longitudinalData[longitudinalData$Sample.ID == "PD9478", ]
# Get a rough estimate for starting params
startingParams <- suppressWarnings(coef(lm(logit(VAF / .5) ~ Age, data = PD9478_long)))
# Fit to a three parameter logistic curve
fit <- nls(VAF ~ K / (1 + exp(-(phi2 + r * Age))),
start = list(K = 0.2, phi2 = min(c(-0.000001, startingParams[1])), r = max(c(0.00001, startingParams[2]))),
data = PD9478_long, trace = FALSE, algorithm = "port", lower = c(0, -500, 0.00001), upper = c(0.5, 0, 5)
)
# Set output equal to summary of longitudinal models
output <- summary(fit)
# Assign fitted params
r <- output$coefficients["r", 1]
t_m <- -output$coefficients["phi2", 1] / r # Get midpoint time by dividing by -1/r
K <- output$coefficients["K", 1]
# Get CI for growth rate of longitudinal/logistic model
stdError <- output$coefficients["r", "Std. Error"]
lb <- r - 1.96 * stdError
ub <- r + 1.96 * stdError
# Prepare a df for plotting the fit line along with the data
x <- c((min(PD9478_long$Age) - 10):(max(PD9478_long$Age) + 10)) # construct a range of x values bounded by the data
y <- K / (1 + exp(-(x - t_m) * r)) # curve VAF (3 param model)
predict.df <- data.frame("x" = x, "y" = y)
# Set color for longitudinal fit info
fitColor <- colorPal[6]
# Plot
ggplot(PD9478_long, aes(x = Age, y = VAF)) +
theme_bw() +
coord_cartesian(xlim = c(min(x), max(x)), ylim = c(-0.01, 0.52), expand = 0) +
labs(x = "Person Age (yr)", y = "Variant Allele Frequency (VAF)") +
ggtitle(gsub("_", " & ", paste0(PD9478_long$Sample.ID[1], " ", PD9478_long$Gene[1]))) +
geom_line(data = predict.df, aes(x = x, y = y), color = fitColor, linewidth = 1, show.legend = TRUE) +
geom_point(color = "#808080", shape = 18, size = 1.5) +
geom_vline(xintercept = t_m, linetype = "dotted", color = fitColor, linewidth = .6) +
geom_hline(yintercept = K, linetype = "dotted", color = fitColor, linewidth = .6)
Now, let’s see how the estimate compares with our maximum likelihood estimate:
# Get the single cell estimate from PD9478 clone 1
scPD9478 <- results[results$cloneName_result == "PD9478_1_clone1", ]
# Combine orthogonal estimates into one df for plotting
combine.df <- data.frame(
"Clone" = "PD9478_1_clone1",
"r" = c(r, scPD9478$estimate),
"lowerBound" = c(lb, scPD9478$lowerBound),
"upperBound" = c(ub, scPD9478$upperBound),
"method" = c("longitudinal", "max. likelihood")
)
# Plot
ggplot(combine.df) +
geom_pointrange(aes(x = method, y = r, ymin = lowerBound, ymax = upperBound)) +
theme_bw() +
theme(axis.title.x = element_blank()) +
ylab("Net growth rate (r)")
All in all, not too bad! This is the result shown in Figure 6 A&D. But now you have the tools to reproduce our analysis relating to the real data aspect of our paper. Hopefully you were able to do this in a matter of hours and minutes, which is much better than the months it took me!
Any real world data we include is produced and annotated in these fascinating papers, which have advanced the capabilities of single-cell Whole Genome Sequencing. They also provide excellent analysis in addition to what we provide here and in our work.
The mathematical methods for estimating growth rates build in large part from the work of Amaury Lambert in:
And here’s a final link to our paper for more of the details of the methods and data analysis.
If you enjoyed the color palette for the plotting, we tried to use colorblind friendly colors, specifically pulling colors from a palette designed by Bang Wong and available here.