Simulate biallelic data from a simple statistical model. Inputs include the complexity of infection (COI), population-level minor allele frequencies (PLMAF), and some parameters dictating skew and error distributions. Outputs include the phased haplotypes and the unphased read count and coverage data.
Usage
sim_biallelic(
coi,
plmaf = runif(10, 0, 0.5),
coverage = 200,
alpha = 1,
overdispersion = 0,
relatedness = 0,
epsilon = 0
)
Arguments
- coi
Complexity of infection.
- plmaf
Vector of population-level minor allele frequencies at each locus.
- coverage
Coverage at each locus. If a single value is supplied then the same coverage is applied over all loci.
- alpha
Shape parameter of the symmetric Dirichlet prior on strain proportions.
- overdispersion
The extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-Binomially distributed, with the beta distribution having shape parameters \(\frac{p}{overdispersion}\) and \(\frac{1-p}{overdispersion}\).
- relatedness
The probability that a strain in mixed infections is related to another. The implementation is similar to relatedness as defined in THE REAL McCOIL simulations (doi:10.1371/journal.pcbi.1005348 ): "... simulated relatedness (r) among lineages within the same host by sampling alleles either from an existing lineage within the same host (with probability r) or from the population (with probability (1-r))."
- epsilon
The probability of a single read being miscalled as the other allele. This error is applied in both directions.
Value
An object of class sim
. Contains a list of
tibbles:
parameters
contains each parameter and the value used to simulate data.strain_proportions
contains the proportion of each strain.phased_haplotypes
contains the phased haplotype for each strain at each locus.data
contains the following columns:plmaf
: The population-level minor allele frequency.coverage
: The coverage at each locus.counts
: The count at each locus.wsaf
: The within-sample minor allele frequency.
Details
Simulated data are drawn from a simple statistical model:
Strain proportions are drawn from a symmetric Dirichlet distribution with shape parameter
alpha
.Phased haplotypes are drawn at every locus, one for each
coi
. The allele at each locus is drawn from a Bernoulli distribution with probability given by theplmaf
.The "true" within-sample allele frequency at every locus is obtained by multiplying haplotypes by their strain proportions, and summing over haplotypes. Errors are introduced through the equation \[wsmaf_{error} = wsmaf(1-e) + (1-wsmaf)e\] where \(wsmaf\) is the WSMAF without error and \(e\) is the error parameter
epsilon
.Final read counts are drawn from a beta-binomial distribution with expectation \(w_{error}\). The raw number of draws is given by the
coverage
, and the skew of the distribution is given by theoverdispersion
parameter. If theoverdispersion
is equal to zero, then the distribution is binomial, rather than beta-binomial.
See also
Other simulated data functions:
plot-simulation
,
process_sim()
Examples
sim_biallelic(coi = 5)
#> $parameters
#> # A tibble: 5 × 2
#> parameter value
#> <chr> <dbl>
#> 1 coi 5
#> 2 alpha 1
#> 3 overdispersion 0
#> 4 relatedness 0
#> 5 epsilon 0
#>
#> $strain_proportions
#> # A tibble: 5 × 1
#> proportion
#> <dbl>
#> 1 0.229
#> 2 0.170
#> 3 0.233
#> 4 0.156
#> 5 0.212
#>
#> $phased_haplotypes
#> # A tibble: 5 × 10
#> locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9
#> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 0 0 0 1 1 0 0 0
#> 2 0 0 0 0 0 0 0 0 0
#> 3 0 0 1 0 0 0 0 0 0
#> 4 1 0 0 1 0 1 1 0 0
#> 5 0 0 0 0 1 0 0 0 0
#> # ℹ 1 more variable: locus_10 <int>
#>
#> $data
#> # A tibble: 10 × 4
#> plmaf coverage counts wsmaf
#> <dbl> <dbl> <int> <dbl>
#> 1 0.157 200 68 0.34
#> 2 0.218 200 0 0
#> 3 0.200 200 42 0.21
#> 4 0.411 200 23 0.115
#> 5 0.242 200 90 0.45
#> 6 0.214 200 64 0.32
#> 7 0.0873 200 35 0.175
#> 8 0.0618 200 0 0
#> 9 0.114 200 0 0
#> 10 0.454 200 87 0.435
#>
#> attr(,"class")
#> [1] "sim"