Skip to contents

Simulate biallelic data from a simple statistical model. Inputs include the complexity of infection (COI), population-level minor allele frequencies (PLMAF), and some parameters dictating skew and error distributions. Outputs include the phased haplotypes and the unphased read count and coverage data.

Usage

sim_biallelic(
  coi,
  plmaf = runif(10, 0, 0.5),
  coverage = 200,
  alpha = 1,
  overdispersion = 0,
  relatedness = 0,
  epsilon = 0
)

Arguments

coi

Complexity of infection.

plmaf

Vector of population-level minor allele frequencies at each locus.

coverage

Coverage at each locus. If a single value is supplied then the same coverage is applied over all loci.

alpha

Shape parameter of the symmetric Dirichlet prior on strain proportions.

overdispersion

The extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-Binomially distributed, with the beta distribution having shape parameters \(\frac{p}{overdispersion}\) and \(\frac{1-p}{overdispersion}\).

relatedness

The probability that a strain in mixed infections is related to another. The implementation is similar to relatedness as defined in THE REAL McCOIL simulations (doi:10.1371/journal.pcbi.1005348 ): "... simulated relatedness (r) among lineages within the same host by sampling alleles either from an existing lineage within the same host (with probability r) or from the population (with probability (1-r))."

epsilon

The probability of a single read being miscalled as the other allele. This error is applied in both directions.

Value

An object of class sim. Contains a list of tibbles:

  • parameters contains each parameter and the value used to simulate data.

  • strain_proportions contains the proportion of each strain.

  • phased_haplotypes contains the phased haplotype for each strain at each locus.

  • data contains the following columns:

    • plmaf: The population-level minor allele frequency.

    • coverage: The coverage at each locus.

    • counts: The count at each locus.

    • wsaf: The within-sample minor allele frequency.

Details

Simulated data are drawn from a simple statistical model:

  1. Strain proportions are drawn from a symmetric Dirichlet distribution with shape parameter alpha.

  2. Phased haplotypes are drawn at every locus, one for each coi. The allele at each locus is drawn from a Bernoulli distribution with probability given by the plmaf.

  3. The "true" within-sample allele frequency at every locus is obtained by multiplying haplotypes by their strain proportions, and summing over haplotypes. Errors are introduced through the equation \[wsmaf_{error} = wsmaf(1-e) + (1-wsmaf)e\] where \(wsmaf\) is the WSMAF without error and \(e\) is the error parameter epsilon.

  4. Final read counts are drawn from a beta-binomial distribution with expectation \(w_{error}\). The raw number of draws is given by the coverage, and the skew of the distribution is given by the overdispersion parameter. If the overdispersion is equal to zero, then the distribution is binomial, rather than beta-binomial.

See also

Other simulated data functions: plot-simulation, process_sim()

Examples

sim_biallelic(coi = 5)
#> $parameters
#> # A tibble: 5 × 2
#>   parameter      value
#>   <chr>          <dbl>
#> 1 coi                5
#> 2 alpha              1
#> 3 overdispersion     0
#> 4 relatedness        0
#> 5 epsilon            0
#> 
#> $strain_proportions
#> # A tibble: 5 × 1
#>   proportion
#>        <dbl>
#> 1      0.229
#> 2      0.170
#> 3      0.233
#> 4      0.156
#> 5      0.212
#> 
#> $phased_haplotypes
#> # A tibble: 5 × 10
#>   locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9
#>     <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>
#> 1       1       0       0       0       1       1       0       0       0
#> 2       0       0       0       0       0       0       0       0       0
#> 3       0       0       1       0       0       0       0       0       0
#> 4       1       0       0       1       0       1       1       0       0
#> 5       0       0       0       0       1       0       0       0       0
#> # ℹ 1 more variable: locus_10 <int>
#> 
#> $data
#> # A tibble: 10 × 4
#>     plmaf coverage counts wsmaf
#>     <dbl>    <dbl>  <int> <dbl>
#>  1 0.157       200     68 0.34 
#>  2 0.218       200      0 0    
#>  3 0.200       200     42 0.21 
#>  4 0.411       200     23 0.115
#>  5 0.242       200     90 0.45 
#>  6 0.214       200     64 0.32 
#>  7 0.0873      200     35 0.175
#>  8 0.0618      200      0 0    
#>  9 0.114       200      0 0    
#> 10 0.454       200     87 0.435
#> 
#> attr(,"class")
#> [1] "sim"