File Sizes
In the world of genomic sequencing, files are often several gigabytes large containing millions of data points. Reading in such files to local machines, such as your laptop, can take an excruciating amount of time.
While there are programs that can handle large amounts of data, an easy and simple solution is to process your data in chunks. For instance, instead of looking at ten chromosomes simultaneously, it may be simpler to focus on two or three at a time.
Filters
The entire read_tbl_*()
family of functions provide the ability to filter data so that data may load and run faster. This works by filtering even before objects are loaded into R. Data can be filtered using any of the information present in the metadata, and you may even filter on multiple conditions.
cov_file <- miplicorn_example("coverage_AA_table.csv")
read_tbl_coverage(cov_file)
#> # A tibble: 6,344 × 8
#> sample gene_id gene mutation_name exonic_func aa_change targeted coverage
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 608
#> 2 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 20
#> 3 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 158
#> 4 D10-JJJ-5 PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 2
#> 5 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 1
#> 6 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 129
#> 7 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 0
#> 8 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 0
#> 9 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 90
#> 10 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 175
#> # … with 6,334 more rows
read_tbl_coverage(cov_file, gene == "atp6")
#> # A tibble: 260 × 8
#> sample gene_id gene mutation_name exonic_func aa_change targeted coverage
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 608
#> 2 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 20
#> 3 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 158
#> 4 D10-JJJ-5 PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 2
#> 5 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 1
#> 6 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 129
#> 7 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 0
#> 8 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 0
#> 9 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 90
#> 10 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 175
#> # … with 250 more rows
read_tbl_coverage(cov_file, gene == "atp6", targeted == "Yes")
#> # A tibble: 156 × 8
#> sample gene_id gene mutation_name exonic_func aa_change targeted coverage
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 608
#> 2 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 20
#> 3 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 158
#> 4 D10-JJJ-5 PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 2
#> 5 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 1
#> 6 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 129
#> 7 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 0
#> 8 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 0
#> 9 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 90
#> 10 D10-JJJ-… PF3D7_… atp6 atp6-Ala623G… missense_v… Ala623Glu Yes 175
#> # … with 146 more rows