How can I add metadata to my data?
To add metadata to your metadata, you must read the metadata file as an R object and then combine the result with your data.
Metadata is often saved as a delimited file—most commonly a CSV (comma separated values) or a TSV (tab separated values)—or an excel file. There are several packages that can be used to read in such files, but we recommend the use of {vroom}
for delimited files and {readxl}
for excel files.
metadata <- vroom::vroom("path/to/file")
# OR
metadata <- readxl::read_excel("path/to/file")
After reading in your metadata, you can add the metadata to your data frame using one of {dplyr}
’s mutating joins.
dplyr::left_join(data, metadata, by = c("sample" = "ID"))
#> # A tibble: 10 × 10
#> sample gene_id gene mutation_name exonic_func aa_change targeted coverage
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 LA-05-37 PF3D7_… crt crt-Cys72Ser missense_v… Cys72Ser Yes 1716
#> 2 KO-05-62 PF3D7_… PF3D… PF3D7_113340… missense_v… Glu187Lys No 246
#> 3 HO-05-13 PF3D7_… PF3D… PF3D7-145120… synonymous… Asn71Asn Yes 1
#> 4 TO-05-43 PF3D7_… PF3D… PF3D7_113340… missense_v… Glu121Lys No 40
#> 5 KO-05-45 PF3D7_… PF3D… PF3D7_113340… missense_v… Glu405Lys No 0
#> 6 KS-05-95 PF3D7_… crt crt-Ala220Ser missense_v… Ala220Ser Yes 1
#> 7 AR-05-25 PF3D7_… dhps dhps-Ile431V… missense_v… Ile431Val Yes 274
#> 8 HO-05-52 PF3D7_… PF3D… PF3D7_113340… missense_v… Ser283Leu No 736
#> 9 KN-05-54 PF3D7_… pph pph-Asn1189_… disruptive… Asn1189_… No 0
#> 10 LA-05-100 PF3D7_… dhfr… dhfr-ts-Cys5… missense_v… Cys59Arg Yes 17
#> # … with 2 more variables: District <chr>, Facility <chr>
Note that you may need to change the by
argument to reflect the proper column name of the metadata.
My sample names do not match
In the case where your sample names do not match, first make sure that you are reading in the correct files! In some cases, your data and metadata sample names may be slightly different. For example, consider a dataset with the following sample names:
#> [1] "LA-05-37-ug-sur-2020-1" "KO-05-62-ug-sur-2020-1"
#> [3] "HO-05-13-ug-sur-2020-1" "TO-05-43-ug-sur-2020-1"
#> [5] "KO-05-45-ug-sur-2020-1" "KS-05-95-ug-sur-2020-1"
#> [7] "AR-05-25-ug-sur-2020-1" "HO-05-52-ug-sur-2020-1"
#> [9] "KN-05-54-ug-sur-2020-1" "LA-05-100-ug-sur-2020-1"
In order to add your metadata to this dataset, you must first perform some string manipulation so that the sample names in the dataset and metadata align. To do so, you can leverage the {stringr}
package.
data_formatted <- data_misformatted %>%
dplyr::mutate(sample = stringr::str_remove(sample, "-ug.*"))
You can then use {dplyr}
as before:
dplyr::left_join(data_formatted, metadata, by = c("sample" = "ID"))
#> # A tibble: 10 × 10
#> sample gene_id gene mutation_name exonic_func aa_change targeted coverage
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 LA-05-37 PF3D7_… crt crt-Cys72Ser missense_v… Cys72Ser Yes 1716
#> 2 KO-05-62 PF3D7_… PF3D… PF3D7_113340… missense_v… Glu187Lys No 246
#> 3 HO-05-13 PF3D7_… PF3D… PF3D7-145120… synonymous… Asn71Asn Yes 1
#> 4 TO-05-43 PF3D7_… PF3D… PF3D7_113340… missense_v… Glu121Lys No 40
#> 5 KO-05-45 PF3D7_… PF3D… PF3D7_113340… missense_v… Glu405Lys No 0
#> 6 KS-05-95 PF3D7_… crt crt-Ala220Ser missense_v… Ala220Ser Yes 1
#> 7 AR-05-25 PF3D7_… dhps dhps-Ile431V… missense_v… Ile431Val Yes 274
#> 8 HO-05-52 PF3D7_… PF3D… PF3D7_113340… missense_v… Ser283Leu No 736
#> 9 KN-05-54 PF3D7_… pph pph-Asn1189_… disruptive… Asn1189_… No 0
#> 10 LA-05-100 PF3D7_… dhfr… dhfr-ts-Cys5… missense_v… Cys59Arg Yes 17
#> # … with 2 more variables: District <chr>, Facility <chr>
How can I combine multiple probe sets?
In order to combine multiple probe sets, it is first important to ensure that you plan to conduct the same analysis for each probe set. If you are not, you may want to reconsider combining multiple probe sets. Assuming you want to combine the same type of data for each probe set, you can simply bind the rows using {dplyr}
.
dplyr::bind_rows(
list(probe_1 = probe_1, probe_2 = probe_2, probe_3 = probe_3),
.id = "probe_set"
)
#> # A tibble: 30 × 9
#> probe_set sample chrom pos ref alt mutation_name targeted coverage
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 probe_1 AR-05-54 chr13 1629329 AATAA… A chr13:162932… No 0
#> 2 probe_1 AR-05-73 chr13 1719135 A T chr13:171913… No 23
#> 3 probe_1 KS-05-35 chr8 495991 C T chr8:495991:… No 0
#> 4 probe_1 KN-05-43 chr13 1701344 TAT A chr13:170134… No 10
#> 5 probe_1 MU-05-69 chr13 1713479 C T chr13:171347… No 3
#> 6 probe_1 JI-05-82 chr13 1706967 C A chr13:170696… No 0
#> 7 probe_1 JI-05-28 chr13 1706000 A G chr13:170600… No 10
#> 8 probe_1 AG-05-28 chr13 1706978 GATTA… TATT… chr13:170697… No 0
#> 9 probe_1 AG-05-82 chr4 848676 T A chr4:848676:… No 0
#> 10 probe_1 KS-05-63 chr4 736960 C T chr4:736960:… No 0
#> # … with 20 more rows