Skip to contents

How can I add metadata to my data?

To add metadata to your metadata, you must read the metadata file as an R object and then combine the result with your data.

Metadata is often saved as a delimited file—most commonly a CSV (comma separated values) or a TSV (tab separated values)—or an excel file. There are several packages that can be used to read in such files, but we recommend the use of {vroom} for delimited files and {readxl} for excel files.

metadata <- vroom::vroom("path/to/file")
# OR
metadata <- readxl::read_excel("path/to/file")

After reading in your metadata, you can add the metadata to your data frame using one of {dplyr}’s mutating joins.

dplyr::left_join(data, metadata, by = c("sample" = "ID"))
#> # A tibble: 10 × 10
#>    sample    gene_id gene  mutation_name exonic_func aa_change targeted coverage
#>    <chr>     <chr>   <chr> <chr>         <chr>       <chr>     <chr>       <dbl>
#>  1 LA-05-37  PF3D7_… crt   crt-Cys72Ser  missense_v… Cys72Ser  Yes          1716
#>  2 KO-05-62  PF3D7_… PF3D… PF3D7_113340… missense_v… Glu187Lys No            246
#>  3 HO-05-13  PF3D7_… PF3D… PF3D7-145120… synonymous… Asn71Asn  Yes             1
#>  4 TO-05-43  PF3D7_… PF3D… PF3D7_113340… missense_v… Glu121Lys No             40
#>  5 KO-05-45  PF3D7_… PF3D… PF3D7_113340… missense_v… Glu405Lys No              0
#>  6 KS-05-95  PF3D7_… crt   crt-Ala220Ser missense_v… Ala220Ser Yes             1
#>  7 AR-05-25  PF3D7_… dhps  dhps-Ile431V… missense_v… Ile431Val Yes           274
#>  8 HO-05-52  PF3D7_… PF3D… PF3D7_113340… missense_v… Ser283Leu No            736
#>  9 KN-05-54  PF3D7_… pph   pph-Asn1189_… disruptive… Asn1189_… No              0
#> 10 LA-05-100 PF3D7_… dhfr… dhfr-ts-Cys5… missense_v… Cys59Arg  Yes            17
#> # … with 2 more variables: District <chr>, Facility <chr>

Note that you may need to change the by argument to reflect the proper column name of the metadata.

My sample names do not match

In the case where your sample names do not match, first make sure that you are reading in the correct files! In some cases, your data and metadata sample names may be slightly different. For example, consider a dataset with the following sample names:

#>  [1] "LA-05-37-ug-sur-2020-1"  "KO-05-62-ug-sur-2020-1" 
#>  [3] "HO-05-13-ug-sur-2020-1"  "TO-05-43-ug-sur-2020-1" 
#>  [5] "KO-05-45-ug-sur-2020-1"  "KS-05-95-ug-sur-2020-1" 
#>  [7] "AR-05-25-ug-sur-2020-1"  "HO-05-52-ug-sur-2020-1" 
#>  [9] "KN-05-54-ug-sur-2020-1"  "LA-05-100-ug-sur-2020-1"

In order to add your metadata to this dataset, you must first perform some string manipulation so that the sample names in the dataset and metadata align. To do so, you can leverage the {stringr} package.

data_formatted <- data_misformatted %>%
  dplyr::mutate(sample = stringr::str_remove(sample, "-ug.*"))

You can then use {dplyr} as before:

dplyr::left_join(data_formatted, metadata, by = c("sample" = "ID"))
#> # A tibble: 10 × 10
#>    sample    gene_id gene  mutation_name exonic_func aa_change targeted coverage
#>    <chr>     <chr>   <chr> <chr>         <chr>       <chr>     <chr>       <dbl>
#>  1 LA-05-37  PF3D7_… crt   crt-Cys72Ser  missense_v… Cys72Ser  Yes          1716
#>  2 KO-05-62  PF3D7_… PF3D… PF3D7_113340… missense_v… Glu187Lys No            246
#>  3 HO-05-13  PF3D7_… PF3D… PF3D7-145120… synonymous… Asn71Asn  Yes             1
#>  4 TO-05-43  PF3D7_… PF3D… PF3D7_113340… missense_v… Glu121Lys No             40
#>  5 KO-05-45  PF3D7_… PF3D… PF3D7_113340… missense_v… Glu405Lys No              0
#>  6 KS-05-95  PF3D7_… crt   crt-Ala220Ser missense_v… Ala220Ser Yes             1
#>  7 AR-05-25  PF3D7_… dhps  dhps-Ile431V… missense_v… Ile431Val Yes           274
#>  8 HO-05-52  PF3D7_… PF3D… PF3D7_113340… missense_v… Ser283Leu No            736
#>  9 KN-05-54  PF3D7_… pph   pph-Asn1189_… disruptive… Asn1189_… No              0
#> 10 LA-05-100 PF3D7_… dhfr… dhfr-ts-Cys5… missense_v… Cys59Arg  Yes            17
#> # … with 2 more variables: District <chr>, Facility <chr>

How can I combine multiple probe sets?

In order to combine multiple probe sets, it is first important to ensure that you plan to conduct the same analysis for each probe set. If you are not, you may want to reconsider combining multiple probe sets. Assuming you want to combine the same type of data for each probe set, you can simply bind the rows using {dplyr}.

dplyr::bind_rows(
  list(probe_1 = probe_1, probe_2 = probe_2, probe_3 = probe_3),
  .id = "probe_set"
)
#> # A tibble: 30 × 9
#>    probe_set sample   chrom pos     ref    alt   mutation_name targeted coverage
#>    <chr>     <chr>    <chr> <chr>   <chr>  <chr> <chr>         <chr>       <dbl>
#>  1 probe_1   AR-05-54 chr13 1629329 AATAA… A     chr13:162932… No              0
#>  2 probe_1   AR-05-73 chr13 1719135 A      T     chr13:171913… No             23
#>  3 probe_1   KS-05-35 chr8  495991  C      T     chr8:495991:… No              0
#>  4 probe_1   KN-05-43 chr13 1701344 TAT    A     chr13:170134… No             10
#>  5 probe_1   MU-05-69 chr13 1713479 C      T     chr13:171347… No              3
#>  6 probe_1   JI-05-82 chr13 1706967 C      A     chr13:170696… No              0
#>  7 probe_1   JI-05-28 chr13 1706000 A      G     chr13:170600… No             10
#>  8 probe_1   AG-05-28 chr13 1706978 GATTA… TATT… chr13:170697… No              0
#>  9 probe_1   AG-05-82 chr4  848676  T      A     chr4:848676:… No              0
#> 10 probe_1   KS-05-63 chr4  736960  C      T     chr4:736960:… No              0
#> # … with 20 more rows