High-throughput ‘omics’ studies such as metabolomics, lipidomics, protomics and microbiome usually share a very similar data structure when they are applied in a specific population to answer specific questions. A study with n samples and m features usually contains three portions of the data. A numeric \(m \times n\) matrix that each number represents the concentration/abundance of the \(j^{th}\) feature in \(i^{th}\) sample, a \(n \times k\) table of the sample information, and a \(m \times l\) table of the feature information. In this package and in the rest of this vignette, the three parts of the dataset are referred as conc_table, sample_table, and feature_data.
Feature is a generalized term. It can be referred to metabolites in a metabolomics study, lipids in a lipidomcis study, proteins/peptides in a proteomics study, and OTUs, genera, or phylums in a microbiome study. The sample_table usually contains the study design information, such as treatment group, dosage, age, and gender. While the feature_data contains the additional information of each feature. In a metabolomics/lipidomics study, it may contain the molecule weight, m/z value, ion mode, annotation(metabolite name), and chemical CID such as PubChem CID. In a proteomics study, it may contain the protein name, sequence, gene name, database ID such as emsembl ID. In a microbiome study, the feature_table contains the taxonomy labels in each level from kingdom to species.
The Metabase package is inspired by the bioconductor package Biobase and phyloseq. The former is the infrastructural package of bioconductor while the latter is a package designated for microbiome data analysis. The Metabase package takes the data structure design of the two package, with an emphasis on the data analysis, manipulation, hypothesis testing and visualization.
Metabase currently has four main data container classes designed for experiments with specific purposes, MetabolomicsSet, LipidomicsSet, ProteomicsSet, MicrobiomeSet, and a generic class MultxSet. The MultxSet can be used as a container for general experiment data such as dietary data, clinical values, anthropometric data, or ELISA data. All the classes above inherits from a virtual class called mSet. The mSet is a template for other classes so it is not directly callable. All mSet methods can be applied to each of its child classes. However, a method defined in any of the child classes cannot be applied to its siblings. The mSet class and its child classes must include four data slots: conc_table, sample_table, feature_data, and experiment_data. The conc_table slot must be a conc_table-class, the sample_table slot must be a sample_table-class, the feature_data slot must be a feature_data-class and the experiment_data slot must be a experiment_data-class.
The conc_table-class inherits from the native R class matrix-class with some constrains. A conc_table-class object must have samples in the columns and features in rows (i.e. each column being a subject and each row being a metabolite). Row and column names are required. The function conc_table is the constructor.
The sample data files used in this vignette come with the package. If you would like to explore the sample data on your own, you can either copy them to your own directory, or download them using the links (right click and choose ‘save link as’):
[lipid_conc_table.csv] [lipid_sample_table.csv] [lipid_feature_data.csv]
library(Metabase)
path = file.path(system.file("extdata", package = "Metabase"), "lipid_conc_table.csv")
conc = read.csv(path, row.names = 1, header = TRUE)
conc = as.matrix(conc) # conc_table must be a numeric matrix
is.numeric(conc)
#> [1] TRUE
conc_table = conc_table(conc)
conc_table
#> >>>>>> Concentration Table <<<<<<
#> conc_table() [ 170 features and 40 samples ]
#>
#> Showing the first 8 samples and 6 features...
#>
#> MD101A MD101B MD101C MD101D MD102A
#> Feature025 0.069313029 0.056075935 0.066063603 0.060227513 0.08757163
#> Feature026 0.011920550 0.009409271 0.009554416 0.009472654 0.00940928
#> Feature027 0.022202024 0.020325450 0.023517676 0.019909233 0.02439995
#> Feature028 0.069511705 0.070753449 0.092010172 0.071973128 0.10247062
#> Feature030 0.007313754 0.014001186 0.013560363 0.003070282 0.02009070
#> Feature031 0.015347708 0.015045342 0.020075784 0.015482315 0.02059498
#> Feature032 0.020923048 0.023624034 0.025750876 0.021087365 0.03024493
#> Feature033 0.014043897 0.005315704 0.013594897 0.013673466 0.02171813
#> MD102B
#> Feature025 0.055060821
#> Feature026 0.010141534
#> Feature027 0.018349653
#> Feature028 0.061691368
#> Feature030 0.013083172
#> Feature031 0.017104202
#> Feature032 0.005396957
#> Feature033 0.013688106
#>
#> ... 34 more samples
#> ... 162 more features
The sample_table-class inherits from the native R class data.frame-class with some constrains. A sample_table-class object must have samples in the rows and sample variables in the columns. Row and column names are required. When a sample_table-class object is used to construct a mSet-class child class object, the number of rows must be equal to the number of columns of the conc_table-class object, and the row names of the sample_table must be equal to the column names of the conc_table. The function sample_table is the constructor.
The number of feature names and of sample names are required to be equal. This is a way of keeping samples and features consistant during further data manipulations including data transformation and summarization.
path = file.path(system.file("extdata", package = "Metabase"), "lipid_sample_table.csv")
samp = read.csv(path, row.names = 1, header = TRUE)
is.data.frame(samp)
#> [1] TRUE
sample_table = sample_table(samp)
show(sample_table)
#> >>>>>> Sample Table <<<<<<
#> sample_table() [ 40 samples and 8 sample variables ]
#>
#> Showing the first 6 samples and 8 sample variables...
#>
#> Sample.. Species Organ Treatment Timepoint Subject
#> MD101A 1 Human Plasma EXP Pre 101
#> MD101B 2 Human Plasma EXP Post 101
#> MD101C 3 Human Plasma CTL Pre 101
#> MD101D 4 Human Plasma CTL Post 101
#> MD102A 5 Human Plasma EXP Pre 102
#> MD102B 6 Human Plasma EXP Post 102
#> MD102C 7 Human Plasma CTL Pre 102
#> MD102D 8 Human Plasma CTL Post 102
#>
#> ... 32 more samples
#> ... 2 more sample variables
The feature_data-class inherits from the native R class data.frame-class with some constrains. A feature_data-class object must have features in the rows and feature variables in the columns. Row and column names are required. When a feature_data-class object is used to construct a mSet-class child class object, the number of rows must be equal to the number of rows of the conc_table-class object, and the row names of the sample_table must be equal to the row names of the conc_table. The function feature_data is the constructor.
path = file.path(system.file("extdata", package = "Metabase"), "lipid_feature_data.csv")
feat = read.csv(path, row.names = 1, header = TRUE)
is.data.frame(feat)
#> [1] TRUE
feature_data = feature_data(feat)
show(feature_data)
#> >>>>>> Feature Data <<<<<<
#> feature_data() [ 170 features and 14 feature variables ]
#>
#> Showing the first 8 samples and 6 features...
#>
#> Identifier Annotation
#> Feature025 5.78_572.48_5.76_596.53 Ceramide (d34:1)
#> Feature026 7.18_628.55 Ceramide (d38:1)
#> Feature027 8.18_670.59 Ceramide (d41:1)
#> Feature028 8.50_684.61_8.50_708.65 Ceramide (d42:1)
#> Feature030 7.91_682.59 Ceramide (d42:2) B
#> Feature031 7.18_818.63 GlcCer (d40:1)
#> Feature032 7.83_846.66 GlcCer (d42:1)
#> Feature033 7.16_844.65 GlcCer (d42:2)
#> InChIKey Species count ESI.mode
#> Feature025 YDNKGFDKKRUKPY-TURZORIXSA-N [M+Cl]-_[M+Hac-H]- 40 (-) ES)
#> Feature026 XWBWIAOWSABHFI-NUKVNZTCSA-N [M+Cl]- 40 (-) ES)
#> Feature027 NAJHAHQNQCNWOP-PUYNVXOJSA-N? [M+Cl]- 40 (-) ES)
#> Feature028 ZJVVOYPTFQEGPH-AUTSUKAISA-N [M+Cl]- 40 (-) ES)
#> Feature030 VJSBNBBOSZJDKB-KPEYJIHVSA-N [M+Cl]- 38 (-) ES)
#> Feature031 YIGARKIIFOHVPF-YAEABVQUSA-N [M+Cl]- 40 (-) ES)
#> Feature032 POQRWMRXUOPCLD-AAAFHMTMSA-N [M+Cl]- 40 (-) ES)
#> Feature033 WBOZIXHPUPAOIA-RBELZSLISA-N [M+Cl]- 40 (-) ES)
#>
#> ... 8 more features
#> ... 162 more feature variables
The experiment_data-class holds any additional experiment data.
The name of each mSet data container class is its own constructor. The conc_table slot is required, while the sample_table, feature_data and experiment_data are optional. Notice if the sample names of conc_table and sample_table, or the feature names of conc_table and feature_data don’t match, the construction will fail.
mset = MetabolomicsSet(
conc_table = conc_table,
sample_table = sample_table,
feature_data = feature_data
)
mset
#> >>>>>> Metabolomics Experiment <<<<<<
#> conc_table(): [ 40 samples and 170 features ]
#> sample_table(): [ 40 samples and 8 sample variables ]
#> feature_data(): [ 170 features and 14 feature variables ]
Slot can be assessed by using the slot name as a function, using the $ function, or using the [[ function. The slots can also be re-assigned using <- or =.
conc_table(mset)
#> >>>>>> Concentration Table <<<<<<
#> conc_table() [ 170 features and 40 samples ]
#>
#> Showing the first 8 samples and 6 features...
#>
#> MD101A MD101B MD101C MD101D MD102A
#> Feature025 0.069313029 0.056075935 0.066063603 0.060227513 0.08757163
#> Feature026 0.011920550 0.009409271 0.009554416 0.009472654 0.00940928
#> Feature027 0.022202024 0.020325450 0.023517676 0.019909233 0.02439995
#> Feature028 0.069511705 0.070753449 0.092010172 0.071973128 0.10247062
#> Feature030 0.007313754 0.014001186 0.013560363 0.003070282 0.02009070
#> Feature031 0.015347708 0.015045342 0.020075784 0.015482315 0.02059498
#> Feature032 0.020923048 0.023624034 0.025750876 0.021087365 0.03024493
#> Feature033 0.014043897 0.005315704 0.013594897 0.013673466 0.02171813
#> MD102B
#> Feature025 0.055060821
#> Feature026 0.010141534
#> Feature027 0.018349653
#> Feature028 0.061691368
#> Feature030 0.013083172
#> Feature031 0.017104202
#> Feature032 0.005396957
#> Feature033 0.013688106
#>
#> ... 34 more samples
#> ... 162 more features
# sample_table(mset)
# feature_data(mset)
show(mset$sample_table)
#> >>>>>> Sample Table <<<<<<
#> sample_table() [ 40 samples and 8 sample variables ]
#>
#> Showing the first 6 samples and 8 sample variables...
#>
#> Sample.. Species Organ Treatment Timepoint Subject
#> MD101A 1 Human Plasma EXP Pre 101
#> MD101B 2 Human Plasma EXP Post 101
#> MD101C 3 Human Plasma CTL Pre 101
#> MD101D 4 Human Plasma CTL Post 101
#> MD102A 5 Human Plasma EXP Pre 102
#> MD102B 6 Human Plasma EXP Post 102
#> MD102C 7 Human Plasma CTL Pre 102
#> MD102D 8 Human Plasma CTL Post 102
#>
#> ... 32 more samples
#> ... 2 more sample variables
# mset$conc_table
# mset$feature_data
show(mset[['feature_data']])
#> >>>>>> Feature Data <<<<<<
#> feature_data() [ 170 features and 14 feature variables ]
#>
#> Showing the first 8 samples and 6 features...
#>
#> Identifier Annotation
#> Feature025 5.78_572.48_5.76_596.53 Ceramide (d34:1)
#> Feature026 7.18_628.55 Ceramide (d38:1)
#> Feature027 8.18_670.59 Ceramide (d41:1)
#> Feature028 8.50_684.61_8.50_708.65 Ceramide (d42:1)
#> Feature030 7.91_682.59 Ceramide (d42:2) B
#> Feature031 7.18_818.63 GlcCer (d40:1)
#> Feature032 7.83_846.66 GlcCer (d42:1)
#> Feature033 7.16_844.65 GlcCer (d42:2)
#> InChIKey Species count ESI.mode
#> Feature025 YDNKGFDKKRUKPY-TURZORIXSA-N [M+Cl]-_[M+Hac-H]- 40 (-) ES)
#> Feature026 XWBWIAOWSABHFI-NUKVNZTCSA-N [M+Cl]- 40 (-) ES)
#> Feature027 NAJHAHQNQCNWOP-PUYNVXOJSA-N? [M+Cl]- 40 (-) ES)
#> Feature028 ZJVVOYPTFQEGPH-AUTSUKAISA-N [M+Cl]- 40 (-) ES)
#> Feature030 VJSBNBBOSZJDKB-KPEYJIHVSA-N [M+Cl]- 38 (-) ES)
#> Feature031 YIGARKIIFOHVPF-YAEABVQUSA-N [M+Cl]- 40 (-) ES)
#> Feature032 POQRWMRXUOPCLD-AAAFHMTMSA-N [M+Cl]- 40 (-) ES)
#> Feature033 WBOZIXHPUPAOIA-RBELZSLISA-N [M+Cl]- 40 (-) ES)
#>
#> ... 8 more features
#> ... 162 more feature variables
# mset[['conc_table']]
# mset[['sample_data']]
The function nsamples and nfeatures are used to get the number of samples and features of any mSet class data container.
nsamples(mset)
#> [1] 40
nfeatures(mset)
#> [1] 170
Function sampleNames and featureNames are used to get the sample names and feature names of any mSet class data container.
head(sampleNames(mset), 10)
#> [1] "MD101A" "MD101B" "MD101C" "MD101D" "MD102A" "MD102B" "MD102C" "MD102D"
#> [9] "MD103A" "MD103B"
head(featureNames(mset), 10)
#> [1] "Feature025" "Feature026" "Feature027" "Feature028" "Feature030"
#> [6] "Feature031" "Feature032" "Feature033" "Feature040" "Feature043"
Function subset_samples and subset_features are used to subset the mSet class data container by specifying either the indices, sampleNames/featureNames, or a boolean (TRUE or FALSE) value.
mset2 = subset_samples(mset, 1:10)
mset2
#> >>>>>> Metabolomics Experiment <<<<<<
#> conc_table(): [ 10 samples and 170 features ]
#> sample_table(): [ 10 samples and 8 sample variables ]
#> feature_data(): [ 170 features and 14 feature variables ]
mset2 = subset_samples(mset, c("MD101A", "MD101B", "MD102A", "MD102B"))
mset2
#> >>>>>> Metabolomics Experiment <<<<<<
#> conc_table(): [ 4 samples and 170 features ]
#> sample_table(): [ 4 samples and 8 sample variables ]
#> feature_data(): [ 170 features and 14 feature variables ]
mset2 = subset_samples(mset, mset$sample_table$Timepoint == "Pre")
mset2
#> >>>>>> Metabolomics Experiment <<<<<<
#> conc_table(): [ 20 samples and 170 features ]
#> sample_table(): [ 20 samples and 8 sample variables ]
#> feature_data(): [ 170 features and 14 feature variables ]
Function transform_by_sample and transform_by_feature are used to transform the conc_table slot into different margins. The former apply each sample at a time, while the latter apply each feature at a time. The second argument of the two functions is required for returning an altered array with the same dimensions as the original one. They are similar to the apply function when the MARGIN parameter is set to 1 (for transform_by_feature) and 2 (for transform_by_sample).
mset2 = transform_by_sample(mset, scale)
library(ggplot2); library(dplyr)
cowplot::plot_grid(
data.frame(mean = colMeans(mset2$conc_table)) %>%
ggplot(aes(x = mean)) +
geom_histogram(bins = 20),
data.frame(mean = rowMeans(mset2$conc_table)) %>%
ggplot(aes(x = mean)) +
geom_histogram(bins = 30),
ncol = 2
)
The figure above shows that by using the scale function. Each sample is scaled to 0 using transform_by_sample. Each feature is scaled to 0 using transform_by_feature.
The summarize_features function will aggregate the conc_table according to a given feature variable. The example below will add all features with the sample lipid class together. The second argument must appear in the column names of the feature_data slot.
unique(mset$feature_data$class)
#> [1] Cer PC PE SM CE Cholesterol
#> [7] DG LPC TG
#> Levels: CE Cer Cholesterol DG LPC PC PE SM TG
mset2 = summarize_features(mset, "class")
mset2
#> >>>>>> Metabolomics Experiment <<<<<<
#> conc_table(): [ 40 samples and 9 features ]
#> sample_table(): [ 40 samples and 8 sample variables ]
Metabase provides a quick solution for linear model testing using the limma package.
design = model.matrix(
~ Treatment * Timepoint + Subject,
data = as(mset$sample_table, "data.frame")
)
colnames(design)
#> [1] "(Intercept)" "TreatmentEXP"
#> [3] "TimepointPre" "Subject"
#> [5] "TreatmentEXP:TimepointPre"
lm = mSet_limma(mset, design, coef = "TreatmentEXP:TimepointPre",
p.value = "TreatmentEXP:TimepointPre")
head(lm)
#> baseMean logFC stat pvalue padj
#> Feature025 0.10973033 0.0070850511 0.39690010 0.6937808 0.9492560
#> Feature026 0.01642961 -0.0005516065 -0.04258577 0.9662667 0.9836248
#> Feature027 0.03803690 -0.0062658529 -0.42615263 0.6725301 0.9492560
#> Feature028 0.13833935 -0.0121808332 -0.47519415 0.6375148 0.9492560
#> Feature030 0.02536260 -0.0097179260 -0.64201934 0.5249219 0.9492560
#> Feature031 0.02598174 -0.0037432849 -0.28388067 0.7781257 0.9492560
Metabase provides some quick visualization functions. Metabase’s plot functions are mostly built on top of the ggmetaplots package. Installation from github is required.
devtools::install_github("zhuchcn/ggmetaplots")
plot_boxplot(mset, x = "Timepoint", feature = "Feature025")
#> Loading required namespace: ggmetaplots
plot_boxplot(mset, x = "Timepoint", feature = "Feature025", cols = "Treatment")
mset$sample_table$Subject = factor(mset$sample_table$Subject)
plot_boxplot(mset, x = "Timepoint", feature = "Feature025", cols = "Treatment",
color = "Subject")
plot_boxplot(mset, x = "Timepoint", feature = "Feature025", cols = "Treatment",
color = "Subject", line = "Subject")
plot_median_hist(mset)
plot_normality_boxplot(subset_features(mset, 1:30))
plot_normality_ridges(subset_features(mset, 1:30))
#> Loading required namespace: ggridges
#> Picking joint bandwidth of 0.356