High-throughput ‘omics’ studies such as metabolomics, lipidomics, protomics and microbiome usually share a very similar data structure when they are applied in a specific population to answer specific questions. A study with n samples and m features usually contains three portions of the data. A numeric \(m \times n\) matrix that each number represents the concentration/abundance of the \(j^{th}\) feature in \(i^{th}\) sample, a \(n \times k\) table of the sample information, and a \(m \times l\) table of the feature information. In this package and in the rest of this vignette, the three parts of the dataset are referred as conc_table
, sample_table
, and feature_data
.
Feature is a generalized term. It can be referred to metabolites in a metabolomics study, lipids in a lipidomcis study, proteins/peptides in a proteomics study, and OTUs, genera, or phylums in a microbiome study. The sample_table
usually contains the study design information, such as treatment group, dosage, age, and gender. While the feature_data
contains the additional information of each feature. In a metabolomics/lipidomics study, it may contain the molecule weight, m/z value, ion mode, annotation(metabolite name), and chemical CID such as PubChem CID. In a proteomics study, it may contain the protein name, sequence, gene name, database ID such as emsembl ID. In a microbiome study, the feature_table
contains the taxonomy labels in each level from kingdom to species.
The Metabase
package is inspired by the bioconductor package Biobase and phyloseq. The former is the infrastructural package of bioconductor while the latter is a package designated for microbiome data analysis. The Metabase package takes the data structure design of the two package, with an emphasis on the data analysis, manipulation, hypothesis testing and visualization.
Metabase currently has four main data container classes designed for experiments with specific purposes, MetabolomicsSet, LipidomicsSet, ProteomicsSet, MicrobiomeSet, and a generic class MultxSet. The MultxSet can be used as a container for general experiment data such as dietary data, clinical values, anthropometric data, or ELISA data. All the classes above inherits from a virtual class called mSet. The mSet is a template for other classes so it is not directly callable. All mSet methods can be applied to each of its child classes. However, a method defined in any of the child classes cannot be applied to its siblings. The mSet class and its child classes must include four data slots: conc_table
, sample_table
, feature_data
, and experiment_data
. The conc_table
slot must be a conc_table-class
, the sample_table
slot must be a sample_table-class
, the feature_data
slot must be a feature_data-class
and the experiment_data
slot must be a experiment_data-class
.
The conc_table-class
inherits from the native R class matrix-class
with some constrains. A conc_table-class
object must have samples in the columns and features in rows (i.e. each column being a subject and each row being a metabolite). Row and column names are required. The function conc_table
is the constructor.
The sample data files used in this vignette come with the package. If you would like to explore the sample data on your own, you can either copy them to your own directory, or download them using the links (right click and choose ‘save link as’):
[lipid_conc_table.csv] [lipid_sample_table.csv] [lipid_feature_data.csv]
library(Metabase)
path = file.path(system.file("extdata", package = "Metabase"), "lipid_conc_table.csv")
conc = read.csv(path, row.names = 1, header = TRUE)
conc = as.matrix(conc) # conc_table must be a numeric matrix
is.numeric(conc)
#> [1] TRUE
conc_table = conc_table(conc)
conc_table
#> >>>>>> Concentration Table <<<<<<
#> conc_table() [ 170 features and 40 samples ]
#>
#> Showing the first 8 samples and 6 features...
#>
#> MD101A MD101B MD101C MD101D MD102A
#> Feature025 0.069313029 0.056075935 0.066063603 0.060227513 0.08757163
#> Feature026 0.011920550 0.009409271 0.009554416 0.009472654 0.00940928
#> Feature027 0.022202024 0.020325450 0.023517676 0.019909233 0.02439995
#> Feature028 0.069511705 0.070753449 0.092010172 0.071973128 0.10247062
#> Feature030 0.007313754 0.014001186 0.013560363 0.003070282 0.02009070
#> Feature031 0.015347708 0.015045342 0.020075784 0.015482315 0.02059498
#> Feature032 0.020923048 0.023624034 0.025750876 0.021087365 0.03024493
#> Feature033 0.014043897 0.005315704 0.013594897 0.013673466 0.02171813
#> MD102B
#> Feature025 0.055060821
#> Feature026 0.010141534
#> Feature027 0.018349653
#> Feature028 0.061691368
#> Feature030 0.013083172
#> Feature031 0.017104202
#> Feature032 0.005396957
#> Feature033 0.013688106
#>
#> ... 34 more samples
#> ... 162 more features
The sample_table-class
inherits from the native R class data.frame-class
with some constrains. A sample_table-class
object must have samples in the rows and sample variables in the columns. Row and column names are required. When a sample_table-class
object is used to construct a mSet-class
child class object, the number of rows must be equal to the number of columns of the conc_table-class
object, and the row names of the sample_table
must be equal to the column names of the conc_table
. The function sample_table
is the constructor.
The number of feature names and of sample names are required to be equal. This is a way of keeping samples and features consistant during further data manipulations including data transformation and summarization.
path = file.path(system.file("extdata", package = "Metabase"), "lipid_sample_table.csv")
samp = read.csv(path, row.names = 1, header = TRUE)
is.data.frame(samp)
#> [1] TRUE
sample_table = sample_table(samp)
show(sample_table)
#> >>>>>> Sample Table <<<<<<
#> sample_table() [ 40 samples and 8 sample variables ]
#>
#> Showing the first 6 samples and 8 sample variables...
#>
#> Sample.. Species Organ Treatment Timepoint Subject
#> MD101A 1 Human Plasma EXP Pre 101
#> MD101B 2 Human Plasma EXP Post 101
#> MD101C 3 Human Plasma CTL Pre 101
#> MD101D 4 Human Plasma CTL Post 101
#> MD102A 5 Human Plasma EXP Pre 102
#> MD102B 6 Human Plasma EXP Post 102
#> MD102C 7 Human Plasma CTL Pre 102
#> MD102D 8 Human Plasma CTL Post 102
#>
#> ... 32 more samples
#> ... 2 more sample variables
The feature_data-class
inherits from the native R class data.frame-class
with some constrains. A feature_data-class
object must have features in the rows and feature variables in the columns. Row and column names are required. When a feature_data-class
object is used to construct a mSet-class
child class object, the number of rows must be equal to the number of rows of the conc_table-class
object, and the row names of the sample_table
must be equal to the row names of the conc_table
. The function feature_data
is the constructor.
path = file.path(system.file("extdata", package = "Metabase"), "lipid_feature_data.csv")
feat = read.csv(path, row.names = 1, header = TRUE)
is.data.frame(feat)
#> [1] TRUE
feature_data = feature_data(feat)
show(feature_data)
#> >>>>>> Feature Data <<<<<<
#> feature_data() [ 170 features and 14 feature variables ]
#>
#> Showing the first 8 samples and 6 features...
#>
#> Identifier Annotation
#> Feature025 5.78_572.48_5.76_596.53 Ceramide (d34:1)
#> Feature026 7.18_628.55 Ceramide (d38:1)
#> Feature027 8.18_670.59 Ceramide (d41:1)
#> Feature028 8.50_684.61_8.50_708.65 Ceramide (d42:1)
#> Feature030 7.91_682.59 Ceramide (d42:2) B
#> Feature031 7.18_818.63 GlcCer (d40:1)
#> Feature032 7.83_846.66 GlcCer (d42:1)
#> Feature033 7.16_844.65 GlcCer (d42:2)
#> InChIKey Species count ESI.mode
#> Feature025 YDNKGFDKKRUKPY-TURZORIXSA-N [M+Cl]-_[M+Hac-H]- 40 (-) ES)
#> Feature026 XWBWIAOWSABHFI-NUKVNZTCSA-N [M+Cl]- 40 (-) ES)
#> Feature027 NAJHAHQNQCNWOP-PUYNVXOJSA-N? [M+Cl]- 40 (-) ES)
#> Feature028 ZJVVOYPTFQEGPH-AUTSUKAISA-N [M+Cl]- 40 (-) ES)
#> Feature030 VJSBNBBOSZJDKB-KPEYJIHVSA-N [M+Cl]- 38 (-) ES)
#> Feature031 YIGARKIIFOHVPF-YAEABVQUSA-N [M+Cl]- 40 (-) ES)
#> Feature032 POQRWMRXUOPCLD-AAAFHMTMSA-N [M+Cl]- 40 (-) ES)
#> Feature033 WBOZIXHPUPAOIA-RBELZSLISA-N [M+Cl]- 40 (-) ES)
#>
#> ... 8 more features
#> ... 162 more feature variables
The experiment_data-class
holds any additional experiment data.
The name of each mSet data container class is its own constructor. The conc_table
slot is required, while the sample_table
, feature_data
and experiment_data
are optional. Notice if the sample names of conc_table
and sample_table
, or the feature names of conc_table
and feature_data
don’t match, the construction will fail.
mset = MetabolomicsSet(
conc_table = conc_table,
sample_table = sample_table,
feature_data = feature_data
)
mset
#> >>>>>> Metabolomics Experiment <<<<<<
#> conc_table(): [ 40 samples and 170 features ]
#> sample_table(): [ 40 samples and 8 sample variables ]
#> feature_data(): [ 170 features and 14 feature variables ]
Slot can be assessed by using the slot name as a function, using the $
function, or using the [[
function. The slots can also be re-assigned using <-
or =
.
conc_table(mset)
#> >>>>>> Concentration Table <<<<<<
#> conc_table() [ 170 features and 40 samples ]
#>
#> Showing the first 8 samples and 6 features...
#>
#> MD101A MD101B MD101C MD101D MD102A
#> Feature025 0.069313029 0.056075935 0.066063603 0.060227513 0.08757163
#> Feature026 0.011920550 0.009409271 0.009554416 0.009472654 0.00940928
#> Feature027 0.022202024 0.020325450 0.023517676 0.019909233 0.02439995
#> Feature028 0.069511705 0.070753449 0.092010172 0.071973128 0.10247062
#> Feature030 0.007313754 0.014001186 0.013560363 0.003070282 0.02009070
#> Feature031 0.015347708 0.015045342 0.020075784 0.015482315 0.02059498
#> Feature032 0.020923048 0.023624034 0.025750876 0.021087365 0.03024493
#> Feature033 0.014043897 0.005315704 0.013594897 0.013673466 0.02171813
#> MD102B
#> Feature025 0.055060821
#> Feature026 0.010141534
#> Feature027 0.018349653
#> Feature028 0.061691368
#> Feature030 0.013083172
#> Feature031 0.017104202
#> Feature032 0.005396957
#> Feature033 0.013688106
#>
#> ... 34 more samples
#> ... 162 more features
# sample_table(mset)
# feature_data(mset)
show(mset$sample_table)
#> >>>>>> Sample Table <<<<<<
#> sample_table() [ 40 samples and 8 sample variables ]
#>
#> Showing the first 6 samples and 8 sample variables...
#>
#> Sample.. Species Organ Treatment Timepoint Subject
#> MD101A 1 Human Plasma EXP Pre 101
#> MD101B 2 Human Plasma EXP Post 101
#> MD101C 3 Human Plasma CTL Pre 101
#> MD101D 4 Human Plasma CTL Post 101
#> MD102A 5 Human Plasma EXP Pre 102
#> MD102B 6 Human Plasma EXP Post 102
#> MD102C 7 Human Plasma CTL Pre 102
#> MD102D 8 Human Plasma CTL Post 102
#>
#> ... 32 more samples
#> ... 2 more sample variables
# mset$conc_table
# mset$feature_data
show(mset[['feature_data']])
#> >>>>>> Feature Data <<<<<<
#> feature_data() [ 170 features and 14 feature variables ]
#>
#> Showing the first 8 samples and 6 features...
#>
#> Identifier Annotation
#> Feature025 5.78_572.48_5.76_596.53 Ceramide (d34:1)
#> Feature026 7.18_628.55 Ceramide (d38:1)
#> Feature027 8.18_670.59 Ceramide (d41:1)
#> Feature028 8.50_684.61_8.50_708.65 Ceramide (d42:1)
#> Feature030 7.91_682.59 Ceramide (d42:2) B
#> Feature031 7.18_818.63 GlcCer (d40:1)
#> Feature032 7.83_846.66 GlcCer (d42:1)
#> Feature033 7.16_844.65 GlcCer (d42:2)
#> InChIKey Species count ESI.mode
#> Feature025 YDNKGFDKKRUKPY-TURZORIXSA-N [M+Cl]-_[M+Hac-H]- 40 (-) ES)
#> Feature026 XWBWIAOWSABHFI-NUKVNZTCSA-N [M+Cl]- 40 (-) ES)
#> Feature027 NAJHAHQNQCNWOP-PUYNVXOJSA-N? [M+Cl]- 40 (-) ES)
#> Feature028 ZJVVOYPTFQEGPH-AUTSUKAISA-N [M+Cl]- 40 (-) ES)
#> Feature030 VJSBNBBOSZJDKB-KPEYJIHVSA-N [M+Cl]- 38 (-) ES)
#> Feature031 YIGARKIIFOHVPF-YAEABVQUSA-N [M+Cl]- 40 (-) ES)
#> Feature032 POQRWMRXUOPCLD-AAAFHMTMSA-N [M+Cl]- 40 (-) ES)
#> Feature033 WBOZIXHPUPAOIA-RBELZSLISA-N [M+Cl]- 40 (-) ES)
#>
#> ... 8 more features
#> ... 162 more feature variables
# mset[['conc_table']]
# mset[['sample_data']]
The function nsamples
and nfeatures
are used to get the number of samples and features of any mSet class data container.
nsamples(mset)
#> [1] 40
nfeatures(mset)
#> [1] 170
Function sampleNames
and featureNames
are used to get the sample names and feature names of any mSet class data container.
head(sampleNames(mset), 10)
#> [1] "MD101A" "MD101B" "MD101C" "MD101D" "MD102A" "MD102B" "MD102C" "MD102D"
#> [9] "MD103A" "MD103B"
head(featureNames(mset), 10)
#> [1] "Feature025" "Feature026" "Feature027" "Feature028" "Feature030"
#> [6] "Feature031" "Feature032" "Feature033" "Feature040" "Feature043"
Function subset_samples
and subset_features
are used to subset the mSet class data container by specifying either the indices, sampleNames/featureNames, or a boolean (TRUE
or FALSE
) value.
mset2 = subset_samples(mset, 1:10)
mset2
#> >>>>>> Metabolomics Experiment <<<<<<
#> conc_table(): [ 10 samples and 170 features ]
#> sample_table(): [ 10 samples and 8 sample variables ]
#> feature_data(): [ 170 features and 14 feature variables ]
mset2 = subset_samples(mset, c("MD101A", "MD101B", "MD102A", "MD102B"))
mset2
#> >>>>>> Metabolomics Experiment <<<<<<
#> conc_table(): [ 4 samples and 170 features ]
#> sample_table(): [ 4 samples and 8 sample variables ]
#> feature_data(): [ 170 features and 14 feature variables ]
mset2 = subset_samples(mset, mset$sample_table$Timepoint == "Pre")
mset2
#> >>>>>> Metabolomics Experiment <<<<<<
#> conc_table(): [ 20 samples and 170 features ]
#> sample_table(): [ 20 samples and 8 sample variables ]
#> feature_data(): [ 170 features and 14 feature variables ]
Function transform_by_sample
and transform_by_feature
are used to transform the conc_table
slot into different margins. The former apply each sample at a time, while the latter apply each feature at a time. The second argument of the two functions is required for returning an altered array with the same dimensions as the original one. They are similar to the apply
function when the MARGIN
parameter is set to 1 (for transform_by_feature
) and 2 (for transform_by_sample
).
mset2 = transform_by_sample(mset, scale)
library(ggplot2); library(dplyr)
cowplot::plot_grid(
data.frame(mean = colMeans(mset2$conc_table)) %>%
ggplot(aes(x = mean)) +
geom_histogram(bins = 20),
data.frame(mean = rowMeans(mset2$conc_table)) %>%
ggplot(aes(x = mean)) +
geom_histogram(bins = 30),
ncol = 2
)
The figure above shows that by using the scale
function. Each sample is scaled to 0 using transform_by_sample
. Each feature is scaled to 0 using transform_by_feature
.
The summarize_features
function will aggregate the conc_table
according to a given feature variable. The example below will add all features with the sample lipid class together. The second argument must appear in the column names of the feature_data
slot.
unique(mset$feature_data$class)
#> [1] Cer PC PE SM CE Cholesterol
#> [7] DG LPC TG
#> Levels: CE Cer Cholesterol DG LPC PC PE SM TG
mset2 = summarize_features(mset, "class")
mset2
#> >>>>>> Metabolomics Experiment <<<<<<
#> conc_table(): [ 40 samples and 9 features ]
#> sample_table(): [ 40 samples and 8 sample variables ]
Metabase provides a quick solution for linear model testing using the limma package.
design = model.matrix(
~ Treatment * Timepoint + Subject,
data = as(mset$sample_table, "data.frame")
)
colnames(design)
#> [1] "(Intercept)" "TreatmentEXP"
#> [3] "TimepointPre" "Subject"
#> [5] "TreatmentEXP:TimepointPre"
lm = mSet_limma(mset, design, coef = "TreatmentEXP:TimepointPre",
p.value = "TreatmentEXP:TimepointPre")
head(lm)
#> baseMean logFC stat pvalue padj
#> Feature025 0.10973033 0.0070850511 0.39690010 0.6937808 0.9492560
#> Feature026 0.01642961 -0.0005516065 -0.04258577 0.9662667 0.9836248
#> Feature027 0.03803690 -0.0062658529 -0.42615263 0.6725301 0.9492560
#> Feature028 0.13833935 -0.0121808332 -0.47519415 0.6375148 0.9492560
#> Feature030 0.02536260 -0.0097179260 -0.64201934 0.5249219 0.9492560
#> Feature031 0.02598174 -0.0037432849 -0.28388067 0.7781257 0.9492560
Metabase provides some quick visualization functions. Metabase’s plot functions are mostly built on top of the ggmetaplots
package. Installation from github is required.
devtools::install_github("zhuchcn/ggmetaplots")
plot_boxplot(mset, x = "Timepoint", feature = "Feature025")
#> Loading required namespace: ggmetaplots
plot_boxplot(mset, x = "Timepoint", feature = "Feature025", cols = "Treatment")
mset$sample_table$Subject = factor(mset$sample_table$Subject)
plot_boxplot(mset, x = "Timepoint", feature = "Feature025", cols = "Treatment",
color = "Subject")
plot_boxplot(mset, x = "Timepoint", feature = "Feature025", cols = "Treatment",
color = "Subject", line = "Subject")
plot_median_hist(mset)
plot_normality_boxplot(subset_features(mset, 1:30))
plot_normality_ridges(subset_features(mset, 1:30))
#> Loading required namespace: ggridges
#> Picking joint bandwidth of 0.356