1 Data structure of high through-put experiments

High-throughput ‘omics’ studies such as metabolomics, lipidomics, protomics and microbiome usually share a very similar data structure when they are applied in a specific population to answer specific questions. A study with n samples and m features usually contains three portions of the data. A numeric $m \times n$ matrix that each number represents the concentration/abundance of the $j^{th}$ feature in $i^{th}$ sample, a $n \times k$ table of the sample information, and a $m \times l$ table of the feature information. In this package and in the rest of this vignette, the three parts of the dataset are referred as conc_table, sample_table, and feature_data.

Feature is a generalized term. It can be referred to metabolites in a metabolomics study, lipids in a lipidomcis study, proteins/peptides in a proteomics study, and OTUs, genera, or phylums in a microbiome study. The sample_table usually contains the study design information, such as treatment group, dosage, age, and gender. While the feature_data contains the additional information of each feature. In a metabolomics/lipidomics study, it may contain the molecule weight, m/z value, ion mode, annotation(metabolite name), and chemical CID such as PubChem CID. In a proteomics study, it may contain the protein name, sequence, gene name, database ID such as emsembl ID. In a microbiome study, the feature_table contains the taxonomy labels in each level from kingdom to species.

The Metabase package is inspired by the bioconductor package Biobase and phyloseq. The former is the infrastructural package of bioconductor while the latter is a package designated for microbiome data analysis. The Metabase package takes the data structure design of the two package, with an emphasis on the data analysis, manipulation, hypothesis testing and visualization.

2 Classes

2.1 mSet-class and its children

Metabase currently has four main data container classes designed for experiments with specific purposes, MetabolomicsSet, LipidomicsSet, ProteomicsSet, MicrobiomeSet, and a generic class MultxSet. The MultxSet can be used as a container for general experiment data such as dietary data, clinical values, anthropometric data, or ELISA data. All the classes above inherits from a virtual class called mSet. The mSet is a template for other classes so it is not directly callable. All mSet methods can be applied to each of its child classes. However, a method defined in any of the child classes cannot be applied to its siblings. The mSet class and its child classes must include four data slots: conc_table, sample_table, feature_data, and experiment_data. The conc_table slot must be a conc_table-class, the sample_table slot must be a sample_table-class, the feature_data slot must be a feature_data-class and the experiment_data slot must be a experiment_data-class.

2.2 conc_table-class

The conc_table-class inherits from the native R class matrix-class with some constrains. A conc_table-class object must have samples in the columns and features in rows (i.e. each column being a subject and each row being a metabolite). Row and column names are required. The function conc_table is the constructor.

The sample data files used in this vignette come with the package. If you would like to explore the sample data on your own, you can either copy them to your own directory, or download them using the links (right click and choose ‘save link as’):

[lipid_conc_table.csv] [lipid_sample_table.csv] [lipid_feature_data.csv]

library(Metabase)
path = file.path(system.file("extdata", package = "Metabase"), "lipid_conc_table.csv")
conc = read.csv(path, row.names = 1, header = TRUE)
conc = as.matrix(conc)    # conc_table must be a numeric matrix
is.numeric(conc)
#> [1] TRUE

conc_table = conc_table(conc)
conc_table
#>                  >>>>>> Concentration Table <<<<<<
#> conc_table()                       [ 170 features and  40 samples ]
#> 
#> Showing the first 8 samples and 6 features...
#> 
#>                 MD101A      MD101B      MD101C      MD101D     MD102A
#> Feature025 0.069313029 0.056075935 0.066063603 0.060227513 0.08757163
#> Feature026 0.011920550 0.009409271 0.009554416 0.009472654 0.00940928
#> Feature027 0.022202024 0.020325450 0.023517676 0.019909233 0.02439995
#> Feature028 0.069511705 0.070753449 0.092010172 0.071973128 0.10247062
#> Feature030 0.007313754 0.014001186 0.013560363 0.003070282 0.02009070
#> Feature031 0.015347708 0.015045342 0.020075784 0.015482315 0.02059498
#> Feature032 0.020923048 0.023624034 0.025750876 0.021087365 0.03024493
#> Feature033 0.014043897 0.005315704 0.013594897 0.013673466 0.02171813
#>                 MD102B
#> Feature025 0.055060821
#> Feature026 0.010141534
#> Feature027 0.018349653
#> Feature028 0.061691368
#> Feature030 0.013083172
#> Feature031 0.017104202
#> Feature032 0.005396957
#> Feature033 0.013688106
#> 
#> ...  34 more samples
#> ... 162 more features

2.3 sample_table-class

The sample_table-class inherits from the native R class data.frame-class with some constrains. A sample_table-class object must have samples in the rows and sample variables in the columns. Row and column names are required. When a sample_table-class object is used to construct a mSet-class child class object, the number of rows must be equal to the number of columns of the conc_table-class object, and the row names of the sample_table must be equal to the column names of the conc_table. The function sample_table is the constructor.

The number of feature names and of sample names are required to be equal. This is a way of keeping samples and features consistant during further data manipulations including data transformation and summarization.

path = file.path(system.file("extdata", package = "Metabase"), "lipid_sample_table.csv")
samp = read.csv(path, row.names = 1, header = TRUE)
is.data.frame(samp)
#> [1] TRUE

sample_table = sample_table(samp)
show(sample_table)
#>                    >>>>>> Sample Table <<<<<<
#> sample_table()              [  40 samples and   8 sample variables ]
#> 
#> Showing the first 6 samples and 8 sample variables...
#> 
#>        Sample.. Species  Organ Treatment Timepoint Subject
#> MD101A        1   Human Plasma       EXP       Pre     101
#> MD101B        2   Human Plasma       EXP      Post     101
#> MD101C        3   Human Plasma       CTL       Pre     101
#> MD101D        4   Human Plasma       CTL      Post     101
#> MD102A        5   Human Plasma       EXP       Pre     102
#> MD102B        6   Human Plasma       EXP      Post     102
#> MD102C        7   Human Plasma       CTL       Pre     102
#> MD102D        8   Human Plasma       CTL      Post     102
#> 
#> ...  32 more samples
#> ...   2 more sample variables

2.4 feature_data-class

The feature_data-class inherits from the native R class data.frame-class with some constrains. A feature_data-class object must have features in the rows and feature variables in the columns. Row and column names are required. When a feature_data-class object is used to construct a mSet-class child class object, the number of rows must be equal to the number of rows of the conc_table-class object, and the row names of the sample_table must be equal to the row names of the conc_table. The function feature_data is the constructor.

path = file.path(system.file("extdata", package = "Metabase"), "lipid_feature_data.csv")
feat = read.csv(path, row.names = 1, header = TRUE)
is.data.frame(feat)
#> [1] TRUE

feature_data = feature_data(feat)
show(feature_data)
#>                       >>>>>> Feature Data <<<<<<
#> feature_data()          [ 170 features and  14 feature variables ]
#> 
#> Showing the first 8 samples and 6 features...
#> 
#>                         Identifier         Annotation
#> Feature025 5.78_572.48_5.76_596.53   Ceramide (d34:1)
#> Feature026             7.18_628.55   Ceramide (d38:1)
#> Feature027             8.18_670.59   Ceramide (d41:1)
#> Feature028 8.50_684.61_8.50_708.65   Ceramide (d42:1)
#> Feature030             7.91_682.59 Ceramide (d42:2) B
#> Feature031             7.18_818.63     GlcCer (d40:1)
#> Feature032             7.83_846.66     GlcCer (d42:1)
#> Feature033             7.16_844.65     GlcCer (d42:2)
#>                                InChIKey            Species count ESI.mode
#> Feature025  YDNKGFDKKRUKPY-TURZORIXSA-N [M+Cl]-_[M+Hac-H]-    40  (-) ES)
#> Feature026  XWBWIAOWSABHFI-NUKVNZTCSA-N            [M+Cl]-    40  (-) ES)
#> Feature027 NAJHAHQNQCNWOP-PUYNVXOJSA-N?            [M+Cl]-    40  (-) ES)
#> Feature028  ZJVVOYPTFQEGPH-AUTSUKAISA-N            [M+Cl]-    40  (-) ES)
#> Feature030  VJSBNBBOSZJDKB-KPEYJIHVSA-N            [M+Cl]-    38  (-) ES)
#> Feature031  YIGARKIIFOHVPF-YAEABVQUSA-N            [M+Cl]-    40  (-) ES)
#> Feature032  POQRWMRXUOPCLD-AAAFHMTMSA-N            [M+Cl]-    40  (-) ES)
#> Feature033  WBOZIXHPUPAOIA-RBELZSLISA-N            [M+Cl]-    40  (-) ES)
#> 
#> ...   8 more features
#> ... 162 more feature variables

2.5 experiment_data-class

The experiment_data-class holds any additional experiment data.

3 Data manipulation capability

3.1 Constructor

The name of each mSet data container class is its own constructor. The conc_table slot is required, while the sample_table, feature_data and experiment_data are optional. Notice if the sample names of conc_table and sample_table, or the feature names of conc_table and feature_data don’t match, the construction will fail.

mset = MetabolomicsSet(
    conc_table = conc_table,
    sample_table = sample_table,
    feature_data = feature_data
)
mset
#>              >>>>>> Metabolomics Experiment <<<<<<
#> conc_table():           [  40 samples  and 170 features          ]
#> sample_table():         [  40 samples  and   8 sample  variables ]
#> feature_data():         [ 170 features and  14 feature variables ]

3.2 Slot assessors

Slot can be assessed by using the slot name as a function, using the $ function, or using the [[ function. The slots can also be re-assigned using <- or =.

conc_table(mset)
#>                  >>>>>> Concentration Table <<<<<<
#> conc_table()                       [ 170 features and  40 samples ]
#> 
#> Showing the first 8 samples and 6 features...
#> 
#>                 MD101A      MD101B      MD101C      MD101D     MD102A
#> Feature025 0.069313029 0.056075935 0.066063603 0.060227513 0.08757163
#> Feature026 0.011920550 0.009409271 0.009554416 0.009472654 0.00940928
#> Feature027 0.022202024 0.020325450 0.023517676 0.019909233 0.02439995
#> Feature028 0.069511705 0.070753449 0.092010172 0.071973128 0.10247062
#> Feature030 0.007313754 0.014001186 0.013560363 0.003070282 0.02009070
#> Feature031 0.015347708 0.015045342 0.020075784 0.015482315 0.02059498
#> Feature032 0.020923048 0.023624034 0.025750876 0.021087365 0.03024493
#> Feature033 0.014043897 0.005315704 0.013594897 0.013673466 0.02171813
#>                 MD102B
#> Feature025 0.055060821
#> Feature026 0.010141534
#> Feature027 0.018349653
#> Feature028 0.061691368
#> Feature030 0.013083172
#> Feature031 0.017104202
#> Feature032 0.005396957
#> Feature033 0.013688106
#> 
#> ...  34 more samples
#> ... 162 more features
# sample_table(mset)
# feature_data(mset)

show(mset$sample_table)
#>                    >>>>>> Sample Table <<<<<<
#> sample_table()              [  40 samples and   8 sample variables ]
#> 
#> Showing the first 6 samples and 8 sample variables...
#> 
#>        Sample.. Species  Organ Treatment Timepoint Subject
#> MD101A        1   Human Plasma       EXP       Pre     101
#> MD101B        2   Human Plasma       EXP      Post     101
#> MD101C        3   Human Plasma       CTL       Pre     101
#> MD101D        4   Human Plasma       CTL      Post     101
#> MD102A        5   Human Plasma       EXP       Pre     102
#> MD102B        6   Human Plasma       EXP      Post     102
#> MD102C        7   Human Plasma       CTL       Pre     102
#> MD102D        8   Human Plasma       CTL      Post     102
#> 
#> ...  32 more samples
#> ...   2 more sample variables
# mset$conc_table
# mset$feature_data

show(mset[['feature_data']])
#>                       >>>>>> Feature Data <<<<<<
#> feature_data()          [ 170 features and  14 feature variables ]
#> 
#> Showing the first 8 samples and 6 features...
#> 
#>                         Identifier         Annotation
#> Feature025 5.78_572.48_5.76_596.53   Ceramide (d34:1)
#> Feature026             7.18_628.55   Ceramide (d38:1)
#> Feature027             8.18_670.59   Ceramide (d41:1)
#> Feature028 8.50_684.61_8.50_708.65   Ceramide (d42:1)
#> Feature030             7.91_682.59 Ceramide (d42:2) B
#> Feature031             7.18_818.63     GlcCer (d40:1)
#> Feature032             7.83_846.66     GlcCer (d42:1)
#> Feature033             7.16_844.65     GlcCer (d42:2)
#>                                InChIKey            Species count ESI.mode
#> Feature025  YDNKGFDKKRUKPY-TURZORIXSA-N [M+Cl]-_[M+Hac-H]-    40  (-) ES)
#> Feature026  XWBWIAOWSABHFI-NUKVNZTCSA-N            [M+Cl]-    40  (-) ES)
#> Feature027 NAJHAHQNQCNWOP-PUYNVXOJSA-N?            [M+Cl]-    40  (-) ES)
#> Feature028  ZJVVOYPTFQEGPH-AUTSUKAISA-N            [M+Cl]-    40  (-) ES)
#> Feature030  VJSBNBBOSZJDKB-KPEYJIHVSA-N            [M+Cl]-    38  (-) ES)
#> Feature031  YIGARKIIFOHVPF-YAEABVQUSA-N            [M+Cl]-    40  (-) ES)
#> Feature032  POQRWMRXUOPCLD-AAAFHMTMSA-N            [M+Cl]-    40  (-) ES)
#> Feature033  WBOZIXHPUPAOIA-RBELZSLISA-N            [M+Cl]-    40  (-) ES)
#> 
#> ...   8 more features
#> ... 162 more feature variables
# mset[['conc_table']]
# mset[['sample_data']]

3.3 Data manipulation

The function nsamples and nfeatures are used to get the number of samples and features of any mSet class data container.

nsamples(mset)
#> [1] 40

nfeatures(mset)
#> [1] 170

Function sampleNames and featureNames are used to get the sample names and feature names of any mSet class data container.

head(sampleNames(mset), 10)
#>  [1] "MD101A" "MD101B" "MD101C" "MD101D" "MD102A" "MD102B" "MD102C" "MD102D"
#>  [9] "MD103A" "MD103B"

head(featureNames(mset), 10)
#>  [1] "Feature025" "Feature026" "Feature027" "Feature028" "Feature030"
#>  [6] "Feature031" "Feature032" "Feature033" "Feature040" "Feature043"

Function subset_samples and subset_features are used to subset the mSet class data container by specifying either the indices, sampleNames/featureNames, or a boolean (TRUE or FALSE) value.

mset2 = subset_samples(mset, 1:10)
mset2
#>              >>>>>> Metabolomics Experiment <<<<<<
#> conc_table():           [  10 samples  and 170 features          ]
#> sample_table():         [  10 samples  and   8 sample  variables ]
#> feature_data():         [ 170 features and  14 feature variables ]

mset2 = subset_samples(mset, c("MD101A", "MD101B", "MD102A", "MD102B"))
mset2
#>              >>>>>> Metabolomics Experiment <<<<<<
#> conc_table():           [   4 samples  and 170 features          ]
#> sample_table():         [   4 samples  and   8 sample  variables ]
#> feature_data():         [ 170 features and  14 feature variables ]

mset2 = subset_samples(mset, mset$sample_table$Timepoint == "Pre")
mset2
#>              >>>>>> Metabolomics Experiment <<<<<<
#> conc_table():           [  20 samples  and 170 features          ]
#> sample_table():         [  20 samples  and   8 sample  variables ]
#> feature_data():         [ 170 features and  14 feature variables ]

Function transform_by_sample and transform_by_feature are used to transform the conc_table slot into different margins. The former apply each sample at a time, while the latter apply each feature at a time. The second argument of the two functions is required for returning an altered array with the same dimensions as the original one. They are similar to the apply function when the MARGIN parameter is set to 1 (for transform_by_feature) and 2 (for transform_by_sample).

mset2 = transform_by_sample(mset, scale)

library(ggplot2); library(dplyr)
cowplot::plot_grid(
    data.frame(mean = colMeans(mset2$conc_table)) %>%
        ggplot(aes(x = mean)) +
        geom_histogram(bins = 20),
    data.frame(mean = rowMeans(mset2$conc_table)) %>%
        ggplot(aes(x = mean)) +
        geom_histogram(bins = 30),
    ncol = 2
)

The figure above shows that by using the scale function. Each sample is scaled to 0 using transform_by_sample. Each feature is scaled to 0 using transform_by_feature.

The summarize_features function will aggregate the conc_table according to a given feature variable. The example below will add all features with the sample lipid class together. The second argument must appear in the column names of the feature_data slot.

unique(mset$feature_data$class)
#> [1] Cer         PC          PE          SM          CE          Cholesterol
#> [7] DG          LPC         TG         
#> Levels: CE Cer Cholesterol DG LPC PC PE SM TG

mset2 = summarize_features(mset, "class")
mset2
#>              >>>>>> Metabolomics Experiment <<<<<<
#> conc_table():           [  40 samples  and   9 features          ]
#> sample_table():         [  40 samples  and   8 sample  variables ]

4 Linear Model

Metabase provides a quick solution for linear model testing using the limma package.

design = model.matrix(
    ~ Treatment * Timepoint + Subject,
    data = as(mset$sample_table, "data.frame")
)
colnames(design)
#> [1] "(Intercept)"               "TreatmentEXP"             
#> [3] "TimepointPre"              "Subject"                  
#> [5] "TreatmentEXP:TimepointPre"
lm = mSet_limma(mset, design, coef = "TreatmentEXP:TimepointPre", 
                p.value = "TreatmentEXP:TimepointPre")
head(lm)
#>              baseMean         logFC        stat    pvalue      padj
#> Feature025 0.10973033  0.0070850511  0.39690010 0.6937808 0.9492560
#> Feature026 0.01642961 -0.0005516065 -0.04258577 0.9662667 0.9836248
#> Feature027 0.03803690 -0.0062658529 -0.42615263 0.6725301 0.9492560
#> Feature028 0.13833935 -0.0121808332 -0.47519415 0.6375148 0.9492560
#> Feature030 0.02536260 -0.0097179260 -0.64201934 0.5249219 0.9492560
#> Feature031 0.02598174 -0.0037432849 -0.28388067 0.7781257 0.9492560

5 Visualization

Metabase provides some quick visualization functions. Metabase’s plot functions are mostly built on top of the ggmetaplots package. Installation from github is required.

devtools::install_github("zhuchcn/ggmetaplots")

5.1 Box plot

plot_boxplot(mset, x = "Timepoint", feature = "Feature025")
#> Loading required namespace: ggmetaplots

plot_boxplot(mset, x = "Timepoint", feature = "Feature025", cols = "Treatment")

mset$sample_table$Subject = factor(mset$sample_table$Subject)
plot_boxplot(mset, x = "Timepoint", feature = "Feature025", cols = "Treatment",
             color = "Subject")

plot_boxplot(mset, x = "Timepoint", feature = "Feature025", cols = "Treatment",
             color = "Subject", line = "Subject")

5.2 Median Histogram

plot_median_hist(mset)

5.3 Normality boxplot

plot_normality_boxplot(subset_features(mset, 1:30))

5.4 Normality ridges plot

plot_normality_ridges(subset_features(mset, 1:30))
#> Loading required namespace: ggridges
#> Picking joint bandwidth of 0.356

Introduction to Metabase

2019-03-26

Contents