Contents

0.1 Introduction

Metabolomics are powerful experiment approaches that analyzes all, or a specific class, of small molecule compounds in a biological specimen. Lipidomics is a special case of Metabolomics that only focus on the lipids. Metabolomics and lipidomics are now being adopted many different type of studies with different purposes, as a tool to answer different scientific questions. Metabolomics and lipidomics often generates big data with feature numbers very from hundreds to thousands. The package Metabase aims to help researchers handle, analyze and visualize metabolomics and lipidomics data. This vignettes is a workflow of how import and process the lipidomics data reported by the West Coast Metabolomics Center, CA.

0.2 Install Metabase

The Metabase package can be installed from github

devtools::install_github("zhuchcn/Metabase")

0.3 Data

The data used in this vignette is a lipidomics experiment from a cross-over nutritional intervention study. 10 subjects were given 2 dietary treatment, and blood samples were taken before and after each treatment. The sample data comes in the package. If you would like to explore the sample data on your own, you can either copy them to your own directory, or download them using the link (right click and choose save link as):

[Download sample data] [Download standards]

0.4 Data import

0.4.1 Read data

Data import can be done using the function import_wcmc_excel_raw().

Arguments: + file: File path to the excel spreadsheet. + sheet: The sheet name with data. + conc_range: The cell range for concentration data. + sample_range: The cell range for sample data. + feature_range: The cell range for feature data. + InChIKey: The name of the column with InChI Key. The InChI Key is neccesary for later process and analysis.

library(Metabase)
file = file.path(system.file("extdata", package = "Metabase"), "CSH-QTOF_lipidomics.xlsx")
mset = import_wcmc_excel(
    file            = file, 
    sheet           = "Submit",
    conc_range      = "I10:BA613",
    sample_range    = "H1:BA9",
    feature_range   = "A9:H613",
    InChIKey        = "InChI Key",
    experiment_type = "Lipidomics"
)
mset
##              >>>>>> Metabolomics Experiment <<<<<<
## conc_table():           [  45 samples  and 604 features          ]
## sample_table():         [  45 samples  and   8 sample  variables ]
## feature_data():         [ 604 features and   8 feature variables ]
## experiment_data():      [   9 experiment attributes              ]
cat("Number of features:", nfeatures(mset),
    "\nNumber of samples: ", nsamples(mset))
## Number of features: 604 
## Number of samples:  45

0.4.2 Remove features

The dataset has 604 features. However 364 of them are unknow.

cat("Annotated:", sum(!is.na(feature_data(mset)$InChIKey)),
    "\nUnknown:  ", sum(is.na(feature_data(mset)$InChIKey)))
## Annotated: 240 
## Unknown:   364

The unknown features can be removed using the subset_features function

mset = subset_features(mset, !is.na(feature_data(mset)$InChIKey))

0.4.3 Collapse QC samples

The first 5 samples are quality control samples.

head(sample_table(mset), 8)
##           Sample # Species  Organ Treatment Timepoint Subject
## Biorec001       QC   Human Plasma      <NA>      <NA>    <NA>
## Biorec002       QC   Human Plasma      <NA>      <NA>    <NA>
## Biorec003       QC   Human Plasma      <NA>      <NA>    <NA>
## Biorec004       QC   Human Plasma      <NA>      <NA>    <NA>
## Biorec005       QC   Human Plasma      <NA>      <NA>    <NA>
## MD101A           1   Human Plasma       EXP       Pre     101
## MD101B           2   Human Plasma       EXP      Post     101
## MD101C           3   Human Plasma       CTL       Pre     101
##                                 File ID          RT
## Biorec001  Biorec001_posCSH_preZhu001.d Peak Height
## Biorec002 Biorec002_posCSH_postZhu010.d Peak Height
## Biorec003 Biorec003_posCSH_postZhu020.d Peak Height
## Biorec004 Biorec004_posCSH_postZhu030.d Peak Height
## Biorec005 Biorec005_posCSH_postZhu040.d Peak Height
## MD101A    Zhu027_posCSH_FFS-107-A_001.d Peak Height
## MD101B    Zhu008_posCSH_FFS-107-B_002.d Peak Height
## MD101C    Zhu040_posCSH_FFS-107-C_003.d Peak Height

To call the collapse_QC function, the mean, standard deviation, and coefficient of variance for each sample are be calculated and stored in the feature_data slot. And then the QC samples are be remove.

mset = collapse_QC(mset, qc_names = paste0("Biorec00", 1:5))
mset
##              >>>>>> Metabolomics Experiment <<<<<<
## conc_table():           [  40 samples  and 240 features          ]
## sample_table():         [  40 samples  and   8 sample  variables ]
## feature_data():         [ 240 features and  11 feature variables ]
## experiment_data():      [   9 experiment attributes              ]

0.4.4 Filling NAs

To see the missing values in the dataset, we can call the plot_hist_NA function. In this dataset, there are 3 features have 2 missing values, and 1 with 12.

plot_hist_NA(mset)

We can remove features with too many missing values using subset_features. The call below will keep all features that have less than 5 missing values.

mset = subset_features(
    mset, apply(conc_table(mset), 1, function(x) sum(is.na(x)) < 5) )

Then we can fill up the rest NAs using half of the lowest value of this feature in all samples.

mset = transform_by_feature(
    mset, function(x) ifelse(is.na(x), min(x, na.rm = TRUE)/2, x)
)

0.4.5 Calculate concentration using internal standards

The west coast metabolomics center’s lipidomics assay spikes 15 complex lipid internal standards into each sample during analysis. Each of the 15 complex lipid internal standards represents a lipid class. To see the spike information, load the following csv file.

file = file.path(system.file("extdata", package = "Metabase"), "wcmc_lipidomics_standards.csv")
internal_standards = read.csv(file)
head(internal_standards)
##                               standard                    InChIKey
## 1 22:1 Cholesterol Ester (Add to MTBE) SQHUGNAFKZZXOT-JWTURFAQSA-N
## 2                       PE (17:0/17:0) YSFFAUPDXKTJMR-DIPNUNPCSA-N
## 3                       PG (17:0/17:0) ZBVHXVKEMAIWQQ-QPPIDDCLSA-N
## 4                        PC (17:0/0:0) SRRQPVVYXBTRQK-XMMPIXPASA-N
## 5                      C17 Sphingosine RBEJCQPPFCKTRZ-LHMZYYNSSA-N
## 6                         C17 Ceramide ICWGMOFDULMCFL-QKSCFGQVSA-N
##         class   spike_amt spike_unit
## 1          CE 16.36363636         ug
## 2          PE  0.36818182         ug
## 3          PG  1.47272727         ug
## 4         LPC  0.24545454         ug
## 5 Sphingosine  0.05454545         ug
## 6         Cer  0.12272727         ug

Before the features can be calibrated to the internal standards, we need to first assign the lipid calss to each annotated feature. The function assign_lipid_class takes the annotation name and assign a lipid class to it. We also need to specify the ESI mode.

feature_data(mset)$class = assign_lipid_class(feature_data(mset)$Annotation)
feature_data(mset)$ESI = ifelse(grepl("\\+$", feature_data(mset)$Species),
                                "pos", "neg")

Next, the internal standard data can be addded to the experiment_data slot of the data object.

experiment_data(mset)$institute = "West Coast Metabolomics Center"
experiment_data(mset)$sample_volumn_ul = 20
experiment_data(mset)$internal_standards = internal_standards
experiment_data(mset)
##           >>>>>> Lipidomics Experiment Data <<<<<<
## 
## Institute: West Coast Metabolomics Center
## Experiment Type: lipidomics
## Sample Volumn Ul: 20
## Internal Standards: a 16 by 5 data frame

Then the features can be calibrated.

mset = calibrate_lipidomics_wcmc(mset, cid = "InChIKey", 
                                 class = "class", ESI = "ESI")
mset
##              >>>>>> Metabolomics Experiment <<<<<<
## conc_table():           [  40 samples  and 215 features          ]
## sample_table():         [  40 samples  and   8 sample  variables ]
## feature_data():         [ 215 features and  13 feature variables ]
## experiment_data():      [   9 experiment attributes              ]

0.4.6 Filter by CV

The last problem is, some features are detected under both positive and negative ESI mode. This could be problomatic for example for some multi-variable analysis. We can only keep whichever with a lower CV.

mset = filter_by_cv(mset, cv = "qc_cv", cid = "InChIKey")
mset
##              >>>>>> Metabolomics Experiment <<<<<<
## conc_table():           [  40 samples  and 170 features          ]
## sample_table():         [  40 samples  and   8 sample  variables ]
## feature_data():         [ 170 features and  14 feature variables ]
## experiment_data():      [   9 experiment attributes              ]

The data now is cleaned up and is ready for any statistic analysis and visualization.