Metabolomics are powerful experiment approaches that analyzes all, or a specific class, of small molecule compounds in a biological specimen. Lipidomics is a special case of Metabolomics that only focus on the lipids. Metabolomics and lipidomics are now being adopted many different type of studies with different purposes, as a tool to answer different scientific questions. Metabolomics and lipidomics often generates big data with feature numbers very from hundreds to thousands. The package Metabase aims to help researchers handle, analyze and visualize metabolomics and lipidomics data. This vignettes is a workflow of how import and process the lipidomics data reported by the West Coast Metabolomics Center, CA.
The Metabase package can be installed from github
devtools::install_github("zhuchcn/Metabase")
The data used in this vignette is a lipidomics experiment from a cross-over nutritional intervention study. 10 subjects were given 2 dietary treatment, and blood samples were taken before and after each treatment. The sample data comes in the package. If you would like to explore the sample data on your own, you can either copy them to your own directory, or download them using the link (right click and choose save link as):
Data import can be done using the function import_wcmc_excel_raw().
Arguments: + file: File path to the excel spreadsheet. + sheet: The sheet name with data. + conc_range: The cell range for concentration data. + sample_range: The cell range for sample data. + feature_range: The cell range for feature data. + InChIKey: The name of the column with InChI Key. The InChI Key is neccesary for later process and analysis.
library(Metabase)
file = file.path(system.file("extdata", package = "Metabase"), "CSH-QTOF_lipidomics.xlsx")
mset = import_wcmc_excel(
file = file,
sheet = "Submit",
conc_range = "I10:BA613",
sample_range = "H1:BA9",
feature_range = "A9:H613",
InChIKey = "InChI Key",
experiment_type = "Lipidomics"
)
mset
## >>>>>> Metabolomics Experiment <<<<<<
## conc_table(): [ 45 samples and 604 features ]
## sample_table(): [ 45 samples and 8 sample variables ]
## feature_data(): [ 604 features and 8 feature variables ]
## experiment_data(): [ 9 experiment attributes ]
cat("Number of features:", nfeatures(mset),
"\nNumber of samples: ", nsamples(mset))
## Number of features: 604
## Number of samples: 45
The dataset has 604 features. However 364 of them are unknow.
cat("Annotated:", sum(!is.na(feature_data(mset)$InChIKey)),
"\nUnknown: ", sum(is.na(feature_data(mset)$InChIKey)))
## Annotated: 240
## Unknown: 364
The unknown features can be removed using the subset_features function
mset = subset_features(mset, !is.na(feature_data(mset)$InChIKey))
The first 5 samples are quality control samples.
head(sample_table(mset), 8)
## Sample # Species Organ Treatment Timepoint Subject
## Biorec001 QC Human Plasma <NA> <NA> <NA>
## Biorec002 QC Human Plasma <NA> <NA> <NA>
## Biorec003 QC Human Plasma <NA> <NA> <NA>
## Biorec004 QC Human Plasma <NA> <NA> <NA>
## Biorec005 QC Human Plasma <NA> <NA> <NA>
## MD101A 1 Human Plasma EXP Pre 101
## MD101B 2 Human Plasma EXP Post 101
## MD101C 3 Human Plasma CTL Pre 101
## File ID RT
## Biorec001 Biorec001_posCSH_preZhu001.d Peak Height
## Biorec002 Biorec002_posCSH_postZhu010.d Peak Height
## Biorec003 Biorec003_posCSH_postZhu020.d Peak Height
## Biorec004 Biorec004_posCSH_postZhu030.d Peak Height
## Biorec005 Biorec005_posCSH_postZhu040.d Peak Height
## MD101A Zhu027_posCSH_FFS-107-A_001.d Peak Height
## MD101B Zhu008_posCSH_FFS-107-B_002.d Peak Height
## MD101C Zhu040_posCSH_FFS-107-C_003.d Peak Height
To call the collapse_QC function, the mean, standard deviation, and coefficient of variance for each sample are be calculated and stored in the feature_data slot. And then the QC samples are be remove.
mset = collapse_QC(mset, qc_names = paste0("Biorec00", 1:5))
mset
## >>>>>> Metabolomics Experiment <<<<<<
## conc_table(): [ 40 samples and 240 features ]
## sample_table(): [ 40 samples and 8 sample variables ]
## feature_data(): [ 240 features and 11 feature variables ]
## experiment_data(): [ 9 experiment attributes ]
To see the missing values in the dataset, we can call the plot_hist_NA function. In this dataset, there are 3 features have 2 missing values, and 1 with 12.
plot_hist_NA(mset)
We can remove features with too many missing values using subset_features. The call below will keep all features that have less than 5 missing values.
mset = subset_features(
mset, apply(conc_table(mset), 1, function(x) sum(is.na(x)) < 5) )
Then we can fill up the rest NAs using half of the lowest value of this feature in all samples.
mset = transform_by_feature(
mset, function(x) ifelse(is.na(x), min(x, na.rm = TRUE)/2, x)
)
The west coast metabolomics center’s lipidomics assay spikes 15 complex lipid internal standards into each sample during analysis. Each of the 15 complex lipid internal standards represents a lipid class. To see the spike information, load the following csv file.
file = file.path(system.file("extdata", package = "Metabase"), "wcmc_lipidomics_standards.csv")
internal_standards = read.csv(file)
head(internal_standards)
## standard InChIKey
## 1 22:1 Cholesterol Ester (Add to MTBE) SQHUGNAFKZZXOT-JWTURFAQSA-N
## 2 PE (17:0/17:0) YSFFAUPDXKTJMR-DIPNUNPCSA-N
## 3 PG (17:0/17:0) ZBVHXVKEMAIWQQ-QPPIDDCLSA-N
## 4 PC (17:0/0:0) SRRQPVVYXBTRQK-XMMPIXPASA-N
## 5 C17 Sphingosine RBEJCQPPFCKTRZ-LHMZYYNSSA-N
## 6 C17 Ceramide ICWGMOFDULMCFL-QKSCFGQVSA-N
## class spike_amt spike_unit
## 1 CE 16.36363636 ug
## 2 PE 0.36818182 ug
## 3 PG 1.47272727 ug
## 4 LPC 0.24545454 ug
## 5 Sphingosine 0.05454545 ug
## 6 Cer 0.12272727 ug
Before the features can be calibrated to the internal standards, we need to first assign the lipid calss to each annotated feature. The function assign_lipid_class takes the annotation name and assign a lipid class to it. We also need to specify the ESI mode.
feature_data(mset)$class = assign_lipid_class(feature_data(mset)$Annotation)
feature_data(mset)$ESI = ifelse(grepl("\\+$", feature_data(mset)$Species),
"pos", "neg")
Next, the internal standard data can be addded to the experiment_data slot of the data object.
experiment_data(mset)$institute = "West Coast Metabolomics Center"
experiment_data(mset)$sample_volumn_ul = 20
experiment_data(mset)$internal_standards = internal_standards
experiment_data(mset)
## >>>>>> Lipidomics Experiment Data <<<<<<
##
## Institute: West Coast Metabolomics Center
## Experiment Type: lipidomics
## Sample Volumn Ul: 20
## Internal Standards: a 16 by 5 data frame
Then the features can be calibrated.
mset = calibrate_lipidomics_wcmc(mset, cid = "InChIKey",
class = "class", ESI = "ESI")
mset
## >>>>>> Metabolomics Experiment <<<<<<
## conc_table(): [ 40 samples and 215 features ]
## sample_table(): [ 40 samples and 8 sample variables ]
## feature_data(): [ 215 features and 13 feature variables ]
## experiment_data(): [ 9 experiment attributes ]
The last problem is, some features are detected under both positive and negative ESI mode. This could be problomatic for example for some multi-variable analysis. We can only keep whichever with a lower CV.
mset = filter_by_cv(mset, cv = "qc_cv", cid = "InChIKey")
mset
## >>>>>> Metabolomics Experiment <<<<<<
## conc_table(): [ 40 samples and 170 features ]
## sample_table(): [ 40 samples and 8 sample variables ]
## feature_data(): [ 170 features and 14 feature variables ]
## experiment_data(): [ 9 experiment attributes ]
The data now is cleaned up and is ready for any statistic analysis and visualization.