Introduction to Hight Throughtput Experiment Data Analysis

07/25/2018

High through-put experiments

High through-put experiment (HTE) uses approaches such as sequencing techonology, Mass Spectrometry based analytical chemistry methods to study the quality or quantity of a large number of features such as nucleic acid sequences, proteins, small molecules, or other features in biological or enviromental specimens, or study subjects.

Biology and Data Science

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms.

Bioinformatics venn

High Through-put Experiment Data

The starting point of a high through-put experiment is usually raw sequencing data.

fastq files for microbiome, metagenomics, RNA seq, etc

mzML/mzXML MS files for proteomics (and maybe metabolomics)

Data Processing and Analysis

Should be standardized, clear and reproducible.

High through-put experiments data processing is the assay- or experiment-dependent, but not study dependent. The data processing precedure doesn't change with the study design.

Proteomics: X!Tandem

16S-seq: fastq-multx, cutadapt/trimmomatic/fastx-toolkit, DADA2

After we generate the tabular data from data processing, the data analysis step is study dependent. Different statistic model and visualization approaches can be used base on the study design.

Basic Principle of Data Science

Principle of tidy data

Reproducibility

Why R

Version Control (git)

Data structure of hight through-put experiment data

Tidy Data

Each measured variable should be in one column.
Each different observation of that variable should be in a different row.
There should be one table for each "kind" of variable.
If you have multiple tables, they should include a column in the table that allows them to be linked.

Tidy Data continue

Each measured variable should be in one column.
Each different observation of that variable should be in a different row.

Messy Data Example

lipidome-messy

Reproducibility

The raw data should be stored well and every analysis step should be documented.

Reproducibility allows you to be validate you analysis, and collaborate with other people.

Using programming language is the best way to make your analysis reproducible.
- SAS
- R
- Python
- Julia

Why R

R is a programming language designed for statistics and data analysis. R's native packages can support almost all basic statistics test (linear model, t test, correlation, etc).

R has tremendous amount of additional packages on CRAN that extend the functionality of R for almost all fields (ecology, chemistry, engineering, etc.)

The bioconductor project has also a lot of packages designated for bioinformatics, contributed by scientists all over the world.

Very popular in academic community.

Almost all R packages have very detailed documentations

Ability to generate scientific report/presentations in different forms, including html, pdf, slides, and word docx.

Why R cont.

R is also a good choice for a career!

Version control

Git: a distributed version control system
Github, a web-based hosting service for git
- Pros:
  - Use as a static html service
  - Multiple people can work on the same project
- Cons:
  - Public to the world unless you pay
  - Single file size limit

Metabase

The Metabase is a R pacakge provides a solution to store, handle, analyze, and visualize data from quantitative experiments such as metabolomics and proteomics.

Now only the metabolomics data is well implemented, but it will be able to support any quantitative experiment data, including metabolomics, proteomics, glycomics, nutrient data, anthropometric and clinical data, biochemical assays.

github repo: github.com/zhuchcn/Metabase

Hight Thoughput Experiment Data Structure

Almost all high thoughput experiment data have 3 tables:

A table with information to each feature (pink box)

A table with information to each sample (blue box)

A pure numeric table of the intensity/concentration/abundance of each feature in each sample (gree box)

HTE Data Structure

Metabase Design

The Metabase uses the Object Oriented Design.

Metabase Design

Classes Roadmap

Only MetabolomicsSet, LipidomicsSet and MultiSet are available now. ProteomicsSet and GLycomicsSet are not done yet.

High through-put experiments

Biology and Data Science

High Through-put Experiment Data

Data Processing and Analysis

Basic Principle of Data Science

Tidy Data

Tidy Data continue

Messy Data Example

Reproducibility

Why R

Why R cont.

Version control

Metabase

Hight Thoughput Experiment Data Structure

Metabase Design

Classes Roadmap

Documentation