07/25/2018

High through-put experiments

High through-put experiment (HTE) uses approaches such as sequencing techonology, Mass Spectrometry based analytical chemistry methods to study the quality or quantity of a large number of features such as nucleic acid sequences, proteins, small molecules, or other features in biological or enviromental specimens, or study subjects.

Biology and Data Science

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms.

Bioinformatics venn

High Through-put Experiment Data

  • The starting point of a high through-put experiment is usually raw sequencing data.
  • fastq files for microbiome, metagenomics, RNA seq, etc
  • mzML/mzXML MS files for proteomics (and maybe metabolomics)

Data Processing and Analysis

  • Should be standardized, clear and reproducible.
  • High through-put experiments data processing is the assay- or experiment-dependent, but not study dependent. The data processing precedure doesn't change with the study design.
  • Proteomics: X!Tandem
  • 16S-seq: fastq-multx, cutadapt/trimmomatic/fastx-toolkit, DADA2
  • After we generate the tabular data from data processing, the data analysis step is study dependent. Different statistic model and visualization approaches can be used base on the study design.

Basic Principle of Data Science

  • Principle of tidy data
  • Reproducibility
  • Why R
  • Version Control (git)
  • Data structure of hight through-put experiment data

Tidy Data

  1. Each measured variable should be in one column.
  2. Each different observation of that variable should be in a different row.
  3. There should be one table for each "kind" of variable.
  4. If you have multiple tables, they should include a column in the table that allows them to be linked.

Tidy Data continue

  1. Each measured variable should be in one column.
  2. Each different observation of that variable should be in a different row.

Messy Data Example

lipidome-messy

Reproducibility

  • The raw data should be stored well and every analysis step should be documented.

  • Reproducibility allows you to be validate you analysis, and collaborate with other people.
  • Using programming language is the best way to make your analysis reproducible.

    • SAS
    • R
    • Python
    • Julia

Why R

  • R is a programming language designed for statistics and data analysis. R's native packages can support almost all basic statistics test (linear model, t test, correlation, etc).
  • R has tremendous amount of additional packages on CRAN that extend the functionality of R for almost all fields (ecology, chemistry, engineering, etc.)
  • The bioconductor project has also a lot of packages designated for bioinformatics, contributed by scientists all over the world.
  • Very popular in academic community.
  • Almost all R packages have very detailed documentations
  • Ability to generate scientific report/presentations in different forms, including html, pdf, slides, and word docx.

Why R cont.

R is also a good choice for a career!

Version control

  • Git: a distributed version control system
  • Github, a web-based hosting service for git
    • Pros:
      • Use as a static html service
      • Multiple people can work on the same project
    • Cons:
      • Public to the world unless you pay
      • Single file size limit

Metabase

  • The Metabase is a R pacakge provides a solution to store, handle, analyze, and visualize data from quantitative experiments such as metabolomics and proteomics.
  • Now only the metabolomics data is well implemented, but it will be able to support any quantitative experiment data, including metabolomics, proteomics, glycomics, nutrient data, anthropometric and clinical data, biochemical assays.
  • github repo: github.com/zhuchcn/Metabase

Hight Thoughput Experiment Data Structure

Almost all high thoughput experiment data have 3 tables:

  • A table with information to each feature (pink box)
  • A table with information to each sample (blue box)
  • A pure numeric table of the intensity/concentration/abundance of each feature in each sample (gree box)

HTE Data Structure

Metabase Design

  • The Metabase uses the Object Oriented Design.

Metabase Design

Classes Roadmap

Only MetabolomicsSet, LipidomicsSet and MultiSet are available now. ProteomicsSet and GLycomicsSet are not done yet.


Documentation