09/25/2018

Science and Art

  • The tricky part of data science is that even experianced data analysts sometimes can't explain why they do a particle action.
  • The knowledge of data analysis is science, but using data science skills to analyze data has artistic components.
  • Knuth Computer Programming as an Art, Donald Knuth 1974:
    • Science is knowledge which we understand so well that we can teach it to a computer.
    • Everything else is art

5 Core Activities of Data Analysis

  1. Stating the question
  2. Exploratory data analysis
  3. Modeling Building
  4. Interpret
  5. Communicate
  • Data anslysis appears to follow an "one-step-after-the-other" linear process. However in the real world, the information learnt at each step sometimes will inform you whether (and how) to refine, and redo, the step that was just performed, or whether (and how) to proceed to the next step.

Epicycles of Analysis

  1. Setting Expectations,
  2. Collecting information (data), comparing the data to your expectations, and if the expectations don’t match,
  3. Revising your expectations or fixing the data so your data and your expectations match.

Epicycles of Analysis (2)

epicycles

Goal of Exploratory Data Analysis

  1. To determine if there are any problems with your dataset.
  2. To determine whether the question you are asking can be answered by the data that you have.
  3. To develop a sketch of the answer to your question.

Case Study: FFS HDL Lipidome

A reminder of the fast food study design

Lipidomics

  • The lipidomics is a subset of metabolimcs that aims at identifying and quantifying all lipid species from a biological or enviromental sample.

  • The west coast metabolomics center's untargeted lipidomics platform is now able to detect and quantify over 300 lipid speceis in 14 lipid classes.

  • Because HDL is a lipid-protein complex, what lipid speceis present in HDL and their quantity affect the HDL function.

  • The lipidomics technique is used a lot by us to study the lipid composition of HDL.

Case Study: FFS HDL Lipidome (2)

  • Check the dimensions (number of rows and number of columns) of the data:
dim(edata)
## [1] 604  45

Case Study: FFS HDL Lipidome (3)

  • Question 1: Why does it have 45 samples not 40?
  • Check the top and tial of the sample data
Sample # Species Organ Treatment Timepoint File ID RT
Biorec001 QC Human Plasma NA NA Biorec001_posCSH_preZhu001.d Peak Height
Biorec002 QC Human Plasma NA NA Biorec002_posCSH_postZhu010.d Peak Height
Biorec003 QC Human Plasma NA NA Biorec003_posCSH_postZhu020.d Peak Height
Biorec004 QC Human Plasma NA NA Biorec004_posCSH_postZhu030.d Peak Height
Biorec005 QC Human Plasma NA NA Biorec005_posCSH_postZhu040.d Peak Height
FFS-107-A 1 Human HDL isolated from Plasma FF Pre Zhu027_posCSH_FFS-107-A_001.d Peak Height

Case Study: FFS HDL Lipidome (4)

  • Remove QCs and check dimensions again
## [1] 604  40

Case Study: FFS HDL Lipidome (5)

  • Question 2: Are the 604 features all successfully annotated?
  • Look at the top of the feature data
Identifier Annotation InChIKey Species count ESI mode m/z RT qc_mean qc_sd qc_cv
Feature001 11.86_724.70_11.86_729.66 1_CE (22:1) iSTD SQHUGNAFKZZXOT-JWTURFAQSA-N [M+NH4]+_[M+Na]+ 40 (+ ES) 724.7003_729.6556 11.86 596534.4 33472.5834 17.821582
Feature002 4.93_376.40 1_Cholesterol d7 iSTD HVYWMOMLDIMFJA-IFAPJKRJSA-N [M-H2O+H]+ 40 (+ ES) 376.39550000000003 4.93 7510.8 324.9349 23.114783
Feature003 0.79_341.28 1_CUDA (pos) iSTD HPTJABJPZMULFH-UHFFFAOYSA-N [M+H]+ 40 (+ ES) 341.28030000000001 0.79 74103.6 5860.4870 12.644615
Feature004 1.88_510.36 1_LPC (17:0) iSTD SRRQPVVYXBTRQK-XMMPIXPASA-N [M+H]+ 40 (+ ES) 510.35629999999998 1.88 210018.6 20991.8872 10.004751
Feature005 1.38_466.29 1_LPE (17:1) iSTD LNJNONCNASQZOB-HEDKFQSOSA-N [M+H]+ 40 (+ ES) 466.29309999999998 1.38 51342.6 15231.6438 3.370785
Feature006 3.56_636.46 1_PC (25:0) iSTD FCTBVSCBBWKZML-WJOKGBTCSA-N [M+H]+ 40 (+ ES) 636.46019999999999 3.56 7933.0 689.8340 11.499868

Case Study: FFS HDL Lipidome (6)

  • Look at the tail of the feature data
Identifier Annotation InChIKey Species count ESI mode m/z RT qc_mean qc_sd qc_cv
Feature599 3.24_672.53 NA NA [M-H]- 13 -1 672.52620000000002 3.24 3958.500 731.85552 5.408855
Feature600 0.74_230.19 NA NA [M+H]+ 12 -1 230.1902 0.74 3644.667 2296.89994 1.586776
Feature601 3.33_573.47 NA NA [M+H]+ 12 -1 573.47090000000003 3.33 2241.500 88.38835 25.359678
Feature602 5.06_807.56 NA NA [M+H]+ 11 -1 807.56119999999999 5.06 NaN NA NaN
Feature603 9.56_391.29 NA NA [M+H]+ 11 -1 391.28640000000001 9.56 21476.500 236.88077 90.663754
Feature604 7.19_478.39 NA NA [M+H]+ 10 -1 478.38749999999999 7.19 11793.500 946.81598 12.455958

Case Study: FFS HDL Lipidome (7)

  • We can then make a count plot of rows (features) with annotation vs without annotation
  • Looks like there are many features without an annotation. We can just remove them

Case Study: FFS HDL Lipidome (8)

  • Now we removed all the rows without the annotation.
  • Question 3: are there any missing values?

3 types of "Missing Values"

  • There are three types of "missing values":
    • The data is missing (dietary record missing for a subject at one time point)
    • Failed to be sampled (count data such as 16S-seq)
    • Zero (quantitative method)
  • Our solution:
  1. remove features with to many zeros
  2. fill the NAs with a reasonable small value

Case Study: FFS HDL Lipidome (9)

  • Does each individual row have unique feature annotation?

Case Study: FFS HDL Lipidome (10)

## ESI
## Neg Pos 
##  55  57

Case Study: FFS HDL Lipidome (11)

  • Now we can run hypothesis test using linear model. We can make a histogram of the P values of the 170 features

Case Study: FFS HDL Lipidome (12)

  • If you have 10000 random variables and we run statistic analysis, what would the distribution of p values look like?

Case Study: FFS HDL Lipidome (13)

  • We can plot some features that have a significant p value. And there is a person seems always different from other subjects.

Case Study: FFS HDL Lipidome (14)

  • After we fix the mislabelled 116

Case Study: FFS HDL Lipidome (15)

Review

5 core activities of data anlysis

  1. Stating the question
  2. Exploratory data analysis
  3. Modeling Building
  4. Interpret
  5. Communicate

Epicycles of analysis

  1. Setting Expectations,
  2. Collecting information (data), comparing the data to your expectations, and if the expectations don’t match,
  3. Revising your expectations or fixing the data so your data and your expectations match.

Reference