The Art of Data Analysis

09/25/2018

Science and Art

The tricky part of data science is that even experianced data analysts sometimes can't explain why they do a particle action.

The knowledge of data analysis is science, but using data science skills to analyze data has artistic components.

Knuth Computer Programming as an Art, Donald Knuth 1974:
- Science is knowledge which we understand so well that we can teach it to a computer.
- Everything else is art

5 Core Activities of Data Analysis

Stating the question
Exploratory data analysis
Modeling Building
Interpret
Communicate

Data anslysis appears to follow an "one-step-after-the-other" linear process. However in the real world, the information learnt at each step sometimes will inform you whether (and how) to refine, and redo, the step that was just performed, or whether (and how) to proceed to the next step.

Epicycles of Analysis

Setting Expectations,
Collecting information (data), comparing the data to your expectations, and if the expectations don’t match,
Revising your expectations or fixing the data so your data and your expectations match.

Epicycles of Analysis (2)

epicycles

Goal of Exploratory Data Analysis

To determine if there are any problems with your dataset.
To determine whether the question you are asking can be answered by the data that you have.
To develop a sketch of the answer to your question.

Case Study: FFS HDL Lipidome

A reminder of the fast food study design

Lipidomics

The lipidomics is a subset of metabolimcs that aims at identifying and quantifying all lipid species from a biological or enviromental sample.
The west coast metabolomics center's untargeted lipidomics platform is now able to detect and quantify over 300 lipid speceis in 14 lipid classes.
Because HDL is a lipid-protein complex, what lipid speceis present in HDL and their quantity affect the HDL function.
The lipidomics technique is used a lot by us to study the lipid composition of HDL.

Case Study: FFS HDL Lipidome (2)

Check the dimensions (number of rows and number of columns) of the data:

dim(edata)

## [1] 604  45

Case Study: FFS HDL Lipidome (3)

Question 1: Why does it have 45 samples not 40?
Check the top and tial of the sample data

	Sample #	Species	Organ	Treatment	Timepoint	File ID	RT
Biorec001	QC	Human	Plasma	NA	NA	Biorec001_posCSH_preZhu001.d	Peak Height
Biorec002	QC	Human	Plasma	NA	NA	Biorec002_posCSH_postZhu010.d	Peak Height
Biorec003	QC	Human	Plasma	NA	NA	Biorec003_posCSH_postZhu020.d	Peak Height
Biorec004	QC	Human	Plasma	NA	NA	Biorec004_posCSH_postZhu030.d	Peak Height
Biorec005	QC	Human	Plasma	NA	NA	Biorec005_posCSH_postZhu040.d	Peak Height
FFS-107-A	1	Human	HDL isolated from Plasma	FF	Pre	Zhu027_posCSH_FFS-107-A_001.d	Peak Height

Case Study: FFS HDL Lipidome (4)

Remove QCs and check dimensions again

## [1] 604  40

Case Study: FFS HDL Lipidome (5)

Question 2: Are the 604 features all successfully annotated?
Look at the top of the feature data

	Identifier	Annotation	InChIKey	Species	count	ESI mode	m/z	RT	qc_mean	qc_sd	qc_cv
Feature001	11.86_724.70_11.86_729.66	1_CE (22:1) iSTD	SQHUGNAFKZZXOT-JWTURFAQSA-N	[M+NH4]+_[M+Na]+	40	(+ ES)	724.7003_729.6556	11.86	596534.4	33472.5834	17.821582
Feature002	4.93_376.40	1_Cholesterol d7 iSTD	HVYWMOMLDIMFJA-IFAPJKRJSA-N	[M-H2O+H]+	40	(+ ES)	376.39550000000003	4.93	7510.8	324.9349	23.114783
Feature003	0.79_341.28	1_CUDA (pos) iSTD	HPTJABJPZMULFH-UHFFFAOYSA-N	[M+H]+	40	(+ ES)	341.28030000000001	0.79	74103.6	5860.4870	12.644615
Feature004	1.88_510.36	1_LPC (17:0) iSTD	SRRQPVVYXBTRQK-XMMPIXPASA-N	[M+H]+	40	(+ ES)	510.35629999999998	1.88	210018.6	20991.8872	10.004751
Feature005	1.38_466.29	1_LPE (17:1) iSTD	LNJNONCNASQZOB-HEDKFQSOSA-N	[M+H]+	40	(+ ES)	466.29309999999998	1.38	51342.6	15231.6438	3.370785
Feature006	3.56_636.46	1_PC (25:0) iSTD	FCTBVSCBBWKZML-WJOKGBTCSA-N	[M+H]+	40	(+ ES)	636.46019999999999	3.56	7933.0	689.8340	11.499868

Case Study: FFS HDL Lipidome (6)

Look at the tail of the feature data

	Identifier	Annotation	InChIKey	Species	count	ESI mode	m/z	RT	qc_mean	qc_sd	qc_cv
Feature599	3.24_672.53	NA	NA	[M-H]-	13	-1	672.52620000000002	3.24	3958.500	731.85552	5.408855
Feature600	0.74_230.19	NA	NA	[M+H]+	12	-1	230.1902	0.74	3644.667	2296.89994	1.586776
Feature601	3.33_573.47	NA	NA	[M+H]+	12	-1	573.47090000000003	3.33	2241.500	88.38835	25.359678
Feature602	5.06_807.56	NA	NA	[M+H]+	11	-1	807.56119999999999	5.06	NaN	NA	NaN
Feature603	9.56_391.29	NA	NA	[M+H]+	11	-1	391.28640000000001	9.56	21476.500	236.88077	90.663754
Feature604	7.19_478.39	NA	NA	[M+H]+	10	-1	478.38749999999999	7.19	11793.500	946.81598	12.455958

Case Study: FFS HDL Lipidome (7)

We can then make a count plot of rows (features) with annotation vs without annotation
Looks like there are many features without an annotation. We can just remove them

Case Study: FFS HDL Lipidome (8)

Now we removed all the rows without the annotation.
Question 3: are there any missing values?

3 types of "Missing Values"

There are three types of "missing values":
- The data is missing (dietary record missing for a subject at one time point)
- Failed to be sampled (count data such as 16S-seq)
- Zero (quantitative method)
Our solution:

remove features with to many zeros
fill the NAs with a reasonable small value

Case Study: FFS HDL Lipidome (9)

Does each individual row have unique feature annotation?

Case Study: FFS HDL Lipidome (10)

## ESI
## Neg Pos 
##  55  57

Case Study: FFS HDL Lipidome (11)

Now we can run hypothesis test using linear model. We can make a histogram of the P values of the 170 features

Case Study: FFS HDL Lipidome (12)

If you have 10000 random variables and we run statistic analysis, what would the distribution of p values look like?

Case Study: FFS HDL Lipidome (13)

We can plot some features that have a significant p value. And there is a person seems always different from other subjects.

Case Study: FFS HDL Lipidome (14)

After we fix the mislabelled 116

Case Study: FFS HDL Lipidome (15)

Review

5 core activities of data anlysis

Stating the question
Exploratory data analysis
Modeling Building
Interpret
Communicate

Epicycles of analysis

Setting Expectations,
Collecting information (data), comparing the data to your expectations, and if the expectations don’t match,
Revising your expectations or fixing the data so your data and your expectations match.

Reference

The Art of Data Science, by Roger D. Peng and Elizabeth Matsui, available free on learnpub

book