Example: Differentially methylated regions – GSE69852

Claims in the paper: - Nearly half (42%) of the CpGs in human liver show a significant difference in methylation comparing fetal and adult samples. - 69% of the significant sites differed in their mean methylation beta value by ≤0.2.

Note on phenotype: The original meta data had ages of samples in different units (e.g. 55yr, 21wk) so we manually converted these to a single measurement (converted_age, float, years) so that linear regression could run. Alternatively, you could treat the two groups (55-56 years vs 20-22 weeks) as a phenotype with two groups (1 and 0) and apply logistic regression.

[1]:
import methylize as m
import pandas as pd
from pathlib import Path
df = pd.read_pickle(Path('data/GSE69852_beta_values.pkl'))
meta = pd.read_pickle(Path('data/GSE69852_GPL13534_meta_data.pkl'))
[2]:
res = m.diff_meth_pos(df.sample(100000), meta['converted_age'])
m.manhattan_plot(res, '450k')
bed = m.diff_meth_regions(res, '450k', prefix='data', plot=True)
INFO:methylize.diff_meth_pos:Converted your beta values into M-values; (6, 100000)
8 NaNs dropped
../_images/docs_example-GSE69852-DMP-DMR_2_3.png
INFO:methylprep.files.manifests:Reading manifest file: HumanMethylation450k_15017482_v3.csv
INFO:methylprep.files.manifests:Reading manifest file: HumanMethylation450k_15017482_v3.csv
Calculating ACF out to: 367
with 14 lags: [1, 31, 61, 91, 121, 151, 181, 211, 241, 271, 301, 331, 361, 391]
4873754 bases used as coverage for sidak correction
INFO:methylize.diff_meth_regions:wrote: data.regions-p.bed.gz, (regions with corrected-p < 0.05: 0)
../_images/docs_example-GSE69852-DMP-DMR_2_5.png
INFO:methylize.genome_browser:Loaded 24371 CpG regions from data_regions.csv.
INFO:methylize.genome_browser:Using cached `refGene`: /Users/mmaxmeister/methylize/methylize/data/refGene.pkl with (135634) genes
Mapping genes: 100%|██████████| 135634/135634 [00:59<00:00, 2277.89it/s]
INFO:methylize.genome_browser:Wrote data_regions_genes.csv

One would interpret the DMR plot to indicate that there were no significantly different methylated regions. Note, this example only ran 100000 of the 485,512 probes.

All data, logistic regression

[3]:
result = m.diff_meth_pos(df, meta['description'])
m.manhattan_plot(result, '450k')
bed_files = m.diff_meth_regions(result, '450k', prefix='data', plot=True)
INFO:methylize.diff_meth_pos:Converted your beta values into M-values; (6, 485512)
INFO:methylize.diff_meth_pos:Logistic regression: Phenotype (Normal Adult liver) was assigned to 0 and (Normal Fetal Liver) was assigned to 1.
218806 probes failed the logistic regression analysis due to perfect separation and could not be included in the final results.
13209 probes failed the logistic regression analysis due to LinearAlgebra error and could not be included in the final results.
148 probes failed the logistic regression analysis due to other unexplained reasons and could not be included in the final results.
Linear Algebra and Perfect Separation errors occur when the dataset is (a) too small to observe
events with low probabilities, or has (b) too many covariates in the model, leading to individual cell
sizes that are too small. Singular Matrix Errors occur when there is no variance in the predictors
(probe values and covariates).
27 NaNs dropped
../_images/docs_example-GSE69852-DMP-DMR_5_3.png
INFO:methylprep.files.manifests:Reading manifest file: HumanMethylation450k_15017482_v3.csv
INFO:methylprep.files.manifests:Reading manifest file: HumanMethylation450k_15017482_v3.csv
Calculating ACF out to: 50
with 3  lags: [1, 31, 61]
11192348 bases used as coverage for sidak correction
INFO:methylize.diff_meth_regions:wrote: data.regions-p.bed.gz, (regions with corrected-p < 0.05: 16)
../_images/docs_example-GSE69852-DMP-DMR_5_5.png
INFO:methylize.genome_browser:Loaded 17 CpG regions from data_regions.csv.
INFO:methylize.genome_browser:Using cached `refGene`: /Users/mmaxmeister/methylize/methylize/data/refGene.pkl with (135634) genes
Mapping genes: 100%|██████████| 135634/135634 [00:56<00:00, 2381.53it/s]
INFO:methylize.genome_browser:Wrote data_regions_genes.csv
[4]:
bed_files
[4]:
['data.args.txt',
 'data.acf.txt',
 'data.fdr.bed.gz',
 'data.slk.bed.gz',
 'data.regions-p.bed.gz',
 'data_regions.csv',
 'data_stats.csv',
 'data_regions_genes.csv',
 'data_dmp_stats.bed']
[5]:
bed = pd.read_csv('data.regions-p.bed.gz', sep='\t')
bed
[5]:
#chrom start end min_p n_probes z_p z_sidak_p
0 10 4034204 4034254 0.000000e+00 1 1.000000e-07 2.214000e-02
1 10 31031409 31031459 0.000000e+00 1 1.000000e-07 2.214000e-02
2 10 75210411 75210461 2.351000e-12 1 1.392000e-16 2.485000e-11
3 12 43840121 43840171 0.000000e+00 1 1.000000e-07 2.214000e-02
4 12 124453780 124453830 0.000000e+00 1 1.000000e-07 2.214000e-02
5 14 37592783 37592833 2.598000e-12 1 1.641000e-16 2.485000e-11
6 15 78810879 78810929 0.000000e+00 1 1.000000e-07 2.214000e-02
7 17 29005861 29005911 6.351000e-24 1 3.510000e-28 7.857000e-23
8 2 200911124 200911174 0.000000e+00 1 1.000000e-07 2.214000e-02
9 20 1470868 1470918 0.000000e+00 1 1.000000e-07 2.214000e-02
10 20 58445879 58445929 0.000000e+00 1 1.000000e-07 2.214000e-02
11 3 141066318 141066368 3.702000e-29 1 1.900000e-33 4.253000e-28
12 6 35150826 35150876 0.000000e+00 1 1.000000e-07 2.214000e-02
13 7 24854884 24854934 0.000000e+00 1 1.000000e-07 2.214000e-02
14 8 144335305 144335355 0.000000e+00 1 1.000000e-07 2.214000e-02
15 9 133804066 133804116 0.000000e+00 1 1.000000e-07 2.214000e-02