Module contents¶

methylize.
diff_meth_pos
(meth_data, pheno_data, regression_method='linear', q_cutoff=1, shrink_var=False, **kwargs)¶ This function searches for individual differentially methylated positions/probes (DMPs) by regressing the methylation Mvalue for each sample at a given genomic location against the phenotype data for those samples.
Phenotypes can be provided as a list of stringbased or integer binary data or as numeric continuous data.
 meth_data:
 A pandas dataframe of methylation Mvalues for where each column corresponds to a CpG site probe and each row corresponds to a sample.
 pheno_data:
 A list or one dimensional numpy array of phenotypes for each sample row in meth_data. Methylprep creates a sample_sheet_meta_data.pkl file containing the phenotype data for this input. You just need to load it and specify which column to be used as the pheno_data.  Binary phenotypes can be presented as a list/array of zeroes and ones or as a list/array of strings made up of two unique words (i.e. “control” and “cancer”). The first string in phenoData will be converted to zeroes, and the second string encountered will be convered to ones for the logistic regression analysis.  Use numbers for phenotypes if running linear regression.
 column:
 if pheno_data is a DataFrame, column=’label’ will select one series to be used as the phenotype data.
 covariates: default []
 if pheno_data is a DataFrame, specify a list of series by column_name to be used as the covariate data in the linear/regression model. [currently not implemented yet]
 regression_method: (logistic  linear)
 Either the string “logistic” or the string “linear”
depending on the phenotype data available.  Default: “linear”  Phenotypes with only two options (e.g. “control” and “cancer”) can be analyzed with a logistic regression  Continuous numeric phenotypes (e.g. age) are required to run a linear regression analysis.
 q_cutoff:
 Select a cutoff value to return only those DMPs that meet a
particular significance threshold. Reported qvalues are pvalues corrected according to the model’s false discovery rate (FDR).  Default: 1 – returns all DMPs regardless of significance.
 export:
 default: False
 if True or ‘csv’, saves a csv file with data
 if ‘pkl’, saves a pickle file of the results as a dataframe.
 USE q_cutoff to limit what gets saved to only significant results.
 by default, q_cutoff == 1 and this means everything is saved/reported/exported.
 filename:
 specify a filename for the exported file.
By default, if not specified, filename will be DMP_<number of probes in file>_<number of samples processed>_<current_date>.<pklcsv>
 shrink_var:
 If True, variance shrinkage will be employed and squeeze
variance using Bayes posterior means. Variance shrinkage is recommended when analyzing small datasets (n < 10). (NOT IMPLEMENTED YET)
 max_workers:
 (=INT) By default, this will parallelize probe processing, using all available cores. During testing, or when running in a virtual environment like circleci or docker or lambda, the number of available cores is fewer than the system’s reported CPU cores, and it breaks. Use this to limit the available cores to some arbitrary number for testing or containerizedusage.
A pandas dataframe of regression statistics with a row for each probe analyzed and columns listing the individual probe’s regression statistics of:
 regression coefficient
 lower limit of the coefficient’s 95% confidence interval
 upper limit of the coefficient’s 95% confidence interval
 standard error
 pvalue (phenotype group A vs B  likelihood that the difference is significant for this probe/location)
 qvalue (pvalues corrected for multiple testing using the BenjaminiHochberg FDR method)
 FDR_QValue: p value, adjusted for multiple comparisons
The rows are sorted by qvalue in ascending order to list the most significant probes first. If q_cutoff is specified, only probes with significant qvalues less than the cutoff will be returned in the dataframe.
 If Progress Bar Missing:
 if you don’t see a progress bar in your jupyterlab notebook, try this:  conda install c condaforge nodejs  jupyter labextension install @jupyterwidgets/jupyterlabmanager

methylize.
volcano_plot
(stats_results, **kwargs)¶ This function writes the pandas DataFrame output of detect_DMPs to a CSV file named by the user. The DataFrame has a row for every successfully tested probe and columns with different regression statistics as follows:
 regression coefficient
 lower limit of the coefficient’s 95% confidence interval
 upper limit of the coefficient’s 95% confidence interval
 standard error
 pvalue
 qvalue (pvalues corrected for multiple testing using the BenjaminiHochberg FDR method)
 stats_results (required):
 A pandas DataFrame output by the function detect_DMPs.
 cutoff:
 Default: 0.05 alpha level The significance level that will be used to highlight the most significant adjusted pvalues (FDR Qvalues) on the plot.
 beta_coefficient_cutoff:
 Default: No cutoff format: a list or tuple with two numbers for (min, max) If specified in kwargs, will exclude values within this range of regression coefficients from being “significant” and put dotted vertical lines on chart.
 visualization kwargs:
 palette – color pattern for plot – default is [blue, red, grey]
 other palettes: [‘default’, ‘Gray’, ‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’, ‘Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘Gray2’, ‘Gray3’]
 width – figure width – default is 16
 height – figure height – default is 8
 fontsize – figure font size – default 16
 dotsize – figure dot size on chart – default 30
 border – plot border – default is OFF
 data_type_label – (e.g. Beta Values, M Values) – default is ‘Beta’
 save:
 specify that it export an image in png format. By default, the function only displays a plot.
 filename:
 specify an export filename. default is volcano_<current_date>.png.
 Returns:
Displays a plot, but does not directly return an object. The data is color coded and displayed as follows:
 the negative log of adjusted pvalues is plotted on the yaxis
 the regression coefficient beta value is plotted on the xaxis
 the significance cutoff level appears as a horizontal gray dashed line
 nonsignificant points appear in light gray
 significant points with positive correlations (hypermethylated probes) appear in red
 significant points with negative correlations (hypomethylated probes) appear in blue

methylize.
manhattan_plot
(stats_results, **kwargs)¶ In EWAS Manhattan plots, epigenomic probe locations are displayed along the Xaxis, with the negative logarithm of the association Pvalue for each single nucleotide polymorphism (SNP) displayed on the Yaxis, meaning that each dot on the Manhattan plot signifies a SNP. Because the strongest associations have the smallest Pvalues (e.g., 10−15), their negative logarithms will be the greatest (e.g., 15).
 genomic coordinates along chromosomes vs epigenetic probe locations along chromosomes
 pvalues are for the probe value associations, using linear or logistic regression,
between phenotype A and B.
Hints of hidden heritability in GWAS. Nature 2010. (https://www.ncbi.nlm.nih.gov/pubmed/20581876)stats_results: a pandas DataFrame containing the stats_results from the linear/logistic regression run on m_values or beta_values and a pair of sample phenotypes. The DataFrame must contain A “PValue” column. the default output of diff_meth_pos() will work. save:
 specify that it export an image in png format. By default, the function only displays a plot.
 filename:
 specify an export filename. default is volcano_<current_date>.png.
 verbose (True/False)  default is True, verbose messages, if omitted.
 width – figure width – default is 16
 height – figure height – default is 8
 fontsize – figure font size – default 16
 border – plot border – default is OFF
 palette – specify one of a dozen options for colors of chromosome regions on plot: [‘default’, ‘Gray’, ‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’, ‘Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘Gray2’, ‘Gray3’]
 cutoff – threshold pvalue for where to draw a line on the plot (default: 5x10^8 on plot, or p<=0.05)
 specify a number, such as 0.05.
 labelprefix – how to refer to chromosomes. By default, it shows numbers ‘CHR‘ like CHR1 .. CHR22, X, and Y.
 pass in ‘’ to remove this, or rename with ‘c’ like: c01 … c22.