methylize package¶
methylize.diff_meth_pos module¶

methylize.diff_meth_pos.
diff_meth_pos
(meth_data, pheno_data, regression_method='linear', q_cutoff=1, shrink_var=False, **kwargs)¶ This function searches for individual differentially methylated positions/probes (DMPs) by regressing the methylation Mvalue for each sample at a given genomic location against the phenotype data for those samples.
Phenotypes can be provided as a list of stringbased or integer binary data or as numeric continuous data.
 meth_data:
 A pandas dataframe of methylation Mvalues for where each column corresponds to a CpG site probe and each row corresponds to a sample.
 pheno_data:
 A list or one dimensional numpy array of phenotypes for each sample row in meth_data. Methylprep creates a sample_sheet_meta_data.pkl file containing the phenotype data for this input. You just need to load it and specify which column to be used as the pheno_data.  Binary phenotypes can be presented as a list/array of zeroes and ones or as a list/array of strings made up of two unique words (i.e. “control” and “cancer”). The first string in phenoData will be converted to zeroes, and the second string encountered will be convered to ones for the logistic regression analysis.  Use numbers for phenotypes if running linear regression.
 column:
 if pheno_data is a DataFrame, column=’label’ will select one series to be used as the phenotype data.
 covariates: default []
 if pheno_data is a DataFrame, specify a list of series by column_name to be used as the covariate data in the linear/regression model. [currently not implemented yet]
 regression_method: (logistic  linear)
 Either the string “logistic” or the string “linear”
depending on the phenotype data available.  Default: “linear”  Phenotypes with only two options (e.g. “control” and “cancer”) can be analyzed with a logistic regression  Continuous numeric phenotypes (e.g. age) are required to run a linear regression analysis.
 q_cutoff:
 Select a cutoff value to return only those DMPs that meet a
particular significance threshold. Reported qvalues are pvalues corrected according to the model’s false discovery rate (FDR).  Default: 1 – returns all DMPs regardless of significance.
 export:
 default: False
 if True or ‘csv’, saves a csv file with data
 if ‘pkl’, saves a pickle file of the results as a dataframe.
 USE q_cutoff to limit what gets saved to only significant results.
 by default, q_cutoff == 1 and this means everything is saved/reported/exported.
 filename:
 specify a filename for the exported file.
By default, if not specified, filename will be DMP_<number of probes in file>_<number of samples processed>_<current_date>.<pklcsv>
 shrink_var:
 If True, variance shrinkage will be employed and squeeze
variance using Bayes posterior means. Variance shrinkage is recommended when analyzing small datasets (n < 10). (NOT IMPLEMENTED YET)
 max_workers:
 (=INT) By default, this will parallelize probe processing, using all available cores. During testing, or when running in a virtual environment like circleci or docker or lambda, the number of available cores is fewer than the system’s reported CPU cores, and it breaks. Use this to limit the available cores to some arbitrary number for testing or containerizedusage.
A pandas dataframe of regression statistics with a row for each probe analyzed and columns listing the individual probe’s regression statistics of:
 regression coefficient
 lower limit of the coefficient’s 95% confidence interval
 upper limit of the coefficient’s 95% confidence interval
 standard error
 pvalue (phenotype group A vs B  likelihood that the difference is significant for this probe/location)
 qvalue (pvalues corrected for multiple testing using the BenjaminiHochberg FDR method)
 FDR_QValue: p value, adjusted for multiple comparisons
The rows are sorted by qvalue in ascending order to list the most significant probes first. If q_cutoff is specified, only probes with significant qvalues less than the cutoff will be returned in the dataframe.
 If Progress Bar Missing:
 if you don’t see a progress bar in your jupyterlab notebook, try this:  conda install c condaforge nodejs  jupyter labextension install @jupyterwidgets/jupyterlabmanager

methylize.diff_meth_pos.
is_interactive
()¶ determine if script is being run within a jupyter notebook or as a script

methylize.diff_meth_pos.
linear_DMP_regression
(probe_data, phenotypes)¶ This function performs a linear regression on a single probe’s worth of methylation data (in the form of Mvalues). It is called by the detect_DMPs.
 probe_data: A pandas Series for a single probe with a methylation Mvalue
 for each sample in the analysis. The Series name corresponds to the probe ID, and the Series is extracted from the meth_data DataFrame through a parallellized loop in detect_DMPs.
 phenotypes: A numpy array of numeric phenotypes with one phenotype per
 sample (so it must be the same length as probe_data). This is the same object as the pheno_data input to detect_DMPs after it has been checked for data type and converted to the numpy array pheno_data_array.
 Returns:
A pandas Series of regression statistics for the single probe analyzed. The columns of regression statistics are as follows:
 regression coefficient
 lower limit of the coefficient’s 95% confidence interval
 upper limit of the coefficient’s 95% confidence interval
 standard error
 pvalue

methylize.diff_meth_pos.
logistic_DMP_regression
(probe_data, phenotypes)¶ Runs parallelized. This function performs a logistic regression on a single probe’s worth of methylation data (in the form of Mvalues). It is called by the detect_DMPs.
 probe_data:
 A pandas Series for a single probe with a methylation Mvalue for each sample in the analysis. The Series name corresponds to the probe ID, and the Series is extracted from the meth_data DataFrame through a parallellized loop in detect_DMPs.
 phenotypes:
 A numpy array of binary phenotypes with one phenotype per sample (so it must be the same length as probe_data). This is the same object as the pheno_data input to detect_DMPs after it has been checked for data type and converted to the numpy array pheno_data_binary.
A pandas Series of regression statistics for the single probe analyzed. The columns of regression statistics are as follows:
 regression coefficient
 lower limit of the coefficient’s 95% confidence interval
 upper limit of the coefficient’s 95% confidence interval
 standard error
 pvalue
If the logistic regression was unsuccessful in fitting to the data due to a Perfect Separation Error (as may be the case with small sample sizes) or a Linear Algebra Error, the exception will be caught and the probe_stats_row output will contain dummy values to flag the error. Perfect Separation Errors are coded with the value 999 and Linear Algebra Errors are coded with value 995. These rows are processed and removed in the next step of detect_DMPs to prevent them from interfering with the final analysis and pvalue correction while printing a list of the unsuccessful probes to alert the user to the issues.

methylize.diff_meth_pos.
manhattan_plot
(stats_results, **kwargs)¶ In EWAS Manhattan plots, epigenomic probe locations are displayed along the Xaxis, with the negative logarithm of the association Pvalue for each single nucleotide polymorphism (SNP) displayed on the Yaxis, meaning that each dot on the Manhattan plot signifies a SNP. Because the strongest associations have the smallest Pvalues (e.g., 10−15), their negative logarithms will be the greatest (e.g., 15).
 genomic coordinates along chromosomes vs epigenetic probe locations along chromosomes
 pvalues are for the probe value associations, using linear or logistic regression,
between phenotype A and B.
Hints of hidden heritability in GWAS. Nature 2010. (https://www.ncbi.nlm.nih.gov/pubmed/20581876)stats_results: a pandas DataFrame containing the stats_results from the linear/logistic regression run on m_values or beta_values and a pair of sample phenotypes. The DataFrame must contain A “PValue” column. the default output of diff_meth_pos() will work. save:
 specify that it export an image in png format. By default, the function only displays a plot.
 filename:
 specify an export filename. default is volcano_<current_date>.png.
 verbose (True/False)  default is True, verbose messages, if omitted.
 width – figure width – default is 16
 height – figure height – default is 8
 fontsize – figure font size – default 16
 border – plot border – default is OFF
 palette – specify one of a dozen options for colors of chromosome regions on plot: [‘default’, ‘Gray’, ‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’, ‘Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘Gray2’, ‘Gray3’]
 cutoff – threshold pvalue for where to draw a line on the plot (default: 5x10^8 on plot, or p<=0.05)
 specify a number, such as 0.05.
 labelprefix – how to refer to chromosomes. By default, it shows numbers ‘CHR‘ like CHR1 .. CHR22, X, and Y.
 pass in ‘’ to remove this, or rename with ‘c’ like: c01 … c22.

methylize.diff_meth_pos.
volcano_plot
(stats_results, **kwargs)¶ This function writes the pandas DataFrame output of detect_DMPs to a CSV file named by the user. The DataFrame has a row for every successfully tested probe and columns with different regression statistics as follows:
 regression coefficient
 lower limit of the coefficient’s 95% confidence interval
 upper limit of the coefficient’s 95% confidence interval
 standard error
 pvalue
 qvalue (pvalues corrected for multiple testing using the BenjaminiHochberg FDR method)
 stats_results (required):
 A pandas DataFrame output by the function detect_DMPs.
 cutoff:
 Default: 0.05 alpha level The significance level that will be used to highlight the most significant adjusted pvalues (FDR Qvalues) on the plot.
 beta_coefficient_cutoff:
 Default: No cutoff format: a list or tuple with two numbers for (min, max) If specified in kwargs, will exclude values within this range of regression coefficients from being “significant” and put dotted vertical lines on chart.
 visualization kwargs:
 palette – color pattern for plot – default is [blue, red, grey]
 other palettes: [‘default’, ‘Gray’, ‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’, ‘Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘Gray2’, ‘Gray3’]
 width – figure width – default is 16
 height – figure height – default is 8
 fontsize – figure font size – default 16
 dotsize – figure dot size on chart – default 30
 border – plot border – default is OFF
 data_type_label – (e.g. Beta Values, M Values) – default is ‘Beta’
 save:
 specify that it export an image in png format. By default, the function only displays a plot.
 filename:
 specify an export filename. default is volcano_<current_date>.png.
 Returns:
Displays a plot, but does not directly return an object. The data is color coded and displayed as follows:
 the negative log of adjusted pvalues is plotted on the yaxis
 the regression coefficient beta value is plotted on the xaxis
 the significance cutoff level appears as a horizontal gray dashed line
 nonsignificant points appear in light gray
 significant points with positive correlations (hypermethylated probes) appear in red
 significant points with negative correlations (hypomethylated probes) appear in blue
methylize.helpers module¶

methylize.helpers.
load_color_schemes
()¶

methylize.helpers.
load_probe_chr_map
()¶ runs inside manhattan plot, and only needed there, but useful to load once if function called multiple times

methylize.helpers.
map_to_genome
(df, rgset)¶ Maps dataframe to genome locations Parameters ——— df: dataframe
Dataframe containing methylation, unmethylation, M or Beta values for each sample at each site rgset: rg channel set instance
 RG channel set instance related to provided df
 df: dataframe
 Dataframe containing the original values with the addition of genomic locations for each site

methylize.helpers.
readManifest
(array)¶ DEPRECRATED VERSION – Returns Illumina manifest for array type Parameters ——— array: str
String specifying the type of Illumina Methylation Array manifest: dataframe
 Dataframe containing Illumina Human Methylation Array manifest