API Reference¶
methylize.diff_meth_pos (meth_data, pheno_data) |
This function searches for individual differentially methylated positions/probes (DMPs) by regressing the methylation M-value for each sample at a given genomic location against the phenotype data for those samples. |
methylize.diff_meth_regions (stats, …) |
Calculates and annotates diffentially methylated regions (DMR) using the combined-pvalues pipeline and returns list of output files. |
methylize.fetch_genes ([dmr_regions_file, …]) |
find genes that are adjacent to significantly different CpG regions provided. |
methylize.manhattan_plot (stats_results, …) |
In EWAS Manhattan plots, epigenomic probe locations are displayed along the X-axis, with the negative logarithm of the association P-value for each single nucleotide polymorphism (SNP) displayed on the Y-axis, meaning that each dot on the Manhattan plot signifies a SNP. |
methylize.volcano_plot (stats_results, **kwargs) |
This function writes the pandas DataFrame output of diff_meth_pos() to a CSV file named by the user. |
methylize.helpers.to_BED (stats, …[, save, …]) |
Converts & exports manifest and probe p-value dataframe to BED format. |
differentially methylated positions¶
-
methylize.diff_meth_pos.
diff_meth_pos
(meth_data, pheno_data, regression_method=None, impute='delete', **kwargs)[source]¶ This function searches for individual differentially methylated positions/probes (DMPs) by regressing the methylation M-value for each sample at a given genomic location against the phenotype data for those samples.
Phenotypes can be provided as a list of string-based or integer binary data or as numeric continuous data.
Input Parameters:
- meth_data:
- A pandas dataframe of methylation array data (as M-values) where each column corresponds to a CpG site probe and each row corresponds to a sample. IF a dataframe of beta-values is supplied instead, this function will detect this and convert to M-values before proceeding.
- pheno_data:
A list or one dimensional numpy array of phenotypes for each sample row in meth_data. Binary phenotypes can be presented as a list/array of zeroes and ones or as a list/array of strings made up of two unique words (i.e. “control” and “cancer”).
- linear regression requires a measurement for the phenotype
- note: methylprep creates a sample_sheet_meta_data.pkl file containing the phenotype
data for this input. You just need to load it and specify which column to be used as the pheno_data. - If pheno_data is a pandas DataFrame, you must specify a column and may optionally specify covariates from the other columns in the DataFrame. Numpy matrices are NOT supported.
- column:
- If pheno_data is a DataFrame, column=’label’ will select one series to be used as the phenotype data.
- covariates: (default None) use to define a more complex model for logistic regression.
- if pheno_data is not a DataFrame, covariates will be ignored.
- if pheno_data is a DataFrame, and column is specified, and covariates is None, it will ignore other columns in pheno_data.
- if covariates is a list of strings, only these column names will be used as the covariate data from pheno_data.
- If covariates is set to True, then all other columns will be treated as covariates in logistic regression.
- regression_method: (logistic | linear)
- Either the string “logistic” or the string “linear”
depending on the phenotype data available. - Phenotypes with only two options (e.g. “control” and “cancer”) can be analyzed with a logistic regression - Continuous numeric phenotypes (e.g. age) are required to run a linear regression analysis. - Default: auto-determine, using the data types found in pheno_data. If it is numeric with more than 2 different labels, use “linear.”
- impute:
Because methylprep output contains missing values by default, this function requires user to either delete or impute missing values.
- Default: ‘delete’ probes if ANY samples have missing data for that probe.
- True or ‘auto’: if <30 samples, deletes rows; if >=30 samples, uses average.
- False: don’t impute and throw an error if NaNs present
- ‘average’ - use the average of probe values in this batch
- ‘delete’ - drop probes if NaNs are present in any sample
- ‘fast’ - use adjacent sample probe value instead of average (much faster but less precise)
- alpha: float
- Default is 0.05 for all tests where it applies.
- fwer: float
- Set the familywise error rate (FWER). Default is 0.05, meaning that we expect 5% of all significant differences to be false positives. 0.1 (10%) is a more conservative convention for FWER.
- q_cutoff:
- Select a cutoff value (< 1.0) to return filtered results.
Only those DMPs that meet a particular significance threshold (e.g. 0.1) will be retained. Reported q-values are p-values corrected according to the model’s false discovery rate (FDR). - Default: 1 – returns all DMPs regardless of significance.
- export:
- default: False
- if True or ‘csv’, saves a csv file with data
- if ‘pkl’, saves a pickle file of the results as a dataframe.
- Use q_cutoff to limit what gets saved to only significant results.
By default, q_cutoff == 1 and this means everything is saved/reported/exported.
- filename:
- specify a filename for the exported file.
By default, if not specified, filename will be DMP_<number of probes in file>_<number of samples processed>_<current_date>.<pkl|csv>
- max_workers:
- (=INT) By default, this will parallelize probe processing, using all available cores. During testing, or when running in a virtual environment like circleci or docker or lambda, the number of available cores is fewer than the system’s reported CPU cores, and it breaks. Use this to limit the available cores to some arbitrary number for testing or containerized-usage.
- debug:
- Default: False – True turns on extra messages and runs the ‘solver’ in serial mode, disabling parallel processing of probes. This is slower but provides more detail for debugging code.
- solver:
You can force it to use a different implementation of regression, for debugging. Options include:
- ‘statsmodels_OLS’
- ‘linregress’ # from scipy
- [logit_DMP() is the default, based on statsmodels.api.Logit()]
- ‘statsmodels_GLM’ # for logistic regression
Returns:
A pandas dataframe of regression statistics with a row for each probe analyzed and columns listing the individual probe’s regression statistics of:
- regression coefficient
- lower limit of the coefficient’s 95% confidence interval
- upper limit of the coefficient’s 95% confidence interval
- standard error
- p-value (phenotype group A vs B - likelihood that the difference is significant for this probe/location)
- FDR_QValue: p-values corrected for multiple comparisons using the Benjamini-Hochberg FDR method
The rows are sorted by q-value in ascending order to list the most significant probes first. If q_cutoff is specified, only probes with significant q-values less than the cutoff will be returned in the dataframe.
Note
- If you get a RuntimeError: main thread is not in main loop error running
- diff_meth_pos, add debug=True to your function inputs. This error occurs in some command line environments when running this function repeatedly, and the parallel processing steps do not all close. Using debug mode forces it to analyze the probes serially, not parallel. It is a little slower but will always run without error under these conditions.
- If Progress Bar Missing:
- if you don’t see a progress bar in your jupyterlab notebook, try this: - conda install -c conda-forge nodejs - jupyter labextension install @jupyter-widgets/jupyterlab-manager
Todo
shrink_var feature: variance shrinkage / squeeze variance using Bayes posterior means. Variance shrinkage is recommended when analyzing small datasets (n < 10). Paper ref: http://www.uvm.edu/~rsingle/JournalClub/papers/Smyth-SAGMB-2004_eBayes+microarray.pdf [NOT IMPLEMENTED YET]
-
methylize.diff_meth_pos.
legacy_OLS
(probe_data, phenotypes, alpha=0.05)[source]¶ to use this, specify “statsmodels_OLS” in kwargs to diff_meth_pos() – this method gives the same result as the scipy.linregress method when tested in version 1.0.0
-
methylize.diff_meth_pos.
linear_DMP_regression
(probe_data, phenotypes, alpha=0.05)[source]¶ This function performs a linear regression on a single probe’s worth of methylation data (in the form of beta or M-values). It is called by the diff_meth_pos().
Inputs and Parameters:
- probe_data:
- A pandas Series for a single probe with a methylation M-value/beta-value for each sample in the analysis. The Series name corresponds to the probe ID, and the Series is extracted from the meth_data DataFrame through a parallellized loop in diff_meth_pos().
- phenotypes:
- A numpy array of numeric phenotypes with one phenotype per sample (so it must be the same length as probe_data). This is the same object as the pheno_data input to diff_meth_pos() after it has been checked for data type and converted to the numpy array pheno_data_array.
Returns:
A pandas Series of regression statistics for the single probe analyzed. The columns of regression statistics are as follows:
- regression coefficient (linregress pearson’s ‘r’)
- lower limit of the coefficient’s 95% confidence interval
- upper limit of the coefficient’s 95% confidence interval
- standard error
- p-value
-
methylize.diff_meth_pos.
logistic_DMP_regression
(probe_data, phenotypes, covariate_data=None, debug=False)[source]¶ - Runs parallelized.
- TESTED, and gives almost the same values as logit_DMP() (2022-02-25)
- This function performs a logistic regression on a single probe’s worth of methylation
data (in the form of beta/M-values). It is called by the diff_meth_pos().
Inputs and Parameters:
- probe_data:
- A pandas Series for a single probe with a methylation M-value or beta_value for each sample in the analysis. The Series name corresponds to the probe ID, and the Series is extracted from the meth_data DataFrame through a parallellized loop in diff_meth_pos().
- phenotypes:
- A numpy array of binary phenotypes with one phenotype per sample (so it must be the same length as probe_data). This is the same object as the pheno_data input to diff_meth_pos() after it has been checked for data type and converted to the numpy array pheno_data_binary.
Returns:
A pandas Series of regression statistics for the single probe analyzed. The columns of regression statistics are as follows:
- regression coefficient
- lower limit of the coefficient’s 95% confidence interval
- upper limit of the coefficient’s 95% confidence interval
- standard error
- p-value
If the logistic regression was unsuccessful in fitting to the data due to a Perfect Separation Error (as may be the case with small sample sizes) or a Linear Algebra Error, the exception will be caught and the probe_stats_row output will contain dummy values to flag the error. Perfect Separation Errors are coded with the value -999 and Linear Algebra Errors are coded with value -995. These rows are processed and removed in the next step of diff_meth_pos() to prevent them from interfering with the final analysis and p-value correction while printing a list of the unsuccessful probes to alert the user to the issues.
-
methylize.diff_meth_pos.
logit_DMP
(probe_data, phenotypes, covariate_data=None, debug=False)[source]¶ DEFAULT method, because tested and works. uses statsmodels.api.Logit pass in a Series for probe data without the constant added
fold_change is log2( (pheno1.mean / pheno0.mean) )
each covariate will be normalized (to 0…1 range) before running DMP.
-
methylize.diff_meth_pos.
manhattan_plot
(stats_results, array_type, **kwargs)[source]¶ In EWAS Manhattan plots, epigenomic probe locations are displayed along the X-axis, with the negative logarithm of the association P-value for each single nucleotide polymorphism (SNP) displayed on the Y-axis, meaning that each dot on the Manhattan plot signifies a SNP. Because the strongest associations have the smallest P-values (e.g., 10−15), their negative logarithms will be the greatest (e.g., 15).
- genomic coordinates along chromosomes vs epigenetic probe locations along chromosomes
- p-values are for the probe value associations, using linear or logistic regression,
between phenotype A and B.
Hints of hidden heritability in GWAS. Nature 2010. (https://www.ncbi.nlm.nih.gov/pubmed/20581876)- stats_results:
- a pandas DataFrame containing the stats_results from the linear/logistic regression run on m_values or beta_values and a pair of sample phenotypes. The DataFrame must contain A “PValue” column. the default output of diff_meth_pos() will work.
- array_type:
- specify the type of array [450k, epic, epic+, mouse, 27k], so that probes can be mapped to chromosomes.
- save:
- specify that it export an image in png format. By default, the function only displays a plot.
- filename:
- specify an export filename. The default is f”manhattan_<stats>_<timestamp>.png”.
- verbose (True/False) - default is True, verbose messages, if omitted.
- fwer (default 0.1) familywise error rate (fwer) is used to set p-value threshold and the FDR threshold line.
- genome_build – NEW or OLD. Default is NEWest genome_build.
- label-prefix – how to refer to chromosomes. By default, it shows numbers like 1 … 22, and X, Y.
- pass in ‘CHR-’ to add a prefix to plot labels, or rename with ‘c’ like: c01 … c22.
There are some preset “override” options for threshold lines on plots:
- explore: (default False) include all FOUR significance threshold lines on plot
- (suggestive, significant, bonferroni, and false discovery rate). Useful for data exploration.
- no_thresholds: (default False) set to True to hide all FOUR threshold lines from plot.
- plain: (default False) hide all lines AND don’t label significant probes.
- statsmode: (default False) show the FDR and Bonferroni thresholds, hide the suggestive and genomic significant lines.
These allow you to toggle lines/labels on or off:
- fdr: (default False) draw a threshold line on plot corresponding to the adjusted false discovery rate = 0.05
- bonferroni: (default False) draw the Bonferroni threshold, correcting for multiple comparisons.
- suggestive: (default 1e-5) draw the consensus “suggestive significance” theshold at p < 1e-5,
- or set to False to hide.
- significant: (default 5e-8) draw the consensus “genomic significance” theshold at p < 5e-8,
- or set to False to hide.
- plot_cutoff_label (default True) label to each dotted line on the plot
- label_sig_probes (default True) labels significant probes showing the greatest difference between groups.
Chart options:
- ymax – default: 50. Set to avoid plotting extremely high -10log(p) values.
- width – figure width – default is 16
- height – figure height – default is 8
- fontsize – figure font size – default 16
- border – plot border – default is OFF
- palette – specify one of a dozen options for colors of chromosome regions on plot: [‘default’, ‘Gray’, ‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’, ‘Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘Gray2’, ‘Gray3’]
-
methylize.diff_meth_pos.
probe_corr_plot
(stats, group='sig', colorby='pval')[source]¶ - group=’sig’ is default (using PValue < 0.05)
- group=’chromosome’ also kinda works.
- colorby= pval or FDR; what to use to color the significant probes, if group=’sig’
-
methylize.diff_meth_pos.
volcano_plot
(stats_results, **kwargs)[source]¶ This function writes the pandas DataFrame output of diff_meth_pos() to a CSV file named by the user. The DataFrame has a row for every successfully tested probe and columns with different regression statistics as follows:
- regression coefficient
- lower limit of the coefficient’s 95% confidence interval
- upper limit of the coefficient’s 95% confidence interval
- standard error
- p-value
- FDR q-value (p-values corrected for multiple testing using the Benjamini-Hochberg FDR method)
Inputs and Parameters:
- stats_results (required):
- A pandas DataFrame output by the function diff_meth_pos().
- ‘alpha’:
- Default: 0.05, The significance level that will be used to highlight the most significant p-values (or adjusted FDR Q-values) on the plot.
- ‘cutoff’:
- the beta-coefficient cutoff | Default: None format: a list or tuple with two numbers for (min, max) or ‘auto’. If specified in kwargs, will exclude values within this range of regression coefficients OR fold-change range from being “significant” and put dotted vertical lines on chart. ‘auto’ will select a beta coefficient range that excludes 95% of results from appearing significant.
- ‘adjust’:
- (default False) – if this will adjust the p-value cutoff line for false discovery rate (Benjamini-Hochberg). Use ‘fwer’ to set the target rate.
- ‘fwer’:
- family-wise error rate (default is 0.1) – specify a probability [0 to 1.0] for false discovery rate
- ‘data_type_label’:
- What to put on X-axis. Either ‘Fold Change’ (default) or ‘Regression Coefficient’.
- visualization kwargs:
- palette – color pattern for plot – default is [blue, red, grey]
- other palettes: [‘default’, ‘Gray’, ‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’, ‘Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘Gray2’, ‘Gray3’]
- width – figure width – default is 16
- height – figure height – default is 8
- fontsize – figure font size – default 16
- dotsize – figure dot size on chart – default 30
- border – plot border – default is OFF
- data_type_label – (e.g. Beta Values, M Values) – default is ‘Beta’
- plot_cutoff_label – add label to dotted line on plot – default False
- save:
- specify that it export an image in png format. By default, the function only displays a plot.
- filename:
- specify an export filename. default is volcano_<current_date>.png.
Returns:
Displays a plot, but does not directly return an object. The data is color coded and displayed as follows:
- the negative log of adjusted p-values is plotted on the y-axis
- the regression coefficient beta value is plotted on the x-axis
- the significance cutoff level appears as a horizontal gray dashed line
- non-significant points appear in light gray
- significant points with positive correlations (hypermethylated probes) appear in red
- significant points with negative correlations (hypomethylated probes) appear in blue
differentially methylated regions¶
-
methylize.diff_meth_regions.
diff_meth_regions
(stats, manifest_or_array_type, **kwargs)[source]¶ Calculates and annotates diffentially methylated regions (DMR) using the combined-pvalues pipeline and returns list of output files.
comb-p is a command-line tool and a python library that manipulates BED files of possibly irregularly spaced P-values and
- calculates auto-correlation,
- combines adjacent P-values,
- performs false discovery adjustment,
- finds regions of enrichment (i.e. series of adjacent low P-values) and
- assigns significance to those regions.
ref: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3496335/
Input Parameters:
- stats: dataframe
- dataframe output from diff_meth_pos()
- manifest_or_array_type: class instance or string
- pass in the manifest, or the name of the array
- filename:
- filename to prepend to the .BED file created.
creates a <filename>.bed output from diff_meth_pos and to_BED. genome_build: default ‘NEW’
by default, it uses the NEWer genome build. Each manifest contains two genome builds, marked “NEW” and “OLD”. To use the OLD build, se this to “OLD”.Computational Parameters:
- dist: int
- maximum distance from each probe to scan for adjacent peaks (80)
- acf-dist: int
- window-size for smoothing. Default is 1/3 of peak-dist (dist kwarg),
- step: int
- step size for bins in the ACF calculation (50)
- threshold: foat
- Extend regions after seeding if pvalue exceeds this value (default: same as seed)
- seed: float
- minimum size of a genomic region (0.05)
- no_fdr: bool
- don’t use FDR
- genome_control: bool
- correct input pvalues for genome control (inflation factor). This reduces the confounding effects of population stratitication in EWAS data.
Display/output Paramters:
- verbose: bool – default False
- Display additional processing information on screen.
- plot: bool – default False
- False will suppress the manhattan plot step.
- prefix: str
- prefix that gets appended to all output files (e.g. ‘dmr’ becomes ‘dmr_regions.csv’)
- region_filter_p:
- max adjusted region-level p-value in final output.
- region_filter_n:
- req at least this many probes in a region
- annotate: bool
- annotate with fetch_genes() function that uses UCSC refGene database to add “nearby” genes to differentially methylated regions in the output CSV. If you want to fine-tune the reference database, the tolerance of what “nearby” means, and other parameters, set this to false and call methylize.fetch_genes as a separate step on the ‘…_regions.csv’ output file.
- tissue: str
- if specified, adds additional columns to the annotation output with the expression levels for identified genes in any/all tissue(s) that match the keyword. (e.g. if your methylation samples are whole blood, specify tissue=blood) For all 54 tissues, use tissue=all
Returns:
- list
- A list of files created.
Fetch Genes¶
find genes that are adjacent to significantly different CpG regions provided.
Summary:
fetch_genes() annotates the DMR region output file, using the UCSC Genome Browser database as a reference as to what genes are nearby. This is an exploratory tool, as there are many versions of the human genome that map genes to slightly different locations.
fetch_genes() is an EXPLORATORY tool and makes a number of simplicifications:
- the DMR regions file saves one CpG probe name and location, even though clusters of probes may map to that nearby area.
- it measures the distance from the start position of the one representative probe per region to any nearby genes, using the `tol`erance parameter as the cutoff. Tolerance is the max number of base pairs of separation between the probe sequence start and the gene sequence start for it to be considered as a match.
- The default `tol`erance is 250, but that is arbitrary. Increase it to expand the search area, or decrease it to be more conservative. Remember that Illumina CpG probe sequences are 50 base pairs long, so 100 is nearly overlapping. 300 or 500 would also be reasonable.
- “Adjacent” in the linear sequence may not necessarily mean that the CpG island is FUNCTIONALLY coupled to the regulatory or coding region of the nearby protein. DNA superstructure can position regulatory elements near to a coding region that are far upstream or downstream from the mapped position, and there is no easy way to identify “adjacent” in this sense.
- Changing the `tol`erance, or the reference database will result major differences in the output, and thus one’s interpretation of the same data.
- Before interpreting these “associations” you should also consider filtering candidate genes by specific cell types where they are expressed. You should know the tissue from which your samples originated. And filter candidate genes to exclude those that are only expressed in your tissue during development, if your samples are from adults, and vice versa.
Arguments:
- dmr_regions_file:
- pass in the output file DataFrame or FILEPATH from DMR function. Omit if you specify the sql kwarg instead.
- ref: default is refGene
- use one of possible_tables for lookup: - ‘refGene’ – 88,819 genes – default table used in comb-b and cruzdb packages. - ‘knownGene’ – 232,184 genes – pseudo genes too (the “WHere TranscriptType == ‘coding_protein’” clause would work, but these fields are missing from the data returned.) - ‘ncbiRefSeq’ – 173,733 genes – this table won’t have gene descriptions, because it cannot be joined with the ‘kgXref’ (no shared key). Additionally, ‘gtexGeneV8’ is used for tissue-expression levels. Pseudogenes are ommited using the “WHERE score > 0” clause in the SQL.
- tol: default 250
- +/- this many base pairs consistutes a gene “related” to a CpG region provided.
- tissue: str
- if specified, adds additional columns to output with the expression levels for identified genes in any/all tissue(s) that match the keyword. (e.g. if your methylation samples are whole blood, specify tissue=blood) For all 54 tissues, use tissue=all
- genome_build: (None, NEW, OLD)
- Only the default human genome build, hg38, is currently supported. Even though many other builds are available in the UCSC database, most tables do not join together in the same way.
- use_cached:
- If True, the first time it downloads a dataset from UCSC Genome Browser, it will save to disk and use that local copy thereafter. To force it to use the online copy, set to False.
- no_sync:
- methylize ships with a copy of the relevant UCSC gene browser tables, and will auto-update these every month. If you want to run this function without accessing this database, you can avoid updating using the no_sync=True kwarg.
- host, user, password, db:
- Internal database connections for UCSC server. You would only need to mess with these of the server domain changes from the current hardcoded value {HOST}. Necessary for tables to be updated and for tissue annotation.
- sql:
- a DEBUG mode that bypasses the function and directly queries the database for any information the user wants. Be sure to specify the complete SQL statement, including the ref-table (e.g. refGene or ncbiRefSeq).
Note
This method flushes cache periodically. After 30 days, it deletes cached reference gene tables and re-downloads.
Manhattan Plot¶
In EWAS Manhattan plots, epigenomic probe locations are displayed along the X-axis, with the negative logarithm of the association P-value for each single nucleotide polymorphism (SNP) displayed on the Y-axis, meaning that each dot on the Manhattan plot signifies a SNP. Because the strongest associations have the smallest P-values (e.g., 10−15), their negative logarithms will be the greatest (e.g., 15).
GWAS vs EWAS¶
- genomic coordinates along chromosomes vs epigenetic probe locations along chromosomes
- p-values are for the probe value associations, using linear or logistic regression,
between phenotype A and B.
Ref¶
Hints of hidden heritability in GWAS. Nature 2010. (https://www.ncbi.nlm.nih.gov/pubmed/20581876)
Required Inputs¶
- stats_results:
- a pandas DataFrame containing the stats_results from the linear/logistic regression run on m_values or beta_values and a pair of sample phenotypes. The DataFrame must contain A “PValue” column. the default output of diff_meth_pos() will work.
- array_type:
- specify the type of array [450k, epic, epic+, mouse, 27k], so that probes can be mapped to chromosomes.
output kwargs¶
- save:
- specify that it export an image in png format. By default, the function only displays a plot.
- filename:
- specify an export filename. The default is f”manhattan_<stats>_<timestamp>.png”.
visualization kwargs¶
- verbose (True/False) - default is True, verbose messages, if omitted.
- fwer (default 0.1) familywise error rate (fwer) is used to set p-value threshold and the FDR threshold line.
- genome_build – NEW or OLD. Default is NEWest genome_build.
- label-prefix – how to refer to chromosomes. By default, it shows numbers like 1 … 22, and X, Y.
- pass in ‘CHR-’ to add a prefix to plot labels, or rename with ‘c’ like: c01 … c22.
There are some preset “override” options for threshold lines on plots:
- explore: (default False) include all FOUR significance threshold lines on plot
- (suggestive, significant, bonferroni, and false discovery rate). Useful for data exploration.
- no_thresholds: (default False) set to True to hide all FOUR threshold lines from plot.
- plain: (default False) hide all lines AND don’t label significant probes.
- statsmode: (default False) show the FDR and Bonferroni thresholds, hide the suggestive and genomic significant lines.
These allow you to toggle lines/labels on or off:
- fdr: (default False) draw a threshold line on plot corresponding to the adjusted false discovery rate = 0.05
- bonferroni: (default False) draw the Bonferroni threshold, correcting for multiple comparisons.
- suggestive: (default 1e-5) draw the consensus “suggestive significance” theshold at p < 1e-5,
- or set to False to hide.
- significant: (default 5e-8) draw the consensus “genomic significance” theshold at p < 5e-8,
- or set to False to hide.
- plot_cutoff_label (default True) label to each dotted line on the plot
- label_sig_probes (default True) labels significant probes showing the greatest difference between groups.
Chart options:
- ymax – default: 50. Set to avoid plotting extremely high -10log(p) values.
- width – figure width – default is 16
- height – figure height – default is 8
- fontsize – figure font size – default 16
- border – plot border – default is OFF
- palette – specify one of a dozen options for colors of chromosome regions on plot: [‘default’, ‘Gray’, ‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’, ‘Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘Gray2’, ‘Gray3’]
Volcano Plot¶
This function writes the pandas DataFrame output of diff_meth_pos() to a CSV file named by the user. The DataFrame has a row for every successfully tested probe and columns with different regression statistics as follows:
- regression coefficient
- lower limit of the coefficient’s 95% confidence interval
- upper limit of the coefficient’s 95% confidence interval
- standard error
- p-value
- FDR q-value (p-values corrected for multiple testing using the Benjamini-Hochberg FDR method)
Inputs and Parameters:
- stats_results (required):
- A pandas DataFrame output by the function diff_meth_pos().
- ‘alpha’:
- Default: 0.05, The significance level that will be used to highlight the most significant p-values (or adjusted FDR Q-values) on the plot.
- ‘cutoff’:
- the beta-coefficient cutoff | Default: None format: a list or tuple with two numbers for (min, max) or ‘auto’. If specified in kwargs, will exclude values within this range of regression coefficients OR fold-change range from being “significant” and put dotted vertical lines on chart. ‘auto’ will select a beta coefficient range that excludes 95% of results from appearing significant.
- ‘adjust’:
- (default False) – if this will adjust the p-value cutoff line for false discovery rate (Benjamini-Hochberg). Use ‘fwer’ to set the target rate.
- ‘fwer’:
- family-wise error rate (default is 0.1) – specify a probability [0 to 1.0] for false discovery rate
- ‘data_type_label’:
- What to put on X-axis. Either ‘Fold Change’ (default) or ‘Regression Coefficient’.
- visualization kwargs:
- palette – color pattern for plot – default is [blue, red, grey]
- other palettes: [‘default’, ‘Gray’, ‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’, ‘Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’, ‘Gray2’, ‘Gray3’]
- width – figure width – default is 16
- height – figure height – default is 8
- fontsize – figure font size – default 16
- dotsize – figure dot size on chart – default 30
- border – plot border – default is OFF
- data_type_label – (e.g. Beta Values, M Values) – default is ‘Beta’
- plot_cutoff_label – add label to dotted line on plot – default False
- save:
- specify that it export an image in png format. By default, the function only displays a plot.
- filename:
- specify an export filename. default is volcano_<current_date>.png.
Returns:
Displays a plot, but does not directly return an object. The data is color coded and displayed as follows:
- the negative log of adjusted p-values is plotted on the y-axis
- the regression coefficient beta value is plotted on the x-axis
- the significance cutoff level appears as a horizontal gray dashed line
- non-significant points appear in light gray
- significant points with positive correlations (hypermethylated probes) appear in red
- significant points with negative correlations (hypomethylated probes) appear in blue
To BED¶
Converts & exports manifest and probe p-value dataframe to BED format. * https://en.wikipedia.org/wiki/BED_(file_format)
- BED format: [ chromosome number | start position | end position | p-values]
Where p-values are the output from diff_meth_pos() comparing probes across two or more groups of samples for genomic differences in methylation.
This output is required for combined-pvalues library to read and annotate manhattan plots with the nearest Gene(s) for each significant CpG cluster.
- manifest_or_array_type:
- either pass in a Manifest instance from methylprep, or a string that defines which manifest to load. One of {‘27k’, ‘450k’, ‘epic’, ‘epic+’, ‘mouse’}.
- genome_build:
- pass in ‘OLD’ to use the older genome build for each respective manifest array type.
note: if manifest has probes that aren’t mapped to genome, they are omitted in BED file.
TODO: incorporate STRAND and OLD_STRAND in calculations.
returns a BED formatted dataframe if save is False, or the saved filename if save is True.