Input IDs
Select Organism
Result
Input IDs
Select Organism
Convert to
Result
Input probe IDs
Convert to
Result
Input IDs
Basic Parameters
Advanced Parameters
Result
Upload gene list
Basic Parameters
Advanced Parameters
Result
Upload ORA Result
Basic Parameters
Advanced Parameters
Plot
Upload GSEA Result
Basic Parameters
Advanced Parameters
Plot
Select Group Number
Parameters
Plot
Upload DEG Result
Basic Parameters
Advanced Parameters
Plot
\n
, ;
, ,
, spaces
are accepted]common name - latin name
)\n
, ;
, ,
, spaces
are accepted]common name - latin name
)one or more
conversion types (symbol
/entrez
/ensembl
/uniprot
)For now, we only support human probe ID
\n
, ;
, ,
, spaces
are accepted]one or more
conversion types (symbol
/entrez
/ensembl
/uniprot
)\n
, ;
, ,
, spaces
are accepted]common name - latin name
)Simplify GO terms
: if turned on, the redundancy-reduced GO terms will return (user could choose one from five semantic similarity methods)P-value & Padj-value cutoff
: statistical significance cutoff for both P-value and adjusted P-value (e.g. If we set this cutoff as 0.05, we will get the result with p-value < 0.05 & p.adj < 0.05)Q-value cutoff
: statistical significance cutoff for Q-value (also called FDR adjusted P-value)P-value adjust method
: P-value adjustment methodsMinimal geneset size
: if any gene set has gene number less than this size, the gene set will be filteredMaximal geneset size
: if any gene set has gene number more than this size, the gene set will be filteredDownload link : click to save as Excel file
View result : dynamic data frame with clickable link
Attribute | Source | Link |
---|---|---|
GO_ID | Quick GO | https://www.ebi.ac.uk/QuickGO/annotations |
KEGG_ID | Kyoto Encyclopedia of Genes and Genomes | https://www.genome.jp/kegg/ |
MeSH_ID | Medical Subject Headings | https://www.ncbi.nlm.nih.gov/mesh |
WikiPathways_ID | WikiPathways | https://wikipathways.org/ |
MSigDB_ID | MsigDB | https://www.gsea-msigdb.org/gsea/msigdb/cards |
Reactome_ID | Reactome Pathway Database | https://reactome.org/ |
DO_ID | Disease Ontology | https://disease-ontology.org/ |
NCG_ID | Network of Cancer Genes database | http://ncg.kcl.ac.uk |
DisGeNET_ID | DisGeNET database | https://www.disgenet.org/ |
Take GO (Gene Ontology) enrichment analysis as an example:
GO ID
: formated as organism_ontology_ID
EnrichedNum
: size of enriched IDs, that is the k
in GeneRatio
InputNum
: size of input IDs
GeneRatio
= k/n
n
= size of the overlap of a vector of gene IDs you input with all the members of the collection of genesets (e.g. the KEGG pathway collection
),only unique IDs. In other words, it is the size of the list of genes of interest
k
= size of the overlap of a vector of gene IDs you input with the specific geneset (e.g. hsa04110: Cell cycle
), only unique IDs. In other words, it is the number of genes within that list which are annotated to the gene set
BgRatio
= M/N
N
= size of all of the unique genes in the collection of genesets (e.g. the KEGG pathway collection
) In other words, it is the total number of genes in the background distribution (universal genes)
M
= size of the geneset (e.g. the size of thehsa04110: Cell cycle
). In other words, it is the number of genes within that distribution that are annotated (either directly or indirectly) to the node of interest
Fold Enrichment
= GeneRatio
/ BgRatio
Rich Factor
= Number of genes enriched in specific term
/ Number of all genes in specific term
geneID
: input gene/protein ID
geneID_symbol
: converted symbol from input ID
Why do we need both p.adjust and qvalue?
Both q-values and adjusted p-values are used to control for false discovery rate (FDR) in multiple testing.
The adjusted p-value is a threshold for statistical significance that is adjusted to account for the number of tests performed. It controls the family-wise error rate (FWER), which is the probability of making at least one false positive among all tests performed. The adjusted p-value is obtained by applying a correction method, such as the Bonferroni or Benjamini-Hochberg (BH) procedure, to the p-values.
The p.adjust
function in R is used to adjust p-values for multiple comparisons using one of several methods (e.g. Bonferroni, Benjamini-Hochberg). This is done to control the FDR, i.e. the probability of falsely rejecting the null hypothesis (i.e. concluding that there is a significant effect when there is not).
The q-value is an estimate of the proportion of false positives among all significant results. It controls the FDR, which is the expected proportion of false positives among all significant results. The q-value is obtained by applying a different correction method, such as the Storey-Tibshirani procedure, to the p-values.
The qvalue
package in R provides a method for estimating q-values from p-values, which can be used to identify significant features in high-dimensional data such as gene expression microarrays. The qvalue function estimates the proportion of true null hypotheses among all null hypotheses (i.e. the proportion of false positives) and uses this estimate to calculate q-values.
If you use a different P-value adjust method, the value of p.adjust may change, but the value of qvalue will remain the same.
In practice, the q-value is often preferred over the adjusted p-value because it provides a more informative measure of the FDR. The adjusted p-value only tells you whether a result is statistically significant or not, while the q-value tells you the probability that a significant result is a false positive.
However, it is still important to report both the adjusted p-value and the q-value because they serve different purposes. The adjusted p-value is useful for determining whether a result is statistically significant, while the q-value is useful for estimating the FDR and selecting a threshold for significance.
Introduction of methods (according to R functionstats::p.adjust
):
"BH"
) or False discovery rate ("FDR"
): Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x.https://www.jstor.org/stable/2346101."holm"
) : Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.https://www.jstor.org/stable/4615733."hochberg"
): Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800–803.doi:10.2307/2336325."hommel"
) : Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika, 75, 383–386.doi:10.2307/2336190."BY"
): Benjamini, Y., and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29, 1165–1188. doi:10.1214/aos/1013699998."none"
)Little naming story about FDR and BH:
Originally, p.adjust()
didn’t have any FDR methods. Then Benjamini and Hochberg’s method was added. At that time, Benjamini and Hochberg’s method was the only FDR method, so it was just called “fdr”. Later on, they added the Benjamini & Yekutieli’s method as well, so they needed new names. They called the new method “BY” and renamed the old method to “BH”. The old name “fdr” was also kept as well, just for backward compatibility, which is how they ended up with two names for the same thing.
upload file (.txt/.csv/.xlsx
)
NOTICE: The file should have two columns:
- First column: gene IDs (Gene Symbol or Alias, Entrez, Ensembl)
- Second column: logFC value with decreasing order
Example : click to load example file
common name - latin name
)Simplify GO terms
: if turned on, the redundancy-reduced GO terms will return (user could choose one from five semantic similarity methods)P-value & Padj-value cutoff
: statistical significance cutoff for both P-value and adjusted P-value (e.g. If we set this cutoff as 0.05, we will get the result with p-value < 0.05 & p.adj < 0.05)Q-value cutoff
: statistical significance cutoff for Q-value (also called FDR adjusted P-value)P-value adjust method
: P-value adjustment methodsMinimal geneset size
: if any gene set has gene number less than this size, the gene set will be filteredMaximal geneset size
: if any gene set has gene number more than this size, the gene set will be filteredDownload link : click to save as Excel file
View result : dynamic data frame with clickable link
Attribute | Source | Link |
---|---|---|
GO_ID | Quick GO | https://www.ebi.ac.uk/QuickGO/annotations |
KEGG_ID | Kyoto Encyclopedia of Genes and Genomes | https://www.genome.jp/kegg/ |
MeSH_ID | Medical Subject Headings | https://www.ncbi.nlm.nih.gov/mesh |
WikiPathways_ID | WikiPathways | https://wikipathways.org/ |
MSigDB_ID | MsigDB | https://www.gsea-msigdb.org/gsea/msigdb/cards |
Reactome_ID | Reactome Pathway Database | https://reactome.org/ |
DO_ID | Disease Ontology | https://disease-ontology.org/ |
NCG_ID | Network of Cancer Genes database | http://ncg.kcl.ac.uk |
DisGeNET_ID | DisGeNET database | https://www.disgenet.org/ |
Take MSigDB GSEA as an example:
For more information, please visit GSEA manual
Description
: Gene set name (click the link to view more details)
InputSize
: size of input gene list, only unique genes
EnrichedSize
: size of enriched IDs
setSize
: number of genes with gene-level statistic values. To be more specific, if we input a pathway gene set with 58 genes (e.g. HALLMARK_MYC_TARGETS_V2 ), while our gene list only have 54 of it, so the result will only show setSize = 54
enrichmentScore
: also called ES
, same as in Broad GSEA implementation. It reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes.
NES
: normalized enrichment score is the primary statistic for examining gene set enrichment results. By normalizing the enrichment score, GSEA accounts for differences in gene set size and in correlations between gene sets and the expression dataset. Therefore, NES can be used to compare analysis results across gene sets.
rank
: The position in the ranked list at which the maximum enrichment score occurred. If gene sets achieve the maximum enrichment score near the top or bottom of the ranked list, the rank at max is either very small or very large.
leading_edge
: includes three statistics
tags
: The percentage of gene hits before (for positive ES) or after (for negative ES) the peak in the running enrichment score, which indicates the percentage of genes contributing to the enrichment score.list
: The percentage of genes in the ranked gene list before (for positive ES) or after (for negative ES) the peak in the running enrichment score. This gives an indication of where in the list the enrichment score is attained.signal
: If the gene set is entirely within the first N positions in the list, then the signal strength is maximal or 100%. If the gene set is spread throughout the list, then the signal strength decreases towards 0%.geneID
and geneID_symbol
: the same as ORA result
Why do we need both p.adjust and qvalue?
Both q-values and adjusted p-values are used to control for false discovery rate (FDR) in multiple testing.
The adjusted p-value is a threshold for statistical significance that is adjusted to account for the number of tests performed. It controls the family-wise error rate (FWER), which is the probability of making at least one false positive among all tests performed. The adjusted p-value is obtained by applying a correction method, such as the Bonferroni or Benjamini-Hochberg (BH) procedure, to the p-values.
The p.adjust
function in R is used to adjust p-values for multiple comparisons using one of several methods (e.g. Bonferroni, Benjamini-Hochberg). This is done to control the FDR, i.e. the probability of falsely rejecting the null hypothesis (i.e. concluding that there is a significant effect when there is not).
The q-value is an estimate of the proportion of false positives among all significant results. It controls the FDR, which is the expected proportion of false positives among all significant results. The q-value is obtained by applying a different correction method, such as the Storey-Tibshirani procedure, to the p-values.
The qvalue
package in R provides a method for estimating q-values from p-values, which can be used to identify significant features in high-dimensional data such as gene expression microarrays. The qvalue function estimates the proportion of true null hypotheses among all null hypotheses (i.e. the proportion of false positives) and uses this estimate to calculate q-values.
If you use a different P-value adjust method, the value of p.adjust may change, but the value of qvalue will remain the same.
In practice, the q-value is often preferred over the adjusted p-value because it provides a more informative measure of the FDR. The adjusted p-value only tells you whether a result is statistically significant or not, while the q-value tells you the probability that a significant result is a false positive.
However, it is still important to report both the adjusted p-value and the q-value because they serve different purposes. The adjusted p-value is useful for determining whether a result is statistically significant, while the q-value is useful for estimating the FDR and selecting a threshold for significance.
Introduction of methods (according to R functionstats::p.adjust
):
"BH"
) or False discovery rate ("FDR"
): Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x.https://www.jstor.org/stable/2346101."holm"
) : Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.https://www.jstor.org/stable/4615733."hochberg"
): Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800–803.doi:10.2307/2336325."hommel"
) : Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika, 75, 383–386.doi:10.2307/2336190."BY"
): Benjamini, Y., and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29, 1165–1188. doi:10.1214/aos/1013699998."none"
)Little naming story about FDR and BH :
Originally, p.adjust()
didn’t have any FDR methods. Then Benjamini and Hochberg’s method was added. At that time, Benjamini and Hochberg’s method was the only FDR method, so it was just called “fdr”. Later on, they added the Benjamini & Yekutieli’s method as well, so they needed new names. They called the new method “BY” and renamed the old method to “BH”. The old name “fdr” was also kept as well, just for backward compatibility, which is how they ended up with two names for the same thing.
upload file (.txt/.csv/.xlsx
) [ clicking upload button or drag & drop file ]
TIP: Users could firstly download the ORA result from
GeneEnrich-ORA
and select some interested pathways, then upload the file toPlot-ORA
new uploaded file will automatically overwrite the old one
Example : click to load example file (with zoom in button)
Plot type: GO catgory supports 13 types and others support 9 types
NOTICE: Each plot type has unique parameters, including axis, legend, color, text, and border.
All parameters are set as default. Once the file is uploaded, the draft plot is made. Users could change their interested parameters to make better visualization.
Plot button : start plotting
Clear button : clear input id and all other results
Remove figure legend
: if slider bar is grey, the plot will remain legendFigure body text size
: set text size for axis label, axis title and text inside plot. If set zero, the text will be removed.Figure legend text size
: set text size for legend label and title. If set zero, the text will be removed.Figure border thickness
: if set zero, the figure border will be removed.Wrap text longer than
: if text length is longer than this cutoff, the text will be wrapped.upload file (.txt/.csv/.xlsx
) [ clicking upload button or drag & drop file ]
TIP: Users could firstly download the GSEA result from
GeneEnrich
and select interesting pathways, then upload the file toPlot
new uploaded file will automatically overwrite the old one
Example : click to load example file (with zoom in button)
Plot type: five types
NOTICE: Each plot type has unique parameters, including axis, legend, color, text, and border.
All parameters have a default value. Once the file is uploaded, the draft plot is made. Users could change advanced parameters to achieve better visualization.
Plot button : start plotting
Clear button : clear input id and all other results
Remove figure legend
: if slider bar is grey, the plot will remain legendFigure body text size
: set text size for axis label, axis title and text inside plot. If set zero, the text will be removed.Figure legend text size
: set text size for legend label and title. If set zero, the text will be removed.Figure border thickness
: if set zero, the figure border will be removed.Plot type: two types
show/hide more arguments: click to show more specific arguments (e.g. color, text size, border size) and click again to hide them
Figure alpha degree
: adjust for background transparencyFigure body text size
: set text size for axis label, axis title and text inside plot. If set zero, the text will be removed.Figure legend text size
: set text size for legend label and title. If set zero, the text will be removed.Figure border thickness
: if set zero, the figure border will be removed.Plot button : start plotting
Clear button : clear input id and all other results
DEG (Differentially Expressed Genes) is prevalent in gene expression analysis (e.g., RNA-Seq, Microarray). Here
genekitr
introduce an easy way to visualize up and down-regulated genes by volcano plot.
upload file (.txt/.csv/.xlsx
) [ clicking upload button or drag & drop file ]
NOTICE: The DEG data could be produced using popular tools such as DESeq2/ limma/ edgeR while each tool has own naming rules.
Before uploading the file, users need to make sure the file contains columns below:
gene
column: includes gene IDlogFC
column: includes log2 (Fold Change) valuepvalue
ORp.adjust
column: includes statistical test values
new uploaded file will automatically overwrite the old one
Example : click to load example file (with zoom in button)
SELECT_NONE
)Remove figure legend
: if slider bar is green, the plot will remove legend.Statistical threshold
: statistically significant if lower than the threshold (default is 0.05).log2(Fold Change) threshold
: log2(fold-change) is the log-ratio of a gene’s or a transcript’s expression values between treatment group and the control group. If log2(FC) > 0, gene is highly expressed in treatment group and vice versa (default is 1, that is two fold changes).Figure body text size
: set text size for axis label, axis title and text inside plot. If set zero, the text will be removed.Figure legend text size
: set text size for legend label and title. If set zero, the text will be removed.Figure border thickness
: if set zero, the figure border will be removed.For more details, please refer to this site.
IDConvert
GenEnrich
ORA and GSEAPlease feel free to contact us at jieandze1314@gmail.com if you have any questions about Genekitr.
Welcome any feedbacks via GitHub.