Content from Introduction
Last updated on 2026-03-31 | Edit this page
Overview
Questions
- What are the main types of functional enrichment analysis approaches, and how do they differ?
- When should you choose one enrichment strategy over another for RNA-seq data?
Objectives
- Understand the conceptual differences between over-representation analysis (ORA) and functional class scoring (FCS)
- Learn how enrichment tools (e.g.
clusterProfiler,fgsea,RegEnrichandSTRINGdb) implement these approaches using pathway and gene-set databases
Introduction
Sometimes, there is an extensive list of genes to interpret after differential gene expres-sion analysis, and it is not feasible to go through the biological function of each gene one at a time. A common downstream procedure is functional enrichment analysis (or gene set testing), which aims to determine which pathways or gene networks the differ-entially expressed genes are implicated in. There are many gene set testing methods available, and it is useful to try several of them.
The purpose of this tutorial is to demonstrate how to perform functional enrichment analysis/gene set testing using various tools/packages in R. We will use data from the Nature Cell Biology paper, EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival (https://www.ncbi.nlm.nih.gov/pubmed/25730472). This study examined the expression profiles of basal and luminal cells in the mammary gland of virgin, pregnant and lactating mice.
Load and read required libraries
We begin by loading the required packages. Please read the following libraries:
R
library(edgeR)
ERROR
Error in `library()`:
! there is no package called 'edgeR'
R
library(goseq)
ERROR
Error in `library()`:
! there is no package called 'goseq'
R
library(fgsea)
ERROR
Error in `library()`:
! there is no package called 'fgsea'
R
library(EGSEA)
ERROR
Error in `library()`:
! there is no package called 'EGSEA'
R
library(clusterProfiler)
ERROR
Error in `library()`:
! there is no package called 'clusterProfiler'
R
library(org.Mm.eg.db)
ERROR
Error in `library()`:
! there is no package called 'org.Mm.eg.db'
R
library(ggplot2)
ERROR
Error in `library()`:
! there is no package called 'ggplot2'
R
library(enrichplot)
ERROR
Error in `library()`:
! there is no package called 'enrichplot'
R
library(pathview)
ERROR
Error in `library()`:
! there is no package called 'pathview'
R
library(edgeR)
ERROR
Error in `library()`:
! there is no package called 'edgeR'
R
library(impute)
ERROR
Error in `library()`:
! there is no package called 'impute'
R
library(preprocessCore)
ERROR
Error in `library()`:
! there is no package called 'preprocessCore'
R
library(RegEnrich)
ERROR
Error in `library()`:
! there is no package called 'RegEnrich'
Inspect Datasets
We will use several files for this workshop:
- Results from differential expression analysis
debasalanddeluminalwith genes in rows and logFC/p-values in columns - Sample information file
factordata– gives details of sample ID and groups - Gene lengths file
seqdata - Filtered counts file
filteredcounts– genes in rows and counts for each sample in columns, lowly expressed genes removed - Hallmarks gene set file for mouse from MSigDB loaded in .RData
format –
Mm.H
Let’s inspect the files:
R
debasal <- read.csv("data/limma-voom_basalpregnant-basallactate", header = TRUE, sep = "\t")
deluminal <- read.csv("data/limma-voom_luminalpregnant-luminallactate", header = TRUE, sep = "\t")
factordata <- read.table("data/factordata", header = TRUE, sep = "\t")
#To view the first 5 rows of the dataset
head(debasal)
OUTPUT
ENTREZID SYMBOL GENENAME logFC
1 24117 Wif1 Wnt inhibitory factor 1 1.819943
2 381290 Atp2b4 ATPase, Ca++ transporting, plasma membrane 4 -2.143885
3 226101 Myof myoferlin -2.329744
4 16012 Igfbp6 insulin-like growth factor binding protein 6 -2.896115
5 231830 Micall2 MICAL-like 2 2.253400
6 78896 1500015O10Rik RIKEN cDNA 1500015O10 gene 2.807548
AveExpr t P.Value adj.P.Val B
1 2.975545 19.85403 5.722034e-11 5.366685e-07 15.55490
2 3.944066 -19.07173 9.406224e-11 5.366685e-07 15.09463
3 6.223525 -18.30281 1.562524e-10 5.366685e-07 14.55585
4 1.978449 -18.21558 1.657202e-10 5.366685e-07 14.13954
5 4.760597 18.00994 1.905713e-10 5.366685e-07 14.33472
6 3.036519 18.60321 2.037466e-10 5.366685e-07 14.35640
You can also view the entire file in a different tab using
View():
R
View(debasal)
Challenge
How many columns are there in
debasalanddeluminalobjects?What are the different types of samples in this analysis? Hint: Look at
factordatafile.
Summary
Commonly used analyses following differenital gene expression (DGE)
Over-representation analysis (ORA): Tests whether DGE list contains more genes from a specific pathway or gene set
Functional class scoring (FCS): Evaluates coordinated shifts in expression across all gene sets
Protein-protein interactions (PPI): Maps the functional connections between proteins to reveal network structure or pathways involved
Content from Gene Ontology testing with clusterProfiler
Last updated on 2026-03-31 | Edit this page
Overview
Questions
- What are the different types of GO terms (BP, MF, CC)?
- How do we perform ORA using
enrichGO()function? - How can we run GSEA-style functional class scoring with
gseGO()function?
Objectives
- Apply GO-based enrichment methods using
clusterProfiler - Perform both ORA and GSEA using the GO terms database
- Build confidence in navigating GO resources and interpreting enriched terms
Introduction
The Gene Ontology (GO) project is a major bioinformatics initiative
that standardises how we describe gene functions across species,
organising them into three categories: Biological Process, Molecular
Function and Cellular Component. clusterProfiler is an R
package that allows us to test whether these GO terms are associated
with our RNA-seq results and gain insight into the pathways or functions
represented in our data. This section demonstrates how to perform both
over-representation analysis (ORA) and
functional class scoring (FCS) with GO database,
depending on whether you are working with a list of significant genes or
full ranked expression data.
Over-Representation Analysis (ORA)
ORA tests whether a list of significant genes are linked to specific GO terms. The input is a vector of gene IDs (or list of genes) that passes your differential expression cut-off. ORA can be run separately for downregulated and upregulated genes to reveal which GO terms are enriched in each direction.
We first subset the debasal dataset to extract genes
with adjusted p-value below 0.01 and store this set of significant genes
in an object called genes. We then run enrichGO function
using this gene list, specifying the organism database
org.Mm.eg.db, the identifier type ENTREZID and
the GO category of interest CC (for cellular component).
The function is configured with standard p-value and q-value, using
Benjamini-Hochberg correction. We use the function head()
to check the first few lines of output.
ERROR
Error in `library()`:
! there is no package called 'edgeR'
ERROR
Error in `library()`:
! there is no package called 'goseq'
ERROR
Error in `library()`:
! there is no package called 'fgsea'
ERROR
Error in `library()`:
! there is no package called 'EGSEA'
ERROR
Error in `library()`:
! there is no package called 'clusterProfiler'
ERROR
Error in `library()`:
! there is no package called 'org.Mm.eg.db'
ERROR
Error in `library()`:
! there is no package called 'ggplot2'
ERROR
Error in `library()`:
! there is no package called 'enrichplot'
ERROR
Error in `library()`:
! there is no package called 'pathview'
ERROR
Error in `library()`:
! there is no package called 'edgeR'
ERROR
Error in `library()`:
! there is no package called 'impute'
ERROR
Error in `library()`:
! there is no package called 'preprocessCore'
ERROR
Error in `library()`:
! there is no package called 'RegEnrich'
R
debasal$Status <- debasal$adj.P.Val < 0.01
gene <- debasal$ENTREZID[debasal$Status]
ego <- enrichGO(gene = gene,
OrgDb = org.Mm.eg.db,
keyType = 'ENTREZID',
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
ERROR
Error in `enrichGO()`:
! could not find function "enrichGO"
R
head(ego)
ERROR
Error:
! object 'ego' not found
We can then use dotplot() function to visualise the
results in the form of a dot plot. From the plot below, we can see that
GO term cellular component spindle, membrane microdomain and ribosome
are top enriched terms.
R
dotplot(ego)
ERROR
Error in `dotplot()`:
! could not find function "dotplot"
Challenge
Challenge! Can you identify enriched GO term biological process in
deluminal dataset? Are the enriched pathways similar?
Gene Set Enrichment Analysis (GSEA)
We can also perform GSEA using GO database. GSEA is a type of functional class scoring method that evaluates whether genes belonging to a GO term tend to appear at the top or bottom of a ranked gene list, rather than relying on a cut-off (i.e. adj.P.Val < 0.01). The input is a continuous ranking metric (e.g. log2FC) for all genes. This allows the detection of subtle but coordinated shifts in GO terms for both downregulated and upregulated pathways.
We begin by creating a ranked gene list for GSEA by extracting the
logFC values from debasal dataset and its corresponding
ENTREZID. We then sort this vector in a decreasing order so
that the upregulated genes appear at the top of the list and the
downregulated genes at the bottom. Using this ranked gene list, we run
gseGO() to perform GSEA on GO terms CC, by
specifying the organism database, gene ID type, gene set limits and
p-value cut-off for enrichment.
R
debasal_genelist <- debasal$logFC
names(debasal_genelist) <- debasal$ENTREZID
debasal_genelist <- sort(debasal_genelist, decreasing = TRUE)
ego3 <- gseGO(gene = debasal_genelist,
OrgDb = org.Mm.eg.db,
keyType = 'ENTREZID',
ont = "CC",
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)
ERROR
Error in `gseGO()`:
! could not find function "gseGO"
R
head(ego3)
ERROR
Error:
! object 'ego3' not found
R
dotplot(ego3)
ERROR
Error in `dotplot()`:
! could not find function "dotplot"
We can also use the gseaplot() function to visualise
GSEA result for a specific gene set. In this example, we select the
top-ranked enriched GO term (geneSetID = 1). The result-ing plot
displays how genes contributing to the enrichment of this GO term are
distributed in the ranked gene list.
R
gseaplot(ego3, by = "all", title = ego3$Description[1], geneSetID = 1)
ERROR
Error in `gseaplot()`:
! could not find function "gseaplot"
- GO terms are divided into Biological Process (BP), Molecular Function (MF) and Cellular Component (CC), which can be analysed separately or together depending on the biological question.
- The
enrichGO()andgseGO()functions inclusterProfilerallow users to perform ORA and GSEA using the GO database directly. - GO testing results highlight gene sets or pathways that are overrepresented in your dataset, allowing interpretation of downregulated or upregulated genes.
Content from KEGG enrichment analysis with clusterProfiler
Last updated on 2026-03-31 | Edit this page
Overview
Questions
- How can we perform pathway analysis using KEGG?
- What insights can KEGG enrichment provide about differentially expressed genes
Objectives
- Learn how to run KEGG over-representation and GSEA-style analysis in R.
- Understand how to interpret pathway-level results.
- Generate and visualise KEGG pathway figures.
ERROR
Error in `library()`:
! there is no package called 'edgeR'
ERROR
Error in `library()`:
! there is no package called 'goseq'
ERROR
Error in `library()`:
! there is no package called 'fgsea'
ERROR
Error in `library()`:
! there is no package called 'EGSEA'
ERROR
Error in `library()`:
! there is no package called 'clusterProfiler'
ERROR
Error in `library()`:
! there is no package called 'org.Mm.eg.db'
ERROR
Error in `library()`:
! there is no package called 'ggplot2'
ERROR
Error in `library()`:
! there is no package called 'enrichplot'
ERROR
Error in `library()`:
! there is no package called 'pathview'
ERROR
Error in `library()`:
! there is no package called 'edgeR'
ERROR
Error in `library()`:
! there is no package called 'impute'
ERROR
Error in `library()`:
! there is no package called 'preprocessCore'
ERROR
Error in `library()`:
! there is no package called 'RegEnrich'
Introduction
The KEGG (Kyoto Encyclopedia of Genes and Genomes) database links
genes to curated biological pathways, offering a powerful foundation for
understanding cellular functions at a systems level and making
meaningful biological interpretations. clusterProfiler
allows us to access KEGG and apply both ORA (using
enrichKEGG function) and GSEA (using gseKEGG
function) to extract pathway-level insights from our RNA-seq data.
KEGG analysis
Before running enrichment, we need to confirm the correct KEGG
organism code for mouse (mmu). You can verify by
searching:
R
kegg_organism <- "mmu"
search_kegg_organism(kegg_organism, by='kegg_code')
ERROR
Error in `search_kegg_organism()`:
! could not find function "search_kegg_organism"
Over-representation analysis with enrichKEGG
To run ORA using KEGG database, we need to specify the gene list,
KEGG organism code and p-value cut-off. In this example, we take the top
500 genes from the ranked gene list debasal_genelist,
specify the organism code mmu (defined as `kegg_organism)
and use 0.05 as the p-value cut-off.
We can use head() function to briefly inspect the
results of enrichKEGG.
R
kk <- enrichKEGG(gene = names(debasal_genelist)[1:500],
organism = kegg_organism,
pvalueCutoff = 0.05)
ERROR
Error in `enrichKEGG()`:
! could not find function "enrichKEGG"
R
head(kk)
ERROR
Error:
! object 'kk' not found
GSEA-style KEGG enrichment with gseKEGG
Similar to previous enrichment analysis with GO database, we can also
perform a GSEA-style enrichment using the KEGG database. To do so, we
use the gseKEGG and specify the entire ranked gene list
(debasal_genelist) rather than an arbitrary cutoff. In this
example, we test KEGG pathways between 3 and 800 genes using 10,000
permutations and NCBI Gene IDs. Results are filtered using a p-value
cut-off of 0.05.
R
kk2 <- gseKEGG(geneList = debasal_genelist,
organism = kegg_organism,
nPerm = 10000,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
pAdjustMethod = "none",
keyType = "ncbi-geneid")
ERROR
Error in `gseKEGG()`:
! could not find function "gseKEGG"
Visualising enriched pathways
Dotplot
Before we look at individual pathways in detail, we can visualise the
overall enrichment results using dotplot().
This dotplot summarises which KEGG pathways are enriched, how many genes
contribute to each pathway, and how significant each one is.
R
dotplot(kk2, showCategory = 10, title = "Enriched Pathways" , split=".sign") + facet_grid(.~.sign)
ERROR
Error in `dotplot()`:
! could not find function "dotplot"
Similarity-based network plots
Next, we can explore how the enriched pathways relate to one
another.
The enrichment map groups pathways that share many genes, helping us see
broader biological themes rather than isolated pathways. In this case,
pairwise_termsim() function calculates the similarity
between enriched KEGG pathways and produces a similarity matrix that
quantifies their relationship. The emapplot()generates an
enrichment map using the similarity matrix produced, visualising the
enriched pathways as a network with nodes representing pathways and
edges reflecting their similarity.
R
kk3 <- pairwise_termsim(kk2)
ERROR
Error in `pairwise_termsim()`:
! could not find function "pairwise_termsim"
R
emapplot(kk3)
ERROR
Error in `emapplot()`:
! could not find function "emapplot"
We can also use cnetplot() to understand which genes
drive these enriched pathways. This plot links genes to pathways they
belong to and highlights genes that appear in multiple pathways.
R
cnetplot(kk3, categorySize="pvalue")
ERROR
Error in `cnetplot()`:
! could not find function "cnetplot"
Ridge plot
We can also inspect the distribution of enrichment scores across
pathways with ridgeplot(). This shows how strongly and
broadly each pathway is enriched across the ranked gene list using
overlapping density curves.
R
ridgeplot(kk3) + labs(x = "enrichment distribution")
ERROR
Error in `ridgeplot()`:
! could not find function "ridgeplot"
R
head(kk3)
ERROR
Error:
! object 'kk3' not found
You can see the top pathways, you can get the top pathway ID with the ID column.
R
# There must be a function that gets the results -> not ideal code
kk3@result$ID[1]
ERROR
Error:
! object 'kk3' not found
KEGG Pathway Diagram
Finally, we can visualise gene expression changes directly onto a
KEGG pathway diagram.pathview highlights which components of the pathway are up-
or down-regulated in your enrichment analysis.
R
# Produce the native KEGG plot (PNG)
mmu_pathway <- pathview(gene.data=debasal_genelist, pathway.id=kk3@result$ID[1], species = kegg_organism)
These will produce these files in your working directory:
mmu05171.xml mmu05171.pathview.png mmu05171.png

KEGG pathway analysis helps link DEGs to functional biological pathways.
Both ORA (
enrichKEGG) and GSEA-style (gseKEGG) methods provide complementary insights.pathviewenables visual interpretation of pathway-level expression changes.
Content from Gene set enrichment analysis with fgsea
Last updated on 2026-03-31 | Edit this page
Overview
Questions
- What is Gene Set Enrichment Analysis (GSEA) and when should I use it?
- How does fgsea perform fast, ranked-list GSEA?
- How do I interpret enrichment scores, p-values, and leading-edge genes?
- How does fgsea differ from the GSEA functions in clusterProfiler?
Objectives
- Prepare a ranked gene list suitable for GSEA.
- Run the ‘fgsea’ algorithm on Hallmark or other gene sets.
- Identify enriched pathways and distinguish between up- and down-regulated sets.
- Use ‘plotEnrichment()’ and ‘plotGseaTable()’ to visualise and interpret results.
- Understand the conceptual differences between ‘fgsea’ and ‘clusterProfiler::gseGO/gseKEGG’
What is GSEA (in practice)?
Unlike over-representation analysis (ORA), which tests a
subset of significant genes,
Gene Set Enrichment Analysis (GSEA) uses a ranked list of all
genes, such as:
- t-statistics
- log fold changes
- Wald statistics
This helps detect coordinated but subtle shifts across entire pathways that might be missed by threshold-based methods.
The fgsea package implements a fast,
permutation-efficient version of the original Broad Institute GSEA
algorithm, allowing thousands of pathways to be tested quickly.
In this part of the workshop, we will:
- Create a ranked list of genes from the
debasaldataset - Run
fgsea()using the mouse Hallmark gene sets(Mm.H). - Explore the top enriched pathways
- Visualise both multiple pathways and a single pathway in detail
ERROR
Error in `library()`:
! there is no package called 'edgeR'
ERROR
Error in `library()`:
! there is no package called 'goseq'
ERROR
Error in `library()`:
! there is no package called 'fgsea'
ERROR
Error in `library()`:
! there is no package called 'EGSEA'
ERROR
Error in `library()`:
! there is no package called 'clusterProfiler'
ERROR
Error in `library()`:
! there is no package called 'org.Mm.eg.db'
ERROR
Error in `library()`:
! there is no package called 'ggplot2'
ERROR
Error in `library()`:
! there is no package called 'enrichplot'
ERROR
Error in `library()`:
! there is no package called 'pathview'
ERROR
Error in `library()`:
! there is no package called 'edgeR'
ERROR
Error in `library()`:
! there is no package called 'impute'
ERROR
Error in `library()`:
! there is no package called 'preprocessCore'
ERROR
Error in `library()`:
! there is no package called 'RegEnrich'
ERROR
Error in `library()`:
! there is no package called 'STRINGdb'
Gene Set Enrichment Analysis with fgsea
Let’s perform Gene Set Enrichment Analysis using the
fgsea package.
R
# Prepare ranked list of genes
# Subset the columns we need (ENTREZID + t-statistic)
# and sort genes by t-statistic (decreasing = FALSE → most negative → most positive)
rankedgenes_df <- debasal[order(debasal$t, decreasing = FALSE), c("ENTREZID", "t")]
# Create the numeric vector of t-statistics
rankedgenes <- rankedgenes_df$t
# Name each t-statistic value with the corresponding Entrez ID
# fgsea() requires a *named* numeric vector:
# - values = ranking metric (t-statistics)
# - names = gene identifiers (Entrez IDs)
names(rankedgenes) <- rankedgenes_df$ENTREZID
# Perform fgsea
# pathways = Mm.H (Hallmark gene sets loaded earlier)
# stats = ranked gene list (t-statistics)
# minSize = minimum number of genes required per pathway
fgseaRes <- fgsea(pathways = Mm.H, stats = rankedgenes, minSize = 15)
ERROR
Error in `fgsea()`:
! could not find function "fgsea"
R
# Extract top enriched pathways
# Up-regulated pathways (ES > 0), ordered by smallest p-value
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
ERROR
Error:
! object 'fgseaRes' not found
R
# Down-regulated pathways (ES < 0), ordered by smallest p-value
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
ERROR
Error:
! object 'fgseaRes' not found
R
# Combine: first up-regulated, then reversed down-regulated
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
ERROR
Error:
! object 'topPathwaysUp' not found
R
# Plot a table of enrichment results
plotGseaTable(Mm.H[topPathways], rankedgenes, fgseaRes,
gseaParam=0.5)
ERROR
Error in `plotGseaTable()`:
! could not find function "plotGseaTable"
R
# Plot the enrichment curve for the top pathway
# Visualise a single pathway: running enrichment score vs. ranked genes.
plotEnrichment(Mm.H[[topPathwaysUp[1]]], rankedgenes) + labs(title = topPathwaysUp[1])
ERROR
Error in `plotEnrichment()`:
! could not find function "plotEnrichment"
Apply fgsea to the deluminal contrast
Repeat the GSEA analysis using the deluminal dataset
instead of debasal.
- Create a ranked gene list using the
tstatistic fromdeluminal.
- Run
fgsea()with the same Hallmark gene sets (Mm.H).
- Identify the top 5 enriched pathways.
- Are they different from the
debasalresults? What biological differences might explain this?
R
# Create a ranked gene list for the deluminal contrast
rankedgenes_df_del <- deluminal[order(deluminal$t, decreasing = FALSE),
c("ENTREZID", "t")]
rankedgenes_del <- rankedgenes_df_del$t
names(rankedgenes_del) <- rankedgenes_df_del$ENTREZID
# Run fgsea
fgseaRes_del <- fgsea(pathways = Mm.H,
stats = rankedgenes_del,
minSize = 15)
# View the top 5 pathways
fgseaRes_del[order(pval)][1:5, ]
Differences between fgseaRes (from debasal) and fgseaRes_del are expected and likely reflect biological differences between the two contrasts (e.g., different cell types or experimental conditions).
- GSEA evaluates enrichment across a ranked list of all genes, not just a subset of significant ones.
- The
fgseapackage provides a fast implementation of GSEA suitable for large RNA-seq datasets. - A positive NES indicates enrichment among up-regulated genes, while a negative NES indicates enrichment among down-regulated genes.
-
plotGseaTable()andplotEnrichment()help visualise how pathways behave across the ranked gene list. - Compared with
clusterProfilers GSEA functions,fgseafocuses on speed and flexibility, whileclusterProfilerprovides tighter integration with specific databases (e.g., GO, KEGG) and additional plotting helpers.
Content from Analysis with RegEnrich
Last updated on 2026-03-31 | Edit this page
Overview
Questions
- How can we use
RegEnrichto identify key transcriptional regulators from RNA-seq data? - What inputs does
RegEnrichneed (expression matrix, metadata, list of regulators)? - Why do we need mouse-specific transcription factor (TF) information instead of the built-in human TFs?
Objectives
- Understand the overall purpose of
RegEnrichin identifying key regulators (e.g. TFs). - Load a mouse transcription factor list suitable for use with
RegEnrich. - Prepare an expression matrix, design matrix, and contrast for a
RegenrichSetobject. - Run the main
RegEnrichpipeline and inspect the resulting ranked regulators.
ERROR
Error in `library()`:
! there is no package called 'edgeR'
ERROR
Error in `library()`:
! there is no package called 'goseq'
ERROR
Error in `library()`:
! there is no package called 'fgsea'
ERROR
Error in `library()`:
! there is no package called 'EGSEA'
ERROR
Error in `library()`:
! there is no package called 'clusterProfiler'
ERROR
Error in `library()`:
! there is no package called 'org.Mm.eg.db'
ERROR
Error in `library()`:
! there is no package called 'ggplot2'
ERROR
Error in `library()`:
! there is no package called 'enrichplot'
ERROR
Error in `library()`:
! there is no package called 'pathview'
ERROR
Error in `library()`:
! there is no package called 'edgeR'
ERROR
Error in `library()`:
! there is no package called 'impute'
ERROR
Error in `library()`:
! there is no package called 'preprocessCore'
ERROR
Error in `library()`:
! there is no package called 'RegEnrich'
ERROR
Error in `library()`:
! there is no package called 'STRINGdb'
Analysis with RegEnrich
RegEnrich is used to identify potential key
regulators (e.g. transcription factors) that may be driving the
gene expression changes observed in your RNA-seq experiment.
At a high level, the workflow looks like this:
-
Expression data: log-transformed expression matrix
(genes × samples).
-
Differential expression: identify genes that differ
between groups (e.g.
limma).
-
Network construction: build a regulator–target
network (e.g. co-expression).
-
Enrichment testing: test whether targets of a
regulator are enriched among DE genes.
- Ranking: combine evidence to give each regulator a score and rank.
Before we set up RegEnrich properly, we will explore the
default TF list that comes with the package and see why it is not
appropriate for this mouse dataset.
Spot the problem: built-in TFs vs mouse data
-
Load the built-in transcription factor list:
R
data(TFs) Inspect the TFs object:
- What kinds of identifiers are used (e.g. gene symbols, Entrez IDs)?
- Which species do these transcription factors belong to?
- Based on what you see:
- Why might using TFs be a problem for our mouse expression dataset?
- What could go wrong in the analysis if we use human TFs with mouse RNA-seq data?
Using a mouse TF list from TcoF-DB
The TFs included in the package are human-only, so for mouse data we must provide our own list of mouse transcription factors.
For this workshop, we will use mouse TFs from TcoF-DB. You can directly download the file that we will be using from this link.
The code below shows how to:
- Load a mouse TF list from a CSV file.
- Prepare an expression matrix for RegEnrich.
- Create a RegenrichSet object.
- Run the main RegEnrich pipeline and inspect the results.
R
# Load mouse transcription factors (must include a "GeneID" column)
mouseTFs <- read.csv('data/BrowseTF_TcoF-DB.csv')
# Prepare expression matrix: genes x samples
logcounts <- filteredcounts[,4:15]
rownames(logcounts) <- filteredcounts$ENTREZID
# Convert to log CPM for RegEnrich
logcounts <- cpm(logcounts,log=TRUE)
# Define design (uses CellTyoeStatus metadata) and example contrast
design = model.matrix(~ factordata$CellTypeStatus)
contrast = c(-1, 1,0,0,0,0)
# Initialise a RegenrichSet object
object = RegenrichSet(expr = logcounts,
colData = factordata,
reg = unique(mouseTFs$GeneID), # regulators
method = "limma", # differential expression analysis method
design = design, # design model matrix
contrast = contrast, # contrast
networkConstruction = "COEN", # network inference method
enrichTest = "FET") # enrichment analysis method
print(object)
The regenrich_diffExp step can take a while. We have
already run this step for you and you can download the object data
directly using this link.
R
# Perform RegEnrich analysis
set.seed(123)
# This step takes a while
object = regenrich_diffExpr(object) %>%
regenrich_network() %>%
regenrich_enrich() %>%
regenrich_rankScore()
# Obtain results (ranked regulators)
res = results_score(object)
print(res)
# Visualise regulator-target expression for selected regulator
plotRegTarExpr(object, reg = "71371")
ERROR
Error in `cpm()`:
! could not find function "cpm"
ERROR
Error in `results_score()`:
! could not find function "results_score"
ERROR
Error:
! object 'res' not found
ERROR
Error in `plotRegTarExpr()`:
! could not find function "plotRegTarExpr"
RegEnrich uses a design matrix and contrast in a similar
way to limma: they define which groups you want to compare.
We create a design matrix from a factor in our sample metadata:
design <- model.matrix(~ factordata$CellTypeStatus)
This turns the factor CellTypeStatus into one column per
group (plus an intercept). A contrast vector then specifies how to
combine these columns to define a comparison.
For example, a contrast like:
contrast <- c(-1, 1, 0, 0, 0, 0)
means:
- Compare group 2 vs group 1
- i.e. “group 2 MINUS group 1”
- All other groups are ignored (set to 0)
The exact mapping of positions in the contrast to group names depends
on the order of the factor levels in
factordata$CellTypeStatus.
Test your understanding: contrasts
Look at the factor levels in
factordata$CellTypeStatus:
R
levels(factordata$CellTypeStatus)
- How many groups are there?
- Which group is used as the baseline (reference) in the design matrix?
- Write a contrast that compares Luminal pregnant vs Basal pregnant.
- In words, what biological question does that contrast represent?
The number of groups equals the number of unique levels returned by
levels(factordata$CellTypeStatus)
The baseline group is the first level of the factor.
If the factor levels are ordered like:
[1] “Basal pregnant” “Basal lactate” “Luminal pregnant” “Luminal lactate” “Stem” “Other”
Then the corresponding contrast to compare Luminal pregnant vs Basal pregnant is:
contrast <- c(-1, 0, 1, 0, 0, 0)
This means:
1 → Luminal pregnant
-1 → Basal pregnant
0 → all other groups ignored
The biological question this is answering is:
“Which transcriptional regulators differ between Luminal pregnant and Basal pregnant samples?”
That is, regulators that functionally distinguish these two cell states.
Inspecting and interpreting RegEnrich results
The results_score(object) call returns a table of
regulators with associated statistics. Typical columns summarise: - The
regulator identifier (e.g. Entrez ID or gene symbol) - Evidence from
differential expression and/or network structure - A combined score used
to rank regulators (higher often = more influential)
A simple way to start exploring is to look at the top regulators and their expression patterns across conditions: - Are top-ranked regulators differentially expressed between groups? - Do their predicted targets show coordinated expression changes? - Does the expression of a regulator and its targets match your biological expectations?
Interpreting regulator results
Using the output table res: - Identify the top 3
regulators by whatever ranking column is provided
(e.g. rankScore). - For one of these regulators, check its
expression across samples using plotRegTarExpr(). - Does
this pattern support the idea that this regulator is involved in the
contrast you specified? - How might you follow this up
experimentally?
-
RegEnrichhelps identify potential regulatory drivers (e.g. TFs) behind observed gene expression changes. - The package’s built-in TF dataset
(data(TFs))is human-specific and not suitable for mouse RNA-seq analysis. - For mouse data, a mouse-specific TF list (e.g. from TcoF-DB) must be supplied via the reg argument.
- A RegenrichSet object requires: an expression matrix, sample metadata, a regulator list, and a design/contrast specification.
Content from Interaction networks with StringDB
Last updated on 2026-03-31 | Edit this page
Overview
Questions
- How can we use
STRINGdbto visualise protein–protein interaction networks for our DE genes?
- How do we map our gene identifiers to the IDs used by STRING?
- What information does STRING functional enrichment add beyond standard GO/KEGG analysis?
Objectives
- Load and initialise the
STRINGdbobject for mouse.
- Map a set of differentially expressed genes to STRING
identifiers.
- Visualise a protein–protein interaction network for top DE
genes.
- Retrieve and inspect STRING functional enrichment results.
ERROR
Error in `library()`:
! there is no package called 'edgeR'
ERROR
Error in `library()`:
! there is no package called 'goseq'
ERROR
Error in `library()`:
! there is no package called 'fgsea'
ERROR
Error in `library()`:
! there is no package called 'EGSEA'
ERROR
Error in `library()`:
! there is no package called 'clusterProfiler'
ERROR
Error in `library()`:
! there is no package called 'org.Mm.eg.db'
ERROR
Error in `library()`:
! there is no package called 'ggplot2'
ERROR
Error in `library()`:
! there is no package called 'enrichplot'
ERROR
Error in `library()`:
! there is no package called 'pathview'
ERROR
Error in `library()`:
! there is no package called 'edgeR'
ERROR
Error in `library()`:
! there is no package called 'impute'
ERROR
Error in `library()`:
! there is no package called 'preprocessCore'
ERROR
Error in `library()`:
! there is no package called 'RegEnrich'
ERROR
Error in `library()`:
! there is no package called 'STRINGdb'
Interaction networks with StringDB
So far, we have focused on pathway-level enrichment. Another useful way to interpret RNA-seq results is to look at protein–protein interaction (PPI) networks: Are our differentially expressed genes part of the same complexes or signalling modules?
The STRINGdb package provides an interface to the STRING database, which aggregates known and predicted PPIs from multiple sources (experiments, databases, text-mining, etc.).
In this lesson we will:
- Initialise a STRINGdb object for mouse.
- Map our top differentially expressed genes to STRING IDs.
- Plot an interaction network.
- Retrieve functional enrichment results from STRING.
R
# Initialize STRINGdb for mouse (taxonomy ID: 10090)
string_db <- STRINGdb$new(version = "12", species = 10090, score_threshold = 400, input_directory = "")
ERROR
Error:
! object 'STRINGdb' not found
R
# Prepare data: select top 200 DE genes (by adjusted P value)
top200 <- debasal[order(debasal$adj.P.Val), ][1:200, ]
top200_mapped <- string_db$map(top200, "ENTREZID", removeUnmappedRows = TRUE)
ERROR
Error:
! object 'string_db' not found
R
# Plot the protein interaction network
string_db$plot_network(top200_mapped$STRING_id)
ERROR
Error:
! object 'string_db' not found
R
# Get functional enrichment (GO, KEGG, Reactome)
enrichment <- string_db$get_enrichment(top200_mapped$STRING_id)
ERROR
Error:
! object 'string_db' not found
R
head(enrichment)
ERROR
Error:
! object 'enrichment' not found
There are many available ways of exploring your data using the STRING
database that can’t be covered in one tutorial but you can learn more by
reading the vignette
and inspect available functions within the STRINGdb package
by running:
R
STRINGdb$methods()
ERROR
Error:
! object 'STRINGdb' not found
Read more about STRING:
- Szklarczyk, Damian et al. “The STRING database in 2025: protein networks with directionality of regulation.” Nucleic acids research vol. 53,D1 (2025): D730-D737. doi:10.1093/nar/gkae1113
STRINGdblinks your genes to protein–protein interaction networks from the STRING database.Mapping from gene IDs (e.g. ENTREZ) to STRING IDs is a crucial first step.
Network visualisation can reveal modules of interconnected DE genes that may not be obvious from lists or tables.
STRING provides its own functional enrichment, which can complement results from
clusterProfilerandfgsea.
Content from Conclusion
Last updated on 2026-03-31 | Edit this page
Overview
Questions
- What have we learned about functional enrichment and pathway analysis?
- How do different methods complement one another when interpreting RNA-seq results?
Objectives
- Summarise the key concepts introduced across the lesson series.
- Understand how different gene set and network tools fit together in a typical analysis workflow.
- Recognise when and why to choose each enrichment method.
Conclusion
In this tutorial, we have explored several complementary approaches for interpreting RNA-seq results beyond differential expression alone. Through using these various R packages, we are able to get insights biological processes and pathways involved in the differential expression of genes observed.
Specifcally, we worked through:
-
Over-representation analysis (ORA) with
clusterProfiler
-
Gene set enrichment analysis (GSEA) using
fgsea
-
Regulatory network analysis with
RegEnrich
-
Protein–protein interaction networks via
STRINGdb
Although each tool uses different assumptions and statistical frameworks, they all aim to answer a similar biological question:
Which biological processes, pathways, or regulators help explain the gene expression changes we observe?
By applying multiple methods, you can cross-validate findings and gain a more complete picture of the molecular biology underlying your condition of interest.
You should now feel comfortable:
- preparing gene lists or ranked gene sets
- running several types of enrichment analyses
- visualising pathway-level patterns
- integrating results from complementary tools
- exploring interaction networks and regulatory drivers
These approaches form a core part of transcriptomic interpretation and are widely used in modern functional genomics.
- Enrichment methods help translate gene-level changes into biological
meaning.
- Different tools (ORA, GSEA, network-based methods) answer different
but complementary questions.
- Combining methods provides stronger and more interpretable
biological insights.
- Functional enrichment is an essential component of any RNA-seq analysis workflow.