All in One View

Last updated on 2026-05-26 | Edit this page

Overview

Questions

What are the main types of functional enrichment analysis approaches, and how do they differ?
When should you choose one enrichment strategy over another for RNA-seq data?

Objectives

Understand the conceptual differences between over-representation analysis (ORA) and functional class scoring (FCS)
Learn how enrichment tools (e.g. clusterProfiler, fgsea, RegEnrich and STRINGdb) implement these approaches using pathway and gene-set databases

Introduction

Sometimes, there is an extensive list of genes to interpret after differential gene expres-sion analysis, and it is not feasible to go through the biological function of each gene one at a time. A common downstream procedure is functional enrichment analysis (or gene set testing), which aims to determine which pathways or gene networks the differ-entially expressed genes are implicated in. There are many gene set testing methods available, and it is useful to try several of them.

The purpose of this tutorial is to demonstrate how to perform functional enrichment analysis/gene set testing using various tools/packages in R. We will use data from the Nature Cell Biology paper, EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival (https://www.ncbi.nlm.nih.gov/pubmed/25730472). This study examined the expression profiles of basal and luminal cells in the mammary gland of virgin, pregnant and lactating mice.

Load and read required libraries

We begin by loading the required packages. Please read the following libraries:

R

library(edgeR)

ERROR

Error in `library()`:
! there is no package called 'edgeR'

R

library(goseq)

ERROR

Error in `library()`:
! there is no package called 'goseq'

R

library(fgsea)

ERROR

Error in `library()`:
! there is no package called 'fgsea'

R

library(EGSEA)

ERROR

Error in `library()`:
! there is no package called 'EGSEA'

R

library(clusterProfiler)

ERROR

Error in `library()`:
! there is no package called 'clusterProfiler'

R

library(org.Mm.eg.db)

ERROR

Error in `library()`:
! there is no package called 'org.Mm.eg.db'

R

library(ggplot2)

ERROR

Error in `library()`:
! there is no package called 'ggplot2'

R

library(enrichplot)

ERROR

Error in `library()`:
! there is no package called 'enrichplot'

R

library(pathview)

ERROR

Error in `library()`:
! there is no package called 'pathview'

R

library(edgeR)

ERROR

Error in `library()`:
! there is no package called 'edgeR'

R

library(impute)

ERROR

Error in `library()`:
! there is no package called 'impute'

R

library(preprocessCore)

ERROR

Error in `library()`:
! there is no package called 'preprocessCore'

R

library(RegEnrich)

ERROR

Error in `library()`:
! there is no package called 'RegEnrich'

Inspect Datasets

We will use several files for this workshop:

Results from differential expression analysis debasal and deluminal with genes in rows and logFC/p-values in columns
Sample information file factordata – gives details of sample ID and groups
Gene lengths file seqdata
Filtered counts file filteredcounts – genes in rows and counts for each sample in columns, lowly expressed genes removed
Hallmarks gene set file for mouse from MSigDB loaded in .RData format – Mm.H

Let’s inspect the files:

R

debasal <- read.csv("data/limma-voom_basalpregnant-basallactate", header = TRUE, sep = "\t")
deluminal <- read.csv("data/limma-voom_luminalpregnant-luminallactate", header = TRUE, sep = "\t")
factordata <- read.table("data/factordata", header = TRUE, sep = "\t")

#To view the first 5 rows of the dataset
head(debasal)

OUTPUT

  ENTREZID        SYMBOL                                     GENENAME     logFC
1    24117          Wif1                      Wnt inhibitory factor 1  1.819943
2   381290        Atp2b4 ATPase, Ca++ transporting, plasma membrane 4 -2.143885
3   226101          Myof                                    myoferlin -2.329744
4    16012        Igfbp6 insulin-like growth factor binding protein 6 -2.896115
5   231830       Micall2                                 MICAL-like 2  2.253400
6    78896 1500015O10Rik                   RIKEN cDNA 1500015O10 gene  2.807548
   AveExpr         t      P.Value    adj.P.Val        B
1 2.975545  19.85403 5.722034e-11 5.366685e-07 15.55490
2 3.944066 -19.07173 9.406224e-11 5.366685e-07 15.09463
3 6.223525 -18.30281 1.562524e-10 5.366685e-07 14.55585
4 1.978449 -18.21558 1.657202e-10 5.366685e-07 14.13954
5 4.760597  18.00994 1.905713e-10 5.366685e-07 14.33472
6 3.036519  18.60321 2.037466e-10 5.366685e-07 14.35640

You can also view the entire file in a different tab using View():

R

View(debasal)

Discussion

Challenge

How many columns are there in debasal and deluminal objects?
What are the different types of samples in this analysis? Hint: Look at factordata file.

Summary

Key Points

Commonly used analyses following differenital gene expression (DGE)

Over-representation analysis (ORA): Tests whether DGE list contains more genes from a specific pathway or gene set
Functional class scoring (FCS): Evaluates coordinated shifts in expression across all gene sets
Protein-protein interactions (PPI): Maps the functional connections between proteins to reveal network structure or pathways involved

Content from Gene Ontology testing with clusterProfiler

Last updated on 2026-05-26 | Edit this page

Overview

Questions

What are the different types of GO terms (BP, MF, CC)?
How do we perform ORA using enrichGO() function?
How can we run GSEA-style functional class scoring with gseGO() function?

Objectives

Apply GO-based enrichment methods using clusterProfiler
Perform both ORA and GSEA using the GO terms database
Build confidence in navigating GO resources and interpreting enriched terms

Introduction

The Gene Ontology (GO) project is a major bioinformatics initiative that standardises how we describe gene functions across species, organising them into three categories: Biological Process, Molecular Function and Cellular Component. clusterProfiler is an R package that allows us to test whether these GO terms are associated with our RNA-seq results and gain insight into the pathways or functions represented in our data. This section demonstrates how to perform both over-representation analysis (ORA) and functional class scoring (FCS) with GO database, depending on whether you are working with a list of significant genes or full ranked expression data.

Over-Representation Analysis (ORA)

ORA tests whether a list of significant genes are linked to specific GO terms. The input is a vector of gene IDs (or list of genes) that passes your differential expression cut-off. ORA can be run separately for downregulated and upregulated genes to reveal which GO terms are enriched in each direction.

We first subset the debasal dataset to extract genes with adjusted p-value below 0.01 and store this set of significant genes in an object called genes. We then run enrichGO function using this gene list, specifying the organism database org.Mm.eg.db, the identifier type ENTREZID and the GO category of interest CC (for cellular component). The function is configured with standard p-value and q-value, using Benjamini-Hochberg correction. We use the function head() to check the first few lines of output.

ERROR

Error in `library()`:
! there is no package called 'edgeR'

ERROR

Error in `library()`:
! there is no package called 'goseq'

ERROR

Error in `library()`:
! there is no package called 'fgsea'

ERROR

Error in `library()`:
! there is no package called 'EGSEA'

ERROR

Error in `library()`:
! there is no package called 'clusterProfiler'

ERROR

Error in `library()`:
! there is no package called 'org.Mm.eg.db'

ERROR

Error in `library()`:
! there is no package called 'ggplot2'

ERROR

Error in `library()`:
! there is no package called 'enrichplot'

ERROR

Error in `library()`:
! there is no package called 'pathview'

ERROR

Error in `library()`:
! there is no package called 'edgeR'

ERROR

Error in `library()`:
! there is no package called 'impute'

ERROR

Error in `library()`:
! there is no package called 'preprocessCore'

ERROR

Error in `library()`:
! there is no package called 'RegEnrich'

R

debasal$Status <- debasal$adj.P.Val < 0.01
gene <- debasal$ENTREZID[debasal$Status]

ego <- enrichGO(gene = gene,
                OrgDb = org.Mm.eg.db,
                keyType = 'ENTREZID',
                ont = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff = 0.01,
                qvalueCutoff = 0.05,
                readable = TRUE)

ERROR

Error in `enrichGO()`:
! could not find function "enrichGO"

R

head(ego)

ERROR

Error:
! object 'ego' not found

We can then use dotplot() function to visualise the results in the form of a dot plot. From the plot below, we can see that GO term cellular component spindle, membrane microdomain and ribosome are top enriched terms.

R

dotplot(ego)

ERROR

Error in `dotplot()`:
! could not find function "dotplot"

Discussion

Challenge

Challenge! Can you identify enriched GO term biological process in deluminal dataset? Are the enriched pathways similar?

Gene Set Enrichment Analysis (GSEA)

We can also perform GSEA using GO database. GSEA is a type of functional class scoring method that evaluates whether genes belonging to a GO term tend to appear at the top or bottom of a ranked gene list, rather than relying on a cut-off (i.e. adj.P.Val < 0.01). The input is a continuous ranking metric (e.g. log2FC) for all genes. This allows the detection of subtle but coordinated shifts in GO terms for both downregulated and upregulated pathways.

We begin by creating a ranked gene list for GSEA by extracting the logFC values from debasal dataset and its corresponding ENTREZID. We then sort this vector in a decreasing order so that the upregulated genes appear at the top of the list and the downregulated genes at the bottom. Using this ranked gene list, we run gseGO() to perform GSEA on GO terms CC, by specifying the organism database, gene ID type, gene set limits and p-value cut-off for enrichment.

R

debasal_genelist <- debasal$logFC
names(debasal_genelist) <- debasal$ENTREZID
debasal_genelist <- sort(debasal_genelist, decreasing = TRUE)

ego3 <- gseGO(gene          = debasal_genelist,
                OrgDb         = org.Mm.eg.db,
                keyType       = 'ENTREZID',
                ont           = "CC",
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.05,
              verbose      = FALSE)

ERROR

Error in `gseGO()`:
! could not find function "gseGO"

R

head(ego3)

ERROR

Error:
! object 'ego3' not found

R

dotplot(ego3)

ERROR

Error in `dotplot()`:
! could not find function "dotplot"

We can also use the gseaplot() function to visualise GSEA result for a specific gene set. In this example, we select the top-ranked enriched GO term (geneSetID = 1). The result-ing plot displays how genes contributing to the enrichment of this GO term are distributed in the ranked gene list.

R

gseaplot(ego3, by = "all", title = ego3$Description[1], geneSetID = 1)

ERROR

Error in `gseaplot()`:
! could not find function "gseaplot"

Key Points

GO terms are divided into Biological Process (BP), Molecular Function (MF) and Cellular Component (CC), which can be analysed separately or together depending on the biological question.
The enrichGO() and gseGO() functions in clusterProfiler allow users to perform ORA and GSEA using the GO database directly.
GO testing results highlight gene sets or pathways that are overrepresented in your dataset, allowing interpretation of downregulated or upregulated genes.

Content from KEGG enrichment analysis with clusterProfiler

Last updated on 2026-05-26 | Edit this page

Overview

Questions

How can we perform pathway analysis using KEGG?
What insights can KEGG enrichment provide about differentially expressed genes

Objectives

Learn how to run KEGG over-representation and GSEA-style analysis in R.
Understand how to interpret pathway-level results.
Generate and visualise KEGG pathway figures.

ERROR

Error in `library()`:
! there is no package called 'edgeR'

ERROR

Error in `library()`:
! there is no package called 'goseq'

ERROR

Error in `library()`:
! there is no package called 'fgsea'

ERROR

Error in `library()`:
! there is no package called 'EGSEA'

ERROR

Error in `library()`:
! there is no package called 'clusterProfiler'

ERROR

Error in `library()`:
! there is no package called 'org.Mm.eg.db'

ERROR

Error in `library()`:
! there is no package called 'ggplot2'

ERROR

Error in `library()`:
! there is no package called 'enrichplot'

ERROR

Error in `library()`:
! there is no package called 'pathview'

ERROR

Error in `library()`:
! there is no package called 'edgeR'

ERROR

Error in `library()`:
! there is no package called 'impute'

ERROR

Error in `library()`:
! there is no package called 'preprocessCore'

ERROR

Error in `library()`:
! there is no package called 'RegEnrich'

Introduction

The KEGG (Kyoto Encyclopedia of Genes and Genomes) database links genes to curated biological pathways, offering a powerful foundation for understanding cellular functions at a systems level and making meaningful biological interpretations. clusterProfiler allows us to access KEGG and apply both ORA (using enrichKEGG function) and GSEA (using gseKEGG function) to extract pathway-level insights from our RNA-seq data.

KEGG analysis

Before running enrichment, we need to confirm the correct KEGG organism code for mouse (mmu). You can verify by searching:

R

kegg_organism <- "mmu"

search_kegg_organism(kegg_organism, by='kegg_code')

ERROR

Error in `search_kegg_organism()`:
! could not find function "search_kegg_organism"

Over-representation analysis with `enrichKEGG`

To run ORA using KEGG database, we need to specify the gene list, KEGG organism code and p-value cut-off. In this example, we take the top 500 genes from the ranked gene list debasal_genelist, specify the organism code mmu (defined as `kegg_organism) and use 0.05 as the p-value cut-off.

We can use head() function to briefly inspect the results of enrichKEGG.

R

kk <- enrichKEGG(gene         = names(debasal_genelist)[1:500],
                 organism     = kegg_organism,
                 pvalueCutoff = 0.05)

ERROR

Error in `enrichKEGG()`:
! could not find function "enrichKEGG"

R

head(kk)

ERROR

Error:
! object 'kk' not found

GSEA-style KEGG enrichment with `gseKEGG`

Similar to previous enrichment analysis with GO database, we can also perform a GSEA-style enrichment using the KEGG database. To do so, we use the gseKEGG and specify the entire ranked gene list (debasal_genelist) rather than an arbitrary cutoff. In this example, we test KEGG pathways between 3 and 800 genes using 10,000 permutations and NCBI Gene IDs. Results are filtered using a p-value cut-off of 0.05.

R

kk2 <- gseKEGG(geneList     = debasal_genelist,
               organism     = kegg_organism,
               nPerm        = 10000,
               minGSSize    = 3,
               maxGSSize    = 800,
               pvalueCutoff = 0.05,
               pAdjustMethod = "none",
               keyType       = "ncbi-geneid")

ERROR

Error in `gseKEGG()`:
! could not find function "gseKEGG"

Visualising enriched pathways

Dotplot

Before we look at individual pathways in detail, we can visualise the overall enrichment results using dotplot().
This dotplot summarises which KEGG pathways are enriched, how many genes contribute to each pathway, and how significant each one is.

R

dotplot(kk2, showCategory = 10, title = "Enriched Pathways" , split=".sign") + facet_grid(.~.sign)

ERROR

Error in `dotplot()`:
! could not find function "dotplot"

Similarity-based network plots

Next, we can explore how the enriched pathways relate to one another.
The enrichment map groups pathways that share many genes, helping us see broader biological themes rather than isolated pathways. In this case, pairwise_termsim() function calculates the similarity between enriched KEGG pathways and produces a similarity matrix that quantifies their relationship. The emapplot()generates an enrichment map using the similarity matrix produced, visualising the enriched pathways as a network with nodes representing pathways and edges reflecting their similarity.

R

kk3 <- pairwise_termsim(kk2)

ERROR

Error in `pairwise_termsim()`:
! could not find function "pairwise_termsim"

R

emapplot(kk3)

ERROR

Error in `emapplot()`:
! could not find function "emapplot"

We can also use cnetplot() to understand which genes drive these enriched pathways. This plot links genes to pathways they belong to and highlights genes that appear in multiple pathways.

R

cnetplot(kk3, categorySize="pvalue")

ERROR

Error in `cnetplot()`:
! could not find function "cnetplot"

Ridge plot

We can also inspect the distribution of enrichment scores across pathways with ridgeplot(). This shows how strongly and broadly each pathway is enriched across the ranked gene list using overlapping density curves.

R

ridgeplot(kk3) + labs(x = "enrichment distribution")

ERROR

Error in `ridgeplot()`:
! could not find function "ridgeplot"

R

head(kk3)

ERROR

Error:
! object 'kk3' not found

You can see the top pathways, you can get the top pathway ID with the ID column.

R

# There must be a function that gets the results -> not ideal code
kk3@result$ID[1]

ERROR

Error:
! object 'kk3' not found

KEGG Pathway Diagram

Finally, we can visualise gene expression changes directly onto a KEGG pathway diagram.
pathview highlights which components of the pathway are up- or down-regulated in your enrichment analysis.

R

# Produce the native KEGG plot (PNG)
mmu_pathway <- pathview(gene.data=debasal_genelist, pathway.id=kk3@result$ID[1], species = kegg_organism)

These will produce these files in your working directory:

mmu05171.xml mmu05171.pathview.png mmu05171.png

Key Points

KEGG pathway analysis helps link DEGs to functional biological pathways.
Both ORA (enrichKEGG) and GSEA-style (gseKEGG) methods provide complementary insights.
pathview enables visual interpretation of pathway-level expression changes.

Content from Gene set enrichment analysis with fgsea

Last updated on 2026-05-26 | Edit this page

Overview

Questions

What is Gene Set Enrichment Analysis (GSEA) and when should I use it?
How does fgsea perform fast, ranked-list GSEA?
How do I interpret enrichment scores, p-values, and leading-edge genes?
How does fgsea differ from the GSEA functions in clusterProfiler?

Objectives

Prepare a ranked gene list suitable for GSEA.
Run the ‘fgsea’ algorithm on Hallmark or other gene sets.
Identify enriched pathways and distinguish between up- and down-regulated sets.
Use ‘plotEnrichment()’ and ‘plotGseaTable()’ to visualise and interpret results.
Understand the conceptual differences between ‘fgsea’ and ‘clusterProfiler::gseGO/gseKEGG’

What is GSEA (in practice)?

Unlike over-representation analysis (ORA), which tests a subset of significant genes,
Gene Set Enrichment Analysis (GSEA) uses a ranked list of all genes, such as:

t-statistics
log fold changes
Wald statistics

This helps detect coordinated but subtle shifts across entire pathways that might be missed by threshold-based methods.

The fgsea package implements a fast, permutation-efficient version of the original Broad Institute GSEA algorithm, allowing thousands of pathways to be tested quickly.

In this part of the workshop, we will:

Create a ranked list of genes from the debasal dataset
Run fgsea() using the mouse Hallmark gene sets (Mm.H).
Explore the top enriched pathways
Visualise both multiple pathways and a single pathway in detail

ERROR

Error in `library()`:
! there is no package called 'edgeR'

ERROR

Error in `library()`:
! there is no package called 'goseq'

ERROR

Error in `library()`:
! there is no package called 'fgsea'

ERROR

Error in `library()`:
! there is no package called 'EGSEA'

ERROR

Error in `library()`:
! there is no package called 'clusterProfiler'

ERROR

Error in `library()`:
! there is no package called 'org.Mm.eg.db'

ERROR

Error in `library()`:
! there is no package called 'ggplot2'

ERROR

Error in `library()`:
! there is no package called 'enrichplot'

ERROR

Error in `library()`:
! there is no package called 'pathview'

ERROR

Error in `library()`:
! there is no package called 'edgeR'

ERROR

Error in `library()`:
! there is no package called 'impute'

ERROR

Error in `library()`:
! there is no package called 'preprocessCore'

ERROR

Error in `library()`:
! there is no package called 'RegEnrich'

ERROR

Error in `library()`:
! there is no package called 'STRINGdb'

Gene Set Enrichment Analysis with `fgsea`

Let’s perform Gene Set Enrichment Analysis using the fgsea package.

R

# Prepare ranked list of genes
# Subset the columns we need (ENTREZID + t-statistic)
# and sort genes by t-statistic (decreasing = FALSE → most negative → most positive)
rankedgenes_df <- debasal[order(debasal$t, decreasing = FALSE), c("ENTREZID", "t")]

# Create the numeric vector of t-statistics
rankedgenes <- rankedgenes_df$t
# Name each t-statistic value with the corresponding Entrez ID
# fgsea() requires a *named* numeric vector:
#   - values = ranking metric (t-statistics)
#   - names  = gene identifiers (Entrez IDs)
names(rankedgenes) <- rankedgenes_df$ENTREZID

# Perform fgsea
# pathways = Mm.H  (Hallmark gene sets loaded earlier)
# stats     = ranked gene list (t-statistics)
# minSize   = minimum number of genes required per pathway
fgseaRes <- fgsea(pathways = Mm.H, stats = rankedgenes, minSize = 15)

ERROR

Error in `fgsea()`:
! could not find function "fgsea"

R

# Extract top enriched pathways
# Up-regulated pathways (ES > 0), ordered by smallest p-value
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]

ERROR

Error:
! object 'fgseaRes' not found

R

# Down-regulated pathways (ES < 0), ordered by smallest p-value
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]

ERROR

Error:
! object 'fgseaRes' not found

R

# Combine: first up-regulated, then reversed down-regulated
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))

ERROR

Error:
! object 'topPathwaysUp' not found

R

# Plot a table of enrichment results
plotGseaTable(Mm.H[topPathways], rankedgenes, fgseaRes, 
              gseaParam=0.5)

ERROR

Error in `plotGseaTable()`:
! could not find function "plotGseaTable"

R

# Plot the enrichment curve for the top pathway
# Visualise a single pathway: running enrichment score vs. ranked genes.
plotEnrichment(Mm.H[[topPathwaysUp[1]]], rankedgenes) + labs(title = topPathwaysUp[1])

ERROR

Error in `plotEnrichment()`:
! could not find function "plotEnrichment"

Challenge

Apply fgsea to the deluminal contrast

Repeat the GSEA analysis using the deluminal dataset instead of debasal.

Create a ranked gene list using the t statistic from deluminal.
Run fgsea() with the same Hallmark gene sets (Mm.H).
Identify the top 5 enriched pathways.
Are they different from the debasal results? What biological differences might explain this?

Show me the solution

R

# Create a ranked gene list for the deluminal contrast
rankedgenes_df_del <- deluminal[order(deluminal$t, decreasing = FALSE),
                                c("ENTREZID", "t")]
rankedgenes_del <- rankedgenes_df_del$t
names(rankedgenes_del) <- rankedgenes_df_del$ENTREZID

# Run fgsea
fgseaRes_del <- fgsea(pathways = Mm.H,
                      stats    = rankedgenes_del,
                      minSize  = 15)

# View the top 5 pathways
fgseaRes_del[order(pval)][1:5, ]
Differences between fgseaRes (from debasal) and fgseaRes_del are expected and likely reflect biological differences between the two contrasts (e.g., different cell types or experimental conditions).

Key Points

GSEA evaluates enrichment across a ranked list of all genes, not just a subset of significant ones.
The fgsea package provides a fast implementation of GSEA suitable for large RNA-seq datasets.
A positive NES indicates enrichment among up-regulated genes, while a negative NES indicates enrichment among down-regulated genes.
plotGseaTable() and plotEnrichment() help visualise how pathways behave across the ranked gene list.
Compared with clusterProfilers GSEA functions, fgsea focuses on speed and flexibility, while clusterProfiler provides tighter integration with specific databases (e.g., GO, KEGG) and additional plotting helpers.

Content from Analysis with RegEnrich

Last updated on 2026-05-26 | Edit this page

Overview

Questions

How can we use RegEnrich to identify key transcriptional regulators from RNA-seq data?
What inputs does RegEnrich need (expression matrix, metadata, list of regulators)?
Why do we need mouse-specific transcription factor (TF) information instead of the built-in human TFs?

Objectives

Understand the overall purpose of RegEnrich in identifying key regulators (e.g. TFs).
Load a mouse transcription factor list suitable for use with RegEnrich.
Prepare an expression matrix, design matrix, and contrast for a RegenrichSet object.
Run the main RegEnrich pipeline and inspect the resulting ranked regulators.

ERROR

Error in `library()`:
! there is no package called 'edgeR'

ERROR

Error in `library()`:
! there is no package called 'goseq'

ERROR

Error in `library()`:
! there is no package called 'fgsea'

ERROR

Error in `library()`:
! there is no package called 'EGSEA'

ERROR

Error in `library()`:
! there is no package called 'clusterProfiler'

ERROR

Error in `library()`:
! there is no package called 'org.Mm.eg.db'

ERROR

Error in `library()`:
! there is no package called 'ggplot2'

ERROR

Error in `library()`:
! there is no package called 'enrichplot'

ERROR

Error in `library()`:
! there is no package called 'pathview'

ERROR

Error in `library()`:
! there is no package called 'edgeR'

ERROR

Error in `library()`:
! there is no package called 'impute'

ERROR

Error in `library()`:
! there is no package called 'preprocessCore'

ERROR

Error in `library()`:
! there is no package called 'RegEnrich'

ERROR

Error in `library()`:
! there is no package called 'STRINGdb'

Analysis with `RegEnrich`

RegEnrich is used to identify potential key regulators (e.g. transcription factors) that may be driving the gene expression changes observed in your RNA-seq experiment.

At a high level, the workflow looks like this:

Expression data: log-transformed expression matrix (genes × samples).
Differential expression: identify genes that differ between groups (e.g. limma).
Network construction: build a regulator–target network (e.g. co-expression).
Enrichment testing: test whether targets of a regulator are enriched among DE genes.
Ranking: combine evidence to give each regulator a score and rank.

Before we set up RegEnrich properly, we will explore the default TF list that comes with the package and see why it is not appropriate for this mouse dataset.

Discussion

Spot the problem: built-in TFs vs mouse data

Load the built-in transcription factor list:
R
```
data(TFs)
```
Inspect the TFs object:

What kinds of identifiers are used (e.g. gene symbols, Entrez IDs)?
Which species do these transcription factors belong to?

Based on what you see:

Why might using TFs be a problem for our mouse expression dataset?
What could go wrong in the analysis if we use human TFs with mouse RNA-seq data?

Using a mouse TF list from TcoF-DB

The TFs included in the package are human-only, so for mouse data we must provide our own list of mouse transcription factors.

For this workshop, we will use mouse TFs from TcoF-DB. You can directly download the file that we will be using from this link.

The code below shows how to:

Load a mouse TF list from a CSV file.
Prepare an expression matrix for RegEnrich.
Create a RegenrichSet object.
Run the main RegEnrich pipeline and inspect the results.

R

# Load mouse transcription factors (must include a "GeneID" column)
mouseTFs <- read.csv('data/BrowseTF_TcoF-DB.csv')

# Prepare expression matrix: genes x samples
logcounts <- filteredcounts[,4:15]
rownames(logcounts) <- filteredcounts$ENTREZID

# Convert to log CPM for RegEnrich
logcounts <- cpm(logcounts,log=TRUE)

# Define design (uses CellTyoeStatus metadata) and example contrast
design = model.matrix(~ factordata$CellTypeStatus)
contrast = c(-1, 1,0,0,0,0) 

# Initialise a RegenrichSet object
object = RegenrichSet(expr = logcounts,
                      colData = factordata,
                      reg = unique(mouseTFs$GeneID), # regulators
                      method = "limma", # differential expression analysis method
                      design = design, # design model matrix
                      contrast = contrast, # contrast
                      networkConstruction = "COEN", # network inference method
                      enrichTest = "FET") # enrichment analysis method

print(object)

Caution

The regenrich_diffExp step can take a while. We have already run this step for you and you can download the object data directly using this link.

R

# Perform RegEnrich analysis
set.seed(123)

# This step takes a while
object = regenrich_diffExpr(object) %>%
  regenrich_network() %>%
  regenrich_enrich() %>%
  regenrich_rankScore()


# Obtain results (ranked regulators)
res = results_score(object)
print(res)

# Visualise regulator-target expression for selected regulator
plotRegTarExpr(object, reg = "71371")

ERROR

Error in `cpm()`:
! could not find function "cpm"

ERROR

Error in `results_score()`:
! could not find function "results_score"

ERROR

Error:
! object 'res' not found

ERROR

Error in `plotRegTarExpr()`:
! could not find function "plotRegTarExpr"

Understanding design matrices and contrasts

RegEnrich uses a design matrix and contrast in a similar way to limma: they define which groups you want to compare.

We create a design matrix from a factor in our sample metadata:

design <- model.matrix(~ factordata$CellTypeStatus)

This turns the factor CellTypeStatus into one column per group (plus an intercept). A contrast vector then specifies how to combine these columns to define a comparison.

For example, a contrast like:

contrast <- c(-1, 1, 0, 0, 0, 0)

means:

Compare group 2 vs group 1
i.e. “group 2 MINUS group 1”
All other groups are ignored (set to 0)

The exact mapping of positions in the contrast to group names depends on the order of the factor levels in factordata$CellTypeStatus.

Challenge

Test your understanding: contrasts

Look at the factor levels in factordata$CellTypeStatus:

R

levels(factordata$CellTypeStatus)

How many groups are there?
Which group is used as the baseline (reference) in the design matrix?
Write a contrast that compares Luminal pregnant vs Basal pregnant.
In words, what biological question does that contrast represent?

Show me the solution

The number of groups equals the number of unique levels returned by levels(factordata$CellTypeStatus)

The baseline group is the first level of the factor.

If the factor levels are ordered like:

[1] “Basal pregnant” “Basal lactate” “Luminal pregnant” “Luminal lactate” “Stem” “Other”

Then the corresponding contrast to compare Luminal pregnant vs Basal pregnant is:

contrast <- c(-1, 0, 1, 0, 0, 0)

This means:

1 → Luminal pregnant

-1 → Basal pregnant

0 → all other groups ignored

The biological question this is answering is:

“Which transcriptional regulators differ between Luminal pregnant and Basal pregnant samples?”

That is, regulators that functionally distinguish these two cell states.

Inspecting and interpreting RegEnrich results

The results_score(object) call returns a table of regulators with associated statistics. Typical columns summarise: - The regulator identifier (e.g. Entrez ID or gene symbol) - Evidence from differential expression and/or network structure - A combined score used to rank regulators (higher often = more influential)

A simple way to start exploring is to look at the top regulators and their expression patterns across conditions: - Are top-ranked regulators differentially expressed between groups? - Do their predicted targets show coordinated expression changes? - Does the expression of a regulator and its targets match your biological expectations?

Discussion

Interpreting regulator results

Using the output table res: - Identify the top 3 regulators by whatever ranking column is provided (e.g. rankScore). - For one of these regulators, check its expression across samples using plotRegTarExpr(). - Does this pattern support the idea that this regulator is involved in the contrast you specified? - How might you follow this up experimentally?

Key Points

RegEnrich helps identify potential regulatory drivers (e.g. TFs) behind observed gene expression changes.
The package’s built-in TF dataset (data(TFs)) is human-specific and not suitable for mouse RNA-seq analysis.
For mouse data, a mouse-specific TF list (e.g. from TcoF-DB) must be supplied via the reg argument.
A RegenrichSet object requires: an expression matrix, sample metadata, a regulator list, and a design/contrast specification.

Content from Interaction networks with StringDB

Last updated on 2026-05-26 | Edit this page

Overview

Questions

How can we use STRINGdb to visualise protein–protein interaction networks for our DE genes?
How do we map our gene identifiers to the IDs used by STRING?
What information does STRING functional enrichment add beyond standard GO/KEGG analysis?

Objectives

Load and initialise the STRINGdb object for mouse.
Map a set of differentially expressed genes to STRING identifiers.
Visualise a protein–protein interaction network for top DE genes.
Retrieve and inspect STRING functional enrichment results.

ERROR

Error in `library()`:
! there is no package called 'edgeR'

ERROR

Error in `library()`:
! there is no package called 'goseq'

ERROR

Error in `library()`:
! there is no package called 'fgsea'

ERROR

Error in `library()`:
! there is no package called 'EGSEA'

ERROR

Error in `library()`:
! there is no package called 'clusterProfiler'

ERROR

Error in `library()`:
! there is no package called 'org.Mm.eg.db'

ERROR

Error in `library()`:
! there is no package called 'ggplot2'

ERROR

Error in `library()`:
! there is no package called 'enrichplot'

ERROR

Error in `library()`:
! there is no package called 'pathview'

ERROR

Error in `library()`:
! there is no package called 'edgeR'

ERROR

Error in `library()`:
! there is no package called 'impute'

ERROR

Error in `library()`:
! there is no package called 'preprocessCore'

ERROR

Error in `library()`:
! there is no package called 'RegEnrich'

ERROR

Error in `library()`:
! there is no package called 'STRINGdb'

Interaction networks with `StringDB`

So far, we have focused on pathway-level enrichment. Another useful way to interpret RNA-seq results is to look at protein–protein interaction (PPI) networks: Are our differentially expressed genes part of the same complexes or signalling modules?

The STRINGdb package provides an interface to the STRING database, which aggregates known and predicted PPIs from multiple sources (experiments, databases, text-mining, etc.).

In this lesson we will:

Initialise a STRINGdb object for mouse.
Map our top differentially expressed genes to STRING IDs.
Plot an interaction network.
Retrieve functional enrichment results from STRING.

R

# Initialize STRINGdb for mouse (taxonomy ID: 10090)
string_db <- STRINGdb$new(version = "12", species = 10090, score_threshold = 400, input_directory = "")

ERROR

Error:
! object 'STRINGdb' not found

R

# Prepare data: select top 200 DE genes (by adjusted P value)
top200 <- debasal[order(debasal$adj.P.Val), ][1:200, ]
top200_mapped <- string_db$map(top200, "ENTREZID", removeUnmappedRows = TRUE)

ERROR

Error:
! object 'string_db' not found

R

# Plot the protein interaction network
string_db$plot_network(top200_mapped$STRING_id)

ERROR

Error:
! object 'string_db' not found

R

# Get functional enrichment (GO, KEGG, Reactome)
enrichment <- string_db$get_enrichment(top200_mapped$STRING_id)

ERROR

Error:
! object 'string_db' not found

R

head(enrichment)

ERROR

Error:
! object 'enrichment' not found

There are many available ways of exploring your data using the STRING database that can’t be covered in one tutorial but you can learn more by reading the vignette and inspect available functions within the STRINGdb package by running:

R

STRINGdb$methods()

ERROR

Error:
! object 'STRINGdb' not found

Overview

Questions

What have we learned about functional enrichment and pathway analysis?
How do different methods complement one another when interpreting RNA-seq results?

Objectives

Summarise the key concepts introduced across the lesson series.
Understand how different gene set and network tools fit together in a typical analysis workflow.
Recognise when and why to choose each enrichment method.

Conclusion

In this tutorial, we have explored several complementary approaches for interpreting RNA-seq results beyond differential expression alone. Through using these various R packages, we are able to get insights biological processes and pathways involved in the differential expression of genes observed.

Specifcally, we worked through:

Over-representation analysis (ORA) with clusterProfiler
Gene set enrichment analysis (GSEA) using fgsea
Regulatory network analysis with RegEnrich
Protein–protein interaction networks via STRINGdb

Although each tool uses different assumptions and statistical frameworks, they all aim to answer a similar biological question:

Which biological processes, pathways, or regulators help explain the gene expression changes we observe?

By applying multiple methods, you can cross-validate findings and gain a more complete picture of the molecular biology underlying your condition of interest.

You should now feel comfortable:

preparing gene lists or ranked gene sets
running several types of enrichment analyses
visualising pathway-level patterns
integrating results from complementary tools
exploring interaction networks and regulatory drivers

These approaches form a core part of transcriptomic interpretation and are widely used in modern functional genomics.

Key Points

Enrichment methods help translate gene-level changes into biological meaning.
Different tools (ORA, GSEA, network-based methods) answer different but complementary questions.
Combining methods provides stronger and more interpretable biological insights.
Functional enrichment is an essential component of any RNA-seq analysis workflow.