---
name: bulk-rna-seq-differential-expression-with-omicverse
title: Bulk RNA-seq differential expression with omicverse
description: "Bulk RNA-seq DEG pipeline: gene ID mapping, DESeq2 normalization, statistical testing, volcano plots, and pathway enrichment in OmicVerse."
---
# Bulk RNA-seq differential expression with omicverse
## Overview
Follow this skill to run the end-to-end differential expression (DEG) workflow showcased in [`t_deg.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_deg.ipynb). It assumes the user provides a raw gene-level count matrix (e.g., from featureCounts) and wants to analyse bulk RNA-seq cohorts inside omicverse.
## Instructions
1. **Set up the session**
- Import `omicverse as ov`, `scanpy as sc`, and `matplotlib.pyplot as plt`.
- Call `ov.plot_set()` so downstream plots adopt omicverse styling.
2. **Prepare ID mapping assets**
- When gene IDs must be converted to gene symbols, instruct the user to download mapping pairs via `ov.utils.download_geneid_annotation_pair()` and store them under `genesets/`.
- Mention the available prebuilt genomes (T2T-CHM13, GRCh38, GRCh37, GRCm39, danRer7, danRer11) and that users can generate their own mapping from GTF files if needed.
3. **Load the raw counts**
- Read tab-delimited featureCounts output with `ov.pd.read_csv(..., sep='\t', header=1, index_col=0)`.
- Strip trailing `.bam` segments from column names using list comprehension so sample IDs are clean.
4. **Map gene identifiers**
- Run `ov.bulk.Matrix_ID_mapping(counts_df, 'genesets/pair_<GENOME>.tsv')` to replace `gene_id` entries with gene symbols.
5. **Initialise the DEG object**
- Create `dds = ov.bulk.pyDEG(mapped_counts)`.
- Handle duplicate gene symbols with `dds.drop_duplicates_index()` to keep the highest expressed version.
6. **Normalise and estimate size factors**
- Execute `dds.normalize()` to calculate DESeq2 size factors, correcting for library size and batch differences.
7. **Run differential testing**
- Collect treatment and control replicate labels into lists.
- Call `dds.deg_analysis(treatment_groups, control_groups, method='ttest')` for the default Welch t-test.
- Offer optional alternatives: `method='edgepy'` for edgeR-like tests and `method='limma'` for limma-style modelling.
8. **Filter and threshold results**
- Note that lowly expressed genes are retained by default; filter using `dds.result.loc[dds.result['log2(BaseMean)'] > 1]` when needed.
- Set dynamic fold-change and significance cutoffs via `dds.foldchange_set(fc_threshold=-1, pval_threshold=0.05, logp_max=6)` (`fc_threshold=-1` auto-selects based on log2FC distribution).
9. **Visualise differential expression**
- Produce volcano plots with `dds.plot_volcano(title=..., figsize=..., plot_genes=... or plot_genes_num=...)` to highlight key genes.
- Generate per-gene boxplots using `dds.plot_boxplot(genes=[...], treatment_groups=..., control_groups=..., figsize=..., legend_bbox=...)`; adjust y-axis tick labels if required.
10. **Perform pathway enrichment (optional)**
- Download curated pathway libraries through `ov.utils.download_pathway_database()`.
- Load genesets with `ov.utils.geneset_prepare(<path>, organism='Mouse'|'Human'|...)`.
- Build the DEG gene list from `dds.result.loc[dds.result['sig'] != 'normal'].index`.
- Run enrichment with `ov.bulk.geneset_enrichment(gene_list=deg_genes, pathways_dict=..., pvalue_type='auto', organism=...)`. Encourage users without internet access to provide a `background` gene list.
- Visualise single-library results via `ov.bulk.geneset_plot(...)` and combine multiple ontologies using `ov.bulk.geneset_plot_multi(enr_dict, colors_dict, num=...)`.
11. **Document outputs**
- Suggest exporting `dds.result` and enrichment tables to CSV for downstream reporting.
- Encourage users to save figures generated by matplotlib (`plt.savefig(...)`) when running outside notebooks.
12. **Defensive validation**
```python
# Before DEG: verify treatment/control groups exist as column names
all_cols = set(dds.result.columns) if hasattr(dds, 'result') else set(counts_df.columns)
for g in treatment_groups + control_groups:
assert g in all_cols, f"Sample '{g}' not found in count matrix columns"
# Verify groups don't overlap
assert not set(treatment_groups) & set(control_groups), "Treatment and control groups must not overlap"
```
13. **Troubleshooting tips**
- Ensure sample labels in `treatment_groups`/`control_groups` exactly match column names post-cleanup.
- Verify required packages (`omicverse`, `pyComplexHeatmap`, `gseapy`) are installed for enrichment visualisations.
- Remind users that internet access is required the first time they download gene mappings or pathway databases.
## Examples
- "I have a featureCounts matrix for mouse tumour samples—normalize it with DESeq2, run t-test DEG, and highlight the top 8 genes in a volcano plot."
- "Use omicverse to compute edgeR-style differential expression between treated and control replicates, then run GO enrichment on significant genes."
- "Guide me through converting Ensembl IDs to symbols, performing limma DEG, and plotting boxplots for Krtap9-5 and Lef1."
## References
- Detailed walkthrough notebook: [`t_deg.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_deg.ipynb)
- Sample count matrix for testing: [`sample/counts.txt`](../../sample/counts.txt)
- Quick copy/paste commands: [`reference.md`](reference.md)