Overviewยถ
delnx is a Python package for differential expression analysis of single-cell RNA-seq data. It provides a unified interface to several ways of performing differential expression analysis with (generalized) linear models. Most methods are implemented in JAX for GPU-accelerated DE testing, with statsmodels as a fallback for binomial GLMs. To get you started, hereโs a basic example of how to use delnx for differential expression analysis:
import delnx as dx
import scanpy as sc
# Load example data
adata = sc.read_h5ad("data/GLI3_KO_45d.h5ad")
adata.layers["counts"] = adata.X.copy() # Store raw counts in a separate layer
adata.obs["GLI3_KO"] = adata.obs["GLI3_KO"].astype(str) # Ensure string type for design matrix
# Some basic preprocessing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
print(adata)
AnnData object with n_obs ร n_vars = 22410 ร 18653
obs: 'organoid', 'GLI3_KO', 'cell_type'
uns: 'log1p'
obsm: 'pca', 'umap'
layers: 'counts'
# Run differential expression analysis between conditions (here knockout vs. control)
de_results = dx.tl.de(
adata,
method="lr", # DE method: "lr" for logistic regression
condition_key="GLI3_KO", # Condition key for DE analysis
reference="True", # Reference level
contrast="GLI3_KO[T.False]", # Test knockout vs control
)
print(de_results)
INFO Inferred data type: lognorm
INFO Running DE for 16199 features
feature log2fc coef stat pval padj
0 SHISA3 -0.478362 -0.331575 3.397817e+02 1.000000e-50 7.534419e-49
1 FZD5 -0.441296 -0.305883 3.975457e+02 1.000000e-50 7.534419e-49
2 CRYBA1 -0.410040 -0.284218 2.466093e+02 1.000000e-50 7.534419e-49
3 NKX6-1 -0.409135 -0.283591 3.141194e+02 1.000000e-50 7.534419e-49
4 SFTA3 -0.353181 -0.244807 1.449701e+03 1.000000e-50 7.534419e-49
... ... ... ... ... ... ...
16194 ECM1 -0.000050 -0.000035 5.194038e-07 9.994250e-01 9.996718e-01
16195 KLHL11 -0.000031 -0.000021 3.173746e-07 9.995505e-01 9.997201e-01
16196 MED14 -0.000008 -0.000006 2.555619e-07 9.995967e-01 9.997201e-01
16197 PYGB 0.000002 0.000001 2.436492e-08 9.998755e-01 9.999372e-01
16198 TRIM58 0.000004 0.000002 2.388738e-09 9.999610e-01 9.999610e-01
[16199 rows x 6 columns]
What you get back from de() is a DataFrame with the results of the differential expression analysis. The rows are genes and the columns are the results of the DE testing: log2fc, coef, stat, pval, and padj.
Groupingยถ
Often, we want to split the dataset into groups, to perform testing e.g. within each cell type. delnx provides the grouped() wrapper for this. It runs any DE function separately for each group and re-corrects p-values across all groups.
# Run differential expression analysis within groups
de_results = dx.tl.grouped(
dx.tl.de,
adata,
group_key="cell_type", # Group by cell type
method="lr", # DE method: "lr" for logistic regression
condition_key="GLI3_KO", # Condition key for DE analysis
reference="True", # Reference level
contrast="GLI3_KO[T.False]", # Test knockout vs control
)
print(de_results)
INFO Running DE for group: mesen_npc
INFO Inferred data type: lognorm
INFO Running DE for 15443 features
INFO Running DE for group: mesen_ex
INFO Inferred data type: lognorm
INFO Running DE for 15320 features
INFO Running DE for group: ctx_npc
INFO Inferred data type: lognorm
INFO Running DE for 14080 features
INFO Running DE for group: ge_in
INFO Inferred data type: lognorm
INFO Running DE for 14462 features
INFO Running DE for group: ge_npc
INFO Inferred data type: lognorm
INFO Running DE for 14906 features
INFO Running DE for group: ctx_ip
INFO Inferred data type: lognorm
INFO Running DE for 12147 features
INFO Running DE for group: ctx_ex
INFO Inferred data type: lognorm
INFO Running DE for 13129 features
feature log2fc coef stat pval padj \
0 RGS20 -0.438849 -0.304187 1.061743e+02 4.705923e-23 8.590426e-21
1 QPCTL -1.223741 -0.848233 9.882085e+01 1.126694e-21 1.822624e-19
2 ASF1B -0.927407 -0.642829 9.369446e+01 1.052890e-20 1.594351e-18
3 PON2 -0.299645 -0.207698 9.349043e+01 1.151250e-20 1.740644e-18
4 POMC -0.722494 -0.500794 8.947009e+01 6.728782e-20 9.481960e-18
... ... ... ... ... ... ...
99482 OSBPL8 0.000012 0.000008 8.110453e-07 9.992815e-01 9.996337e-01
99483 EGFL8 0.000100 0.000069 5.362082e-07 9.994158e-01 9.997055e-01
99484 PTPN7 -0.000056 -0.000039 2.285110e-07 9.996186e-01 9.997794e-01
99485 DEAF1 -0.000008 -0.000006 1.727580e-07 9.996684e-01 9.998091e-01
99486 SLFN13 0.000030 0.000021 1.247965e-08 9.999109e-01 9.999209e-01
group
0 ctx_ex
1 ctx_ex
2 ctx_ex
3 ctx_ex
4 ctx_ex
... ...
99482 mesen_npc
99483 mesen_npc
99484 mesen_npc
99485 mesen_npc
99486 mesen_npc
[99487 rows x 7 columns]
Now the result has one additional column, group, which indicates the group the gene was tested in.
Pseudo-bulkingยถ
It is often advisable to not test on the single-cell level, but to aggregate the data to a (pseudo-)bulk level first. This better accounts for variation between actual biological replicates. delnx provides a thin wrapper around the decoupler function to do this:
adata_pb = dx.pp.pseudobulk(
adata,
sample_key="organoid", # Sample key for pseudobulk aggregation (the biological replicate)
group_key="cell_type", # Group key for pseudobulk aggregation
n_pseudoreps=2, # Optionally, the data can be split into multiple pseudoreplicates. This can be useful if the number of actual biological replicates is low.
layer="counts", # Layer to use for pseudobulk aggregation, e.g. "counts" or None for adata.X
mode="sum",
)
print(adata_pb)
AnnData object with n_obs ร n_vars = 81 ร 16199
obs: 'psbulk_replicate', 'cell_type', 'organoid', 'GLI3_KO', 'psbulk_cells', 'psbulk_counts'
layers: 'psbulk_props'
Available Methodsยถ
Method |
Function |
Data type |
Test |
Best for |
|---|---|---|---|---|
ANOVA |
|
Log-normalized / scaled |
F-test |
Fast screening, continuous data |
Logistic regression |
|
Log-normalized / scaled |
LR test |
Covariate control, classification-style DE |
Binomial |
|
Binary (0/1) |
LR test |
Proportion-based DE (e.g. ATAC-seq) |
Negative binomial |
|
Raw counts |
Quasi-likelihood F |
Pseudobulk, perturbation screens |
AUROC ranking |
|
Any |
Mann-Whitney U |
Fast multi-group markers |
LR vs binomial: Both use likelihood ratio tests but model different things. LR predicts the condition label from gene expression (classification-style); binomial models binary expression as a function of condition (GLM-style).
Design matrix: condition_key vs formulaยถ
There are two ways to specify the model design. The condition_key shorthand builds
a simple one-factor design automatically. The formula parameter gives full control
via R-style formulas (patsy syntax), which is useful for adding covariates or interactions.
dx.tl.de() (anova, lr, binomial)ยถ
# condition_key shorthand โ simple two-group test
results = dx.tl.de(adata, condition_key="GLI3_KO", reference="True",
contrast="GLI3_KO[T.False]", method="lr")
# formula โ same test, explicit formula
results = dx.tl.de(adata, formula="~ GLI3_KO", reference="True",
contrast="GLI3_KO[T.False]", method="lr")
# formula โ with batch covariate
results = dx.tl.de(adata, formula="~ GLI3_KO + organoid",
contrast="GLI3_KO[T.False]", reference="True", method="anova")
# formula โ continuous covariate (no reference needed)
results = dx.tl.de(adata, formula="~ age + sex", contrast="age", method="anova")
dx.tl.nb_fit() + dx.tl.nb_test() (negative binomial)ยถ
# condition_key shorthand
fit = dx.tl.nb_fit(adata_pb, condition_key="GLI3_KO", reference="True")
results = dx.tl.nb_test(adata_pb, fit, contrast="GLI3_KO[T.False]")
# formula โ with batch covariate
fit = dx.tl.nb_fit(adata_pb, formula="~ GLI3_KO + organoid", reference="True")
results = dx.tl.nb_test(adata_pb, fit, contrast="GLI3_KO[T.False]")
# formula โ interaction term
fit = dx.tl.nb_fit(adata_pb, formula="~ GLI3_KO * cell_type", reference="True")
dx.tl.rank_de() (AUROC markers)ยถ
# Only supports condition_key (no formula)
results = dx.tl.rank_de(adata, condition_key="cell_type")