Overview¶

delnx is a Python package for differential expression analysis of single-cell RNA-seq data. It provides a unified interface to several ways of performing differential expression analysis with (generalized) linear models. Most methods are implemented in JAX for GPU-accelerated DE testing, with statsmodels as a fallback for binomial GLMs. To get you started, here’s a basic example of how to use delnx for differential expression analysis:

import delnx as dx
import scanpy as sc


# Load example data
adata = sc.read_h5ad("data/GLI3_KO_45d.h5ad")
adata.layers["counts"] = adata.X.copy()  # Store raw counts in a separate layer
adata.obs["GLI3_KO"] = adata.obs["GLI3_KO"].astype(str)  # Ensure string type for design matrix

# Some basic preprocessing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

print(adata)

AnnData object with n_obs × n_vars = 22410 × 18653
    obs: 'organoid', 'GLI3_KO', 'cell_type'
    uns: 'log1p'
    obsm: 'pca', 'umap'
    layers: 'counts'

# Run differential expression analysis between conditions (here knockout vs. control)
de_results = dx.tl.de(
    adata,
    method="lr",  # DE method: "lr" for logistic regression
    condition_key="GLI3_KO",  # Condition key for DE analysis
    reference="True",  # Reference level
    contrast="GLI3_KO[T.False]",  # Test knockout vs control
)

print(de_results)

INFO     Inferred data type: lognorm                                                                               
INFO     Running DE for 16199 features
      feature    log2fc      coef          stat          pval          padj
0      SHISA3 -0.478362 -0.331575  3.397817e+02  1.000000e-50  7.534419e-49
1        FZD5 -0.441296 -0.305883  3.975457e+02  1.000000e-50  7.534419e-49
2      CRYBA1 -0.410040 -0.284218  2.466093e+02  1.000000e-50  7.534419e-49
3      NKX6-1 -0.409135 -0.283591  3.141194e+02  1.000000e-50  7.534419e-49
4       SFTA3 -0.353181 -0.244807  1.449701e+03  1.000000e-50  7.534419e-49
...       ...       ...       ...           ...           ...           ...
16194    ECM1 -0.000050 -0.000035  5.194038e-07  9.994250e-01  9.996718e-01
16195  KLHL11 -0.000031 -0.000021  3.173746e-07  9.995505e-01  9.997201e-01
16196   MED14 -0.000008 -0.000006  2.555619e-07  9.995967e-01  9.997201e-01
16197    PYGB  0.000002  0.000001  2.436492e-08  9.998755e-01  9.999372e-01
16198  TRIM58  0.000004  0.000002  2.388738e-09  9.999610e-01  9.999610e-01

[16199 rows x 6 columns]

What you get back from de() is a DataFrame with the results of the differential expression analysis. The rows are genes and the columns are the results of the DE testing: log2fc, coef, stat, pval, and padj.

Grouping¶

Often, we want to split the dataset into groups, to perform testing e.g. within each cell type. delnx provides the grouped() wrapper for this. It runs any DE function separately for each group and re-corrects p-values across all groups.

# Run differential expression analysis within groups
de_results = dx.tl.grouped(
    dx.tl.de,
    adata,
    group_key="cell_type",  # Group by cell type
    method="lr",  # DE method: "lr" for logistic regression
    condition_key="GLI3_KO",  # Condition key for DE analysis
    reference="True",  # Reference level
    contrast="GLI3_KO[T.False]",  # Test knockout vs control
)

print(de_results)

INFO     Running DE for group: mesen_npc                                                                           
INFO     Inferred data type: lognorm                                                                               
INFO     Running DE for 15443 features
INFO     Running DE for group: mesen_ex
INFO     Inferred data type: lognorm                                                                               
INFO     Running DE for 15320 features
INFO     Running DE for group: ctx_npc                                                                             
INFO     Inferred data type: lognorm                                                                               
INFO     Running DE for 14080 features
INFO     Running DE for group: ge_in                                                                               
INFO     Inferred data type: lognorm
INFO     Running DE for 14462 features
INFO     Running DE for group: ge_npc
INFO     Inferred data type: lognorm                                                                               
INFO     Running DE for 14906 features
INFO     Running DE for group: ctx_ip                                                                              
INFO     Inferred data type: lognorm                                                                               
INFO     Running DE for 12147 features
INFO     Running DE for group: ctx_ex                                                                              
INFO     Inferred data type: lognorm                                                                               
INFO     Running DE for 13129 features
      feature    log2fc      coef          stat          pval          padj  \
0       RGS20 -0.438849 -0.304187  1.061743e+02  4.705923e-23  8.590426e-21   
1       QPCTL -1.223741 -0.848233  9.882085e+01  1.126694e-21  1.822624e-19   
2       ASF1B -0.927407 -0.642829  9.369446e+01  1.052890e-20  1.594351e-18   
3        PON2 -0.299645 -0.207698  9.349043e+01  1.151250e-20  1.740644e-18   
4        POMC -0.722494 -0.500794  8.947009e+01  6.728782e-20  9.481960e-18   
...       ...       ...       ...           ...           ...           ...   
99482  OSBPL8  0.000012  0.000008  8.110453e-07  9.992815e-01  9.996337e-01   
99483   EGFL8  0.000100  0.000069  5.362082e-07  9.994158e-01  9.997055e-01   
99484   PTPN7 -0.000056 -0.000039  2.285110e-07  9.996186e-01  9.997794e-01   
99485   DEAF1 -0.000008 -0.000006  1.727580e-07  9.996684e-01  9.998091e-01   
99486  SLFN13  0.000030  0.000021  1.247965e-08  9.999109e-01  9.999209e-01   

           group  
0         ctx_ex  
1         ctx_ex  
2         ctx_ex  
3         ctx_ex  
4         ctx_ex  
...          ...  
99482  mesen_npc  
99483  mesen_npc  
99484  mesen_npc  
99485  mesen_npc  
99486  mesen_npc  

[99487 rows x 7 columns]

Now the result has one additional column, group, which indicates the group the gene was tested in.

Pseudo-bulking¶

It is often advisable to not test on the single-cell level, but to aggregate the data to a (pseudo-)bulk level first. This better accounts for variation between actual biological replicates. delnx provides a thin wrapper around the decoupler function to do this:

adata_pb = dx.pp.pseudobulk(
    adata,
    sample_key="organoid",  # Sample key for pseudobulk aggregation (the biological replicate)
    group_key="cell_type",  # Group key for pseudobulk aggregation
    n_pseudoreps=2,  # Optionally, the data can be split into multiple pseudoreplicates. This can be useful if the number of actual biological replicates is low.
    layer="counts",  # Layer to use for pseudobulk aggregation, e.g. "counts" or None for adata.X
    mode="sum",
)

print(adata_pb)

AnnData object with n_obs × n_vars = 81 × 16199
    obs: 'psbulk_replicate', 'cell_type', 'organoid', 'GLI3_KO', 'psbulk_cells', 'psbulk_counts'
    layers: 'psbulk_props'

Available Methods¶

Method	Function	Data type	Test	Best for
ANOVA	`de(method="anova")`	Log-normalized / scaled	F-test	Fast screening, continuous data
Logistic regression	`de(method="lr")`	Log-normalized / scaled	LR test	Covariate control, classification-style DE
Binomial	`de(method="binomial")`	Binary (0/1)	LR test	Proportion-based DE (e.g. ATAC-seq)
Negative binomial	`nb_fit()` + `nb_test()`	Raw counts	Quasi-likelihood F	Pseudobulk, perturbation screens
AUROC ranking	`rank_de()`	Any	Mann-Whitney U	Fast multi-group markers

LR vs binomial: Both use likelihood ratio tests but model different things. LR predicts the condition label from gene expression (classification-style); binomial models binary expression as a function of condition (GLM-style).

Design matrix: `condition_key` vs `formula`¶

There are two ways to specify the model design. The condition_key shorthand builds a simple one-factor design automatically. The formula parameter gives full control via R-style formulas (patsy syntax), which is useful for adding covariates or interactions.

`dx.tl.de()` (anova, lr, binomial)¶

# condition_key shorthand — simple two-group test
results = dx.tl.de(adata, condition_key="GLI3_KO", reference="True",
                   contrast="GLI3_KO[T.False]", method="lr")

# formula — same test, explicit formula
results = dx.tl.de(adata, formula="~ GLI3_KO", reference="True",
                   contrast="GLI3_KO[T.False]", method="lr")

# formula — with batch covariate
results = dx.tl.de(adata, formula="~ GLI3_KO + organoid",
                   contrast="GLI3_KO[T.False]", reference="True", method="anova")

# formula — continuous covariate (no reference needed)
results = dx.tl.de(adata, formula="~ age + sex", contrast="age", method="anova")

`dx.tl.nb_fit()` + `dx.tl.nb_test()` (negative binomial)¶

# condition_key shorthand
fit = dx.tl.nb_fit(adata_pb, condition_key="GLI3_KO", reference="True")
results = dx.tl.nb_test(adata_pb, fit, contrast="GLI3_KO[T.False]")

# formula — with batch covariate
fit = dx.tl.nb_fit(adata_pb, formula="~ GLI3_KO + organoid", reference="True")
results = dx.tl.nb_test(adata_pb, fit, contrast="GLI3_KO[T.False]")

# formula — interaction term
fit = dx.tl.nb_fit(adata_pb, formula="~ GLI3_KO * cell_type", reference="True")

`dx.tl.rank_de()` (AUROC markers)¶

# Only supports condition_key (no formula)
results = dx.tl.rank_de(adata, condition_key="cell_type")

Overview¶

Grouping¶

Pseudo-bulking¶

Available Methods¶

Design matrix: condition_key vs formula¶

dx.tl.de() (anova, lr, binomial)¶

dx.tl.nb_fit() + dx.tl.nb_test() (negative binomial)¶

dx.tl.rank_de() (AUROC markers)¶

Design matrix: `condition_key` vs `formula`¶

`dx.tl.de()` (anova, lr, binomial)¶

`dx.tl.nb_fit()` + `dx.tl.nb_test()` (negative binomial)¶

`dx.tl.rank_de()` (AUROC markers)¶