Faster filtering of sparce matrix #3465

kaushalprasadhial · 2025-02-13T04:32:00Z

This is the PR for faster filtering of sparce matrix. We have tested is on AWS c7i.24xlarge.

# code to test
import time
import numpy as np

import pandas as pd

import scanpy as sc
from sklearn.cluster import KMeans

import os
import wget

import warnings

starting_time = time.time()
print(os.getpid())
warnings.filterwarnings('ignore', 'Expected ')
warnings.simplefilter('ignore')
input_file = "./1M_brain_cells_10X.sparse.h5ad"

if not os.path.exists(input_file):
    print('Downloading import file...')
    wget.download('https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad',input_file)


# marker genes
MITO_GENE_PREFIX = "mt-" # Prefix for mitochondrial genes to regress out
markers = ["Stmn2", "Hes1", "Olig1"] # Marker genes for visualization

# filtering cells
min_genes_per_cell = 200 # Filter out cells with fewer genes than this expressed
max_genes_per_cell = 6000 # Filter out cells with more genes than this expressed

# filtering genes
min_cells_per_gene = 1 # Filter out genes expressed in fewer cells than this
n_top_genes = 4000 # Number of highly variable genes to retain

# PCA
n_components = 50 # Number of principal components to compute

# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE

# k-means
k = 35 # Number of clusters for k-means

# Gene ranking

ranking_n_top_genes = 50 # Number of differential genes to compute for each cluster

# Number of parallel jobs
sc._settings.ScanpyConfig.n_jobs = os.cpu_count()

start=time.time()
tr=time.time()
adata = sc.read(input_file)
adata.var_names_make_unique()
adata.shape
print("Total read time : %s" % (time.time()-tr))




# To reduce the number of cells:
USE_FIRST_N_CELLS = 1300000
adata = adata[0:USE_FIRST_N_CELLS]
adata.shape

tfcmin=time.time()
sc.pp.filter_cells(adata, min_genes=min_genes_per_cell)
print("Total filter min_genes_per_cell : %s" % (time.time()-tfcmin))
tfcmax=time.time()
sc.pp.filter_cells(adata, max_genes=max_genes_per_cell)
print("Total filter max_genes_per_cell : %s" % (time.time()-tfcmax))

tfg=time.time()
sc.pp.filter_genes(adata, min_cells=min_cells_per_gene)
print("Total filter genes : %s" % (time.time()-tfg))

tnt=time.time()
sc.pp.normalize_total(adata, target_sum=1e4)
print("Total filter normalize_total : %s" % (time.time()-tnt))
print("Total filter and normalize time : %s" % (time.time()-tr))

tr=time.time()
sc.pp.log1p(adata)
print("Total log time : %s" % (time.time()-tr))

# Select highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor = "cell_ranger")
print("after ")
# Retain marker gene expression
for marker in markers:
        adata.obs[marker + "_raw"] = adata.X[:, adata.var.index == marker].toarray().ravel()
print("after for loop")
# Filter matrix to only variable genes
adata = adata[:, adata.var.highly_variable]

ts=time.time()
#Regress out confounding factors (number of counts, mitochondrial gene expression)
mito_genes = adata.var_names.str.startswith(MITO_GENE_PREFIX)
print("mito_genes")
n_counts = np.array(adata.X.sum(axis=1))
print("n_counts")
adata.obs['percent_mito'] = np.array(np.sum(adata[:, mito_genes].X, axis=1)) / n_counts
adata.obs['n_counts'] = n_counts
print("adata.obs['n_counts']")

sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
print("Total regress out time : %s" % (time.time()-ts))

#scale

ts=time.time()
sc.pp.scale(adata)
print("Total scale time : %s" % (time.time()-ts))
print("total time taken", (time.time()-starting_time))

Orignal Output
Total filter min_genes_per_cell : 175.78095293045044
Total filter max_genes_per_cell : 66.684401512146
Total filter genes : 60.77386426925659

Output with our code
Total filter min_genes_per_cell : 131.49908232688904
Total filter max_genes_per_cell : 24.586800575256348
Total filter genes : 32.83677649497986

So in total it is around 37% faster in case of sparce matrix

We have one failed test case which is failing in both main branch and in our branch

    def test_sam():
        adata_ref = pbmc3k()
        ix = np.random.choice(adata_ref.shape[0], size=200, replace=False)
        adata = adata_ref[ix, :].copy()
        sc.pp.normalize_total(adata, target_sum=10000)
        sc.pp.log1p(adata)
        sce.tl.sam(adata, inplace=True)
        uns_keys = list(adata.uns.keys())
        obsm_keys = list(adata.obsm.keys())
>       assert all(["sam" in uns_keys, "X_umap" in obsm_keys, "neighbors" in uns_keys])
E       assert False
E        +  where False = all([True, True, False])

Please tell me how can i fix this failed check

Closes #
Tests included or not required because:

[ x] Release notes not necessary because: This PR does filtering of cells and genes faster for sparce matrix data only

…canpy_filter_2

codecov · 2025-02-13T04:47:45Z

Codecov Report

Attention: Patch coverage is 68.00000% with 16 lines in your changes missing coverage. Please review.

Project coverage is 75.33%. Comparing base (3075537) to head (14405f9).

Files with missing lines	Patch %	Lines
src/scanpy/preprocessing/_simple.py	68.00%	16 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3465      +/-   ##
==========================================
- Coverage   75.44%   75.33%   -0.12%     
==========================================
  Files         113      113              
  Lines       13250    13311      +61     
==========================================
+ Hits         9997    10028      +31     
- Misses       3253     3283      +30

Files with missing lines	Coverage Δ
src/scanpy/preprocessing/_simple.py	`84.90% <68.00%> (-5.25%)`	⬇️

for more information, see https://pre-commit.ci

kaushalprasadhial added 2 commits February 12, 2025 18:26

initial code

8a3de0c

Merge branch 'main' of https://github.com/sanchit-misra/scanpy into s…

5678a7c

…canpy_filter_2

kaushalprasadhial and others added 2 commits February 14, 2025 11:18

_pca_fallback removal

da71424

[pre-commit.ci] auto fixes from pre-commit.com hooks

14405f9

for more information, see https://pre-commit.ci

kaushalprasadhial marked this pull request as ready for review February 14, 2025 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster filtering of sparce matrix #3465

Faster filtering of sparce matrix #3465

kaushalprasadhial commented Feb 13, 2025 •

edited

Loading

codecov bot commented Feb 13, 2025 •

edited

Loading

Faster filtering of sparce matrix #3465

Are you sure you want to change the base?

Faster filtering of sparce matrix #3465

Conversation

kaushalprasadhial commented Feb 13, 2025 • edited Loading

codecov bot commented Feb 13, 2025 • edited Loading

Codecov Report

kaushalprasadhial commented Feb 13, 2025 •

edited

Loading

codecov bot commented Feb 13, 2025 •

edited

Loading