Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster filtering of sparce matrix #3465

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

kaushalprasadhial
Copy link
Contributor

@kaushalprasadhial kaushalprasadhial commented Feb 13, 2025

This is the PR for faster filtering of sparce matrix. We have tested is on AWS c7i.24xlarge.

# code to test
import time
import numpy as np

import pandas as pd

import scanpy as sc
from sklearn.cluster import KMeans

import os
import wget

import warnings

starting_time = time.time()
print(os.getpid())
warnings.filterwarnings('ignore', 'Expected ')
warnings.simplefilter('ignore')
input_file = "./1M_brain_cells_10X.sparse.h5ad"

if not os.path.exists(input_file):
    print('Downloading import file...')
    wget.download('https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad',input_file)


# marker genes
MITO_GENE_PREFIX = "mt-" # Prefix for mitochondrial genes to regress out
markers = ["Stmn2", "Hes1", "Olig1"] # Marker genes for visualization

# filtering cells
min_genes_per_cell = 200 # Filter out cells with fewer genes than this expressed
max_genes_per_cell = 6000 # Filter out cells with more genes than this expressed

# filtering genes
min_cells_per_gene = 1 # Filter out genes expressed in fewer cells than this
n_top_genes = 4000 # Number of highly variable genes to retain

# PCA
n_components = 50 # Number of principal components to compute

# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE

# k-means
k = 35 # Number of clusters for k-means

# Gene ranking

ranking_n_top_genes = 50 # Number of differential genes to compute for each cluster

# Number of parallel jobs
sc._settings.ScanpyConfig.n_jobs = os.cpu_count()

start=time.time()
tr=time.time()
adata = sc.read(input_file)
adata.var_names_make_unique()
adata.shape
print("Total read time : %s" % (time.time()-tr))




# To reduce the number of cells:
USE_FIRST_N_CELLS = 1300000
adata = adata[0:USE_FIRST_N_CELLS]
adata.shape

tfcmin=time.time()
sc.pp.filter_cells(adata, min_genes=min_genes_per_cell)
print("Total filter min_genes_per_cell : %s" % (time.time()-tfcmin))
tfcmax=time.time()
sc.pp.filter_cells(adata, max_genes=max_genes_per_cell)
print("Total filter max_genes_per_cell : %s" % (time.time()-tfcmax))

tfg=time.time()
sc.pp.filter_genes(adata, min_cells=min_cells_per_gene)
print("Total filter genes : %s" % (time.time()-tfg))

tnt=time.time()
sc.pp.normalize_total(adata, target_sum=1e4)
print("Total filter normalize_total : %s" % (time.time()-tnt))
print("Total filter and normalize time : %s" % (time.time()-tr))

tr=time.time()
sc.pp.log1p(adata)
print("Total log time : %s" % (time.time()-tr))

# Select highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes, flavor = "cell_ranger")
print("after ")
# Retain marker gene expression
for marker in markers:
        adata.obs[marker + "_raw"] = adata.X[:, adata.var.index == marker].toarray().ravel()
print("after for loop")
# Filter matrix to only variable genes
adata = adata[:, adata.var.highly_variable]

ts=time.time()
#Regress out confounding factors (number of counts, mitochondrial gene expression)
mito_genes = adata.var_names.str.startswith(MITO_GENE_PREFIX)
print("mito_genes")
n_counts = np.array(adata.X.sum(axis=1))
print("n_counts")
adata.obs['percent_mito'] = np.array(np.sum(adata[:, mito_genes].X, axis=1)) / n_counts
adata.obs['n_counts'] = n_counts
print("adata.obs['n_counts']")

sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
print("Total regress out time : %s" % (time.time()-ts))

#scale

ts=time.time()
sc.pp.scale(adata)
print("Total scale time : %s" % (time.time()-ts))
print("total time taken", (time.time()-starting_time))

Orignal Output
Total filter min_genes_per_cell : 175.78095293045044
Total filter max_genes_per_cell : 66.684401512146
Total filter genes : 60.77386426925659

Output with our code
Total filter min_genes_per_cell : 131.49908232688904
Total filter max_genes_per_cell : 24.586800575256348
Total filter genes : 32.83677649497986

So in total it is around 37% faster in case of sparce matrix

We have one failed test case which is failing in both main branch and in our branch

    def test_sam():
        adata_ref = pbmc3k()
        ix = np.random.choice(adata_ref.shape[0], size=200, replace=False)
        adata = adata_ref[ix, :].copy()
        sc.pp.normalize_total(adata, target_sum=10000)
        sc.pp.log1p(adata)
        sce.tl.sam(adata, inplace=True)
        uns_keys = list(adata.uns.keys())
        obsm_keys = list(adata.obsm.keys())
>       assert all(["sam" in uns_keys, "X_umap" in obsm_keys, "neighbors" in uns_keys])
E       assert False
E        +  where False = all([True, True, False])

Please tell me how can i fix this failed check
Opera Snapshot_2025-02-13_102118_github com

  • Closes #
  • Tests included or not required because:
  • [ x] Release notes not necessary because: This PR does filtering of cells and genes faster for sparce matrix data only

Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 68.00000% with 16 lines in your changes missing coverage. Please review.

Project coverage is 75.33%. Comparing base (3075537) to head (14405f9).

Files with missing lines Patch % Lines
src/scanpy/preprocessing/_simple.py 68.00% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3465      +/-   ##
==========================================
- Coverage   75.44%   75.33%   -0.12%     
==========================================
  Files         113      113              
  Lines       13250    13311      +61     
==========================================
+ Hits         9997    10028      +31     
- Misses       3253     3283      +30     
Files with missing lines Coverage Δ
src/scanpy/preprocessing/_simple.py 84.90% <68.00%> (-5.25%) ⬇️

@kaushalprasadhial kaushalprasadhial marked this pull request as ready for review February 14, 2025 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant