Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~ 35.63% faster filtering #3330

Draft
wants to merge 118 commits into
base: main
Choose a base branch
from

Conversation

kaushalprasadhial
Copy link
Contributor

@kaushalprasadhial kaushalprasadhial commented Oct 29, 2024

We are submitting PR for speed up of the filtering

Time
Original 290.59
Updated 187.03
Speedup 35.63%

Experiment was done on AWS r7i.24xlarge

import time
import numpy as np

import pandas as pd

import scanpy as sc
from sklearn.cluster import KMeans

import os
import wget

import warnings


warnings.filterwarnings('ignore', 'Expected ')
warnings.simplefilter('ignore')
input_file = "./1M_brain_cells_10X.sparse.h5ad"

if not os.path.exists(input_file):
    print('Downloading import file...')
    wget.download('https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/1M_brain_cells_10X.sparse.h5ad',input_file)


# marker genes
MITO_GENE_PREFIX = "mt-" # Prefix for mitochondrial genes to regress out
markers = ["Stmn2", "Hes1", "Olig1"] # Marker genes for visualization

# filtering cells
min_genes_per_cell = 200 # Filter out cells with fewer genes than this expressed
max_genes_per_cell = 6000 # Filter out cells with more genes than this expressed

# filtering genes
min_cells_per_gene = 1 # Filter out genes expressed in fewer cells than this
n_top_genes = 4000 # Number of highly variable genes to retain

# PCA
n_components = 50 # Number of principal components to compute

# t-SNE
tsne_n_pcs = 20 # Number of principal components to use for t-SNE

# k-means
k = 35 # Number of clusters for k-means

# Gene ranking

ranking_n_top_genes = 50 # Number of differential genes to compute for each cluster

# Number of parallel jobs
sc._settings.ScanpyConfig.n_jobs = os.cpu_count()

start=time.time()
adata = sc.read(input_file)
adata.var_names_make_unique()
adata.shape

tr=time.time()
# To reduce the number of cells:
USE_FIRST_N_CELLS = 1300000
adata = adata[0:USE_FIRST_N_CELLS]
adata.shape

sc.pp.filter_cells(adata, min_genes=min_genes_per_cell)
print("Total filter cell : %s" % (time.time()-tr))
sc.pp.filter_cells(adata, max_genes=max_genes_per_cell)
tfg=time.time()
sc.pp.filter_genes(adata, min_cells=min_cells_per_gene)
print("Total filter genes : %s" % (time.time()-tfg))
tnt=time.time()
sc.pp.normalize_total(adata, target_sum=1e4)
print("Total filter normalize_total : %s" % (time.time()-tnt))
print("Total filter and normalize time : %s" % (time.time()-tr))
  • Closes #
  • Tests included or not required because:
  • Release notes not necessary because:

Copy link

codecov bot commented Oct 29, 2024

Codecov Report

Attention: Patch coverage is 22.44898% with 38 lines in your changes missing coverage. Please review.

Project coverage is 72.25%. Comparing base (a70582e) to head (25e7cd4).
Report is 103 commits behind head on main.

Files with missing lines Patch % Lines
src/scanpy/preprocessing/_simple.py 22.44% 38 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3330      +/-   ##
==========================================
- Coverage   76.27%   72.25%   -4.03%     
==========================================
  Files         117      111       -6     
  Lines       12795    12639     -156     
==========================================
- Hits         9760     9132     -628     
- Misses       3035     3507     +472     
Files with missing lines Coverage Δ
src/scanpy/preprocessing/_simple.py 61.67% <22.44%> (ø)

... and 65 files with indirect coverage changes

kaushalprasadhial and others added 28 commits November 15, 2024 13:43
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* (chore): add preparation-of-release documentation

* (chore): add versioning url

* Update release.md

* unify quotes

* fmt

* mention UI elements

* clearer

---------

Co-authored-by: Philipp A <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ilan Gold <[email protected]>
* Unpin numpy 2

* float64 and harmonize metrics code

* Skip tests for old skmisc

* Fix parallel tests

* fix numpy 2 reprs

* add relnote

* (fix): release notes version

---------

Co-authored-by: Ilan Gold <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix layer use_raw

* add test

* adds release note

* adds default
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Philipp A. <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…cverse#3196)

* (fix): resolve axis ordering

* (chore): release note

* (chore): add test

* (fix): make testing faster and more robust

* (fix): light refactor

* (fix): clean up comments

* (chore): add more comments

* (fix): remove test

* Update src/scanpy/plotting/_stacked_violin.py

* (chore): add test to cement behavior

* (fix): tolerance that is sensitive enough for `main` difference

---------

Co-authored-by: Philipp A. <[email protected]>
flying-sheep and others added 29 commits February 4, 2025 15:11
… package (scverse#3362)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phil Schaf <[email protected]>
…erse#3393)

* (fix): bound sklearn because of dask-ml on the release candidate

* (chore): release note

* (fix): `mod` in note

* (fix): release notes number
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* typo and grammar fixes in docstrings only

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert dendogram docstring

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.