KnowEnG's Samples Clustering Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Samples Clustering Pipeline.

This pipeline clusters the columns of a given spreadsheet, where spreadsheet's columns correspond to sample-labels and rows correspond to gene-labels.

There are four clustering methods that one can choose from:

Options	Method	Parameters
Clustering	nmf	nmf
Consensus Clustering	bootstrapping with nmf	cc_nmf
Clustering with network regularization	network-based nmf	net_nmf
Consensus Clustering with network regularization	bootstrapping with network-based nmf	cc_net_nmf

Note: all of the clustering methods mentioned above use the non-negative matrix factorization (nmf) as the main clustering algorithm.

If a pheotype data file is included this pipeline evaluates the clustering result.

There are two evaluation methods:

Method	Trait Type
one-way ANOVA(f_oneway)	Continuous
one-way chi square test(chisquare)	Categorical

How to run this pipeline with Our data

1. Clone the Samples_Clustering_Pipeline Repo

 git clone https://github.com/KnowEnG-Research/Samples_Clustering_Pipeline.git

2. Install the following (Ubuntu or Linux)

pip3 install pyyaml
pip3 install knpackage
pip3 install scipy==0.18.0
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install matplotlib==1.4.2
pip3 install scikit-learn==0.17.1

apt-get install -y python3-pip
apt-get install -y libfreetype6-dev libxft-dev
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran

3. Change directory to Samples_Clustering_Pipeline

cd Samples_Clustering_Pipeline

4. Change directory to test

cd test

5. Create a local directory "run_dir" and place all the run files in it

make env_setup

6. Use one of the following "make" commands to select and run a clustering option:

Command	Option
make run_nmf	Clustering
make run_net_nmf	Clustering with network regularization
make run_cc_nmf_serial	Consensus Clustering
make run_cc_nmf_parallel_shared	Consensus Clustering
make run_cc_net_nmf_serial	Consensus Clustering with network regularization
make run_cc_net_nmf_parallel_shared	Consensus Clustering with network regularization

How to run this pipeline with Your data

Follow steps 1-3 above then do the following:

* Create your run directory

mkdir run_directory

* Change directory to the run_directory

cd run_directory

* Create your results directory

mkdir results_directory

* Create run_paramters file (YAML Format)

Look for examples of run_parameters in the Sample_Clustering_Pipeline/data/run_files zTEMPLATE_cc_net_nmf.yml

* Modify run_paramters file (YAML Format)

Change processing_method to one of: serial, parallel depending on your machine.

processing_method: serial

set the data file targets to the files you want to run, and the parameters as appropriate for your data.

* Run the Samples Clustering Pipeline:

Update PYTHONPATH enviroment variable

export PYTHONPATH='../src':$PYTHONPATH

Run

python3 ../src/samples_clustering.py -run_directory ./run_dir -run_file zTEMPLATE_cc_net_nmf.yml

Description of "run_parameters" file

Key	Value	Comments
method	nmf, cc_nmf, net_nmf or cc_net_nmf	Choose clustering method
gg_network_name_full_path	directory+gg_network_name	Path and file name of the 4 col network file
spreadsheet_name_full_path	directory+spreadsheet_name	Path and file name of user supplied gene sets
phenotype_data_full_path	directory+phenotype_data_name	Path and file name of user supplied phenotype data
threshold	10	cluster eval - catagorical vs continuous cut off level
results_directory	directory	Directory to save the output files
tmp_directory	directory	Directory to save the intermediate files
rwr_max_iterations	100	Maximum number of iterations without convergence in random walk with restart
rwr_convergence_tolerence	1.0e-8	Frobenius norm tolerence of spreadsheet vector in random walk
rwr_restart_probability	0.7	alpha in `V_(n+1) = alpha * N * Vn + (1-alpha) * Vo`
rows_sampling_fraction	0.8	Select 80% of spreadsheet rows
cols_sampling_fraction	0.8	Select 80% of spreadsheet columns
number_of_bootstraps	4	Number of random samplings
number_of_clusters	3	Estimated number of clusters
nmf_conv_check_freq	50	Check convergence at given frequency
nmf_max_invariance	200	Maximum number of invariance
nmf_max_iterations	10000	Maximum number of iterations
nmf_penalty_parameter	1400	Penalty parameter
top_number_of_genes	100	Number of top genes selected
processing_method	serial or parallel or distribute	Choose processing method
parallelism	number of cores to use in parallel processing	Set number of cores for speed or memory

gg_network_name = STRING_experimental_gene_gene.edge
spreadsheet_name = ProGENI_rwr20_STExp_GDSC_500.rname.gxc.tsv
phenotype_data_name = UCEC_phenotype.txt

Description of Output files saved in results directory

Output files of all four methods save genes by sample heatmap variances per row with name genes_variance_{method}_{timestamp}_viz.tsv.

	variance
gene 1	float
...	...
gene m	float

Output files of all four methods save genes by samples heatmap with name genes_by_samples_heatmp_{method}_{timestamp}_viz.tsv.

	sample 1	...	sample n
gene 1	float	...	float
...	...	...	...
gene m	float	...	float

Output files of all four methods save samples by samples heatmap with name consensus_matrix_{method}_{timestamp}_viz.tsv.

	sample 1	...	sample n
sample 1	float	...	float
...	...	...	...
sample n	float	...	float

Output files of all four methods save patients to cluster map with name samples_labeled_by_cluster_{method}_{timestamp}_viz.tsv.

	cluster
sample 1	int
...	...
sample n	int

Output files of all four methods save gene scores by cluster with name genes_averages_by_cluster_{method}_{timestamp}_viz.tsv.

	cluster 1	...	cluster k
gene 1	float	...	float
...	...	...	...
gene m	float	...	float

Output files of all four methods save spreadsheet with top ranked genes per sample with name top_genes_by_cluster_{method}_{timestamp}_download.tsv.

	cluster 1	...	cluster k
gene 1	1/0	...	1/0
...	...	...	...
gene m	1/0	...	1/0

All methods save three silhouette scores: silhouette overall score, silhouette per cluster score and silhouette per sample with name silhouette_{method}_{timestamp}_viz.tsv.
1. silhouette overall score file: | number of clusters | silhouette score |
2. silhouette per cluster score file: | ith clusters | corresponding silhouette score |
3. silhouette per sample score file: | ith sample | corresponding silhouette score|
Output files of all four methods save patients to cluster map with name phenotypes_labeled_by_cluster_{method}_{timestamp}_viz.tsv.

sample id	cluster	phenotype 1	...	phenotype k
sample 1	int	mixed type	...	mixed type
...	...	...	...	...
sample n	int	mixed type	...	mixed type

The clustering evaluation output file has the name clustering_evaluation_result_{timestamp}.tsv.

	Measure	Trait_length_after_dropna	Sample_number_after_dropna	chi/fval	pval
sample 1	f_oneway	int(more than threshold)	int	float	float
...	...	...	...	...	...
sample m	chisquare	int(less than threshold)	int	float	float

References:

Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115 (2013).

Name		Name	Last commit message	Last commit date
Latest commit History 869 Commits
build/docker		build/docker
data		data
docs		docs
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KnowEnG's Samples Clustering Pipeline

How to run this pipeline with Our data

1. Clone the Samples_Clustering_Pipeline Repo

2. Install the following (Ubuntu or Linux)

3. Change directory to Samples_Clustering_Pipeline

4. Change directory to test

5. Create a local directory "run_dir" and place all the run files in it

6. Use one of the following "make" commands to select and run a clustering option:

How to run this pipeline with Your data

* Create your run directory

* Change directory to the run_directory

* Create your results directory

* Create run_paramters file (YAML Format)

* Modify run_paramters file (YAML Format)

* Run the Samples Clustering Pipeline:

Description of "run_parameters" file

Description of Output files saved in results directory

About

Releases

Packages

Contributors 5

Languages

License

KnowEnG/Samples_Clustering_Pipeline

Folders and files

Latest commit

History

Repository files navigation

KnowEnG's Samples Clustering Pipeline

How to run this pipeline with Our data

1. Clone the Samples_Clustering_Pipeline Repo

2. Install the following (Ubuntu or Linux)

3. Change directory to Samples_Clustering_Pipeline

4. Change directory to test

5. Create a local directory "run_dir" and place all the run files in it

6. Use one of the following "make" commands to select and run a clustering option:

How to run this pipeline with Your data

* Create your run directory

* Change directory to the run_directory

* Create your results directory

* Create run_paramters file (YAML Format)

* Modify run_paramters file (YAML Format)

* Run the Samples Clustering Pipeline:

Description of "run_parameters" file

Description of Output files saved in results directory

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages