Skip to content

Commit

Permalink
Fixed directories, and more updates
Browse files Browse the repository at this point in the history
  • Loading branch information
komodovaran committed Feb 3, 2020
1 parent ea44f83 commit 4d97540
Show file tree
Hide file tree
Showing 16 changed files with 404 additions and 252 deletions.
89 changes: 63 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,29 @@
### Setup (tested on linux only!)
1. Install conda and a conda environment ("what?" "how?" - Google it!)
2. Install Tensorflow with `conda install tensorflow-gpu=2.0.0`. This **must** be installed as the first package. The contents
here are only tested with version 2.0, but it should work on later ones as well. If done correctly, check the script at
`/checks/test_tensorflow_gpu_is_working.py`
3. Install the rest of the conda requirements with

````conda install -f -y -q --name py37 -c conda-forge --file conda_requirements.txt````
3. Install everything else with `pip install -r requirements.txt`
4. If Tensorflow is installed correctly, run `checks/test_tensorflow_gpu_is_working`. If the device is correctly set up,
1. Install conda and a conda environment. Conda installation instructions for
Linux can be found on the website, as well as how to create an environment.
2. Install Tensorflow with `conda install tensorflow-gpu=2.0.0`. This **must**
be installed as the first package. The contents here are only tested with
version 2.0, but it should work on later ones as well. If done correctly,
check the script at `/checks/test_tensorflow_gpu_is_working.py`
3. Install the rest of the conda requirements with
````
conda install -f -y -q --namepy37 -c conda-forge --file conda_requirements.txt
````
4. Install everything else with `pip install -r requirements.txt`
5. If Tensorflow is installed correctly, run
`checks/test_tensorflow_gpu_is_working`. If the device is correctly set up,
Tensorflow is working and you're good to go!
6. Conda and pip don't talk together, and this breaks some of the package
installations. If for some reason a package was not installed, try running a
script until you hit a `ModuleNotFound: no module named name_of_package` error,
and try installing the module with `pip install name_of_package`.

### Interactive scripts
Large parts run interactively in the Python-package [Streamlit](www.streamlit.io). If a script has `st_` in front of the
name, it must be run interactively through Streamlit. To launch these from the terminal, write `streamlit run st_myscript.py`
Large parts run interactively in the Python-package
[Streamlit](www.streamlit.io). If a script has `st_` in front of the name, it
must be run interactively through Streamlit (or else it doesn't produce any
visible output). To launch these from the terminal, write
`streamlit run st_myscript.py`

### Naming convention

Expand All @@ -22,30 +33,56 @@ name, it must be run interactively through Streamlit. To launch these from the t

* name: filename (`model_005.h5`)

### Data format
Every dataset is preprocessed so that eventually you'll have a `hdf` file and a corresponding `npz` file. All
computations are done on the `npz` file because it's much faster, and compatible with tensorflow. However, group order
must be preserved according to the parent `hdf` dataframe.
* To access something one directory up, write `../` in front of the directory
name. Two directories up is `../../`, and so on.

A `npz` has just a single group index, whereas a dataframe may have both an `id` and `sub_id` if it's combined
from multiple sources. In that case, the `id` will correspond to the `npz` index, and `sub_id` will be the actual
group index in the sub-dataset. Group order is only preserved if dataframe groups are sorted by `['file', 'particle']`,
or for combined dataframes `['source', 'file', 'particle']`. To combine dataframes, the inputs are stacked in loaded order
(which must therefore also be sorted!). All of this is done automatically, if the right sequence of steps is taken.
* All paths in use are defined in `lib/globals.py`, so they can be conveniently
changed once here, rather than everywhere in the code.

### Data format
Every dataset is preprocessed so that eventually you'll have a `hdf` file and a
corresponding `npz` file. All computations are done on the `npz` file because
it's much faster, and compatible with tensorflow. However, group order must be
preserved according to the parent `hdf` dataframe.

A `npz` has just a single group index (i.e. '512' means trace id 512 (remember,
Python counts from 0!), whereas a dataframe may have both an `id` and `sub_id`
if it's combined from multiple sources. In that case, the `id` will correspond
to the `npz` index (i.e. the order of appearance), and `sub_id` will be the
actual group index in the sub-dataset (which is currently not used). Group order
is only preserved if dataframe groups are sorted by `['file', 'particle']`, or
for combined dataframes `['source', 'file', 'particle']`. To combine dataframes,
the inputs are stacked in loaded order (which must therefore also be sorted!).
All of this is done automatically, if the right sequence of steps is taken.


### Scripts to run, step by step
1. `get_cme_tracks.py` to convert from CME `.mat` files to a dataframe.
2. `prepare_data.py` to filter out too short data (set it low initially to be safe), traces that would be cut by the
tracking start/end, and
2. `prepare_data.py` to filter out too short data (set it low initially if you
want to be safe - The model will work almost equally well, regardless of the
minimum length), as well as traces that would be cut by the tracking start/end (
i.e. if something starts at frame 0 of the video, it's removed, because you
can't be sure if the actual event started at "frame -10".
3. `train_autoencoder.py` to train a model on the data.
4. `st_predict.py` to predict and plot the data. Initially, a UMAP model is trained.
5. `st_eval.py` once clustering is done and you want to explore the data.
4. `st_predict.py` to predict and plot the data. Initially, a UMAP model is
trained. This takes a while. It might even time out your Streamlit session, but
don't touch anything and it'll be ready eventually.
5. Every cluster is saved as a combination of model + data names, and will be
output to `results/cluster_indices/`. This contains the indices of every trace
(see above on how indexing works), and which cluster they belong to. Note that
every change in the analysis **OVERWRITES** the automatically created file
containing cluster indices. If you have reached a point where you want to save
them, go to `results/cluster_indices/` and rename the file so you're sure it
won't be overwritten.
6. `st_eval.py` once clustering is done and you want to explore the data. It
currently doesn't have much functionality. Only looking at one/more specific
datasets/clusters...

### Things to avoid
In order to preserve group ordering, the original dataframes must be run through
`prepare_data.py` if they need to be filtered in some way. **DO NOT** run a combine dataframe through a filter,
because this messes up the internal group ordering that was first established when creating the combined dataframe.
`prepare_data.py` if they need to be filtered in some way. **DO NOT** run a
combine dataframe through a filter, because this messes up the internal group
ordering that was first established when creating the combined dataframe.

### Troubleshooting
#### Packages are missing
Expand Down
41 changes: 29 additions & 12 deletions best_clusters.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@
from sklearn.mixture import GaussianMixture
from sklearn.cluster import MiniBatchKMeans
from tqdm import tqdm
from lib.utils import get_index

import lib.globals
import lib.globals
from lib.utils import get_index


def _plot_kmeans_scores(X, min, max, step):
def _plot_kmeans_score(X, min, max, step):
"""
Calculates scores for multiple values of kmeans
Args:
Expand All @@ -25,9 +25,8 @@ def _plot_kmeans_scores(X, min, max, step):
"""
rng = list(range(min, max, step))

def process(n):
def process_gaussian(n):
clf = GaussianMixture(n_components = n, random_state = 42)
# clf = MiniBatchKMeans(n_clusters=n, random_state=42)
labels = clf.fit_predict(X)

s = silhouette_score(X, labels)
Expand All @@ -36,20 +35,35 @@ def process(n):

return s, c, b

def process_kmeans(n):
clf = MiniBatchKMeans(n_clusters=n, random_state=42)
labels = clf.fit_predict(X)

s = silhouette_score(X, labels)
c = calinski_harabasz_score(X, labels)
return s, c


n_jobs = len(rng)
results = Parallel(n_jobs=n_jobs)(delayed(process)(i) for i in tqdm(rng))
results = np.column_stack(results).T
results_kmeans = Parallel(n_jobs=n_jobs)(delayed(process_kmeans)(i) for i in tqdm(rng))
results_kmeans = np.column_stack(results_kmeans).T

fig, ax = plt.subplots(nrows=3, ncols = 2)

fig, ax = plt.subplots(nrows=3)
ax[0].plot(rng, results[:, 0], "o-", color="blue", label="Silhouette score")
ax[1].plot(rng, results[:, 1], "o-", color="orange", label="CH score")
ax[2].plot(rng, results[:, 2], "o-", color="red", label="BIC")
ax[0, 0].set_title("K-means")
ax[0, 0].plot(rng, results_kmeans[:, 0], "o-", color="blue", label="Silhouette score")
ax[1, 0].plot(rng, results_kmeans[:, 1], "o-", color="orange", label="CH score")

ax[0, 1].set_title("Gaussian Mixture")
ax[0, 1].plot(rng, results_kmeans[:, 0], "o-", color="blue", label="Silhouette score")
ax[1, 1].plot(rng, results_kmeans[:, 1], "o-", color="orange", label="CH score")
ax[2, 1].plot(rng, results_kmeans[:, 2], "o-", color="red", label="BIC")

for a in ax:
a.legend(loc="upper right")

plt.tight_layout()
plt.savefig("plots/best_k.pdf")
plt.savefig("plots/best_clusters.pdf")
plt.show()


Expand All @@ -61,11 +75,14 @@ def main(encodings_name):

X, encodings = f["X_true"], f["features"]

X = X[0:1000]
encodings = encodings[0:1000]

arr_lens = np.array([len(xi) for xi in X])
(len_above_idx,) = np.where(arr_lens >= 30)
X, encodings, = get_index((X, encodings), index=len_above_idx)

_plot_kmeans_scores(encodings, min=2, max=100, step=3)
_plot_kmeans_score(encodings, min=2, max=100, step=3)


if __name__ == "__main__":
Expand Down
Empty file added data/preprocessed/.gitkeep
Empty file.
Empty file added ksparse/__init__.py
Empty file.
91 changes: 50 additions & 41 deletions other/sparse_gam.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@

x_train, x_val = train_test_split(X)

input_shape = x_train.shape[1:]
shape = x_train.shape[1:]

# Define autoencoder parameters
EPOCHS = 10
Expand All @@ -43,52 +43,61 @@
INIT_SPARSITY, END_SPARSITY, EPOCHS
)

# k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(x)

# Build the Autoencoder Model
inputs = Input(shape=input_shape, name="encoder_input")
x = inputs

for filters in LAYER_FILTERS:
x = Conv2D(
filters=filters,
kernel_size=KERNEL_SIZE,
strides=2,
activation="relu",
padding="same",
data_format="channels_last",
)(x)
i = Input(shape)
x = Flatten()(i)
h = Dense(LATENT_DIM, activation = 'sigmoid')(x)

# Generate the latent vector
x = Flatten()(x)
x = Dense(LATENT_DIM, name="latent_vector")(x)
k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(x)
k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(h)

# Decoder
x = Dense(input_shape[0] * input_shape[1] * input_shape[2])(k_sparse)
x = Reshape(input_shape)(x)
x = Dense(x_train.shape[1] * x_train.shape[2] * x_train.shape[3])(k_sparse)
x = Reshape(shape)(x)
o = Activation("sigmoid")(x)

for filters in LAYER_FILTERS[::-1]:
x = Conv2DTranspose(
filters=filters,
kernel_size=KERNEL_SIZE,
strides=1,
activation="relu",
padding="same",
data_format="channels_last",
)(x)

x = Conv2DTranspose(
filters=input_shape[-1],
kernel_size=KERNEL_SIZE,
padding="same",
data_format="channels_last",
)(x)
# k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(x)

outputs = Activation(None, name="decoder_output")(x)
# Build the Autoencoder Model
# i = Input(shape=input_shape, name= "encoder_input")
# # for filters in LAYER_FILTERS:
# # x = Conv2D(
# # filters=filters,
# # kernel_size=KERNEL_SIZE,
# # strides=2,
# # activation="relu",
# # padding="same",
# # data_format="channels_last",
# # )(x)
#
# # Generate the latent vector
# x = Flatten()(i)
# x = Dense(LATENT_DIM, activation = "sigmoid")(x)
# k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(x)
#
#
# # Decoder
# x = Dense(int(np.product(input_shape)), activation = "relu")(x)
# x = Reshape(input_shape)(x)
#
# outputs = Dense(x_train.shape[1], activation = 'sigmoid')(k_sparse)

# for filters in LAYER_FILTERS[::-1]:
# x = Conv2DTranspose(
# filters=filters,
# kernel_size=KERNEL_SIZE,
# strides=1,
# activation="relu",
# padding="same",
# data_format="channels_last",
# )(x)
#
# x = Conv2DTranspose(
# filters=input_shape[-1],
# kernel_size=KERNEL_SIZE,
# padding="same",
# data_format="channels_last",
# )(x)

# Autoencoder = Encoder + Decoder
autoencoder = Model(inputs, outputs, name="autoencoder")
autoencoder = Model(i, o, name= "autoencoder")
autoencoder.summary()

autoencoder.compile(loss="mse", optimizer="adam")
Expand Down
9 changes: 2 additions & 7 deletions other/sparse_mnist.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,9 @@
x_train, y_train = shuffle(x_train, y_train, random_state = 1)
x_test, y_test = shuffle(x_test, y_test, random_state = 1)


def process(x):
return x.reshape(x.shape[0], x.shape[1] ** 2) / 255


x_train = process(x_train)
x_test = process(x_test)

Expand All @@ -35,11 +33,8 @@ def process(x):

# Build the k-sparse autoencoder
inputs = Input((x_train.shape[1],))
x = Dense(embedding_size, activation = "relu")(inputs)
x = Dense(embedding_size, activation = "relu")(x)
k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(x)
x = Dense(embedding_size, activation = "relu")(k_sparse)

h = Dense(embedding_size, activation = 'sigmoid')(inputs)
k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(h)
outputs = Dense(x_train.shape[1], activation = 'sigmoid')(k_sparse)

ae1 = Model(inputs, outputs)
Expand Down
52 changes: 52 additions & 0 deletions other/sparse_mnist_new.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
import ksparse.utils.mnist.mnist_helper as mh
from ksparse.layers.linear_layer import LinearLayer
from ksparse.layers.sparse_layer import SparseLayer
from ksparse.nets.fcnn import *
from ksparse.utils.activations import *
from ksparse.utils.cost_functions import *
from tensorflow.keras.layers import Dense

img_size = 28
num_hidden = 100
k = 70
learning_rate = 0.01
epochs = 10000
batch_size = 256
print_epochs = 1000
num_test_examples = 10

helper = mh.mnist_helper()
train_lbl, train_img, test_lbl, test_img = helper.get_data()

x_data = train_img.reshape(-1, img_size * img_size) / np.float32(256)
test_data = test_img.reshape(-1, img_size * img_size) / np.float32(256)

layers = [
# LinearLayer(name="input", n_in=x_data.shape[1], n_out=num_hidden, activation=sigmoid_function),
Dense(units = 32, activation = "relu", input_shape = x_data.shape[1]),
SparseLayer(
name="hidden 1",
n_in=x_data.shape[1],
n_out=num_hidden,
activation=sigmoid_function,
num_k_sparse=k,
),
LinearLayer(
name="output",
n_in=num_hidden,
n_out=x_data.shape[1],
activation=sigmoid_function,
),
]

nn = FCNeuralNet(layers=layers, cost_func=subtract_err)
nn.print_network()

nn.train(
x_data,
x_data,
learning_rate=learning_rate,
epochs=epochs,
batch_size=batch_size,
print_epochs=print_epochs,
)
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 4d97540

Please sign in to comment.