Fixed directories, and more updates

komodovaran · Feb 3, 2020 · 4d97540 · 4d97540
1 parent ea44f83
commit 4d97540
Show file tree

Hide file tree

Showing 16 changed files with 404 additions and 252 deletions.
diff --git a/README.md b/README.md
@@ -1,18 +1,29 @@
 ### Setup (tested on linux only!)
-1. Install conda and a conda environment ("what?" "how?" - Google it!)
-2. Install Tensorflow with `conda install tensorflow-gpu=2.0.0`. This **must** be installed as the first package. The contents
-here are only tested with version 2.0, but it should work on later ones as well. If done correctly, check the script at 
-`/checks/test_tensorflow_gpu_is_working.py`
-3. Install the rest of the conda requirements with
-
-````conda install -f -y -q --name py37 -c conda-forge --file conda_requirements.txt````
-3. Install everything else with `pip install -r requirements.txt`
-4. If Tensorflow is installed correctly, run `checks/test_tensorflow_gpu_is_working`. If the device is correctly set up,
+1. Install conda and a conda environment. Conda installation instructions for
+Linux can be found on the website, as well as how to create an environment.
+2. Install Tensorflow with `conda install tensorflow-gpu=2.0.0`. This **must**
+be installed as the first package. The contents here are only tested with 
+version 2.0, but it should work on later ones as well. If done correctly,
+check the script at  `/checks/test_tensorflow_gpu_is_working.py`
+3. Install the rest of the conda requirements with   
+````
+conda install -f -y -q --namepy37 -c conda-forge --file conda_requirements.txt
+````
+4. Install everything else with `pip install -r requirements.txt`
+5. If Tensorflow is installed correctly, run 
+`checks/test_tensorflow_gpu_is_working`. If the device is correctly set up,
 Tensorflow is working and you're good to go!
+6. Conda and pip don't talk together, and this breaks some of the package
+installations. If for some reason a package was not installed, try running a 
+script until you hit a `ModuleNotFound: no module named name_of_package` error,
+and try installing the module with `pip install name_of_package`.
 
 ### Interactive scripts
-Large parts run interactively in the Python-package [Streamlit](www.streamlit.io). If a script has `st_` in front of the
-name, it must be run interactively through Streamlit. To launch these from the terminal, write `streamlit run st_myscript.py`
+Large parts run interactively in the Python-package
+[Streamlit](www.streamlit.io). If a script has `st_` in front of the name, it
+must be run interactively through Streamlit (or else it doesn't produce any
+visible output). To launch these from the terminal, write 
+`streamlit run st_myscript.py`
 
 ### Naming convention
 
@@ -22,30 +33,56 @@ name, it must be run interactively through Streamlit. To launch these from the t
 
 * name: filename (`model_005.h5`)
 
-### Data format
-Every dataset is preprocessed so that eventually you'll have a `hdf` file and a corresponding `npz` file. All
-computations are done on the `npz` file because it's much faster, and compatible with tensorflow. However, group order
-must be preserved according to the parent `hdf` dataframe.
+* To access something one directory up, write `../` in front of the directory
+name. Two directories up is `../../`, and so on.
 
-A `npz` has just a single group index, whereas a dataframe may have both an `id` and `sub_id` if it's combined
-from multiple sources. In that case, the `id` will correspond to the `npz` index, and `sub_id` will be the actual
-group index in the sub-dataset. Group order is only preserved if dataframe groups are sorted by `['file', 'particle']`,
-or for combined dataframes `['source', 'file', 'particle']`. To combine dataframes, the inputs are stacked in loaded order
-(which must therefore also be sorted!). All of this is done automatically, if the right sequence of steps is taken. 
+* All paths in use are defined in `lib/globals.py`, so they can be conveniently
+changed once here, rather than everywhere in the code.
+
+### Data format
+Every dataset is preprocessed so that eventually you'll have a `hdf` file and a
+corresponding `npz` file. All computations are done on the `npz` file because
+it's much faster, and compatible with tensorflow. However, group order must be
+preserved according to the parent `hdf` dataframe.
+
+A `npz` has just a single group index (i.e. '512' means trace id 512 (remember,
+Python counts from 0!), whereas a dataframe may have both an `id` and `sub_id`
+if it's combined from multiple sources. In that case, the `id` will correspond
+to the `npz` index (i.e. the order of appearance), and `sub_id` will be the
+actual group index in the sub-dataset (which is currently not used). Group order
+is only preserved if dataframe groups are sorted by `['file', 'particle']`, or
+for combined dataframes `['source', 'file', 'particle']`. To combine dataframes,
+the inputs are stacked in loaded order (which must therefore also be sorted!).
+All of this is done automatically, if the right sequence of steps is taken. 
 
 
 ### Scripts to run, step by step
 1. `get_cme_tracks.py` to convert from CME `.mat` files to a dataframe.
-2. `prepare_data.py` to filter out too short data (set it low initially to be safe), traces that would be cut by the
-tracking start/end, and 
+2. `prepare_data.py` to filter out too short data (set it low initially if you 
+want to be safe - The model will work almost equally well, regardless of the 
+minimum length), as well as traces that would be cut by the tracking start/end (
+i.e. if something starts at frame 0 of the video, it's removed, because you
+can't be sure if the actual event started at "frame -10". 
 3. `train_autoencoder.py` to train a model on the data.
-4. `st_predict.py` to predict and plot the data. Initially, a UMAP model is trained.
-5. `st_eval.py` once clustering is done and you want to explore the data.
+4. `st_predict.py` to predict and plot the data. Initially, a UMAP model is
+trained. This takes a while. It might even time out your Streamlit session, but
+don't touch anything and it'll be ready eventually.
+5. Every cluster is saved as a combination of model + data names, and will be
+output to `results/cluster_indices/`. This contains the indices of every trace
+(see above on how indexing works), and which cluster they belong to. Note that
+every change in the analysis **OVERWRITES** the automatically created file
+containing cluster indices. If you have reached a point where you want to save
+them, go to `results/cluster_indices/` and rename the file so you're sure it
+won't be overwritten. 
+6. `st_eval.py` once clustering is done and you want to explore the data. It
+currently doesn't have much functionality. Only looking at one/more specific
+datasets/clusters...
 
 ### Things to avoid
 In order to preserve group ordering, the original dataframes must be run through
- `prepare_data.py` if they need to be filtered in some way. **DO NOT** run a combine dataframe through a filter,
- because this messes up the internal group ordering that was first established when creating the combined dataframe.
+ `prepare_data.py` if they need to be filtered in some way. **DO NOT** run a
+ combine dataframe through a filter, because this messes up the internal group
+ ordering that was first established when creating the combined dataframe.
 
 ### Troubleshooting
 #### Packages are missing

diff --git a/best_clusters.py b/best_clusters.py
@@ -8,13 +8,13 @@
 from sklearn.mixture import GaussianMixture
 from sklearn.cluster import MiniBatchKMeans
 from tqdm import tqdm
-from lib.utils import get_index
 
 import lib.globals
 import lib.globals
+from lib.utils import get_index
 
 
-def _plot_kmeans_scores(X, min, max, step):
+def _plot_kmeans_score(X, min, max, step):
     """
     Calculates scores for multiple values of kmeans
     Args:
@@ -25,9 +25,8 @@ def _plot_kmeans_scores(X, min, max, step):
     """
     rng = list(range(min, max, step))
 
-    def process(n):
+    def process_gaussian(n):
         clf = GaussianMixture(n_components = n, random_state = 42)
-        # clf = MiniBatchKMeans(n_clusters=n, random_state=42)
         labels = clf.fit_predict(X)
 
         s = silhouette_score(X, labels)
@@ -36,20 +35,35 @@ def process(n):
 
         return s, c, b
 
+    def process_kmeans(n):
+        clf = MiniBatchKMeans(n_clusters=n, random_state=42)
+        labels = clf.fit_predict(X)
+
+        s = silhouette_score(X, labels)
+        c = calinski_harabasz_score(X, labels)
+        return s, c
+
+
     n_jobs = len(rng)
-    results = Parallel(n_jobs=n_jobs)(delayed(process)(i) for i in tqdm(rng))
-    results = np.column_stack(results).T
+    results_kmeans = Parallel(n_jobs=n_jobs)(delayed(process_kmeans)(i) for i in tqdm(rng))
+    results_kmeans = np.column_stack(results_kmeans).T
+
+    fig, ax = plt.subplots(nrows=3, ncols = 2)
 
-    fig, ax = plt.subplots(nrows=3)
-    ax[0].plot(rng, results[:, 0], "o-", color="blue", label="Silhouette score")
-    ax[1].plot(rng, results[:, 1], "o-", color="orange", label="CH score")
-    ax[2].plot(rng, results[:, 2], "o-", color="red", label="BIC")
+    ax[0, 0].set_title("K-means")
+    ax[0, 0].plot(rng, results_kmeans[:, 0], "o-", color="blue", label="Silhouette score")
+    ax[1, 0].plot(rng, results_kmeans[:, 1], "o-", color="orange", label="CH score")
+
+    ax[0, 1].set_title("Gaussian Mixture")
+    ax[0, 1].plot(rng, results_kmeans[:, 0], "o-", color="blue", label="Silhouette score")
+    ax[1, 1].plot(rng, results_kmeans[:, 1], "o-", color="orange", label="CH score")
+    ax[2, 1].plot(rng, results_kmeans[:, 2], "o-", color="red", label="BIC")
 
     for a in ax:
         a.legend(loc="upper right")
 
     plt.tight_layout()
-    plt.savefig("plots/best_k.pdf")
+    plt.savefig("plots/best_clusters.pdf")
     plt.show()
 
 
@@ -61,11 +75,14 @@ def main(encodings_name):
 
     X, encodings = f["X_true"], f["features"]
 
+    X = X[0:1000]
+    encodings = encodings[0:1000]
+
     arr_lens = np.array([len(xi) for xi in X])
     (len_above_idx,) = np.where(arr_lens >= 30)
     X, encodings, = get_index((X, encodings), index=len_above_idx)
 
-    _plot_kmeans_scores(encodings, min=2, max=100, step=3)
+    _plot_kmeans_score(encodings, min=2, max=100, step=3)
 
 
 if __name__ == "__main__":

diff --git a/data/preprocessed/.gitkeep b/data/preprocessed/.gitkeep
diff --git a/ksparse/__init__.py b/ksparse/__init__.py
diff --git a/other/sparse_gam.py b/other/sparse_gam.py
@@ -28,7 +28,7 @@
 
     x_train, x_val = train_test_split(X)
 
-    input_shape = x_train.shape[1:]
+    shape = x_train.shape[1:]
 
     # Define autoencoder parameters
     EPOCHS = 10
@@ -43,52 +43,61 @@
         INIT_SPARSITY, END_SPARSITY, EPOCHS
     )
 
-    # k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(x)
-
-    # Build the Autoencoder Model
-    inputs = Input(shape=input_shape, name="encoder_input")
-    x = inputs
-
-    for filters in LAYER_FILTERS:
-        x = Conv2D(
-            filters=filters,
-            kernel_size=KERNEL_SIZE,
-            strides=2,
-            activation="relu",
-            padding="same",
-            data_format="channels_last",
-        )(x)
+    i = Input(shape)
+    x = Flatten()(i)
+    h = Dense(LATENT_DIM, activation = 'sigmoid')(x)
 
-    # Generate the latent vector
-    x = Flatten()(x)
-    x = Dense(LATENT_DIM, name="latent_vector")(x)
-    k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(x)
+    k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(h)
 
-    # Decoder
-    x = Dense(input_shape[0] * input_shape[1] * input_shape[2])(k_sparse)
-    x = Reshape(input_shape)(x)
+    x = Dense(x_train.shape[1] * x_train.shape[2] * x_train.shape[3])(k_sparse)
+    x = Reshape(shape)(x)
+    o = Activation("sigmoid")(x)
 
-    for filters in LAYER_FILTERS[::-1]:
-        x = Conv2DTranspose(
-            filters=filters,
-            kernel_size=KERNEL_SIZE,
-            strides=1,
-            activation="relu",
-            padding="same",
-            data_format="channels_last",
-        )(x)
-
-    x = Conv2DTranspose(
-        filters=input_shape[-1],
-        kernel_size=KERNEL_SIZE,
-        padding="same",
-        data_format="channels_last",
-    )(x)
+    # k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(x)
 
-    outputs = Activation(None, name="decoder_output")(x)
+    # Build the Autoencoder Model
+    # i = Input(shape=input_shape, name= "encoder_input")
+    # # for filters in LAYER_FILTERS:
+    # #     x = Conv2D(
+    # #         filters=filters,
+    # #         kernel_size=KERNEL_SIZE,
+    # #         strides=2,
+    # #         activation="relu",
+    # #         padding="same",
+    # #         data_format="channels_last",
+    # #     )(x)
+    #
+    # # Generate the latent vector
+    # x = Flatten()(i)
+    # x = Dense(LATENT_DIM, activation = "sigmoid")(x)
+    # k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(x)
+    #
+    #
+    # # Decoder
+    # x = Dense(int(np.product(input_shape)), activation = "relu")(x)
+    # x = Reshape(input_shape)(x)
+    #
+    # outputs = Dense(x_train.shape[1], activation = 'sigmoid')(k_sparse)
+
+    # for filters in LAYER_FILTERS[::-1]:
+    #     x = Conv2DTranspose(
+    #         filters=filters,
+    #         kernel_size=KERNEL_SIZE,
+    #         strides=1,
+    #         activation="relu",
+    #         padding="same",
+    #         data_format="channels_last",
+    #     )(x)
+    #
+    # x = Conv2DTranspose(
+    #     filters=input_shape[-1],
+    #     kernel_size=KERNEL_SIZE,
+    #     padding="same",
+    #     data_format="channels_last",
+    # )(x)
 
     # Autoencoder = Encoder + Decoder
-    autoencoder = Model(inputs, outputs, name="autoencoder")
+    autoencoder = Model(i, o, name= "autoencoder")
     autoencoder.summary()
 
     autoencoder.compile(loss="mse", optimizer="adam")

diff --git a/other/sparse_mnist.py b/other/sparse_mnist.py
@@ -16,11 +16,9 @@
     x_train, y_train = shuffle(x_train, y_train, random_state = 1)
     x_test, y_test = shuffle(x_test, y_test, random_state = 1)
 
-
     def process(x):
         return x.reshape(x.shape[0], x.shape[1] ** 2) / 255
 
-
     x_train = process(x_train)
     x_test = process(x_test)
 
@@ -35,11 +33,8 @@ def process(x):
 
     # Build the k-sparse autoencoder
     inputs = Input((x_train.shape[1],))
-    x = Dense(embedding_size, activation = "relu")(inputs)
-    x = Dense(embedding_size, activation = "relu")(x)
-    k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(x)
-    x = Dense(embedding_size, activation = "relu")(k_sparse)
-
+    h = Dense(embedding_size, activation = 'sigmoid')(inputs)
+    k_sparse = KSparse(sparsity_levels = sparsity_levels, name = 'KSparse')(h)
     outputs = Dense(x_train.shape[1], activation = 'sigmoid')(k_sparse)
 
     ae1 = Model(inputs, outputs)

diff --git a/other/sparse_mnist_new.py b/other/sparse_mnist_new.py
@@ -0,0 +1,52 @@
+import ksparse.utils.mnist.mnist_helper as mh
+from ksparse.layers.linear_layer import LinearLayer
+from ksparse.layers.sparse_layer import SparseLayer
+from ksparse.nets.fcnn import *
+from ksparse.utils.activations import *
+from ksparse.utils.cost_functions import *
+from tensorflow.keras.layers import Dense
+
+img_size = 28
+num_hidden = 100
+k = 70
+learning_rate = 0.01
+epochs = 10000
+batch_size = 256
+print_epochs = 1000
+num_test_examples = 10
+
+helper = mh.mnist_helper()
+train_lbl, train_img, test_lbl, test_img = helper.get_data()
+
+x_data = train_img.reshape(-1, img_size * img_size) / np.float32(256)
+test_data = test_img.reshape(-1, img_size * img_size) / np.float32(256)
+
+layers = [
+    # LinearLayer(name="input", n_in=x_data.shape[1], n_out=num_hidden, activation=sigmoid_function),
+    Dense(units = 32, activation = "relu", input_shape = x_data.shape[1]),
+    SparseLayer(
+        name="hidden 1",
+        n_in=x_data.shape[1],
+        n_out=num_hidden,
+        activation=sigmoid_function,
+        num_k_sparse=k,
+    ),
+    LinearLayer(
+        name="output",
+        n_in=num_hidden,
+        n_out=x_data.shape[1],
+        activation=sigmoid_function,
+    ),
+]
+
+nn = FCNeuralNet(layers=layers, cost_func=subtract_err)
+nn.print_network()
+
+nn.train(
+    x_data,
+    x_data,
+    learning_rate=learning_rate,
+    epochs=epochs,
+    batch_size=batch_size,
+    print_epochs=print_epochs,
+)
diff --git a/train_flipped_classifier.py → other/train_flipped_classifier.py b/train_flipped_classifier.py → other/train_flipped_classifier.py
diff --git a/ts_to_img.py → other/ts_to_img.py b/ts_to_img.py → other/ts_to_img.py