This repository contains implementations of LSTM-based autoencoders and bidirectional LSTM VAEs designed to encode variable-length time series into latent spaces. These latent representations can be used for tasks such as clustering, anomaly detection, and feature extraction. The models leverage TensorFlow and TensorFlow Probability for robust and scalable neural network architectures.
Note that this code has never been published, or tested extensively after its initial release, and is provided as-is.
Uses LSTM VAE to easily categorize data with absolute minimal knowledge about the data.
- Install conda and a conda environment. Conda installation instructions for Linux can be found on the website, as well as how to create an environment.
- Install Tensorflow with
conda install tensorflow-gpu=2.0.0
. This must be installed as the first package. The contents here are only tested with version 2.0, but it should work on later ones as well. If done correctly, check the script at/checks/test_tensorflow_gpu_is_working.py
- Install the rest of the conda requirements with
conda install -f -y -q --namepy37 -c conda-forge --file conda_requirements.txt
- Install everything else with
pip install -r requirements.txt
- If Tensorflow is installed correctly, run
checks/test_tensorflow_gpu_is_working
. If the device is correctly set up, Tensorflow is working and you're good to go! - Conda and pip don't talk together, and this breaks some of the package
installations. If for some reason a package was not installed, try running a
script until you hit a
ModuleNotFound: no module named name_of_package
error, and try installing the module withpip install name_of_package
.
Large parts run interactively in the Python-package
Streamlit. If a script has st_
in front of the name, it
must be run interactively through Streamlit (or else it doesn't produce any
visible output). To launch these from the terminal, write
streamlit run st_myscript.py
-
dir: relative directory (
models/mymodel/
) -
path: relative path to file (
models/mymodel/model_005.h5
) -
name: filename (
model_005.h5
) -
To access something one directory up, write
../
in front of the directory name. Two directories up is../../
, and so on. -
All paths in use are defined in
lib/globals.py
, so they can be conveniently changed once here, rather than everywhere in the code.
Every dataset is preprocessed so that eventually you'll have a hdf
file and a
corresponding npz
file. All computations are done on the npz
file because
it's much faster, and compatible with tensorflow. However, group order must be
preserved according to the parent hdf
dataframe.
A npz
has just a single group index (i.e. '512' means trace id 512 (remember,
Python counts from 0!), whereas a dataframe may have both an id
and sub_id
if it's combined from multiple sources. In that case, the id
will correspond
to the npz
index (i.e. the order of appearance), and sub_id
will be the
actual group index in the sub-dataset (which is currently not used). Group order
is only preserved if dataframe groups are sorted by ['file', 'particle']
, or
for combined dataframes ['source', 'file', 'particle']
. To combine dataframes,
the inputs are stacked in loaded order (which must therefore also be sorted!).
All of this is done automatically, if the right sequence of steps is taken.
-
get_cme_tracks.py
to convert from CME.mat
files to a dataframe. -
prepare_data.py
to filter out too short data (set it low initially if you want to be safe - The model will work almost equally well, regardless of the minimum length). If desired, can also remove tracks that would be cut by the tracking start/end (i.e. if something starts at frame 0 of the video, it's removed, because you can't be sure if the actual event started at "frame -10".). This can also be disabled if not desirable/applicable for the data at hand. -
train_autoencoder.py
to train a model on the data. -
st_predict.py
to predict and plot the data. Initially, a UMAP model is trained. This takes a while. It might even time out your Streamlit session, but don't touch anything and it'll be ready eventually. -
Every cluster is saved as a combination of model + data names, and will be output to
results/cluster_indices/
. This contains the indices of every trace (see above on how indexing works), and which cluster they belong to. Note that every change in the analysis OVERWRITES the automatically created file containing cluster indices. If you have reached a point where you want to save them, go toresults/cluster_indices/
and rename the file so you're sure it won't be overwritten. -
st_eval.py
once clustering is done and you want to explore the data. It currently doesn't have much functionality. Only looking at one/more specific datasets/clusters...
In order to preserve group ordering, the original dataframes must be run through
prepare_data.py
if they need to be filtered in some way. DO NOT run a
combine dataframe through a filter, because this messes up the internal group
ordering that was first established when creating the combined dataframe.
If any scripts raise complaints about packages I may have missed, they can be installed with
pip install packagename
Streamlit was never designed for super heavy computations. The underlying calculations are as fast as possible but due to the way Streamlit is set up, it appears to be slow. Rest assured, after you put in the parameters, Streamlit will get there eventually. Just don't touch anything until it's done, because the script will re-run whenever any parameters are changed.
- Hit
Ctrl+Z
in the terminal - If the above doesn't work, write
killall streamlit
If after using Streamlit you get an error like
tensorflow/core/kernels/cudnn_rnn_ops.cc:1624] Check failed: stream->parent()->GetRnnAlgorithms(&algorithms)
It means that a Tensorflow GPU session is still active from the Streamlit session. To fix this, open a terminal and write
nvidia-smi
, and search for thepid
of the Streamlit process.kill -9 pid
, wherepid
is the number above
Tensorflow by default creates directories with incorrect permissions for PyCharm.
To fix this and make them deletable from PyCharm navigate to the base directory and write
sudo chmod -R 777 models
.
Open an issue and I'll see what I can do...