-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dual stage normalization and scaling #7
Comments
this is directly related to #5 a dual-stage approach is probably required due to the nature of the dataset (complex hierarchical time series with numerical features), so a simple transformation is not going to work/going to make the 2nd stage of the model more inaccurate |
Copied from #5 (dataset scaling/normalization before wavelet transform): The author of
My my main issues/concerns are the following:
Thus: More research needed on scaling/normalization in respect/context of time series data for machine learning. In terms of code/practical implementation, I will mostly like code multiple options for different training runs (with different scaling/normalization options), and then compare. |
see commit 8073c42 scaled data and scaled denoised data now saved in data/interim folder: pdf output of train-validate-test split scaled + denoised in reports folder: |
train-validate-test split is showing some questionable output; look into when I get a chance there is the possibility it's a matplotlib/pdf render output issue: yup, this definitely should not be happening- values shouldn't be negative with |
Hi @timothyyu, I can't understand the structure of your repository quite well. |
@mg64ve the Generally I will test or refine my implementation in a Jupyter Notebook in the |
@timothyyu thanks. I see there are many notebooks in the archive directory. What is the reason for that? |
Archived notebooks are not "current" to the latest commit - usually anything in the archived directory has been implemented in python under I know it's not exactly ideal, but this particular type of workflow allows me to rapidly prototype and develop while leveraging the data exploration and visualization tools provided by Anaconda + Jupyter Notebook, and then refine that exploration/visualization/prototype into something that can be reproduced by anyone that clones or forks the repository (i.e. the files under Jupyter notebooks are great for visual analysis and exploration, but terrible for reproducible results, consistency, and development. |
Ok @timothyyu, but in the active notebooks, you read data from ../data/interim/cdii_tvt_split.pickle |
Every step of the process is saved in data/interim folder: |
Functions that are used to clean and split the dataset are in The function used to generate the report output in the |
ok @timothyyu but what you call raw data is .xls with several indicators. From where do you take this data? for my understanding raw data is OHLC+volume |
The raw data in the The source journal, "A deep learning framework for financial time series using stacked autoencoders and long-short term memory" (Bao et al., 2017) describes that their raw data is not just OHLC + volume, but an assortment of technical indicators and macroeconomic variables added to the OHLC + volume data:
|
Ok @timothyyu I did not know they made this data available. |
@timothyyu got similar results processing data with R. |
Interesting - scaling the indicators separately from the the OHLC is something I'm going to look into once I'm further along constructing the rest of the model. Additionally, I'm almost sure values from the wavelet transform have be saved from the train sets to apply to the validate and test sets, but there are some limitations/issues regarding that (see: #6 (comment)) Ideally, I'd like see if this kind of hybrid model is viable before applying it to a streaming series. |
@timothyyu I don't think we need to be concerned about the reverse process. |
That is something I am still looking into - how the the output from the SAE layers into the LSTM section is handled. I am not yet at this stage in replicating the results of the paper; so I can't fully answer your question as of this time. |
@timothyyu I don't know if really necessary. If you look at Gavin Tsang document, he "normalised based only upon the minimum/maximum values of their corresponding training set in order to eliminate any prior knowledge of overall scale as would occur in real-time prediction". If you do this, you should evaluate if the prediction is greater/lower than the previous value. |
see comment on #9: |
Relevant removed post/comment from r/algotrading that references this paper/echoes what i've found so far in attempting to replicate the model:
|
I agree many papers they do not considers many aspects or they contain look-ahead bias. https://ieeexplore.ieee.org/document/8280883 But I don't have access. |
@mg64ve here's the paper, I haven't had a chance to go through it yet but I'll be including it under
|
Careful attention is required for proper scaling/normalization of the Panel B and Panel C indicators in relation to the OHLC data (Panel A). When visualizing the train-validate-split with matplotlib, some of the index types show two lines or more, which shouldn't be possible, or at least that is what I thought was the case (I was, very, very wrong):
It turns out the panel C indicators and panel B indicators for the other index types are so far out range that they flatten the other features in terms of visualization/plotting:
The text was updated successfully, but these errors were encountered: