dual stage normalization and scaling #7

timothyyu · 2019-03-01T07:44:14Z

Careful attention is required for proper scaling/normalization of the Panel B and Panel C indicators in relation to the OHLC data (Panel A). When visualizing the train-validate-split with matplotlib, some of the index types show two lines or more, which shouldn't be possible, or at least that is what I thought was the case (I was, very, very wrong):

It turns out the panel C indicators and panel B indicators for the other index types are so far out range that they flatten the other features in terms of visualization/plotting:

timothyyu · 2019-03-01T07:48:06Z

this is directly related to #5

a dual-stage approach is probably required due to the nature of the dataset (complex hierarchical time series with numerical features), so a simple transformation is not going to work/going to make the 2nd stage of the model more inaccurate

timothyyu · 2019-03-01T07:49:34Z

Copied from #5 (dataset scaling/normalization before wavelet transform):

The author of DeepLearning_Financial decided to forgo automated scaling/normalization and instead scaled the input features/dataset manually before applying the wavelet transform:

https://github.com/mlpanda/DeepLearning_Financial/blob/7e846144629d8b49b8fd74a87d5ff047b7af55d1/run_training.py#L55 :

 # This is a scaling of the inputs such that they are in an appropriate range    
    feats["Close Price"].loc[:] = feats["Close Price"].loc[:]/1000
    feats["Open Price"].loc[:] = feats["Open Price"].loc[:]/1000
    feats["High Price"].loc[:] = feats["High Price"].loc[:]/1000
    feats["Low Price"].loc[:] = feats["Low Price"].loc[:]/1000
    feats["Volume"].loc[:] = feats["Volume"].loc[:]/1000000
    feats["MACD"].loc[:] = feats["MACD"].loc[:]/10
    feats["CCI"].loc[:] = feats["CCI"].loc[:]/100
    feats["ATR"].loc[:] = feats["ATR"].loc[:]/100
    feats["BOLL"].loc[:] = feats["BOLL"].loc[:]/1000
    feats["EMA20"].loc[:] = feats["EMA20"].loc[:]/1000
    feats["MA10"].loc[:] = feats["MA10"].loc[:]/1000
    feats["MTM6"].loc[:] = feats["MTM6"].loc[:]/100
    feats["MA5"].loc[:] = feats["MA5"].loc[:]/1000
    feats["MTM12"].loc[:] = feats["MTM12"].loc[:]/100
    feats["ROC"].loc[:] = feats["ROC"].loc[:]/10
    feats["SMI"].loc[:] = feats["SMI"].loc[:] * 10
    feats["WVAD"].loc[:] = feats["WVAD"].loc[:]/100000000
    feats["US Dollar Index"].loc[:] = feats["US Dollar Index"].loc[:]/100
    feats["Federal Fund Rate"].loc[:] = feats["Federal Fund Rate"].loc[:]

https://github.com/mlpanda/DeepLearning_Financial/blob/7e846144629d8b49b8fd74a87d5ff047b7af55d1/run_training.py#L96 :

 # REMOVED THE NORMALIZATION AND MANUALLY SCALED TO APPROPRIATE VALUES ABOVE

    """
    scaler = StandardScaler().fit(feats_train)
    feats_norm_train = scaler.transform(feats_train)
    feats_norm_validate = scaler.transform(feats_validate)
    feats_norm_test = scaler.transform(feats_test)
    """
    """
    scaler = MinMaxScaler(feature_range=(0,1))
    scaler.fit(feats_train)
    feats_norm_train = scaler.transform(feats_train)
    feats_norm_validate = scaler.transform(feats_validate)
    feats_norm_test = scaler.transform(feats_test)
    """

My my main issues/concerns are the following:

Manual scaling can work when you know the exact range of the dataset you're going to be working with, but this kind of scaling would not work on a live model (whether online or continuously batch-trained). In this case, a few values outside of the defined manual ranges for OHLC and the rest of the Panel B Technical Indicators would throw the scaling off.
The source article/journal (Bao et al., 2017) does not go into detail about preprocessing their dataset beyond using the wavelet transform to denoise the dataset.
Scaling != normalization, and there are different ways to scale and/or normalize data depending on the nature of the problem and model (and the nature of the dataset itself)

Thus:

More research needed on scaling/normalization in respect/context of time series data for machine learning. In terms of code/practical implementation, I will mostly like code multiple options for different training runs (with different scaling/normalization options), and then compare.

timothyyu · 2019-03-01T11:01:10Z

RobustScaler test for 'nifty 50 index data':

timothyyu · 2019-03-01T12:32:34Z

see commit 8073c42

scaled data and scaled denoised data now saved in data/interim folder:

pdf output of train-validate-test split scaled + denoised in reports folder:

Excerpt from pdf output:

timothyyu · 2019-03-02T07:28:23Z

train-validate-test split is showing some questionable output; look into when I get a chance

there is the possibility it's a matplotlib/pdf render output issue:
https://github.com/timothyyu/wsae-lstm/blob/master/reports/djia%20index%20data%20tvt%20split%20scale%20denoise%20visual.pdf

this should not be happening:

yup, this definitely should not be happening- values shouldn't be negative with hard thresholding to the point where they throw off the rest of the features/input data::

timothyyu · 2019-03-02T22:18:11Z

upon closer examination, it appears that the line/feature flatling visually is not technically wrong for the djia index dataset:

The values are still there, but RobustScaler needs some adjustment

mg64ve · 2019-03-03T21:38:08Z

Hi @timothyyu, I can't understand the structure of your repository quite well.
What directories subrepos and wsae_lstm stand for?

timothyyu · 2019-03-04T00:29:31Z

@mg64ve the subrepos directory contains deeplearning_financial for reference; wsae_lstm is the main source location for the code for my implementation. I am using the following directory structure, but with wsae_lstm instead of src:
http://drivendata.github.io/cookiecutter-data-science/#directory-structure

Generally I will test or refine my implementation in a Jupyter Notebook in the notebooks folder, and then refine code from those notebooks into python files, which go under wsae_lstm. Jupyter notebooks are not exactly reproducible - that is part of the reason why I'm not doing everything in a Jupyter environment.

mg64ve · 2019-03-04T07:40:58Z

@timothyyu thanks. I see there are many notebooks in the archive directory. What is the reason for that?

timothyyu · 2019-03-04T07:59:36Z

Archived notebooks are not "current" to the latest commit - usually anything in the archived directory has been implemented in python under wsae_lstm. My general development process uses jupyter notebooks to explore and rapidly prototype, which means things will break frequently and often.

I know it's not exactly ideal, but this particular type of workflow allows me to rapidly prototype and develop while leveraging the data exploration and visualization tools provided by Anaconda + Jupyter Notebook, and then refine that exploration/visualization/prototype into something that can be reproduced by anyone that clones or forks the repository (i.e. the files under wsae_lstm).

Jupyter notebooks are great for visual analysis and exploration, but terrible for reproducible results, consistency, and development.
For more context, see the following articles/threads:

mg64ve · 2019-03-04T08:19:04Z

Ok @timothyyu, but in the active notebooks, you read data from ../data/interim/cdii_tvt_split.pickle
How did you make this data?
I can't understand where is the starting point.

timothyyu · 2019-03-04T12:25:39Z

@mg64ve

First the raw data is cleaned and then split into train-validate-test intervals:

#print(dict_dataframes_index.keys())
# [index data][period 1-24][train/validate/test]
    # Train [1], Validate [2], Test [3]

Then the data is scaled and then denoised:

Every step of the process is saved in data/interim folder:
https://github.com/timothyyu/wsae-lstm/tree/master/data/interim

timothyyu · 2019-03-04T12:28:26Z

Functions that are used to clean and split the dataset are in wsae-lstm/utils.py:
https://github.com/timothyyu/wsae-lstm/blob/master/wsae_lstm/utils.py

The function used to generate the report output in the reports folder for the train-validate-test split is wsae-lstm/visualize.py:
https://github.com/timothyyu/wsae-lstm/blob/master/wsae_lstm/visualize.py
https://github.com/timothyyu/wsae-lstm/tree/master/reports

mg64ve · 2019-03-04T12:33:08Z

ok @timothyyu but what you call raw data is .xls with several indicators. From where do you take this data? for my understanding raw data is OHLC+volume

timothyyu · 2019-03-04T12:41:11Z

The raw data in the data/raw folder is straight from the source - it is the dataset that the authors of the WSAE-LSTM model journal/paper link and use themselves. Specifically, the raw data is obtained from the following link:
https://figshare.com/articles/Raw_Data/5028110
DOI:10.6084/m9.figshare.5028110

The source journal, "A deep learning framework for financial time series using stacked autoencoders and long-short term memory" (Bao et al., 2017) describes that their raw data is not just OHLC + volume, but an assortment of technical indicators and macroeconomic variables added to the OHLC + volume data:

Source: "Table 1. Description of the input variables." (Bao et al., 2017)

mg64ve · 2019-03-04T12:59:34Z

Ok @timothyyu I did not know they made this data available.
I know that document very well and I also recommend to read https://www.researchgate.net/publication/329316403_Recurrent_Neural_Networks_for_Financial_Time-Series_Modelling which seems to be very interesting. What do you think of the two documents?
What is the purpose of your research? Do you want to replicate their result starting from same data or do you want to explore if this concept is applicable to streaming data series?
In the second case you need to start from OHLC market data.
What is the purpose of clean_dataset.py ? I see you are basically adjusting columns in the dataset since the dataset seems to be already not containing null values or gaps. right?

mg64ve · 2019-03-07T12:13:56Z

@timothyyu got similar results processing data with R.
Left is raw data, on the right you have after preprocessing with HAAR and SURE Shrinking after normalization.

timothyyu · 2019-03-08T20:39:24Z

Interesting - scaling the indicators separately from the the OHLC is something I'm going to look into once I'm further along constructing the rest of the model. Additionally, I'm almost sure values from the wavelet transform have be saved from the train sets to apply to the validate and test sets, but there are some limitations/issues regarding that (see: #6 (comment))

Ideally, I'd like see if this kind of hybrid model is viable before applying it to a streaming series.

timothyyu · 2019-03-08T20:40:43Z

This is possible by saving the scaling parameters:

However, the same is not true for the denoise with the wavelet transform if the values for sigma are different:

    # calculate the wavelet coefficients
    coeff = pywt.wavedec( x, wavelet, mode='periodization',level=declevel,axis=0 )
    # calculate a threshold
    sigma = mad(coeff[-level])
    #print("sigma: ",sigma)
    uthresh = sigma * np.sqrt( 2*np.log( len( x ) ) )
    coeff[1:] = ( pywt.threshold( i, value=uthresh, mode="hard" ) for i in coeff[1:] )
    # reconstruct the signal using the thresholded coefficients
    y = pywt.waverec( coeff, wavelet, mode='periodization',axis=0 )
    return y,sigma,uthresh

There is more than one way to approach this issue - it is a multifaceted issue that will affect the rest of the model + results.

mg64ve · 2019-03-12T12:17:40Z

@timothyyu I don't think we need to be concerned about the reverse process.
The following are some snapshot after the SAE process. I am using s8 wavelet and SURE thresholding.
I can't really understand what I should expect after SAE. Should I expect some features fusion or some features diversification?

timothyyu · 2019-03-15T16:27:12Z

@timothyyu I don't think we need to be concerned about the reverse process.
...
I can't really understand what I should expect after SAE. Should I expect some features fusion or some features diversification?

That is something I am still looking into - how the the output from the SAE layers into the LSTM section is handled. I am not yet at this stage in replicating the results of the paper; so I can't fully answer your question as of this time.

timothyyu · 2019-03-15T16:40:47Z

@mg64ve also see #6

There are potential issues with how sigma and uthresh values are used for the wavelet transform that I am looking into

timothyyu · 2019-03-15T16:54:11Z

Partial/incomplete answer to your question about the reverse process:

If the LSTMs are trained on scaled OHLC data, then the predictions will be scaled. If process is not reversible (even if approximate if not exact), then output from the LSTMs is going to be unintelligible nonsense:

There is a possibility that the LSTMs are fed scaled and denoised indicator data, but the OHLC data is denoised and not scaled - it's fairly complex is think about in terms of a pipeline:

mg64ve · 2019-03-20T07:56:56Z

@timothyyu I don't know if really necessary. If you look at Gavin Tsang document, he "normalised based only upon the minimum/maximum values of their corresponding training set in order to eliminate any prior knowledge of overall scale as would occur in real-time prediction". If you do this, you should evaluate if the prediction is greater/lower than the previous value.

timothyyu · 2019-07-12T23:27:42Z

see comment on #9:
#9 (comment)

timothyyu · 2019-08-21T20:35:06Z

Relevant removed post/comment from r/algotrading that references this paper/echoes what i've found so far in attempting to replicate the model:
https://www.removeddit.com/r/algotrading/comments/cr7jey/ive_reproduced_130_research_papers_about/

The most frustrating paper:

I have true hate for the authors of this paper: "A deep learning framework for financial time series using stacked autoencoders and long-short term memory". Probably the most complex AND vague in terms of methodology and after weeks trying to reproduce their results (and failing) I figured out that they were leaking future data into their training set (this also happens more than you'd think).

The two positive take-aways that I did find from all of this research are:

Almost every instrument is mean-reverting on short timelines and trending on longer timelines. This has held true across most of the data that I tested. Putting this information into a strategy would be rather easy and straightforward (although you have no guarantee that it'll continue to work in future).
When we were in the depths of the great recession, almost every signal was bearish (seeking alpha contributors, news, google trends). If this holds in the next recession, just using this data alone would give you a strategy that vastly outperforms the index across long time periods.

mg64ve · 2019-08-21T21:11:13Z

I agree many papers they do not considers many aspects or they contain look-ahead bias.
I think they should publish their code so everybody can check if the results are made with code containing leakage.
I wonder to read this paper:

https://ieeexplore.ieee.org/document/8280883

But I don't have access.
Do you think it could contain bias?

timothyyu · 2019-08-24T21:29:26Z

@mg64ve here's the paper, I haven't had a chance to go through it yet but I'll be including it under references in future commits:
li2017.pdf

Z. Li and V. Tam, "Combining the real-time wavelet denoising and long-short-term-memory neural network for predicting stock indexes," *2017 IEEE Symposium Series on Computational Intelligence (SSCI)*, Honolulu, HI, 2017, pp. 1-8.
doi: 10.1109/SSCI.2017.8280883

timothyyu self-assigned this Mar 1, 2019

timothyyu pinned this issue Mar 1, 2019

timothyyu mentioned this issue Mar 1, 2019

dataset scaling/normalization before wavelet transform #5

Closed

This was referenced Mar 1, 2019

pretty sure the results in the paper are because WT leaks future info mlpanda/DeepLearning_Financial#3

Open

WT in the paper leaks info #6

Open

timothyyu mentioned this issue Jun 10, 2019

"level" parameter in waveletSmooth function #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dual stage normalization and scaling #7

dual stage normalization and scaling #7

timothyyu commented Mar 1, 2019 •

edited

Loading

timothyyu commented Mar 1, 2019 •

edited

Loading

timothyyu commented Mar 1, 2019 •

edited

Loading

timothyyu commented Mar 1, 2019 •

edited

Loading

timothyyu commented Mar 1, 2019 •

edited

Loading

timothyyu commented Mar 2, 2019 •

edited

Loading

timothyyu commented Mar 2, 2019

mg64ve commented Mar 3, 2019

timothyyu commented Mar 4, 2019 •

edited

Loading

mg64ve commented Mar 4, 2019

timothyyu commented Mar 4, 2019 •

edited

Loading

mg64ve commented Mar 4, 2019

timothyyu commented Mar 4, 2019 •

edited

Loading

timothyyu commented Mar 4, 2019 •

edited

Loading

mg64ve commented Mar 4, 2019

timothyyu commented Mar 4, 2019

mg64ve commented Mar 4, 2019

mg64ve commented Mar 7, 2019

timothyyu commented Mar 8, 2019 •

edited

Loading

timothyyu commented Mar 8, 2019 •

edited

Loading

mg64ve commented Mar 12, 2019

timothyyu commented Mar 15, 2019

timothyyu commented Mar 15, 2019 •

edited

Loading

timothyyu commented Mar 15, 2019

mg64ve commented Mar 20, 2019

timothyyu commented Jul 12, 2019

timothyyu commented Aug 21, 2019

mg64ve commented Aug 21, 2019

timothyyu commented Aug 24, 2019

dual stage normalization and scaling #7

dual stage normalization and scaling #7

Comments

timothyyu commented Mar 1, 2019 • edited Loading

timothyyu commented Mar 1, 2019 • edited Loading

timothyyu commented Mar 1, 2019 • edited Loading

timothyyu commented Mar 1, 2019 • edited Loading

timothyyu commented Mar 1, 2019 • edited Loading

timothyyu commented Mar 2, 2019 • edited Loading

timothyyu commented Mar 2, 2019

mg64ve commented Mar 3, 2019

timothyyu commented Mar 4, 2019 • edited Loading

mg64ve commented Mar 4, 2019

timothyyu commented Mar 4, 2019 • edited Loading

mg64ve commented Mar 4, 2019

timothyyu commented Mar 4, 2019 • edited Loading

timothyyu commented Mar 4, 2019 • edited Loading

mg64ve commented Mar 4, 2019

timothyyu commented Mar 4, 2019

mg64ve commented Mar 4, 2019

mg64ve commented Mar 7, 2019

timothyyu commented Mar 8, 2019 • edited Loading

timothyyu commented Mar 8, 2019 • edited Loading

mg64ve commented Mar 12, 2019

timothyyu commented Mar 15, 2019

timothyyu commented Mar 15, 2019 • edited Loading

timothyyu commented Mar 15, 2019

mg64ve commented Mar 20, 2019

timothyyu commented Jul 12, 2019

timothyyu commented Aug 21, 2019

mg64ve commented Aug 21, 2019

timothyyu commented Aug 24, 2019

timothyyu commented Mar 1, 2019 •

edited

Loading

timothyyu commented Mar 1, 2019 •

edited

Loading

timothyyu commented Mar 1, 2019 •

edited

Loading

timothyyu commented Mar 1, 2019 •

edited

Loading

timothyyu commented Mar 1, 2019 •

edited

Loading

timothyyu commented Mar 2, 2019 •

edited

Loading

timothyyu commented Mar 4, 2019 •

edited

Loading

timothyyu commented Mar 4, 2019 •

edited

Loading

timothyyu commented Mar 4, 2019 •

edited

Loading

timothyyu commented Mar 4, 2019 •

edited

Loading

timothyyu commented Mar 8, 2019 •

edited

Loading

timothyyu commented Mar 8, 2019 •

edited

Loading

timothyyu commented Mar 15, 2019 •

edited

Loading