Dataset questions #8

khundman · 2018-08-30T14:11:33Z

Answers to questions received via email:

Q1. For the anomaly_sequences column in the labeled_anomalies.csv, it means the start and end indices of true anomalies in stream. However, I don’t know the indice of your file is begun at 0 or 1? For example, the [[6000,8127]] for channel id “D-2”, I want to know whether the start indice “6000” means the “6000”(begun at “1”) or “6001”(begun at “0”) row of the file “test/D-2.txt”?

The indices begin at 0.

Q2. For the anomaly_sequences and num values column in the labeled_anomalies.csv, I found that some end indice is larger than the num values: A-8.txt, A-9.txt, D-9.txt, F-2.txt. Is there any mistake?

This was an error and has been cleaned up. The anomalies go to the end of the sequence and the end of the range should equal num_values - 1.

Q3. In both your test and train files, I found most values of data is 0, and I want to know more background knowledge of the data to explain why most value of the value is 0.

The “Raw experiment data” section of the readme explains this: “Model input data also includes one-hot encoded information about commands that were sent or received by specific spacecraft modules in a given time window. No identifying information related to the timing or nature of commands is included in the data.” So you see lots of zeroes where commands weren’t sent/received for to a specific spacecraft module in a time window. At most timesteps for most of the spacecraft submodules, there is no command activity. The first dimension is the prior telemetry values for that channel (the -1.000s in the example you screenshotted) and will be primarily nonzero.

Q4. What is the time interval between the adjacent rows?

For the anomalies from the SMAP spacecraft, values are aggregated into 1 minute buckets. For MSL, the time bucket size is variable as data rates are inconsistent and no interpolation between values was performed to fill missing buckets. This is one factor in the poorer performance seen for MSL anomalies and something we will be addressing in future iterations.

Q5. I found that the anomaly of channel id P-2 are described twice and different (in row 19 and row 53), however, there are no descriptions about the anomaly of channel id T-10.

P-2 is the same channel with two anomalies occurring at different points in time, which is why you see two separate anomalies for that channel. These are entirely separate events and data that happen to occur for the same channel at different points in time. The full ranges of values are non-overlapping and the fact that the anomalous sequences have overlapping indices is coincidental.

T-10 didn’t have enough values to include so it was removed intentionally from the dataset and in the interest of time we didn’t rename all the channels.

The text was updated successfully, but these errors were encountered:

jules-samaran · 2019-11-14T02:11:31Z

Thank you for making your analysis and data available, I'd like to use the dataset to assess the performance of multivariate time series anomaly detection algorithms by using the telemetry values of each channel (so my time series' dimension would be the number of channels) but I noticed that all the channels don't have the same number of values (time steps). Have values for different channels not been collected for the same time sequence? If not, is there any way to use the data from each channel jointly as a multivariate time series?

khundman · 2019-11-22T21:45:24Z

@jules-samaran it is not possible to "stack" the telemetry values as you are suggesting. Values for each channel come from different, independent time windows. Also, for MSL, data for a given channel often arrives at irregular intervals and no imputation of timesteps in between values was performed.

anand-gy · 2023-01-30T06:52:56Z

I am using my own Raw data.I have put train and test data in train and test folder respectively.Currently i am using only single channel with telemetry value at 0 column.However after running i am getting error "No such file or directory:'data\train\T_test.npy" and "No such file or directory:'data\test\T_test.npy".
Kindly guide me. I am new in AI/ML

khundman added the question Further information is requested label Aug 30, 2018

khundman pushed a commit that referenced this issue Aug 30, 2018

Fixes #7 and #8

5e87347

khundman mentioned this issue Mar 22, 2020

Does the data only have categorical inputs? #35

Closed

khundman mentioned this issue Nov 21, 2020

Duplicate P-2 and non-exist T-10 #48

Closed

peerschuett mentioned this issue Sep 22, 2023

Question about MSL and SMAP dataset d-ailin/GDN#82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset questions #8

Dataset questions #8

khundman commented Aug 30, 2018

jules-samaran commented Nov 14, 2019

khundman commented Nov 22, 2019

anand-gy commented Jan 30, 2023

Dataset questions #8

Dataset questions #8

Comments

khundman commented Aug 30, 2018

jules-samaran commented Nov 14, 2019

khundman commented Nov 22, 2019

anand-gy commented Jan 30, 2023