-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset questions #8
Comments
Thank you for making your analysis and data available, I'd like to use the dataset to assess the performance of multivariate time series anomaly detection algorithms by using the telemetry values of each channel (so my time series' dimension would be the number of channels) but I noticed that all the channels don't have the same number of values (time steps). Have values for different channels not been collected for the same time sequence? If not, is there any way to use the data from each channel jointly as a multivariate time series? |
@jules-samaran it is not possible to "stack" the telemetry values as you are suggesting. Values for each channel come from different, independent time windows. Also, for MSL, data for a given channel often arrives at irregular intervals and no imputation of timesteps in between values was performed. |
I am using my own Raw data.I have put train and test data in train and test folder respectively.Currently i am using only single channel with telemetry value at 0 column.However after running i am getting error "No such file or directory:'data\train\T_test.npy" and "No such file or directory:'data\test\T_test.npy". |
Answers to questions received via email:
Q1. For the
anomaly_sequences
column in thelabeled_anomalies.csv
, it means the start and end indices of true anomalies in stream. However, I don’t know the indice of your file is begun at0
or1
? For example, the [[6000,8127]] for channel id “D-2”, I want to know whether the start indice “6000” means the “6000”(begun at “1”) or “6001”(begun at “0”) row of the file “test/D-2.txt”?The indices begin at 0.
Q2. For the
anomaly_sequences
andnum values
column in thelabeled_anomalies.csv
, I found that some end indice is larger than thenum values
: A-8.txt, A-9.txt, D-9.txt, F-2.txt. Is there any mistake?This was an error and has been cleaned up. The anomalies go to the end of the sequence and the end of the range should equal
num_values
- 1.Q3. In both your
test
andtrain
files, I found most values of data is0
, and I want to know more background knowledge of the data to explain why most value of the value is0
.The “Raw experiment data” section of the readme explains this: “Model input data also includes one-hot encoded information about commands that were sent or received by specific spacecraft modules in a given time window. No identifying information related to the timing or nature of commands is included in the data.” So you see lots of zeroes where commands weren’t sent/received for to a specific spacecraft module in a time window. At most timesteps for most of the spacecraft submodules, there is no command activity. The first dimension is the prior telemetry values for that channel (the -1.000s in the example you screenshotted) and will be primarily nonzero.
Q4. What is the time interval between the adjacent rows?
For the anomalies from the SMAP spacecraft, values are aggregated into 1 minute buckets. For MSL, the time bucket size is variable as data rates are inconsistent and no interpolation between values was performed to fill missing buckets. This is one factor in the poorer performance seen for MSL anomalies and something we will be addressing in future iterations.
Q5. I found that the anomaly of channel id
P-2
are described twice and different (in row 19 and row 53), however, there are no descriptions about the anomaly of channel idT-10
.P-2 is the same channel with two anomalies occurring at different points in time, which is why you see two separate anomalies for that channel. These are entirely separate events and data that happen to occur for the same channel at different points in time. The full ranges of values are non-overlapping and the fact that the anomalous sequences have overlapping indices is coincidental.
T-10 didn’t have enough values to include so it was removed intentionally from the dataset and in the interest of time we didn’t rename all the channels.
The text was updated successfully, but these errors were encountered: