Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarking dataset with labelled anomalies for scoring performance of detector algorithms #12

Open
halvgaard opened this issue Jan 20, 2021 · 12 comments
Labels
help wanted Extra attention is needed

Comments

@halvgaard
Copy link

halvgaard commented Jan 20, 2021

Do you know about any (open source) datasets at DHI that has labelled anomalies that we can use for testing? @ecomodeller @laurafroelich @akfDHI

@halvgaard halvgaard added the help wanted Extra attention is needed label Jan 21, 2021
@halvgaard
Copy link
Author

@ecomodeller I found some datasets with labelled anomalies here: https://github.com/numenta/NAB
There are very few labels. But I guess that is the case with anomalies.

@laurafroelich
Copy link
Contributor

@Rhadhi Have you checked out the license for that repo? it seems to be quite strict and copy-left, so if we want to use material from the numenta/NAB repo we need to change our license to the same one (AGPL-3.0 License) as far as I can tell. What do you think? If I am right, making our repo AGPL would then imply that anyone using our repo would also have to make it AGPL... maybe not what we want?

@ecomodeller
Copy link
Member

I don't know any open datasets at DHI that we can use. We have to ask around and see if someone has some annotated dataset they are willing to share. There are lots of data, but not so many with labels and probably even fewer that are public, unfortunately.

@halvgaard
Copy link
Author

I will try to ask around on DHI yammer for labelled data sets with anomalies. @ecomodeller Do you have labels for the DMI data set we have in repo? Otherwise I will try to label the obvious ones with the algorithms, e.g.
anomaly 1

@halvgaard
Copy link
Author

halvgaard commented Jan 28, 2021

@laurafroelich @ecomodeller @akfDHI How do you like this message to be posted on yammer:

We are trying to establish best practices and automated ways of identifying anomalies/outliers in time series data.
Please let us know if you:

  • have a dataset that needs to be cleaned automatically
  • have algorithms for detecting outliers lying around in your head or in actual code
  • have a data set, ideally publicly available, with labelled anomalies, i.e. an exact indication about which data points are actually anomalies.

Currently we are working on algorithms based on everything from simple range checks to machine learning models. Check out and potentially contribute to our open source anomaly detection python package on DHI's Github here: https://github.com/DHI/anomalydetection

@laurafroelich
Copy link
Contributor

Sounds good to me :)

@ecomodeller
Copy link
Member

Can we make an interactive application to assist the labelling process?

  1. Upload data
  2. Automatic labeling of obvious outliers with simple detector
  3. Manually add / remove labels by clicking on chart.
  4. Save the labelled timeseries in reusable format e.g. csv

@akfDHI
Copy link
Collaborator

akfDHI commented Jan 29, 2021 via email

@halvgaard
Copy link
Author

halvgaard commented Jan 29, 2021

@ecomodeller There is one open source tool here: https://trainset.geocene.com/

@halvgaard
Copy link
Author

@ecomodeller Is this relevant: http://www.marineinsitu.eu/dashboard ?

@halvgaard
Copy link
Author

We got a labelled dataset from an actual DHI case based on groundwater measurements. Unfortunately, the dataset cannot be published publicly on github.

@ecomodeller
Copy link
Member

Can we make an interactive application to assist the labelling process?

  1. Upload data
  2. Automatic labeling of obvious outliers with simple detector
  3. Manually add / remove labels by clicking on chart.
  4. Save the labelled timeseries in reusable format e.g. csv

Please note that we now have an interactive application for labelling outliers and training a detector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants