Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Dataset #20

Open
Suman-ksolves opened this issue Jul 7, 2022 · 5 comments
Open

Missing Dataset #20

Suman-ksolves opened this issue Jul 7, 2022 · 5 comments

Comments

@Suman-ksolves
Copy link

Suman-ksolves commented Jul 7, 2022

Dear developers,

In the Experiments package, there are some real applications of algorithms.
But I couldn't find the source the experiment programs use in your github because the pathes of the sources are almost your own computer path like this

C:/Users/gagli/Desktop/outf.txt,
C:\Users\gagli\Desktop\gt.csv,
C:\Users\gagli\Downloads\syntheticDatasets\syntheticDatasets\10Kprofiles.json,
C:\Users\gagli\Downloads\syntheticDatasets\syntheticDatasets\10KIdDuplicates.json,
C:\Users\gagli\Desktop\gt.csv

Without these file, I have to speculate the structure and thus could not understand the program correctly.

May I ask you to upload these files to the Github. And that will help me a lot. Thank you very much!

@Gaglia88
Copy link
Contributor

Gaglia88 commented Jul 8, 2022

Hi,
you are right, the paths refer to my laptop, I am sorry about that.

You can find the datasets here https://github.com/Gaglia88/sparker/tree/master/python/datasets

If you need a specific file let me know.

Regards,
Luca

@Suman-ksolves
Copy link
Author

Hi,
Yes I required the file name
C:/Users/gagli/Desktop/outf.txt,
C:\Users\gagli\Desktop\gt.csv,
C:\Users\gagli\Desktop\gt.csv,

To run the EntityClusteringTests and Progressive. Can you please help me about it.
It will very grateful for me.

Regards,
Suman

@Gaglia88
Copy link
Contributor

Gaglia88 commented Jul 11, 2022

Hi,
the "gt.csv" file is the ground truth of the used dataset, you can find it in any of the dirty datasets (e.g. https://github.com/Gaglia88/sparker/blob/master/python/datasets/dirty/cddb/cddb_groundtruth.csv).

Regarding "outf.txt" it is the output of an entity matching function applied to the pairs of profiles retained after applying the meta-blocking.
I do not have that file anymore, but it was in the form entity1id, entity2id, score.

An example could be:
1,2,0.5
1,3,0.4
2,3,0.8

If you look in this file https://github.com/scify/JedAIToolkit/blob/master/src/test/java/org/scify/jedai/entityclustering/TestAllMethods.java you can generate the "outf.txt" file by writing the content of the simPairs variable in the following way

   try {
            Writer writer = new FileWriter(new File("<output_path>"));
            BufferedWriter bufferedWriter = new BufferedWriter(writer);
            PairIterator its = simPairs.getPairIterator();
            while (its.hasNext()) {
                Comparison cmp = its.next();
                String out = cmp.getEntityId1() + "," + cmp.getEntityId2() + "," + cmp.getUtilityMeasure() + "\n";
                bufferedWriter.write(out);
            }
            bufferedWriter.close();
            writer.close();
        } catch (Exception e) {
        }

I hope this will help.

Regards,
Luca

@Suman-ksolves
Copy link
Author

Hi,
Thanks Luca for dataset but when I try to run Progressive Program from Experiments it will required
textFile("C:/Users/gagli/Desktop/matches.txt") and csv("C:\Users\gagli\Desktop\gt.csv") for run without these file I am not able to run the program. Could you Please help me regarding it.

Regards,
Suman

@Gaglia88
Copy link
Contributor

Hi,
sorry I forgot to answer you.
As I wrote before, "gt.csv" is one of the groundtruth files, it depends on the dataset you are using for the experiment.
If you look at the dirty datasets in the repo I have linked, you can find many of them, to see how they are made.

Regarding the "matches.txt" file, I do not remember how it was created.
Anyway, to run the experiment, you can remove the lines from 212 to 228, and it should works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants