Missing Dataset #20

Suman-ksolves · 2022-07-07T03:57:00Z

Dear developers,

In the Experiments package, there are some real applications of algorithms.
But I couldn't find the source the experiment programs use in your github because the pathes of the sources are almost your own computer path like this

C:/Users/gagli/Desktop/outf.txt,
C:\Users\gagli\Desktop\gt.csv,
C:\Users\gagli\Downloads\syntheticDatasets\syntheticDatasets\10Kprofiles.json,
C:\Users\gagli\Downloads\syntheticDatasets\syntheticDatasets\10KIdDuplicates.json,
C:\Users\gagli\Desktop\gt.csv

Without these file, I have to speculate the structure and thus could not understand the program correctly.

May I ask you to upload these files to the Github. And that will help me a lot. Thank you very much!

Gaglia88 · 2022-07-08T14:17:34Z

Hi,
you are right, the paths refer to my laptop, I am sorry about that.

You can find the datasets here https://github.com/Gaglia88/sparker/tree/master/python/datasets

If you need a specific file let me know.

Regards,
Luca

Suman-ksolves · 2022-07-11T04:53:51Z

Hi,
Yes I required the file name
C:/Users/gagli/Desktop/outf.txt,
C:\Users\gagli\Desktop\gt.csv,
C:\Users\gagli\Desktop\gt.csv,

To run the EntityClusteringTests and Progressive. Can you please help me about it.
It will very grateful for me.

Regards,
Suman

Gaglia88 · 2022-07-11T08:38:00Z

Hi,
the "gt.csv" file is the ground truth of the used dataset, you can find it in any of the dirty datasets (e.g. https://github.com/Gaglia88/sparker/blob/master/python/datasets/dirty/cddb/cddb_groundtruth.csv).

Regarding "outf.txt" it is the output of an entity matching function applied to the pairs of profiles retained after applying the meta-blocking.
I do not have that file anymore, but it was in the form entity1id, entity2id, score.

An example could be:
1,2,0.5
1,3,0.4
2,3,0.8

If you look in this file https://github.com/scify/JedAIToolkit/blob/master/src/test/java/org/scify/jedai/entityclustering/TestAllMethods.java you can generate the "outf.txt" file by writing the content of the simPairs variable in the following way

   try {
            Writer writer = new FileWriter(new File("<output_path>"));
            BufferedWriter bufferedWriter = new BufferedWriter(writer);
            PairIterator its = simPairs.getPairIterator();
            while (its.hasNext()) {
                Comparison cmp = its.next();
                String out = cmp.getEntityId1() + "," + cmp.getEntityId2() + "," + cmp.getUtilityMeasure() + "\n";
                bufferedWriter.write(out);
            }
            bufferedWriter.close();
            writer.close();
        } catch (Exception e) {
        }

I hope this will help.

Regards,
Luca

Suman-ksolves · 2022-07-12T08:03:33Z

Hi,
Thanks Luca for dataset but when I try to run Progressive Program from Experiments it will required
textFile("C:/Users/gagli/Desktop/matches.txt") and csv("C:\Users\gagli\Desktop\gt.csv") for run without these file I am not able to run the program. Could you Please help me regarding it.

Regards,
Suman

Gaglia88 · 2022-08-11T07:29:32Z

Hi,
sorry I forgot to answer you.
As I wrote before, "gt.csv" is one of the groundtruth files, it depends on the dataset you are using for the experiment.
If you look at the dirty datasets in the repo I have linked, you can find many of them, to see how they are made.

Regarding the "matches.txt" file, I do not remember how it was created.
Anyway, to run the experiment, you can remove the lines from 212 to 228, and it should works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Dataset #20

Missing Dataset #20

Suman-ksolves commented Jul 7, 2022 •

edited

Loading

Gaglia88 commented Jul 8, 2022

Suman-ksolves commented Jul 11, 2022

Gaglia88 commented Jul 11, 2022 •

edited

Loading

Suman-ksolves commented Jul 12, 2022

Gaglia88 commented Aug 11, 2022

Missing Dataset #20

Missing Dataset #20

Comments

Suman-ksolves commented Jul 7, 2022 • edited Loading

Gaglia88 commented Jul 8, 2022

Suman-ksolves commented Jul 11, 2022

Gaglia88 commented Jul 11, 2022 • edited Loading

Suman-ksolves commented Jul 12, 2022

Gaglia88 commented Aug 11, 2022

Suman-ksolves commented Jul 7, 2022 •

edited

Loading

Gaglia88 commented Jul 11, 2022 •

edited

Loading