Problems running low-dimensional synthetic tables #21

yoid2000 · 2024-01-06T12:47:20Z

I am trying to run sdnist with low-dimensional synthetic tables (synthetic tables with very few columns). I hit the following error with all 1-column tables, and with some 2-column tables. Examples of synthetic tables that fail can be found here:

https://mpi-sws.org/~francis/texas_AGEP.csv

https://mpi-sws.org/~francis/texas_AGEP.DEAR.csv

An example of a synthetic table that does not fail is:

https://mpi-sws.org/~francis/texas_AGEP.DEAR.PUMA.csv

I'd be willing to try to debug this myself, but I'm hoping one of you can quickly understand and fix the problem. Myself I'm not really sure what this code in percentile_rank_synthetic() is trying to do in any event.

Note also that I'm running this off of a version of SDNist to which I've made some modifications. However, this error kicks in before any of my code is reached, and it works just fine on some synthetic tables but not others, so I'm guessing that the problem isn't caused by me. (The changes are for the purpose of getting SDNist to work in settings where multiple synthetic tables are produced, not just one.)

The error message is this:

| SDNist: Deidentified Data Report Tool
|-- Creating Evaluation Report for Deidentified Data at path: deids\texas\texas_AGEP.csv
|---- Loading Datasets
C:\paul\GitHub\SDNist-cross\sdnist\report\dataset\binning.py:124: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  for dbin, g in d.groupby(by='binned_density'):
|------ Features (1): ['AGEP']
|------ Deidentified Data Records Count: 9277
|------ Target Data Records Count: 9276
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\paul\GitHub\SDNist-cross\sdnist\report\__main__.py", line 160, in <module>
    run(**input_cnf)
  File "C:\paul\GitHub\SDNist-cross\sdnist\report\__main__.py", line 41, in run
    dataset = Dataset(synthetic_filepath, log, dataset_name, data_root, download)
  File "<string>", line 9, in __init__
  File "C:\paul\GitHub\SDNist-cross\sdnist\report\dataset\__init__.py", line 214, in __post_init__
    self.d_synthetic_data = percentile_rank_synthetic(self.c_synthetic_data,
  File "C:\paul\GitHub\SDNist-cross\sdnist\report\dataset\binning.py", line 53, in percentile_rank_synthetic
    s.loc[nna_mask, f] = final_st
  File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 885, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 1895, in _setitem_with_indexer
    self._setitem_single_block(indexer, value, name)
  File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 2132, in _setitem_single_block
    value = self._align_frame(indexer, value)._values
  File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 2417, in _align_frame
    raise ValueError("Incompatible indexer with DataFrame")
ValueError: Incompatible indexer with DataFrame

The text was updated successfully, but these errors were encountered:

yoid2000 · 2024-02-01T14:47:36Z

I found a fix to the problem.

The crashing line of code is this:

SDNist/sdnist/report/dataset/binning.py

Line 53 in b06d361

s.loc[nna_mask, f] = final_st

There is some kind of shape issue that I don't understand, but I found if I modify that line of code to be this:

s.loc[nna_mask, f] = final_st[f]

then the problem disappears. I think the fix is basically ensuring that final_st is really a single-column dataframe, and therefore the shape problem is fixed.

kbtriangulum · 2024-02-02T13:41:31Z

Some metrics don't work well with very low dimensional data. So we are fixing it to probably skip the metrics that are not meaningful with few features. We are in process of releasing the fix soon.
Thanks for looking into the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems running low-dimensional synthetic tables #21

Problems running low-dimensional synthetic tables #21

yoid2000 commented Jan 6, 2024

yoid2000 commented Feb 1, 2024

kbtriangulum commented Feb 2, 2024

Problems running low-dimensional synthetic tables #21

Problems running low-dimensional synthetic tables #21

Comments

yoid2000 commented Jan 6, 2024

yoid2000 commented Feb 1, 2024

kbtriangulum commented Feb 2, 2024