Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems running low-dimensional synthetic tables #21

Open
yoid2000 opened this issue Jan 6, 2024 · 2 comments
Open

Problems running low-dimensional synthetic tables #21

yoid2000 opened this issue Jan 6, 2024 · 2 comments

Comments

@yoid2000
Copy link

yoid2000 commented Jan 6, 2024

I am trying to run sdnist with low-dimensional synthetic tables (synthetic tables with very few columns). I hit the following error with all 1-column tables, and with some 2-column tables. Examples of synthetic tables that fail can be found here:

https://mpi-sws.org/~francis/texas_AGEP.csv

https://mpi-sws.org/~francis/texas_AGEP.DEAR.csv

An example of a synthetic table that does not fail is:

https://mpi-sws.org/~francis/texas_AGEP.DEAR.PUMA.csv

I'd be willing to try to debug this myself, but I'm hoping one of you can quickly understand and fix the problem. Myself I'm not really sure what this code in percentile_rank_synthetic() is trying to do in any event.

Note also that I'm running this off of a version of SDNist to which I've made some modifications. However, this error kicks in before any of my code is reached, and it works just fine on some synthetic tables but not others, so I'm guessing that the problem isn't caused by me. (The changes are for the purpose of getting SDNist to work in settings where multiple synthetic tables are produced, not just one.)

The error message is this:

| SDNist: Deidentified Data Report Tool
|-- Creating Evaluation Report for Deidentified Data at path: deids\texas\texas_AGEP.csv
|---- Loading Datasets
C:\paul\GitHub\SDNist-cross\sdnist\report\dataset\binning.py:124: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  for dbin, g in d.groupby(by='binned_density'):
|------ Features (1): ['AGEP']
|------ Deidentified Data Records Count: 9277
|------ Target Data Records Count: 9276
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\paul\GitHub\SDNist-cross\sdnist\report\__main__.py", line 160, in <module>
    run(**input_cnf)
  File "C:\paul\GitHub\SDNist-cross\sdnist\report\__main__.py", line 41, in run
    dataset = Dataset(synthetic_filepath, log, dataset_name, data_root, download)
  File "<string>", line 9, in __init__
  File "C:\paul\GitHub\SDNist-cross\sdnist\report\dataset\__init__.py", line 214, in __post_init__
    self.d_synthetic_data = percentile_rank_synthetic(self.c_synthetic_data,
  File "C:\paul\GitHub\SDNist-cross\sdnist\report\dataset\binning.py", line 53, in percentile_rank_synthetic
    s.loc[nna_mask, f] = final_st
  File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 885, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 1895, in _setitem_with_indexer
    self._setitem_single_block(indexer, value, name)
  File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 2132, in _setitem_single_block
    value = self._align_frame(indexer, value)._values
  File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 2417, in _align_frame
    raise ValueError("Incompatible indexer with DataFrame")
ValueError: Incompatible indexer with DataFrame
@yoid2000
Copy link
Author

yoid2000 commented Feb 1, 2024

I found a fix to the problem.

The crashing line of code is this:

s.loc[nna_mask, f] = final_st

There is some kind of shape issue that I don't understand, but I found if I modify that line of code to be this:

s.loc[nna_mask, f] = final_st[f]

then the problem disappears. I think the fix is basically ensuring that final_st is really a single-column dataframe, and therefore the shape problem is fixed.

@kbtriangulum
Copy link
Collaborator

Some metrics don't work well with very low dimensional data. So we are fixing it to probably skip the metrics that are not meaningful with few features. We are in process of releasing the fix soon.
Thanks for looking into the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants