You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run sdnist with low-dimensional synthetic tables (synthetic tables with very few columns). I hit the following error with all 1-column tables, and with some 2-column tables. Examples of synthetic tables that fail can be found here:
I'd be willing to try to debug this myself, but I'm hoping one of you can quickly understand and fix the problem. Myself I'm not really sure what this code in percentile_rank_synthetic() is trying to do in any event.
Note also that I'm running this off of a version of SDNist to which I've made some modifications. However, this error kicks in before any of my code is reached, and it works just fine on some synthetic tables but not others, so I'm guessing that the problem isn't caused by me. (The changes are for the purpose of getting SDNist to work in settings where multiple synthetic tables are produced, not just one.)
The error message is this:
| SDNist: Deidentified Data Report Tool
|-- Creating Evaluation Report for Deidentified Data at path: deids\texas\texas_AGEP.csv
|---- Loading Datasets
C:\paul\GitHub\SDNist-cross\sdnist\report\dataset\binning.py:124: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
for dbin, g in d.groupby(by='binned_density'):
|------ Features (1): ['AGEP']
|------ Deidentified Data Records Count: 9277
|------ Target Data Records Count: 9276
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\paul\GitHub\SDNist-cross\sdnist\report\__main__.py", line 160, in <module>
run(**input_cnf)
File "C:\paul\GitHub\SDNist-cross\sdnist\report\__main__.py", line 41, in run
dataset = Dataset(synthetic_filepath, log, dataset_name, data_root, download)
File "<string>", line 9, in __init__
File "C:\paul\GitHub\SDNist-cross\sdnist\report\dataset\__init__.py", line 214, in __post_init__
self.d_synthetic_data = percentile_rank_synthetic(self.c_synthetic_data,
File "C:\paul\GitHub\SDNist-cross\sdnist\report\dataset\binning.py", line 53, in percentile_rank_synthetic
s.loc[nna_mask, f] = final_st
File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 885, in __setitem__
iloc._setitem_with_indexer(indexer, value, self.name)
File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 1895, in _setitem_with_indexer
self._setitem_single_block(indexer, value, name)
File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 2132, in _setitem_single_block
value = self._align_frame(indexer, value)._values
File "C:\Users\local_francis\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexing.py", line 2417, in _align_frame
raise ValueError("Incompatible indexer with DataFrame")
ValueError: Incompatible indexer with DataFrame
The text was updated successfully, but these errors were encountered:
There is some kind of shape issue that I don't understand, but I found if I modify that line of code to be this:
s.loc[nna_mask, f] = final_st[f]
then the problem disappears. I think the fix is basically ensuring that final_st is really a single-column dataframe, and therefore the shape problem is fixed.
Some metrics don't work well with very low dimensional data. So we are fixing it to probably skip the metrics that are not meaningful with few features. We are in process of releasing the fix soon.
Thanks for looking into the issue.
I am trying to run sdnist with low-dimensional synthetic tables (synthetic tables with very few columns). I hit the following error with all 1-column tables, and with some 2-column tables. Examples of synthetic tables that fail can be found here:
https://mpi-sws.org/~francis/texas_AGEP.csv
https://mpi-sws.org/~francis/texas_AGEP.DEAR.csv
An example of a synthetic table that does not fail is:
https://mpi-sws.org/~francis/texas_AGEP.DEAR.PUMA.csv
I'd be willing to try to debug this myself, but I'm hoping one of you can quickly understand and fix the problem. Myself I'm not really sure what this code in
percentile_rank_synthetic()
is trying to do in any event.Note also that I'm running this off of a version of SDNist to which I've made some modifications. However, this error kicks in before any of my code is reached, and it works just fine on some synthetic tables but not others, so I'm guessing that the problem isn't caused by me. (The changes are for the purpose of getting SDNist to work in settings where multiple synthetic tables are produced, not just one.)
The error message is this:
The text was updated successfully, but these errors were encountered: