Cleanup dataframes #1360

nkanazawa1989 · 2024-01-18T08:19:16Z

Summary

This PR updates the implementation of ScatterTable and AnalysisResultTable based on the comment from @itoko .

Details and comments

Current pattern heavily uses inheritance; Table(DataFrame, MixIn), but this causes several problems. Qiskit Experiments class directly depends on the third party library, resulting in Sphinx directive mismatch and poor robustness of the API. Instead of using inheritance, these classes are refactored with composition and delegation, namely

class Table:
    def __init__(self):
        self._data = DataFrame(...)

this pattern is also common in other software libraries using dataframe. Since this PR removes unreleased public classes, this should be merged before the release.

Although this updates many files, these are just delegation of data handling logic to the class itself, which simplifies the implantation of classes that operate the container objects. Also new pattern allows more strict dtype management with dataframe.

wshanks · 2024-01-22T15:37:23Z

This is good timing because the ScatterTable inheritance model just broke with the release of pandas 2.2. See for example the test failures on #1363 here.

…nup/more_composition

nkanazawa1989 · 2024-01-31T19:42:07Z

qiskit_experiments/library/driven_freq_tuning/coefficients.py

@@ -181,6 +181,7 @@ def __eq__(self, other):
                self.neg_coef_o1 == other.neg_coef_o1,
                self.neg_coef_o2 == other.neg_coef_o2,
                self.neg_coef_o3 == other.neg_coef_o3,
+                self.offset == other.offset,


This is a followup of #1236 . PR can be separated.

One of the nice things about dataclasses 🙂

coruscating

auto_save is not working right now. I get errors like {"errors":["experiment_uuid is required","type must be a string","chisq is not a decimal"]} after this:

expdata=exp.run(backend)
expdata.auto_save=True

Then if I run save() again on its own, everything seems to save correctly.

coruscating · 2024-02-01T03:19:34Z

qiskit_experiments/curve_analysis/scatter_table.py

+        return instance
+
+    @property
+    def dataframe(self):


Can you add __repr__() so that the dataframe is shown when users try to print the table directly? I think that's a bit more intuitive than accessing table.dataframe.

One concern for returning the html representation is that a user might believe the object is a bare pandas dataframe and blame missing methods that pandas provides. Of course html representation is very helpful, and I'm not against adding it.

That's true, maybe there can be a header before the dataframe content to distinguish the table from a dataframe.

nkanazawa1989 · 2024-02-02T03:30:32Z

Thanks @coruscating for catching the bug.

The experiment ID is not recorded in the analysis result table (because the experiment name and ID are sort of duplication and only human-friendly name is stored) and the service AnalysisResult object created from the table requires the ID to look for where to save the data. Likely this was introduced in the previous PR. Fixed in 901bb28

This was not correct. I just forgotten to specify columns="all" when I filter the analysis result table. By default it returns a subset of columns that doesn't contain the experiment_id. Correct fix is 8dc6c4f (force-pushed)

itoko

@nkanazawa1989 , thank you for your hard work on this cleanup. I've mainly reviewed public interface of ScatterTable class newly introduced during this series of refactoring. Overall, looks good to me in that point.

Just one major comment from me. To tell the following points clearly to users, I think we need more introductory explanation is necessary at the beginning of ScatterTable class (and/or in docs/tutorials/curve_analysis.rst). For example, we may need to explain the important tuple (kind, category, analysis) as a unique key; We may need to mention to what values are allowed for category and what each of them represents; so on.

I understand ScatterTable object now to be a single source of the truth as the data used for all plots. I think that's good for users who rework with some complex Analysis classes. Instead, it may contain multiple data sets. That requires users to fully understand the three concepts, kind, category and analysis, because this triplet must be a unique key to retrieve one sequence of data points -- (x, y, y_err) arrays -- from multiple data sets. I think that's why all of the versatile filter method and get_? methods have those in their options. I personally like get_? methods since most of experiments would be simple and ScatterTable object is likely to have a single key (kind, category and analysis) for all rows. They would prevent from calling filter functions everywhere.

I'll left minor comments inline, each of which should not prevent merging this PR.

qiskit_experiments/curve_analysis/scatter_table.py

nkanazawa1989 · 2024-02-05T03:08:19Z

Thanks @itoko , I added more documentation in 346d23a. I added the documentation directly to the class doc of the ScatterTable for visibility and added the cross-ref link to the tutorial.

most of experiments would be simple and ScatterTable object is likely to have a single key (kind, category and analysis) for all rows

This is not true because even the model and analysis are trivial most of analysis subclasses produce two categories of raw and formatted, e.g. in RB, raw corresponds to P1 of each randomization seed and formatted is P1 of the sample average of them. I hesitate to add dedicated methods like raw_xvals and formatted_xvals for shortcut because naming rule depends on analysis; e.g. StarkRamseyXYAmpScanAnalysis has four columns "raw", "ramsey_xy", "freq", "fitted".

coruscating

Thanks @nkanazawa1989, I like the lazy add rows approach and the overall the code looks fine to me. Since it's difficult to test the ScatterTable output from experiments in this PR, perhaps we can merge this first and handle any potential followups from adding unit tests in #1342 in another PR, if needed.

qiskit_experiments/curve_analysis/scatter_table.py

docs/tutorials/curve_analysis.rst

qiskit_experiments/curve_analysis/scatter_table.py

wshanks

I read through all the changes and left comments. There are many but I think they are easy to address and most are just related to documentation.

The thing I am most uncertain about with ScatterTable is how much it is a user facing class versus an internal class. For a user just running experiments, I think it is not exposed, but for a user writing CurveAnalysis subclasses it is? So it only makes sense to document it in the context of curve analysis? There is a little documentation now but not too much (the table is shown in the the curve analysis workflow section and the columns are described in the ScatterTable doc string). One concern I have to make the roles of name, data_uid, category, and model (not a column but a term used in the same context) clear.

docs/tutorials/curve_analysis.rst

qiskit_experiments/curve_analysis/scatter_table.py

wshanks · 2024-02-05T20:59:11Z

qiskit_experiments/framework/analysis_result_table.py

+                if key in self._data.index:
+                    return [key]
+                # This key is name of entry
+                loc = self._data["name"] == key


So the key can match an ID or a name. That feels a little overloaded though they shouldn't collide.

Yeah unfortunately this is existing behavior :(

wshanks · 2024-02-05T21:07:12Z

qiskit_experiments/curve_analysis/composite_curve_analysis.py

+                        yerr_arr_fit = unp.std_devs(uval_arr_fit)
+                    else:
+                        yerr_arr_fit = np.zeros_like(xval_arr_fit)
+                    for xval, yval, yerr in zip(xval_arr_fit, yval_arr_fit, yerr_arr_fit):


It still surprises me that it is better to iterate over numpy arrays point by point and add them to them to lists to add to a new dataframe rather than just adding the numpy arrays to a new dataframe.

Handling of the empty column is expensive because it requires careful handling of missing values. Without doing this shots column may be accidentally typecasted to float because numpy doesn't support nullable integer. This means we first need to create a 2D object-dtype ndarray and populate values, then convert it into dataframe. Since current _lazy_add_rows buffer assumes row-wise data list, arrays needs to be converted into this form internally.

qiskit_experiments/framework/experiment_data.py

wshanks · 2024-02-05T21:38:21Z

qiskit_experiments/curve_analysis/scatter_table.py

    def x(self) -> np.ndarray:
        """X values."""
-        return self.xval.to_numpy()
+        # For backward compatibility with CurveData.x
+        return self._data.xval.to_numpy(dtype=float, na_value=np.nan)


Is this pretty efficient? We might look at if it is worth caching these properties where it would be tempting to use them multiple times in the same function.

IIRC using the lru_cache for the class method may cause memory leak and I guess I need to implement custom logic (i.e. cache that clears the data when _lazy_add_rows is changed).

According to the performance analysis of T1Analysis, the most time-consuming logic after 03aac67 is still iter_groups, and the most expensive part is the data extraction from the dataframe. This means implementation of the cache doesn't drastically improve the performance.

Probably we may want to replace pandas with polars in the next release.

Yes, I was going to comment on iter_groups that we should look at using polars there. Even just casting to polars for that one method and back to pandas might help (I mean, not refactoring the rest of the code to use polars).

Another option would be to stop over-engineering. We currently allow users to run experiment on existing experiment data to get more statistically confident result. But I don't think this is heavy used feature and we can remove. In principle we run very expensive operation for the sake of something useless; if we have 100 xvals, we need to iterate over them and create new table with size=1 to average... Even if we switch to polars there still be overhead of object creation.

Yes, I agree that that feature has never seemed that useful to me. I think we could implement some support for post-processing instead, like a combine_results function or something that could take multiple experiment results, warn if the options are not all the same, and then run the analysis with the combined job data.

We didn't have this ability to extend the data of a completed experient in the code we used before qiskit-experiments and I think it was a feature request of something to be included when starting over from scratch. I like the experiment being less mutable and extending the experiment being like "copy-on-write" where a new experiment ID is generated for the combination of multiple experiments if you need that.

I still think polars could possibly help here though because I expect its groupby to be more efficient than iterating through values in Python. With the pyarrow backend (I think not yet the default in pandas), it should be possible for pandas and polars to work on the same in-memory representation of the data so it should be efficient to go back and forth.

I like your suggestion. IIRC when we started this project we had QV experiment in our mind. Namely we need to repeat until standard error gets converged within 1 sigma. The rest of experiments don't require such mechanism. Combine results is something useful :)

Sorry this feature was already removed by #463. This means we can now just drop this formatter and raise a warning if we encounter duplicated xvals -- bur likely for next release.

Okay, I opened #1394 to try to capture what we said here but maybe its description is out of date.

qiskit_experiments/framework/analysis_result_table.py

wshanks · 2024-02-05T22:19:32Z

Also, I read all the code changes but one thing that is hard with a PR like this is to review all the code that didn't change. I tried checking the documentation and release notes and I think they still make sense with the changes that were made.

Co-authored-by: Will Shanks <[email protected]>

itoko · 2024-02-06T02:47:08Z

qiskit_experiments/curve_analysis/scatter_table.py

+    and data points can also take metadata triplet (`data_uid`, `category`, `analysis`)
+    to distinguish the subset.
+
+    * The `data_uid` is an integer key representing the class of the data.


Thank you for adding this info in the doc, @nkanazawa1989 . It helps me a lot to understand the concepts to know as a user of ScatterTable class. I've now understood data_uid identifies a "series" of data to fit a model. And name is another human-friendly "identifier" that indicates a series of data to fit a model as well. I've also found that the term series has already been used in the Curve Analysis tutorial. Given that, series_id and series_name sound better to me. What do you think? @nkanazawa1989 @wshanks

series_index instead of series_id? Looking at the example in the tutorial, data_uid can take a common value for different category. In that sense, it's not "ID" (to me, ID sounds like a solo unique identifier), it's more like "index" defined for each (category, analysis) data.

I like this suggestion. Perhaps series_idx to be shorter?

One case to consider regarding the naming is StarkRamseyXYAmpScanAnalysis. For that experiment, X and Y vs amp data are recorded. Then the X and Y data are converted into phase vs amp data which is fit. So there are two raw series that become the formatted series that is fit. Maybe series_ still makes sense since one series is passed to the model. It is the fit models though that have a one to one mapping the data_uid values.

Umm, I'm converging to Will's comment:

One concern I have to make the roles of name, data_uid, category, and model (not a column but a term used in the same context) clear.

Maybe, all we need is model_ with fixed categories (or "stage"s?) (as for StarkRamseyXYAmpScanAnalysis, we could distinguish "ramsey_xy" and "freq" data sets using model_name instead of category).

We cannot define fit model for ramsey_xy because its P1 envelope is unpredictable. In this sense, "ramsey_xy" is another the intermediate data to produce the frequency curves which have fit models (Ramsey data is used for visualization and we cannot discard -- when fit doesn't go well experimentalists can get some insights from the raw Ramsey signal). This is why I cannot introduce fixed categories -- this requires the Ramsey analysis to define ScatterTable subclass with its own categories.

Anyways I also like series. This was originally used in the curve analysis to indicate the data belongs to a particular fit model.

Done in ac5bdd8. Thanks for the suggestion.

So I think the broadest way to think about the columns are that series labels a single curve of some kind (ScatterTable is under qiskit_experiments.curve_analysis so it is safe to assume all the data belong to curves). category is a label for a group of curves/series that all belong to the same stage of processing (raw/formatted/fitted). analysis is a namespace for separating these groups of curves from different subexperiments.

@coruscating Is it important to get the release out today? I think I would like to clarify these terms a bit in the docs, but that could be done in a follow up if it is important to release today. I don't have any objections to the code (as long as further discussion doesn't make us want to change any names again).

docs/tutorials/curve_analysis.rst

test/curve_analysis/test_scatter_table.py

wshanks

The code looks good. I will follow up about the documentation for ScatterTable.

nkanazawa1989 added this to the Release 0.6 milestone Jan 18, 2024

nkanazawa1989 added 6 commits February 1, 2024 03:18

Update internals of AnalysisResultTable

78a967a

Update internals of ScatterTable

7a83179

Removed unused mixin

3bc9525

Fix index mismatch issue after JSON serialization

68013a8

Add more tests

5032e86

Bug fixes

7736f19

nkanazawa1989 force-pushed the cleanup/more_composition branch from 63ccb54 to 7736f19 Compare January 31, 2024 18:22

nkanazawa1989 marked this pull request as ready for review January 31, 2024 18:22

nkanazawa1989 added 3 commits February 1, 2024 03:23

Merge branch 'main' of github.com:Qiskit/qiskit-experiments into clea…

8bbaa15

…nup/more_composition

Unpin pandas 2.2

fc9273e

Update old pattern

0cae116

nkanazawa1989 force-pushed the cleanup/more_composition branch from 7373897 to 0cae116 Compare January 31, 2024 19:41

nkanazawa1989 commented Jan 31, 2024

View reviewed changes

coruscating mentioned this pull request Feb 1, 2024

Remove pandas version bound prior to release #1367

Closed

coruscating added a commit to coruscating/qiskit-experiments that referenced this pull request Feb 1, 2024

Sync with qiskit-community#1360

d0358e6

Fix cross-reference

2fb28dc

coruscating reviewed Feb 1, 2024

View reviewed changes

nkanazawa1989 added 2 commits February 2, 2024 11:49

Update curve analysis tutorial

ac972fd

Add shortcut methods

01471bb

nkanazawa1989 added 2 commits February 2, 2024 15:04

Bugfix autosave

8dc6c4f

Raise user warning when numbers contain multiple series

144127a

nkanazawa1989 force-pushed the cleanup/more_composition branch from 901bb28 to 144127a Compare February 2, 2024 06:11

nkanazawa1989 added 2 commits February 2, 2024 15:16

Merge branch 'main' into cleanup/more_composition

a81f97c

Bugfix: Missing circuit metadata in composite analysis

7c0662c

nkanazawa1989 force-pushed the cleanup/more_composition branch from 8090dae to 7c0662c Compare February 2, 2024 08:43

itoko reviewed Feb 2, 2024

View reviewed changes

qiskit_experiments/curve_analysis/scatter_table.py Outdated Show resolved Hide resolved

qiskit_experiments/curve_analysis/scatter_table.py Outdated Show resolved Hide resolved

Replace class_id with data_uid

92cfc92

Add documentation for filtering triplet

346d23a

coruscating reviewed Feb 5, 2024

View reviewed changes

qiskit_experiments/curve_analysis/scatter_table.py Outdated Show resolved Hide resolved

docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved

qiskit_experiments/curve_analysis/scatter_table.py Outdated Show resolved Hide resolved

Apply review comments

ee03161

wshanks reviewed Feb 5, 2024

View reviewed changes

nkanazawa1989 and others added 2 commits February 6, 2024 10:23

Wording suggestions

ee5b34d

Co-authored-by: Will Shanks <[email protected]>

Remove DEFAULT_

38abdff

itoko reviewed Feb 6, 2024

View reviewed changes

nkanazawa1989 added 3 commits February 6, 2024 12:02

Reorganize the doc

9e27f16

Remove _data

b870be3

Remove key from add_data

cc905c6

wshanks reviewed Feb 6, 2024

View reviewed changes

docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved

docs/tutorials/curve_analysis.rst Show resolved Hide resolved

nkanazawa1989 added 4 commits February 6, 2024 14:03

Remove type cast depending on the entry number

0dc4eb2

Minor docs formatting

f8c1efe

Add more tests for result table

ee92f1d

Performance optimization

03aac67

nkanazawa1989 force-pushed the cleanup/more_composition branch from f9d13bd to 70867e8 Compare February 6, 2024 09:29

name, data_uid -> series_name, series_id

ac5bdd8

nkanazawa1989 force-pushed the cleanup/more_composition branch from 70867e8 to ac5bdd8 Compare February 6, 2024 09:38

wshanks reviewed Feb 6, 2024

View reviewed changes

test/curve_analysis/test_scatter_table.py Show resolved Hide resolved

nkanazawa1989 added 2 commits February 7, 2024 01:42

Add more tests for construction

58671eb

Update Ramsey analysis

7ff2c6a

wshanks mentioned this pull request Feb 6, 2024

Investigate performance of ExperimentData regarding adding more data to it #1394

Open

wshanks approved these changes Feb 6, 2024

View reviewed changes

nkanazawa1989 added this pull request to the merge queue Feb 6, 2024

Merged via the queue into qiskit-community:main with commit 36c06ea Feb 6, 2024
11 checks passed

nkanazawa1989 mentioned this pull request Feb 7, 2024

Epic - Implementation of RFC 0007: Dataframe for Qiskit Experiments Qiskit/RFCs#62

Closed

5 tasks

Cleanup dataframes #1360

Cleanup dataframes #1360

Conversation

nkanazawa1989 commented Jan 18, 2024 • edited Loading

Summary

Details and comments

wshanks commented Jan 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coruscating left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkanazawa1989 commented Feb 2, 2024 • edited Loading

itoko left a comment • edited Loading

Choose a reason for hiding this comment

nkanazawa1989 commented Feb 5, 2024

coruscating left a comment

Choose a reason for hiding this comment

wshanks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkanazawa1989 Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wshanks commented Feb 5, 2024

itoko Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itoko Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkanazawa1989 Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wshanks left a comment

Choose a reason for hiding this comment

nkanazawa1989 commented Jan 18, 2024 •

edited

Loading

nkanazawa1989 commented Feb 2, 2024 •

edited

Loading

itoko left a comment •

edited

Loading

nkanazawa1989 Feb 6, 2024 •

edited

Loading

itoko Feb 6, 2024 •

edited

Loading

itoko Feb 6, 2024 •

edited

Loading

nkanazawa1989 Feb 6, 2024 •

edited

Loading