Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Behavior with Copy #1400

Open
yoavkatz opened this issue Nov 28, 2024 · 8 comments
Open

Strange Behavior with Copy #1400

yoavkatz opened this issue Nov 28, 2024 · 8 comments
Assignees

Comments

@yoavkatz
Copy link
Member

yoavkatz commented Nov 28, 2024

If we copy a string to a nested field in another string field (which is meaningless) , we get this odd behavior that the target string field becomes a dictionary, with the copied value.

def test_copy_string_to_string_error(self):
        inputs = [
            {"source": "hello", "task_data" :3}
        ]

        targets = [{"source": "hello","task_data": {'source': 'hello'}}]

        check_operator(
            operator=Copy(field="source", to_field="task_data/source"),  <-- this should fail because task_data does not have a 'source field' . Instead it works.
            inputs=inputs,
            targets=targets,
            tester=self,
        )
@yoavkatz yoavkatz changed the title StrangeBehavior with Copy Strange Behavior with Copy Nov 28, 2024
@dafnapension
Copy link
Collaborator

dafnapension commented Nov 28, 2024

Hi @yoavkatz , it is not odd, it is as planned.. Pending the flag not_exist_ok which is set to True, by default, for dict_set, which, in turn, participates in the last stage of Copy. It means that if dict_set needs to set something, and not all the way (as specified by the query) to where that something needs to be set exists, or exists but not the type needed as implied by the query, then dict_set instantiates all the non-existing parts of the way, overwriting existing-but-not-matching parts (if needed). [in this example, the 3 that sat quietly in task_data was decisively overwritten, turned into a dictionary, as implied by the query].
If not_exist_ok is False, then dict_set yells:

from unitxt.dict_utils import dict_set

dic = {"source": "hello", "task_data" :3}
try:
    dict_set(dic, "task_data/source", "hello", not_exist_ok=False) 
    print(dic)
except Exception as e:
    print(e)  # prints No path in dic {'source': 'hello', 'task_data': 3} matches query task_data/source
    
dic = {"source": "hello", "task_data" :3}

try:
    dict_set(dic, "task_data/source", "hello", not_exist_ok=True) 
    print(dic) # prints  {'source': 'hello', 'task_data': {'source': 'hello'}}
except Exception as e:
    print(e)  

@dafnapension
Copy link
Collaborator

dafnapension commented Nov 28, 2024

Here is how dict_set participates now in the last stage of Copy, actually: the last stage of any InstanceFieldOperator:

            dict_set(
                instance,
                to_field,
                new_value,
                not_exist_ok=True,   # << note here
            )

We may want to change it to use the flag not_exist_ok of InstanceFieldOperator.
The same flag is also used for getting the value in the first stage of InstanceFieldOperator
So either we use the same flag for both missions, or add another flag and separate.
What would you suggest?

@yoavkatz
Copy link
Member Author

Thanks for the explanation. I think the most natural behavior, which does not require additional flags, is that if that you can always assign to a non exist field, but only at the last level.

Assume:

{ "source": "my string", "task_data" : {"format" : "my string", "options" : ["a","b","c"] }

The we can assign to a "new_field", or to "source" or to "task_data", or to "task_data/format", or to "task_data/options" or to "task_data/options/0". But we can not assign to "source/0" or "task_data/format/a".

Building such implicit data structures is non expected (we also don't know if we need to build a dictionary or a list when assigning to non existent field : a/0/field".

@dafnapension
Copy link
Collaborator

dafnapension commented Nov 29, 2024

Thanks, @yoavkatz , thats a neat idea. It also solves the problem I started to cook around (silently for now.. :-)): what to do with 0 like in your example, a 0 that can be both an index into a list but also a field name, as used in rendered-to-json tables, for example. I was cooking around the already currently existing flag of dict_utils: flag allow_int_index=True (set by default to True, giving priority to lists over dictionary).
With your idea, we can remove this flag altogether, simplify the code, and operate by the 'conditions in the field' rather than by a predetermined value of a flag.

A question: It is still OK to replace any existing value, of any complexity, and sitting in a field of any depth in dic, by a new value, of any complexity too. Right? E.g., In your example above, it is OK to assign a new value, of any complexity (dict of lists of dicts, or just a simple str) to task_data/options, replacing the whole list currently assigned to it. Right?

I will push a PR soon.

@dafnapension
Copy link
Collaborator

@yoavkatz , I see some points in existing code where Copy does trust the whole path will be instantiated, if missing, not just the last leg.
See e.g. from answer_correctness.json:

        {
            "__type__": "copy",
            "field_to_field": {
                "task_data/reference_answers": "references"
            }
        }

What if at this stage the instance does not have a field named task_data?

@yoavkatz
Copy link
Member Author

Hi. I think on this case task_data/reference_answer is the source. If it does not exist , it should fail. If the target reference field does not exist, it should be created (that's fube because it's not nested).

@dafnapension
Copy link
Collaborator

Ahh, yes, my mistake. In my example the nested field is the source, not the target. Sorry.

@dafnapension
Copy link
Collaborator

Hi @yoavkatz ,
In reference to existing code that does expect dict_set to generate, when needed, more than just a missing key in a dictionary. Here is a better example to make my point:

dict_set(instance, "recipe_metadata/template", template)
In ApplyTemplate.process().

I will change it here to a two step operation, but there might be other places like this in users' code..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants