Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Folia file parsing error #192

Open
savary opened this issue Oct 8, 2024 · 7 comments
Open

Folia file parsing error #192

savary opened this issue Oct 8, 2024 · 7 comments
Assignees
Labels

Comments

@savary
Copy link

savary commented Oct 8, 2024

Several of our FLAT users have an issue with their working files. They work on annotations and sometimes it happens that when they come back to the same file, FLAT cannot open them again but displays and error message:
FoliA-exception

I downloaded the file for which the message was displayed. It is the XenophonAnabasis4REANNOTATION.folia.xml file. I checked that the reference in line 209 (XenophonAnabasis4REANNOTATION.text.1.S.326.W.8) does exist in the file, so I do not understand the problem, especially because the file was not manipulated outside FLAT.

AI also tried to download this file and convert it to another format (.cup for PARSEME) using Folia libraries and the same error message occurred for this file.

Several other files are also affected by the same problem, for instance: XenophonAnabasis3GSclean.

@proycon proycon self-assigned this Oct 9, 2024
@proycon proycon added the bug label Oct 9, 2024
@proycon
Copy link
Owner

proycon commented Oct 9, 2024

I can't seem to access that FoLiA XML file, https://gitlab.com/parseme/annotations/ gives a 404. Perhaps it is a private repository I (https://gitlab.com/proycon) can't access?

I suspect that the word is referenced before it appears I have seen something like that in the past, but it would indeed be a bug in FLAT or the folia library, the entity layer not being inserted at the proper place.

@proycon
Copy link
Owner

proycon commented Oct 9, 2024

Possibly related issue:

@savary
Copy link
Author

savary commented Oct 9, 2024

I added you (as proycon) to the parseme projet on Gitlab. All repos in this project will be available for you. THe XenophonAnabasis4REANNOTATION.folia.xml shoudl be accessible now.

@savary
Copy link
Author

savary commented Oct 9, 2024

I looked into the file and I think that one of the problems is that an annotation (containing 2-3 tokens) spans over tokens of two different sentences.

@proycon
Copy link
Owner

proycon commented Oct 11, 2024

Yes, the first entity layer in that file, for sentence XenophonAnabasis4REANNOTATION.text.1.s.1 has two correct MWEs, but one which references a sentence (326 & 327), that does is only defined later. This causes the error.

The bug if of course somewhere in FLAT's underlying libraries as it should have never written the entity in that layer for that sentence. I don't suppose you remember the exact steps that replicates such a mis-annotation?

I'll first expand an existing fix (foliavalidator --fixinvalidreferences) to at allow fixing of such malformed documents, although the fix will consist of simply removing the incorrect references.

proycon added a commit to proycon/foliapy that referenced this issue Oct 11, 2024
@proycon
Copy link
Owner

proycon commented Oct 11, 2024

I published a new version of the FLAT container image on Docker Hub that contains this fix/workaround. The actual root cause remains to be found and fixed though.

@savary
Copy link
Author

savary commented Oct 12, 2024

The bug if of course somewhere in FLAT's underlying libraries as it should have never written the entity in that layer for that sentence. I don't suppose you remember the exact steps that replicates such a mis-annotation?
No, I asked the annotator but she could not remember the precise steps.

I made some tests myself though and noticed that it is possible to select two tokens in two different sentences and group them in one annotation. When we then try to delete such a cross-sentence annotation, it is no longer possible. I also noticed that in the file and sent you, and also in other files having the same parsing error, this precise situation occurred: an annotation covered two tokens from two different sentences. Could that be an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants