Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactive parsing with Earley #1493

Open
tanvisharma opened this issue Nov 29, 2024 · 3 comments
Open

Interactive parsing with Earley #1493

tanvisharma opened this issue Nov 29, 2024 · 3 comments

Comments

@tanvisharma
Copy link

Suggestion
I am trying to implement interactive parsing with lark using the earley parser because llar cannot handle all the cases represented through my grammar. Could you please help me in efficiently incorporating it in the lark repository?

My Methodology
As a first implementation (or proof of concept) the interactive parsing option is exposed through parse with an extra argument. The idea is to return a list of regex patterns for the next expected token given an input stream that partially matches the grammar. To accomplish this functionality, I focused on keeping track of the to_scan list coming from the predict_and_complete function of xearley class. For the case when to_scan does not result in valid solutions, it returns the list of expected_terminals instead of throwing an UnexpectedEOF error (when the interactive parser argument is true). While this case works for many cases, it does not take care of corner cases such as delayed_matches which may have resulted in a correct match after some tokens. Therefore, I also add the delayed_matches to to_scan in case of partial input stream.

Feedback
Could you please provide feedback on the given approach? Am I missing any possible corner cases with this approach? I understand it is an ad-hoc strategy and probably could be made efficient with dynamic programming. I would be more than happy to share the code and expand on it based on your suggestions/feedback.

Thank you. Looking forward to your response!

@erezsh
Copy link
Member

erezsh commented Jan 1, 2025

The difficulty with Earley is that it explores many paths at the same time, many of which will never reach the final tree.

So, if you try to follow it step-by-step and report on what it's doing, you will face the issue that paths will be blended in together. So to_scan will contain tokens not only for the correct path, but also for invalid paths, and also for valid paths that will end up being de-prioritized.

Unfortunately, I can't think of a simple solution to this. Although in theory it should be possible to keep track of each "line of thinking", and provide a more accurate context, similar to what the LALR interactive parser does.

I hope that helps!

@tanvisharma
Copy link
Author

Thank you for your reply!

I did not understand why paths getting blended in together would cause an issue in future. Could you please explain with some example?

Also, how can we categorize a path as valid or invalid without knowing the full stream? I am of the opinion that any interactive parse should output all possible paths for a partial stream. What do you say?

Thank you for your patience.

@erezsh
Copy link
Member

erezsh commented Jan 3, 2025

how can we categorize a path as valid or invalid without knowing the full stream?

You can't. You can only know at the end of the parse.

That's why a step-by-step parsing has to assume all paths are valid until they are disqualified by the parser.

why paths getting blended in together would cause an issue in future

It's an issue if you want to be able to predict the next valid tokens, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants