Include token index on nodes when parsing AST #2795

matthew-canestraro · 2022-09-28T16:29:32Z

matthew-canestraro
Sep 28, 2022

Currently when parsing a string into an AST, there is no way to trace an individual node back to its token's original location in the string. This makes it difficult to lint the string or perform other semantics-aware manipulations of it. At the same time, turning an AST back into a string using toString() causes all whitespace formatting to be lost so cannot be used as an alternative

The parser already tracks the current string index of each token as it parses, so adding that value to nodes as a tokenIndex seems like a relatively small change which would resolve this problem, but I am open to any alternatives which would help map an AST back to the source string

I am happy to create a pull request for this, but want to make sure I'm tackling it in a way others will agree to

(POC pull request opened here: #2796)

josdejong · 2022-10-03T11:37:35Z

josdejong
Oct 3, 2022
Maintainer

Thanks, it would be nice indeed to have the parsed nodes contain more information of the original parsed expression. So basically we want to create a source map.

Before implementing a solution, I think we should think trough how we want to solve this exactly. Some initial thoughts:

Adding an extra argument tokenIndex to the constructors of all Nodes will make it harder to use them on your own, like in a transform where you replace nodes with new types of Nodes. So probably, this information should remain optional.
I haven't thought this through, but I have the feeling we would want to have more information, and that the information we want may differ per Node class. For example for a ParenthesisNode we want the start and the end, and for a FunctionNode we probably want the start of the function name, position of the opening bracket and of the closing bracket. Just let me know if I'm overthinking this 😁

@mattvague do you have any opinion in this regard with your experience in parsing, transforming, and highlighting of expressions?

3 replies

mattvague Oct 4, 2022

Adding an extra argument tokenIndex to the constructors of all Nodes will make it harder to use them on your own, like in a transform where you replace nodes with new types of Nodes. So probably, this information should remain optional.

Agreed, if we add it to the constructor it should definitely be optional. I did the same thing in my implementation but I wonder now if it might not be better though to just have an addSource(startIndex, endIndex, expression) function on the base Node class so that we don't force all sub types to add this source arg to their constructor and have a uniform interface.

I haven't thought this through, but I have the feeling we would want to have more information, and that the information we want may differ per Node class. For example for a ParenthesisNode we want the start and the end, and for a FunctionNode we probably want the start of the function name, position of the opening bracket and of the closing bracket. Just let me know if I'm overthinking this

I think we would definitely want to keep track of at least the start and end index for all nodes, I think probably the original expression too. I kind of think everything else could be calculated from that though, no? For example, as long as you know the start position, end position, and length of function name you could calculate where the first bracket and end bracket are.

matthew-canestraro Oct 4, 2022
Author

I like the addSource idea, will work on an implementation later this week! Off the top of my head, startIndex and endIndex seem sufficient, but once I'm in the code, it'll be easier to pick out any nodes which would benefit from more information

josdejong Oct 5, 2022
Maintainer

Yeah good idea about the addSource, to keep this meta information separate from the constructors. It makes sense to keep startIndex and endIndex and not tailer this differently for each and every node (just keep it simple).

mattvague · 2022-10-04T17:45:14Z

mattvague
Oct 4, 2022

@matthew-canestraro Just FYI I already have a partially complete implementation of this here if you need inspiration or would like to build on that instead of doing it all yourself

2 replies

matthew-canestraro Oct 4, 2022
Author

Thanks, this will be a helpful reference!

mattvague Oct 4, 2022

@matthew-canestraro No problem, feel free to @ me in your PR if you need to discuss anything. If / when you do complete this, let's make sure to finish off my ASTExplorer extension for mathjs: https://github.com/fkling/astexplorer/blob/master/website/src/parsers/mathjs/mathjs.js#L35-L37

matthew-canestraro · 2023-02-07T22:14:50Z

matthew-canestraro
Feb 7, 2023
Author

Hi! Sorry for the long absence on this one, I had to shift priorities for a while and finally found time to get back to this.

I just pushed changes to add mapping to all nodes. the SourceMapping is a simple object containing the starting index and source string for the particular node. EG in the expression abc = foo, the symbol foo would receive the mapping:

{ index: 6, source: "foo" }

There was no need for a start and end index because end can always be found using index + source.length

The mapping for most nodes points to their identifying symbol (EG = for assignment or +, -, etc for operators)

For nodes with multiple identifying symbols, I included everything between the first and last symbol. EG for conditional 1 < 2 ? "hello" : "goodbye" the source would be:

{ index: 6, source: '? "hello" :' }

Which seems awkward at first glance, but this methodology lets the user quickly get the index of ? and : which are unique to this node. The condition and if-true, if-false values can be gotten through other nodes. (It also happened to be much easier to add without reworking the code 😇 )

There are a few quirky edge-cases, like BlockNode which always receives { index: 0, source: state.expression } and implicit multiplication, which has an index where the * would be, but its source is an empty string. But in general I tried to keep the mappings as simple, useful and consistent as possible

I'm about to begin writing tests, (which will make it more easy to see how each node maps) but let me know if you have any major concerns with this direction!

3 replies

josdejong Feb 8, 2023
Maintainer

Thanks for getting back to this!

Feedbacks:

The name mapping makes sense on one hand: it is a source mapping. It may also be confused with mapping over nodes as in iterating over them using the function map. How about naming it source, withSource, etc? It could look like:
```
.source = { index: 6, text: 'foo' }
```
It sounds logical indeed to only include index instead of a start and end.
I think we should think through what we should put in the source or text. It should be consistent at least, else you can't handily use it.
- Having sometimes just the operator like + and in other cases the operator and part of the argument like ? "hello" : feels inconsistent to me.
- One idea could be to have a list with sources, like:
```
sources = [
  { index: 6, text: '?'},
  { index: 16, text: ':'},
]
```
- An other idea could be to include the full operator and its argument, like 2+3 and 1 < 2 ? "hello" : "goodbye". I think that will not be very useful in practice though. But that depends on (4).
To select the right approach for (3), it would help me having a more clear idea on how the source information will be used. Do we want to be able to construct the original expression from the AST for example? Do you have some use cases?

matthew-canestraro Feb 8, 2023
Author

Ah I hadn't considered the confusion with mapping, but that's a very good point! I'll make the change

I actually like your sources array idea a lot better than my original strategy 😁 It handles simple and complex nodes equally well without too much added complexity

My personal use-case is to find the index of symbols in the AST so I can modify them in the original source. EG if a user renames an object, we update all code references to the object to use the new name. I can also see this feature used to provide IDE features like syntax highlighting, in-place error messages, etc.

Do we want to be able to construct the original expression from the AST for example?

I think this would be more trouble than it's worth. If the user needs access to the original expression, it would be much easier to make state.expression available on the top node of the tree, or something along those lines. So for this feature, I would assume that the user has the original source string and wants to use these source mappings to modify it in some way

I like the source array approach because the user can access any individual token without doing any manual parsing. If they need the exact index of the ? or :, they can look at conditionalNode.sources. But if they need the index of the condition, they can also find that on conditionalNode.condition.sources

If we try to include too much in a single source, eg { index: 4, text: '1 < 2 ? "hello" : "goodbye"' }, then a user who wants more granularity will be forced to parse the conditional expression themself

Maybe @mattvague has some insight on how ASTExplorer would want to use this feature

josdejong Feb 10, 2023
Maintainer

Thanks for explaining your use case. I think indeed a fine-grained array with sources makes sense 👍

matthew-canestraro · 2023-03-06T19:47:40Z

matthew-canestraro
Mar 6, 2023
Author

Hi @josdejong, I've moved on to fixing tests and realized that the new sources behavior is undefined for math.resolve()

What should the sources be for a resolved constant node? IE for math.resolve("1 + x", { x: 2 }), what should the source be for node ConstantNode { value: 2 }?

Viable options I could see:

{ index: 4, text: "x"} // source maps back to the variable which was resolved

{ index: 4, text: "2" } // source acts as if value was in the source string

{ index: 4, text: "" } // no text because this node is implicit and not actually in the source

[] // no source at all, easy way out

3 replies

josdejong Mar 9, 2023
Maintainer

That is a nice one. We have two separate steps here:

Parse 1 + x. That gives a source like { index: 4, text: "x"}
Transform the expression tree into a parsed 1 + 2.

Since a part of the expression is replaced, the replaced part has no source to point to. It makes most sense to me to have no source information then (your last option). I think that is better than a made up source or a source that doesn't actually correspond with what is in the ConstantNode. Does that make sense?

matthew-canestraro Mar 15, 2023
Author

Makes sense! I've updated the resolve behavior, added lots of unit tests for sources and moved my PR (#2796) out of draft

I'm hitting a strange error in the CI for node tests where it's expecting a much older version of MathJS. EG it says it expects version 11.1.0 but my branch is on 11.7.0 which I believe to be latest

Is there something I need to do to configure local node tests before running npx test:node ?

josdejong Mar 17, 2023
Maintainer

I cannot reproduce the issue with versions. It may help to run npm run build-and-test once.

The npm run build-and-test script is the only one that still has some failing tests now, can you have a look at those?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include token index on nodes when parsing AST #2795

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Include token index on nodes when parsing AST #2795

matthew-canestraro Sep 28, 2022

Replies: 4 comments · 11 replies

josdejong Oct 3, 2022 Maintainer

mattvague Oct 4, 2022

matthew-canestraro Oct 4, 2022 Author

josdejong Oct 5, 2022 Maintainer

mattvague Oct 4, 2022

matthew-canestraro Oct 4, 2022 Author

mattvague Oct 4, 2022

matthew-canestraro Feb 7, 2023 Author

josdejong Feb 8, 2023 Maintainer

matthew-canestraro Feb 8, 2023 Author

josdejong Feb 10, 2023 Maintainer

matthew-canestraro Mar 6, 2023 Author

josdejong Mar 9, 2023 Maintainer

matthew-canestraro Mar 15, 2023 Author

josdejong Mar 17, 2023 Maintainer

matthew-canestraro
Sep 28, 2022

Replies: 4 comments 11 replies

josdejong
Oct 3, 2022
Maintainer

matthew-canestraro Oct 4, 2022
Author

josdejong Oct 5, 2022
Maintainer

mattvague
Oct 4, 2022

matthew-canestraro Oct 4, 2022
Author

matthew-canestraro
Feb 7, 2023
Author

josdejong Feb 8, 2023
Maintainer

matthew-canestraro Feb 8, 2023
Author

josdejong Feb 10, 2023
Maintainer

matthew-canestraro
Mar 6, 2023
Author

josdejong Mar 9, 2023
Maintainer

matthew-canestraro Mar 15, 2023
Author

josdejong Mar 17, 2023
Maintainer