feat: implement `token_at_offset` method for `Node`. #426

prosperritty · 2025-08-08T11:32:51Z

Add implementation of token_at_offset method for CST Node along with TokenAtOffset return type.

This method can be used in language server development.

google-cla · 2025-08-08T11:32:56Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

plusvic · 2025-08-08T13:13:43Z

parser/src/cst/mod.rs

+pub enum TokenAtOffset<M> {
+    None,
+    Single(Token<M>),
+    Between(Token<M>, Token<M>),


I wonder why this API can return Single or Between. In the case of this rule:

rule test {condition:true}

Why token_at_offset(4) returns that it is between the rule keyword and the space? At offset we are exactly at the space, right?

I would like to hear a bit more about the use case for this API in order to understand the problem that it is trying to solve.

I wonder why this API can return Single or Between.

Why token_at_offset(4) returns that it is between the rule keyword and the space? At offset we are exactly at the space, right?

This API can return Single and Between, so the user can choose from which direction to take the token by using TokenAtOffset methods like right_biased() or left_biased(). The Single value will return the same token in both methods, when the Between value can return nearby tokens using these methods. In this API, offset is treated a little differently in my understanding. It is not treated as an absolute position in the text, but as a position between characters. I prepared a small diagram which I will attach to this comment. I hope it will help to understand the concept.

I would like to hear a bit more about the use case for this API in order to understand the problem that it is trying to solve.

As I mentioned before this method will be very useful in implementing a language server and this is the problem I am trying to solve. This method can replace some parts of the code that also search for a token at given position manually.

More about TokenAtOffset can be found here.
More about token_at_offset() can be found here.

Ok, now I understand the API semantics, it is basically exposing an existing rowan API. But to be honest I find this rowan API a bit convoluted and hard to grasp. That's why I would like to see examples of situations in which this API is used in the language server.

In which language server features do you need this API?

Now suppose that the you have an API that given an offset returns the token located at that offset. For instance, in the rule rule test { condition: true } , the tokens are:

rule keyword (offsets: [0-3])

space (offset: 4)

test identifier (offsets: [5-8])

space (offset: 9)

{ (offset: 10)

space (offset: 11)

etc.

token_at_offset will return a Token<T> type, with the token that contains the offset, for instance:

token_at_offset(0) -> rule
token_at_offset(1) -> rule
token_at_offset(2) -> rule
token_at_offset(3) -> rule
token_at_offset(4) -> space
token_at_offset(5) -> test
token_at_offset(6) -> test
token_at_offset(7) -> test
token_at_offset(8) -> test
token_at_offset(9) -> space
token_at_offset(10) -> {

etc.

Does an API like that would work for you? What I would like to understand is if a more complex API is really required and why.

In which language server features do you need this API?

This method is mainly used in features where it is necessary to process the position from the client request. For example, the current position of the user's cursor in the text for language server features such as Hover or Go to definition.

What I would like to understand is if a more complex API is really required and why.

I don't think a more complex API will be needed in this case. I thought exposing an existing rowan API like I did was a common practice for YARA-X project.

Does an API like that would work for you?

Sure! But I think it should return a Option<Token<T>> type instead of Token<T>. Would this code be acceptable to you? After all, such code should work as you described in your example.

pub fn token_at_offset(&self, offset: u32) -> Option<Token<M>> { self.inner.token_at_offset(offset.into()).right_biased().map(Token::new) }

Yes, I would be ok with an API that returns Option<Token<M>>, what I don't like too much is all that right_biased and left_biased stuff. I don't get why it was necessary in the rowan API, but I find it excessively complex. Maybe there's a reason for it, but if we can solve our problems with a simpler API, I prefer the simpler one.

I've simplified the API and changed corresponding Doctest in the last commit.

Thanks.

While looking at this in detail I found something that I don't like in the API design. The token_at_offset function panics if the offset is outside the boundaries of the current node. For instance:

let mut root_node = Parser::new(b"rule test {condition:true}") .try_into_cst() .unwrap() .root(); root_node.token_at_offset(27); // This panics because the valid range is 0..26.

The root node covers all the source code, so passing an offset that is lower than the source code length is enough to avoid a panic. However, this gets tricky when we call token_at_offset from some other node that is not the root. For instance:

let mut root_node = Parser::new(b"rule test {condition:true}") .try_into_cst() .unwrap() .root(); let rule_decl = root_node.first_child().unwrap(); let condition_blk = rule_decl.first_child().unwrap(); condition_blk.token_at_offset(0); // This panics because the condition block is in the range 11..25 and 0 is outside that range

I believe that this is a very fragile API that forces the user to be extremely cautious with the offsets they pass to token_at_offset. All the burden is passed to the API user.

One alternative is making the offset relative to the node itself. For the root node the offsets will match absolute source code offsets, but for inner nodes the offsets passed to token_at_offset will be relative to offset where the node starts. In condition_blk.token_at_offset(0) the returned token will be the condition keyword, because it is the token located at offset 0, counted from the start of the start of the condition block.

Another alternative is keeping the absolute offsets, but returning an error or None instead of panicking.

The token_at_offset function panics if the offset is outside the boundaries of the current node.

Yes, I forgot to mention that. Here is the code that panics in such scenarios:

assert!( range.start() <= offset && offset <= range.end(), "Bad offset: range {:?} offset {:?}", range, offset );

So yes, it always panics when the offset goes beyond the span.

I believe that this is a very fragile API that forces the user to be extremely cautious with the offsets they pass to token_at_offset. All the burden is passed to the API user.

Another alternative is keeping the absolute offsets, but returning an error or None instead of panicking.

I like this alternative with returning None more and it will work for me as well.

Do you think it is necessary to make the offset relative? In my understanding of this API, absolute offset seems more logical.

Yes, I think that absolute offsets are easier to understand, I don't think that is a real use-case for relative-offsets.

I added the same check for offset as in rowan library.

If you would like to do the check differently, let me know.

Also changed corresponding Doctest.

Add implementation of `token_at_offset` method for CST `Node`. This method returns the token at a given source code offset.

feat: implement token_at_offset method for Node.

8f24a8e

plusvic reviewed Aug 8, 2025

View reviewed changes

chore: simplify the API by returning Token instead of TokenAtOffset.

810b7a8

Also changed corresponding Doctest.

prosperritty force-pushed the node-token-at-offset branch from b4748f2 to 810b7a8 Compare August 28, 2025 11:46

prosperritty and others added 3 commits August 29, 2025 15:04

fix: avoid panic when offset goes out of bounds

60fef8c

style: make offset an usize and improve documentation and tests.

e1012a0

fix: wrong test case.

c93d11e

plusvic merged commit c83ca69 into VirusTotal:main Aug 29, 2025
15 checks passed

nmbarel pushed a commit to nmbarel/yara-x that referenced this pull request Sep 15, 2025

feat: implement token_at_offset method for Node. (VirusTotal#426)

f2e9886

Add implementation of `token_at_offset` method for CST `Node`. This method returns the token at a given source code offset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement `token_at_offset` method for `Node`. #426

feat: implement `token_at_offset` method for `Node`. #426

Uh oh!

prosperritty commented Aug 8, 2025

Uh oh!

google-cla bot commented Aug 8, 2025

Uh oh!

plusvic Aug 8, 2025

Uh oh!

prosperritty Aug 8, 2025

Uh oh!

plusvic Aug 8, 2025

Uh oh!

prosperritty Aug 11, 2025

Uh oh!

plusvic Aug 28, 2025

Uh oh!

prosperritty Aug 28, 2025

Uh oh!

plusvic Aug 29, 2025

Uh oh!

prosperritty Aug 29, 2025

Uh oh!

plusvic Aug 29, 2025

Uh oh!

prosperritty Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: implement token_at_offset method for Node. #426

feat: implement token_at_offset method for Node. #426

Uh oh!

Conversation

prosperritty commented Aug 8, 2025

Uh oh!

google-cla bot commented Aug 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: implement `token_at_offset` method for `Node`. #426

feat: implement `token_at_offset` method for `Node`. #426