Skip to content

Conversation

@prosperritty
Copy link
Contributor

Add implementation of token_at_offset method for CST Node along with TokenAtOffset return type.

This method can be used in language server development.

@google-cla
Copy link

google-cla bot commented Aug 8, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

pub enum TokenAtOffset<M> {
None,
Single(Token<M>),
Between(Token<M>, Token<M>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this API can return Single or Between. In the case of this rule:

rule test {condition:true}

Why token_at_offset(4) returns that it is between the rule keyword and the space? At offset we are exactly at the space, right?

I would like to hear a bit more about the use case for this API in order to understand the problem that it is trying to solve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this API can return Single or Between.

Why token_at_offset(4) returns that it is between the rule keyword and the space? At offset we are exactly at the space, right?

This API can return Single and Between, so the user can choose from which direction to take the token by using TokenAtOffset methods like right_biased() or left_biased(). The Single value will return the same token in both methods, when the Between value can return nearby tokens using these methods. In this API, offset is treated a little differently in my understanding. It is not treated as an absolute position in the text, but as a position between characters. I prepared a small diagram which I will attach to this comment. I hope it will help to understand the concept.

I would like to hear a bit more about the use case for this API in order to understand the problem that it is trying to solve.

As I mentioned before this method will be very useful in implementing a language server and this is the problem I am trying to solve. This method can replace some parts of the code that also search for a token at given position manually.

More about TokenAtOffset can be found here.
More about token_at_offset() can be found here.
token_at_offset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I understand the API semantics, it is basically exposing an existing rowan API. But to be honest I find this rowan API a bit convoluted and hard to grasp. That's why I would like to see examples of situations in which this API is used in the language server.

In which language server features do you need this API?

Now suppose that the you have an API that given an offset returns the token located at that offset. For instance, in the rule rule test { condition: true } , the tokens are:

  • rule keyword (offsets: [0-3])
  • space (offset: 4)
  • test identifier (offsets: [5-8])
  • space (offset: 9)
  • { (offset: 10)
  • space (offset: 11)

etc.

token_at_offset will return a Token<T> type, with the token that contains the offset, for instance:

token_at_offset(0) -> rule
token_at_offset(1) -> rule
token_at_offset(2) -> rule
token_at_offset(3) -> rule
token_at_offset(4) -> space
token_at_offset(5) -> test
token_at_offset(6) -> test
token_at_offset(7) -> test
token_at_offset(8) -> test
token_at_offset(9) -> space
token_at_offset(10) -> {

etc.

Does an API like that would work for you? What I would like to understand is if a more complex API is really required and why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which language server features do you need this API?

This method is mainly used in features where it is necessary to process the position from the client request. For example, the current position of the user's cursor in the text for language server features such as Hover or Go to definition.

What I would like to understand is if a more complex API is really required and why.

I don't think a more complex API will be needed in this case. I thought exposing an existing rowan API like I did was a common practice for YARA-X project.

Does an API like that would work for you?

Sure! But I think it should return a Option<Token<T>> type instead of Token<T>. Would this code be acceptable to you? After all, such code should work as you described in your example.

pub fn token_at_offset(&self, offset: u32) -> Option<Token<M>> {
    self.inner.token_at_offset(offset.into()).right_biased().map(Token::new)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would be ok with an API that returns Option<Token<M>>, what I don't like too much is all that right_biased and left_biased stuff. I don't get why it was necessary in the rowan API, but I find it excessively complex. Maybe there's a reason for it, but if we can solve our problems with a simpler API, I prefer the simpler one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified the API and changed corresponding Doctest in the last commit.

Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking at this in detail I found something that I don't like in the API design. The token_at_offset function panics if the offset is outside the boundaries of the current node. For instance:

let mut root_node = Parser::new(b"rule test {condition:true}")
    .try_into_cst()
    .unwrap()
    .root();

root_node.token_at_offset(27);   // This panics because the valid range is 0..26.

The root node covers all the source code, so passing an offset that is lower than the source code length is enough to avoid a panic. However, this gets tricky when we call token_at_offset from some other node that is not the root. For instance:

let mut root_node = Parser::new(b"rule test {condition:true}")
    .try_into_cst()
    .unwrap()
    .root();

let rule_decl = root_node.first_child().unwrap();
let condition_blk = rule_decl.first_child().unwrap();

condition_blk.token_at_offset(0);   // This panics because the condition block is in the range 11..25 and 0 is outside that range

I believe that this is a very fragile API that forces the user to be extremely cautious with the offsets they pass to token_at_offset. All the burden is passed to the API user.

One alternative is making the offset relative to the node itself. For the root node the offsets will match absolute source code offsets, but for inner nodes the offsets passed to token_at_offset will be relative to offset where the node starts. In condition_blk.token_at_offset(0) the returned token will be the condition keyword, because it is the token located at offset 0, counted from the start of the start of the condition block.

Another alternative is keeping the absolute offsets, but returning an error or None instead of panicking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The token_at_offset function panics if the offset is outside the boundaries of the current node.

Yes, I forgot to mention that. Here is the code that panics in such scenarios:

assert!(
    range.start() <= offset && offset <= range.end(),
    "Bad offset: range {:?} offset {:?}",
    range,
    offset
);

So yes, it always panics when the offset goes beyond the span.

I believe that this is a very fragile API that forces the user to be extremely cautious with the offsets they pass to token_at_offset. All the burden is passed to the API user.

Another alternative is keeping the absolute offsets, but returning an error or None instead of panicking.

I like this alternative with returning None more and it will work for me as well.

Do you think it is necessary to make the offset relative? In my understanding of this API, absolute offset seems more logical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that absolute offsets are easier to understand, I don't think that is a real use-case for relative-offsets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the same check for offset as in rowan library.

If you would like to do the check differently, let me know.

@plusvic plusvic merged commit c83ca69 into VirusTotal:main Aug 29, 2025
15 checks passed
nmbarel pushed a commit to nmbarel/yara-x that referenced this pull request Sep 15, 2025
Add implementation of `token_at_offset` method for CST `Node`. This method returns the token at a given source code offset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants