- 
                Notifications
    You must be signed in to change notification settings 
- Fork 87
          feat: implement token_at_offset method for Node.
          #426
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. | 
        
          
                parser/src/cst/mod.rs
              
                Outdated
          
        
      | pub enum TokenAtOffset<M> { | ||
| None, | ||
| Single(Token<M>), | ||
| Between(Token<M>, Token<M>), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this API can return Single or Between.  In the case of this rule:
rule test {condition:true}
Why token_at_offset(4) returns that it is between the rule keyword and the space? At offset we are exactly at the space, right?
I would like to hear a bit more about the use case for this API in order to understand the problem that it is trying to solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this API can return Single or Between.
Why token_at_offset(4) returns that it is between the rule keyword and the space? At offset we are exactly at the space, right?
This API can return Single and Between, so the user can choose from which direction to take the token by using TokenAtOffset methods like right_biased() or left_biased(). The Single value will return the same token in both methods, when the Between value can return nearby tokens using these methods. In this API, offset is treated a little differently in my understanding. It is not treated as an absolute position in the text, but as a position between characters. I prepared a small diagram which I will attach to this comment. I hope it will help to understand the concept.
I would like to hear a bit more about the use case for this API in order to understand the problem that it is trying to solve.
As I mentioned before this method will be very useful in implementing a language server and this is the problem I am trying to solve. This method can replace some parts of the code that also search for a token at given position manually.
More about TokenAtOffset can be found here.
More about token_at_offset() can be found here.

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now I understand the API semantics, it is basically exposing an existing rowan API. But to be honest I find this rowan API a bit convoluted and hard to grasp. That's why I would like to see examples of situations in which this API is used in the language server.
In which language server features do you need this API?
Now suppose that the you have an API that given an offset returns the token located at that offset. For instance, in the rule rule test { condition: true } , the tokens are:
- rulekeyword (offsets: [0-3])
- space (offset: 4)
- testidentifier (offsets: [5-8])
- space (offset: 9)
- {(offset: 10)
- space (offset: 11)
etc.
token_at_offset will return a Token<T> type,  with the token that contains the offset, for instance:
token_at_offset(0) -> rule
token_at_offset(1) -> rule
token_at_offset(2) -> rule
token_at_offset(3) -> rule
token_at_offset(4) -> space
token_at_offset(5) -> test
token_at_offset(6) -> test
token_at_offset(7) -> test
token_at_offset(8) -> test
token_at_offset(9) -> space
token_at_offset(10) -> {
etc.
Does an API like that would work for you? What I would like to understand is if a more complex API is really required and why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which language server features do you need this API?
This method is mainly used in features where it is necessary to process the position from the client request. For example, the current position of the user's cursor in the text for language server features such as Hover or Go to definition.
What I would like to understand is if a more complex API is really required and why.
I don't think a more complex API will be needed in this case. I thought exposing an existing rowan API like I did was a common practice for YARA-X project.
Does an API like that would work for you?
Sure! But I think it should return a Option<Token<T>> type instead of Token<T>. Would this code be acceptable to you? After all, such code should work as you described in your example.
pub fn token_at_offset(&self, offset: u32) -> Option<Token<M>> {
    self.inner.token_at_offset(offset.into()).right_biased().map(Token::new)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would be ok with an API that returns Option<Token<M>>, what I don't like too much is all that right_biased and left_biased stuff. I don't get why it was necessary in the rowan API, but I find it excessively complex. Maybe there's a reason for it, but if we can solve our problems with a simpler API, I prefer the simpler one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've simplified the API and changed corresponding Doctest in the last commit.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While looking at this in detail I found something that I don't like in the API design. The token_at_offset function panics if the offset is outside the boundaries of the current node. For instance:
let mut root_node = Parser::new(b"rule test {condition:true}")
    .try_into_cst()
    .unwrap()
    .root();
root_node.token_at_offset(27);   // This panics because the valid range is 0..26.The root node covers all the source code, so passing an offset that is lower than the source code length is enough to avoid a panic. However, this gets tricky when we call token_at_offset from some other node that is not the root. For instance:
let mut root_node = Parser::new(b"rule test {condition:true}")
    .try_into_cst()
    .unwrap()
    .root();
let rule_decl = root_node.first_child().unwrap();
let condition_blk = rule_decl.first_child().unwrap();
condition_blk.token_at_offset(0);   // This panics because the condition block is in the range 11..25 and 0 is outside that rangeI believe that this is a very fragile API that forces the user to be extremely cautious with the offsets they pass to token_at_offset. All the burden is passed to the API user.
One alternative is making the offset relative to the node itself. For the root node the offsets will match absolute source code offsets, but for inner nodes the offsets passed to token_at_offset  will be relative to offset where the node starts. In condition_blk.token_at_offset(0) the returned token will be the condition keyword, because it is the token located at offset 0, counted from the start of the start of the condition block.
Another alternative is keeping the absolute offsets, but returning an error or None instead of panicking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
token_at_offsetfunction panics if the offset is outside the boundaries of the current node.
Yes, I forgot to mention that. Here is the code that panics in such scenarios:
assert!(
    range.start() <= offset && offset <= range.end(),
    "Bad offset: range {:?} offset {:?}",
    range,
    offset
);So yes, it always panics when the offset goes beyond the span.
I believe that this is a very fragile API that forces the user to be extremely cautious with the offsets they pass to
token_at_offset. All the burden is passed to the API user.
Another alternative is keeping the absolute offsets, but returning an error or
Noneinstead of panicking.
I like this alternative with returning None more and it will work for me as well.
Do you think it is necessary to make the offset relative? In my understanding of this API, absolute offset seems more logical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that absolute offsets are easier to understand, I don't think that is a real use-case for relative-offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the same check for offset as in rowan library.
If you would like to do the check differently, let me know.
Also changed corresponding Doctest.
b4748f2    to
    810b7a8      
    Compare
  
    Add implementation of `token_at_offset` method for CST `Node`. This method returns the token at a given source code offset.
Add implementation of
token_at_offsetmethod for CSTNodealong withTokenAtOffsetreturn type.This method can be used in language server development.