Skip to content

Conversation

Sigma42
Copy link
Contributor

@Sigma42 Sigma42 commented Oct 1, 2024

This fixes #40 by using ByteOrder.nativeOrder().

I changed InputEncoding.valueOfto accept StandardCharsets.UTF_16BE, StandardCharsets.UTF_16LE and StandardCharsets.UTF_16 but ignore the endianness (to stay backwards compatible with e.g. a big-endian machine).

I also changed the parseUtf8 and parseUtf16 tests to verify that their small example code are actually parsed correctly.

Copy link
Member

@ObserverOfTime ObserverOfTime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't StandardCharsets.UTF_16 enough on its own?

@Sigma42
Copy link
Contributor Author

Sigma42 commented Oct 1, 2024

Not as a value for the enum variant InputEncoding.UTF_16.
Because String.getBytes(StandardCharsets.UTF_16) may return any byte order, it just includes a byte-order mark at the start. (in my testing using openjdk on linux x86 it still retuned a big-endian encoded string, just with the added BOM). And treesitter ignores the BOM.

But for the InputEncoding.valueOf it might make more sense to only allow StandardCharsets.UTF_16.
I just added StandardCharsets.UTF_16BE so it would not be a breaking change. But I guess it was probably never used anyway because it would have only worked for big-endian machines.

@ObserverOfTime
Copy link
Member

See tree-sitter/tree-sitter#3740

@ObserverOfTime ObserverOfTime merged commit 6ce00b5 into tree-sitter:master Oct 7, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UTF-16 parsing may use wrong byte order

2 participants