Skip to content

Conversation

ObserverOfTime
Copy link
Member

No description provided.

Copy link
Contributor

@Marcono1234 Marcono1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a Draft, I assume you are aware that there are issues with the current implementation. Hopefully my comments are useful nonetheless; if not feel free to ignore them.

If you want to support only UTF-32 but not any other charsets, then a handwritten decoder might be easiest. Maybe the question is in general how common use cases with other charsets are, and if so which charsets are used then.

Otherwise if you want to support arbitrary charsets, then it probably becomes more difficult.
In general I think custom charset support could work by using CharsetDecoder#decode(ByteBuffer,CharBuffer,boolean), where the output buffer is sized to only permit the decoder to decode at most one code point. And then measure how many bytes were consumed from the input buffer.

However, this becomes quite convoluted because if it should be completely correct then you would probably have to:

  • first try to decode into an output buffer which fits 1 char
  • if that failed with coderResult.isOverflow() and did not produce any output, try again with a buffer which fits 2 char (a surrogate pair)
  • maybe have to account for charsets which behave weirdly, e.g.
    • report isOverflow() but already wrote one char
    • write 2 chars which are not actually a surrogate pair
    • writes 1 char which is a lone surrogate
    • write a prefix (similar to a BOM) or suffix
  • perform the full decoding operation as described by the CharsetDecoder docs
    That is, once there is no further input call decode(..., ..., true) and also eventually flush. At the very least if tree-sitter or jtreesitter cannot support decoded data at the end but the decoder produced additional data, throw an exception to avoid silent incorrect behavior.

It is also not completely clear to me how tree-sitter DecodeFunction is supposed to behave:

  • Does it provide all bytes, or could the bytes be incomplete (e.g. when using a custom reading function) and in that case what should the decode function return when there are too few bytes to decode a code point?
  • Are there guarantees for how long the DecodeFunction is called? Until it returns 0? Or is it not even called anymore when length becomes 0 (regardless of whether the decode function returned 0 in the past)?

(string, length, code_point) -> {
if (length == 0) return 0;
var buffer = string.asSlice(0, length).asByteBuffer();
var decoded = encoding.charset().decode(buffer);
Copy link
Contributor

@Marcono1234 Marcono1234 Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling decode(buffer) here will decode all bytes, even though just the first code point is needed. This is probably rather inefficient.

Also as mentioned by the documentation, Charset#decode uses CodingErrorAction.REPLACE, which might not be desired? (On the other hand, I am not sure how the Java FFM behaves when an upcall throws an exception, maybe it exits the JVM, see Linker#upcallStub.)

if (length == 0) return 0;
var buffer = string.asSlice(0, length).asByteBuffer();
var decoded = encoding.charset().decode(buffer);
code_point.set(C_INT, 0, decoded.charAt(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

charAt(0) only retrieves a single 16-bit Java char. For supplementary codepoints (>= U+10000) that will only be the high surrogate.

var buffer = string.asSlice(0, length).asByteBuffer();
var decoded = encoding.charset().decode(buffer);
code_point.set(C_INT, 0, decoded.charAt(0));
return (int)encoder.maxBytesPerChar();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxBytesPerChar() only works if the charset consumes a fixed number of bytes. It would not work for a charset like UTF-8 (if that didn't have built-in support by tree-sitter) where a code point uses a variable amount of bytes.

@DisplayName("parse(custom)")
void parseCustom() {
parser.setLanguage(language);
var encoding = InputEncoding.valueOf(StandardCharsets.UTF_32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to https://github.com/tree-sitter/java-tree-sitter/pull/136/files#r2341779397, this test only works because UTF-32 uses a fixed amount of bytes per code point.

And I am also not completely sure if this here works properly since for these surrogate pairs probably only the high surrogate is actually read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants