-
Notifications
You must be signed in to change notification settings - Fork 351
Add Tree-Sitter-based Language Module for Python #2548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/tree-sitter-parser-integration
Are you sure you want to change the base?
Add Tree-Sitter-based Language Module for Python #2548
Conversation
Convert to abstract class with handler maps and template method pattern. Reduces boilerplate and enforces consistent visitor implementations.
The "Adapter" suffix was misleading as these classes are direct parsers, not adapters between incompatible interfaces. Updated all references and documentation to reflect the cleaner, more accurate naming.
Removed redundant token list initialization from PythonTokenCollector and added a centralized token list in TreeSitterVisitor. Introduced a method to retrieve collected tokens.
Updated the language module documentation to clarify core components and added sections for ANTLR and Tree-sitter parsing technologies. Included detailed examples for setting up language modules with Tree-sitter, emphasizing its implementation specifics.
private boolean isEndToken(PythonTokenType tokenType) { | ||
return switch (tokenType) { | ||
case CLASS_END, METHOD_END, WHILE_END, FOR_END, TRY_END, EXCEPT_END, FINALLY_END, IF_END, DECORATOR_END, WITH_END, MATCH_END -> true; | ||
default -> false; | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to getLength()
, should we move this into PythonTokenType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, yes, to provide a more object-oriented implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive Tree-sitter-based Python language support to JPlag, providing modern Python syntax support (3.10+ match statements, 3.11+ exception groups, 3.12+ type aliases) through native parsing performance and community-maintained grammars.
- Implements new
language-tree-sitter-utils
module with abstract base classes and native library management - Adds Tree-sitter Python language module with token extraction for all Python constructs
- Updates Java compiler target from JDK 21 to JDK 22 for Foreign Function and Memory API support
Reviewed Changes
Copilot reviewed 52 out of 53 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
scripts/build-native-libraries.sh |
Cross-platform native library build script for Tree-sitter grammars |
pom.xml |
Updates Java version to 22 and adds Tree-sitter dependencies and build profile |
language-tree-sitter-utils/ |
New module with abstract base classes for Tree-sitter language implementations |
languages/python/ |
Complete Tree-sitter Python implementation with parser, token collector, and tests |
CI workflows | Updates Java version and adds native library build steps |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
"windows") | ||
# Zig puts output in zig-out/lib/ directory and names it $library_name.dll | ||
if [ -f "zig-out/bin/$library_name.dll" ]; then | ||
# Hence we need to rename it to $LIBRARY_NAME.dll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 175 doesn't match the code logic. The comment says "Hence we need to rename it" but the variable names suggest the opposite - $library_name
(lowercase with hyphens) is being renamed to $LIBRARY_NAME
(with lib prefix). Update the comment to accurately describe the renaming from the build output name to the expected library name format.
# Hence we need to rename it to $LIBRARY_NAME.dll | |
# Rename the build output ($library_name.dll) to the expected format ($LIBRARY_NAME.dll) |
Copilot uses AI. Check for mistakes.
import { ParserLanguage, type Language } from '@/model/Language' | ||
import hljs from 'highlight.js' | ||
import scheme from 'highlight.js/lib/languages/scheme' | ||
import llvm from 'highlight.js/lib/languages/llvm' | ||
import scheme from 'highlight.js/lib/languages/scheme' | ||
import typescript from 'highlight.js/lib/languages/typescript' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The imports are not consistently ordered. Consider organizing imports alphabetically within each group (external libraries first, then local imports) for better maintainability. Move the hljs
import before the specific language imports.
Copilot uses AI. Check for mistakes.
@@ -43,4 +44,4 @@ function getLanguageParser(language: string): Language { | |||
return 'unknown language' | |||
} | |||
|
|||
export { ParserLanguage, type Language, getLanguageParser } | |||
export { ParserLanguage, getLanguageParser, type Language } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The export order change appears inconsistent with the import order change in CodeHighlighter.ts. Consider maintaining a consistent ordering pattern across the codebase - either alphabetical or by type (functions before types).
export { ParserLanguage, getLanguageParser, type Language } | |
export { getLanguageParser, ParserLanguage, type Language } |
Copilot uses AI. Check for mistakes.
// StringTemplate was introduced in Java 21, but removed in Java 22 | ||
// collector.testFile("StringConcat.java", "StringTemplate.java").testSourceCoverage().testTokenSequence(J_CLASS_BEGIN, | ||
// J_METHOD_BEGIN, J_VARDEF, | ||
// J_VARDEF, J_VARDEF, J_APPLY, J_METHOD_END, J_CLASS_END); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of commenting out the test, consider using @Disabled
annotation with a reason or conditional test execution based on Java version. This provides better test documentation and maintains the test structure for potential future re-enablement.
Copilot uses AI. Check for mistakes.
<!-- Tree-sitter --> | ||
<dependency> | ||
<groupId>net.sourceforge.argparse4j</groupId> | ||
<artifactId>argparse4j</artifactId> | ||
<version>0.9.0</version> | ||
<groupId>io.github.tree-sitter</groupId> | ||
<artifactId>jtreesitter</artifactId> | ||
<version>${java-tree-sitter.version}</version> | ||
</dependency> | ||
|
||
<!-- CoreNLP --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The dependency management section has inconsistent organization. The Tree-sitter dependency is placed between ANTLR and CoreNLP dependencies, breaking the logical grouping. Consider placing it in a separate Tree-sitter section or moving it to maintain better logical organization of related dependencies.
Copilot uses AI. Check for mistakes.
JPlag's current Python support relies on ANTLR grammars that struggle with modern Python syntax (3.10+
match
statements, 3.11+ exception groups, 3.12+type
aliases) and require maintaining custom grammars that lag behind language evolution.This PR introduces Tree-sitter as JPlag's new parsing foundation, starting with Python. Tree-sitter provides native parsing performance, community-maintained grammars that stay current with language specs, and cross-platform native library distribution.
Technical Notes
Java JDK 22
for Java's Foreign Function and Memory (FFM) API andjextract
to generate Java bindings from Tree-sitter C headersZig
build support in Tree-sitter repositories for Windows library compilationArchitecture Changes
New
language-tree-sitter-utils
moduleFirst Tree-sitter Python language module implementation
PythonParser
: Orchestrates parsing and delegates to token collectorPythonTokenCollector
: AST visitor that maps Tree-sitter nodes to JPlag tokensTreeSitterPython
: Singleton grammar loader using FFM APINative library build system
mvn -Pbuild-native-libraries generate-resources
Testing