Skip to content

Conversation

holtvogt
Copy link
Contributor

@holtvogt holtvogt commented Aug 7, 2025

JPlag's current Python support relies on ANTLR grammars that struggle with modern Python syntax (3.10+ match statements, 3.11+ exception groups, 3.12+ type aliases) and require maintaining custom grammars that lag behind language evolution.

This PR introduces Tree-sitter as JPlag's new parsing foundation, starting with Python. Tree-sitter provides native parsing performance, community-maintained grammars that stay current with language specs, and cross-platform native library distribution.

Technical Notes

  • Requires Java JDK 22 for Java's Foreign Function and Memory (FFM) API and jextract to generate Java bindings from Tree-sitter C headers
  • Depends on Zig build support in Tree-sitter repositories for Windows library compilation

Architecture Changes

New language-tree-sitter-utils module

  • Uses Java's FFM API bindings for Tree-sitter native libraries
  • Introduces an abstract base class for future Tree-sitter language implementations
  • Enables cross-platform native library loading

First Tree-sitter Python language module implementation

  • PythonParser: Orchestrates parsing and delegates to token collector
  • PythonTokenCollector: AST visitor that maps Tree-sitter nodes to JPlag tokens
  • TreeSitterPython: Singleton grammar loader using FFM API
  • Handler-based token extraction with support for nested constructs

Native library build system

  • GitHub Actions workflow for cross-platform compilation (Linux/macOS/Windows)
  • Local native library compilation via mvn -Pbuild-native-libraries generate-resources
  • Supports versioning of native Tree-sitter libraries to enable language modules to run on different Tree-sitter releases

Testing

# Build native libraries locally
mvn -Pbuild-native-libraries generate-resources

# Test new Python module
mvn clean test-compile -pl :python -am
mvn test -pl :python

# Test CLI integration
mvn -Pwith-report-viewer clean package assembly:single
java -jar cli/target/jplag-*-jar-with-dependencies.jar -l python [files]

holtvogt and others added 30 commits May 21, 2025 22:46
Convert to abstract class with handler maps and template method pattern.
Reduces boilerplate and enforces consistent visitor implementations.
The "Adapter" suffix was misleading as these classes are direct parsers,
not adapters between incompatible interfaces. Updated all references
and documentation to reflect the cleaner, more accurate naming.
Removed redundant token list initialization from PythonTokenCollector and added a centralized token list in TreeSitterVisitor. Introduced a method to retrieve collected tokens.
Updated the language module documentation to clarify core components and added sections for ANTLR and Tree-sitter parsing technologies. Included detailed examples for setting up language modules with Tree-sitter, emphasizing its implementation specifics.
@holtvogt holtvogt marked this pull request as ready for review September 2, 2025 21:03
Comment on lines +151 to +156
private boolean isEndToken(PythonTokenType tokenType) {
return switch (tokenType) {
case CLASS_END, METHOD_END, WHILE_END, FOR_END, TRY_END, EXCEPT_END, FINALLY_END, IF_END, DECORATOR_END, WITH_END, MATCH_END -> true;
default -> false;
};
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to getLength(), should we move this into PythonTokenType?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, yes, to provide a more object-oriented implementation.

@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes major Major issue/feature/contribution/change language PR / Issue deals (partly) with new and/or existing languages for JPlag labels Sep 4, 2025
@robinmaisch robinmaisch requested a review from a team September 4, 2025 13:52
@tsaglam tsaglam requested a review from Copilot September 15, 2025 06:55
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive Tree-sitter-based Python language support to JPlag, providing modern Python syntax support (3.10+ match statements, 3.11+ exception groups, 3.12+ type aliases) through native parsing performance and community-maintained grammars.

  • Implements new language-tree-sitter-utils module with abstract base classes and native library management
  • Adds Tree-sitter Python language module with token extraction for all Python constructs
  • Updates Java compiler target from JDK 21 to JDK 22 for Foreign Function and Memory API support

Reviewed Changes

Copilot reviewed 52 out of 53 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
scripts/build-native-libraries.sh Cross-platform native library build script for Tree-sitter grammars
pom.xml Updates Java version to 22 and adds Tree-sitter dependencies and build profile
language-tree-sitter-utils/ New module with abstract base classes for Tree-sitter language implementations
languages/python/ Complete Tree-sitter Python implementation with parser, token collector, and tests
CI workflows Updates Java version and adds native library build steps

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

"windows")
# Zig puts output in zig-out/lib/ directory and names it $library_name.dll
if [ -f "zig-out/bin/$library_name.dll" ]; then
# Hence we need to rename it to $LIBRARY_NAME.dll
Copy link
Preview

Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 175 doesn't match the code logic. The comment says "Hence we need to rename it" but the variable names suggest the opposite - $library_name (lowercase with hyphens) is being renamed to $LIBRARY_NAME (with lib prefix). Update the comment to accurately describe the renaming from the build output name to the expected library name format.

Suggested change
# Hence we need to rename it to $LIBRARY_NAME.dll
# Rename the build output ($library_name.dll) to the expected format ($LIBRARY_NAME.dll)

Copilot uses AI. Check for mistakes.

Comment on lines +1 to 5
import { ParserLanguage, type Language } from '@/model/Language'
import hljs from 'highlight.js'
import scheme from 'highlight.js/lib/languages/scheme'
import llvm from 'highlight.js/lib/languages/llvm'
import scheme from 'highlight.js/lib/languages/scheme'
import typescript from 'highlight.js/lib/languages/typescript'
Copy link
Preview

Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The imports are not consistently ordered. Consider organizing imports alphabetically within each group (external libraries first, then local imports) for better maintainability. Move the hljs import before the specific language imports.

Copilot uses AI. Check for mistakes.

@@ -43,4 +44,4 @@ function getLanguageParser(language: string): Language {
return 'unknown language'
}

export { ParserLanguage, type Language, getLanguageParser }
export { ParserLanguage, getLanguageParser, type Language }
Copy link
Preview

Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The export order change appears inconsistent with the import order change in CodeHighlighter.ts. Consider maintaining a consistent ordering pattern across the codebase - either alphabetical or by type (functions before types).

Suggested change
export { ParserLanguage, getLanguageParser, type Language }
export { getLanguageParser, ParserLanguage, type Language }

Copilot uses AI. Check for mistakes.

Comment on lines +74 to +77
// StringTemplate was introduced in Java 21, but removed in Java 22
// collector.testFile("StringConcat.java", "StringTemplate.java").testSourceCoverage().testTokenSequence(J_CLASS_BEGIN,
// J_METHOD_BEGIN, J_VARDEF,
// J_VARDEF, J_VARDEF, J_APPLY, J_METHOD_END, J_CLASS_END);
Copy link
Preview

Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of commenting out the test, consider using @Disabled annotation with a reason or conditional test execution based on Java version. This provides better test documentation and maintains the test structure for potential future re-enablement.

Copilot uses AI. Check for mistakes.

Comment on lines +121 to 128
<!-- Tree-sitter -->
<dependency>
<groupId>net.sourceforge.argparse4j</groupId>
<artifactId>argparse4j</artifactId>
<version>0.9.0</version>
<groupId>io.github.tree-sitter</groupId>
<artifactId>jtreesitter</artifactId>
<version>${java-tree-sitter.version}</version>
</dependency>

<!-- CoreNLP -->
Copy link
Preview

Copilot AI Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The dependency management section has inconsistent organization. The Tree-sitter dependency is placed between ANTLR and CoreNLP dependencies, breaking the logical grouping. Consider placing it in a separate Tree-sitter section or moving it to maintain better logical organization of related dependencies.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes language PR / Issue deals (partly) with new and/or existing languages for JPlag major Major issue/feature/contribution/change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants