Skip to content

Commit e296a53

Browse files
committed
Add documentation for Tree-sitter-based language modules
Updated the language module documentation to clarify core components and added sections for ANTLR and Tree-sitter parsing technologies. Included detailed examples for setting up language modules with Tree-sitter, emphasizing its implementation specifics.
1 parent 77c2bfd commit e296a53

File tree

1 file changed

+152
-3
lines changed

1 file changed

+152
-3
lines changed

docs/4.-Adding-New-Languages.md

Lines changed: 152 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ If a hard-coded (as opposed to dynamically generated) parser library is availabl
6363

6464
# Language Module Structure
6565

66-
A language module consists of these parts:
66+
A language module consists of these core components:
6767

6868
| Component/Class | Superclass | Function | How to get there |
6969
|-----------------------------------------|---------------------------|--------------------------------------------------|-------------------------------------------------------------|
@@ -73,8 +73,14 @@ A language module consists of these parts:
7373
| TokenType class | `de.jplag.TokenType` | contains the language-specific token types | **implement new** |
7474
| | | |
7575
| Lexer and Parser | - | transform code into AST | depends on technology |
76-
| ParserAdapter class | `de.jplag.AbstractParser` | sets up Parser and calls Traverser | depends on technology |
77-
| Traverser/<br>TraverserListener classes | - | creates tokens traversing the AST | depends on technology |
76+
| ParserAdapter class | varies by technology | sets up Parser and calls Traverser | depends on technology |
77+
| Traverser/<br>TraverserListener classes | varies by technology | creates tokens traversing the AST | depends on technology |
78+
79+
## Parser Technology Options
80+
81+
JPlag supports two main parsing technologies for implementing language modules:
82+
83+
### ANTLR-based Language Modules
7884

7985
For example, if ANTLR is used, the setup is as follows:
8086

@@ -95,6 +101,28 @@ As the table shows, much of a language module can be reused, especially when usi
95101
- It should still be rather easy to implement the ParserAdapter from the library documentation.
96102
- Instead of using a listener pattern, the library may require you to do the token extraction in a _Visitor subclass_. In that case, there is only one method call per element, called e.g. `traverseClassDeclaration`. The advantage of this version is that the traversal of the subtrees can be controlled freely. See the Scala language module for an example.
97103

104+
105+
### Tree-sitter-based Language Modules
106+
107+
Tree-sitter provides an alternative parsing approach with native performance and community-maintained grammars. For Tree-sitter-based modules, the setup is as follows:
108+
109+
| Tree-sitter specific parts/files | Superclass/Interface | Function | How to get there |
110+
|-----------------------------------|---------------------|----------|------------------|
111+
| Native grammar libraries | - | Platform-specific parsing libraries | Built from Tree-sitter grammar repositories |
112+
| Tree-sitter Language class | `de.jplag.treesitter.TreeSitterLanguage` | Loads native grammar | **implement new** |
113+
| Parser class | `de.jplag.treesitter.AbstractTreeSitterParser` | Sets up parser and calls visitor | copy with small adjustments |
114+
| TokenCollector class | `de.jplag.treesitter.TreeSitterVisitor` | Extracts tokens from AST using visitor pattern | **implement new** |
115+
116+
As with ANTLR modules, much can be reused. The parts to implement specifically for each Tree-sitter language module are:
117+
- the Tree-sitter Language class (grammar loader)
118+
- the TokenCollector (visitor implementation), and
119+
- the TokenTypes.
120+
121+
**Note** about Tree-sitter advantages:
122+
- Native parsing performance with robust error recovery
123+
- Community-maintained grammars that stay current with language evolution
124+
- Cross-platform native library distribution
125+
98126
## Setting up a new language module with ANTLR
99127

100128
JPlag provides a small framework to make it easier to implement language modules with ANTLR
@@ -218,6 +246,127 @@ Additional features for rules:
218246
2. Semantics - Can be passed by using withSemantics after the map call (see CPP language module for examples)
219247
3. Delegate - To have more precise control over the token position and length a delegated visitor can be used (see Go language module for examples)
220248

249+
## Setting up a new language module with Tree-sitter
250+
251+
JPlag provides a comprehensive framework for implementing Tree-sitter-based language modules with built-in handler infrastructure and native library management.
252+
253+
### Create the Tree-sitter Language class
254+
255+
Create a class that extends `TreeSitterLanguage` to load your language's native grammar:
256+
257+
```java
258+
public class TreeSitterYourLanguage extends TreeSitterLanguage {
259+
private static final TreeSitterYourLanguage INSTANCE = new TreeSitterYourLanguage();
260+
private static final String SYMBOL_NAME = "tree_sitter_your_language";
261+
262+
private TreeSitterYourLanguage() {
263+
// Private constructor for singleton pattern
264+
}
265+
266+
public static MemorySegment language() {
267+
return INSTANCE.call();
268+
}
269+
270+
@Override
271+
protected NativeLibraryType libraryType() {
272+
return NativeLibraryType.TREE_SITTER_YOUR_LANGUAGE; // Add to enum first
273+
}
274+
275+
@Override
276+
protected String symbolName() {
277+
return SYMBOL_NAME;
278+
}
279+
}
280+
```
281+
282+
### Create the Parser class
283+
284+
Implement a parser that extends `AbstractTreeSitterParser`:
285+
286+
```java
287+
public class YourLanguageParser extends AbstractTreeSitterParser {
288+
@Override
289+
protected MemorySegment getLanguageMemorySegment() {
290+
return TreeSitterYourLanguage.language();
291+
}
292+
293+
@Override
294+
protected List<Token> extractTokens(File file, Node rootNode) {
295+
YourLanguageTokenCollector collector = new YourLanguageTokenCollector(file);
296+
collector.traverse(rootNode);
297+
List<Token> tokens = collector.getTokens();
298+
tokens.add(Token.fileEnd(file));
299+
return tokens;
300+
}
301+
}
302+
```
303+
304+
### Implement the TokenCollector
305+
306+
The TokenCollector extends `TreeSitterVisitor` and uses a handler-based approach:
307+
308+
```java
309+
public class YourLanguageTokenCollector extends TreeSitterVisitor {
310+
private final File file;
311+
312+
public YourLanguageTokenCollector(File file) {
313+
this.file = file;
314+
}
315+
316+
@Override
317+
protected void initializeHandlers() {
318+
// Map Tree-sitter node types to token creation handlers
319+
enterHandlers.put("class_definition", node -> addToken(YourLanguageTokenType.CLASS_BEGIN, node));
320+
enterHandlers.put("function_definition", node -> addToken(YourLanguageTokenType.METHOD_BEGIN, node));
321+
322+
// Add exit handlers for balanced constructs
323+
exitHandlers.put("class_definition", node -> addToken(YourLanguageTokenType.CLASS_END, node));
324+
exitHandlers.put("function_definition", node -> addToken(YourLanguageTokenType.METHOD_END, node));
325+
}
326+
327+
private void addToken(YourLanguageTokenType tokenType, Node node) {
328+
int line = node.getStartPoint().row() + 1;
329+
int column = node.getStartPoint().column() + 1;
330+
int length = tokenType.getLength();
331+
332+
tokens.add(new Token(tokenType, file, line, column, line, column + length, length));
333+
}
334+
}
335+
```
336+
337+
### Implement the token type enum
338+
339+
This is similar to ANTLR modules, but includes length information for performance:
340+
341+
```java
342+
public enum YourLanguageTokenType implements TokenType {
343+
CLASS_BEGIN("CLASS{", 5), // class keyword
344+
CLASS_END("}CLASS", 5), // end of class
345+
METHOD_BEGIN("METHOD{", 3), // def/function keyword
346+
METHOD_END("}METHOD", 3), // end of method
347+
// Add more tokens as needed
348+
349+
private final String description;
350+
private final int length;
351+
352+
YourLanguageTokenType(String description, int length) {
353+
this.description = description;
354+
this.length = length;
355+
}
356+
357+
@Override
358+
public String getDescription() {
359+
return description;
360+
}
361+
362+
public int getLength() {
363+
return length;
364+
}
365+
}
366+
```
367+
368+
The handler-based approach eliminates boilerplate code and provides a clean, declarative way to map Tree-sitter node types to JPlag tokens. The abstract base class handles traversal and token collection automatically.
369+
221370
## Basic procedure outline
222371

223372
```mermaid

0 commit comments

Comments
 (0)