Documentation Index Fetch the complete documentation index at: https://mintlify.com/tree-sitter/tree-sitter/llms.txt
Use this file to discover all available pages before exploring further.
Writing the Grammar
Writing a grammar requires creativity. There are infinite context-free grammars (CFGs) that can describe any given language. To produce a good Tree-sitter parser, you need to create a grammar with two important properties:
Intuitive structure - Direct correspondence between grammar symbols and recognizable language constructs
Close adherence to LR(1) - Efficient parsing with minimal conflicts
Tree-sitter produces a concrete syntax tree where each node corresponds to a grammar symbol. Your grammar structure directly affects the tree structure.
Starting Your Grammar
The First Few Rules
Find a formal specification for your language. As you read through the context-free grammar, you’ll discover a complex graph of relationships. Start by creating structure for basic groups:
Declarations - Top-level constructs
Definitions - Function, class, variable definitions
Statements - Executable code
Expressions - Values and operations
Types - Type annotations
Patterns - Pattern matching constructs
Breadth-First Approach
For a language like Go, start with a skeleton:
export default grammar ({
name: 'go' ,
rules: {
source_file : $ => repeat ( $ . _definition ),
_definition : $ => choice (
$ . function_definition
// TODO: other kinds of definitions
),
function_definition : $ => seq (
'func' ,
$ . identifier ,
$ . parameter_list ,
$ . _type ,
$ . block
),
parameter_list : $ => seq (
'(' ,
// TODO: parameters
')'
),
_type : $ => choice (
'bool'
// TODO: other kinds of types
),
block : $ => seq (
'{' ,
repeat ( $ . _statement ),
'}'
),
_statement : $ => choice (
$ . return_statement
// TODO: other kinds of statements
),
return_statement : $ => seq (
'return' ,
$ . expression ,
';'
),
expression : $ => choice (
$ . identifier ,
$ . number
// TODO: other kinds of expressions
),
identifier : $ => / [ a-z ] + / ,
number : $ => / \d + /
}
}) ;
Create the skeleton
Define basic structure touching on major groups of rules.
Choose a sublanguage
Pick one area (types, expressions, statements) to develop first.
Flesh out rules
Add rules one-by-one for that sublanguage.
Test frequently
Use tree-sitter parse to verify your progress with real code.
Add tests
Write tests in test/corpus/ for each rule you add.
The first rule in the rules object is the start rule (typically source_file).
Structuring Rules Well
Avoid Language Spec Structure
Language specifications often have deeply nested rules that don’t translate well to syntax trees.
Consider this JavaScript code:
The ECMAScript spec represents this with 20+ levels of indirection:
ReturnStatement -> 'return' Expression
Expression -> AssignmentExpression
AssignmentExpression -> ConditionalExpression
ConditionalExpression -> LogicalORExpression
LogicalORExpression -> LogicalANDExpression
LogicalANDExpression -> BitwiseORExpression
BitwiseORExpression -> BitwiseXORExpression
BitwiseXORExpression -> BitwiseANDExpression
BitwiseANDExpression -> EqualityExpression
EqualityExpression -> RelationalExpression
RelationalExpression -> ShiftExpression
ShiftExpression -> AdditiveExpression
AdditiveExpression -> MultiplicativeExpression
MultiplicativeExpression -> ExponentiationExpression
ExponentiationExpression -> UnaryExpression
UnaryExpression -> UpdateExpression
UpdateExpression -> LeftHandSideExpression
LeftHandSideExpression -> NewExpression
NewExpression -> MemberExpression
MemberExpression -> PrimaryExpression
PrimaryExpression -> IdentifierReference
Don’t create a 20-level deep tree for a simple expression. Use precedence instead.
Flatten with Precedence
Create a flatter structure using prec:
rules : {
expression : $ => choice (
$ . identifier ,
$ . number ,
$ . unary_expression ,
$ . binary_expression ,
// ...
),
unary_expression : $ => prec ( 2 , choice (
seq ( '-' , $ . expression ),
seq ( '!' , $ . expression ),
seq ( 'typeof' , $ . expression )
)),
binary_expression : $ => choice (
prec . left ( 2 , seq ( $ . expression , '*' , $ . expression )),
prec . left ( 1 , seq ( $ . expression , '+' , $ . expression ))
)
}
Using Precedence
Resolving Conflicts
When Tree-sitter encounters conflicts, it provides helpful error messages:
Error: Unresolved conflict for symbol sequence:
'-' _expression • '*' …
Possible interpretations:
1: '-' (binary_expression _expression • '*' _expression)
2: (unary_expression '-' _expression) • '*' …
Possible resolutions:
1: Specify a higher precedence in `binary_expression`
2: Specify a higher precedence in `unary_expression`
3: Specify a left or right associativity in `unary_expression`
4: Add a conflict for these rules: `binary_expression` `unary_expression`
The • character shows exactly where during parsing the conflict occurs.
Applying Precedence
For -a * b, we want unary - to bind tighter than binary *:
unary_expression : $ => prec ( 2 , choice (
seq ( '-' , $ . expression ),
seq ( '!' , $ . expression )
))
Using Associativity
Left vs Right
For a * b * c, we need to choose between:
(a * b) * c - left associative
a * (b * c) - right associative
Most operators are left associative:
binary_expression : $ => choice (
// Multiplication and division (precedence 2, left)
prec . left ( 2 , seq ( $ . expression , '*' , $ . expression )),
prec . left ( 2 , seq ( $ . expression , '/' , $ . expression )),
// Addition and subtraction (precedence 1, left)
prec . left ( 1 , seq ( $ . expression , '+' , $ . expression )),
prec . left ( 1 , seq ( $ . expression , '-' , $ . expression ))
)
Assignment is typically right associative:
assignment : $ => prec . right ( seq (
$ . identifier ,
'=' ,
choice ( $ . expression , $ . assignment )
))
Result : a = b = c parses as a = (b = c)
Using Conflicts
Intentional Ambiguity
Some constructs are legitimately ambiguous. In JavaScript, [x, y] could be:
An array literal: let a = [x, y]
A destructuring pattern: let [x, y] = arr
export default grammar ({
name: 'javascript' ,
conflicts : $ => [
[ $ . array , $ . array_pattern ]
] ,
rules: {
expression : $ => choice (
$ . identifier ,
$ . array ,
$ . pattern
),
array : $ => seq (
'[' ,
optional ( seq (
$ . expression ,
repeat ( seq ( ',' , $ . expression ))
)),
']'
),
array_pattern : $ => seq (
'[' ,
optional ( seq (
$ . pattern ,
repeat ( seq ( ',' , $ . pattern ))
)),
']'
),
pattern : $ => choice (
$ . identifier ,
$ . array_pattern
)
}
}) ;
Only use conflicts when you have genuine ambiguity that should be resolved at runtime. Tree-sitter will use GLR parsing to explore both possibilities.
Dynamic Precedence
Use prec.dynamic to prefer one interpretation:
array : $ => prec . dynamic ( 1 , seq (
'[' ,
optional ( seq ( $ . expression , repeat ( seq ( ',' , $ . expression )))),
']'
))
Hiding Rules
Underscore Prefix
Rules starting with _ are hidden from the syntax tree:
rules : {
source_file : $ => repeat ( $ . _statement ),
// Hidden - won't appear in tree
_statement : $ => choice (
$ . expression_statement ,
$ . if_statement ,
$ . return_statement
),
// Visible rules
if_statement : $ => seq ( 'if' , $ . expression , $ . block ),
return_statement : $ => seq ( 'return' , $ . expression )
}
Hide rules that always wrap a single child to reduce tree depth and noise.
Using Fields
Named Children
Fields let you access children by name instead of index:
function_definition : $ => seq (
'func' ,
field ( 'name' , $ . identifier ),
field ( 'parameters' , $ . parameter_list ),
field ( 'return_type' , optional ( $ . _type )),
field ( 'body' , $ . block )
),
binary_expression : $ => prec . left ( seq (
field ( 'left' , $ . expression ),
field ( 'operator' , choice ( '+' , '-' , '*' , '/' )),
field ( 'right' , $ . expression )
))
Benefits :
Code is more readable
Resilient to grammar changes
Self-documenting structure
Extras can appear anywhere in the language:
export default grammar ({
name: 'mylang' ,
extras : $ => [
/ \s / , // Whitespace
$ . comment
] ,
rules: {
comment : $ => token ( choice (
seq ( '//' , / . * / ),
seq ( '/*' , / [ ^ * ] * \* + ( [ ^ /* ][ ^ * ] * \* + ) * / , '/' )
))
}
}) ;
When adding complex patterns to extras, associate them with a rule instead of inlining. This dramatically reduces parser size.
Good - Rule Reference
Bad - Inline Pattern
extras : $ => [
/ \s / ,
$ . comment // Reference to rule
],
rules : {
comment : $ => token ( /* complex pattern */ )
}
Tree-sitter simplifies \s to [ \t\n\r] as a performance optimization.
Using Supertypes
Abstract Categories
Supertypes represent abstract categories without creating visible nodes:
export default grammar ({
name: 'javascript' ,
supertypes : $ => [
$ . _expression ,
$ . _statement ,
$ . _declaration
] ,
rules: {
_expression : $ => choice (
$ . identifier ,
$ . unary_expression ,
$ . binary_expression ,
$ . call_expression ,
$ . member_expression
)
}
}) ;
Effect : _expression nodes don’t appear in the tree, but can be used in queries.
Standard Rule Names
Follow these conventions for consistency:
source_file Root node representing an entire source file
expression Choice between different expression types
statement Choice between different statement types
block Parent node for block scopes
type Type annotations (int, char, void, etc.)
identifier Variable/function names (often the word token)
comment Comments (often in extras)
Lexical Analysis
Tree-sitter’s parsing is divided into two phases: parsing and lexing.
Conflicting Tokens
Grammars often have tokens that match the same characters. Tree-sitter resolves conflicts using:
Context-aware lexing
The lexer only tries to recognize tokens that are valid at the current position.
Lexical precedence
token(prec(N, ...)) gives explicit precedence values. Higher precedence wins.
Match length
Prefer the token matching the longest sequence of characters.
Match specificity
Prefer a String ('if') over a RegExp (/[a-z]+/) for the same match.
Rule order
Prefer the token that appears earlier in the grammar.
External scanners have priority over all these rules. See External Scanners for details.
Lexical vs Parse Precedence
Don’t confuse these two:
Parse precedence - prec(N, rule) - Which rule to use for a sequence of tokens
Lexical precedence - token(prec(N, ...)) - Which token to recognize at a position
Parse Precedence
Lexical Precedence
// Choose between interpretations of token sequences
unary_expression : $ => prec ( 2 , seq ( '-' , $ . expression ))
Keywords
The Word Token
Many languages have keywords (if, for, return) and a general identifier token. Without special handling, instanceofSomething would incorrectly tokenize as instanceof + Something.
Specify a word token to fix this:
export default grammar ({
name: 'javascript' ,
word : $ => $ . identifier ,
rules: {
identifier : $ => / [ a-z_ ] + / ,
binary_expression : $ => prec . left ( 1 , seq (
$ . expression ,
'instanceof' , // Keyword
$ . expression
)),
unary_expression : $ => prec . left ( 2 , seq (
'typeof' , // Keyword
$ . expression
))
}
}) ;
How It Works
Keyword extraction
Tree-sitter finds all keyword tokens that match strings also matched by the word token.
Two-step matching
When parsing, Tree-sitter first matches the word token, then checks if it’s a keyword.
Better errors
instanceofSomething correctly tokenizes as one identifier, so the parser can report better errors.
Keyword extraction also improves performance by generating a smaller, simpler lexing function.
The word token must be unique and not reused by another rule. If needed, use an alias instead.
Complete Example
Here’s a complete grammar demonstrating these concepts:
export default grammar ({
name: 'mini_lang' ,
extras : $ => [ / \s / , $ . comment ] ,
word : $ => $ . identifier ,
supertypes : $ => [ $ . _expression , $ . _statement ] ,
rules: {
source_file : $ => repeat ( $ . _statement ),
_statement : $ => choice (
$ . expression_statement ,
$ . if_statement ,
$ . return_statement
),
expression_statement : $ => seq ( $ . expression , ';' ),
if_statement : $ => seq (
'if' ,
field ( 'condition' , $ . expression ),
field ( 'consequence' , $ . block ),
optional ( seq ( 'else' , field ( 'alternative' , $ . block )))
),
return_statement : $ => seq (
'return' ,
optional ( $ . expression ),
';'
),
block : $ => seq ( '{' , repeat ( $ . _statement ), '}' ),
_expression : $ => choice (
$ . identifier ,
$ . number ,
$ . binary_expression ,
$ . call_expression
),
binary_expression : $ => choice (
prec . left ( 2 , seq (
field ( 'left' , $ . expression ),
field ( 'operator' , choice ( '*' , '/' )),
field ( 'right' , $ . expression )
)),
prec . left ( 1 , seq (
field ( 'left' , $ . expression ),
field ( 'operator' , choice ( '+' , '-' )),
field ( 'right' , $ . expression )
))
),
call_expression : $ => seq (
field ( 'function' , $ . identifier ),
'(' ,
optional ( seq (
$ . expression ,
repeat ( seq ( ',' , $ . expression ))
)),
')'
),
identifier : $ => / [ a-zA-Z_ ][ a-zA-Z0-9_ ] * / ,
number : $ => / \d + / ,
comment : $ => token ( choice (
seq ( '//' , / . * / ),
seq ( '/*' , / [ ^ * ] * \* + ( [ ^ /* ][ ^ * ] * \* + ) * / , '/' )
))
}
}) ;
Next Steps
Now that you understand grammar writing: