Implementation Details

Tree-sitter consists of two main components: a C library (libtree-sitter) for runtime parsing, and a Rust-based CLI for parser generation.

Architecture Overview

The C Library (libtree-sitter)

The runtime library is written in plain C for maximum portability and embeddability.

Core API

The public interface is defined in tree_sitter/api.h:

// Core types
typedef struct TSParser TSParser;
typedef struct TSTree TSTree;
typedef struct TSNode TSNode;
typedef struct TSLanguage TSLanguage;

// Parser lifecycle
TSParser *ts_parser_new(void);
void ts_parser_delete(TSParser *);
bool ts_parser_set_language(TSParser *, const TSLanguage *);

// Parsing
TSTree *ts_parser_parse_string(
    TSParser *self,
    const TSTree *old_tree,
    const char *string,
    uint32_t length
);

// Tree operations
TSNode ts_tree_root_node(const TSTree *);
void ts_tree_delete(TSTree *);
TSTree *ts_tree_copy(const TSTree *);

Key Components

The library source is organized into focused modules:

parser.c - Main parsing algorithm and state management
lexer.c - Lexical analysis and token generation
node.c - Syntax tree node operations
language.c - Language definition and version management
get_changed_ranges.c - Incremental parsing support
alloc.c - Memory allocation with custom allocators

Memory Management

Tree-sitter uses reference counting for tree sharing:

// Cheap copy - just increments refcount
TSTree *tree_copy = ts_tree_copy(tree);

// Safe to use on different threads
// Both trees can be queried/modified independently
use_tree_on_thread_1(tree);
use_tree_on_thread_2(tree_copy);

// Each thread deletes its own copy
ts_tree_delete(tree);
ts_tree_delete(tree_copy);

Individual TSTree instances are NOT thread-safe. Always copy a tree before using it on multiple threads simultaneously.

Incremental Parsing

Tree-sitter achieves efficiency through incremental parsing:

typedef struct {
    uint32_t start_byte;
    uint32_t old_end_byte;
    uint32_t new_end_byte;
    TSPoint start_point;
    TSPoint old_end_point;
    TSPoint new_end_point;
} TSInputEdit;

// 1. Edit the tree to adjust node positions
ts_tree_edit(old_tree, &edit);

// 2. Reparse with the old tree
TSTree *new_tree = ts_parser_parse_string(
    parser,
    old_tree,  // Provides context for reuse
    new_source,
    new_length
);

The parser identifies unchanged regions and reuses subtrees, making edits extremely fast.

The CLI (tree-sitter)

The CLI is written in Rust and available via:

Parser Generation Pipeline

The tree-sitter generate command transforms grammars through several stages:

1. Grammar Parsing

Implemented in parse_grammar.rs:

// grammar.js
module.exports = grammar({
  name: 'javascript',
  rules: {
    program: $ => repeat($._statement),
    _statement: $ => choice(
      $.expression_statement,
      $.if_statement,
      // ...
    ),
  },
});

The CLI shells out to Node.js to evaluate the grammar and convert it to JSON.

Grammar format is formally specified in grammar.schema.json.

2. Grammar Rules

Grammars are composed of rule types defined in rules.rs:

pub enum Rule {
    String(String),           // Literal text
    Pattern(String, Flags),   // Regex pattern  
    Symbol(String),           // Reference to another rule
    Seq(Vec<Rule>),          // Sequence of rules
    Choice(Vec<Rule>),       // Alternative rules
    Repeat(Box<Rule>),       // Zero or more
    Repeat1(Box<Rule>),      // One or more
    Metadata(Metadata),      // Annotations
    Blank,                   // Empty rule
}

3. Grammar Preparation

The prepare_grammar module transforms grammars: Transformations include:

Inlining - Expand inline rules
Precedence extraction - Identify operator precedence
Associativity handling - Process left/right associativity
Token extraction - Separate tokens from syntax rules
Conflict resolution - Handle ambiguities

Output: Two grammars:

Syntax grammar - How non-terminals combine
Lexical grammar - How terminals (tokens) are formed

4. Parse Table Generation

The CLI generates LR parsing tables:

// Simplified representation
struct ParseTable {
    states: Vec<ParseState>,
    symbols: Vec<Symbol>,
}

struct ParseState {
    actions: HashMap<Symbol, ParseAction>,
    goto_table: HashMap<Symbol, StateId>,
}

enum ParseAction {
    Shift(StateId),
    Reduce(RuleId),
    Accept,
}

These tables drive the LR parser in the generated C code.

5. Code Generation

The final step emits parser.c:

// Generated parser structure
static const TSParseActionEntry ts_parse_actions[] = {
  [0] = {.entry = {.count = 1, .reusable = false}},
  // Thousands of action entries...
};

static const TSStateId ts_primary_state_ids[] = {
  [0] = 0,
  // State mappings...
};

// External scanner integration
extern const TSLanguage *tree_sitter_javascript(void) {
  static const TSLanguage language = {
    .version = LANGUAGE_VERSION,
    .symbol_count = SYMBOL_COUNT,
    .parse_table = ts_parse_actions,
    // ...
  };
  return &language;
}

Grammar DSL Deep Dive

The grammar DSL provides several powerful constructs:

Fields

function_declaration: $ => seq(
  'function',
  field('name', $.identifier),
  field('parameters', $.parameter_list),
  field('body', $.block)
)

Fields enable precise node queries:

(function_declaration
  name: (identifier) @function.name
  body: (block) @function.body)

Precedence

binary_expression: $ => choice(
  prec.left(1, seq($.expression, '+', $.expression)),
  prec.left(2, seq($.expression, '*', $.expression)),
)

Higher numbers = higher precedence.

Conflicts

When the grammar is ambiguous:

conflicts: $ => [
  [$.primary_expression, $.assignment_expression],
]

This tells the parser that these conflicts are expected and acceptable.

External Scanner

For context-sensitive lexing:

externals: $ => [
  $.heredoc_body,
  $.string_content,
]

Implemented in scanner.c or scanner.cc.

Query System Architecture

Queries use a separate compilation process:

Query Compilation

use tree_sitter::Query;

let query = Query::new(
    &language,
    "(function_definition name: (identifier) @name)",
)?;

Internally:

Parse query S-expression
Validate against language symbols
Compile to pattern-matching bytecode
Index by capture names

Query Execution

let mut cursor = QueryCursor::new();
let matches = cursor.matches(&query, tree.root_node(), source);

for m in matches {
    for capture in m.captures {
        println!(
            "{}: {}",
            query.capture_names()[capture.index as usize],
            capture.node.utf8_text(source)?
        );
    }
}

Predicate System

Predicates are evaluated at runtime:

((identifier) @constant
 (#match? @constant "^[A-Z_]+$"))

Built-in predicates:

#eq?, #not-eq? - String equality
#match?, #not-match? - Regex matching
#any-of?, #not-any-of? - Set membership
#is?, #is-not? - Property checks
#set! - Set properties

Language ABI Versioning

Parsers declare their ABI version:

const TSLanguage *tree_sitter_javascript(void) {
  static const TSLanguage language = {
    .version = TREE_SITTER_LANGUAGE_VERSION,
    // ...
  };
  return &language;
}

The runtime checks compatibility:

uint32_t version = ts_language_version(language);
if (version < TREE_SITTER_MIN_COMPATIBLE_LANGUAGE_VERSION ||
    version > TREE_SITTER_LANGUAGE_VERSION) {
  // Incompatible!
}

Recompile parsers when updating Tree-sitter to avoid ABI mismatches.

Error Recovery

Tree-sitter performs automatic error recovery:

ERROR Node Insertion

When parsing fails, an ERROR node is inserted:

(program
  (function_declaration
    name: (identifier)
    (ERROR  ; Missing parameter list
      ")")
    body: (block)))

MISSING Node Insertion

For expected but absent tokens:

(if_statement
  condition: (parenthesized_expression
    "("
    (MISSING identifier)  ; Expected condition
    ")"))

Recovery Strategy

The parser:

Detects an unexpected token
Inserts ERROR node
Skips tokens until finding a recovery point
Resumes parsing

This ensures you always get a complete tree, even for invalid code.

Performance Characteristics

Time Complexity

Full parse: O(n) where n = source length
Incremental parse: O(e log n) where e = edit size
Query execution: O(n × p) where p = pattern complexity

Space Complexity

Syntax tree: O(n) nodes
Parse stack: O(d) where d = max nesting depth
Shared subtrees: Zero extra cost due to reference counting

Optimization Techniques

Subtree reuse - Unchanged regions share nodes
Lazy node materialization - Nodes created on access
Arena allocation - Batch allocations reduce overhead
Stack-based parsing - No heap allocation during parse

Performance

Learn optimization techniques

Creating Parsers

Build your own parser

Get Started

Using Parsers

Queries

Creating Parsers

Advanced Topics

Implementation Details

Architecture Overview

The C Library (libtree-sitter)

Core API

Key Components

Memory Management

Incremental Parsing

The CLI (tree-sitter)

Parser Generation Pipeline

1. Grammar Parsing

2. Grammar Rules

3. Grammar Preparation

4. Parse Table Generation

5. Code Generation

Grammar DSL Deep Dive

Fields

Precedence

Conflicts

External Scanner

Query System Architecture

Query Compilation

Query Execution

Predicate System

Language ABI Versioning

Error Recovery

ERROR Node Insertion

MISSING Node Insertion

Recovery Strategy

Performance Characteristics

Time Complexity

Space Complexity

Optimization Techniques

Performance

Creating Parsers

​Architecture Overview

​The C Library (libtree-sitter)

​Core API

​Key Components

​Memory Management

​Incremental Parsing

​The CLI (tree-sitter)

​Parser Generation Pipeline

​1. Grammar Parsing

​2. Grammar Rules

​3. Grammar Preparation

​4. Parse Table Generation

​5. Code Generation

​Grammar DSL Deep Dive

​Fields

​Precedence

​Conflicts

​External Scanner

​Query System Architecture

​Query Compilation

​Query Execution

​Predicate System

​Language ABI Versioning

​Error Recovery

​ERROR Node Insertion

​MISSING Node Insertion

​Recovery Strategy

​Performance Characteristics

​Time Complexity

​Space Complexity

​Optimization Techniques

​Related Resources

Performance

Creating Parsers

Architecture Overview

The C Library (libtree-sitter)

Core API

Key Components

Memory Management

Incremental Parsing

The CLI (tree-sitter)

Parser Generation Pipeline

1. Grammar Parsing

2. Grammar Rules

3. Grammar Preparation

4. Parse Table Generation

5. Code Generation

Grammar DSL Deep Dive

Fields

Precedence

Conflicts

External Scanner

Query System Architecture

Query Compilation

Query Execution

Predicate System

Language ABI Versioning

Error Recovery

ERROR Node Insertion

MISSING Node Insertion

Recovery Strategy

Performance Characteristics

Time Complexity

Space Complexity

Optimization Techniques

Related Resources