Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

What is Sipha?

Sipha is a flexible, incremental parsing library for Rust with support for multiple parsing algorithms. Sipha is designed from the ground up to support incremental parsing—the ability to efficiently re-parse only the changed portions of your code, making it ideal for interactive applications like IDEs, editors, and language servers.

Unlike traditional parsers that re-parse entire files on every change, Sipha can:

  • Reuse unchanged subtrees from previous parses
  • Reparse only affected regions instead of the entire file
  • Maintain parse caches for efficient updates
  • Dramatically improve performance in interactive editing scenarios

Architecture Overview

graph TB
    Input[Source Text] --> Lexer[Lexer]
    Lexer --> Tokens[Tokens]
    Tokens --> Parser[Parser Backend]
    Parser --> Tree[Syntax Tree]
    Tree --> Incremental[Incremental Parser]
    Incremental --> Cache[Parse Cache]
    Incremental --> Output[Updated Tree]
    
    subgraph Backends
        LL[LL Parser]
        LR[LR Parser]
        GLR[GLR Parser]
    end
    
    Parser -.-> Backends
    
    style Input fill:#e1f5ff,color:#000000
    style Tree fill:#c8e6c9,color:#000000
    style Output fill:#c8e6c9,color:#000000
    style Cache fill:#fff9c4,color:#000000

Project Status

Note

Sipha 0.5.0 provides a stable foundation for incremental parsing. The core API is stable, and we continue to add features and improvements based on user feedback. We welcome contributions!

Key Features

Note

For a detailed feature comparison with other parsing libraries, see the README in the repository.

Sipha provides a comprehensive parsing solution with these key capabilities:

Incremental Parsing (Primary Focus)

Sipha’s standout feature is its incremental parsing capability. When you edit code, Sipha can:

  • Reuse unchanged subtrees from previous parses
  • Reparse only affected regions instead of the entire file
  • Maintain parse caches for efficient updates
  • Dramatically improve performance in interactive editing scenarios

This makes Sipha perfect for building language servers, IDEs, and other tools that need to parse code as users type.

Learn more in the Incremental Parsing section.

Additional Features

  • Multiple parsing backends: Choose from LL(k), LR, and GLR (via feature flags) - see Parsing Backends
  • Immutable syntax trees: Green/red tree representation for efficient manipulation - see Syntax Trees
  • Error recovery: Configurable error recovery strategies for robust parsing - see Error Handling
  • Flexible grammar definition: Builder API for defining your grammar - see Grammars
  • Unicode support: Full Unicode support for identifiers and text (optional)
  • Rich diagnostics: Beautiful error messages with miette integration (optional)
  • Tree traversal: Visitor patterns and query APIs for working with syntax trees

Use Cases

Sipha is particularly well-suited for:

  • Language Servers: Building LSP implementations that need fast, incremental parsing
  • IDEs and Editors: Providing real-time syntax analysis and error checking
  • Code Analysis Tools: Tools that need to parse and analyze code efficiently
  • Interactive Development Tools: Any tool that needs to parse code as users edit it
  • Complex Language Parsing: Using GLR backend for languages with inherent ambiguities (like C++)

Comparison with Alternatives

FeatureSiphapestnomlalrpop
Incremental parsing
Multiple backends
Syntax trees
Error recovery
Grammar DSL
Zero-copyPartial

When to Use Sipha

Use Sipha when:

  • Building language servers or IDEs
  • Need incremental parsing for interactive editing
  • Want flexibility in choosing parsing algorithms
  • Need rich syntax tree manipulation
  • Parsing ambiguous grammars (use GLR backend)

When to Consider Alternatives

Consider alternatives when:

  • Simple one-off parsers (pest or nom might be simpler)
  • Maximum performance for batch parsing (nom might be faster)
  • Prefer declarative grammar DSLs (pest or lalrpop)

What’s Next?

Ready to get started? Head to Getting Started to install Sipha and build your first parser!

For a deeper dive into specific topics:

Getting Started

This guide will help you get up and running with Sipha. We’ll cover installation, a quick start example, and an overview of the core concepts.

Installation

Add Sipha to your Cargo.toml:

[dependencies]
sipha = "0.5.0"

Or with specific features:

[dependencies]
sipha = { version = "0.5.0", features = ["diagnostics", "unicode", "backend-ll"] }

Available Features

  • backend-ll: Enable LL(k) parser backend (default)
  • backend-lr: Enable LR parser backend
  • backend-glr: Enable GLR parser backend (requires backend-lr)
  • diagnostics: Enable rich error diagnostics with miette
  • unicode: Enable full Unicode support for identifiers
  • visitor: Enable syntax tree visitor patterns
  • query: Enable XPath-like query API for syntax trees
  • tree-utils: Enable tree diffing and validation utilities

Quick Start

Let’s build a simple arithmetic expression parser step by step. This example will help you understand the core concepts.

Step 1: Define Your Syntax Kinds

First, define the tokens and non-terminals your parser will use. Sipha uses a unified SyntaxKind trait for both terminals (tokens) and non-terminals (grammar rules):

use sipha::syntax::SyntaxKind;

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
    // Terminals (produced by lexer)
    Number,
    Plus,
    Minus,
    Multiply,
    Divide,
    LParen,
    RParen,
    Whitespace,
    Eof,
    // Non-terminals (produced by parser)
    Expr,
    Term,
    Factor,
}

impl SyntaxKind for ArithSyntaxKind {
    fn is_terminal(self) -> bool {
        !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor)
    }
    
    fn is_trivia(self) -> bool {
        matches!(self, ArithSyntaxKind::Whitespace)
    }
}

Step 2: Build a Lexer

Create a lexer to tokenize your input text:

use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
    Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
    Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
    fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
    fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::lexer::{LexerBuilder, Pattern, CharSet};

let lexer = LexerBuilder::new()
    .token(ArithSyntaxKind::Number, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::digits())),
        min: 1,
        max: None,
    })
    .token(ArithSyntaxKind::Plus, Pattern::Literal("+".into()))
    .token(ArithSyntaxKind::Minus, Pattern::Literal("-".into()))
    .token(ArithSyntaxKind::Multiply, Pattern::Literal("*".into()))
    .token(ArithSyntaxKind::Divide, Pattern::Literal("/".into()))
    .token(ArithSyntaxKind::LParen, Pattern::Literal("(".into()))
    .token(ArithSyntaxKind::RParen, Pattern::Literal(")".into()))
    .token(ArithSyntaxKind::Whitespace, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
        min: 1,
        max: None,
    })
    .trivia(ArithSyntaxKind::Whitespace)
    .build(ArithSyntaxKind::Eof, ArithSyntaxKind::Number)
    .expect("Failed to build lexer");

Step 3: Tokenize Input

use sipha::lexer::{LexerBuilder, Pattern, CharSet};
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
    Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
    Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
    fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
    fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
let lexer = LexerBuilder::new()
    .token(ArithSyntaxKind::Number, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::digits())),
        min: 1, max: None,
    })
    .token(ArithSyntaxKind::Plus, Pattern::Literal("+".into()))
    .token(ArithSyntaxKind::Minus, Pattern::Literal("-".into()))
    .token(ArithSyntaxKind::Multiply, Pattern::Literal("*".into()))
    .token(ArithSyntaxKind::Divide, Pattern::Literal("/".into()))
    .token(ArithSyntaxKind::LParen, Pattern::Literal("(".into()))
    .token(ArithSyntaxKind::RParen, Pattern::Literal(")".into()))
    .token(ArithSyntaxKind::Whitespace, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
        min: 1, max: None,
    })
    .trivia(ArithSyntaxKind::Whitespace)
    .build(ArithSyntaxKind::Eof, ArithSyntaxKind::Number)
    .expect("Failed to build lexer");
let input = "42 + 10";
let tokens = lexer.tokenize(input)
    .expect("Failed to tokenize input");

Step 4: Define Non-Terminals and Build Grammar

use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
    Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
    Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
    fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
    fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::grammar::{GrammarBuilder, NonTerminal, Expr};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};
use std::convert::TryFrom;

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum ArithNonTerminal {
    Expr,
    Term,
    Factor,
}

impl NonTerminal for ArithNonTerminal {
    fn name(&self) -> &str {
        match self {
            ArithNonTerminal::Expr => "Expr",
            ArithNonTerminal::Term => "Term",
            ArithNonTerminal::Factor => "Factor",
        }
    }
}

// Helper function to create tokens with proper ranges
fn create_token(kind: ArithSyntaxKind, text: &str, offset: u32) -> LexerToken<ArithSyntaxKind> {
    let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
    LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}

// Build your grammar rules
let grammar = GrammarBuilder::new()
    .entry_point(ArithNonTerminal::Expr)
    // Simple grammar: Expr -> Number
    .rule(ArithNonTerminal::Expr, Expr::token(create_token(
        ArithSyntaxKind::Number,
        "42",
        0
    )))
    .build()
    .expect("Failed to build grammar");

Step 5: Parse!

use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
    Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
    Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
    fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
    fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::grammar::{GrammarBuilder, NonTerminal, Expr};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};
use std::convert::TryFrom;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum ArithNonTerminal { Expr, Term, Factor, }
impl NonTerminal for ArithNonTerminal {
    fn name(&self) -> &str { match self { ArithNonTerminal::Expr => "Expr", ArithNonTerminal::Term => "Term", ArithNonTerminal::Factor => "Factor", } }
}
fn create_token(kind: ArithSyntaxKind, text: &str, offset: u32) -> LexerToken<ArithSyntaxKind> {
    let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
    LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}
let grammar = GrammarBuilder::new()
    .entry_point(ArithNonTerminal::Expr)
    .rule(ArithNonTerminal::Expr, Expr::token(create_token(ArithSyntaxKind::Number, "42", 0)))
    .build().expect("Failed to build grammar");
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
let lexer = LexerBuilder::new()
    .token(ArithSyntaxKind::Number, Pattern::Repeat { pattern: Box::new(Pattern::CharClass(CharSet::digits())), min: 1, max: None })
    .token(ArithSyntaxKind::Plus, Pattern::Literal("+".into()))
    .token(ArithSyntaxKind::Minus, Pattern::Literal("-".into()))
    .token(ArithSyntaxKind::Multiply, Pattern::Literal("*".into()))
    .token(ArithSyntaxKind::Divide, Pattern::Literal("/".into()))
    .token(ArithSyntaxKind::LParen, Pattern::Literal("(".into()))
    .token(ArithSyntaxKind::RParen, Pattern::Literal(")".into()))
    .token(ArithSyntaxKind::Whitespace, Pattern::Repeat { pattern: Box::new(Pattern::CharClass(CharSet::whitespace())), min: 1, max: None })
    .trivia(ArithSyntaxKind::Whitespace)
    .build(ArithSyntaxKind::Eof, ArithSyntaxKind::Number).expect("Failed to build lexer");
let input = "42 + 10";
let tokens = lexer.tokenize(input).expect("Failed to tokenize input");
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;

let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)
    .expect("Failed to create parser");

let result = parser.parse(&tokens, ArithNonTerminal::Expr);

For a complete working example, see the Basic Arithmetic Example or check out examples/basic_arithmetic.rs in the repository.

Core Concepts Overview

Before diving deeper, here’s a quick overview of Sipha’s core concepts:

Syntax Kinds

Sipha uses a unified SyntaxKind trait for both terminals (tokens) and non-terminals (grammar rules). This design simplifies the API and allows for flexible grammar definitions.

See Syntax Kinds for more details.

Lexers

Lexers convert raw text into tokens. Sipha provides a flexible lexer builder API that supports:

  • Pattern matching (literals, character classes, repetitions)
  • Trivia handling (whitespace, comments)
  • DFA-based tokenization for performance

See Lexers for more details.

Grammars

Grammars define the structure of your language. Sipha uses a builder API to define grammar rules declaratively.

See Grammars for more details.

Parsing Backends

Sipha supports multiple parsing algorithms:

  • LL(k): Top-down predictive parsing
  • LR: Bottom-up shift-reduce parsing
  • GLR: Generalized LR for ambiguous grammars

See Parsing Backends for more details.

Syntax Trees

Sipha uses an immutable green/red tree representation:

  • Green trees: Compact, shared representation
  • Red trees: Convenient API for traversal

See Syntax Trees for more details.

Incremental Parsing

Sipha’s key feature is incremental parsing—the ability to efficiently re-parse only changed regions. This is essential for interactive applications.

See Incremental Parsing for more details.

Next Steps

Now that you have Sipha installed and understand the basics, you can:

  1. Read about Core Concepts in detail
  2. Explore Incremental Parsing to understand Sipha’s key feature
  3. Check out Examples to see real-world usage
  4. Learn about Parsing Backends to choose the right algorithm

Syntax Kinds

Sipha uses a unified SyntaxKind trait for both terminals (tokens) and non-terminals (grammar rules). This design simplifies the API and allows for flexible grammar definitions.

Overview

In traditional parser generators, terminals and non-terminals are separate types. Sipha unifies them into a single SyntaxKind enum, which simplifies the API and makes it easier to work with syntax trees.

The SyntaxKind Trait

The SyntaxKind trait is the foundation of Sipha’s type system:

use std::hash::Hash;
use std::fmt::Debug;

pub trait SyntaxKind: Copy + Clone + PartialEq + Eq + Hash + Debug {
    fn is_terminal(self) -> bool;
    fn is_trivia(self) -> bool;
}

Required Methods

  • is_terminal(self) -> bool: Returns true if this kind represents a terminal (token), false if it’s a non-terminal (grammar rule).
  • is_trivia(self) -> bool: Returns true if this kind represents trivia (whitespace, comments) that should be ignored during parsing.

Defining Your Syntax Kinds

You define syntax kinds as an enum that implements the SyntaxKind trait:

use sipha::syntax::SyntaxKind;

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind {
    // Terminals (produced by lexer)
    Number,
    Plus,
    Minus,
    LParen,
    RParen,
    Whitespace,
    Eof,
    // Non-terminals (produced by parser)
    Expr,
    Term,
    Factor,
}

impl SyntaxKind for MySyntaxKind {
    fn is_terminal(self) -> bool {
        !matches!(self, 
            MySyntaxKind::Expr | 
            MySyntaxKind::Term | 
            MySyntaxKind::Factor
        )
    }
    
    fn is_trivia(self) -> bool {
        matches!(self, MySyntaxKind::Whitespace)
    }
}

Terminals vs Non-Terminals

Terminals

Terminals are produced by the lexer and represent actual tokens in the input text:

  • Keywords (e.g., if, while, return)
  • Operators (e.g., +, -, *, /)
  • Identifiers (e.g., variable names)
  • Literals (e.g., numbers, strings)
  • Punctuation (e.g., (, ), {, })
  • Trivia (e.g., whitespace, comments)

Non-Terminals

Non-terminals are produced by the parser and represent grammar rules:

  • Expression nodes (e.g., Expr, Term, Factor)
  • Statement nodes (e.g., Stmt, IfStmt, WhileStmt)
  • Declaration nodes (e.g., Decl, FnDecl, VarDecl)

Trivia

Trivia are tokens that don’t affect the parse tree structure but are preserved for formatting and error messages:

  • Whitespace
  • Comments
  • Line breaks

Mark trivia kinds with is_trivia() returning true. The parser will automatically skip trivia during parsing, but they’ll still be available in the syntax tree.

Best Practices

  1. Use descriptive names: Choose names that clearly indicate what the kind represents.
  2. Group related kinds: Keep terminals and non-terminals organized logically.
  3. Mark trivia correctly: Ensure is_trivia() returns true for all trivia kinds.
  4. Use exhaustive matching: When implementing is_terminal(), use exhaustive patterns to catch all cases.

Example: Complete Syntax Kind Definition

use sipha::syntax::SyntaxKind;

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum CalculatorSyntaxKind {
    // Terminals
    Number,
    Plus,
    Minus,
    Multiply,
    Divide,
    LParen,
    RParen,
    Whitespace,
    Eof,
    // Non-terminals
    Expr,
    Term,
    Factor,
}

impl SyntaxKind for CalculatorSyntaxKind {
    fn is_terminal(self) -> bool {
        !matches!(
            self,
            CalculatorSyntaxKind::Expr
                | CalculatorSyntaxKind::Term
                | CalculatorSyntaxKind::Factor
        )
    }
    
    fn is_trivia(self) -> bool {
        matches!(self, CalculatorSyntaxKind::Whitespace)
    }
}

Next Steps

Now that you understand syntax kinds, you can:

  • Learn about Lexers to see how terminals are produced
  • Explore Grammars to see how non-terminals are used
  • Check out Syntax Trees to see how kinds are used in trees

Lexers

Lexers convert raw text into tokens. Sipha provides a flexible lexer builder API that supports pattern matching, trivia handling, and DFA-based tokenization for performance.

Overview

The lexer is the first stage of parsing. It takes source text and converts it into a stream of tokens that the parser can consume. Sipha’s lexer uses a DFA (Deterministic Finite Automaton) for efficient tokenization.

Building a Lexer

Use LexerBuilder to construct a lexer:

use sipha::lexer::{LexerBuilder, Pattern, CharSet};

let lexer = LexerBuilder::new()
    .token(MySyntaxKind::Number, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::digits())),
        min: 1,
        max: None,
    })
    .token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
    .token(MySyntaxKind::Whitespace, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
        min: 1,
        max: None,
    })
    .trivia(MySyntaxKind::Whitespace)
    .build(MySyntaxKind::Eof, MySyntaxKind::Number)
    .expect("Failed to build lexer");

Patterns

Sipha supports several pattern types for matching tokens:

Literal Patterns

Match exact strings:

.token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
.token(MySyntaxKind::LParen, Pattern::Literal("(".into()))

Character Class Patterns

Match character ranges:

use sipha::lexer::CharSet;

// Match digits [0-9]
.token(MySyntaxKind::Number, Pattern::CharClass(CharSet::digits()))

// Match letters [a-zA-Z]
.token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::new(vec!['a'..='z', 'A'..='Z'])))

// Match whitespace
.token(MySyntaxKind::Whitespace, Pattern::CharClass(CharSet::whitespace()))

Repeat Patterns

Match repeating patterns:

// Match one or more digits
.token(MySyntaxKind::Number, Pattern::Repeat {
    pattern: Box::new(Pattern::CharClass(CharSet::digits())),
    min: 1,
    max: None, // No maximum
})

// Match zero or more whitespace
.token(MySyntaxKind::Whitespace, Pattern::Repeat {
    pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
    min: 0,
    max: None,
})

// Match exactly 3 digits
.token(MySyntaxKind::ThreeDigits, Pattern::Repeat {
    pattern: Box::new(Pattern::CharClass(CharSet::digits())),
    min: 3,
    max: Some(3),
})

Regex Patterns

For complex patterns, use regex:

// Match floating point numbers
.token(MySyntaxKind::Float, Pattern::Regex(r"\d+\.\d+".into()))

// Match email addresses
.token(MySyntaxKind::Email, Pattern::Regex(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}".into()))

Any Pattern

Match any single character:

.token(MySyntaxKind::AnyChar, Pattern::Any)

Keywords

Keywords are reserved words that should be matched as literals rather than identifiers:

let lexer = LexerBuilder::new()
    .token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::new(vec!['a'..='z', 'A'..='Z'])))
    .keyword("if", MySyntaxKind::If)
    .keyword("else", MySyntaxKind::Else)
    .keyword("while", MySyntaxKind::While)
    .build(MySyntaxKind::Eof, MySyntaxKind::Ident)
    .expect("Failed to build lexer");

Keywords are matched with higher priority than general patterns, so "if" will match as a keyword rather than an identifier.

Trivia

Trivia are tokens that don’t affect parsing but are preserved in the syntax tree (e.g., whitespace, comments):

let lexer = LexerBuilder::new()
    .token(MySyntaxKind::Whitespace, Pattern::CharClass(CharSet::whitespace()))
    .token(MySyntaxKind::Comment, Pattern::Regex(r"//.*".into()))
    .trivia(MySyntaxKind::Whitespace)
    .trivia(MySyntaxKind::Comment)
    .build(MySyntaxKind::Eof, MySyntaxKind::Number)
    .expect("Failed to build lexer");

The parser will automatically skip trivia during parsing, but they’ll still be available in the syntax tree for formatting and IDE features.

Custom Matchers

For complex tokenization logic, use custom matchers:

let lexer = LexerBuilder::new()
    .custom_token(MySyntaxKind::String, |text, pos| {
        // Custom string matching logic
        if text[pos..].starts_with('"') {
            // Find closing quote
            let mut end = pos + 1;
            while end < text.len() && text.as_bytes()[end] != b'"' {
                if text.as_bytes()[end] == b'\\' {
                    end += 2; // Skip escape sequence
                } else {
                    end += 1;
                }
            }
            if end < text.len() {
                Some((end - pos + 1, TokenValue::String(text[pos+1..end].to_string())))
            } else {
                None // Unterminated string
            }
        } else {
            None
        }
    })
    .build(MySyntaxKind::Eof, MySyntaxKind::Number)
    .expect("Failed to build lexer");

Custom matchers return Option<(usize, TokenValue)> where:

  • usize is the number of characters matched
  • TokenValue is the token’s value (if any)

Tokenization

Once the lexer is built, tokenize input:

let input = "42 + 10";
let tokens = lexer.tokenize(input)
    .expect("Failed to tokenize input");

for token in &tokens {
    println!("{:?}: {}", token.kind, token.text);
}

DFA Construction

Sipha’s lexer uses a DFA (Deterministic Finite Automaton) for efficient tokenization. The DFA is built from NFA (Nondeterministic Finite Automaton) patterns, which are then converted to a DFA for O(n) tokenization performance.

The DFA construction process:

  1. Pattern to NFA: Each pattern is converted to an NFA
  2. NFA to DFA: The NFA is converted to a DFA using subset construction
  3. Optimization: The DFA is optimized for size and performance

Priority and Ambiguity

When multiple patterns could match the same input, Sipha uses priority to resolve ambiguity:

  • Keywords have the highest priority (lowest number)
  • Patterns are prioritized by order of definition (earlier = higher priority)
  • Longest match wins when priorities are equal

Error Handling

The lexer returns LexerError for invalid input:

match lexer.tokenize(input) {
    Ok(tokens) => {
        // Process tokens
    }
    Err(error) => {
        match error.kind {
            LexerErrorKind::UnexpectedChar { .. } => {
                // Handle unexpected character
            }
            LexerErrorKind::InvalidNumber { reason } => {
                // Handle invalid number format
            }
            // ... other error kinds
        }
    }
}

Best Practices

  1. Order patterns carefully: More specific patterns should come before general ones
  2. Use keywords for reserved words: This ensures proper matching priority
  3. Mark trivia correctly: Ensure all trivia kinds are marked with .trivia()
  4. Use character classes efficiently: Prefer character classes over regex when possible
  5. Test edge cases: Test with empty input, very long input, and invalid input

Example: Complete Lexer

use sipha::lexer::{LexerBuilder, Pattern, CharSet};

let lexer = LexerBuilder::new()
    // Numbers
    .token(MySyntaxKind::Number, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::digits())),
        min: 1,
        max: None,
    })
    // Operators
    .token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
    .token(MySyntaxKind::Minus, Pattern::Literal("-".into()))
    .token(MySyntaxKind::Multiply, Pattern::Literal("*".into()))
    .token(MySyntaxKind::Divide, Pattern::Literal("/".into()))
    // Parentheses
    .token(MySyntaxKind::LParen, Pattern::Literal("(".into()))
    .token(MySyntaxKind::RParen, Pattern::Literal(")".into()))
    // Whitespace (trivia)
    .token(MySyntaxKind::Whitespace, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
        min: 1,
        max: None,
    })
    .trivia(MySyntaxKind::Whitespace)
    .build(MySyntaxKind::Eof, MySyntaxKind::Number)
    .expect("Failed to build lexer");

Next Steps

Grammars

Grammars define the structure of your language. Sipha uses a builder API to define grammar rules declaratively.

Overview

A grammar in Sipha consists of:

  • Non-terminals: Grammar rules (e.g., Expr, Stmt, Decl)
  • Production rules: How non-terminals are expanded (e.g., Expr -> Term + Expr)
  • Entry point: The starting non-terminal for parsing

GrammarBuilder API

Use GrammarBuilder to construct grammars:

use sipha::grammar::{GrammarBuilder, Expr, NonTerminal, Token};

let grammar = GrammarBuilder::new()
    .entry_point(MyNonTerminal::Expr)
    .rule(MyNonTerminal::Expr, Expr::token(some_token))
    .build()
    .expect("Failed to build grammar");

Grammar Expressions

Grammar rules use Expr to define right-hand sides:

Token Expressions

Match a specific token:

use sipha::grammar::Expr;
use sipha::lexer::Token;
use sipha::syntax::{TextRange, TextSize};

let token = Token::new(
    MySyntaxKind::Number,
    "42",
    TextRange::at(TextSize::zero(), TextSize::from(2))
);
.rule(MyNonTerminal::Factor, Expr::token(token))

Sequence Expressions

Match a sequence of expressions:

.rule(MyNonTerminal::Expr, Expr::seq(vec![
    Expr::non_terminal(MyNonTerminal::Term),
    Expr::token(plus_token),
    Expr::non_terminal(MyNonTerminal::Expr),
]))

Choice Expressions

Match one of several alternatives:

.rule(MyNonTerminal::Expr, Expr::choice(vec![
    Expr::non_terminal(MyNonTerminal::Term),
    Expr::non_terminal(MyNonTerminal::Expr),
]))

Optional Expressions

Match an optional expression:

.rule(MyNonTerminal::Stmt, Expr::seq(vec![
    Expr::non_terminal(MyNonTerminal::If),
    Expr::optional(Box::new(Expr::non_terminal(MyNonTerminal::Else))),
]))

Repeat Expressions

Match zero or more repetitions:

// Zero or more
.rule(MyNonTerminal::Args, Expr::repeat(
    Box::new(Expr::non_terminal(MyNonTerminal::Arg)),
    0,
    None,
))

// One or more
.rule(MyNonTerminal::Stmts, Expr::repeat(
    Box::new(Expr::non_terminal(MyNonTerminal::Stmt)),
    1,
    None,
))

Empty Expression

Match nothing (epsilon):

.rule(MyNonTerminal::Optional, Expr::Empty)

Complete Example: Arithmetic Grammar

use sipha::grammar::{GrammarBuilder, Expr, NonTerminal};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};

// Helper function to create tokens with proper ranges
fn create_token(kind: ArithSyntaxKind, text: &str, offset: u32) -> LexerToken<ArithSyntaxKind> {
    let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
    LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}

let grammar = GrammarBuilder::new()
    .entry_point(ArithNonTerminal::Expr)
    
    // Expr -> Term | Expr + Term | Expr - Term
    .rule(ArithNonTerminal::Expr, Expr::choice(vec![
        Expr::seq(vec![
            Expr::non_terminal(ArithNonTerminal::Expr),
            Expr::token(create_token(ArithSyntaxKind::Plus, "+", 0)),
            Expr::non_terminal(ArithNonTerminal::Term),
        ]),
        Expr::seq(vec![
            Expr::non_terminal(ArithNonTerminal::Expr),
            Expr::token(create_token(ArithSyntaxKind::Minus, "-", 0)),
            Expr::non_terminal(ArithNonTerminal::Term),
        ]),
        Expr::non_terminal(ArithNonTerminal::Term),
    ]))
    
    // Term -> Factor | Term * Factor | Term / Factor
    .rule(ArithNonTerminal::Term, Expr::choice(vec![
        Expr::seq(vec![
            Expr::non_terminal(ArithNonTerminal::Term),
            Expr::token(create_token(ArithSyntaxKind::Multiply, "*", 0)),
            Expr::non_terminal(ArithNonTerminal::Factor),
        ]),
        Expr::seq(vec![
            Expr::non_terminal(ArithNonTerminal::Term),
            Expr::token(create_token(ArithSyntaxKind::Divide, "/", 0)),
            Expr::non_terminal(ArithNonTerminal::Factor),
        ]),
        Expr::non_terminal(ArithNonTerminal::Factor),
    ]))
    
    // Factor -> Number | ( Expr )
    .rule(ArithNonTerminal::Factor, Expr::choice(vec![
        Expr::token(create_token(ArithSyntaxKind::Number, "42", 0)),
        Expr::seq(vec![
            Expr::token(create_token(ArithSyntaxKind::LParen, "(", 0)),
            Expr::non_terminal(ArithNonTerminal::Expr),
            Expr::token(create_token(ArithSyntaxKind::RParen, ")", 0)),
        ]),
    ]))
    
    .build()
    .expect("Failed to build grammar");

Grammar Validation

Sipha validates grammars before use:

let grammar = GrammarBuilder::new()
    .entry_point(MyNonTerminal::Expr)
    .rule(MyNonTerminal::Expr, Expr::token(some_token))
    .build()
    .expect("Failed to build grammar");

// Validate grammar for a specific backend
let errors = LlParser::validate(&grammar);
if !errors.is_empty() {
    for error in errors {
        eprintln!("Grammar error: {:?}", error);
    }
}

Common validation errors:

  • Left recursion: Some backends don’t support left recursion
  • Ambiguity: Some backends require unambiguous grammars
  • Missing rules: Referenced non-terminals must be defined
  • Unreachable rules: Rules that can’t be reached from the entry point

Backend Hints

Provide hints to backends about how to handle rules:

use sipha::grammar::hint::PrecedenceHint;

.rule(MyNonTerminal::Expr, Expr::choice(vec![
    // Higher precedence
    Expr::seq(vec![
        Expr::non_terminal(MyNonTerminal::Term),
        Expr::token(multiply_token),
        Expr::non_terminal(MyNonTerminal::Term),
    ]).with_hint(PrecedenceHint::new(2)),
    // Lower precedence
    Expr::seq(vec![
        Expr::non_terminal(MyNonTerminal::Expr),
        Expr::token(plus_token),
        Expr::non_terminal(MyNonTerminal::Term),
    ]).with_hint(PrecedenceHint::new(1)),
]))

Grammar Analysis

Sipha provides tools for analyzing grammars:

  • Left recursion detection: Find rules with left recursion
  • Reachability analysis: Find unreachable rules
  • First/Follow sets: Compute FIRST and FOLLOW sets for LL parsing
  • Conflict detection: Find shift/reduce and reduce/reduce conflicts

See Grammar Analysis for more details.

Best Practices

  1. Start simple: Begin with a simple grammar and add complexity gradually
  2. Use meaningful names: Choose clear names for non-terminals
  3. Factor common patterns: Extract common subexpressions into separate rules
  4. Handle precedence explicitly: Use precedence hints or grammar structure
  5. Test thoroughly: Test with various inputs, including edge cases
  6. Validate early: Validate grammars before using them in parsers

Next Steps

See Also

  • Cheat Sheet - Quick reference for grammar expressions
  • Examples - Complete grammar examples
  • Glossary - Definitions of grammar-related terms

Parsing Basics

This chapter explains how parsing works in Sipha, from creating a parser to understanding parse results.

Overview

Parsing is the process of converting a stream of tokens into a syntax tree according to a grammar. Sipha provides multiple parsing backends, each with different characteristics and capabilities.

Creating a Parser

First, create a parser from a grammar:

use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;

let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)
    .expect("Failed to create parser");

Different backends have different configuration options:

  • LL parser: LlConfig with lookahead settings
  • LR parser: LrConfig with LALR vs canonical LR options
  • GLR parser: GlrConfig with ambiguity handling options

Parsing Input

Parse a stream of tokens:

let result = parser.parse(&tokens, MyNonTerminal::Expr);

The parse method takes:

  • Tokens: A slice of tokens from the lexer
  • Entry point: The non-terminal to start parsing from

Parse Results

The parse method returns a ParseResult:

pub struct ParseResult<T, N> {
    pub root: GreenNode<T::Kind>,
    pub errors: Vec<ParseError<T, N>>,
    pub warnings: Vec<ParseWarning<T, N>>,
    pub metrics: ParseMetrics,
    #[cfg(feature = "backend-glr")]
    pub forest: Option<ParseForest<T::Kind>>,
}

Root Node

The root field contains the syntax tree root. Even if there are errors, the parser attempts to produce a tree:

let root = SyntaxNode::new_root(result.root.clone());
println!("Root kind: {:?}", root.kind());

Errors

The errors field contains parsing errors:

if !result.errors.is_empty() {
    for error in &result.errors {
        eprintln!("Parse error at {:?}: {}", error.span, error.message);
    }
}

Common error types:

  • Unexpected token: Expected one token but found another
  • Unexpected EOF: Reached end of input unexpectedly
  • Missing token: Expected a token that wasn’t found
  • Invalid grammar: Grammar-related errors

Warnings

The warnings field contains non-fatal issues:

for warning in &result.warnings {
    eprintln!("Warning: {}", warning.message);
}

Metrics

The metrics field contains parsing statistics:

println!("Tokens consumed: {}", result.metrics.tokens_consumed);
println!("Nodes created: {}", result.metrics.nodes_created);
println!("Parse time: {:?}", result.metrics.parse_time);

Parse Forest (GLR)

When using the GLR backend, the forest field may contain multiple parse trees for ambiguous input:

#[cfg(feature = "backend-glr")]
if let Some(forest) = result.forest {
    // Handle ambiguity
    let disambiguated = forest.disambiguate(|alternatives| {
        // Choose one alternative
        alternatives.first().cloned()
    });
}

Error Recovery

Sipha parsers attempt to recover from errors and continue parsing:

  1. Synchronization tokens: Skip to known recovery points
  2. Delimited recovery: Skip to matching delimiters
  3. Best-effort parsing: Continue parsing despite errors

See Error Handling for more details.

Working with Syntax Trees

After parsing, work with the syntax tree:

use sipha::syntax::SyntaxNode;

let root = SyntaxNode::new_root(result.root.clone());

// Traverse children
for child in root.children() {
    println!("Child: {:?}", child.kind());
}

// Find specific nodes
if let Some(expr) = root.descendants().find(|n| n.kind() == MySyntaxKind::Expr) {
    println!("Found expression");
}

See Syntax Trees for more details.

Incremental Parsing

For interactive applications, use incremental parsing:

use sipha::incremental::{IncrementalParser, TextEdit};

let mut incremental_parser = IncrementalParser::new(parser);

// Initial parse
let result1 = incremental_parser.parse_incremental(
    &tokens,
    None,
    &[],
    MyNonTerminal::Expr,
    Some(&grammar),
);

// After an edit
let edits = vec![TextEdit::replace(range, "new text".into())];
let result2 = incremental_parser.parse_incremental(
    &new_tokens,
    Some(&result1.root),
    &edits,
    MyNonTerminal::Expr,
    Some(&grammar),
);

See Incremental Parsing for more details.

Backend Selection

Choose a backend based on your needs:

  • LL(k): Good for most grammars, supports left-recursion elimination
  • LR: Efficient for many grammar types, good error recovery
  • GLR: Handles ambiguous grammars, ideal for complex languages

See Choosing a Backend for guidance.

Complete Example

use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
use sipha::syntax::SyntaxNode;

// Create parser
let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)
    .expect("Failed to create parser");

// Parse
let result = parser.parse(&tokens, MyNonTerminal::Expr);

// Check for errors
if !result.errors.is_empty() {
    eprintln!("Parsing failed with {} errors", result.errors.len());
    for error in &result.errors {
        eprintln!("  {:?}", error);
    }
}

// Work with syntax tree
let root = SyntaxNode::new_root(result.root.clone());
println!("Parse successful! Root: {:?}", root.kind());
println!("Created {} nodes in {:?}", 
    result.metrics.nodes_created, 
    result.metrics.parse_time);

Next Steps

See Also

Incremental Parsing Overview

Incremental parsing is Sipha’s core strength. It allows you to efficiently update your parse tree when code changes, rather than re-parsing everything from scratch.

Why Incremental Parsing?

In interactive applications like IDEs, users edit code frequently. Traditional parsers re-parse the entire file on every change, which can be slow for large files. Incremental parsing:

  • Reuses unchanged subtrees from previous parses
  • Only re-parses affected regions instead of the entire file
  • Maintains parse caches for fast updates
  • Scales to large codebases efficiently

This makes Sipha perfect for building language servers, IDEs, and other tools that need to parse code as users type.

Benefits

Performance

Incremental parsing provides significant performance improvements:

  • Fast updates: Only re-parse changed regions
  • Memory efficient: Shared tree representation
  • Scalable: Handles large files efficiently
  • IDE-ready: Designed for interactive editing scenarios

User Experience

For interactive applications, incremental parsing means:

  • Responsive editing: No lag when typing
  • Real-time feedback: Errors and diagnostics update instantly
  • Smooth experience: Large files don’t slow down the editor

How It Works (High Level)

  1. Initial parse: Parse the entire file and cache results
  2. Edit detection: Identify which regions changed
  3. Node reuse: Find unchanged subtrees from previous parse
  4. Incremental re-parse: Only re-parse affected regions
  5. Tree reconstruction: Combine reused nodes with new parse results

Current Status

Incremental parsing is fully implemented with complete node reuse and cache management:

  • Node reuse: Unchanged subtrees are automatically identified and reused
  • Cache population: Parse results are cached for future reuse
  • Affected range computation: Only affected regions are re-parsed
  • Smart invalidation: Cache entries are invalidated based on edit locations

The parser automatically integrates reusable nodes from previous parses, providing significant performance improvements for interactive editing scenarios.

When to Use Incremental Parsing

Use incremental parsing when:

  • Building language servers or IDEs
  • Creating interactive code editors
  • Building tools that need real-time syntax analysis
  • Parsing large files that change frequently
  • Providing live error checking and diagnostics

For batch parsing (one-time parsing of files), regular parsing is sufficient and may be faster due to less overhead.

Next Steps

How Incremental Parsing Works

This chapter explains the internal mechanisms of incremental parsing in Sipha.

Overview

Incremental parsing works by:

  1. Tracking edits: Identifying which parts of the text changed
  2. Computing affected regions: Determining minimal regions to re-parse
  3. Finding reusable nodes: Identifying unchanged subtrees from previous parse
  4. Re-parsing selectively: Only parsing affected regions
  5. Reconstructing the tree: Combining reused nodes with new parse results

Incremental Parsing Flow

flowchart TD
    Start[Text Edit] --> Track[Track Edit]
    Track --> Compute[Compute Affected Region]
    Compute --> Find[Find Reusable Nodes]
    Find --> Check{Reusable<br/>Nodes Found?}
    Check -->|Yes| Reuse[Reuse Unchanged Subtrees]
    Check -->|No| ParseAll[Parse Affected Region]
    Reuse --> ParseGaps[Parse Gaps Only]
    ParseGaps --> Combine[Combine Results]
    ParseAll --> Combine
    Combine --> Update[Update Parse Cache]
    Update --> Reconstruct[Reconstruct Tree]
    Reconstruct --> End[New Syntax Tree]
    
    style Start fill:#e1f5ff,color:#000000
    style Reuse fill:#c8e6c9,color:#000000
    style ParseGaps fill:#fff9c4,color:#000000
    style Combine fill:#ffccbc,color:#000000
    style End fill:#c8e6c9,color:#000000

Edit Tracking

When text is edited, Sipha tracks the changes using TextEdit:

use sipha::incremental::TextEdit;
use sipha::syntax::{TextRange, TextSize};

let edit = TextEdit::replace(
    TextRange::new(TextSize::from(10), TextSize::from(20)),
    "new text".into(),
);

Edits can be:

  • Replacements: Replace a range with new text
  • Insertions: Insert text at a position
  • Deletions: Delete a range

Affected Region Computation

Sipha computes the minimal region that needs to be re-parsed:

let affected = AffectedRegion::from_edits(&edits, Some(total_len));

The affected region:

  • Union: The smallest range covering all edits
  • Expanded: The union plus padding for context (default 16 characters)

The expansion ensures that context-dependent parsing (like operator precedence) works correctly.

Node Reuse

Sipha identifies nodes from the previous parse that can be reused:

let reusable = find_reusable_nodes(old_tree, affected.expanded(), budget);

A node is reusable if:

  • It doesn’t intersect with the affected region
  • It matches the expected syntax kind (if specified)
  • It’s within the reuse budget

Reuse Budget

The reuse budget limits how many nodes to consider for reuse:

use sipha::incremental::ReuseBudget;

// Fixed budget
let budget = ReuseBudget::Fixed(1000);

// Heuristic budget (default)
let budget = ReuseBudget::Heuristic {
    max_depth: 20,
    max_nodes: 1000,
};

The heuristic budget considers:

  • Tree depth: Deeper trees allow more candidates
  • File size: Larger files allow more candidates
  • Affected region size: Smaller affected regions allow more candidates

Incremental Session

An IncrementalSession captures everything needed for incremental parsing:

use sipha::incremental::IncrementalSession;

let session = IncrementalSession::new(old_tree, &edits);

The session provides:

  • Old tree: Previous parse tree
  • Edits: Text changes
  • Affected region: Computed affected range
  • Reusable nodes: List of nodes that can be reused
  • Cache access: Optional persistent parse cache

Parse Cache

Sipha maintains a persistent parse cache for cross-session node reuse:

let session = IncrementalSession::with_cache(old_tree, &edits, &cache);

The cache:

  • Keys: Rule name, start position, version
  • Values: Cached parse nodes
  • Versioning: Invalidates old entries based on version
  • Eviction: Removes old versions automatically

Cache population happens automatically when using parse_incremental with a grammar:

let result = parser.parse_incremental(
    &tokens,
    old_tree,
    &edits,
    entry_point,
    Some(&grammar), // Cache is populated automatically
);

Re-parsing Process

When re-parsing incrementally:

  1. Find reusable nodes: Query the session for reusable nodes at each position
  2. Parse gaps: Re-parse only the regions not covered by reusable nodes
  3. Combine results: Merge reused nodes with newly parsed nodes
  4. Update cache: Store new parse results in the cache

Tree Reconstruction

The final tree is reconstructed by:

  1. Starting from root: Begin with the old tree root
  2. Replacing affected subtrees: Replace subtrees in affected regions
  3. Inserting new nodes: Add newly parsed nodes
  4. Preserving structure: Maintain tree structure outside affected regions

Performance Characteristics

Incremental parsing performance depends on:

  • Edit size: Smaller edits = better performance
  • Tree structure: Deeper trees = more reuse opportunities
  • Cache hit rate: Higher hit rate = better performance
  • Affected region size: Smaller regions = faster re-parsing

Typical performance improvements:

  • Small edits (1-10 characters): 10-100x faster
  • Medium edits (10-100 characters): 5-20x faster
  • Large edits (100+ characters): 2-5x faster

Limitations

Incremental parsing has some limitations:

  • Context sensitivity: Some parsing decisions depend on distant context
  • Error recovery: Error recovery may require re-parsing larger regions
  • Ambiguity: Ambiguous grammars may require full re-parse
  • Cache overhead: Cache management has some overhead

Next Steps

Incremental Parsing Usage

This chapter shows how to use incremental parsing in your applications.

Basic Usage

Initial Parse

Start with an initial parse:

use sipha::incremental::IncrementalParser;
use sipha::backend::ll::{LlParser, LlConfig};

let config = LlConfig::default();
let parser = LlParser::new(&grammar, config)
    .expect("Failed to create parser");

let mut incremental_parser = IncrementalParser::new(parser);

// Initial parse
let result1 = incremental_parser.parse_incremental(
    &tokens,
    None,        // No old tree
    &[],         // No edits
    MyNonTerminal::Expr,
    Some(&grammar), // Populate cache
);

After an Edit

When text is edited, re-parse incrementally:

use sipha::incremental::TextEdit;
use sipha::syntax::{TextRange, TextSize};

// User changed "42" to "100" at position 0
let edits = vec![TextEdit::replace(
    TextRange::new(TextSize::from(0), TextSize::from(2)),
    "100".into(),
)];

// Re-tokenize (or update tokens incrementally)
let new_tokens = lexer.tokenize(&new_text)
    .expect("Failed to tokenize");

// Incremental re-parse
let result2 = incremental_parser.parse_incremental(
    &new_tokens,
    Some(&result1.root), // Previous tree
    &edits,              // Text edits
    MyNonTerminal::Expr,
    Some(&grammar),      // Update cache
);

Text Edits

Creating Edits

use sipha::incremental::TextEdit;
use sipha::syntax::{TextRange, TextSize};

// Replace a range
let edit1 = TextEdit::replace(
    TextRange::new(TextSize::from(10), TextSize::from(20)),
    "new text".into(),
);

// Insert at a position
let edit2 = TextEdit::insert(
    TextSize::from(5),
    "inserted".into(),
);

// Delete a range
let edit3 = TextEdit::delete(
    TextRange::new(TextSize::from(0), TextSize::from(5)),
);

Multiple Edits

Handle multiple edits at once:

let edits = vec![
    TextEdit::replace(range1, "text1".into()),
    TextEdit::replace(range2, "text2".into()),
];

let result = incremental_parser.parse_incremental(
    &tokens,
    Some(&old_tree),
    &edits,
    entry_point,
    Some(&grammar),
);

Reuse Budget

Control how many nodes to consider for reuse:

use sipha::incremental::{IncrementalSession, ReuseBudget};

// Fixed budget
let budget = ReuseBudget::Fixed(500);
let session = IncrementalSession::new_with_budget(
    Some(&old_tree),
    &edits,
    budget,
);

// Heuristic budget (default)
let budget = ReuseBudget::Heuristic {
    max_depth: 20,
    max_nodes: 1000,
};

Cache Management

Automatic Cache Population

The cache is populated automatically when you provide a grammar:

let result = incremental_parser.parse_incremental(
    &tokens,
    old_tree,
    &edits,
    entry_point,
    Some(&grammar), // Cache populated automatically
);

Manual Cache Access

Access the cache directly:

let session = IncrementalSession::with_cache(
    Some(&old_tree),
    &edits,
    &incremental_parser.cache, // Access cache
);

Backend Integration

Incremental parsing works with all backends that support it:

// LL parser
let parser = LlParser::new(&grammar, config)?;
let mut incremental = IncrementalParser::new(parser);

// GLR parser
let parser = GlrParser::new(&grammar, config)?;
let mut incremental = IncrementalParser::new(parser);

Error Handling

Handle errors during incremental parsing:

let result = incremental_parser.parse_incremental(
    &tokens,
    Some(&old_tree),
    &edits,
    entry_point,
    Some(&grammar),
);

if !result.errors.is_empty() {
    // Handle errors
    for error in &result.errors {
        eprintln!("Error: {:?}", error);
    }
}

Best Practices

  1. Always provide grammar: This enables cache population
  2. Batch edits: Process multiple edits together when possible
  3. Reuse old tree: Always pass the previous tree for best performance
  4. Monitor metrics: Check ParseMetrics to understand performance
  5. Handle errors gracefully: Incremental parsing may produce errors

Example: Language Server

Here’s a complete example for a language server:

use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::syntax::{TextRange, TextSize};

struct LanguageServer {
    parser: IncrementalParser<LlParser, MyToken, MyNonTerminal>,
    grammar: Grammar<MyToken, MyNonTerminal>,
    lexer: CompiledLexer<MySyntaxKind>,
    current_tree: Option<GreenNode<MySyntaxKind>>,
}

impl LanguageServer {
    fn new() -> Self {
        let grammar = build_grammar();
        let parser = LlParser::new(&grammar, Default::default()).unwrap();
        let lexer = build_lexer();
        
        Self {
            parser: IncrementalParser::new(parser),
            grammar,
            lexer,
            current_tree: None,
        }
    }
    
    fn on_text_changed(&mut self, edits: Vec<TextEdit>, new_text: &str) {
        // Tokenize new text
        let tokens = self.lexer.tokenize(new_text).unwrap();
        
        // Incremental parse
        let result = self.parser.parse_incremental(
            &tokens,
            self.current_tree.as_ref(),
            &edits,
            MyNonTerminal::Expr,
            Some(&self.grammar),
        );
        
        // Update current tree
        self.current_tree = Some(result.root.clone());
        
        // Report errors
        if !result.errors.is_empty() {
            self.report_errors(&result.errors);
        }
    }
}

Next Steps

Incremental Parsing Implementation Details

This chapter covers advanced implementation details for incremental parsing.

Architecture

Incremental parsing consists of several components:

IncrementalParser

The IncrementalParser wraps a backend parser and manages the parse cache:

pub struct IncrementalParser<P, T, N> {
    parser: P,
    cache: ParseCache<T::Kind>,
    _phantom: PhantomData<(T, N)>,
}

IncrementalSession

The IncrementalSession provides context for incremental parsing:

pub struct IncrementalSession<'a, K: SyntaxKind> {
    old_tree: Option<&'a GreenNode<K>>,
    edits: &'a [TextEdit],
    affected: AffectedRegion,
    reusable: Vec<ReuseCandidate<K>>,
    reusable_by_pos: HashMap<TextSize, Vec<usize>>,
    cache: Option<&'a ParseCache<K>>,
}

ParseCache

The ParseCache stores parse results for reuse:

pub struct ParseCache<K: SyntaxKind> {
    version: usize,
    nodes: HashMap<CacheKey, Arc<GreenNode<K>>>,
    interner: Rodeo,
}

Node Reuse Algorithm

The node reuse algorithm works as follows:

  1. Compute affected region: Determine minimal region to re-parse
  2. Expand region: Add padding for context
  3. Find reusable nodes: Traverse old tree, collecting nodes outside affected region
  4. Budget filtering: Limit candidates based on budget
  5. Index by position: Build position index for fast lookup

Finding Reusable Nodes

fn find_reusable_nodes<K>(
    root: &GreenNode<K>,
    affected: TextRange,
    budget: usize,
) -> Vec<ReuseCandidate<K>> {
    let mut reusable = Vec::new();
    visit_green_spans(root, |span| {
        if reusable.len() >= budget {
            return;
        }
        
        if span.depth == 0 || span.range.intersect(affected).is_some() {
            return;
        }
        
        if let Some(node_arc) = span.node_arc.clone() {
            reusable.push(ReuseCandidate {
                node: node_arc,
                range: span.range,
                depth: span.depth,
            });
        }
    });
    
    reusable
}

Cache Key Structure

Cache keys consist of:

  • Rule name: Interned string for the non-terminal
  • Start position: Text position where parsing started
  • Version: Cache version for invalidation
struct CacheKey {
    rule: Spur,        // Interned rule name
    start: TextSize,   // Start position
    version: usize,     // Cache version
}

Version Management

The cache uses versioning for invalidation:

  1. Increment version: On each parse, increment cache version
  2. Evict old entries: Remove entries from old versions
  3. Keep recent versions: Maintain last N versions (default: 2)
fn evict_old_entries(&mut self, max_versions: usize) {
    let min_version = self.version.saturating_sub(max_versions);
    self.nodes.retain(|key, _| key.version >= min_version);
}

Backend Integration

Backends implement incremental parsing via parse_with_session:

fn parse_with_session(
    &mut self,
    input: &[T],
    entry: N,
    session: &IncrementalSession<'_, T::Kind>,
) -> ParseResult<T, N> {
    // Use session to find reusable nodes
    // Re-parse only affected regions
    // Combine reused and new nodes
}

Performance Optimizations

Position Indexing

Reusable nodes are indexed by position for O(1) lookup:

let mut reusable_by_pos = HashMap::new();
for (idx, candidate) in reusable.iter().enumerate() {
    reusable_by_pos
        .entry(candidate.range.start())
        .or_insert_with(Vec::new)
        .push(idx);
}

Budget Calculation

The heuristic budget considers multiple factors:

fn calculate(&self, tree_depth: usize, file_size: TextSize, affected_size: TextSize) -> usize {
    // Depth factor: deeper trees = more candidates
    let depth_factor = (tree_depth.min(max_depth) * 10).max(50);
    
    // File size factor: larger files = more candidates
    let size_factor = ((file_size / 100) as usize).min(max_nodes / 2);
    
    // Affected region factor: smaller regions = more candidates
    let affected_ratio = affected_size as f64 / file_size as f64;
    let affected_factor = ((1.0 - affected_ratio) * 200.0) as usize;
    
    (depth_factor + size_factor + affected_factor)
        .min(max_nodes)
        .max(50)
}

Limitations and Trade-offs

Context Sensitivity

Some parsing decisions depend on distant context, requiring larger re-parse regions:

  • Operator precedence: May need parent context
  • Type inference: May need sibling context
  • Scoping: May need ancestor context

Error Recovery

Error recovery may require re-parsing larger regions:

  • Synchronization: May skip to distant recovery points
  • Best-effort: May need full re-parse for complex errors

Cache Overhead

Cache management has overhead:

  • Memory: Cache uses memory for stored nodes
  • Lookup: Cache lookups have some overhead
  • Eviction: Periodic eviction has CPU cost

Future Improvements

Potential improvements:

  • Incremental lexing: Extend incremental capabilities to lexer
  • Smarter invalidation: More precise cache invalidation
  • Parallel parsing: Parse multiple regions in parallel
  • Predictive reuse: Predict which nodes will be reusable

Next Steps

  • See Usage for API examples
  • Check Examples for complete examples

Parsing Backends Overview

Sipha supports multiple parsing algorithms via feature flags. Each backend has different characteristics and is suited for different use cases.

Available Backends

LL(k) Parser (backend-ll)

Top-down predictive parsing with configurable lookahead:

  • Supports left-recursion elimination
  • Configurable error recovery
  • Incremental parsing support
  • Good for most grammars

LR Parser (backend-lr)

Bottom-up shift-reduce parsing:

  • Efficient for many grammar types
  • Good error recovery
  • Supports LALR and canonical LR
  • Table-based parsing

GLR Parser (backend-glr)

Generalized LR parsing for ambiguous grammars:

  • Handles non-deterministic and ambiguous grammars
  • Parse forest representation for ambiguity tracking
  • Configurable disambiguation strategies
  • Incremental parsing support
  • Ideal for complex languages like C++

PEG Parser (backend-peg)

Parsing Expression Grammar with ordered choice and memoization:

  • Ordered choice semantics (first match wins)
  • Packrat parsing with memoization for O(n) performance
  • Backtracking with configurable depth limits
  • Incremental parsing support
  • Ideal for precedence-based languages and interactive tools

ParserBackend Trait

All backends implement the ParserBackend trait:

pub trait ParserBackend<T, N>: Sized + Send
where
    T: Token,
    N: NonTerminal,
{
    type Config: Default + Clone;
    type Error: std::error::Error + Send + Sync + 'static;
    type State: Send + Sync;

    fn new(grammar: &Grammar<T, N>, config: Self::Config) -> Result<Self, Self::Error>;
    fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N>;
    fn parse_incremental(...) -> ParseResult<T, N>;
    fn parse_with_session(...) -> ParseResult<T, N>;
    fn validate(grammar: &Grammar<T, N>) -> Vec<GrammarError<T, N>>;
    fn capabilities() -> BackendCapabilities;
    fn state(&self) -> &Self::State;
}

Backend Capabilities

Each backend reports its capabilities:

pub struct BackendCapabilities {
    pub name: &'static str,
    pub algorithm: Algorithm,
    pub supports_left_recursion: bool,
    pub supports_ambiguity: bool,
    pub supports_incremental: bool,
    pub supports_error_recovery: bool,
    pub max_lookahead: Option<usize>,
}

Choosing a Backend

flowchart TD
    Start[Choose Backend] --> Ambiguous{Grammar<br/>Ambiguous?}
    Ambiguous -->|Yes| GLR[Use GLR Parser]
    Ambiguous -->|No| UseCase{Primary<br/>Use Case?}
    
    UseCase -->|Expression Parser<br/>Precedence-based| PEG1[Use PEG Parser]
    UseCase -->|Interactive Tools<br/>IDEs/Editors| Interactive{Need<br/>Memoization?}
    UseCase -->|Batch Processing| LR[Use LR Parser]
    UseCase -->|Complex Language| GLR2[Use GLR Parser]
    UseCase -->|Simple Grammar| LL[Use LL Parser]
    
    Interactive -->|Yes| PEG2[Use PEG Parser]
    Interactive -->|No| LL2[Use LL Parser]
    
    GLR --> End[Selected Backend]
    GLR2 --> End
    LR --> End
    LL --> End
    LL2 --> End
    PEG1 --> End
    PEG2 --> End
    
    style GLR fill:#ffccbc,color:#000000
    style GLR2 fill:#ffccbc,color:#000000
    style LR fill:#fff9c4,color:#000000
    style LL fill:#c8e6c9,color:#000000
    style LL2 fill:#c8e6c9,color:#000000
    style PEG1 fill:#e1bee7,color:#000000
    style PEG2 fill:#e1bee7,color:#000000
    style End fill:#e1f5ff,color:#000000

See Choosing a Backend for detailed guidance on selecting the right backend for your use case.

Next Steps

See Also

LL(k) Parser

The LL(k) parser is a top-down predictive parser with configurable lookahead.

Overview

LL(k) parsing works by:

  1. Predicting: Using lookahead to predict which production to use
  2. Matching: Matching tokens against predicted productions
  3. Recursing: Recursively parsing subexpressions

Configuration

Configure the LL parser with LlConfig:

use sipha::backend::ll::{LlParser, LlConfig};

let config = LlConfig {
    lookahead: 1,              // k value for LL(k)
    left_recursion: true,      // Enable left-recursion elimination
    error_recovery: true,      // Enable error recovery
    ..Default::default()
};

let mut parser = LlParser::new(&grammar, config)
    .expect("Failed to create parser");

Lookahead

The lookahead parameter controls how many tokens to look ahead:

  • LL(1): One token lookahead (most common)
  • LL(k): k tokens lookahead (for ambiguous grammars)

Higher lookahead values:

  • Handle more ambiguous grammars
  • Require larger parse tables
  • May be slower

Left Recursion

Left recursion elimination allows parsing left-recursive grammars:

// Without elimination: Expr -> Expr + Term (left-recursive)
// With elimination: Automatically transformed

Error Recovery

Error recovery strategies:

  • Synchronization tokens: Skip to known recovery points
  • Delimited recovery: Skip to matching delimiters
  • Best-effort parsing: Continue parsing despite errors

Usage

use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;

let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)
    .expect("Failed to create parser");

let result = parser.parse(&tokens, MyNonTerminal::Expr);

Incremental Parsing

The LL parser supports incremental parsing:

use sipha::incremental::IncrementalParser;

let mut incremental = IncrementalParser::new(parser);
let result = incremental.parse_incremental(
    &tokens,
    old_tree,
    &edits,
    entry_point,
    Some(&grammar),
);

Grammar Requirements

LL parsers require:

  • No left recursion (unless elimination is enabled)
  • No ambiguity (unless lookahead > 1)
  • FIRST/FOLLOW sets: Computed automatically

Parse Table

The LL parser builds a parse table:

  • Rows: Non-terminals
  • Columns: Terminals (FIRST sets)
  • Entries: Production rules to apply

Performance

LL parsing performance:

  • Time: O(n) where n is input length
  • Space: O(grammar size) for parse table
  • Incremental: Fast for small edits

When to Use

Use LL parser when:

  • Grammar is LL(k) compatible
  • Need predictable performance
  • Want simple error messages
  • Building interactive tools

Limitations

LL parser limitations:

  • Left recursion: Requires elimination (may change parse tree)
  • Ambiguity: Requires higher lookahead
  • Left associativity: May require grammar transformation

Next Steps

LR Parser

The LR parser is a bottom-up shift-reduce parser that builds parse trees from leaves to root.

Overview

LR parsing works by:

  1. Shifting: Moving tokens onto the stack
  2. Reducing: Replacing matched rules with non-terminals
  3. Building: Constructing the parse tree bottom-up

Configuration

Configure the LR parser with LrConfig:

use sipha::backend::lr::{LrParser, LrConfig};

let config = LrConfig {
    use_lalr: true,        // Use LALR instead of canonical LR
    error_recovery: true,  // Enable error recovery
    ..Default::default()
};

let mut parser = LrParser::new(&grammar, config)
    .expect("Failed to create parser");

LALR vs Canonical LR

  • LALR: Smaller tables, faster, handles most grammars
  • Canonical LR: Larger tables, more precise, handles edge cases

For most grammars, LALR is sufficient and faster.

Usage

use sipha::backend::lr::{LrParser, LrConfig};
use sipha::backend::ParserBackend;

let config = LrConfig::default();
let mut parser = LrParser::new(&grammar, config)
    .expect("Failed to create parser");

let result = parser.parse(&tokens, MyNonTerminal::Expr);

Parse Table

The LR parser builds a parse table:

  • States: Parser states (items)
  • Actions: Shift, reduce, accept, error
  • Gotos: State transitions for non-terminals

Table Construction

  1. Build items: Create LR items from grammar
  2. Build states: Group items into states
  3. Compute actions: Determine shift/reduce actions
  4. Resolve conflicts: Handle shift/reduce and reduce/reduce conflicts

Grammar Requirements

LR parsers require:

  • No ambiguity: Must be unambiguous (or use GLR)
  • No conflicts: No shift/reduce or reduce/reduce conflicts

Error Recovery

LR parsers support error recovery:

  • Synchronization: Skip to known recovery points
  • Delimited recovery: Skip to matching delimiters
  • Best-effort: Continue parsing despite errors

Performance

LR parsing performance:

  • Time: O(n) where n is input length
  • Space: O(grammar size) for parse table
  • Table size: LALR tables are smaller than canonical LR

When to Use

Use LR parser when:

  • Grammar is LR compatible
  • Need efficient parsing
  • Want good error recovery
  • Building batch parsers

Limitations

LR parser limitations:

  • Ambiguity: Cannot handle ambiguous grammars (use GLR)
  • Table size: Large grammars may have large tables
  • Left recursion: Must be handled in grammar

Next Steps

GLR Parser

The GLR (Generalized LR) parser extends LR parsing to handle ambiguous grammars by maintaining multiple parser stacks and forking on conflicts.

Overview

GLR parsing works by:

  1. Building LR table: Uses LR(1) state machine
  2. Maintaining stacks: Keeps multiple parser stacks
  3. Forking on conflicts: Creates new stacks for ambiguous choices
  4. Merging stacks: Merges stacks when they converge
  5. Returning forest: Returns parse forest for ambiguous results

When to Use GLR

Use the GLR backend when:

  • Your grammar has inherent ambiguities (e.g., C++ template syntax)
  • You need to handle multiple valid parse trees
  • You want to disambiguate at parse time using precedence/associativity rules
  • Parsing complex languages with inherent ambiguities

Configuration

Configure the GLR parser with GlrConfig:

use sipha::backend::glr::{GlrParser, GlrConfig};

let config = GlrConfig {
    max_stacks: 100,           // Maximum number of stacks
    disambiguation: None,      // Disambiguation strategy
    ..Default::default()
};

let mut parser = GlrParser::new(&grammar, config)
    .expect("Failed to create GLR parser");

Usage

use sipha::backend::glr::{GlrParser, GlrConfig};
use sipha::backend::ParserBackend;

let config = GlrConfig::default();
let mut parser = GlrParser::new(&grammar, config)
    .expect("Failed to create GLR parser");

let result = parser.parse(&tokens, entry_point);

Handling Ambiguity

GLR parsers can produce parse forests when multiple valid parse trees exist:

if let Some(forest) = result.forest {
    // Multiple parse trees exist - disambiguate
    let disambiguated = forest.disambiguate(|alternatives| {
        // Custom disambiguation logic
        alternatives.first().cloned()
    });
}

Disambiguation

Sipha provides several disambiguation strategies:

Precedence-based

Resolve conflicts using operator precedence:

use sipha::backend::glr::disambiguate_by_precedence;

let disambiguated = forest.disambiguate(disambiguate_by_precedence(|op| {
    match op {
        MySyntaxKind::Multiply | MySyntaxKind::Divide => 2,
        MySyntaxKind::Plus | MySyntaxKind::Minus => 1,
        _ => 0,
    }
}));

Associativity-based

Resolve conflicts using operator associativity:

use sipha::backend::glr::disambiguate_by_associativity;

let disambiguated = forest.disambiguate(disambiguate_by_associativity(|op| {
    match op {
        MySyntaxKind::Minus => Associativity::Left,
        MySyntaxKind::Power => Associativity::Right,
        _ => Associativity::None,
    }
}));

Custom Strategies

Implement your own disambiguation logic:

let disambiguated = forest.disambiguate(|alternatives| {
    // Choose based on custom criteria
    alternatives
        .iter()
        .min_by_key(|tree| compute_complexity(tree))
        .cloned()
});

Parse Forest

A parse forest represents multiple parse trees:

pub struct ParseForest<K: SyntaxKind> {
    roots: Vec<Arc<GreenNode<K>>>,
}

The forest can be:

  • Queried: Find all parse trees
  • Disambiguated: Choose one tree
  • Traversed: Visit all alternatives

Incremental Parsing

The GLR parser supports incremental parsing:

use sipha::incremental::IncrementalParser;

let mut incremental = IncrementalParser::new(parser);
let result = incremental.parse_incremental(
    &tokens,
    old_tree,
    &edits,
    entry_point,
    Some(&grammar),
);

Performance

GLR parsing performance:

  • Time: O(n³) in worst case (ambiguous grammars)
  • Space: O(n²) for parse forest
  • Stacks: Number of stacks can grow with ambiguity

For unambiguous grammars, GLR performs similarly to LR.

Grammar Requirements

GLR parsers can handle:

  • Ambiguous grammars: Multiple valid parse trees
  • Non-deterministic grammars: Grammars with conflicts
  • Complex languages: Languages like C++ with inherent ambiguities

Limitations

GLR parser limitations:

  • Performance: Slower for highly ambiguous grammars
  • Memory: Parse forests can be large
  • Complexity: More complex than LL or LR

Example: Ambiguous Expression

// Grammar: Expr -> Expr + Expr | Expr * Expr | Number
// Input: "1 + 2 * 3"
// Ambiguous: Could be (1 + 2) * 3 or 1 + (2 * 3)

let result = glr_parser.parse(&tokens, MyNonTerminal::Expr);

if let Some(forest) = result.forest {
    // Two parse trees exist
    assert_eq!(forest.roots.len(), 2);
    
    // Disambiguate using precedence
    let disambiguated = forest.disambiguate(disambiguate_by_precedence(|op| {
        match op {
            MySyntaxKind::Multiply => 2,
            MySyntaxKind::Plus => 1,
            _ => 0,
        }
    }));
    
    // Result: 1 + (2 * 3) = 7
}

Next Steps

PEG Parser

The PEG (Parsing Expression Grammar) parser uses ordered choice and backtracking with memoization (packrat parsing) for efficient parsing.

Overview

PEG parsing works by:

  1. Ordered Choice: Try alternatives in order, first match wins
  2. Backtracking: If an alternative fails, backtrack and try the next
  3. Memoization: Cache parse results for O(n) performance (packrat parsing)
  4. Greedy Matching: Always match as much as possible

Understanding PEG Grammars

PEG (Parsing Expression Grammar) is a type of formal grammar that describes a language in terms of parsing rules rather than generation rules. Unlike traditional CFG (Context-Free Grammar) parsers, PEG parsers use ordered choice semantics, making them deterministic and unambiguous.

Key Concepts

Ordered Choice (A | B)

In PEG, the choice operator | is ordered and greedy:

  • Alternatives are tried left to right
  • The first matching alternative wins
  • Once an alternative matches, no other alternatives are tried
  • This makes PEG grammars unambiguous by definition

Example:

// Grammar: Expr -> Number | Number
// In CFG: This is ambiguous (both alternatives match "42")
// In PEG: First alternative always wins, second is never tried

Expr::Choice(vec![
    Expr::token(Number("42")),  // This matches first
    Expr::token(Number("99")), // Never tried, even if input is "99"
])

Greedy Repetition (A*, A+)

PEG repetition is greedy - it matches as much as possible:

  • A* matches zero or more As (as many as possible)
  • A+ matches one or more As (as many as possible)
  • Repetition always tries to match more before trying less

Example:

// Grammar: Expr -> Number*
// Input: "123"
// Matches: All three digits (greedy)

Expr::Repeat {
    expr: Box::new(Expr::token(Number("1"))),
    min: 0,
    max: None,
}

Backtracking

When an alternative fails, the parser backtracks to try the next alternative:

  • Position is restored to where the alternative started
  • Previous matches are discarded
  • Next alternative is tried from the same position

Example:

// Grammar: Expr -> "ab" | "ac"
// Input: "ac"
// Process:
//   1. Try "ab" - matches "a", fails on "b"
//   2. Backtrack to start
//   3. Try "ac" - matches successfully

Key Differences from CFG

PEG parsers differ from traditional CFG parsers in fundamental ways:

FeatureCFGPEG
ChoiceUnordered (all alternatives considered)Ordered (first match wins)
AmbiguityCan be ambiguousAlways unambiguous
RepetitionMay be non-greedyAlways greedy
Left RecursionNaturalRequires special handling
DeterminismMay be non-deterministicAlways deterministic

Ordered Choice Example

// CFG: Expr -> Number | Number
//      This is ambiguous - both alternatives match the same input
//      Parser must use disambiguation rules

// PEG: Expr -> Number | Number  
//      This is unambiguous - first alternative always wins
//      Second alternative is never tried if first matches

Expr::Choice(vec![
    Expr::token(Number("1")),  // Tried first
    Expr::token(Number("2")), // Only tried if first fails
])

Precedence Through Ordering

PEG makes operator precedence natural through ordering:

// Lower precedence operators tried first
// Higher precedence operators tried later
// This naturally gives * higher precedence than +

Expr::Choice(vec![
    // Addition (lower precedence, tried first)
    Expr::Seq(vec![
        Expr::Rule(Expr),
        Expr::token(Plus),
        Expr::Rule(Term),
    ]),
    // Base case (tried last)
    Expr::Rule(Term),
])

Term::Choice(vec![
    // Multiplication (higher precedence, tried first)
    Expr::Seq(vec![
        Expr::Rule(Term),
        Expr::token(Multiply),
        Expr::Rule(Factor),
    ]),
    // Base case (tried last)
    Expr::Rule(Factor),
])

Input: 1 + 2 * 3

Parse process:

  1. Try Expr + Term - matches 1 +, then tries to match Term
  2. Term tries Term * Factor - matches 2 * 3
  3. Result: 1 + (2 * 3) ✓ (correct precedence)

Configuration

Configure the PEG parser with PegConfig:

use sipha::backend::peg::{PegParser, PegConfig};

let config = PegConfig {
    enable_memoization: true,    // Enable packrat parsing
    max_memo_size: 10000,        // Maximum cache size
    error_recovery: true,         // Enable error recovery
    max_errors: 100,             // Maximum errors before giving up
    max_backtrack_depth: 1000,   // Limit backtracking depth
};

let mut parser = PegParser::new(&grammar, config)
    .expect("Failed to create parser");

Memoization (Packrat Parsing)

When enable_memoization is true, the parser caches parse results:

  • O(n) Performance: Linear time parsing for many grammars
  • Memory Trade-off: Uses more memory for cache
  • Cache Size: Controlled by max_memo_size

Without memoization:

  • Exponential Worst Case: Can be slow for some grammars
  • Less Memory: No cache overhead

Backtracking Depth

The max_backtrack_depth parameter limits backtracking:

  • Prevents Infinite Loops: Stops infinite backtracking
  • Performance: Lower values may fail on valid inputs
  • Default: 1000 (usually sufficient)

Error Recovery

Error recovery strategies:

  • Best-effort parsing: Continue parsing despite errors
  • Error limits: Stop after max_errors errors
  • Recovery points: Use RecoveryPoint expressions in grammar

Usage

use sipha::backend::peg::{PegParser, PegConfig};
use sipha::backend::ParserBackend;

let config = PegConfig::default();
let mut parser = PegParser::new(&grammar, config)
    .expect("Failed to create parser");

let result = parser.parse(&tokens, MyNonTerminal::Expr);

Writing PEG Grammars

Basic Patterns

Sequences

Sequences match expressions in order:

// Match: "a" followed by "b"
Expr::Seq(vec![
    Expr::token(Token::A),
    Expr::token(Token::B),
])

Optional Elements

Optional expressions match zero or one occurrence:

// Match: "a" optionally followed by "b"
Expr::Seq(vec![
    Expr::token(Token::A),
    Expr::Opt(Box::new(Expr::token(Token::B))),
])

Repetition

Repetition matches zero or more (or one or more) occurrences:

// Match: One or more "a"s
Expr::Repeat {
    expr: Box::new(Expr::token(Token::A)),
    min: 1,
    max: None,
}

// Match: Zero to three "a"s
Expr::Repeat {
    expr: Box::new(Expr::token(Token::A)),
    min: 0,
    max: Some(3),
}

Separated Lists

Separated lists match items separated by a delimiter:

// Match: "1, 2, 3" or "1, 2, 3,"
Expr::Separated {
    item: Box::new(Expr::token(Number)),
    separator: Box::new(Expr::token(Comma)),
    min: 1,
    trailing: TrailingSeparator::Allow,
}

Common Patterns

Operator Precedence

Express precedence by ordering alternatives (lower precedence first):

// Expression with precedence: + < * < ( )
Expr::Choice(vec![
    // Addition (lowest precedence, tried first)
    Expr::Seq(vec![
        Expr::Rule(Expr),
        Expr::token(Plus),
        Expr::Rule(Term),
    ]),
    // Subtraction
    Expr::Seq(vec![
        Expr::Rule(Expr),
        Expr::token(Minus),
        Expr::Rule(Term),
    ]),
    // Base case (tried last)
    Expr::Rule(Term),
])

Term::Choice(vec![
    // Multiplication (higher precedence)
    Expr::Seq(vec![
        Expr::Rule(Term),
        Expr::token(Multiply),
        Expr::Rule(Factor),
    ]),
    // Base case
    Expr::Rule(Factor),
])

Factor::Choice(vec![
    // Parentheses (highest precedence)
    Expr::Delimited {
        open: Box::new(Expr::token(LParen)),
        content: Box::new(Expr::Rule(Expr)),
        close: Box::new(Expr::token(RParen)),
        recover: false,
    },
    // Numbers
    Expr::token(Number),
])

Left Associativity

Left associativity is natural in PEG (left-recursive rules work correctly):

// Left-associative: 1 - 2 - 3 = (1 - 2) - 3
Expr::Choice(vec![
    Expr::Seq(vec![
        Expr::Rule(Expr),  // Left-recursive
        Expr::token(Minus),
        Expr::Rule(Term),
    ]),
    Expr::Rule(Term),
])

Note: Left recursion is detected and handled, but may require special consideration.

Right Associativity

Right associativity requires right-recursive rules:

// Right-associative: 2^3^4 = 2^(3^4)
Expr::Choice(vec![
    Expr::Seq(vec![
        Expr::Rule(Term),
        Expr::token(Power),
        Expr::Rule(Expr),  // Right-recursive
    ]),
    Expr::Rule(Term),
])

Optional Trailing Elements

Handle optional trailing separators:

// Match: "a, b, c" or "a, b, c,"
Expr::Separated {
    item: Box::new(Expr::token(Ident)),
    separator: Box::new(Expr::token(Comma)),
    min: 1,
    trailing: TrailingSeparator::Allow,  // or Forbid, or Require
}

Advanced Patterns

Lookahead

Positive lookahead checks if expression matches without consuming:

// Match "a" only if followed by "b" (but don't consume "b")
Expr::Seq(vec![
    Expr::token(Token::A),
    Expr::Lookahead(Box::new(Expr::token(Token::B))),
])

Negative lookahead checks if expression doesn’t match:

// Match "a" only if NOT followed by "b"
Expr::Seq(vec![
    Expr::token(Token::A),
    Expr::NotLookahead(Box::new(Expr::token(Token::B))),
])

Error Recovery

Use recovery points to skip to known tokens on error:

// On error, skip to semicolon or brace
Expr::RecoveryPoint {
    expr: Box::new(Expr::Rule(Statement)),
    sync_tokens: vec![Semicolon, RBrace].into(),
}

Delimited Recovery

Recover from missing closing delimiters:

// Recover from missing closing parenthesis
Expr::Delimited {
    open: Box::new(Expr::token(LParen)),
    content: Box::new(Expr::Rule(Expr)),
    close: Box::new(Expr::token(RParen)),
    recover: true,  // Skip tokens until closing delimiter found
}

Incremental Parsing

The PEG parser supports incremental parsing with memoization:

use sipha::incremental::IncrementalParser;

let mut incremental = IncrementalParser::new(parser);
let result = incremental.parse_incremental(
    &tokens,
    old_tree,
    &edits,
    entry_point,
    Some(&grammar),
);

The memoization cache works with incremental parsing:

  • Node Reuse: Cached nodes are reused when possible
  • Cache Invalidation: Affected regions invalidate cache entries
  • Performance: Significant speedup for small edits

Grammar Requirements and Best Practices

What PEG Handles Well

PEG parsers excel at:

  • Precedence-based Languages: Operator precedence is natural through ordering
  • Expression Parsers: Arithmetic, logical, and comparison expressions
  • Structured Data: JSON, XML, configuration files
  • Interactive Tools: Fast parsing with memoization for IDEs and editors
  • Deterministic Parsing: Always produces one parse tree (no ambiguity)

Left Recursion

PEG parsers detect left recursion and handle it, but with limitations:

// Left-recursive rule
Expr -> Expr + Term | Term

// PEG detects this and adds a warning
// The parser will fail on left-recursive rules to prevent infinite loops

Best Practice: For left-recursive grammars, consider:

  • Using right-recursive alternatives where possible
  • Restructuring to avoid left recursion
  • Using iterative parsing for specific cases

Example - Avoiding Left Recursion:

// Instead of: Expr -> Expr + Term | Term
// Use iterative approach with repetition:

Expr::Seq(vec![
    Expr::Rule(Term),
    Expr::Repeat {
        expr: Box::new(Expr::Seq(vec![
            Expr::token(Plus),
            Expr::Rule(Term),
        ])),
        min: 0,
        max: None,
    },
])

Ambiguity Resolution

PEG grammars are always unambiguous because ordered choice resolves ambiguity:

// CFG: Expr -> Number | Number  (ambiguous)
// PEG: Expr -> Number | Number   (unambiguous - first wins)

// If you need different behavior, reorder:
Expr::Choice(vec![
    Expr::token(Number("42")),  // Specific case first
    Expr::token(Number("n")),   // General case second
])

Best Practice: Order alternatives from most specific to least specific.

Common Pitfalls

1. Order Matters

// Wrong: General case first
Expr::Choice(vec![
    Expr::token(Number),        // Matches everything
    Expr::token(Number("42")),  // Never reached
])

// Correct: Specific case first
Expr::Choice(vec![
    Expr::token(Number("42")),  // Specific case first
    Expr::token(Number),        // General case second
])

2. Greedy Repetition

// This matches as many as possible
Expr::Repeat {
    expr: Box::new(Expr::token(Number)),
    min: 0,
    max: None,
}
// Input: "123" matches all three digits

3. Backtracking Cost

Excessive backtracking can be slow. Use lookahead to avoid unnecessary backtracking:

// Expensive: Tries "ab", backtracks, tries "ac"
Expr::Choice(vec![
    Expr::Seq(vec![Expr::token(A), Expr::token(B)]),
    Expr::Seq(vec![Expr::token(A), Expr::token(C)]),
])

// Better: Use lookahead to avoid backtracking
Expr::Seq(vec![
    Expr::token(A),
    Expr::Choice(vec![
        Expr::Lookahead(Box::new(Expr::token(B))),
        Expr::token(C),
    ]),
])

Memoization Details

Packrat parsing caches results at each position:

  • Cache Key: (rule, position)
  • Cache Value: Parse result (success with node, or failure)
  • Cache Size: Bounded by max_memo_size
  • Eviction: Old entries evicted when cache is full

Benefits:

  • Linear Time: O(n) for many grammars
  • Node Reuse: Cached nodes can be reused
  • Incremental: Works well with incremental parsing

Ordered Choice Details

Ordered choice (A | B) semantics:

  1. Try A
  2. If A succeeds, return success (never try B)
  3. If A fails, backtrack and try B
  4. If both fail, return failure

This differs from CFG where both alternatives are considered.

Performance

PEG parsing performance:

  • With Memoization: O(n) time, O(n) space (packrat parsing)
  • Without Memoization: O(n) to exponential time, O(1) space
  • Incremental: Fast for small edits (benefits from memoization)
  • Memory: Cache size depends on grammar and input

When to Use PEG

Ideal Use Cases

1. Expression Parsers

PEG excels at parsing expressions with operator precedence:

// Natural expression parsing
// Input: "1 + 2 * 3"
// Result: 1 + (2 * 3)  (correct precedence)

// Grammar is intuitive - lower precedence tried first
Expr -> Expr + Term | Term
Term -> Term * Factor | Factor

Why PEG: Precedence is expressed naturally through ordering, no need for precedence tables.

2. Interactive Tools (IDEs, Editors)

PEG with memoization provides fast incremental parsing:

  • Fast re-parsing: Memoization caches results
  • Incremental updates: Only affected regions re-parsed
  • Responsive: Good performance for real-time editing

Why PEG: Memoization makes repeated parsing very fast.

3. Structured Data Formats

PEG handles delimited and separated structures well:

// JSON-like structures
Object -> "{" (Pair ("," Pair)*)? "}"
Array -> "[" (Value ("," Value)*)? "]"

Why PEG: Delimited and separated expressions are first-class.

4. Deterministic Parsing

When you need guaranteed unambiguous parsing:

  • No ambiguity: Ordered choice ensures one parse tree
  • Predictable: Same input always produces same output
  • No disambiguation needed: Parser makes choices automatically

Why PEG: Unambiguous by design.

When NOT to Use PEG

1. Truly Ambiguous Grammars

If you need to track or explore ambiguity:

  • Use GLR instead: GLR produces parse forests for ambiguous inputs
  • PEG hides ambiguity: Only one parse tree is produced

2. Very Large Files Without Memoization

Without memoization, some PEG grammars can be exponential:

  • Enable memoization: Use enable_memoization: true
  • Or use LR: LR parsers are consistently O(n) without memoization

3. Simple LL(1) Grammars

For simple, unambiguous grammars, LL(k) may be simpler:

  • LL is simpler: Table-based, easier to understand
  • PEG is more flexible: But adds complexity for simple cases

Limitations

PEG parser limitations:

  • Exponential Without Memoization: Can be slow for some grammars
  • Memory Usage: Memoization requires significant memory
  • Left Recursion: May require special handling
  • No Ambiguity Tracking: Ambiguity resolved by ordering (may hide issues)

Comparison with Other Backends

PEG vs LL Parser

AspectPEGLL(k)
ApproachTop-down with backtrackingTop-down predictive
ChoiceOrdered (first match wins)Based on lookahead
Left RecursionDetected, may need handlingCan eliminate
AmbiguityResolved by orderingRequires lookahead > 1
PerformanceO(n) with memoizationO(n) always
MemoryO(n) with memoizationO(grammar size)
Best ForPrecedence, interactive toolsSimple grammars, IDEs

Choose PEG when: You need ordered choice semantics or memoization benefits.

Choose LL when: Your grammar is LL(k) compatible and you want simplicity.

PEG vs LR Parser

AspectPEGLR
ApproachTop-down with backtrackingBottom-up shift-reduce
ChoiceOrderedBased on parse table
Left RecursionDetected, may need handlingNatural
BacktrackingYes (with limits)No
PerformanceO(n) with memoizationO(n) always
Error RecoveryGoodExcellent
Best ForPrecedence, expressionsBatch parsing, languages

Choose PEG when: You need ordered choice or top-down semantics.

Choose LR when: You need excellent error recovery or batch processing.

PEG vs GLR Parser

AspectPEGGLR
ApproachTop-down deterministicBottom-up with ambiguity
ChoiceOrdered (unambiguous)All alternatives (ambiguous)
AmbiguityResolved by orderingTracked in parse forest
PerformanceO(n) with memoizationO(n) to O(n³)
Best ForDeterministic parsingAmbiguous grammars

Choose PEG when: You want guaranteed unambiguous parsing.

Choose GLR when: Your grammar has inherent ambiguities you need to explore.

Practical Examples

Example 1: Arithmetic Expressions

// Grammar for: 1 + 2 * 3
let grammar = GrammarBuilder::new()
    .entry_point(Expr)
    .rule(Expr, Expr::Choice(vec![
        // Addition (lowest precedence, tried first)
        Expr::Seq(vec![
            Expr::Rule(Expr),
            Expr::token(Plus),
            Expr::Rule(Term),
        ]),
        // Base case
        Expr::Rule(Term),
    ]))
    .rule(Term, Expr::Choice(vec![
        // Multiplication (higher precedence)
        Expr::Seq(vec![
            Expr::Rule(Term),
            Expr::token(Multiply),
            Expr::Rule(Factor),
        ]),
        // Base case
        Expr::Rule(Factor),
    ]))
    .rule(Factor, Expr::Choice(vec![
        // Parentheses
        Expr::Delimited {
            open: Box::new(Expr::token(LParen)),
            content: Box::new(Expr::Rule(Expr)),
            close: Box::new(Expr::token(RParen)),
            recover: false,
        },
        // Numbers
        Expr::token(Number),
    ]))
    .build()?;

Example 2: JSON-like Lists

// Grammar for: [1, 2, 3] or [1, 2, 3,]
let grammar = GrammarBuilder::new()
    .entry_point(Array)
    .rule(Array, Expr::Delimited {
        open: Box::new(Expr::token(LBracket)),
        content: Box::new(Expr::Separated {
            item: Box::new(Expr::Rule(Value)),
            separator: Box::new(Expr::token(Comma)),
            min: 0,
            trailing: TrailingSeparator::Allow,
        }),
        close: Box::new(Expr::token(RBracket)),
        recover: true,
    })
    .build()?;

Example 3: Error Recovery

// Grammar with error recovery
let grammar = GrammarBuilder::new()
    .entry_point(Program)
    .rule(Program, Expr::Repeat {
        expr: Box::new(Expr::RecoveryPoint {
            expr: Box::new(Expr::Rule(Statement)),
            sync_tokens: vec![Semicolon, RBrace].into(),
        }),
        min: 0,
        max: None,
    })
    .build()?;

Next Steps

Choosing a Backend

This guide helps you choose the right parsing backend for your use case.

Quick Comparison

FeatureLL(k)LRGLRPEG
Grammar compatibilityLL(k)LRAnyMost
Ambiguity handlingNoNoYesNo (ordered choice)
Left recursionWith eliminationYesYesYes
Error recoveryGoodExcellentGoodGood
Incremental parsingYesYesYesYes
PerformanceFastFastSlower (ambiguous)Fast (with memo)
Memory usageSmallMediumMediumMedium (with memo)
ComplexityLowMediumHighMedium

Decision Tree

flowchart TD
    Start[Choose Backend] --> Ambiguous{Grammar<br/>Ambiguous?}
    Ambiguous -->|Yes| GLR[Use GLR Parser]
    Ambiguous -->|No| UseCase{Primary<br/>Use Case?}
    
    UseCase -->|Expression Parser<br/>Precedence-based| PEG1[Use PEG Parser]
    UseCase -->|Interactive Tools<br/>IDEs/Editors| Interactive{Need<br/>Memoization?}
    UseCase -->|Batch Processing| LR[Use LR Parser]
    UseCase -->|Complex Language| GLR2[Use GLR Parser]
    UseCase -->|Simple Grammar| LL[Use LL Parser]
    
    Interactive -->|Yes| PEG2[Use PEG Parser]
    Interactive -->|No| LL2[Use LL Parser]
    
    GLR --> End[Selected Backend]
    GLR2 --> End
    LR --> End
    LL --> End
    LL2 --> End
    PEG1 --> End
    PEG2 --> End
    
    style GLR fill:#ffccbc,color:#000000
    style GLR2 fill:#ffccbc,color:#000000
    style LR fill:#fff9c4,color:#000000
    style LL fill:#c8e6c9,color:#000000
    style LL2 fill:#c8e6c9,color:#000000
    style PEG1 fill:#e1bee7,color:#000000
    style PEG2 fill:#e1bee7,color:#000000
    style End fill:#e1f5ff,color:#000000

Step-by-Step Decision Process

1. Is your grammar ambiguous?

  • Yes → Use GLR
  • No → Continue to step 2

2. What’s your primary use case?

  • Expression parser / Precedence-based → Use PEG (natural ordered choice)
  • Interactive tools (IDEs, editors) → Continue to step 3
  • Batch parsing → Use LR (efficient, good error recovery)
  • Complex language (C++, etc.) → Use GLR (handles ambiguities)
  • Simple grammar → Use LL(k) (simplest)

3. For Interactive Tools: Need memoization?

  • Yes → Use PEG (memoization provides fast re-parsing)
  • No → Use LL(k) (simpler, table-based)

Detailed Scenarios

Scenario 1: Simple Expression Parser

Grammar: Arithmetic expressions with precedence (e.g., 1 + 2 * 3)

Recommendation: PEG or LL(k) with left-recursion elimination

Reasoning:

  • Grammar is unambiguous
  • PEG’s ordered choice makes precedence natural - lower precedence operators tried first
  • Left recursion can be eliminated (LL) or handled (PEG)
  • Simple and fast
  • Good for interactive tools

PEG Grammar Example:

// Precedence: + < * < ( )
// Lower precedence tried first in PEG
Expr -> Expr + Term | Term      // + tried first (lower precedence)
Term -> Term * Factor | Factor  // * tried later (higher precedence)
Factor -> ( Expr ) | Number     // () tried last (highest precedence)

Why PEG works well: The ordering naturally expresses precedence without needing precedence tables.

Scenario 2: Language Server

Grammar: Full programming language

Recommendation: LL(k), LR, or PEG

Reasoning:

  • Need incremental parsing (all support it)
  • LL(k) is simpler for error messages
  • LR has better error recovery
  • PEG has memoization for fast parsing
  • Choose based on grammar characteristics

Scenario 3: C++ Parser

Grammar: C++ with template syntax ambiguities

Recommendation: GLR

Reasoning:

  • Grammar has inherent ambiguities
  • Need to handle multiple parse trees
  • Can disambiguate using precedence/associativity

Scenario 4: Batch Parser

Grammar: Any unambiguous grammar

Recommendation: LR

Reasoning:

  • Efficient for batch processing
  • Good error recovery
  • Handles left recursion naturally

Performance Considerations

LL(k)

  • Fast: O(n) parsing time
  • Small tables: Efficient memory usage
  • Good for: Interactive tools, small to medium files

LR

  • Fast: O(n) parsing time
  • Medium tables: Reasonable memory usage
  • Good for: Batch processing, large files

GLR

  • Variable: O(n) to O(n³) depending on ambiguity
  • Medium to large tables: More memory for forests
  • Good for: Ambiguous grammars, complex languages

PEG

  • With memoization: O(n) parsing time, O(n) space
  • Without memoization: O(n) to exponential time, O(1) space
  • Medium memory: Cache size depends on grammar
  • Good for: Precedence-based languages, interactive tools

Grammar Compatibility

LL(k) Compatible

  • No left recursion (or can be eliminated)
  • No ambiguity (or k > 1)
  • Can compute FIRST/FOLLOW sets

LR Compatible

  • No ambiguity
  • No shift/reduce conflicts
  • No reduce/reduce conflicts

GLR Compatible

  • Any grammar
  • Handles all conflicts
  • Handles all ambiguities

PEG Compatible

  • Most grammars: Very flexible
  • Left recursion: Detected and handled (may require restructuring)
  • Ordered choice: Resolves ambiguity automatically
  • Best for: Precedence-based expressions, structured data, interactive tools

Writing PEG Grammars:

  • Order alternatives from most specific to least specific
  • Lower precedence operators should be tried first
  • Use repetition for lists and sequences
  • Consider memoization for performance

Example - Precedence:

// Correct: Lower precedence first
Expr -> Expr + Term | Term      // + tried first
Term -> Term * Factor | Factor  // * tried later (higher precedence)

// This naturally gives * higher precedence than +

Migration Path

From LL to LR

If you need better error recovery or left recursion:

  1. Update grammar (if needed)
  2. Change parser type
  3. Update configuration
  4. Test thoroughly

From LR to GLR

If you encounter ambiguities:

  1. Keep same grammar
  2. Change parser type
  3. Add disambiguation logic
  4. Handle parse forests

From LL/LR to PEG

If you want ordered choice semantics or memoization:

  1. Reorder alternatives: Most specific first, least specific last
  2. Adjust precedence: Lower precedence operators tried first
  3. Enable memoization: Set enable_memoization: true for performance
  4. Handle left recursion: Restructure if needed (PEG detects it)
  5. Test thoroughly - ordered choice may change parse results

Example Migration:

// LL/LR: Precedence via grammar structure or tables
// PEG: Precedence via ordering

// Before (LL/LR style - may need precedence hints):
Expr -> Term | Expr + Term
Term -> Factor | Term * Factor

// After (PEG style - ordering expresses precedence):
Expr -> Expr + Term | Term      // + tried first (lower precedence)
Term -> Term * Factor | Factor  // * tried later (higher precedence)

Recommendations by Use Case

Interactive Tools (IDEs, Editors)

Primary: LL(k) or PEG Alternative: LR if grammar requires it

Reasoning:

  • Both LL(k) and PEG are fast and support incremental parsing
  • PEG’s memoization can provide significant speedup for repeated parsing
  • PEG’s ordered choice makes precedence natural for expression languages
  • LL(k) is simpler for straightforward grammars

PEG Advantages for IDEs:

  • Memoization caches parse results, making re-parsing very fast
  • Incremental parsing benefits from cached nodes
  • Ordered choice makes grammar writing intuitive

Batch Processing

Primary: LR Alternative: LL(k) if grammar is simpler

Complex Languages

Primary: GLR Alternative: Transform grammar to be unambiguous, or use PEG if ambiguity can be resolved by ordering

PEG for Complex Languages:

  • Use PEG if ambiguities can be resolved by ordering alternatives
  • Use GLR if you need to explore or track multiple parse trees
  • PEG is deterministic - always produces one parse tree

Learning/Prototyping

Primary: LL(k) or PEG

Reasoning:

  • LL(k): Simplest to understand and use
  • PEG: Intuitive grammar writing (ordered choice is natural), good for learning parsing concepts

PEG for Learning:

  • Ordered choice is easy to understand
  • Precedence is expressed naturally
  • No need for precedence tables or complex grammar analysis

Next Steps

Green/Red Trees

Sipha uses an immutable green/red tree representation for syntax trees. This design provides efficient memory usage and fast tree operations.

Overview

The green/red tree architecture separates:

  • Green trees: Compact, shared representation stored in an arena
  • Red trees: Convenient API for traversing and querying the tree

This separation provides:

  • Efficient memory usage: Shared subtrees reduce memory footprint
  • Fast traversal: Optimized for common tree operations
  • Immutability: Trees are immutable, enabling safe sharing

Green Trees

Green trees are the storage layer:

  • Compact: Minimal memory overhead
  • Shared: Subtrees can be shared across multiple trees
  • Arena-allocated: Stored in an arena for efficient allocation
  • Immutable: Once created, cannot be modified

GreenNode

A GreenNode represents a node in the green tree:

pub struct GreenNode<K: SyntaxKind> {
    kind: K,
    children: Vec<SyntaxElement<K>>,
    // ... internal fields
}

Green nodes contain:

  • Kind: The syntax kind (terminal or non-terminal)
  • Children: Child elements (nodes or tokens)
  • Metadata: Internal metadata for tree operations

Red Trees

Red trees are the API layer:

  • Convenient: Easy-to-use API for traversal
  • Lazy: Created on-demand from green trees
  • Navigable: Rich navigation API (parent, siblings, etc.)
  • Queryable: Supports queries and visitors

SyntaxNode

A SyntaxNode represents a node in the red tree:

let root = SyntaxNode::new_root(green_root.clone());

// Navigate
for child in root.children() {
    println!("{:?}", child.kind());
}

// Find nodes
if let Some(expr) = root.descendants().find(|n| n.kind() == MySyntaxKind::Expr) {
    println!("Found expression");
}

Tree Structure

Elements

Tree elements can be:

  • Nodes: Non-terminal nodes (have children)
  • Tokens: Terminal nodes (leaf nodes)
pub enum SyntaxElement<K: SyntaxKind> {
    Node(SyntaxNode),
    Token(SyntaxToken),
}

Hierarchy

Trees form a hierarchy:

Root (NonTerminal)
├── Child1 (NonTerminal)
│   ├── Token1 (Terminal)
│   └── Token2 (Terminal)
└── Child2 (NonTerminal)
    └── Token3 (Terminal)

Tree Structure Visualization

graph TD
    subgraph Green["Green Tree (Storage)"]
        G1[GreenNode: Expr]
        G2[GreenNode: Term]
        G3[GreenToken: Number]
        G4[GreenToken: Plus]
        G5[GreenNode: Term]
        G6[GreenToken: Number]
        
        G1 --> G2
        G1 --> G4
        G1 --> G5
        G2 --> G3
        G5 --> G6
    end
    
    subgraph Red["Red Tree (API)"]
        R1[SyntaxNode: Expr]
        R2[SyntaxNode: Term]
        R3[SyntaxToken: Number]
        R4[SyntaxToken: Plus]
        R5[SyntaxNode: Term]
        R6[SyntaxToken: Number]
        
        R1 --> R2
        R1 --> R4
        R1 --> R5
        R2 --> R3
        R5 --> R6
    end
    
    G1 -.->|"new_root()"| R1
    
    style G1 fill:#c8e6c9,color:#000000
    style G2 fill:#c8e6c9,color:#000000
    style G3 fill:#ffccbc,color:#000000
    style G4 fill:#ffccbc,color:#000000
    style G5 fill:#c8e6c9,color:#000000
    style G6 fill:#ffccbc,color:#000000
    
    style R1 fill:#e1f5ff,color:#000000
    style R2 fill:#e1f5ff,color:#000000
    style R3 fill:#fff9c4,color:#000000
    style R4 fill:#fff9c4,color:#000000
    style R5 fill:#e1f5ff,color:#000000
    style R6 fill:#fff9c4,color:#000000

Memory Efficiency

Green/red trees are memory-efficient:

  1. Sharing: Unchanged subtrees are shared across edits
  2. Arena allocation: Nodes are allocated in an arena
  3. Compact representation: Minimal overhead per node
  4. Incremental updates: Only changed regions are recreated

Design Rationale

The green/red tree design is inspired by:

  • rust-analyzer: Uses similar architecture
  • rowan: Provides the green/red tree foundation
  • IDE requirements: Optimized for interactive editing

Benefits:

  • Fast incremental parsing: Reuse green nodes
  • Efficient memory: Share unchanged subtrees
  • Safe sharing: Immutability enables safe sharing
  • Rich API: Red trees provide convenient navigation

Conversion

Convert between green and red trees:

// Green to Red
let red_root = SyntaxNode::new_root(green_node.clone());

// Red to Green
let green_node = red_node.green();

Next Steps

See Also

Working with Syntax Trees

This chapter shows how to traverse, query, and work with syntax trees in Sipha.

Creating a Red Tree

Create a red tree from a parse result:

use sipha::syntax::SyntaxNode;

let root = SyntaxNode::new_root(result.root.clone());

Traversal

Children

Iterate over direct children:

for child in root.children() {
    println!("Child: {:?}", child.kind());
}

Descendants

Iterate over all descendants (depth-first):

for descendant in root.descendants() {
    println!("Descendant: {:?}", descendant.kind());
}

Ancestors

Iterate over ancestors:

if let Some(node) = root.descendants().find(|n| n.kind() == MySyntaxKind::Expr) {
    for ancestor in node.ancestors() {
        println!("Ancestor: {:?}", ancestor.kind());
    }
}

Siblings

Access siblings:

if let Some(node) = root.descendants().find(|n| n.kind() == MySyntaxKind::Expr) {
    if let Some(next) = node.next_sibling() {
        println!("Next sibling: {:?}", next.kind());
    }
    if let Some(prev) = node.prev_sibling() {
        println!("Previous sibling: {:?}", prev.kind());
    }
}

Querying

Find by Kind

Find nodes by syntax kind:

let exprs: Vec<_> = root.descendants()
    .filter(|n| n.kind() == MySyntaxKind::Expr)
    .collect();

Find by Position

Find nodes at a specific position:

use sipha::syntax::TextSize;

let pos = TextSize::from(10);
if let Some(node) = root.token_at_offset(pos) {
    println!("Token at position {}: {:?}", pos, node.kind());
}

Find Parent

Find the parent of a node:

if let Some(parent) = node.parent() {
    println!("Parent: {:?}", parent.kind());
}

Text Access

Node Text

Get the text covered by a node:

let text = node.text();
println!("Node text: {}", text);

Token Text

Get the text of a token:

if let Some(token) = node.first_token() {
    println!("Token text: {}", token.text());
}

Text Range

Get the text range of a node:

let range = node.text_range();
println!("Range: {:?}", range);

Visitors

Use visitors to traverse trees:

#[cfg(feature = "visitor")]
use sipha::syntax::{SyntaxVisitor, SyntaxWalker};

struct MyVisitor;

impl SyntaxVisitor for MyVisitor {
    fn visit_node(&mut self, node: &SyntaxNode) {
        println!("Visiting: {:?}", node.kind());
    }
    
    fn visit_token(&mut self, token: &SyntaxToken) {
        println!("Token: {:?}", token.kind());
    }
}

let mut visitor = MyVisitor;
root.walk(&mut visitor);

Queries

Use XPath-like queries (requires query feature):

#[cfg(feature = "query")]
use sipha::syntax::{QueryBuilder, XPathQuery};

let query = QueryBuilder::new()
    .descendant(MySyntaxKind::Expr)
    .child(MySyntaxKind::Number)
    .build();

let results: Vec<_> = query.matches(&root).collect();

Pattern Matching

Match tree patterns:

// Match specific structure
if let Some(expr) = root.descendants().find(|n| {
    n.kind() == MySyntaxKind::Expr &&
    n.children().any(|c| c.kind() == MySyntaxKind::Number)
}) {
    println!("Found expression with number");
}

Collecting Information

Count Nodes

Count nodes by kind:

let expr_count = root.descendants()
    .filter(|n| n.kind() == MySyntaxKind::Expr)
    .count();

Collect Text

Collect text from nodes:

let all_text: String = root.descendants()
    .filter_map(|n| n.first_token())
    .map(|t| t.text().to_string())
    .collect();

Best Practices

  1. Use iterators: Prefer iterator methods over manual traversal
  2. Cache results: Cache frequently accessed nodes
  3. Filter early: Filter nodes as early as possible
  4. Use visitors: Use visitors for complex traversals
  5. Leverage queries: Use queries for pattern matching

Next Steps

Tree Manipulation

This chapter shows how to build and modify syntax trees in Sipha.

Building Trees

GreenNodeBuilder

Use GreenNodeBuilder to build green trees:

use sipha::syntax::{GreenNodeBuilder, SyntaxKind};

let mut builder = GreenNodeBuilder::new();

// Start a node
builder.start_node(MySyntaxKind::Expr);

// Add tokens
builder.token(MySyntaxKind::Number, "42").unwrap();
builder.token(MySyntaxKind::Plus, "+").unwrap();
builder.token(MySyntaxKind::Number, "10").unwrap();

// Finish the node
let expr_node = builder.finish_node().unwrap();

// Build root
builder.start_node(MySyntaxKind::Root);
builder.finish_node().unwrap();
let root = builder.finish().unwrap();

Nested Structures

Build nested structures:

let mut builder = GreenNodeBuilder::new();

builder.start_node(MySyntaxKind::Expr);

// Nested expression
builder.start_node(MySyntaxKind::Term);
builder.token(MySyntaxKind::Number, "1").unwrap();
builder.finish_node().unwrap();

builder.token(MySyntaxKind::Plus, "+").unwrap();

// Another nested expression
builder.start_node(MySyntaxKind::Term);
builder.token(MySyntaxKind::Number, "2").unwrap();
builder.finish_node().unwrap();

builder.finish_node().unwrap();
let root = builder.finish().unwrap();

Modifying Trees

Trees are immutable, so modifications create new trees:

Replacing Nodes

Replace a node by rebuilding:

fn replace_node(old_node: &SyntaxNode, new_kind: MySyntaxKind) -> GreenNode<MySyntaxKind> {
    let mut builder = GreenNodeBuilder::new();
    builder.start_node(new_kind);
    
    // Copy children
    for child in old_node.children() {
        match child {
            SyntaxElement::Node(n) => {
                // Recursively replace
                let new_child = replace_node(&n, new_kind);
                // Add new child (simplified - actual API may differ)
            }
            SyntaxElement::Token(t) => {
                builder.token(t.kind(), t.text()).unwrap();
            }
        }
    }
    
    builder.finish_node().unwrap();
    builder.finish().unwrap()
}

Inserting Nodes

Insert nodes by rebuilding:

fn insert_child(parent: &SyntaxNode, new_child: GreenNode<MySyntaxKind>, index: usize) -> GreenNode<MySyntaxKind> {
    let mut builder = GreenNodeBuilder::new();
    builder.start_node(parent.kind());
    
    // Copy children before index
    for (i, child) in parent.children().enumerate() {
        if i == index {
            // Insert new child
            // (simplified - actual API may differ)
        }
        // Copy existing child
    }
    
    builder.finish_node().unwrap();
    builder.finish().unwrap()
}

Tree Transformation

Transform trees using visitors:

struct Transformer;

impl SyntaxVisitor for Transformer {
    fn visit_node(&mut self, node: &SyntaxNode) {
        // Transform node
        if node.kind() == MySyntaxKind::OldKind {
            // Replace with new kind
        }
    }
}

Incremental Updates

For incremental updates, reuse unchanged nodes:

use sipha::incremental::IncrementalParser;

// Initial parse
let result1 = parser.parse_incremental(&tokens, None, &[], entry, Some(&grammar));

// After edit, reuse unchanged nodes
let result2 = parser.parse_incremental(
    &new_tokens,
    Some(&result1.root),
    &edits,
    entry,
    Some(&grammar),
);

Best Practices

  1. Use builders: Use GreenNodeBuilder for building trees
  2. Reuse nodes: Reuse unchanged nodes in incremental updates
  3. Immutable operations: All tree operations create new trees
  4. Cache results: Cache frequently accessed nodes
  5. Batch modifications: Batch multiple modifications together

Next Steps

Error Handling Overview

Sipha provides comprehensive error handling with configurable error recovery strategies. This section covers error types, recovery mechanisms, and best practices for handling errors in your parsers.

Overview

Sipha’s error handling system is designed to:

  • Provide detailed error information with precise source locations
  • Recover from errors gracefully to continue parsing
  • Support rich diagnostics with context and suggestions (via diagnostics feature)
  • Distinguish between errors and warnings for different severity levels

Error Handling Flow

flowchart TD
    Start[Parse Input] --> Parse[Parser Processes Tokens]
    Parse --> Error{Error<br/>Encountered?}
    Error -->|No| Success[Parse Success]
    Error -->|Yes| Recover{Recovery<br/>Enabled?}
    Recover -->|Yes| Attempt[Attempt Recovery]
    Recover -->|No| Fail[Report Error]
    Attempt --> Success2[Continue Parsing]
    Attempt --> Fail2[Recovery Failed]
    Success --> Result[Return ParseResult]
    Success2 --> Result
    Fail --> Result
    Fail2 --> Result
    
    style Success fill:#c8e6c9,color:#000000
    style Success2 fill:#c8e6c9,color:#000000
    style Fail fill:#ffccbc,color:#000000
    style Fail2 fill:#ffccbc,color:#000000
    style Result fill:#e1f5ff,color:#000000

Key Concepts

Errors vs Warnings

  • Errors: Prevent successful parsing or indicate syntax violations
  • Warnings: Non-fatal issues that don’t prevent parsing

Error Recovery

Sipha parsers can automatically recover from errors to continue parsing, which is essential for:

  • IDEs: Providing real-time feedback as users type
  • Language Servers: Maintaining parse state despite errors
  • Interactive Tools: Allowing partial parsing results

Rich Diagnostics

With the diagnostics feature enabled, errors include:

  • Source code snippets
  • Highlighted error locations
  • Suggestions and hints
  • Related code locations

What’s Covered

This section covers:

Quick Start

The most basic error handling:

use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;

let result = parser.parse(&tokens, entry_point);

// Always check for errors
if !result.errors.is_empty() {
    for error in &result.errors {
        eprintln!("Error: {}", error);
    }
    return Err("Parsing failed".into());
}

// Check for warnings
for warning in &result.warnings {
    eprintln!("Warning: {}", warning.message);
}

Next Steps

Error Types

Sipha defines several error types to represent different kinds of problems that can occur during parsing and lexing.

ParseError

ParseError represents errors that occur during parsing. It’s an enum with four variants:

use sipha::error::ParseError;
use sipha::syntax::{TextRange, TextSize};

// Unexpected token error
let error = ParseError::UnexpectedToken {
    span: TextRange::new(TextSize::from(10), TextSize::from(15)),
    expected: vec!["identifier".to_string(), "number".to_string()],
};

// Invalid syntax error
let error = ParseError::InvalidSyntax {
    span: TextRange::new(TextSize::from(0), TextSize::from(5)),
    message: "Invalid expression syntax".to_string(),
};

// Unexpected end of file
let error = ParseError::UnexpectedEof {
    span: TextRange::new(TextSize::from(100), TextSize::from(100)),
    expected: vec!["}".to_string(), ";".to_string()],
};

// Ambiguity error (GLR parser)
let error = ParseError::Ambiguity {
    span: TextRange::new(TextSize::from(20), TextSize::from(25)),
    alternatives: vec!["expression1".to_string(), "expression2".to_string()],
};

ParseError Variants

UnexpectedToken

Occurs when the parser encounters a token that doesn’t match any expected production.

  • Most common parse error
  • Contains a list of expected token types
  • Parser can often recover by skipping the unexpected token

Example:

// Input: "let x = 42 +"
// Error: UnexpectedToken with expected: ["number", "identifier", "("]

InvalidSyntax

Occurs when the input violates grammar rules in a way that can’t be attributed to a single unexpected token.

  • Contains a descriptive error message
  • Used for complex syntax violations
  • May require more context to understand

Example:

// Input: "function() { return }"
// Error: InvalidSyntax with message: "Missing return value"

UnexpectedEof

Occurs when the parser reaches the end of input but expects more tokens.

  • Contains a list of expected tokens
  • Common when delimiters are missing (e.g., unclosed parentheses)
  • Indicates incomplete input

Example:

// Input: "if (condition { ... }"
// Error: UnexpectedEof with expected: [")"]

Ambiguity (GLR backend only)

Occurs when the GLR parser finds multiple valid parse trees.

  • Contains a list of alternative interpretations
  • Requires disambiguation to resolve
  • Only occurs with GLR parser backend

Example:

// Input: "1 + 2 * 3" (ambiguous without precedence)
// Error: Ambiguity with alternatives: ["(1 + 2) * 3", "1 + (2 * 3)"]

LexerError

LexerError represents errors that occur during tokenization (lexing):

use sipha::error::{LexerError, LexerErrorKind};
use sipha::syntax::{TextRange, TextSize};

let error = LexerError::new(
    TextRange::new(TextSize::from(5), TextSize::from(6)),
    LexerErrorKind::UnexpectedChar { char: '!' },
);

LexerErrorKind Variants

UnexpectedChar

An invalid character was encountered that doesn’t match any token pattern.

  • Contains the unexpected character
  • Usually indicates a typo or unsupported character

Example:

// Input: "let x = 42#"
// Error: UnexpectedChar { char: '#' }

UnterminatedString

A string literal was started but never closed.

  • Common when quotes are mismatched
  • Indicates incomplete string literal

Example:

// Input: "let s = \"hello world"
// Error: UnterminatedString

InvalidEscape

An invalid escape sequence was found in a string.

  • Contains the invalid escape sequence
  • Common with malformed escape sequences

Example:

// Input: "let s = \"hello\\zworld\""
// Error: InvalidEscape { escape: "\\z" }

InvalidNumber

A number literal has invalid format.

  • Contains a reason explaining the issue
  • Common with malformed numeric literals

Example:

// Input: "let x = 123abc"
// Error: InvalidNumber { reason: "Invalid digit 'a' in number" }

UnexpectedEof

End of file was reached unexpectedly during tokenization.

  • Usually indicates incomplete input
  • Common when input is cut off mid-token

Example:

// Input: "let x = 4" (cut off)
// Error: UnexpectedEof

ParseWarning

ParseWarning represents non-fatal issues that don’t prevent parsing:

use sipha::error::{ParseWarning, Severity};
use sipha::syntax::{TextRange, TextSize};

// Create warnings with different severity levels
let warning = ParseWarning::warning(
    TextRange::new(TextSize::from(0), TextSize::from(10)),
    "Deprecated syntax used".to_string(),
);

let info = ParseWarning::info(
    TextRange::new(TextSize::from(10), TextSize::from(20)),
    "Consider using newer syntax".to_string(),
);

let hint = ParseWarning::hint(
    TextRange::new(TextSize::from(20), TextSize::from(30)),
    "This can be simplified".to_string(),
);

Severity Levels

Warning

Important issues that should be addressed.

  • Indicates potential problems
  • Should be shown to users
  • May indicate deprecated or problematic code

Info

Informational messages.

  • Provides context or suggestions
  • Less urgent than warnings
  • Can be shown optionally

Hint

Suggestions for improvement.

  • Optimization opportunities
  • Style suggestions
  • Can be shown in IDE tooltips

Converting Between Error Types

Lexer errors can be converted to parse errors:

use sipha::error::{LexerError, ParseError};

let lexer_error: LexerError = /* ... */;
let parse_error: ParseError = lexer_error.into();

This is useful when you want to handle all errors uniformly.

Error Methods

All error types provide useful methods:

ParseError Methods

use sipha::error::ParseError;

// Get the error location
let span = error.span();

// Format expected tokens as a readable string
let expected_str = error.format_expected();
// Returns: "identifier or number" or "identifier, number, or string"

// Get "did you mean" suggestions
if let Some(suggestion) = error.did_you_mean("identifer") {
    println!("{}", suggestion); // "Did you mean 'identifier'?"
}

// Get context around the error
if let Some((before, error_span, after)) = error.get_context(source_code) {
    println!("Context: ...{}[{}]{}...", before, error_span, after);
}

// Format error with full context
let formatted = error.format_with_context(source_code, Some("actual_token"));
println!("{}", formatted);

LexerError Methods

use sipha::error::LexerError;

// Get the error location
let span = error.span();

// Get the error kind
let kind = error.kind();

Next Steps

Working with Errors

This chapter covers how to check, report, and extract information from errors in your parsers.

Working with Parse Results

After parsing, check the ParseResult for errors and warnings:

use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
use sipha::syntax::SyntaxNode;
use sipha::error::Severity;

let result = parser.parse(&tokens, entry_point);

// Check for errors
if !result.errors.is_empty() {
    eprintln!("Parsing failed with {} errors:", result.errors.len());
    for error in &result.errors {
        eprintln!("  Error: {}", error);
        eprintln!("    Location: {:?}", error.span());
    }
}

// Check for warnings
if !result.warnings.is_empty() {
    eprintln!("Found {} warnings:", result.warnings.len());
    for warning in &result.warnings {
        eprintln!("  [{}] {} at {:?}", 
            match warning.severity {
                Severity::Warning => "WARNING",
                Severity::Info => "INFO",
                Severity::Hint => "HINT",
            },
            warning.message,
            warning.span
        );
    }
}

// Even with errors, a tree may be available
let root = SyntaxNode::new_root(result.root.clone());

Error Information

Getting Error Details

ParseError provides several methods for extracting information:

use sipha::error::ParseError;

// Get the error location
let span = error.span();

// Format expected tokens as a readable string
let expected_str = error.format_expected();
// Returns: "identifier or number" or "identifier, number, or string"

// Get "did you mean" suggestions
if let Some(suggestion) = error.did_you_mean("identifer") {
    println!("{}", suggestion); // "Did you mean 'identifier'?"
}

// Get context around the error
if let Some((before, error_span, after)) = error.get_context(source_code) {
    println!("Context: ...{}[{}]{}...", before, error_span, after);
}

// Format error with full context
let formatted = error.format_with_context(source_code, Some("actual_token"));
println!("{}", formatted);

Example: Comprehensive Error Reporting

use sipha::error::ParseError;

fn report_parse_errors(errors: &[ParseError], source: &str) {
    for (i, error) in errors.iter().enumerate() {
        println!("\nError {}: {}", i + 1, error);
        println!("  Location: {:?}", error.span());
        
        // Show context
        if let Some((before, error_span, after)) = error.get_context(source) {
            println!("  Context: ...{}[{}]{}...", before, error_span, after);
        }
        
        // Show expected tokens
        match error {
            ParseError::UnexpectedToken { expected, .. } 
            | ParseError::UnexpectedEof { expected, .. } => {
                if !expected.is_empty() {
                    println!("  Expected: {}", error.format_expected());
                }
            }
            ParseError::InvalidSyntax { message, .. } => {
                println!("  Details: {}", message);
            }
            ParseError::Ambiguity { alternatives, .. } => {
                println!("  Alternatives: {}", alternatives.join(", "));
            }
        }
    }
}

Handling Lexer Errors

Lexer errors occur during tokenization and can be handled separately:

use sipha::lexer::LexerBuilder;
use sipha::error::{LexerError, LexerErrorKind};

let lexer = LexerBuilder::new()
    // ... configure lexer
    .build(eof_kind, default_kind)?;

match lexer.tokenize(input) {
    Ok(tokens) => {
        // Tokenization succeeded
    }
    Err(errors) => {
        // Handle lexer errors
        for error in errors {
            eprintln!("Lexer error at {:?}: {}", error.span(), error.kind());
            
            match error.kind() {
                LexerErrorKind::UnexpectedChar { char } => {
                    eprintln!("  Unexpected character: '{}'", char);
                }
                LexerErrorKind::UnterminatedString => {
                    eprintln!("  String literal not closed");
                }
                LexerErrorKind::InvalidEscape { escape } => {
                    eprintln!("  Invalid escape sequence: {}", escape);
                }
                LexerErrorKind::InvalidNumber { reason } => {
                    eprintln!("  Invalid number: {}", reason);
                }
                LexerErrorKind::UnexpectedEof => {
                    eprintln!("  Unexpected end of file");
                }
            }
        }
    }
}

Lexer errors can also be converted to parse errors:

use sipha::error::{LexerError, ParseError};

let lexer_error: LexerError = /* ... */;
let parse_error: ParseError = lexer_error.into();

Error Metrics

ParseResult includes metrics about the parsing process:

let result = parser.parse(&tokens, entry_point);

println!("Parsing metrics:");
println!("  Tokens consumed: {}", result.metrics.tokens_consumed);
println!("  Nodes created: {}", result.metrics.nodes_created);
println!("  Errors recovered: {}", result.metrics.errors_recovered);
println!("  Cache hits: {}", result.metrics.cache_hits);
println!("  Parse time: {:?}", result.metrics.parse_time);

These metrics are useful for:

  • Performance analysis: Understanding parse performance
  • Debugging: Identifying problematic grammar rules
  • Optimization: Finding bottlenecks in parsing

Error Handling in Incremental Parsing

Incremental parsing maintains error information across edits:

use sipha::incremental::IncrementalParser;

let mut incremental = IncrementalParser::new(parser);

// Initial parse
let result1 = incremental.parse_incremental(/* ... */);

// After edit, errors are updated for affected regions
let result2 = incremental.parse_incremental(/* ... */);

// Errors from previous parse are preserved if not affected by edit

Next Steps

Error Recovery

Sipha parsers attempt to recover from errors and continue parsing. This is essential for providing good user experience in IDEs and language servers.

Overview

Error recovery allows parsers to:

  • Continue parsing despite errors
  • Provide partial results even when input has errors
  • Maintain parse state for interactive applications
  • Report multiple errors in a single pass

Automatic Recovery

By default, parsers use automatic error recovery:

use sipha::backend::ll::{LlParser, LlConfig};

// Default config includes error recovery
let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)?;

// Parser will attempt to recover from errors
let result = parser.parse(&tokens, entry_point);

// Check how many errors were recovered from
println!("Errors recovered: {}", result.metrics.errors_recovered);

Even when errors occur, the parser will:

  1. Report the error
  2. Attempt to recover
  3. Continue parsing
  4. Return a parse tree (possibly incomplete)

Recovery Strategies

Parsers use several strategies to recover:

1. Token Skipping

Skip unexpected tokens and continue parsing.

Example:

// Input: "let x = 42 + + 10"
// Parser skips the second "+" and continues
// Error reported, but parsing continues

2. Token Insertion

Insert missing tokens (e.g., semicolons, closing delimiters).

Example:

// Input: "let x = 42 let y = 10"
// Parser may insert a semicolon after "42"
// Error reported, but both statements parsed

3. Rule Skipping

Skip to next statement or block when current rule fails.

Example:

// Input: "let x = invalid; let y = 10;"
// Parser skips the first statement and continues with the second
// Error reported for first statement, second statement parsed successfully

4. Best Effort

Continue parsing despite errors, producing the best possible result.

Example:

// Input with multiple errors
// Parser attempts to parse as much as possible
// All errors reported, partial tree returned

The specific strategy depends on the parser backend and configuration.

Configuring Recovery

Different backends support different recovery options:

use sipha::backend::ll::{LlParser, LlConfig};

let config = LlConfig {
    error_recovery: true,  // Enable error recovery
    max_recovery_attempts: 10,  // Limit recovery attempts
    ..Default::default()
};

let mut parser = LlParser::new(&grammar, config)?;

Recovery Options

error_recovery (bool)

  • Enable or disable error recovery
  • Default: true
  • When disabled, parser stops at first error

max_recovery_attempts (usize)

  • Maximum number of recovery attempts per parse
  • Default: varies by backend
  • Prevents infinite recovery loops

When to Use Recovery

Use Recovery For:

  • IDEs and Language Servers: Need to parse incomplete or erroneous code
  • Interactive Tools: Users are actively editing code
  • Error Reporting: Want to find all errors in one pass
  • Partial Results: Need parse tree even with errors

Disable Recovery For:

  • Strict Validation: Want to fail fast on errors
  • Batch Processing: Errors should stop processing
  • Testing: Want to catch all errors immediately
  • Performance: Recovery has overhead

Recovery Examples

Example 1: Missing Delimiter

// Input: "if (condition { ... }"
// Parser behavior:
// 1. Reports UnexpectedEof error expecting ")"
// 2. May insert ")" and continue
// 3. Parses the block successfully
// Result: Error reported, but parse tree includes the block

Example 2: Unexpected Token

// Input: "let x = 42 +"
// Parser behavior:
// 1. Reports UnexpectedToken error
// 2. Skips the "+" or inserts placeholder
// 3. Continues parsing
// Result: Error reported, variable declaration partially parsed

Example 3: Invalid Syntax

// Input: "function() { return }"
// Parser behavior:
// 1. Reports InvalidSyntax error
// 2. Marks statement as incomplete
// 3. Continues parsing rest of function
// Result: Error reported, function body partially parsed

Recovery Quality

Recovery quality depends on:

  • Grammar structure: Well-structured grammars recover better
  • Error location: Errors at statement boundaries recover better
  • Parser backend: Different backends have different recovery strategies
  • Configuration: Recovery settings affect behavior

Limitations

Error recovery has some limitations:

  • May produce incorrect trees: Recovered trees may not match intended structure
  • Can mask errors: Multiple errors may be reduced to fewer reported errors
  • Performance overhead: Recovery adds processing time
  • Not always possible: Some errors can’t be recovered from

Best Practices

  1. Always check errors: Even with recovery, check result.errors
  2. Validate recovered trees: Don’t trust recovered parse trees completely
  3. Limit recovery attempts: Use max_recovery_attempts to prevent loops
  4. Consider context: Recovery works better with more context
  5. Test recovery: Test your grammar with various error scenarios

Next Steps

Rich Diagnostics

When the diagnostics feature is enabled, errors integrate with miette for beautiful error reporting with source code snippets, highlighted locations, and helpful suggestions.

Enabling Diagnostics

Add the diagnostics feature to your Cargo.toml:

[dependencies]
sipha = { version = "0.5.0", features = ["diagnostics"] }

This enables:

  • Rich error formatting with source code snippets
  • Highlighted error locations
  • Suggestions and hints
  • Integration with editor tooling

Using Diagnostics

Errors automatically implement the Diagnostic trait from miette:

#[cfg(feature = "diagnostics")]
use sipha::error::ParseError;

// Errors automatically implement Diagnostic trait
let error: ParseError = /* ... */;

// Print with rich formatting
eprintln!("{:#}", error);

// Errors include:
// - Source code snippets
// - Highlighted error locations
// - Suggestions and hints
// - Related code locations

Example Output

With diagnostics enabled, errors are displayed with rich formatting:

Error: Unexpected token
  ┌─ example.txt:10:15
  │
10 │     let x = 42 +
  │               ^ Unexpected token
  │
  = Expected: identifier, number, or operator
  = Help: Did you mean to add an expression after '+'?

Diagnostic Features

Source Code Snippets

Diagnostics automatically include source code around the error:

let error: ParseError = /* ... */;
eprintln!("{:#}", error);
// Shows:
//   ┌─ file.rs:42:10
//   │
//42 │     let x = invalid
//   │            ^^^^^^^^ Error here

Highlighted Locations

Error locations are highlighted in the source code:

// The error span is automatically highlighted
// Multiple related errors can be shown together

Suggestions

Diagnostics can include suggestions:

// "Did you mean" suggestions
// Fix suggestions
// Helpful hints

Multiple related errors can be shown together:

// Errors at different locations
// Related errors in the same context
// Error chains

Integration with Editors

Diagnostics integrate with editor tooling:

  • Language Servers: Errors are reported via LSP
  • IDE Integration: Errors shown in editor
  • Terminal Output: Beautiful terminal output
  • JSON Output: Structured error data

Custom Diagnostic Messages

You can customize diagnostic messages:

use sipha::error::ParseError;

// Create errors with custom messages
let error = ParseError::invalid_syntax_with_suggestion(
    span,
    "Invalid expression",
    "Try adding parentheses around the expression"
);

Diagnostic Severity

Diagnostics support different severity levels:

  • Error: Fatal errors that prevent parsing
  • Warning: Non-fatal issues (via ParseWarning)
  • Info: Informational messages
  • Hint: Suggestions for improvement

Example: Full Diagnostic Output

#[cfg(feature = "diagnostics")]
use sipha::error::ParseError;
use miette::Diagnostic;

fn report_with_diagnostics(error: &ParseError, source: &str) {
    // Print with full diagnostic formatting
    eprintln!("{:#}", error);
    
    // Or use miette's reporting features
    // miette::set_hook(/* ... */);
}

Performance Considerations

Diagnostics have minimal performance impact:

  • Compile-time: Feature is gated, no overhead when disabled
  • Runtime: Small overhead for formatting
  • Memory: Source code snippets are included in errors

When to Use Diagnostics

Use diagnostics for:

  • IDEs and Language Servers: Better user experience
  • Development Tools: Clearer error messages
  • Documentation: Better error examples
  • Debugging: Easier to understand errors

Next Steps

Best Practices

This chapter provides guidelines and common patterns for error handling in Sipha parsers.

Always Check for Errors

Always check for errors, even if you expect success:

let result = parser.parse(&tokens, entry_point);

// Always check errors, even if you expect success
if !result.errors.is_empty() {
    // Handle errors appropriately
    return Err(format!("Parsing failed: {} errors", result.errors.len()));
}

Why: Errors can occur even in valid-looking input due to:

  • Grammar ambiguities
  • Edge cases
  • Recovery attempts
  • Incomplete input

Provide Context in Error Messages

When reporting errors to users, include:

  • Source location: Line and column numbers
  • Context: Surrounding code
  • Suggestions: What was expected or how to fix it
use sipha::error::ParseError;

fn report_error(error: &ParseError, source: &str) {
    println!("Error: {}", error);
    println!("Location: {:?}", error.span());
    
    // Show context
    if let Some((before, error_span, after)) = error.get_context(source) {
        println!("Context: ...{}[{}]{}...", before, error_span, after);
    }
    
    // Show suggestions
    if let Some(suggestion) = error.did_you_mean("actual_token") {
        println!("Suggestion: {}", suggestion);
    }
}

Handle Warnings Appropriately

Warnings don’t prevent parsing, but should be addressed:

use sipha::error::Severity;

// Warnings don't prevent parsing, but should be addressed
for warning in &result.warnings {
    match warning.severity {
        Severity::Warning => {
            // Log or report warnings
            eprintln!("Warning: {}", warning.message);
        }
        Severity::Info => {
            // Optionally show info messages
            // Useful for development tools
        }
        Severity::Hint => {
            // Hints can be shown in IDE tooltips
            // Less urgent, more suggestions
        }
    }
}

Use Error Recovery Wisely

Error recovery is great for IDEs, but consider disabling it for:

  • Strict validation: When you want to fail fast
  • Batch processing: When errors should stop processing
  • Testing: When you want to catch all errors
use sipha::backend::ll::{LlParser, LlConfig};

// For strict validation
let strict_config = LlConfig {
    error_recovery: false,  // Disable recovery
    ..Default::default()
};

// For IDE/language server
let ide_config = LlConfig {
    error_recovery: true,  // Enable recovery
    max_recovery_attempts: 100,  // Allow many attempts
    ..Default::default()
};

Leverage Diagnostics

If building an IDE or language server, enable the diagnostics feature for:

  • Better error messages
  • Source code snippets
  • Helpful suggestions
  • Integration with editor tooling
[dependencies]
sipha = { version = "0.5.0", features = ["diagnostics"] }

Common Error Patterns

Pattern 1: Missing Delimiter

// Input: "if (condition { ... }"
// Error: UnexpectedEof with expected: [")"]

// Handling:
// 1. Report the error with context
// 2. Show what delimiter is missing
// 3. Suggest where to add it

Pattern 2: Unexpected Token

// Input: "let x = 42 +"
// Error: UnexpectedToken with expected: [number, identifier, "("]

// Handling:
// 1. Show the unexpected token
// 2. List expected alternatives
// 3. Provide "did you mean" suggestions

Pattern 3: Invalid Syntax

// Input: "function() { return }"
// Error: InvalidSyntax with message: "Missing return value"

// Handling:
// 1. Explain the syntax violation
// 2. Show the problematic code
// 3. Suggest how to fix it

Error Handling Checklist

When implementing error handling:

  • Check result.errors after every parse
  • Check result.warnings for non-fatal issues
  • Provide context in error messages
  • Use appropriate severity levels for warnings
  • Enable diagnostics for better error reporting
  • Test error recovery behavior
  • Handle lexer errors separately if needed
  • Consider error metrics for debugging

Performance Considerations

Error handling has minimal performance impact:

  • Error checking: Negligible overhead
  • Error recovery: Small overhead, but enables partial results
  • Diagnostics: Small overhead for formatting
  • Metrics: Minimal overhead for tracking

Testing Error Handling

Test your error handling:

#[test]
fn test_error_handling() {
    let result = parser.parse(&invalid_tokens, entry_point);
    
    // Verify errors are reported
    assert!(!result.errors.is_empty());
    
    // Verify error details
    let error = &result.errors[0];
    assert!(error.span().start() > TextSize::zero());
    
    // Verify recovery (if enabled)
    if config.error_recovery {
        assert!(result.metrics.errors_recovered > 0);
    }
}

Error Handling in Production

For production code:

  1. Log errors: Use proper logging framework
  2. Monitor metrics: Track error rates and recovery success
  3. User-friendly messages: Translate technical errors to user messages
  4. Graceful degradation: Continue operation when possible
  5. Error reporting: Collect error data for improvement

Next Steps

Performance

This chapter covers performance optimization tips and benchmarking for Sipha.

General Tips

Use Incremental Parsing

For interactive applications, always use incremental parsing:

let mut incremental = IncrementalParser::new(parser);
// Much faster for small edits

Choose the Right Backend

  • LL(k): Fast, good for most grammars
  • LR: Fast, good for batch parsing
  • GLR: Slower, only use for ambiguous grammars

Optimize Grammar

  • Factor common patterns: Extract common subexpressions
  • Avoid deep nesting: Flatten deeply nested structures
  • Use appropriate lookahead: Don’t use more lookahead than needed

Lexer Performance

Pattern Ordering

Order patterns from most specific to least specific:

// Good: Specific patterns first
.token(MySyntaxKind::Keyword, Pattern::Literal("if".into()))
.token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::letters()))

// Bad: General patterns first
.token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::letters()))
.token(MySyntaxKind::Keyword, Pattern::Literal("if".into()))

Use Character Classes

Prefer character classes over regex when possible:

// Good: Character class
.token(MySyntaxKind::Number, Pattern::CharClass(CharSet::digits()))

// Slower: Regex
.token(MySyntaxKind::Number, Pattern::Regex(r"\d+".into()))

Keyword Trie

Keywords are matched using a trie for O(m) performance where m is keyword length.

Parser Performance

Cache Population

Always provide grammar for cache population:

// Good: Cache populated
parser.parse_incremental(&tokens, old_tree, &edits, entry, Some(&grammar))

// Bad: No cache
parser.parse_incremental(&tokens, old_tree, &edits, entry, None)

Reuse Budget

Adjust reuse budget based on use case:

// For large files
let budget = ReuseBudget::Heuristic {
    max_depth: 30,
    max_nodes: 2000,
};

// For small files
let budget = ReuseBudget::Fixed(100);

Memory Optimization

Arena Allocation

Green nodes are arena-allocated for efficient memory usage.

Node Sharing

Incremental parsing shares unchanged nodes, reducing memory usage.

Cache Eviction

Cache automatically evicts old versions:

// Keep last 2 versions (default)
cache.evict_old_entries(2);

Benchmarking

Measure Parse Time

let start = std::time::Instant::now();
let result = parser.parse(&tokens, entry);
let duration = start.elapsed();

println!("Parse time: {:?}", duration);
println!("Metrics: {:?}", result.metrics);

Compare Backends

// LL parser
let ll_start = Instant::now();
ll_parser.parse(&tokens, entry);
let ll_time = ll_start.elapsed();

// LR parser
let lr_start = Instant::now();
lr_parser.parse(&tokens, entry);
let lr_time = lr_start.elapsed();

Profiling

Use a Profiler

Profile your parser to find bottlenecks:

# With perf
perf record --call-graph=dwarf cargo run --release
perf report

# With valgrind
valgrind --tool=callgrind cargo run --release

Common Bottlenecks

  • Lexer: Pattern matching, especially regex
  • Parser: Table lookups, especially for large grammars
  • Tree construction: Node allocation and linking

Best Practices

  1. Profile first: Profile before optimizing
  2. Measure changes: Benchmark before and after changes
  3. Use incremental parsing: Always use for interactive apps
  4. Optimize grammar: Factor and simplify grammar
  5. Choose right backend: Use appropriate backend for use case

Next Steps

Custom Patterns

This chapter shows how to extend lexer patterns with custom matching logic.

Custom Matchers

For complex tokenization logic, use custom matchers:

let lexer = LexerBuilder::new()
    .custom_token(MySyntaxKind::String, |text, pos| {
        // Custom string matching
        if text[pos..].starts_with('"') {
            let mut end = pos + 1;
            while end < text.len() && text.as_bytes()[end] != b'"' {
                if text.as_bytes()[end] == b'\\' {
                    end += 2; // Skip escape sequence
                } else {
                    end += 1;
                }
            }
            if end < text.len() {
                let value = text[pos+1..end].to_string();
                Some((end - pos + 1, TokenValue::String(value)))
            } else {
                None // Unterminated string
            }
        } else {
            None
        }
    })
    .build(MySyntaxKind::Eof, MySyntaxKind::Ident)
    .expect("Failed to build lexer");

Custom Matcher Signature

Custom matchers have this signature:

Fn(&str, usize) -> Option<(usize, TokenValue)>

Where:

  • Input: &str is the text, usize is the starting position
  • Output: Option<(usize, TokenValue)> where:
    • usize is the number of characters matched
    • TokenValue is the token’s value (if any)

TokenValue

Token values can be:

pub enum TokenValue {
    Integer(i64),
    Float(f64),
    String(String),
    None,
}

Examples

String Literal

.custom_token(MySyntaxKind::String, |text, pos| {
    if !text[pos..].starts_with('"') {
        return None;
    }
    
    let mut end = pos + 1;
    while end < text.len() {
        match text.as_bytes()[end] {
            b'"' => {
                let value = text[pos+1..end].to_string();
                return Some((end - pos + 1, TokenValue::String(value)));
            }
            b'\\' if end + 1 < text.len() => {
                end += 2; // Skip escape sequence
            }
            _ => end += 1,
        }
    }
    
    None // Unterminated string
})

Floating Point Number

.custom_token(MySyntaxKind::Float, |text, pos| {
    let mut end = pos;
    let mut has_dot = false;
    
    // Optional sign
    if end < text.len() && (text.as_bytes()[end] == b'+' || text.as_bytes()[end] == b'-') {
        end += 1;
    }
    
    // Digits before dot
    while end < text.len() && text.as_bytes()[end].is_ascii_digit() {
        end += 1;
    }
    
    // Dot
    if end < text.len() && text.as_bytes()[end] == b'.' {
        has_dot = true;
        end += 1;
    }
    
    // Digits after dot
    while end < text.len() && text.as_bytes()[end].is_ascii_digit() {
        end += 1;
    }
    
    if has_dot && end > pos {
        let value = text[pos..end].parse::<f64>().ok()?;
        Some((end - pos, TokenValue::Float(value)))
    } else {
        None
    }
})

Multiline Comment

.custom_token(MySyntaxKind::Comment, |text, pos| {
    if !text[pos..].starts_with("/*") {
        return None;
    }
    
    let mut end = pos + 2;
    while end + 1 < text.len() {
        if text.as_bytes()[end] == b'*' && text.as_bytes()[end + 1] == b'/' {
            let value = text[pos+2..end].to_string();
            return Some((end - pos + 2, TokenValue::String(value)));
        }
        end += 1;
    }
    
    None // Unterminated comment
})

Best Practices

  1. Return early: Return None quickly if pattern doesn’t match
  2. Handle edge cases: Handle EOF, invalid sequences, etc.
  3. Escape sequences: Properly handle escape sequences
  4. Performance: Keep matching logic efficient
  5. Error handling: Return None for invalid input

Combining with Regular Patterns

Custom matchers work alongside regular patterns:

let lexer = LexerBuilder::new()
    // Regular patterns
    .token(MySyntaxKind::Number, Pattern::CharClass(CharSet::digits()))
    .token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
    
    // Custom matcher
    .custom_token(MySyntaxKind::String, |text, pos| {
        // Custom logic
    })
    
    .build(MySyntaxKind::Eof, MySyntaxKind::Ident)
    .expect("Failed to build lexer");

Priority

Custom matchers have the same priority as regular patterns based on definition order.

Next Steps

  • See Lexers for basic pattern usage
  • Check Examples for real-world patterns

Grammar Analysis

Sipha provides tools for analyzing and optimizing grammars.

Grammar Validation

Validate grammars before use:

use sipha::backend::ll::LlParser;

let errors = LlParser::validate(&grammar);
if !errors.is_empty() {
    for error in errors {
        eprintln!("Grammar error: {:?}", error);
    }
}

Left Recursion Detection

Detect left-recursive rules:

use sipha::grammar::analysis::detect_left_recursion;

let left_recursive = detect_left_recursion(&grammar);
for rule in left_recursive {
    eprintln!("Left-recursive rule: {:?}", rule);
}

Reachability Analysis

Find unreachable rules:

use sipha::grammar::analysis::find_unreachable_rules;

let unreachable = find_unreachable_rules(&grammar);
for rule in unreachable {
    eprintln!("Unreachable rule: {:?}", rule);
}

FIRST and FOLLOW Sets

Compute FIRST and FOLLOW sets for LL parsing:

use sipha::grammar::analysis::compute_first_sets;
use sipha::grammar::analysis::compute_follow_sets;

let first_sets = compute_first_sets(&grammar);
let follow_sets = compute_follow_sets(&grammar);

Conflict Detection

Detect shift/reduce and reduce/reduce conflicts:

use sipha::backend::lr::LrParser;

let conflicts = LrParser::detect_conflicts(&grammar);
for conflict in conflicts {
    eprintln!("Conflict: {:?}", conflict);
}

Grammar Optimization

Factor Common Patterns

Extract common subexpressions:

// Before: Duplicated pattern
.rule(MyNonTerminal::Expr, Expr::choice(vec![
    Expr::seq(vec![Expr::non_terminal(MyNonTerminal::Term), Expr::token(plus_token)]),
    Expr::seq(vec![Expr::non_terminal(MyNonTerminal::Term), Expr::token(minus_token)]),
]))

// After: Factored
.rule(MyNonTerminal::Expr, Expr::seq(vec![
    Expr::non_terminal(MyNonTerminal::Term),
    Expr::choice(vec![
        Expr::token(plus_token),
        Expr::token(minus_token),
    ]),
]))

Eliminate Left Recursion

Transform left-recursive rules:

// Before: Left-recursive
.rule(MyNonTerminal::Expr, Expr::choice(vec![
    Expr::seq(vec![
        Expr::non_terminal(MyNonTerminal::Expr),
        Expr::token(plus_token),
        Expr::non_terminal(MyNonTerminal::Term),
    ]),
    Expr::non_terminal(MyNonTerminal::Term),
]))

// After: Right-recursive
.rule(MyNonTerminal::Expr, Expr::seq(vec![
    Expr::non_terminal(MyNonTerminal::Term),
    Expr::opt(Box::new(Expr::seq(vec![
        Expr::token(plus_token),
        Expr::non_terminal(MyNonTerminal::Expr),
    ]))),
]))

Grammar Visualization

Visualize grammar structure (planned feature):

// Future API
let dot = grammar.to_dot();
std::fs::write("grammar.dot", dot).unwrap();

Best Practices

  1. Validate early: Validate grammars before use
  2. Check for conflicts: Detect and resolve conflicts
  3. Optimize structure: Factor and simplify grammar
  4. Test thoroughly: Test with various inputs
  5. Analyze performance: Profile grammar performance

Next Steps

Unicode Support

Sipha provides full Unicode support for identifiers and text when the unicode feature is enabled.

Enabling Unicode

Enable Unicode support in Cargo.toml:

[dependencies]
sipha = { version = "0.5.0", features = ["unicode"] }

Unicode Character Classes

Use Unicode character classes in patterns:

use sipha::lexer::{CharSet, Pattern};

// Unicode letters
let unicode_letters = CharSet::unicode_letters();

// Unicode digits
let unicode_digits = CharSet::unicode_digits();

// Unicode identifiers (letters, digits, and connectors)
let unicode_ident = CharSet::unicode_ident_start();

Unicode Identifiers

Match Unicode identifiers:

let lexer = LexerBuilder::new()
    .token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::unicode_ident_start()))
    .build(MySyntaxKind::Eof, MySyntaxKind::Ident)
    .expect("Failed to build lexer");

Unicode Normalization

Sipha handles Unicode normalization automatically:

  • NFC: Normalized Form Canonical Composition
  • NFD: Normalized Form Canonical Decomposition

Examples

Japanese Identifiers

let lexer = LexerBuilder::new()
    .token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::unicode_letters()))
    .build(MySyntaxKind::Eof, MySyntaxKind::Ident)
    .expect("Failed to build lexer");

// Can tokenize: 変数名, 関数名, etc.

Emoji in Strings

// Emoji are handled correctly in string literals
let input = r#""Hello 👋 World 🌍""#;
let tokens = lexer.tokenize(input).unwrap();

Performance

Unicode support has minimal performance impact:

  • Character classes: Efficient Unicode property lookups
  • Normalization: Cached normalization results
  • Identifiers: Fast Unicode property checks

Best Practices

  1. Enable when needed: Only enable unicode feature if needed
  2. Use appropriate classes: Use Unicode character classes for identifiers
  3. Handle normalization: Be aware of Unicode normalization
  4. Test with Unicode: Test with various Unicode inputs

Next Steps

Adding a Backend

This chapter shows how to implement a new parsing backend for Sipha.

ParserBackend Trait

All backends implement the ParserBackend trait:

pub trait ParserBackend<T, N>: Sized + Send
where
    T: Token,
    N: NonTerminal,
{
    type Config: Default + Clone;
    type Error: std::error::Error + Send + Sync + 'static;
    type State: Send + Sync;

    fn new(grammar: &Grammar<T, N>, config: Self::Config) -> Result<Self, Self::Error>;
    fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N>;
    fn parse_with_session(...) -> ParseResult<T, N>;
    fn validate(grammar: &Grammar<T, N>) -> Vec<GrammarError<T, N>>;
    fn capabilities() -> BackendCapabilities;
    fn state(&self) -> &Self::State;
}

Implementation Steps

1. Define Backend Type

pub struct MyParser<T, N>
where
    T: Token,
    N: NonTerminal,
{
    grammar: Grammar<T, N>,
    config: MyConfig,
    state: MyState,
}

2. Implement ParserBackend

impl<T, N> ParserBackend<T, N> for MyParser<T, N>
where
    T: Token,
    N: NonTerminal,
{
    type Config = MyConfig;
    type Error = MyError;
    type State = MyState;

    fn new(grammar: &Grammar<T, N>, config: Self::Config) -> Result<Self, Self::Error> {
        // Initialize parser
        Ok(Self {
            grammar: grammar.clone(),
            config,
            state: MyState::new(),
        })
    }

    fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N> {
        // Implement parsing logic
        // ...
    }

    fn parse_with_session(
        &mut self,
        input: &[T],
        entry: N,
        session: &IncrementalSession<'_, T::Kind>,
    ) -> ParseResult<T, N> {
        // Implement incremental parsing
        // ...
    }

    fn validate(grammar: &Grammar<T, N>) -> Vec<GrammarError<T, N>> {
        // Validate grammar compatibility
        // ...
    }

    fn capabilities() -> BackendCapabilities {
        BackendCapabilities {
            name: "MyParser",
            algorithm: Algorithm::PEG, // Or other
            supports_left_recursion: true,
            supports_ambiguity: false,
            supports_incremental: true,
            supports_error_recovery: true,
            max_lookahead: None,
        }
    }

    fn state(&self) -> &Self::State {
        &self.state
    }
}

3. Implement Parsing Logic

Implement the core parsing algorithm:

fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N> {
    let mut builder = GreenNodeBuilder::new();
    let mut errors = Vec::new();
    let mut tokens = input.iter();
    
    // Parse according to algorithm
    // ...
    
    ParseResult {
        root: builder.finish().unwrap(),
        errors,
        warnings: Vec::new(),
        metrics: ParseMetrics::default(),
        #[cfg(feature = "backend-glr")]
        forest: None,
        _phantom: PhantomData,
    }
}

4. Implement Incremental Parsing

Support incremental parsing:

fn parse_with_session(
    &mut self,
    input: &[T],
    entry: N,
    session: &IncrementalSession<'_, T::Kind>,
) -> ParseResult<T, N> {
    // Find reusable nodes
    // Re-parse only affected regions
    // Combine reused and new nodes
    // ...
}

5. Implement Validation

Validate grammar compatibility:

fn validate(grammar: &Grammar<T, N>) -> Vec<GrammarError<T, N>> {
    let mut errors = Vec::new();
    
    // Check for left recursion
    // Check for ambiguity
    // Check for other requirements
    // ...
    
    errors
}

Example: PEG Parser

Here’s a simplified example of a PEG parser:

pub struct PegParser<T, N> {
    grammar: Grammar<T, N>,
    config: PegConfig,
    state: PegState,
}

impl<T, N> ParserBackend<T, N> for PegParser<T, N>
where
    T: Token,
    N: NonTerminal,
{
    type Config = PegConfig;
    type Error = PegError;
    type State = PegState;

    fn new(grammar: &Grammar<T, N>, config: Self::Config) -> Result<Self, Self::Error> {
        Ok(Self {
            grammar: grammar.clone(),
            config,
            state: PegState::new(),
        })
    }

    fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N> {
        // PEG parsing logic
        // ...
    }

    // ... other methods
}

Best Practices

  1. Follow existing patterns: Look at LL/LR/GLR implementations
  2. Handle errors gracefully: Return errors, don’t panic
  3. Support incremental: Implement incremental parsing if possible
  4. Validate grammar: Check grammar compatibility
  5. Document behavior: Document algorithm-specific behavior

Next Steps

Custom Syntax Kinds

This chapter covers advanced usage of syntax kinds.

Advanced SyntaxKind Implementation

Additional Methods

The SyntaxKind trait provides optional methods:

impl SyntaxKind for MySyntaxKind {
    fn is_terminal(self) -> bool {
        // ...
    }
    
    fn is_trivia(self) -> bool {
        // ...
    }
    
    // Optional: Mark keywords
    fn is_keyword(self) -> bool {
        matches!(self, 
            MySyntaxKind::If | 
            MySyntaxKind::While | 
            MySyntaxKind::For
        )
    }
    
    // Optional: Mark literals
    fn is_literal(self) -> bool {
        matches!(self, 
            MySyntaxKind::Number | 
            MySyntaxKind::String | 
            MySyntaxKind::Boolean
        )
    }
}

Syntax Kind Organization

Grouping Kinds

Organize kinds logically:

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind {
    // Keywords
    If, While, For, Return,
    
    // Operators
    Plus, Minus, Multiply, Divide,
    
    // Punctuation
    LParen, RParen, LBrace, RBrace,
    
    // Literals
    Number, String, Boolean,
    
    // Identifiers
    Ident,
    
    // Trivia
    Whitespace, Comment,
    
    // Non-terminals
    Expr, Stmt, Block,
    
    // Special
    Eof, Error,
}

Using Macros

Use macros to reduce boilerplate:

macro_rules! define_syntax_kinds {
    (
        keywords: [$($keyword:ident),*],
        operators: [$($op:ident),*],
        // ...
    ) => {
        #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
        enum MySyntaxKind {
            $($keyword,)*
            $($op,)*
            // ...
        }
        
        impl SyntaxKind for MySyntaxKind {
            fn is_keyword(self) -> bool {
                matches!(self, $(MySyntaxKind::$keyword)|*)
            }
            // ...
        }
    };
}

Syntax Kind Conversion

Converting to/from Strings

impl MySyntaxKind {
    pub fn as_str(self) -> &'static str {
        match self {
            MySyntaxKind::If => "if",
            MySyntaxKind::While => "while",
            // ...
        }
    }
    
    pub fn from_str(s: &str) -> Option<Self> {
        match s {
            "if" => Some(MySyntaxKind::If),
            "while" => Some(MySyntaxKind::While),
            // ...
            _ => None,
        }
    }
}

Best Practices

  1. Use descriptive names: Choose clear, descriptive names
  2. Group logically: Organize kinds by category
  3. Document purpose: Document what each kind represents
  4. Use exhaustiveness: Use exhaustive matching in implementations
  5. Consider extensions: Design for future extensions

Next Steps

Extending Lexers

This chapter shows how to extend lexers with custom tokenization logic.

Custom Tokenization

Custom Matchers

Use custom matchers for complex patterns:

let lexer = LexerBuilder::new()
    .custom_token(MySyntaxKind::String, |text, pos| {
        // Custom string matching
        // ...
    })
    .build(MySyntaxKind::Eof, MySyntaxKind::Ident)
    .expect("Failed to build lexer");

See Custom Patterns for details.

Extending Pattern Types

Adding New Pattern Types

To add new pattern types, extend the Pattern enum:

pub enum Pattern {
    Literal(CompactString),
    CharClass(CharSet),
    Repeat { pattern: Box<Pattern>, min: usize, max: Option<usize> },
    Regex(CompactString),
    Any,
    // Add new pattern types
    Custom(CustomPattern),
}

Implementing Pattern Matching

Implement pattern matching logic:

impl Pattern {
    fn matches(&self, text: &str, pos: usize) -> Option<usize> {
        match self {
            Pattern::Custom(custom) => custom.matches(text, pos),
            // ... other patterns
        }
    }
}

Lexer Hooks

Pre-processing

Add pre-processing hooks:

pub struct LexerBuilder<K: SyntaxKind> {
    // ...
    preprocessors: Vec<Box<dyn Fn(&str) -> String>>,
}

impl<K: SyntaxKind> LexerBuilder<K> {
    pub fn preprocess(mut self, f: impl Fn(&str) -> String + 'static) -> Self {
        self.preprocessors.push(Box::new(f));
        self
    }
}

Post-processing

Add post-processing hooks:

impl<K: SyntaxKind> LexerBuilder<K> {
    pub fn postprocess(mut self, f: impl Fn(&mut Vec<Token<K>>) + 'static) -> Self {
        self.postprocessors.push(Box::new(f));
        self
    }
}

Incremental Lexing

Token Updates

Update tokens incrementally:

pub struct IncrementalLexer<K: SyntaxKind> {
    lexer: CompiledLexer<K>,
    old_tokens: Vec<Token<K>>,
}

impl<K: SyntaxKind> IncrementalLexer<K> {
    pub fn tokenize_incremental(
        &self,
        text: &str,
        edits: &[TextEdit],
    ) -> Vec<Token<K>> {
        // Re-tokenize only affected regions
        // Reuse unchanged tokens
        // ...
    }
}

Best Practices

  1. Use custom matchers: For complex patterns, use custom matchers
  2. Extend patterns carefully: Only add new pattern types if needed
  3. Test thoroughly: Test with various inputs
  4. Document behavior: Document custom behavior
  5. Consider performance: Keep custom logic efficient

Next Steps

Contributing

Thank you for your interest in contributing to Sipha! This document provides guidelines and instructions for contributing.

Code of Conduct

This project adheres to the Rust Code of Conduct. By participating, you are expected to uphold this code.

Getting Started

Prerequisites

  • Rust toolchain (stable, beta, or nightly)
  • cargo (comes with Rust)
  • rustfmt and clippy (install with rustup component add rustfmt clippy)

Development Setup

  1. Fork the repository and clone your fork:

    git clone https://github.com/yourusername/sipha.git
    cd sipha
    
  2. Build the project:

    cargo build
    
  3. Run the test suite:

    cargo test
    
  4. Run clippy:

    cargo clippy --all-targets --all-features --workspace -- -W clippy::pedantic -W clippy::nursery -W clippy::cargo -D warnings
    
  5. Check formatting:

    cargo fmt --all -- --check
    

Development Workflow

  1. Create a branch for your changes:

    git checkout -b feature/your-feature-name
    # or
    git checkout -b fix/your-bug-fix
    
  2. Make your changes following the coding standards below.

  3. Write or update tests for your changes.

  4. Ensure all tests pass:

    cargo test --all-features
    
  5. Run clippy and fix any warnings:

    cargo clippy --all-targets --all-features --workspace -- -W clippy::pedantic -W clippy::nursery -W clippy::cargo -D warnings
    
  6. Format your code:

    cargo fmt --all
    
  7. Commit your changes with a clear, descriptive commit message:

    git commit -m "Add feature: description of what you added"
    
  8. Push to your fork:

    git push origin feature/your-feature-name
    
  9. Open a Pull Request on GitHub with a clear description of your changes.

Coding Standards

Code Style

  • Follow Rust’s standard formatting conventions. Run cargo fmt before committing.
  • Use cargo clippy to catch common issues and follow Rust best practices.
  • We use pedantic clippy lints - some are allowed in clippy.toml and lib.rs, but try to follow them when possible.

Documentation

  • Document all public APIs with doc comments.
  • Include examples in doc comments where helpful.
  • Update the README if you add new features or change existing behavior.

Testing

  • Write tests for new features and bug fixes.
  • Aim for good test coverage, especially for critical paths.
  • Include both unit tests and integration tests where appropriate.
  • Test with different feature combinations when applicable.

Error Handling

  • Prefer Result types over panicking in library code.
  • Provide clear, actionable error messages.
  • Use appropriate error types from the error module.

Commit Messages

  • Use clear, descriptive commit messages.
  • Start with a verb in imperative mood (e.g., “Add”, “Fix”, “Update”).
  • Reference issue numbers when applicable (e.g., “Fix #123: description”).

Example:

Add incremental parsing cache invalidation

Implements cache invalidation for incremental parsing when text edits
occur. This improves performance by only re-parsing affected regions.

Fixes #42

Pull Request Process

  1. Ensure your PR is up to date with the target branch (usually trunk).

  2. Write a clear description of what your PR does and why.

  3. Reference related issues if applicable.

  4. Ensure CI passes - all tests, clippy checks, and formatting checks must pass.

  5. Request review from maintainers.

  6. Address feedback promptly and update your PR as needed.

Feature Requests

If you have an idea for a new feature:

  1. Check existing issues to see if it’s already been discussed.
  2. Open an issue describing the feature, its use case, and potential implementation approach.
  3. Wait for maintainer feedback before implementing.

Bug Reports

When reporting bugs, please include:

  • A clear description of the bug
  • Steps to reproduce
  • Expected behavior
  • Actual behavior
  • Your environment (Rust version, OS, etc.)
  • Minimal code example if possible

Fuzz Testing

Sipha includes fuzz testing infrastructure to ensure robustness and correctness. Fuzz tests help find edge cases and potential panics in the parser.

Prerequisites

Install cargo-fuzz:

cargo install cargo-fuzz

Running Fuzz Tests

Navigate to the fuzz directory and run the fuzz targets:

cd fuzz
cargo fuzz run parser_fuzz
cargo fuzz run incremental_fuzz

Fuzz Targets

  • parser_fuzz: Tests the parser with random byte streams to ensure no panics occur.
  • incremental_fuzz: Differential fuzzing that compares incremental parsing results with full re-parsing to ensure correctness.

Adding New Fuzz Targets

To add a new fuzz target:

  1. Create a new file in fuzz/fuzz_targets/ with your fuzz target code
  2. Add a [[bin]] entry in fuzz/Cargo.toml pointing to your target
  3. Run cargo fuzz list to verify the target is registered

Questions?

Feel free to open an issue for questions or discussions. We’re happy to help!

License

By contributing to Sipha, you agree that your contributions will be licensed under the MIT License.

Design Principles

This chapter explains the core design principles behind Sipha.

Incremental-First Design

Sipha is designed from the ground up for incremental parsing:

  • Node reuse: Unchanged subtrees are automatically reused
  • Cache management: Parse results are cached for future reuse
  • Affected regions: Only affected regions are re-parsed
  • Performance: Optimized for interactive editing scenarios

Immutability

Sipha uses immutable data structures:

  • Green trees: Immutable, shareable representation
  • Safe sharing: Immutability enables safe sharing across threads
  • Functional style: Operations create new trees rather than modifying existing ones

Modularity

Sipha is modular and extensible:

  • Backend system: Multiple parsing backends via trait
  • Feature flags: Optional features via feature flags
  • Pluggable components: Lexers, parsers, and trees are pluggable

Performance

Performance is a key consideration:

  • Efficient algorithms: Use efficient parsing algorithms
  • Memory efficiency: Shared trees reduce memory usage
  • Fast operations: Optimized for common operations
  • Incremental updates: Fast updates for interactive applications

Type Safety

Sipha leverages Rust’s type system:

  • Generic types: Generic over token and non-terminal types
  • Trait bounds: Clear trait bounds for extensibility
  • Type safety: Compile-time guarantees for correctness

Error Handling

Sipha provides comprehensive error handling:

  • Result types: Use Result for fallible operations
  • Error recovery: Configurable error recovery strategies
  • Rich diagnostics: Detailed error messages with context

API Design

Sipha’s API is designed for:

  • Ease of use: Simple, intuitive API
  • Flexibility: Support for various use cases
  • Extensibility: Easy to extend and customize
  • Documentation: Well-documented with examples

Next Steps

Module Structure

This chapter describes the organization of Sipha’s codebase.

Core Modules

syntax

Syntax tree types and operations:

  • green.rs: Green tree nodes
  • red.rs: Red tree nodes
  • kind.rs: Syntax kind trait
  • builder.rs: Tree builder
  • visitor.rs: Tree visitors
  • query.rs: Tree queries

lexer

Lexical analysis:

  • builder.rs: Lexer builder
  • dfa.rs: DFA construction and tokenization
  • token.rs: Token types
  • mod.rs: Module exports

grammar

Grammar definition and validation:

  • builder.rs: Grammar builder
  • expr.rs: Grammar expressions
  • validate.rs: Grammar validation
  • analysis.rs: Grammar analysis
  • docs.rs: Grammar documentation

backend

Parser backends:

  • mod.rs: Backend trait and common code
  • ll/: LL(k) parser implementation
  • lr/: LR parser implementation
  • glr/: GLR parser implementation
  • common.rs: Common backend utilities

incremental

Incremental parsing support:

  • mod.rs: Incremental parser and session

error

Error types and diagnostics:

  • mod.rs: Error types and diagnostics

parser

Parser traits and interfaces:

  • mod.rs: Parser traits

Workspace Structure

sipha/
├── crates/
│   ├── sipha/          # Main library
│   ├── sipha_derive/   # Procedural macros
│   └── sipha_lsp/      # LSP support
├── fuzz/               # Fuzz testing
└── doc/                # Documentation (this book)

Module Dependencies

syntax
  ↑
  ├── lexer
  │     ↓
  │   grammar
  │     ↓
  │   backend
  │     ↓
  │   incremental
  │
  └── error

Next Steps

Data Structures

This chapter describes the key data structures in Sipha.

Green Trees

GreenNode

Immutable, shareable tree node:

use sipha::syntax::{SyntaxKind, TextSize};

pub struct GreenNode<K: SyntaxKind> {
    kind: K,
    text_len: TextSize,
    children: GreenChildren<K>,
}

GreenToken

Immutable token:

use sipha::syntax::SyntaxKind;
use compact_str::CompactString;

pub struct GreenToken<K: SyntaxKind> {
    kind: K,
    text: CompactString,
}

GreenElement

Tree element (node or token):

use std::sync::Arc;
use sipha::syntax::SyntaxKind;

pub enum GreenElement<K: SyntaxKind> {
    Node(Arc<GreenNode<K>>),
    Token(GreenToken<K>),
}

Red Trees

SyntaxNode

Red tree node for traversal:

use std::sync::Arc;

pub struct SyntaxNode {
    green: Arc<GreenNode>,
    parent: Option<Arc<SyntaxNode>>,
    index: usize,
}

SyntaxToken

Red tree token:

use std::sync::Arc;

pub struct SyntaxToken {
    green: GreenToken,
    parent: Option<Arc<SyntaxNode>>,
    index: usize,
}

Grammar

Grammar

Grammar definition:

use std::collections::HashMap;
use lasso::Rodeo;

pub struct Grammar<T, N> {
    rules: HashMap<N, Rule<T, N>>,
    entry_point: N,
    interner: Rodeo,
}

Rule

Production rule:

pub struct Rule<T, N> {
    pub lhs: N,
    pub rhs: Expr<T, N>,
    pub metadata: RuleMetadata,
}

Expr

Grammar expression:

pub enum Expr<T, N> {
    Token(T),
    Rule(N),
    Seq(Vec<Expr<T, N>>),
    Choice(Vec<Expr<T, N>>),
    // ... more variants
}

Incremental Parsing

IncrementalParser

Incremental parser wrapper:

pub struct IncrementalParser<P, T, N> {
    parser: P,
    cache: ParseCache<T::Kind>,
}

IncrementalSession

Incremental parsing session:

pub struct IncrementalSession<'a, K> {
    old_tree: Option<&'a GreenNode<K>>,
    edits: &'a [TextEdit],
    affected: AffectedRegion,
    reusable: Vec<ReuseCandidate<K>>,
    cache: Option<&'a ParseCache<K>>,
}

ParseCache

Parse result cache:

use std::collections::HashMap;
use std::sync::Arc;
use lasso::Rodeo;

pub struct ParseCache<K> {
    version: usize,
    nodes: HashMap<CacheKey, Arc<GreenNode<K>>>,
    interner: Rodeo,
}

Lexer

CompiledLexer

Compiled lexer:

use std::collections::{HashMap, HashSet};
use compact_str::CompactString;

pub struct CompiledLexer<K> {
    rules: Vec<CompiledRule<K>>,
    dfa: Option<Dfa>,
    keywords: HashMap<CompactString, K>,
    trivia_kinds: HashSet<K>,
}

Token

Lexer token:

use compact_str::CompactString;
use sipha::syntax::TextRange;

pub struct Token<K> {
    pub kind: K,
    pub text: CompactString,
    pub range: TextRange,
}

Error Types

ParseError

Parsing error:

use sipha::syntax::TextRange;

pub struct ParseError<T, N> {
    pub span: TextRange,
    pub message: String,
    pub expected: Vec<T>,
    pub found: Option<T>,
}

LexerError

Lexer error:

use sipha::syntax::TextRange;

pub struct LexerError {
    pub span: TextRange,
    pub kind: LexerErrorKind,
}

Next Steps

Performance Considerations

This chapter covers performance considerations in Sipha’s design.

Memory Efficiency

Green Tree Sharing

Green trees are designed for sharing:

  • Arc-based: Nodes use Arc for reference counting
  • Shared subtrees: Unchanged subtrees are shared across edits
  • Arena allocation: Nodes can be arena-allocated for efficiency

Compact Representation

Green trees use compact representations:

  • Small nodes: Minimal overhead per node
  • Compact strings: Use CompactString for short tokens
  • Efficient children: Optimized child storage

Parsing Performance

DFA-Based Lexing

Lexer uses DFA for O(n) tokenization:

  • Compiled DFA: DFA is compiled once
  • Fast matching: O(n) tokenization time
  • Efficient patterns: Character classes are efficient

Table-Based Parsing

Parsers use tables for fast lookups:

  • Parse tables: Pre-computed parse tables
  • Fast lookups: O(1) table lookups
  • Compact tables: Optimized table representation

Incremental Performance

Node Reuse

Incremental parsing reuses nodes:

  • Budget-based: Limit reuse candidates
  • Position indexing: Fast node lookup
  • Smart invalidation: Efficient cache invalidation

Cache Management

Parse cache is optimized:

  • Version-based: Efficient version management
  • Automatic eviction: Old versions are evicted
  • Interned strings: Rule names are interned

Optimization Tips

Grammar Optimization

  • Factor patterns: Extract common subexpressions
  • Minimize nesting: Flatten deeply nested structures
  • Use appropriate lookahead: Don’t use more than needed

Lexer Optimization

  • Order patterns: Most specific patterns first
  • Use character classes: Prefer over regex
  • Keyword trie: Keywords use efficient trie

Parser Optimization

  • Choose right backend: Use appropriate backend
  • Enable cache: Always provide grammar for cache
  • Adjust budget: Tune reuse budget for use case

Benchmarking

Measuring Performance

use std::time::Instant;

// Assuming parser, tokens, and entry are already defined
let start = Instant::now();
let result = parser.parse(&tokens, entry);
let duration = start.elapsed();

Profiling

Use profilers to find bottlenecks:

  • perf: Linux profiling
  • valgrind: Memory and call profiling
  • cargo-flamegraph: Flamegraph generation

Best Practices

  1. Profile first: Profile before optimizing
  2. Measure changes: Benchmark before and after
  3. Use incremental: Always use for interactive apps
  4. Optimize grammar: Factor and simplify
  5. Choose right backend: Use appropriate backend

Next Steps

Basic Arithmetic Example

This example walks through building a simple arithmetic expression parser step by step.

Overview

We’ll build a parser for arithmetic expressions like 42 + 10 * 3 with proper operator precedence.

Step 1: Define Syntax Kinds

use sipha::syntax::SyntaxKind;

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
    // Terminals
    Number,
    Plus,
    Minus,
    Multiply,
    Divide,
    LParen,
    RParen,
    Whitespace,
    Eof,
    // Non-terminals
    Expr,
    Term,
    Factor,
}

impl SyntaxKind for ArithSyntaxKind {
    fn is_terminal(self) -> bool {
        !matches!(self, 
            ArithSyntaxKind::Expr | 
            ArithSyntaxKind::Term | 
            ArithSyntaxKind::Factor
        )
    }
    
    fn is_trivia(self) -> bool {
        matches!(self, ArithSyntaxKind::Whitespace)
    }
}

Step 2: Build Lexer

use sipha::syntax::SyntaxKind;

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
    Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
    Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
    fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
    fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::lexer::{LexerBuilder, Pattern, CharSet};

let lexer = LexerBuilder::new()
    .token(ArithSyntaxKind::Number, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::digits())),
        min: 1,
        max: None,
    })
    .token(ArithSyntaxKind::Plus, Pattern::Literal("+".into()))
    .token(ArithSyntaxKind::Minus, Pattern::Literal("-".into()))
    .token(ArithSyntaxKind::Multiply, Pattern::Literal("*".into()))
    .token(ArithSyntaxKind::Divide, Pattern::Literal("/".into()))
    .token(ArithSyntaxKind::LParen, Pattern::Literal("(".into()))
    .token(ArithSyntaxKind::RParen, Pattern::Literal(")".into()))
    .token(ArithSyntaxKind::Whitespace, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
        min: 1,
        max: None,
    })
    .trivia(ArithSyntaxKind::Whitespace)
    .build(ArithSyntaxKind::Eof, ArithSyntaxKind::Number)
    .expect("Failed to build lexer");

Step 3: Define Grammar

use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
    Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
    Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
    fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
    fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::grammar::{GrammarBuilder, NonTerminal, Expr};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};
use std::convert::TryFrom;

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum ArithNonTerminal {
    Expr,
    Term,
    Factor,
}

impl NonTerminal for ArithNonTerminal {
    fn name(&self) -> &str {
        match self {
            ArithNonTerminal::Expr => "Expr",
            ArithNonTerminal::Term => "Term",
            ArithNonTerminal::Factor => "Factor",
        }
    }
}

// Helper function to create tokens with proper ranges
fn create_token(kind: ArithSyntaxKind, text: &str, offset: u32) -> LexerToken<ArithSyntaxKind> {
    let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
    LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}

let grammar = GrammarBuilder::new()
    .entry_point(ArithNonTerminal::Expr)
    
    // Expr -> Term | Expr + Term | Expr - Term
    .rule(ArithNonTerminal::Expr, Expr::choice(vec![
        Expr::seq(vec![
            Expr::non_terminal(ArithNonTerminal::Expr),
            Expr::token(create_token(ArithSyntaxKind::Plus, "+", 0)),
            Expr::non_terminal(ArithNonTerminal::Term),
        ]),
        Expr::seq(vec![
            Expr::non_terminal(ArithNonTerminal::Expr),
            Expr::token(create_token(ArithSyntaxKind::Minus, "-", 0)),
            Expr::non_terminal(ArithNonTerminal::Term),
        ]),
        Expr::non_terminal(ArithNonTerminal::Term),
    ]))
    
    // Term -> Factor | Term * Factor | Term / Factor
    .rule(ArithNonTerminal::Term, Expr::choice(vec![
        Expr::seq(vec![
            Expr::non_terminal(ArithNonTerminal::Term),
            Expr::token(create_token(ArithSyntaxKind::Multiply, "*", 0)),
            Expr::non_terminal(ArithNonTerminal::Factor),
        ]),
        Expr::seq(vec![
            Expr::non_terminal(ArithNonTerminal::Term),
            Expr::token(create_token(ArithSyntaxKind::Divide, "/", 0)),
            Expr::non_terminal(ArithNonTerminal::Factor),
        ]),
        Expr::non_terminal(ArithNonTerminal::Factor),
    ]))
    
    // Factor -> Number | ( Expr )
    .rule(ArithNonTerminal::Factor, Expr::choice(vec![
        Expr::token(create_token(ArithSyntaxKind::Number, "42", 0)),
        Expr::seq(vec![
            Expr::token(create_token(ArithSyntaxKind::LParen, "(", 0)),
            Expr::non_terminal(ArithNonTerminal::Expr),
            Expr::token(create_token(ArithSyntaxKind::RParen, ")", 0)),
        ]),
    ]))
    
    .build()
    .expect("Failed to build grammar");

Step 4: Parse

use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
    Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
    Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
    fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
    fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
let lexer = LexerBuilder::new()
    .token(ArithSyntaxKind::Number, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::digits())),
        min: 1, max: None,
    })
    .token(ArithSyntaxKind::Plus, Pattern::Literal("+".into()))
    .token(ArithSyntaxKind::Minus, Pattern::Literal("-".into()))
    .token(ArithSyntaxKind::Multiply, Pattern::Literal("*".into()))
    .token(ArithSyntaxKind::Divide, Pattern::Literal("/".into()))
    .token(ArithSyntaxKind::LParen, Pattern::Literal("(".into()))
    .token(ArithSyntaxKind::RParen, Pattern::Literal(")".into()))
    .token(ArithSyntaxKind::Whitespace, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
        min: 1, max: None,
    })
    .trivia(ArithSyntaxKind::Whitespace)
    .build(ArithSyntaxKind::Eof, ArithSyntaxKind::Number)
    .expect("Failed to build lexer");
use sipha::grammar::{GrammarBuilder, NonTerminal, Expr};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};
use std::convert::TryFrom;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum ArithNonTerminal { Expr, Term, Factor, }
impl NonTerminal for ArithNonTerminal {
    fn name(&self) -> &str { match self { ArithNonTerminal::Expr => "Expr", ArithNonTerminal::Term => "Term", ArithNonTerminal::Factor => "Factor", } }
}
fn create_token(kind: ArithSyntaxKind, text: &str, offset: u32) -> LexerToken<ArithSyntaxKind> {
    let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
    LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}
let grammar = GrammarBuilder::new()
    .entry_point(ArithNonTerminal::Expr)
    .rule(ArithNonTerminal::Expr, Expr::choice(vec![
        Expr::seq(vec![Expr::non_terminal(ArithNonTerminal::Expr), Expr::token(create_token(ArithSyntaxKind::Plus, "+", 0)), Expr::non_terminal(ArithNonTerminal::Term)]),
        Expr::seq(vec![Expr::non_terminal(ArithNonTerminal::Expr), Expr::token(create_token(ArithSyntaxKind::Minus, "-", 0)), Expr::non_terminal(ArithNonTerminal::Term)]),
        Expr::non_terminal(ArithNonTerminal::Term),
    ]))
    .rule(ArithNonTerminal::Term, Expr::choice(vec![
        Expr::seq(vec![Expr::non_terminal(ArithNonTerminal::Term), Expr::token(create_token(ArithSyntaxKind::Multiply, "*", 0)), Expr::non_terminal(ArithNonTerminal::Factor)]),
        Expr::seq(vec![Expr::non_terminal(ArithNonTerminal::Term), Expr::token(create_token(ArithSyntaxKind::Divide, "/", 0)), Expr::non_terminal(ArithNonTerminal::Factor)]),
        Expr::non_terminal(ArithNonTerminal::Factor),
    ]))
    .rule(ArithNonTerminal::Factor, Expr::choice(vec![
        Expr::token(create_token(ArithSyntaxKind::Number, "42", 0)),
        Expr::seq(vec![Expr::token(create_token(ArithSyntaxKind::LParen, "(", 0)), Expr::non_terminal(ArithNonTerminal::Expr), Expr::token(create_token(ArithSyntaxKind::RParen, ")", 0))]),
    ]))
    .build().expect("Failed to build grammar");
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
use sipha::syntax::SyntaxNode;

let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)
    .expect("Failed to create parser");

let input = "42 + 10 * 3";
let tokens = lexer.tokenize(input)
    .expect("Failed to tokenize");

let result = parser.parse(&tokens, ArithNonTerminal::Expr);

if !result.errors.is_empty() {
    eprintln!("Errors: {:?}", result.errors);
} else {
    let root = SyntaxNode::new_root(result.root.clone());
    println!("Parse successful!");
    println!("Root: {:?}", root.kind());
}

Complete Example

See examples/basic_arithmetic.rs for the complete working example.

Next Steps

Incremental Parsing Example

This example demonstrates incremental parsing with Sipha.

Overview

We’ll build a simple language server that uses incremental parsing to efficiently update the parse tree as the user edits code.

Setup

use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::syntax::{TextRange, TextSize, GreenNode};
use sipha::grammar::Grammar;
use sipha::lexer::CompiledLexer;

// These types would be defined in your actual code
// type MyToken = /* ... */;
// type MyNonTerminal = /* ... */;
// type MySyntaxKind = /* ... */;

struct LanguageServer {
    parser: IncrementalParser<LlParser, MyToken, MyNonTerminal>,
    grammar: Grammar<MyToken, MyNonTerminal>,
    lexer: CompiledLexer<MySyntaxKind>,
    current_tree: Option<GreenNode<MySyntaxKind>>,
}

Initial Parse

use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::syntax::{TextRange, TextSize, GreenNode};
use sipha::grammar::Grammar;
use sipha::lexer::CompiledLexer;
struct LanguageServer {
    parser: IncrementalParser<LlParser, MyToken, MyNonTerminal>,
    grammar: Grammar<MyToken, MyNonTerminal>,
    lexer: CompiledLexer<MySyntaxKind>,
    current_tree: Option<GreenNode<MySyntaxKind>>,
}
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
fn build_grammar() -> Grammar<MyToken, MyNonTerminal> { todo!() }
fn build_lexer() -> CompiledLexer<MySyntaxKind> { todo!() }
impl LanguageServer {
    fn new() -> Self {
        let grammar = build_grammar();
        let parser = LlParser::new(&grammar, Default::default()).unwrap();
        let lexer = build_lexer();
        
        Self {
            parser: IncrementalParser::new(parser),
            grammar,
            lexer,
            current_tree: None,
        }
    }
    
    fn initial_parse(&mut self, text: &str) {
        let tokens = self.lexer.tokenize(text).unwrap();
        
        let result = self.parser.parse_incremental(
            &tokens,
            None,              // No old tree
            &[],               // No edits
            MyNonTerminal::Expr,
            Some(&self.grammar), // Populate cache
        );
        
        self.current_tree = Some(result.root.clone());
    }
}

Handling Edits

use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::syntax::{TextRange, TextSize, GreenNode};
use sipha::grammar::Grammar;
use sipha::lexer::CompiledLexer;
struct LanguageServer {
    parser: IncrementalParser<LlParser, MyToken, MyNonTerminal>,
    grammar: Grammar<MyToken, MyNonTerminal>,
    lexer: CompiledLexer<MySyntaxKind>,
    current_tree: Option<GreenNode<MySyntaxKind>>,
}
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
impl LanguageServer {
    fn report_errors(&self, _errors: &[()]) {}
}
impl LanguageServer {
    fn on_text_changed(&mut self, edits: Vec<TextEdit>, new_text: &str) {
        // Re-tokenize (or update tokens incrementally)
        let tokens = self.lexer.tokenize(new_text).unwrap();
        
        // Incremental parse
        let result = self.parser.parse_incremental(
            &tokens,
            self.current_tree.as_ref(), // Previous tree
            &edits,                     // Text edits
            MyNonTerminal::Expr,
            Some(&self.grammar),        // Update cache
        );
        
        // Update current tree
        self.current_tree = Some(result.root.clone());
        
        // Report errors
        if !result.errors.is_empty() {
            self.report_errors(&result.errors);
        }
    }
}

Creating Edits

use sipha::incremental::TextEdit;
use sipha::syntax::{TextRange, TextSize};
struct LanguageServer;
impl LanguageServer {
    fn on_text_changed(&mut self, _edits: Vec<TextEdit>, _new_text: &str) {}
}
let mut server = LanguageServer;
// User changed "42" to "100" at position 0
let edit = TextEdit::replace(
    TextRange::new(TextSize::from(0), TextSize::from(2)),
    "100".into(),
);

server.on_text_changed(vec![edit], "100 + 10");

Performance

Incremental parsing provides significant performance improvements:

  • Small edits: 10-100x faster
  • Medium edits: 5-20x faster
  • Large edits: 2-5x faster

Complete Example

See examples/incremental_parsing.rs for the complete working example.

Next Steps

GLR Parsing Example

This example demonstrates using the GLR parser to handle ambiguous grammars.

Overview

We’ll parse an ambiguous expression grammar and disambiguate using precedence rules.

Ambiguous Grammar

// Grammar: Expr -> Expr + Expr | Expr * Expr | Number
// Input: "1 + 2 * 3"
// Ambiguous: Could be (1 + 2) * 3 or 1 + (2 * 3)

Setup

use sipha::grammar::Grammar;
type MyToken = ();
type MyNonTerminal = ();
let grammar: Grammar<MyToken, MyNonTerminal> = todo!();
use sipha::backend::glr::{GlrParser, GlrConfig};
use sipha::backend::glr::disambiguate_by_precedence;

let config = GlrConfig::default();
let mut parser = GlrParser::new(&grammar, config)
    .expect("Failed to create GLR parser");

Parsing

use sipha::backend::glr::{GlrParser, GlrConfig};
use sipha::backend::glr::disambiguate_by_precedence;
use sipha::syntax::SyntaxNode;
use sipha::lexer::Token;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
struct Grammar;
let grammar = Grammar;
let config = GlrConfig::default();
let mut parser = GlrParser::new(&grammar, config).expect("Failed to create GLR parser");
let tokens: Vec<Token<MySyntaxKind>> = vec![];
let result = parser.parse(&tokens, MyNonTerminal::Expr);

// Check for ambiguity
if let Some(forest) = result.forest {
    println!("Ambiguous: {} parse trees", forest.roots.len());
    
    // Disambiguate using precedence
    let disambiguated = forest.disambiguate(disambiguate_by_precedence(|op| {
        match op {
            MySyntaxKind::Multiply | MySyntaxKind::Divide => 2,
            MySyntaxKind::Plus | MySyntaxKind::Minus => 1,
            _ => 0,
        }
    }));
    
    // Use disambiguated tree
    let root = SyntaxNode::new_root(disambiguated);
} else {
    // Unambiguous - use result directly
    let root = SyntaxNode::new_root(result.root);
}

Disambiguation Strategies

Precedence

use sipha::backend::glr::disambiguate_by_precedence;
type MySyntaxKind = ();
struct Forest;
impl Forest {
    fn disambiguate<F>(&self, _f: F) -> () { () }
}
let forest = Forest;
let disambiguated = forest.disambiguate(disambiguate_by_precedence(|op| {
    match op {
        MySyntaxKind::Multiply => 2,
        MySyntaxKind::Plus => 1,
        _ => 0,
    }
}));

Associativity

use sipha::backend::glr::disambiguate_by_associativity;
use sipha::backend::glr::Associativity;
type MySyntaxKind = ();
struct Forest;
impl Forest {
    fn disambiguate<F>(&self, _f: F) -> () { () }
}
let forest = Forest;
let disambiguated = forest.disambiguate(disambiguate_by_associativity(|op| {
    match op {
        MySyntaxKind::Minus => Associativity::Left,
        MySyntaxKind::Power => Associativity::Right,
        _ => Associativity::None,
    }
}));

Custom

struct Forest;
impl Forest {
    fn disambiguate<F>(&self, _f: F) -> () { () }
}
let forest = Forest;
fn compute_complexity(_tree: &()) -> usize { 0 }
let disambiguated = forest.disambiguate(|alternatives| {
    // Choose based on custom criteria
    alternatives
        .iter()
        .min_by_key(|tree| compute_complexity(tree))
        .cloned()
});

Complete Example

See examples/glr_parsing.rs for the complete working example.

Next Steps

PEG Parsing Example

This example demonstrates using the PEG parser with ordered choice and memoization.

Overview

We’ll parse an expression grammar using PEG’s ordered choice semantics and demonstrate the benefits of memoization (packrat parsing).

Ordered Choice Semantics

PEG uses ordered choice where the first matching alternative wins:

// Grammar: Expr -> Expr + Term | Term
// In PEG: First alternative tried first
// If "Expr + Term" matches, "Term" is never tried

This makes precedence natural - lower precedence operators are tried first.

Setup

use sipha::backend::peg::{PegParser, PegConfig};
use sipha::backend::ParserBackend;

let config = PegConfig {
    enable_memoization: true,    // Enable packrat parsing
    max_memo_size: 10000,        // Cache size limit
    error_recovery: true,
    max_errors: 100,
    max_backtrack_depth: 1000,
};

let mut parser = PegParser::new(&grammar, config)
    .expect("Failed to create PEG parser");

Basic Parsing

use sipha::backend::peg::{PegParser, PegConfig};
use sipha::backend::ParserBackend;
type MyToken = ();
type MyNonTerminal = ();
struct Grammar;
let grammar = Grammar;
let config = PegConfig::default();
let mut parser = PegParser::new(&grammar, config).expect("Failed to create PEG parser");
let tokens: Vec<MyToken> = vec![];
let result = parser.parse(&tokens, MyNonTerminal::Expr);

if result.errors.is_empty() {
    println!("Parse successful!");
    println!("Tokens consumed: {}", result.metrics.tokens_consumed);
    println!("Nodes created: {}", result.metrics.nodes_created);
} else {
    for error in &result.errors {
        eprintln!("Error: {error}");
    }
}

Memoization (Packrat Parsing)

Memoization caches parse results for O(n) performance:

use sipha::backend::peg::{PegParser, PegConfig};
type MyToken = ();
type MyNonTerminal = ();
struct Grammar;
let grammar = Grammar;
let config_with_memo = PegConfig {
    enable_memoization: true,
    ..Default::default()
};

let config_without_memo = PegConfig {
    enable_memoization: false,
    ..Default::default()
};

// With memoization: O(n) time, O(n) space
let parser_with = PegParser::new(&grammar, config_with_memo)?;

// Without memoization: May be exponential for some grammars
let parser_without = PegParser::new(&grammar, config_without_memo)?;

Incremental Parsing

PEG works well with incremental parsing thanks to memoization:

use sipha::incremental::IncrementalParser;
use sipha::backend::peg::{PegParser, PegConfig};
type MyToken = ();
type MyNonTerminal = ();
struct Grammar;
let grammar = Grammar;
let parser = PegParser::new(&grammar, PegConfig::default()).expect("parser");
use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::syntax::{TextRange, TextSize};

let mut incremental = IncrementalParser::new(parser);

// Initial parse
let result1 = incremental.parse_incremental(
    &tokens,
    None,
    &[],
    MyNonTerminal::Expr,
    Some(&grammar),
);

// After edit - benefits from memoization cache
let edits = vec![TextEdit::replace(
    TextRange::new(TextSize::from(0), TextSize::from(1)),
    "new_text".into(),
)];
let result2 = incremental.parse_incremental(
    &new_tokens,
    Some(&result1.root),
    &edits,
    MyNonTerminal::Expr,
    Some(&grammar),
);

Ordered Choice for Precedence

PEG’s ordered choice makes operator precedence natural:

// Lower precedence tried first
Expr -> Expr + Term | Term

// Higher precedence tried later  
Term -> Term * Factor | Factor

// This naturally gives * higher precedence than +

Performance Comparison

The example demonstrates performance with and without memoization:

  • With memoization: Linear time O(n)
  • Without memoization: May be exponential for some grammars
  • Incremental: Fast for small edits (benefits from cache)

Complete Example

See examples/peg_parsing.rs for the complete working example.

Run it with:

cargo run --example peg_parsing --features backend-peg

Expected Output

The example demonstrates:

  1. Building a lexer and grammar
  2. Parsing with ordered choice
  3. Memoization benefits
  4. Incremental parsing performance
  5. Performance comparison with/without memoization

Next Steps

Real-World Patterns

This chapter covers common patterns and idioms for using Sipha in real-world applications.

Language Server Pattern

Structure

struct LanguageServer {
    parser: IncrementalParser<LlParser, Token, NonTerminal>,
    grammar: Grammar<Token, NonTerminal>,
    lexer: CompiledLexer<SyntaxKind>,
    documents: HashMap<Uri, Document>,
}

struct Document {
    text: String,
    tree: Option<GreenNode<SyntaxKind>>,
    version: i32,
}

Handling Changes

impl LanguageServer {
    fn did_change(&mut self, uri: Uri, edits: Vec<TextEdit>, version: i32) {
        let doc = self.documents.get_mut(&uri).unwrap();
        doc.text = apply_edits(&doc.text, &edits);
        doc.version = version;
        
        let tokens = self.lexer.tokenize(&doc.text).unwrap();
        let result = self.parser.parse_incremental(
            &tokens,
            doc.tree.as_ref(),
            &edits,
            MyNonTerminal::Root,
            Some(&self.grammar),
        );
        
        doc.tree = Some(result.root.clone());
        
        // Publish diagnostics
        self.publish_diagnostics(uri, &result.errors);
    }
}

Error Reporting Pattern

Collecting Errors

fn collect_errors(result: &ParseResult) -> Vec<Diagnostic> {
    result.errors
        .iter()
        .map(|error| Diagnostic {
            range: error.span,
            severity: DiagnosticSeverity::Error,
            message: error.message.clone(),
            // ...
        })
        .collect()
}

Reporting to User

fn report_errors(errors: &[Diagnostic]) {
    for error in errors {
        eprintln!("Error at {}: {}", error.range, error.message);
    }
}

Tree Traversal Pattern

Finding Nodes

fn find_nodes_by_kind(root: &SyntaxNode, kind: SyntaxKind) -> Vec<SyntaxNode> {
    root.descendants()
        .filter(|n| n.kind() == kind)
        .collect()
}

Collecting Information

fn collect_identifiers(root: &SyntaxNode) -> Vec<String> {
    root.descendants()
        .filter(|n| n.kind() == SyntaxKind::Ident)
        .filter_map(|n| n.first_token())
        .map(|t| t.text().to_string())
        .collect()
}

Grammar Extension Pattern

Adding Rules

fn extend_grammar(base: GrammarBuilder) -> GrammarBuilder {
    base
        .rule(MyNonTerminal::NewRule, Expr::token(new_token))
        .rule(MyNonTerminal::AnotherRule, Expr::choice(vec![
            Expr::rule(MyNonTerminal::NewRule),
            Expr::rule(MyNonTerminal::ExistingRule),
        ]))
}

Testing Pattern

Test Helpers

fn parse_test(input: &str) -> ParseResult {
    let lexer = build_test_lexer();
    let grammar = build_test_grammar();
    let mut parser = LlParser::new(&grammar, Default::default()).unwrap();
    
    let tokens = lexer.tokenize(input).unwrap();
    parser.parse(&tokens, MyNonTerminal::Expr)
}

#[test]
fn test_simple_expression() {
    let result = parse_test("1 + 2");
    assert!(result.errors.is_empty());
    // ...
}

Best Practices

  1. Use incremental parsing: Always use for interactive apps
  2. Cache grammar: Reuse grammar across parses
  3. Handle errors gracefully: Don’t panic on errors
  4. Test thoroughly: Test with various inputs
  5. Profile performance: Profile and optimize hot paths

Next Steps

API Overview

This chapter provides an overview of Sipha’s key types and traits.

Core Traits

SyntaxKind

Trait for syntax kind identifiers:

use std::fmt::Debug;
use std::hash::Hash;

pub trait SyntaxKind: Copy + Clone + PartialEq + Eq + Hash + Debug {
    fn is_terminal(self) -> bool;
    fn is_trivia(self) -> bool;
}

Token

Trait for tokens:

use std::hash::Hash;
use sipha::syntax::{SyntaxKind, TextSize};
use compact_str::CompactString;

pub trait Token: Clone + PartialEq + Eq + Hash {
    type Kind: SyntaxKind;
    fn kind(&self) -> Self::Kind;
    fn text_len(&self) -> TextSize;
    fn text(&self) -> CompactString;
}

NonTerminal

Trait for non-terminals:

use std::hash::Hash;

pub trait NonTerminal: Clone + PartialEq + Eq + Hash {
    fn name(&self) -> &str;
}

ParserBackend

Trait for parser backends:

use sipha::grammar::Grammar;

pub trait ParserBackend<T, N>: Sized + Send
where
    T: Token,
    N: NonTerminal,
{
    type Config: Default + Clone;
    type Error: std::error::Error + Send + Sync + 'static;
    type State: Send + Sync;
    
    fn new(grammar: &Grammar<T, N>, config: Self::Config) -> Result<Self, Self::Error>;
    fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N>;
    // ... more methods
}

Key Types

Grammar

Grammar definition:

use std::marker::PhantomData;

pub struct Grammar<T, N> {
    _phantom: PhantomData<(T, N)>,
}

ParseResult

Parse result:

use sipha::syntax::GreenNode;

pub struct ParseResult<T, N> {
    pub root: GreenNode<T::Kind>,
    pub errors: Vec<ParseError<T, N>>,
    pub warnings: Vec<ParseWarning<T, N>>,
    pub metrics: ParseMetrics,
    #[cfg(feature = "backend-glr")]
    pub forest: Option<ParseForest<T::Kind>>,
}

GreenNode

Green tree node:

use sipha::syntax::SyntaxKind;

pub struct GreenNode<K: SyntaxKind> {
    _phantom: std::marker::PhantomData<K>,
}

SyntaxNode

Red tree node:

pub struct SyntaxNode {
    // ...
}

Builders

GrammarBuilder

Build grammars:

pub struct GrammarBuilder<T, N> {
    // ...
}

impl<T, N> GrammarBuilder<T, N> {
    pub fn new() -> Self;
    pub fn entry_point(mut self, entry: N) -> Self;
    pub fn rule(mut self, lhs: N, rhs: Expr<T, N>) -> Self;
    pub fn build(self) -> Result<Grammar<T, N>, GrammarError>;
}

LexerBuilder

Build lexers:

use sipha::syntax::SyntaxKind;
use sipha::lexer::{LexerBuilder, Pattern, CompiledLexer};
type LexerError = ();
type K = ();
pub struct LexerBuilder<K: SyntaxKind> {
    _phantom: std::marker::PhantomData<K>,
}

impl<K: SyntaxKind> LexerBuilder<K> {
    pub fn new() -> Self { todo!() }
    pub fn token(self, kind: K, pattern: Pattern) -> Self { todo!() }
    pub fn keyword(self, text: &str, kind: K) -> Self { todo!() }
    pub fn trivia(self, kind: K) -> Self { todo!() }
    pub fn build(self, eof_kind: K, ident_kind: K) -> Result<CompiledLexer<K>, LexerError> { todo!() }
}

GreenNodeBuilder

Build green trees:

use sipha::syntax::{GreenNode, SyntaxKind};
type K = ();
type BuilderError = ();
struct GreenNodeBuilder;
pub struct GreenNodeBuilder {
    _phantom: std::marker::PhantomData<()>,
}

impl GreenNodeBuilder {
    pub fn new() -> Self { todo!() }
    pub fn start_node(&mut self, kind: K) { todo!() }
    pub fn token(&mut self, kind: K, text: &str) -> Result<(), BuilderError> { todo!() }
    pub fn finish_node(&mut self) -> Result<(), BuilderError> { todo!() }
    pub fn finish(self) -> Result<GreenNode<K>, BuilderError> { todo!() }
}

Incremental Parsing

IncrementalParser

Incremental parser wrapper:

use sipha::syntax::{GreenNode, SyntaxKind};
use sipha::grammar::Grammar;
use sipha::incremental::TextEdit;
type P = ();
type T = ();
type N = ();
type ParseResult<T, N> = ();
pub struct IncrementalParser<P, T, N> {
    _phantom: std::marker::PhantomData<(P, T, N)>,
}

impl<P, T, N> IncrementalParser<P, T, N> {
    pub fn new(parser: P) -> Self { todo!() }
    pub fn parse_incremental(
        &mut self,
        input: &[T],
        old_tree: Option<&GreenNode<T::Kind>>,
        edits: &[TextEdit],
        entry: N,
        grammar: Option<&Grammar<T, N>>,
    ) -> ParseResult<T, N> { todo!() }
}

TextEdit

Text edit:

use sipha::syntax::{TextRange, TextSize};

pub struct TextEdit {
    pub range: TextRange,
    pub new_text: String,
}

impl TextEdit {
    pub fn replace(range: TextRange, new_text: impl Into<String>) -> Self { todo!() }
    pub fn insert(pos: TextSize, text: impl Into<String>) -> Self { todo!() }
    pub fn delete(range: TextRange) -> Self { todo!() }
}

Error Types

ParseError

Parsing error:

use sipha::syntax::TextRange;

pub struct ParseError<T, N> {
    pub span: TextRange,
    pub message: String,
    pub expected: Vec<T>,
    pub found: Option<T>,
}

LexerError

Lexer error:

use sipha::syntax::TextRange;

pub struct LexerError {
    pub span: TextRange,
    pub kind: LexerErrorKind,
}

Next Steps

Feature Flags

Sipha uses feature flags to enable optional functionality.

Backend Features

backend-ll

Enable LL(k) parser backend (default).

[dependencies]
sipha = { version = "0.5.0", features = ["backend-ll"] }

backend-lr

Enable LR parser backend.

[dependencies]
sipha = { version = "0.5.0", features = ["backend-lr"] }

backend-glr

Enable GLR parser backend (requires backend-lr).

[dependencies]
sipha = { version = "0.5.0", features = ["backend-lr", "backend-glr"] }

Optional Features

diagnostics

Enable rich error diagnostics with miette.

[dependencies]
sipha = { version = "0.5.0", features = ["diagnostics"] }

unicode

Enable full Unicode support for identifiers.

[dependencies]
sipha = { version = "0.5.0", features = ["unicode"] }

visitor

Enable syntax tree visitor patterns.

[dependencies]
sipha = { version = "0.5.0", features = ["visitor"] }

query

Enable XPath-like query API for syntax trees.

[dependencies]
sipha = { version = "0.5.0", features = ["query"] }

tree-utils

Enable tree diffing and validation utilities.

[dependencies]
sipha = { version = "0.5.0", features = ["tree-utils"] }

Feature Combinations

Minimal

[dependencies]
sipha = "0.5.0"  # Only backend-ll

Full

[dependencies]
sipha = { 
    version = "0.5.0", 
    features = [
        "backend-ll",
        "backend-lr",
        "backend-glr",
        "diagnostics",
        "unicode",
        "visitor",
        "query",
        "tree-utils",
    ]
}

Language Server

[dependencies]
sipha = { 
    version = "0.5.0", 
    features = [
        "backend-ll",
        "diagnostics",
        "visitor",
        "query",
    ]
}

Default Features

By default, only backend-ll is enabled.

Next Steps

FAQ

Frequently asked questions about Sipha.

General

What is Sipha?

Sipha is a flexible, incremental parsing library for Rust with support for multiple parsing algorithms.

Why use Sipha?

Sipha is designed for interactive applications like IDEs and language servers that need fast, incremental parsing.

Is Sipha production-ready?

Sipha is currently in alpha and under active development. The API is subject to change.

Parsing

Which backend should I use?

See Choosing a Backend for guidance.

Does Sipha support left recursion?

  • LL parser: With left-recursion elimination
  • LR parser: Yes, naturally
  • GLR parser: Yes, naturally

How do I handle ambiguous grammars?

Use the GLR backend. See GLR Parser for details.

Incremental Parsing

When should I use incremental parsing?

Use incremental parsing for interactive applications like IDEs and language servers.

How much faster is incremental parsing?

  • Small edits: 10-100x faster
  • Medium edits: 5-20x faster
  • Large edits: 2-5x faster

Do I need to provide the grammar for incremental parsing?

Yes, providing the grammar enables cache population, which improves performance.

Performance

How do I optimize performance?

See Performance for optimization tips.

Is Sipha fast?

Sipha is competitive with other Rust parsing libraries for batch parsing, and significantly faster for incremental parsing.

Grammar

How do I define a grammar?

Use GrammarBuilder. See Grammars for details.

Can I use regex in patterns?

Yes, use Pattern::Regex. See Lexers for details.

Error Handling

How do I handle errors?

Check result.errors after parsing. See Error Handling for details.

Does Sipha recover from errors?

Yes, Sipha supports configurable error recovery strategies.

Contributing

How do I contribute?

See Contributing for guidelines.

Where do I report bugs?

Open an issue on GitHub.

Next Steps

Glossary

This glossary defines key terms used throughout the Sipha documentation.

A

Affected Region
The minimal range of text that needs to be re-parsed after an edit. Sipha computes this automatically by expanding the edit range to include necessary context.

Ambiguous Grammar
A grammar that can produce multiple valid parse trees for the same input. Sipha’s GLR backend can handle ambiguous grammars by producing a parse forest.

Arena Allocation
A memory allocation strategy where objects are allocated in a contiguous region (arena) for better cache locality and faster deallocation. Green trees use arena allocation.

B

Backend
A parsing algorithm implementation. Sipha supports multiple backends: LL(k), LR, and GLR, each with different characteristics.

C

Cache (Parse Cache)
A persistent storage for parse results that enables cross-session node reuse. The cache stores parse nodes keyed by rule name, position, and version.

Choice Expression
A grammar expression that matches one of several alternatives. Represented as Expr::choice(vec![...]).

D

DFA (Deterministic Finite Automaton)
A state machine used by Sipha’s lexer for efficient tokenization. DFAs provide O(n) tokenization time.

Disambiguation
The process of selecting a single parse tree from multiple valid alternatives in an ambiguous grammar. GLR parsers support various disambiguation strategies.

E

Entry Point
The starting non-terminal for parsing. Defined when building a grammar using GrammarBuilder::entry_point().

Error Recovery
The ability of a parser to continue parsing after encountering an error, rather than stopping immediately. Sipha supports configurable recovery strategies.

Expression (Grammar Expression)
A component of a grammar rule’s right-hand side. Can be a token, non-terminal, sequence, choice, optional, repeat, or empty.

F

First Set
The set of terminals that can appear as the first token in a derivation from a non-terminal. Used by LL parsers for prediction.

Follow Set
The set of terminals that can appear immediately after a non-terminal in a derivation. Used by LL parsers for error recovery.

Forest (Parse Forest)
A data structure representing multiple parse trees for ambiguous grammars. GLR parsers produce parse forests when ambiguity is detected.

G

GLR (Generalized LR)
A parsing algorithm that extends LR parsing to handle ambiguous and non-deterministic grammars by maintaining multiple parser stacks.

Grammar
A formal description of a language’s syntax, consisting of non-terminals, production rules, and an entry point.

Green Tree
The storage layer of Sipha’s syntax tree representation. Green trees are compact, immutable, and can be shared across multiple red trees.

I

Incremental Parsing
The ability to efficiently re-parse only changed portions of code, reusing unchanged subtrees from previous parses. Sipha’s core feature.

Immutable
Data structures that cannot be modified after creation. Sipha’s syntax trees are immutable, enabling safe sharing and efficient updates.

L

Left Recursion
A grammar rule where a non-terminal appears as the first symbol in its own production. Some backends require left-recursion elimination.

Lexer
The component that converts raw text into tokens. Also called a tokenizer or scanner.

LL(k) Parser
A top-down predictive parser that uses k tokens of lookahead to make parsing decisions.

LR Parser
A bottom-up shift-reduce parser that builds parse trees from the leaves up. Supports LALR and canonical LR variants.

N

Non-Terminal
A grammar symbol that can be expanded into other symbols. Represented as enum variants implementing the NonTerminal trait.

Node Reuse
The process of identifying and reusing unchanged subtrees from previous parses during incremental parsing.

P

Parse Cache
See Cache (Parse Cache).

Parser Backend
See Backend.

Production Rule
A grammar rule that defines how a non-terminal can be expanded. Written as NonTerminal -> Expression.

R

Red Tree
The API layer of Sipha’s syntax tree representation. Red trees provide convenient navigation and query APIs, created on-demand from green trees.

Reuse Budget
A limit on how many nodes to consider for reuse during incremental parsing. Can be fixed or heuristic-based.

S

Sequence Expression
A grammar expression that matches a sequence of expressions in order. Represented as Expr::seq(vec![...]).

Syntax Kind
A unified type representing both terminals (tokens) and non-terminals. All syntax kinds implement the SyntaxKind trait.

Syntax Tree
A tree representation of the syntactic structure of parsed code. Sipha uses green/red trees for efficient representation.

T

Terminal
A grammar symbol that cannot be expanded further. In Sipha, terminals are tokens produced by the lexer.

Token
A unit of lexical analysis, representing a sequence of characters with a specific meaning (e.g., identifier, keyword, operator).

Trivia
Tokens that are syntactically insignificant, such as whitespace and comments. Trivia is automatically skipped during parsing.

U

Unambiguous Grammar
A grammar that produces exactly one parse tree for each valid input. Most parsers require unambiguous grammars, except GLR.

V

Version (Cache Version)
A number used to invalidate old cache entries. When the grammar or lexer changes, the version increments, invalidating old cache entries.

Cheat Sheet

Quick reference for common Sipha patterns and APIs.

Syntax Kinds

use sipha::syntax::SyntaxKind;

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind {
    // Terminals
    Number, Plus, Minus, Eof,
    // Non-terminals
    Expr, Term,
}

impl SyntaxKind for MySyntaxKind {
    fn is_terminal(self) -> bool {
        !matches!(self, Self::Expr | Self::Term)
    }
    
    fn is_trivia(self) -> bool {
        false // No trivia in this example
    }
}

Lexer Builder

use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind { Number, Plus, Minus, Eof, Whitespace, Expr, Term, }
impl SyntaxKind for MySyntaxKind {
    fn is_terminal(self) -> bool { !matches!(self, Self::Expr | Self::Term) }
    fn is_trivia(self) -> bool { matches!(self, Self::Whitespace) }
}
use sipha::lexer::{LexerBuilder, Pattern, CharSet};

let lexer = LexerBuilder::new()
    .token(MySyntaxKind::Number, Pattern::Repeat {
        pattern: Box::new(Pattern::CharClass(CharSet::digits())),
        min: 1,
        max: None,
    })
    .token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
    .trivia(MySyntaxKind::Whitespace)
    .build(MySyntaxKind::Eof, MySyntaxKind::Number)?;

Grammar Builder

use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind { Number, Plus, Minus, Eof, Whitespace, Expr, Term, }
impl SyntaxKind for MySyntaxKind {
    fn is_terminal(self) -> bool { !matches!(self, Self::Expr | Self::Term) }
    fn is_trivia(self) -> bool { matches!(self, Self::Whitespace) }
}
use sipha::grammar::{GrammarBuilder, Expr, NonTerminal};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};
use std::convert::TryFrom;

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum MyNonTerminal { Expr, }
impl NonTerminal for MyNonTerminal {
    fn name(&self) -> &str { match self { MyNonTerminal::Expr => "Expr", } }
}

// Helper to create tokens
fn token(kind: MySyntaxKind, text: &str, offset: u32) -> LexerToken<MySyntaxKind> {
    let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
    LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}

let grammar = GrammarBuilder::new()
    .entry_point(MyNonTerminal::Expr)
    .rule(MyNonTerminal::Expr, Expr::token(token(MySyntaxKind::Number, "1", 0)))
    .build()?;

Grammar Expressions

ExpressionSyntaxDescription
TokenExpr::token(token)Match a specific token
Non-terminalExpr::non_terminal(NT::Expr)Match a non-terminal
SequenceExpr::seq(vec![...])Match expressions in order
ChoiceExpr::choice(vec![...])Match one of several alternatives
OptionalExpr::optional(Box::new(...))Match zero or one occurrence
RepeatExpr::repeat(Box::new(...), min, max)Match repeated occurrences
EmptyExpr::EmptyMatch nothing (epsilon)

Parser Usage

use sipha::grammar::{GrammarBuilder, Grammar, Expr, NonTerminal};
use sipha::lexer::Token;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
let grammar: Grammar<MyToken, MyNonTerminal> = todo!();
let tokens: Vec<Token<MySyntaxKind>> = vec![];
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
use sipha::syntax::SyntaxNode;

let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)?;
let result = parser.parse(&tokens, MyNonTerminal::Expr);

if !result.errors.is_empty() {
    // Handle errors
}
let root = SyntaxNode::new_root(result.root.clone());

Incremental Parsing

use sipha::backend::ll::LlParser;
use sipha::grammar::Grammar;
use sipha::lexer::Token;
use sipha::syntax::GreenNode;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
let parser: LlParser = todo!();
let grammar: Grammar<MyToken, MyNonTerminal> = todo!();
let tokens: Vec<Token<MySyntaxKind>> = vec![];
let old_tree: Option<&GreenNode<MySyntaxKind>> = None;
let entry_point: MyNonTerminal = todo!();
use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::syntax::{TextRange, TextSize};

let mut incremental = IncrementalParser::new(parser);
let edits = vec![TextEdit::replace(
    TextRange::new(TextSize::from(0), TextSize::from(2)),
    "new".into(),
)];
let result = incremental.parse_incremental(
    &tokens,
    old_tree,
    &edits,
    entry_point,
    Some(&grammar),
);

Syntax Tree Navigation

use sipha::syntax::{SyntaxNode, GreenNode, SyntaxKind};
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind { Expr, }
impl SyntaxKind for MySyntaxKind {
    fn is_terminal(self) -> bool { false }
    fn is_trivia(self) -> bool { false }
}
let green_node: GreenNode<MySyntaxKind> = todo!();
use sipha::syntax::SyntaxNode;

let root = SyntaxNode::new_root(green_node);

// Children
for child in root.children() {
    println!("{:?}", child.kind());
}

// Descendants
for node in root.descendants() {
    if node.kind() == MySyntaxKind::Expr {
        // Found expression
    }
}

// Parent
if let Some(parent) = root.parent() {
    println!("Parent: {:?}", parent.kind());
}

// Siblings
for sibling in root.next_sibling().into_iter() {
    println!("Sibling: {:?}", sibling.kind());
}

Error Handling

use sipha::backend::ParserBackend;
type MyToken = ();
type MyNonTerminal = ();
struct Parser;
impl ParserBackend<MyToken, MyNonTerminal> for Parser {
    type Config = ();
    type Error = ();
    type State = ();
    fn new(_: &(), _: ()) -> Result<Self, ()> { Ok(Parser) }
    fn parse(&mut self, _: &[MyToken], _: MyNonTerminal) -> sipha::backend::ParseResult<MyToken, MyNonTerminal> { todo!() }
}
let mut parser = Parser;
let tokens: Vec<MyToken> = vec![];
let result = parser.parse(&tokens, MyNonTerminal::Expr);
// Check for errors
if !result.errors.is_empty() {
    for error in &result.errors {
        eprintln!("Error: {} at {:?}", error.message, error.span);
    }
}

// Check for warnings
for warning in &result.warnings {
    eprintln!("Warning: {}", warning.message);
}

Pattern Matching

PatternSyntaxExample
LiteralPattern::Literal("+".into())Match exact string
Char ClassPattern::CharClass(CharSet::digits())Match character class
RepeatPattern::Repeat { pattern, min, max }Match repetition
RegexPattern::Regex(regex)Match regex pattern

Common Character Sets

use sipha::lexer::CharSet;

let _ = CharSet::digits();        // [0-9]
let _ = CharSet::whitespace();    // \s
let _ = CharSet::letters();       // [a-zA-Z]
let _ = CharSet::alphanumeric();  // [a-zA-Z0-9]
let _ = CharSet::hex_digits();    // [0-9a-fA-F]

Backend Selection

Use CaseBackendReason
Simple grammarLLEasy to use, good error messages
Left recursionLRNatural support
Ambiguous grammarGLRHandles ambiguity
Maximum performanceLREfficient table-based parsing

Feature Flags

[dependencies]
sipha = { version = "0.5.0", features = [
    "backend-ll",      # LL(k) parser (default)
    "backend-lr",      # LR parser
    "backend-glr",     # GLR parser (requires backend-lr)
    "diagnostics",     # Rich error diagnostics
    "unicode",         # Unicode support
    "visitor",         # Visitor patterns
    "query",           # XPath-like queries
    "tree-utils",      # Tree utilities
] }

Common Patterns

Optional Semicolon

use sipha::grammar::{GrammarBuilder, Expr, NonTerminal};
use sipha::lexer::Token;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
let semicolon_token: Token<MySyntaxKind> = todo!();
let mut builder = GrammarBuilder::new();
builder.rule(MyNonTerminal::Stmt, Expr::seq(vec![
    Expr::non_terminal(MyNonTerminal::Expr),
    Expr::optional(Box::new(Expr::token(semicolon_token))),
]))

Delimited List

use sipha::grammar::{GrammarBuilder, Expr, NonTerminal};
use sipha::lexer::Token;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
let open_token: Token<MySyntaxKind> = todo!();
let close_token: Token<MySyntaxKind> = todo!();
let mut builder = GrammarBuilder::new();
builder.rule(MyNonTerminal::List, Expr::seq(vec![
    Expr::token(open_token),
    Expr::repeat(
        Box::new(Expr::non_terminal(MyNonTerminal::Item)),
        0,
        None,
    ),
    Expr::token(close_token),
]))

Operator Precedence

use sipha::grammar::{GrammarBuilder, Expr, NonTerminal};
use sipha::lexer::Token;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
let multiply_token: Token<MySyntaxKind> = todo!();
let plus_token: Token<MySyntaxKind> = todo!();
let mut builder = GrammarBuilder::new();
// Higher precedence first
builder.rule(MyNonTerminal::Expr, Expr::choice(vec![
    Expr::seq(vec![  // * has higher precedence
        Expr::non_terminal(MyNonTerminal::Term),
        Expr::token(multiply_token),
        Expr::non_terminal(MyNonTerminal::Term),
    ]),
    Expr::seq(vec![  // + has lower precedence
        Expr::non_terminal(MyNonTerminal::Expr),
        Expr::token(plus_token),
        Expr::non_terminal(MyNonTerminal::Term),
    ]),
    Expr::non_terminal(MyNonTerminal::Term),
]))

See Also