Introduction
What is Sipha?
Sipha is a flexible, incremental parsing library for Rust with support for multiple parsing algorithms. Sipha is designed from the ground up to support incremental parsing—the ability to efficiently re-parse only the changed portions of your code, making it ideal for interactive applications like IDEs, editors, and language servers.
Unlike traditional parsers that re-parse entire files on every change, Sipha can:
- Reuse unchanged subtrees from previous parses
- Reparse only affected regions instead of the entire file
- Maintain parse caches for efficient updates
- Dramatically improve performance in interactive editing scenarios
Architecture Overview
graph TB
Input[Source Text] --> Lexer[Lexer]
Lexer --> Tokens[Tokens]
Tokens --> Parser[Parser Backend]
Parser --> Tree[Syntax Tree]
Tree --> Incremental[Incremental Parser]
Incremental --> Cache[Parse Cache]
Incremental --> Output[Updated Tree]
subgraph Backends
LL[LL Parser]
LR[LR Parser]
GLR[GLR Parser]
end
Parser -.-> Backends
style Input fill:#e1f5ff,color:#000000
style Tree fill:#c8e6c9,color:#000000
style Output fill:#c8e6c9,color:#000000
style Cache fill:#fff9c4,color:#000000
Project Status
Note
Sipha 0.5.0 provides a stable foundation for incremental parsing. The core API is stable, and we continue to add features and improvements based on user feedback. We welcome contributions!
Key Features
Note
For a detailed feature comparison with other parsing libraries, see the README in the repository.
Sipha provides a comprehensive parsing solution with these key capabilities:
Incremental Parsing (Primary Focus)
Sipha’s standout feature is its incremental parsing capability. When you edit code, Sipha can:
- Reuse unchanged subtrees from previous parses
- Reparse only affected regions instead of the entire file
- Maintain parse caches for efficient updates
- Dramatically improve performance in interactive editing scenarios
This makes Sipha perfect for building language servers, IDEs, and other tools that need to parse code as users type.
Learn more in the Incremental Parsing section.
Additional Features
- Multiple parsing backends: Choose from LL(k), LR, and GLR (via feature flags) - see Parsing Backends
- Immutable syntax trees: Green/red tree representation for efficient manipulation - see Syntax Trees
- Error recovery: Configurable error recovery strategies for robust parsing - see Error Handling
- Flexible grammar definition: Builder API for defining your grammar - see Grammars
- Unicode support: Full Unicode support for identifiers and text (optional)
- Rich diagnostics: Beautiful error messages with miette integration (optional)
- Tree traversal: Visitor patterns and query APIs for working with syntax trees
Use Cases
Sipha is particularly well-suited for:
- Language Servers: Building LSP implementations that need fast, incremental parsing
- IDEs and Editors: Providing real-time syntax analysis and error checking
- Code Analysis Tools: Tools that need to parse and analyze code efficiently
- Interactive Development Tools: Any tool that needs to parse code as users edit it
- Complex Language Parsing: Using GLR backend for languages with inherent ambiguities (like C++)
Comparison with Alternatives
| Feature | Sipha | pest | nom | lalrpop |
|---|---|---|---|---|
| Incremental parsing | ✅ | ❌ | ❌ | ❌ |
| Multiple backends | ✅ | ❌ | ❌ | ❌ |
| Syntax trees | ✅ | ✅ | ❌ | ✅ |
| Error recovery | ✅ | ✅ | ✅ | ✅ |
| Grammar DSL | ❌ | ✅ | ❌ | ✅ |
| Zero-copy | Partial | ✅ | ✅ | ❌ |
When to Use Sipha
Use Sipha when:
- Building language servers or IDEs
- Need incremental parsing for interactive editing
- Want flexibility in choosing parsing algorithms
- Need rich syntax tree manipulation
- Parsing ambiguous grammars (use GLR backend)
When to Consider Alternatives
Consider alternatives when:
- Simple one-off parsers (pest or nom might be simpler)
- Maximum performance for batch parsing (nom might be faster)
- Prefer declarative grammar DSLs (pest or lalrpop)
What’s Next?
Ready to get started? Head to Getting Started to install Sipha and build your first parser!
For a deeper dive into specific topics:
- Learn about Core Concepts to understand Sipha’s architecture
- Explore Incremental Parsing to understand Sipha’s key feature
- Check out Examples to see Sipha in action
- Read the Architecture section to understand design decisions
Getting Started
This guide will help you get up and running with Sipha. We’ll cover installation, a quick start example, and an overview of the core concepts.
Installation
Add Sipha to your Cargo.toml:
[dependencies]
sipha = "0.5.0"
Or with specific features:
[dependencies]
sipha = { version = "0.5.0", features = ["diagnostics", "unicode", "backend-ll"] }
Available Features
backend-ll: Enable LL(k) parser backend (default)backend-lr: Enable LR parser backendbackend-glr: Enable GLR parser backend (requiresbackend-lr)diagnostics: Enable rich error diagnostics with mietteunicode: Enable full Unicode support for identifiersvisitor: Enable syntax tree visitor patternsquery: Enable XPath-like query API for syntax treestree-utils: Enable tree diffing and validation utilities
Quick Start
Let’s build a simple arithmetic expression parser step by step. This example will help you understand the core concepts.
Step 1: Define Your Syntax Kinds
First, define the tokens and non-terminals your parser will use. Sipha uses a unified SyntaxKind trait for both terminals (tokens) and non-terminals (grammar rules):
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
// Terminals (produced by lexer)
Number,
Plus,
Minus,
Multiply,
Divide,
LParen,
RParen,
Whitespace,
Eof,
// Non-terminals (produced by parser)
Expr,
Term,
Factor,
}
impl SyntaxKind for ArithSyntaxKind {
fn is_terminal(self) -> bool {
!matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor)
}
fn is_trivia(self) -> bool {
matches!(self, ArithSyntaxKind::Whitespace)
}
}
Step 2: Build a Lexer
Create a lexer to tokenize your input text:
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
let lexer = LexerBuilder::new()
.token(ArithSyntaxKind::Number, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 1,
max: None,
})
.token(ArithSyntaxKind::Plus, Pattern::Literal("+".into()))
.token(ArithSyntaxKind::Minus, Pattern::Literal("-".into()))
.token(ArithSyntaxKind::Multiply, Pattern::Literal("*".into()))
.token(ArithSyntaxKind::Divide, Pattern::Literal("/".into()))
.token(ArithSyntaxKind::LParen, Pattern::Literal("(".into()))
.token(ArithSyntaxKind::RParen, Pattern::Literal(")".into()))
.token(ArithSyntaxKind::Whitespace, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
min: 1,
max: None,
})
.trivia(ArithSyntaxKind::Whitespace)
.build(ArithSyntaxKind::Eof, ArithSyntaxKind::Number)
.expect("Failed to build lexer");
Step 3: Tokenize Input
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
let lexer = LexerBuilder::new()
.token(ArithSyntaxKind::Number, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 1, max: None,
})
.token(ArithSyntaxKind::Plus, Pattern::Literal("+".into()))
.token(ArithSyntaxKind::Minus, Pattern::Literal("-".into()))
.token(ArithSyntaxKind::Multiply, Pattern::Literal("*".into()))
.token(ArithSyntaxKind::Divide, Pattern::Literal("/".into()))
.token(ArithSyntaxKind::LParen, Pattern::Literal("(".into()))
.token(ArithSyntaxKind::RParen, Pattern::Literal(")".into()))
.token(ArithSyntaxKind::Whitespace, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
min: 1, max: None,
})
.trivia(ArithSyntaxKind::Whitespace)
.build(ArithSyntaxKind::Eof, ArithSyntaxKind::Number)
.expect("Failed to build lexer");
let input = "42 + 10";
let tokens = lexer.tokenize(input)
.expect("Failed to tokenize input");
Step 4: Define Non-Terminals and Build Grammar
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::grammar::{GrammarBuilder, NonTerminal, Expr};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};
use std::convert::TryFrom;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum ArithNonTerminal {
Expr,
Term,
Factor,
}
impl NonTerminal for ArithNonTerminal {
fn name(&self) -> &str {
match self {
ArithNonTerminal::Expr => "Expr",
ArithNonTerminal::Term => "Term",
ArithNonTerminal::Factor => "Factor",
}
}
}
// Helper function to create tokens with proper ranges
fn create_token(kind: ArithSyntaxKind, text: &str, offset: u32) -> LexerToken<ArithSyntaxKind> {
let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}
// Build your grammar rules
let grammar = GrammarBuilder::new()
.entry_point(ArithNonTerminal::Expr)
// Simple grammar: Expr -> Number
.rule(ArithNonTerminal::Expr, Expr::token(create_token(
ArithSyntaxKind::Number,
"42",
0
)))
.build()
.expect("Failed to build grammar");
Step 5: Parse!
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::grammar::{GrammarBuilder, NonTerminal, Expr};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};
use std::convert::TryFrom;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum ArithNonTerminal { Expr, Term, Factor, }
impl NonTerminal for ArithNonTerminal {
fn name(&self) -> &str { match self { ArithNonTerminal::Expr => "Expr", ArithNonTerminal::Term => "Term", ArithNonTerminal::Factor => "Factor", } }
}
fn create_token(kind: ArithSyntaxKind, text: &str, offset: u32) -> LexerToken<ArithSyntaxKind> {
let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}
let grammar = GrammarBuilder::new()
.entry_point(ArithNonTerminal::Expr)
.rule(ArithNonTerminal::Expr, Expr::token(create_token(ArithSyntaxKind::Number, "42", 0)))
.build().expect("Failed to build grammar");
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
let lexer = LexerBuilder::new()
.token(ArithSyntaxKind::Number, Pattern::Repeat { pattern: Box::new(Pattern::CharClass(CharSet::digits())), min: 1, max: None })
.token(ArithSyntaxKind::Plus, Pattern::Literal("+".into()))
.token(ArithSyntaxKind::Minus, Pattern::Literal("-".into()))
.token(ArithSyntaxKind::Multiply, Pattern::Literal("*".into()))
.token(ArithSyntaxKind::Divide, Pattern::Literal("/".into()))
.token(ArithSyntaxKind::LParen, Pattern::Literal("(".into()))
.token(ArithSyntaxKind::RParen, Pattern::Literal(")".into()))
.token(ArithSyntaxKind::Whitespace, Pattern::Repeat { pattern: Box::new(Pattern::CharClass(CharSet::whitespace())), min: 1, max: None })
.trivia(ArithSyntaxKind::Whitespace)
.build(ArithSyntaxKind::Eof, ArithSyntaxKind::Number).expect("Failed to build lexer");
let input = "42 + 10";
let tokens = lexer.tokenize(input).expect("Failed to tokenize input");
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)
.expect("Failed to create parser");
let result = parser.parse(&tokens, ArithNonTerminal::Expr);
For a complete working example, see the Basic Arithmetic Example or check out examples/basic_arithmetic.rs in the repository.
Core Concepts Overview
Before diving deeper, here’s a quick overview of Sipha’s core concepts:
Syntax Kinds
Sipha uses a unified SyntaxKind trait for both terminals (tokens) and non-terminals (grammar rules). This design simplifies the API and allows for flexible grammar definitions.
See Syntax Kinds for more details.
Lexers
Lexers convert raw text into tokens. Sipha provides a flexible lexer builder API that supports:
- Pattern matching (literals, character classes, repetitions)
- Trivia handling (whitespace, comments)
- DFA-based tokenization for performance
See Lexers for more details.
Grammars
Grammars define the structure of your language. Sipha uses a builder API to define grammar rules declaratively.
See Grammars for more details.
Parsing Backends
Sipha supports multiple parsing algorithms:
- LL(k): Top-down predictive parsing
- LR: Bottom-up shift-reduce parsing
- GLR: Generalized LR for ambiguous grammars
See Parsing Backends for more details.
Syntax Trees
Sipha uses an immutable green/red tree representation:
- Green trees: Compact, shared representation
- Red trees: Convenient API for traversal
See Syntax Trees for more details.
Incremental Parsing
Sipha’s key feature is incremental parsing—the ability to efficiently re-parse only changed regions. This is essential for interactive applications.
See Incremental Parsing for more details.
Next Steps
Now that you have Sipha installed and understand the basics, you can:
- Read about Core Concepts in detail
- Explore Incremental Parsing to understand Sipha’s key feature
- Check out Examples to see real-world usage
- Learn about Parsing Backends to choose the right algorithm
Syntax Kinds
Sipha uses a unified SyntaxKind trait for both terminals (tokens) and non-terminals (grammar rules). This design simplifies the API and allows for flexible grammar definitions.
Overview
In traditional parser generators, terminals and non-terminals are separate types. Sipha unifies them into a single SyntaxKind enum, which simplifies the API and makes it easier to work with syntax trees.
The SyntaxKind Trait
The SyntaxKind trait is the foundation of Sipha’s type system:
use std::hash::Hash;
use std::fmt::Debug;
pub trait SyntaxKind: Copy + Clone + PartialEq + Eq + Hash + Debug {
fn is_terminal(self) -> bool;
fn is_trivia(self) -> bool;
}
Required Methods
is_terminal(self) -> bool: Returnstrueif this kind represents a terminal (token),falseif it’s a non-terminal (grammar rule).is_trivia(self) -> bool: Returnstrueif this kind represents trivia (whitespace, comments) that should be ignored during parsing.
Defining Your Syntax Kinds
You define syntax kinds as an enum that implements the SyntaxKind trait:
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind {
// Terminals (produced by lexer)
Number,
Plus,
Minus,
LParen,
RParen,
Whitespace,
Eof,
// Non-terminals (produced by parser)
Expr,
Term,
Factor,
}
impl SyntaxKind for MySyntaxKind {
fn is_terminal(self) -> bool {
!matches!(self,
MySyntaxKind::Expr |
MySyntaxKind::Term |
MySyntaxKind::Factor
)
}
fn is_trivia(self) -> bool {
matches!(self, MySyntaxKind::Whitespace)
}
}
Terminals vs Non-Terminals
Terminals
Terminals are produced by the lexer and represent actual tokens in the input text:
- Keywords (e.g.,
if,while,return) - Operators (e.g.,
+,-,*,/) - Identifiers (e.g., variable names)
- Literals (e.g., numbers, strings)
- Punctuation (e.g.,
(,),{,}) - Trivia (e.g., whitespace, comments)
Non-Terminals
Non-terminals are produced by the parser and represent grammar rules:
- Expression nodes (e.g.,
Expr,Term,Factor) - Statement nodes (e.g.,
Stmt,IfStmt,WhileStmt) - Declaration nodes (e.g.,
Decl,FnDecl,VarDecl)
Trivia
Trivia are tokens that don’t affect the parse tree structure but are preserved for formatting and error messages:
- Whitespace
- Comments
- Line breaks
Mark trivia kinds with is_trivia() returning true. The parser will automatically skip trivia during parsing, but they’ll still be available in the syntax tree.
Best Practices
- Use descriptive names: Choose names that clearly indicate what the kind represents.
- Group related kinds: Keep terminals and non-terminals organized logically.
- Mark trivia correctly: Ensure
is_trivia()returnstruefor all trivia kinds. - Use exhaustive matching: When implementing
is_terminal(), use exhaustive patterns to catch all cases.
Example: Complete Syntax Kind Definition
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum CalculatorSyntaxKind {
// Terminals
Number,
Plus,
Minus,
Multiply,
Divide,
LParen,
RParen,
Whitespace,
Eof,
// Non-terminals
Expr,
Term,
Factor,
}
impl SyntaxKind for CalculatorSyntaxKind {
fn is_terminal(self) -> bool {
!matches!(
self,
CalculatorSyntaxKind::Expr
| CalculatorSyntaxKind::Term
| CalculatorSyntaxKind::Factor
)
}
fn is_trivia(self) -> bool {
matches!(self, CalculatorSyntaxKind::Whitespace)
}
}
Next Steps
Now that you understand syntax kinds, you can:
- Learn about Lexers to see how terminals are produced
- Explore Grammars to see how non-terminals are used
- Check out Syntax Trees to see how kinds are used in trees
Lexers
Lexers convert raw text into tokens. Sipha provides a flexible lexer builder API that supports pattern matching, trivia handling, and DFA-based tokenization for performance.
Overview
The lexer is the first stage of parsing. It takes source text and converts it into a stream of tokens that the parser can consume. Sipha’s lexer uses a DFA (Deterministic Finite Automaton) for efficient tokenization.
Building a Lexer
Use LexerBuilder to construct a lexer:
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
let lexer = LexerBuilder::new()
.token(MySyntaxKind::Number, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 1,
max: None,
})
.token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
.token(MySyntaxKind::Whitespace, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
min: 1,
max: None,
})
.trivia(MySyntaxKind::Whitespace)
.build(MySyntaxKind::Eof, MySyntaxKind::Number)
.expect("Failed to build lexer");
Patterns
Sipha supports several pattern types for matching tokens:
Literal Patterns
Match exact strings:
.token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
.token(MySyntaxKind::LParen, Pattern::Literal("(".into()))
Character Class Patterns
Match character ranges:
use sipha::lexer::CharSet;
// Match digits [0-9]
.token(MySyntaxKind::Number, Pattern::CharClass(CharSet::digits()))
// Match letters [a-zA-Z]
.token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::new(vec!['a'..='z', 'A'..='Z'])))
// Match whitespace
.token(MySyntaxKind::Whitespace, Pattern::CharClass(CharSet::whitespace()))
Repeat Patterns
Match repeating patterns:
// Match one or more digits
.token(MySyntaxKind::Number, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 1,
max: None, // No maximum
})
// Match zero or more whitespace
.token(MySyntaxKind::Whitespace, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
min: 0,
max: None,
})
// Match exactly 3 digits
.token(MySyntaxKind::ThreeDigits, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 3,
max: Some(3),
})
Regex Patterns
For complex patterns, use regex:
// Match floating point numbers
.token(MySyntaxKind::Float, Pattern::Regex(r"\d+\.\d+".into()))
// Match email addresses
.token(MySyntaxKind::Email, Pattern::Regex(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}".into()))
Any Pattern
Match any single character:
.token(MySyntaxKind::AnyChar, Pattern::Any)
Keywords
Keywords are reserved words that should be matched as literals rather than identifiers:
let lexer = LexerBuilder::new()
.token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::new(vec!['a'..='z', 'A'..='Z'])))
.keyword("if", MySyntaxKind::If)
.keyword("else", MySyntaxKind::Else)
.keyword("while", MySyntaxKind::While)
.build(MySyntaxKind::Eof, MySyntaxKind::Ident)
.expect("Failed to build lexer");
Keywords are matched with higher priority than general patterns, so "if" will match as a keyword rather than an identifier.
Trivia
Trivia are tokens that don’t affect parsing but are preserved in the syntax tree (e.g., whitespace, comments):
let lexer = LexerBuilder::new()
.token(MySyntaxKind::Whitespace, Pattern::CharClass(CharSet::whitespace()))
.token(MySyntaxKind::Comment, Pattern::Regex(r"//.*".into()))
.trivia(MySyntaxKind::Whitespace)
.trivia(MySyntaxKind::Comment)
.build(MySyntaxKind::Eof, MySyntaxKind::Number)
.expect("Failed to build lexer");
The parser will automatically skip trivia during parsing, but they’ll still be available in the syntax tree for formatting and IDE features.
Custom Matchers
For complex tokenization logic, use custom matchers:
let lexer = LexerBuilder::new()
.custom_token(MySyntaxKind::String, |text, pos| {
// Custom string matching logic
if text[pos..].starts_with('"') {
// Find closing quote
let mut end = pos + 1;
while end < text.len() && text.as_bytes()[end] != b'"' {
if text.as_bytes()[end] == b'\\' {
end += 2; // Skip escape sequence
} else {
end += 1;
}
}
if end < text.len() {
Some((end - pos + 1, TokenValue::String(text[pos+1..end].to_string())))
} else {
None // Unterminated string
}
} else {
None
}
})
.build(MySyntaxKind::Eof, MySyntaxKind::Number)
.expect("Failed to build lexer");
Custom matchers return Option<(usize, TokenValue)> where:
usizeis the number of characters matchedTokenValueis the token’s value (if any)
Tokenization
Once the lexer is built, tokenize input:
let input = "42 + 10";
let tokens = lexer.tokenize(input)
.expect("Failed to tokenize input");
for token in &tokens {
println!("{:?}: {}", token.kind, token.text);
}
DFA Construction
Sipha’s lexer uses a DFA (Deterministic Finite Automaton) for efficient tokenization. The DFA is built from NFA (Nondeterministic Finite Automaton) patterns, which are then converted to a DFA for O(n) tokenization performance.
The DFA construction process:
- Pattern to NFA: Each pattern is converted to an NFA
- NFA to DFA: The NFA is converted to a DFA using subset construction
- Optimization: The DFA is optimized for size and performance
Priority and Ambiguity
When multiple patterns could match the same input, Sipha uses priority to resolve ambiguity:
- Keywords have the highest priority (lowest number)
- Patterns are prioritized by order of definition (earlier = higher priority)
- Longest match wins when priorities are equal
Error Handling
The lexer returns LexerError for invalid input:
match lexer.tokenize(input) {
Ok(tokens) => {
// Process tokens
}
Err(error) => {
match error.kind {
LexerErrorKind::UnexpectedChar { .. } => {
// Handle unexpected character
}
LexerErrorKind::InvalidNumber { reason } => {
// Handle invalid number format
}
// ... other error kinds
}
}
}
Best Practices
- Order patterns carefully: More specific patterns should come before general ones
- Use keywords for reserved words: This ensures proper matching priority
- Mark trivia correctly: Ensure all trivia kinds are marked with
.trivia() - Use character classes efficiently: Prefer character classes over regex when possible
- Test edge cases: Test with empty input, very long input, and invalid input
Example: Complete Lexer
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
let lexer = LexerBuilder::new()
// Numbers
.token(MySyntaxKind::Number, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 1,
max: None,
})
// Operators
.token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
.token(MySyntaxKind::Minus, Pattern::Literal("-".into()))
.token(MySyntaxKind::Multiply, Pattern::Literal("*".into()))
.token(MySyntaxKind::Divide, Pattern::Literal("/".into()))
// Parentheses
.token(MySyntaxKind::LParen, Pattern::Literal("(".into()))
.token(MySyntaxKind::RParen, Pattern::Literal(")".into()))
// Whitespace (trivia)
.token(MySyntaxKind::Whitespace, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
min: 1,
max: None,
})
.trivia(MySyntaxKind::Whitespace)
.build(MySyntaxKind::Eof, MySyntaxKind::Number)
.expect("Failed to build lexer");
Next Steps
- Learn about Grammars to see how tokens are used in parsing
- Explore Syntax Trees to see how tokens appear in trees
- Check out Custom Patterns for advanced lexer usage
Grammars
Grammars define the structure of your language. Sipha uses a builder API to define grammar rules declaratively.
Overview
A grammar in Sipha consists of:
- Non-terminals: Grammar rules (e.g.,
Expr,Stmt,Decl) - Production rules: How non-terminals are expanded (e.g.,
Expr -> Term + Expr) - Entry point: The starting non-terminal for parsing
GrammarBuilder API
Use GrammarBuilder to construct grammars:
use sipha::grammar::{GrammarBuilder, Expr, NonTerminal, Token};
let grammar = GrammarBuilder::new()
.entry_point(MyNonTerminal::Expr)
.rule(MyNonTerminal::Expr, Expr::token(some_token))
.build()
.expect("Failed to build grammar");
Grammar Expressions
Grammar rules use Expr to define right-hand sides:
Token Expressions
Match a specific token:
use sipha::grammar::Expr;
use sipha::lexer::Token;
use sipha::syntax::{TextRange, TextSize};
let token = Token::new(
MySyntaxKind::Number,
"42",
TextRange::at(TextSize::zero(), TextSize::from(2))
);
.rule(MyNonTerminal::Factor, Expr::token(token))
Sequence Expressions
Match a sequence of expressions:
.rule(MyNonTerminal::Expr, Expr::seq(vec![
Expr::non_terminal(MyNonTerminal::Term),
Expr::token(plus_token),
Expr::non_terminal(MyNonTerminal::Expr),
]))
Choice Expressions
Match one of several alternatives:
.rule(MyNonTerminal::Expr, Expr::choice(vec![
Expr::non_terminal(MyNonTerminal::Term),
Expr::non_terminal(MyNonTerminal::Expr),
]))
Optional Expressions
Match an optional expression:
.rule(MyNonTerminal::Stmt, Expr::seq(vec![
Expr::non_terminal(MyNonTerminal::If),
Expr::optional(Box::new(Expr::non_terminal(MyNonTerminal::Else))),
]))
Repeat Expressions
Match zero or more repetitions:
// Zero or more
.rule(MyNonTerminal::Args, Expr::repeat(
Box::new(Expr::non_terminal(MyNonTerminal::Arg)),
0,
None,
))
// One or more
.rule(MyNonTerminal::Stmts, Expr::repeat(
Box::new(Expr::non_terminal(MyNonTerminal::Stmt)),
1,
None,
))
Empty Expression
Match nothing (epsilon):
.rule(MyNonTerminal::Optional, Expr::Empty)
Complete Example: Arithmetic Grammar
use sipha::grammar::{GrammarBuilder, Expr, NonTerminal};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};
// Helper function to create tokens with proper ranges
fn create_token(kind: ArithSyntaxKind, text: &str, offset: u32) -> LexerToken<ArithSyntaxKind> {
let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}
let grammar = GrammarBuilder::new()
.entry_point(ArithNonTerminal::Expr)
// Expr -> Term | Expr + Term | Expr - Term
.rule(ArithNonTerminal::Expr, Expr::choice(vec![
Expr::seq(vec![
Expr::non_terminal(ArithNonTerminal::Expr),
Expr::token(create_token(ArithSyntaxKind::Plus, "+", 0)),
Expr::non_terminal(ArithNonTerminal::Term),
]),
Expr::seq(vec![
Expr::non_terminal(ArithNonTerminal::Expr),
Expr::token(create_token(ArithSyntaxKind::Minus, "-", 0)),
Expr::non_terminal(ArithNonTerminal::Term),
]),
Expr::non_terminal(ArithNonTerminal::Term),
]))
// Term -> Factor | Term * Factor | Term / Factor
.rule(ArithNonTerminal::Term, Expr::choice(vec![
Expr::seq(vec![
Expr::non_terminal(ArithNonTerminal::Term),
Expr::token(create_token(ArithSyntaxKind::Multiply, "*", 0)),
Expr::non_terminal(ArithNonTerminal::Factor),
]),
Expr::seq(vec![
Expr::non_terminal(ArithNonTerminal::Term),
Expr::token(create_token(ArithSyntaxKind::Divide, "/", 0)),
Expr::non_terminal(ArithNonTerminal::Factor),
]),
Expr::non_terminal(ArithNonTerminal::Factor),
]))
// Factor -> Number | ( Expr )
.rule(ArithNonTerminal::Factor, Expr::choice(vec![
Expr::token(create_token(ArithSyntaxKind::Number, "42", 0)),
Expr::seq(vec![
Expr::token(create_token(ArithSyntaxKind::LParen, "(", 0)),
Expr::non_terminal(ArithNonTerminal::Expr),
Expr::token(create_token(ArithSyntaxKind::RParen, ")", 0)),
]),
]))
.build()
.expect("Failed to build grammar");
Grammar Validation
Sipha validates grammars before use:
let grammar = GrammarBuilder::new()
.entry_point(MyNonTerminal::Expr)
.rule(MyNonTerminal::Expr, Expr::token(some_token))
.build()
.expect("Failed to build grammar");
// Validate grammar for a specific backend
let errors = LlParser::validate(&grammar);
if !errors.is_empty() {
for error in errors {
eprintln!("Grammar error: {:?}", error);
}
}
Common validation errors:
- Left recursion: Some backends don’t support left recursion
- Ambiguity: Some backends require unambiguous grammars
- Missing rules: Referenced non-terminals must be defined
- Unreachable rules: Rules that can’t be reached from the entry point
Backend Hints
Provide hints to backends about how to handle rules:
use sipha::grammar::hint::PrecedenceHint;
.rule(MyNonTerminal::Expr, Expr::choice(vec![
// Higher precedence
Expr::seq(vec![
Expr::non_terminal(MyNonTerminal::Term),
Expr::token(multiply_token),
Expr::non_terminal(MyNonTerminal::Term),
]).with_hint(PrecedenceHint::new(2)),
// Lower precedence
Expr::seq(vec![
Expr::non_terminal(MyNonTerminal::Expr),
Expr::token(plus_token),
Expr::non_terminal(MyNonTerminal::Term),
]).with_hint(PrecedenceHint::new(1)),
]))
Grammar Analysis
Sipha provides tools for analyzing grammars:
- Left recursion detection: Find rules with left recursion
- Reachability analysis: Find unreachable rules
- First/Follow sets: Compute FIRST and FOLLOW sets for LL parsing
- Conflict detection: Find shift/reduce and reduce/reduce conflicts
See Grammar Analysis for more details.
Best Practices
- Start simple: Begin with a simple grammar and add complexity gradually
- Use meaningful names: Choose clear names for non-terminals
- Factor common patterns: Extract common subexpressions into separate rules
- Handle precedence explicitly: Use precedence hints or grammar structure
- Test thoroughly: Test with various inputs, including edge cases
- Validate early: Validate grammars before using them in parsers
Next Steps
- Learn about Parsing Basics to see how grammars are used
- Explore Parsing Backends to understand backend-specific requirements
- Check out Grammar Analysis for advanced grammar tools
See Also
- Cheat Sheet - Quick reference for grammar expressions
- Examples - Complete grammar examples
- Glossary - Definitions of grammar-related terms
Parsing Basics
This chapter explains how parsing works in Sipha, from creating a parser to understanding parse results.
Overview
Parsing is the process of converting a stream of tokens into a syntax tree according to a grammar. Sipha provides multiple parsing backends, each with different characteristics and capabilities.
Creating a Parser
First, create a parser from a grammar:
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)
.expect("Failed to create parser");
Different backends have different configuration options:
- LL parser:
LlConfigwith lookahead settings - LR parser:
LrConfigwith LALR vs canonical LR options - GLR parser:
GlrConfigwith ambiguity handling options
Parsing Input
Parse a stream of tokens:
let result = parser.parse(&tokens, MyNonTerminal::Expr);
The parse method takes:
- Tokens: A slice of tokens from the lexer
- Entry point: The non-terminal to start parsing from
Parse Results
The parse method returns a ParseResult:
pub struct ParseResult<T, N> {
pub root: GreenNode<T::Kind>,
pub errors: Vec<ParseError<T, N>>,
pub warnings: Vec<ParseWarning<T, N>>,
pub metrics: ParseMetrics,
#[cfg(feature = "backend-glr")]
pub forest: Option<ParseForest<T::Kind>>,
}
Root Node
The root field contains the syntax tree root. Even if there are errors, the parser attempts to produce a tree:
let root = SyntaxNode::new_root(result.root.clone());
println!("Root kind: {:?}", root.kind());
Errors
The errors field contains parsing errors:
if !result.errors.is_empty() {
for error in &result.errors {
eprintln!("Parse error at {:?}: {}", error.span, error.message);
}
}
Common error types:
- Unexpected token: Expected one token but found another
- Unexpected EOF: Reached end of input unexpectedly
- Missing token: Expected a token that wasn’t found
- Invalid grammar: Grammar-related errors
Warnings
The warnings field contains non-fatal issues:
for warning in &result.warnings {
eprintln!("Warning: {}", warning.message);
}
Metrics
The metrics field contains parsing statistics:
println!("Tokens consumed: {}", result.metrics.tokens_consumed);
println!("Nodes created: {}", result.metrics.nodes_created);
println!("Parse time: {:?}", result.metrics.parse_time);
Parse Forest (GLR)
When using the GLR backend, the forest field may contain multiple parse trees for ambiguous input:
#[cfg(feature = "backend-glr")]
if let Some(forest) = result.forest {
// Handle ambiguity
let disambiguated = forest.disambiguate(|alternatives| {
// Choose one alternative
alternatives.first().cloned()
});
}
Error Recovery
Sipha parsers attempt to recover from errors and continue parsing:
- Synchronization tokens: Skip to known recovery points
- Delimited recovery: Skip to matching delimiters
- Best-effort parsing: Continue parsing despite errors
See Error Handling for more details.
Working with Syntax Trees
After parsing, work with the syntax tree:
use sipha::syntax::SyntaxNode;
let root = SyntaxNode::new_root(result.root.clone());
// Traverse children
for child in root.children() {
println!("Child: {:?}", child.kind());
}
// Find specific nodes
if let Some(expr) = root.descendants().find(|n| n.kind() == MySyntaxKind::Expr) {
println!("Found expression");
}
See Syntax Trees for more details.
Incremental Parsing
For interactive applications, use incremental parsing:
use sipha::incremental::{IncrementalParser, TextEdit};
let mut incremental_parser = IncrementalParser::new(parser);
// Initial parse
let result1 = incremental_parser.parse_incremental(
&tokens,
None,
&[],
MyNonTerminal::Expr,
Some(&grammar),
);
// After an edit
let edits = vec![TextEdit::replace(range, "new text".into())];
let result2 = incremental_parser.parse_incremental(
&new_tokens,
Some(&result1.root),
&edits,
MyNonTerminal::Expr,
Some(&grammar),
);
See Incremental Parsing for more details.
Backend Selection
Choose a backend based on your needs:
- LL(k): Good for most grammars, supports left-recursion elimination
- LR: Efficient for many grammar types, good error recovery
- GLR: Handles ambiguous grammars, ideal for complex languages
See Choosing a Backend for guidance.
Complete Example
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
use sipha::syntax::SyntaxNode;
// Create parser
let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)
.expect("Failed to create parser");
// Parse
let result = parser.parse(&tokens, MyNonTerminal::Expr);
// Check for errors
if !result.errors.is_empty() {
eprintln!("Parsing failed with {} errors", result.errors.len());
for error in &result.errors {
eprintln!(" {:?}", error);
}
}
// Work with syntax tree
let root = SyntaxNode::new_root(result.root.clone());
println!("Parse successful! Root: {:?}", root.kind());
println!("Created {} nodes in {:?}",
result.metrics.nodes_created,
result.metrics.parse_time);
Next Steps
- Learn about Incremental Parsing for interactive applications
- Explore Parsing Backends to understand different algorithms
- Check out Syntax Trees for tree manipulation
See Also
- Error Handling - Comprehensive error handling guide
- Cheat Sheet - Quick reference for common patterns
- Examples - Working code examples
Incremental Parsing Overview
Incremental parsing is Sipha’s core strength. It allows you to efficiently update your parse tree when code changes, rather than re-parsing everything from scratch.
Why Incremental Parsing?
In interactive applications like IDEs, users edit code frequently. Traditional parsers re-parse the entire file on every change, which can be slow for large files. Incremental parsing:
- Reuses unchanged subtrees from previous parses
- Only re-parses affected regions instead of the entire file
- Maintains parse caches for fast updates
- Scales to large codebases efficiently
This makes Sipha perfect for building language servers, IDEs, and other tools that need to parse code as users type.
Benefits
Performance
Incremental parsing provides significant performance improvements:
- Fast updates: Only re-parse changed regions
- Memory efficient: Shared tree representation
- Scalable: Handles large files efficiently
- IDE-ready: Designed for interactive editing scenarios
User Experience
For interactive applications, incremental parsing means:
- Responsive editing: No lag when typing
- Real-time feedback: Errors and diagnostics update instantly
- Smooth experience: Large files don’t slow down the editor
How It Works (High Level)
- Initial parse: Parse the entire file and cache results
- Edit detection: Identify which regions changed
- Node reuse: Find unchanged subtrees from previous parse
- Incremental re-parse: Only re-parse affected regions
- Tree reconstruction: Combine reused nodes with new parse results
Current Status
Incremental parsing is fully implemented with complete node reuse and cache management:
- ✅ Node reuse: Unchanged subtrees are automatically identified and reused
- ✅ Cache population: Parse results are cached for future reuse
- ✅ Affected range computation: Only affected regions are re-parsed
- ✅ Smart invalidation: Cache entries are invalidated based on edit locations
The parser automatically integrates reusable nodes from previous parses, providing significant performance improvements for interactive editing scenarios.
When to Use Incremental Parsing
Use incremental parsing when:
- Building language servers or IDEs
- Creating interactive code editors
- Building tools that need real-time syntax analysis
- Parsing large files that change frequently
- Providing live error checking and diagnostics
For batch parsing (one-time parsing of files), regular parsing is sufficient and may be faster due to less overhead.
Next Steps
- Learn How It Works for detailed implementation
- See Usage for API examples
- Read Implementation Details for advanced topics
How Incremental Parsing Works
This chapter explains the internal mechanisms of incremental parsing in Sipha.
Overview
Incremental parsing works by:
- Tracking edits: Identifying which parts of the text changed
- Computing affected regions: Determining minimal regions to re-parse
- Finding reusable nodes: Identifying unchanged subtrees from previous parse
- Re-parsing selectively: Only parsing affected regions
- Reconstructing the tree: Combining reused nodes with new parse results
Incremental Parsing Flow
flowchart TD
Start[Text Edit] --> Track[Track Edit]
Track --> Compute[Compute Affected Region]
Compute --> Find[Find Reusable Nodes]
Find --> Check{Reusable<br/>Nodes Found?}
Check -->|Yes| Reuse[Reuse Unchanged Subtrees]
Check -->|No| ParseAll[Parse Affected Region]
Reuse --> ParseGaps[Parse Gaps Only]
ParseGaps --> Combine[Combine Results]
ParseAll --> Combine
Combine --> Update[Update Parse Cache]
Update --> Reconstruct[Reconstruct Tree]
Reconstruct --> End[New Syntax Tree]
style Start fill:#e1f5ff,color:#000000
style Reuse fill:#c8e6c9,color:#000000
style ParseGaps fill:#fff9c4,color:#000000
style Combine fill:#ffccbc,color:#000000
style End fill:#c8e6c9,color:#000000
Edit Tracking
When text is edited, Sipha tracks the changes using TextEdit:
use sipha::incremental::TextEdit;
use sipha::syntax::{TextRange, TextSize};
let edit = TextEdit::replace(
TextRange::new(TextSize::from(10), TextSize::from(20)),
"new text".into(),
);
Edits can be:
- Replacements: Replace a range with new text
- Insertions: Insert text at a position
- Deletions: Delete a range
Affected Region Computation
Sipha computes the minimal region that needs to be re-parsed:
let affected = AffectedRegion::from_edits(&edits, Some(total_len));
The affected region:
- Union: The smallest range covering all edits
- Expanded: The union plus padding for context (default 16 characters)
The expansion ensures that context-dependent parsing (like operator precedence) works correctly.
Node Reuse
Sipha identifies nodes from the previous parse that can be reused:
let reusable = find_reusable_nodes(old_tree, affected.expanded(), budget);
A node is reusable if:
- It doesn’t intersect with the affected region
- It matches the expected syntax kind (if specified)
- It’s within the reuse budget
Reuse Budget
The reuse budget limits how many nodes to consider for reuse:
use sipha::incremental::ReuseBudget;
// Fixed budget
let budget = ReuseBudget::Fixed(1000);
// Heuristic budget (default)
let budget = ReuseBudget::Heuristic {
max_depth: 20,
max_nodes: 1000,
};
The heuristic budget considers:
- Tree depth: Deeper trees allow more candidates
- File size: Larger files allow more candidates
- Affected region size: Smaller affected regions allow more candidates
Incremental Session
An IncrementalSession captures everything needed for incremental parsing:
use sipha::incremental::IncrementalSession;
let session = IncrementalSession::new(old_tree, &edits);
The session provides:
- Old tree: Previous parse tree
- Edits: Text changes
- Affected region: Computed affected range
- Reusable nodes: List of nodes that can be reused
- Cache access: Optional persistent parse cache
Parse Cache
Sipha maintains a persistent parse cache for cross-session node reuse:
let session = IncrementalSession::with_cache(old_tree, &edits, &cache);
The cache:
- Keys: Rule name, start position, version
- Values: Cached parse nodes
- Versioning: Invalidates old entries based on version
- Eviction: Removes old versions automatically
Cache population happens automatically when using parse_incremental with a grammar:
let result = parser.parse_incremental(
&tokens,
old_tree,
&edits,
entry_point,
Some(&grammar), // Cache is populated automatically
);
Re-parsing Process
When re-parsing incrementally:
- Find reusable nodes: Query the session for reusable nodes at each position
- Parse gaps: Re-parse only the regions not covered by reusable nodes
- Combine results: Merge reused nodes with newly parsed nodes
- Update cache: Store new parse results in the cache
Tree Reconstruction
The final tree is reconstructed by:
- Starting from root: Begin with the old tree root
- Replacing affected subtrees: Replace subtrees in affected regions
- Inserting new nodes: Add newly parsed nodes
- Preserving structure: Maintain tree structure outside affected regions
Performance Characteristics
Incremental parsing performance depends on:
- Edit size: Smaller edits = better performance
- Tree structure: Deeper trees = more reuse opportunities
- Cache hit rate: Higher hit rate = better performance
- Affected region size: Smaller regions = faster re-parsing
Typical performance improvements:
- Small edits (1-10 characters): 10-100x faster
- Medium edits (10-100 characters): 5-20x faster
- Large edits (100+ characters): 2-5x faster
Limitations
Incremental parsing has some limitations:
- Context sensitivity: Some parsing decisions depend on distant context
- Error recovery: Error recovery may require re-parsing larger regions
- Ambiguity: Ambiguous grammars may require full re-parse
- Cache overhead: Cache management has some overhead
Next Steps
- See Usage for API examples
- Read Implementation Details for advanced topics
Incremental Parsing Usage
This chapter shows how to use incremental parsing in your applications.
Basic Usage
Initial Parse
Start with an initial parse:
use sipha::incremental::IncrementalParser;
use sipha::backend::ll::{LlParser, LlConfig};
let config = LlConfig::default();
let parser = LlParser::new(&grammar, config)
.expect("Failed to create parser");
let mut incremental_parser = IncrementalParser::new(parser);
// Initial parse
let result1 = incremental_parser.parse_incremental(
&tokens,
None, // No old tree
&[], // No edits
MyNonTerminal::Expr,
Some(&grammar), // Populate cache
);
After an Edit
When text is edited, re-parse incrementally:
use sipha::incremental::TextEdit;
use sipha::syntax::{TextRange, TextSize};
// User changed "42" to "100" at position 0
let edits = vec![TextEdit::replace(
TextRange::new(TextSize::from(0), TextSize::from(2)),
"100".into(),
)];
// Re-tokenize (or update tokens incrementally)
let new_tokens = lexer.tokenize(&new_text)
.expect("Failed to tokenize");
// Incremental re-parse
let result2 = incremental_parser.parse_incremental(
&new_tokens,
Some(&result1.root), // Previous tree
&edits, // Text edits
MyNonTerminal::Expr,
Some(&grammar), // Update cache
);
Text Edits
Creating Edits
use sipha::incremental::TextEdit;
use sipha::syntax::{TextRange, TextSize};
// Replace a range
let edit1 = TextEdit::replace(
TextRange::new(TextSize::from(10), TextSize::from(20)),
"new text".into(),
);
// Insert at a position
let edit2 = TextEdit::insert(
TextSize::from(5),
"inserted".into(),
);
// Delete a range
let edit3 = TextEdit::delete(
TextRange::new(TextSize::from(0), TextSize::from(5)),
);
Multiple Edits
Handle multiple edits at once:
let edits = vec![
TextEdit::replace(range1, "text1".into()),
TextEdit::replace(range2, "text2".into()),
];
let result = incremental_parser.parse_incremental(
&tokens,
Some(&old_tree),
&edits,
entry_point,
Some(&grammar),
);
Reuse Budget
Control how many nodes to consider for reuse:
use sipha::incremental::{IncrementalSession, ReuseBudget};
// Fixed budget
let budget = ReuseBudget::Fixed(500);
let session = IncrementalSession::new_with_budget(
Some(&old_tree),
&edits,
budget,
);
// Heuristic budget (default)
let budget = ReuseBudget::Heuristic {
max_depth: 20,
max_nodes: 1000,
};
Cache Management
Automatic Cache Population
The cache is populated automatically when you provide a grammar:
let result = incremental_parser.parse_incremental(
&tokens,
old_tree,
&edits,
entry_point,
Some(&grammar), // Cache populated automatically
);
Manual Cache Access
Access the cache directly:
let session = IncrementalSession::with_cache(
Some(&old_tree),
&edits,
&incremental_parser.cache, // Access cache
);
Backend Integration
Incremental parsing works with all backends that support it:
// LL parser
let parser = LlParser::new(&grammar, config)?;
let mut incremental = IncrementalParser::new(parser);
// GLR parser
let parser = GlrParser::new(&grammar, config)?;
let mut incremental = IncrementalParser::new(parser);
Error Handling
Handle errors during incremental parsing:
let result = incremental_parser.parse_incremental(
&tokens,
Some(&old_tree),
&edits,
entry_point,
Some(&grammar),
);
if !result.errors.is_empty() {
// Handle errors
for error in &result.errors {
eprintln!("Error: {:?}", error);
}
}
Best Practices
- Always provide grammar: This enables cache population
- Batch edits: Process multiple edits together when possible
- Reuse old tree: Always pass the previous tree for best performance
- Monitor metrics: Check
ParseMetricsto understand performance - Handle errors gracefully: Incremental parsing may produce errors
Example: Language Server
Here’s a complete example for a language server:
use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::syntax::{TextRange, TextSize};
struct LanguageServer {
parser: IncrementalParser<LlParser, MyToken, MyNonTerminal>,
grammar: Grammar<MyToken, MyNonTerminal>,
lexer: CompiledLexer<MySyntaxKind>,
current_tree: Option<GreenNode<MySyntaxKind>>,
}
impl LanguageServer {
fn new() -> Self {
let grammar = build_grammar();
let parser = LlParser::new(&grammar, Default::default()).unwrap();
let lexer = build_lexer();
Self {
parser: IncrementalParser::new(parser),
grammar,
lexer,
current_tree: None,
}
}
fn on_text_changed(&mut self, edits: Vec<TextEdit>, new_text: &str) {
// Tokenize new text
let tokens = self.lexer.tokenize(new_text).unwrap();
// Incremental parse
let result = self.parser.parse_incremental(
&tokens,
self.current_tree.as_ref(),
&edits,
MyNonTerminal::Expr,
Some(&self.grammar),
);
// Update current tree
self.current_tree = Some(result.root.clone());
// Report errors
if !result.errors.is_empty() {
self.report_errors(&result.errors);
}
}
}
Next Steps
- Read Implementation Details for advanced topics
- See Examples for complete examples
Incremental Parsing Implementation Details
This chapter covers advanced implementation details for incremental parsing.
Architecture
Incremental parsing consists of several components:
IncrementalParser
The IncrementalParser wraps a backend parser and manages the parse cache:
pub struct IncrementalParser<P, T, N> {
parser: P,
cache: ParseCache<T::Kind>,
_phantom: PhantomData<(T, N)>,
}
IncrementalSession
The IncrementalSession provides context for incremental parsing:
pub struct IncrementalSession<'a, K: SyntaxKind> {
old_tree: Option<&'a GreenNode<K>>,
edits: &'a [TextEdit],
affected: AffectedRegion,
reusable: Vec<ReuseCandidate<K>>,
reusable_by_pos: HashMap<TextSize, Vec<usize>>,
cache: Option<&'a ParseCache<K>>,
}
ParseCache
The ParseCache stores parse results for reuse:
pub struct ParseCache<K: SyntaxKind> {
version: usize,
nodes: HashMap<CacheKey, Arc<GreenNode<K>>>,
interner: Rodeo,
}
Node Reuse Algorithm
The node reuse algorithm works as follows:
- Compute affected region: Determine minimal region to re-parse
- Expand region: Add padding for context
- Find reusable nodes: Traverse old tree, collecting nodes outside affected region
- Budget filtering: Limit candidates based on budget
- Index by position: Build position index for fast lookup
Finding Reusable Nodes
fn find_reusable_nodes<K>(
root: &GreenNode<K>,
affected: TextRange,
budget: usize,
) -> Vec<ReuseCandidate<K>> {
let mut reusable = Vec::new();
visit_green_spans(root, |span| {
if reusable.len() >= budget {
return;
}
if span.depth == 0 || span.range.intersect(affected).is_some() {
return;
}
if let Some(node_arc) = span.node_arc.clone() {
reusable.push(ReuseCandidate {
node: node_arc,
range: span.range,
depth: span.depth,
});
}
});
reusable
}
Cache Key Structure
Cache keys consist of:
- Rule name: Interned string for the non-terminal
- Start position: Text position where parsing started
- Version: Cache version for invalidation
struct CacheKey {
rule: Spur, // Interned rule name
start: TextSize, // Start position
version: usize, // Cache version
}
Version Management
The cache uses versioning for invalidation:
- Increment version: On each parse, increment cache version
- Evict old entries: Remove entries from old versions
- Keep recent versions: Maintain last N versions (default: 2)
fn evict_old_entries(&mut self, max_versions: usize) {
let min_version = self.version.saturating_sub(max_versions);
self.nodes.retain(|key, _| key.version >= min_version);
}
Backend Integration
Backends implement incremental parsing via parse_with_session:
fn parse_with_session(
&mut self,
input: &[T],
entry: N,
session: &IncrementalSession<'_, T::Kind>,
) -> ParseResult<T, N> {
// Use session to find reusable nodes
// Re-parse only affected regions
// Combine reused and new nodes
}
Performance Optimizations
Position Indexing
Reusable nodes are indexed by position for O(1) lookup:
let mut reusable_by_pos = HashMap::new();
for (idx, candidate) in reusable.iter().enumerate() {
reusable_by_pos
.entry(candidate.range.start())
.or_insert_with(Vec::new)
.push(idx);
}
Budget Calculation
The heuristic budget considers multiple factors:
fn calculate(&self, tree_depth: usize, file_size: TextSize, affected_size: TextSize) -> usize {
// Depth factor: deeper trees = more candidates
let depth_factor = (tree_depth.min(max_depth) * 10).max(50);
// File size factor: larger files = more candidates
let size_factor = ((file_size / 100) as usize).min(max_nodes / 2);
// Affected region factor: smaller regions = more candidates
let affected_ratio = affected_size as f64 / file_size as f64;
let affected_factor = ((1.0 - affected_ratio) * 200.0) as usize;
(depth_factor + size_factor + affected_factor)
.min(max_nodes)
.max(50)
}
Limitations and Trade-offs
Context Sensitivity
Some parsing decisions depend on distant context, requiring larger re-parse regions:
- Operator precedence: May need parent context
- Type inference: May need sibling context
- Scoping: May need ancestor context
Error Recovery
Error recovery may require re-parsing larger regions:
- Synchronization: May skip to distant recovery points
- Best-effort: May need full re-parse for complex errors
Cache Overhead
Cache management has overhead:
- Memory: Cache uses memory for stored nodes
- Lookup: Cache lookups have some overhead
- Eviction: Periodic eviction has CPU cost
Future Improvements
Potential improvements:
- Incremental lexing: Extend incremental capabilities to lexer
- Smarter invalidation: More precise cache invalidation
- Parallel parsing: Parse multiple regions in parallel
- Predictive reuse: Predict which nodes will be reusable
Next Steps
Parsing Backends Overview
Sipha supports multiple parsing algorithms via feature flags. Each backend has different characteristics and is suited for different use cases.
Available Backends
LL(k) Parser (backend-ll)
Top-down predictive parsing with configurable lookahead:
- Supports left-recursion elimination
- Configurable error recovery
- Incremental parsing support
- Good for most grammars
LR Parser (backend-lr)
Bottom-up shift-reduce parsing:
- Efficient for many grammar types
- Good error recovery
- Supports LALR and canonical LR
- Table-based parsing
GLR Parser (backend-glr)
Generalized LR parsing for ambiguous grammars:
- Handles non-deterministic and ambiguous grammars
- Parse forest representation for ambiguity tracking
- Configurable disambiguation strategies
- Incremental parsing support
- Ideal for complex languages like C++
PEG Parser (backend-peg)
Parsing Expression Grammar with ordered choice and memoization:
- Ordered choice semantics (first match wins)
- Packrat parsing with memoization for O(n) performance
- Backtracking with configurable depth limits
- Incremental parsing support
- Ideal for precedence-based languages and interactive tools
ParserBackend Trait
All backends implement the ParserBackend trait:
pub trait ParserBackend<T, N>: Sized + Send
where
T: Token,
N: NonTerminal,
{
type Config: Default + Clone;
type Error: std::error::Error + Send + Sync + 'static;
type State: Send + Sync;
fn new(grammar: &Grammar<T, N>, config: Self::Config) -> Result<Self, Self::Error>;
fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N>;
fn parse_incremental(...) -> ParseResult<T, N>;
fn parse_with_session(...) -> ParseResult<T, N>;
fn validate(grammar: &Grammar<T, N>) -> Vec<GrammarError<T, N>>;
fn capabilities() -> BackendCapabilities;
fn state(&self) -> &Self::State;
}
Backend Capabilities
Each backend reports its capabilities:
pub struct BackendCapabilities {
pub name: &'static str,
pub algorithm: Algorithm,
pub supports_left_recursion: bool,
pub supports_ambiguity: bool,
pub supports_incremental: bool,
pub supports_error_recovery: bool,
pub max_lookahead: Option<usize>,
}
Choosing a Backend
flowchart TD
Start[Choose Backend] --> Ambiguous{Grammar<br/>Ambiguous?}
Ambiguous -->|Yes| GLR[Use GLR Parser]
Ambiguous -->|No| UseCase{Primary<br/>Use Case?}
UseCase -->|Expression Parser<br/>Precedence-based| PEG1[Use PEG Parser]
UseCase -->|Interactive Tools<br/>IDEs/Editors| Interactive{Need<br/>Memoization?}
UseCase -->|Batch Processing| LR[Use LR Parser]
UseCase -->|Complex Language| GLR2[Use GLR Parser]
UseCase -->|Simple Grammar| LL[Use LL Parser]
Interactive -->|Yes| PEG2[Use PEG Parser]
Interactive -->|No| LL2[Use LL Parser]
GLR --> End[Selected Backend]
GLR2 --> End
LR --> End
LL --> End
LL2 --> End
PEG1 --> End
PEG2 --> End
style GLR fill:#ffccbc,color:#000000
style GLR2 fill:#ffccbc,color:#000000
style LR fill:#fff9c4,color:#000000
style LL fill:#c8e6c9,color:#000000
style LL2 fill:#c8e6c9,color:#000000
style PEG1 fill:#e1bee7,color:#000000
style PEG2 fill:#e1bee7,color:#000000
style End fill:#e1f5ff,color:#000000
See Choosing a Backend for detailed guidance on selecting the right backend for your use case.
Next Steps
- Learn about LL Parser
- Explore LR Parser
- Check out GLR Parser
- Discover PEG Parser
See Also
- Choosing a Backend - Detailed guidance on backend selection
- Cheat Sheet - Quick reference for backend usage
- Glossary - Definitions of parsing-related terms
LL(k) Parser
The LL(k) parser is a top-down predictive parser with configurable lookahead.
Overview
LL(k) parsing works by:
- Predicting: Using lookahead to predict which production to use
- Matching: Matching tokens against predicted productions
- Recursing: Recursively parsing subexpressions
Configuration
Configure the LL parser with LlConfig:
use sipha::backend::ll::{LlParser, LlConfig};
let config = LlConfig {
lookahead: 1, // k value for LL(k)
left_recursion: true, // Enable left-recursion elimination
error_recovery: true, // Enable error recovery
..Default::default()
};
let mut parser = LlParser::new(&grammar, config)
.expect("Failed to create parser");
Lookahead
The lookahead parameter controls how many tokens to look ahead:
- LL(1): One token lookahead (most common)
- LL(k): k tokens lookahead (for ambiguous grammars)
Higher lookahead values:
- Handle more ambiguous grammars
- Require larger parse tables
- May be slower
Left Recursion
Left recursion elimination allows parsing left-recursive grammars:
// Without elimination: Expr -> Expr + Term (left-recursive)
// With elimination: Automatically transformed
Error Recovery
Error recovery strategies:
- Synchronization tokens: Skip to known recovery points
- Delimited recovery: Skip to matching delimiters
- Best-effort parsing: Continue parsing despite errors
Usage
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)
.expect("Failed to create parser");
let result = parser.parse(&tokens, MyNonTerminal::Expr);
Incremental Parsing
The LL parser supports incremental parsing:
use sipha::incremental::IncrementalParser;
let mut incremental = IncrementalParser::new(parser);
let result = incremental.parse_incremental(
&tokens,
old_tree,
&edits,
entry_point,
Some(&grammar),
);
Grammar Requirements
LL parsers require:
- No left recursion (unless elimination is enabled)
- No ambiguity (unless lookahead > 1)
- FIRST/FOLLOW sets: Computed automatically
Parse Table
The LL parser builds a parse table:
- Rows: Non-terminals
- Columns: Terminals (FIRST sets)
- Entries: Production rules to apply
Performance
LL parsing performance:
- Time: O(n) where n is input length
- Space: O(grammar size) for parse table
- Incremental: Fast for small edits
When to Use
Use LL parser when:
- Grammar is LL(k) compatible
- Need predictable performance
- Want simple error messages
- Building interactive tools
Limitations
LL parser limitations:
- Left recursion: Requires elimination (may change parse tree)
- Ambiguity: Requires higher lookahead
- Left associativity: May require grammar transformation
Next Steps
- See Choosing a Backend for comparison
- Check Examples for usage
LR Parser
The LR parser is a bottom-up shift-reduce parser that builds parse trees from leaves to root.
Overview
LR parsing works by:
- Shifting: Moving tokens onto the stack
- Reducing: Replacing matched rules with non-terminals
- Building: Constructing the parse tree bottom-up
Configuration
Configure the LR parser with LrConfig:
use sipha::backend::lr::{LrParser, LrConfig};
let config = LrConfig {
use_lalr: true, // Use LALR instead of canonical LR
error_recovery: true, // Enable error recovery
..Default::default()
};
let mut parser = LrParser::new(&grammar, config)
.expect("Failed to create parser");
LALR vs Canonical LR
- LALR: Smaller tables, faster, handles most grammars
- Canonical LR: Larger tables, more precise, handles edge cases
For most grammars, LALR is sufficient and faster.
Usage
use sipha::backend::lr::{LrParser, LrConfig};
use sipha::backend::ParserBackend;
let config = LrConfig::default();
let mut parser = LrParser::new(&grammar, config)
.expect("Failed to create parser");
let result = parser.parse(&tokens, MyNonTerminal::Expr);
Parse Table
The LR parser builds a parse table:
- States: Parser states (items)
- Actions: Shift, reduce, accept, error
- Gotos: State transitions for non-terminals
Table Construction
- Build items: Create LR items from grammar
- Build states: Group items into states
- Compute actions: Determine shift/reduce actions
- Resolve conflicts: Handle shift/reduce and reduce/reduce conflicts
Grammar Requirements
LR parsers require:
- No ambiguity: Must be unambiguous (or use GLR)
- No conflicts: No shift/reduce or reduce/reduce conflicts
Error Recovery
LR parsers support error recovery:
- Synchronization: Skip to known recovery points
- Delimited recovery: Skip to matching delimiters
- Best-effort: Continue parsing despite errors
Performance
LR parsing performance:
- Time: O(n) where n is input length
- Space: O(grammar size) for parse table
- Table size: LALR tables are smaller than canonical LR
When to Use
Use LR parser when:
- Grammar is LR compatible
- Need efficient parsing
- Want good error recovery
- Building batch parsers
Limitations
LR parser limitations:
- Ambiguity: Cannot handle ambiguous grammars (use GLR)
- Table size: Large grammars may have large tables
- Left recursion: Must be handled in grammar
Next Steps
- See GLR Parser for ambiguous grammars
- Check Choosing a Backend for comparison
GLR Parser
The GLR (Generalized LR) parser extends LR parsing to handle ambiguous grammars by maintaining multiple parser stacks and forking on conflicts.
Overview
GLR parsing works by:
- Building LR table: Uses LR(1) state machine
- Maintaining stacks: Keeps multiple parser stacks
- Forking on conflicts: Creates new stacks for ambiguous choices
- Merging stacks: Merges stacks when they converge
- Returning forest: Returns parse forest for ambiguous results
When to Use GLR
Use the GLR backend when:
- Your grammar has inherent ambiguities (e.g., C++ template syntax)
- You need to handle multiple valid parse trees
- You want to disambiguate at parse time using precedence/associativity rules
- Parsing complex languages with inherent ambiguities
Configuration
Configure the GLR parser with GlrConfig:
use sipha::backend::glr::{GlrParser, GlrConfig};
let config = GlrConfig {
max_stacks: 100, // Maximum number of stacks
disambiguation: None, // Disambiguation strategy
..Default::default()
};
let mut parser = GlrParser::new(&grammar, config)
.expect("Failed to create GLR parser");
Usage
use sipha::backend::glr::{GlrParser, GlrConfig};
use sipha::backend::ParserBackend;
let config = GlrConfig::default();
let mut parser = GlrParser::new(&grammar, config)
.expect("Failed to create GLR parser");
let result = parser.parse(&tokens, entry_point);
Handling Ambiguity
GLR parsers can produce parse forests when multiple valid parse trees exist:
if let Some(forest) = result.forest {
// Multiple parse trees exist - disambiguate
let disambiguated = forest.disambiguate(|alternatives| {
// Custom disambiguation logic
alternatives.first().cloned()
});
}
Disambiguation
Sipha provides several disambiguation strategies:
Precedence-based
Resolve conflicts using operator precedence:
use sipha::backend::glr::disambiguate_by_precedence;
let disambiguated = forest.disambiguate(disambiguate_by_precedence(|op| {
match op {
MySyntaxKind::Multiply | MySyntaxKind::Divide => 2,
MySyntaxKind::Plus | MySyntaxKind::Minus => 1,
_ => 0,
}
}));
Associativity-based
Resolve conflicts using operator associativity:
use sipha::backend::glr::disambiguate_by_associativity;
let disambiguated = forest.disambiguate(disambiguate_by_associativity(|op| {
match op {
MySyntaxKind::Minus => Associativity::Left,
MySyntaxKind::Power => Associativity::Right,
_ => Associativity::None,
}
}));
Custom Strategies
Implement your own disambiguation logic:
let disambiguated = forest.disambiguate(|alternatives| {
// Choose based on custom criteria
alternatives
.iter()
.min_by_key(|tree| compute_complexity(tree))
.cloned()
});
Parse Forest
A parse forest represents multiple parse trees:
pub struct ParseForest<K: SyntaxKind> {
roots: Vec<Arc<GreenNode<K>>>,
}
The forest can be:
- Queried: Find all parse trees
- Disambiguated: Choose one tree
- Traversed: Visit all alternatives
Incremental Parsing
The GLR parser supports incremental parsing:
use sipha::incremental::IncrementalParser;
let mut incremental = IncrementalParser::new(parser);
let result = incremental.parse_incremental(
&tokens,
old_tree,
&edits,
entry_point,
Some(&grammar),
);
Performance
GLR parsing performance:
- Time: O(n³) in worst case (ambiguous grammars)
- Space: O(n²) for parse forest
- Stacks: Number of stacks can grow with ambiguity
For unambiguous grammars, GLR performs similarly to LR.
Grammar Requirements
GLR parsers can handle:
- Ambiguous grammars: Multiple valid parse trees
- Non-deterministic grammars: Grammars with conflicts
- Complex languages: Languages like C++ with inherent ambiguities
Limitations
GLR parser limitations:
- Performance: Slower for highly ambiguous grammars
- Memory: Parse forests can be large
- Complexity: More complex than LL or LR
Example: Ambiguous Expression
// Grammar: Expr -> Expr + Expr | Expr * Expr | Number
// Input: "1 + 2 * 3"
// Ambiguous: Could be (1 + 2) * 3 or 1 + (2 * 3)
let result = glr_parser.parse(&tokens, MyNonTerminal::Expr);
if let Some(forest) = result.forest {
// Two parse trees exist
assert_eq!(forest.roots.len(), 2);
// Disambiguate using precedence
let disambiguated = forest.disambiguate(disambiguate_by_precedence(|op| {
match op {
MySyntaxKind::Multiply => 2,
MySyntaxKind::Plus => 1,
_ => 0,
}
}));
// Result: 1 + (2 * 3) = 7
}
Next Steps
- See Choosing a Backend for comparison
- Check Examples for complete examples
PEG Parser
The PEG (Parsing Expression Grammar) parser uses ordered choice and backtracking with memoization (packrat parsing) for efficient parsing.
Overview
PEG parsing works by:
- Ordered Choice: Try alternatives in order, first match wins
- Backtracking: If an alternative fails, backtrack and try the next
- Memoization: Cache parse results for O(n) performance (packrat parsing)
- Greedy Matching: Always match as much as possible
Understanding PEG Grammars
PEG (Parsing Expression Grammar) is a type of formal grammar that describes a language in terms of parsing rules rather than generation rules. Unlike traditional CFG (Context-Free Grammar) parsers, PEG parsers use ordered choice semantics, making them deterministic and unambiguous.
Key Concepts
Ordered Choice (A | B)
In PEG, the choice operator | is ordered and greedy:
- Alternatives are tried left to right
- The first matching alternative wins
- Once an alternative matches, no other alternatives are tried
- This makes PEG grammars unambiguous by definition
Example:
// Grammar: Expr -> Number | Number
// In CFG: This is ambiguous (both alternatives match "42")
// In PEG: First alternative always wins, second is never tried
Expr::Choice(vec![
Expr::token(Number("42")), // This matches first
Expr::token(Number("99")), // Never tried, even if input is "99"
])
Greedy Repetition (A*, A+)
PEG repetition is greedy - it matches as much as possible:
A*matches zero or moreAs (as many as possible)A+matches one or moreAs (as many as possible)- Repetition always tries to match more before trying less
Example:
// Grammar: Expr -> Number*
// Input: "123"
// Matches: All three digits (greedy)
Expr::Repeat {
expr: Box::new(Expr::token(Number("1"))),
min: 0,
max: None,
}
Backtracking
When an alternative fails, the parser backtracks to try the next alternative:
- Position is restored to where the alternative started
- Previous matches are discarded
- Next alternative is tried from the same position
Example:
// Grammar: Expr -> "ab" | "ac"
// Input: "ac"
// Process:
// 1. Try "ab" - matches "a", fails on "b"
// 2. Backtrack to start
// 3. Try "ac" - matches successfully
Key Differences from CFG
PEG parsers differ from traditional CFG parsers in fundamental ways:
| Feature | CFG | PEG |
|---|---|---|
| Choice | Unordered (all alternatives considered) | Ordered (first match wins) |
| Ambiguity | Can be ambiguous | Always unambiguous |
| Repetition | May be non-greedy | Always greedy |
| Left Recursion | Natural | Requires special handling |
| Determinism | May be non-deterministic | Always deterministic |
Ordered Choice Example
// CFG: Expr -> Number | Number
// This is ambiguous - both alternatives match the same input
// Parser must use disambiguation rules
// PEG: Expr -> Number | Number
// This is unambiguous - first alternative always wins
// Second alternative is never tried if first matches
Expr::Choice(vec![
Expr::token(Number("1")), // Tried first
Expr::token(Number("2")), // Only tried if first fails
])
Precedence Through Ordering
PEG makes operator precedence natural through ordering:
// Lower precedence operators tried first
// Higher precedence operators tried later
// This naturally gives * higher precedence than +
Expr::Choice(vec![
// Addition (lower precedence, tried first)
Expr::Seq(vec![
Expr::Rule(Expr),
Expr::token(Plus),
Expr::Rule(Term),
]),
// Base case (tried last)
Expr::Rule(Term),
])
Term::Choice(vec![
// Multiplication (higher precedence, tried first)
Expr::Seq(vec![
Expr::Rule(Term),
Expr::token(Multiply),
Expr::Rule(Factor),
]),
// Base case (tried last)
Expr::Rule(Factor),
])
Input: 1 + 2 * 3
Parse process:
- Try
Expr + Term- matches1 +, then tries to matchTerm TermtriesTerm * Factor- matches2 * 3- Result:
1 + (2 * 3)✓ (correct precedence)
Configuration
Configure the PEG parser with PegConfig:
use sipha::backend::peg::{PegParser, PegConfig};
let config = PegConfig {
enable_memoization: true, // Enable packrat parsing
max_memo_size: 10000, // Maximum cache size
error_recovery: true, // Enable error recovery
max_errors: 100, // Maximum errors before giving up
max_backtrack_depth: 1000, // Limit backtracking depth
};
let mut parser = PegParser::new(&grammar, config)
.expect("Failed to create parser");
Memoization (Packrat Parsing)
When enable_memoization is true, the parser caches parse results:
- O(n) Performance: Linear time parsing for many grammars
- Memory Trade-off: Uses more memory for cache
- Cache Size: Controlled by
max_memo_size
Without memoization:
- Exponential Worst Case: Can be slow for some grammars
- Less Memory: No cache overhead
Backtracking Depth
The max_backtrack_depth parameter limits backtracking:
- Prevents Infinite Loops: Stops infinite backtracking
- Performance: Lower values may fail on valid inputs
- Default: 1000 (usually sufficient)
Error Recovery
Error recovery strategies:
- Best-effort parsing: Continue parsing despite errors
- Error limits: Stop after
max_errorserrors - Recovery points: Use
RecoveryPointexpressions in grammar
Usage
use sipha::backend::peg::{PegParser, PegConfig};
use sipha::backend::ParserBackend;
let config = PegConfig::default();
let mut parser = PegParser::new(&grammar, config)
.expect("Failed to create parser");
let result = parser.parse(&tokens, MyNonTerminal::Expr);
Writing PEG Grammars
Basic Patterns
Sequences
Sequences match expressions in order:
// Match: "a" followed by "b"
Expr::Seq(vec![
Expr::token(Token::A),
Expr::token(Token::B),
])
Optional Elements
Optional expressions match zero or one occurrence:
// Match: "a" optionally followed by "b"
Expr::Seq(vec![
Expr::token(Token::A),
Expr::Opt(Box::new(Expr::token(Token::B))),
])
Repetition
Repetition matches zero or more (or one or more) occurrences:
// Match: One or more "a"s
Expr::Repeat {
expr: Box::new(Expr::token(Token::A)),
min: 1,
max: None,
}
// Match: Zero to three "a"s
Expr::Repeat {
expr: Box::new(Expr::token(Token::A)),
min: 0,
max: Some(3),
}
Separated Lists
Separated lists match items separated by a delimiter:
// Match: "1, 2, 3" or "1, 2, 3,"
Expr::Separated {
item: Box::new(Expr::token(Number)),
separator: Box::new(Expr::token(Comma)),
min: 1,
trailing: TrailingSeparator::Allow,
}
Common Patterns
Operator Precedence
Express precedence by ordering alternatives (lower precedence first):
// Expression with precedence: + < * < ( )
Expr::Choice(vec![
// Addition (lowest precedence, tried first)
Expr::Seq(vec![
Expr::Rule(Expr),
Expr::token(Plus),
Expr::Rule(Term),
]),
// Subtraction
Expr::Seq(vec![
Expr::Rule(Expr),
Expr::token(Minus),
Expr::Rule(Term),
]),
// Base case (tried last)
Expr::Rule(Term),
])
Term::Choice(vec![
// Multiplication (higher precedence)
Expr::Seq(vec![
Expr::Rule(Term),
Expr::token(Multiply),
Expr::Rule(Factor),
]),
// Base case
Expr::Rule(Factor),
])
Factor::Choice(vec![
// Parentheses (highest precedence)
Expr::Delimited {
open: Box::new(Expr::token(LParen)),
content: Box::new(Expr::Rule(Expr)),
close: Box::new(Expr::token(RParen)),
recover: false,
},
// Numbers
Expr::token(Number),
])
Left Associativity
Left associativity is natural in PEG (left-recursive rules work correctly):
// Left-associative: 1 - 2 - 3 = (1 - 2) - 3
Expr::Choice(vec![
Expr::Seq(vec![
Expr::Rule(Expr), // Left-recursive
Expr::token(Minus),
Expr::Rule(Term),
]),
Expr::Rule(Term),
])
Note: Left recursion is detected and handled, but may require special consideration.
Right Associativity
Right associativity requires right-recursive rules:
// Right-associative: 2^3^4 = 2^(3^4)
Expr::Choice(vec![
Expr::Seq(vec![
Expr::Rule(Term),
Expr::token(Power),
Expr::Rule(Expr), // Right-recursive
]),
Expr::Rule(Term),
])
Optional Trailing Elements
Handle optional trailing separators:
// Match: "a, b, c" or "a, b, c,"
Expr::Separated {
item: Box::new(Expr::token(Ident)),
separator: Box::new(Expr::token(Comma)),
min: 1,
trailing: TrailingSeparator::Allow, // or Forbid, or Require
}
Advanced Patterns
Lookahead
Positive lookahead checks if expression matches without consuming:
// Match "a" only if followed by "b" (but don't consume "b")
Expr::Seq(vec![
Expr::token(Token::A),
Expr::Lookahead(Box::new(Expr::token(Token::B))),
])
Negative lookahead checks if expression doesn’t match:
// Match "a" only if NOT followed by "b"
Expr::Seq(vec![
Expr::token(Token::A),
Expr::NotLookahead(Box::new(Expr::token(Token::B))),
])
Error Recovery
Use recovery points to skip to known tokens on error:
// On error, skip to semicolon or brace
Expr::RecoveryPoint {
expr: Box::new(Expr::Rule(Statement)),
sync_tokens: vec![Semicolon, RBrace].into(),
}
Delimited Recovery
Recover from missing closing delimiters:
// Recover from missing closing parenthesis
Expr::Delimited {
open: Box::new(Expr::token(LParen)),
content: Box::new(Expr::Rule(Expr)),
close: Box::new(Expr::token(RParen)),
recover: true, // Skip tokens until closing delimiter found
}
Incremental Parsing
The PEG parser supports incremental parsing with memoization:
use sipha::incremental::IncrementalParser;
let mut incremental = IncrementalParser::new(parser);
let result = incremental.parse_incremental(
&tokens,
old_tree,
&edits,
entry_point,
Some(&grammar),
);
The memoization cache works with incremental parsing:
- Node Reuse: Cached nodes are reused when possible
- Cache Invalidation: Affected regions invalidate cache entries
- Performance: Significant speedup for small edits
Grammar Requirements and Best Practices
What PEG Handles Well
PEG parsers excel at:
- Precedence-based Languages: Operator precedence is natural through ordering
- Expression Parsers: Arithmetic, logical, and comparison expressions
- Structured Data: JSON, XML, configuration files
- Interactive Tools: Fast parsing with memoization for IDEs and editors
- Deterministic Parsing: Always produces one parse tree (no ambiguity)
Left Recursion
PEG parsers detect left recursion and handle it, but with limitations:
// Left-recursive rule
Expr -> Expr + Term | Term
// PEG detects this and adds a warning
// The parser will fail on left-recursive rules to prevent infinite loops
Best Practice: For left-recursive grammars, consider:
- Using right-recursive alternatives where possible
- Restructuring to avoid left recursion
- Using iterative parsing for specific cases
Example - Avoiding Left Recursion:
// Instead of: Expr -> Expr + Term | Term
// Use iterative approach with repetition:
Expr::Seq(vec![
Expr::Rule(Term),
Expr::Repeat {
expr: Box::new(Expr::Seq(vec![
Expr::token(Plus),
Expr::Rule(Term),
])),
min: 0,
max: None,
},
])
Ambiguity Resolution
PEG grammars are always unambiguous because ordered choice resolves ambiguity:
// CFG: Expr -> Number | Number (ambiguous)
// PEG: Expr -> Number | Number (unambiguous - first wins)
// If you need different behavior, reorder:
Expr::Choice(vec![
Expr::token(Number("42")), // Specific case first
Expr::token(Number("n")), // General case second
])
Best Practice: Order alternatives from most specific to least specific.
Common Pitfalls
1. Order Matters
// Wrong: General case first
Expr::Choice(vec![
Expr::token(Number), // Matches everything
Expr::token(Number("42")), // Never reached
])
// Correct: Specific case first
Expr::Choice(vec![
Expr::token(Number("42")), // Specific case first
Expr::token(Number), // General case second
])
2. Greedy Repetition
// This matches as many as possible
Expr::Repeat {
expr: Box::new(Expr::token(Number)),
min: 0,
max: None,
}
// Input: "123" matches all three digits
3. Backtracking Cost
Excessive backtracking can be slow. Use lookahead to avoid unnecessary backtracking:
// Expensive: Tries "ab", backtracks, tries "ac"
Expr::Choice(vec![
Expr::Seq(vec![Expr::token(A), Expr::token(B)]),
Expr::Seq(vec![Expr::token(A), Expr::token(C)]),
])
// Better: Use lookahead to avoid backtracking
Expr::Seq(vec![
Expr::token(A),
Expr::Choice(vec![
Expr::Lookahead(Box::new(Expr::token(B))),
Expr::token(C),
]),
])
Memoization Details
Packrat parsing caches results at each position:
- Cache Key:
(rule, position) - Cache Value: Parse result (success with node, or failure)
- Cache Size: Bounded by
max_memo_size - Eviction: Old entries evicted when cache is full
Benefits:
- Linear Time: O(n) for many grammars
- Node Reuse: Cached nodes can be reused
- Incremental: Works well with incremental parsing
Ordered Choice Details
Ordered choice (A | B) semantics:
- Try
A - If
Asucceeds, return success (never tryB) - If
Afails, backtrack and tryB - If both fail, return failure
This differs from CFG where both alternatives are considered.
Performance
PEG parsing performance:
- With Memoization: O(n) time, O(n) space (packrat parsing)
- Without Memoization: O(n) to exponential time, O(1) space
- Incremental: Fast for small edits (benefits from memoization)
- Memory: Cache size depends on grammar and input
When to Use PEG
Ideal Use Cases
1. Expression Parsers
PEG excels at parsing expressions with operator precedence:
// Natural expression parsing
// Input: "1 + 2 * 3"
// Result: 1 + (2 * 3) (correct precedence)
// Grammar is intuitive - lower precedence tried first
Expr -> Expr + Term | Term
Term -> Term * Factor | Factor
Why PEG: Precedence is expressed naturally through ordering, no need for precedence tables.
2. Interactive Tools (IDEs, Editors)
PEG with memoization provides fast incremental parsing:
- Fast re-parsing: Memoization caches results
- Incremental updates: Only affected regions re-parsed
- Responsive: Good performance for real-time editing
Why PEG: Memoization makes repeated parsing very fast.
3. Structured Data Formats
PEG handles delimited and separated structures well:
// JSON-like structures
Object -> "{" (Pair ("," Pair)*)? "}"
Array -> "[" (Value ("," Value)*)? "]"
Why PEG: Delimited and separated expressions are first-class.
4. Deterministic Parsing
When you need guaranteed unambiguous parsing:
- No ambiguity: Ordered choice ensures one parse tree
- Predictable: Same input always produces same output
- No disambiguation needed: Parser makes choices automatically
Why PEG: Unambiguous by design.
When NOT to Use PEG
1. Truly Ambiguous Grammars
If you need to track or explore ambiguity:
- Use GLR instead: GLR produces parse forests for ambiguous inputs
- PEG hides ambiguity: Only one parse tree is produced
2. Very Large Files Without Memoization
Without memoization, some PEG grammars can be exponential:
- Enable memoization: Use
enable_memoization: true - Or use LR: LR parsers are consistently O(n) without memoization
3. Simple LL(1) Grammars
For simple, unambiguous grammars, LL(k) may be simpler:
- LL is simpler: Table-based, easier to understand
- PEG is more flexible: But adds complexity for simple cases
Limitations
PEG parser limitations:
- Exponential Without Memoization: Can be slow for some grammars
- Memory Usage: Memoization requires significant memory
- Left Recursion: May require special handling
- No Ambiguity Tracking: Ambiguity resolved by ordering (may hide issues)
Comparison with Other Backends
PEG vs LL Parser
| Aspect | PEG | LL(k) |
|---|---|---|
| Approach | Top-down with backtracking | Top-down predictive |
| Choice | Ordered (first match wins) | Based on lookahead |
| Left Recursion | Detected, may need handling | Can eliminate |
| Ambiguity | Resolved by ordering | Requires lookahead > 1 |
| Performance | O(n) with memoization | O(n) always |
| Memory | O(n) with memoization | O(grammar size) |
| Best For | Precedence, interactive tools | Simple grammars, IDEs |
Choose PEG when: You need ordered choice semantics or memoization benefits.
Choose LL when: Your grammar is LL(k) compatible and you want simplicity.
PEG vs LR Parser
| Aspect | PEG | LR |
|---|---|---|
| Approach | Top-down with backtracking | Bottom-up shift-reduce |
| Choice | Ordered | Based on parse table |
| Left Recursion | Detected, may need handling | Natural |
| Backtracking | Yes (with limits) | No |
| Performance | O(n) with memoization | O(n) always |
| Error Recovery | Good | Excellent |
| Best For | Precedence, expressions | Batch parsing, languages |
Choose PEG when: You need ordered choice or top-down semantics.
Choose LR when: You need excellent error recovery or batch processing.
PEG vs GLR Parser
| Aspect | PEG | GLR |
|---|---|---|
| Approach | Top-down deterministic | Bottom-up with ambiguity |
| Choice | Ordered (unambiguous) | All alternatives (ambiguous) |
| Ambiguity | Resolved by ordering | Tracked in parse forest |
| Performance | O(n) with memoization | O(n) to O(n³) |
| Best For | Deterministic parsing | Ambiguous grammars |
Choose PEG when: You want guaranteed unambiguous parsing.
Choose GLR when: Your grammar has inherent ambiguities you need to explore.
Practical Examples
Example 1: Arithmetic Expressions
// Grammar for: 1 + 2 * 3
let grammar = GrammarBuilder::new()
.entry_point(Expr)
.rule(Expr, Expr::Choice(vec![
// Addition (lowest precedence, tried first)
Expr::Seq(vec![
Expr::Rule(Expr),
Expr::token(Plus),
Expr::Rule(Term),
]),
// Base case
Expr::Rule(Term),
]))
.rule(Term, Expr::Choice(vec![
// Multiplication (higher precedence)
Expr::Seq(vec![
Expr::Rule(Term),
Expr::token(Multiply),
Expr::Rule(Factor),
]),
// Base case
Expr::Rule(Factor),
]))
.rule(Factor, Expr::Choice(vec![
// Parentheses
Expr::Delimited {
open: Box::new(Expr::token(LParen)),
content: Box::new(Expr::Rule(Expr)),
close: Box::new(Expr::token(RParen)),
recover: false,
},
// Numbers
Expr::token(Number),
]))
.build()?;
Example 2: JSON-like Lists
// Grammar for: [1, 2, 3] or [1, 2, 3,]
let grammar = GrammarBuilder::new()
.entry_point(Array)
.rule(Array, Expr::Delimited {
open: Box::new(Expr::token(LBracket)),
content: Box::new(Expr::Separated {
item: Box::new(Expr::Rule(Value)),
separator: Box::new(Expr::token(Comma)),
min: 0,
trailing: TrailingSeparator::Allow,
}),
close: Box::new(Expr::token(RBracket)),
recover: true,
})
.build()?;
Example 3: Error Recovery
// Grammar with error recovery
let grammar = GrammarBuilder::new()
.entry_point(Program)
.rule(Program, Expr::Repeat {
expr: Box::new(Expr::RecoveryPoint {
expr: Box::new(Expr::Rule(Statement)),
sync_tokens: vec![Semicolon, RBrace].into(),
}),
min: 0,
max: None,
})
.build()?;
Next Steps
- See Choosing a Backend for comparison
- Check Examples for usage
- Learn about Incremental Parsing
Choosing a Backend
This guide helps you choose the right parsing backend for your use case.
Quick Comparison
| Feature | LL(k) | LR | GLR | PEG |
|---|---|---|---|---|
| Grammar compatibility | LL(k) | LR | Any | Most |
| Ambiguity handling | No | No | Yes | No (ordered choice) |
| Left recursion | With elimination | Yes | Yes | Yes |
| Error recovery | Good | Excellent | Good | Good |
| Incremental parsing | Yes | Yes | Yes | Yes |
| Performance | Fast | Fast | Slower (ambiguous) | Fast (with memo) |
| Memory usage | Small | Medium | Medium | Medium (with memo) |
| Complexity | Low | Medium | High | Medium |
Decision Tree
flowchart TD
Start[Choose Backend] --> Ambiguous{Grammar<br/>Ambiguous?}
Ambiguous -->|Yes| GLR[Use GLR Parser]
Ambiguous -->|No| UseCase{Primary<br/>Use Case?}
UseCase -->|Expression Parser<br/>Precedence-based| PEG1[Use PEG Parser]
UseCase -->|Interactive Tools<br/>IDEs/Editors| Interactive{Need<br/>Memoization?}
UseCase -->|Batch Processing| LR[Use LR Parser]
UseCase -->|Complex Language| GLR2[Use GLR Parser]
UseCase -->|Simple Grammar| LL[Use LL Parser]
Interactive -->|Yes| PEG2[Use PEG Parser]
Interactive -->|No| LL2[Use LL Parser]
GLR --> End[Selected Backend]
GLR2 --> End
LR --> End
LL --> End
LL2 --> End
PEG1 --> End
PEG2 --> End
style GLR fill:#ffccbc,color:#000000
style GLR2 fill:#ffccbc,color:#000000
style LR fill:#fff9c4,color:#000000
style LL fill:#c8e6c9,color:#000000
style LL2 fill:#c8e6c9,color:#000000
style PEG1 fill:#e1bee7,color:#000000
style PEG2 fill:#e1bee7,color:#000000
style End fill:#e1f5ff,color:#000000
Step-by-Step Decision Process
1. Is your grammar ambiguous?
- Yes → Use GLR
- No → Continue to step 2
2. What’s your primary use case?
- Expression parser / Precedence-based → Use PEG (natural ordered choice)
- Interactive tools (IDEs, editors) → Continue to step 3
- Batch parsing → Use LR (efficient, good error recovery)
- Complex language (C++, etc.) → Use GLR (handles ambiguities)
- Simple grammar → Use LL(k) (simplest)
3. For Interactive Tools: Need memoization?
- Yes → Use PEG (memoization provides fast re-parsing)
- No → Use LL(k) (simpler, table-based)
Detailed Scenarios
Scenario 1: Simple Expression Parser
Grammar: Arithmetic expressions with precedence (e.g., 1 + 2 * 3)
Recommendation: PEG or LL(k) with left-recursion elimination
Reasoning:
- Grammar is unambiguous
- PEG’s ordered choice makes precedence natural - lower precedence operators tried first
- Left recursion can be eliminated (LL) or handled (PEG)
- Simple and fast
- Good for interactive tools
PEG Grammar Example:
// Precedence: + < * < ( )
// Lower precedence tried first in PEG
Expr -> Expr + Term | Term // + tried first (lower precedence)
Term -> Term * Factor | Factor // * tried later (higher precedence)
Factor -> ( Expr ) | Number // () tried last (highest precedence)
Why PEG works well: The ordering naturally expresses precedence without needing precedence tables.
Scenario 2: Language Server
Grammar: Full programming language
Recommendation: LL(k), LR, or PEG
Reasoning:
- Need incremental parsing (all support it)
- LL(k) is simpler for error messages
- LR has better error recovery
- PEG has memoization for fast parsing
- Choose based on grammar characteristics
Scenario 3: C++ Parser
Grammar: C++ with template syntax ambiguities
Recommendation: GLR
Reasoning:
- Grammar has inherent ambiguities
- Need to handle multiple parse trees
- Can disambiguate using precedence/associativity
Scenario 4: Batch Parser
Grammar: Any unambiguous grammar
Recommendation: LR
Reasoning:
- Efficient for batch processing
- Good error recovery
- Handles left recursion naturally
Performance Considerations
LL(k)
- Fast: O(n) parsing time
- Small tables: Efficient memory usage
- Good for: Interactive tools, small to medium files
LR
- Fast: O(n) parsing time
- Medium tables: Reasonable memory usage
- Good for: Batch processing, large files
GLR
- Variable: O(n) to O(n³) depending on ambiguity
- Medium to large tables: More memory for forests
- Good for: Ambiguous grammars, complex languages
PEG
- With memoization: O(n) parsing time, O(n) space
- Without memoization: O(n) to exponential time, O(1) space
- Medium memory: Cache size depends on grammar
- Good for: Precedence-based languages, interactive tools
Grammar Compatibility
LL(k) Compatible
- No left recursion (or can be eliminated)
- No ambiguity (or k > 1)
- Can compute FIRST/FOLLOW sets
LR Compatible
- No ambiguity
- No shift/reduce conflicts
- No reduce/reduce conflicts
GLR Compatible
- Any grammar
- Handles all conflicts
- Handles all ambiguities
PEG Compatible
- Most grammars: Very flexible
- Left recursion: Detected and handled (may require restructuring)
- Ordered choice: Resolves ambiguity automatically
- Best for: Precedence-based expressions, structured data, interactive tools
Writing PEG Grammars:
- Order alternatives from most specific to least specific
- Lower precedence operators should be tried first
- Use repetition for lists and sequences
- Consider memoization for performance
Example - Precedence:
// Correct: Lower precedence first
Expr -> Expr + Term | Term // + tried first
Term -> Term * Factor | Factor // * tried later (higher precedence)
// This naturally gives * higher precedence than +
Migration Path
From LL to LR
If you need better error recovery or left recursion:
- Update grammar (if needed)
- Change parser type
- Update configuration
- Test thoroughly
From LR to GLR
If you encounter ambiguities:
- Keep same grammar
- Change parser type
- Add disambiguation logic
- Handle parse forests
From LL/LR to PEG
If you want ordered choice semantics or memoization:
- Reorder alternatives: Most specific first, least specific last
- Adjust precedence: Lower precedence operators tried first
- Enable memoization: Set
enable_memoization: truefor performance - Handle left recursion: Restructure if needed (PEG detects it)
- Test thoroughly - ordered choice may change parse results
Example Migration:
// LL/LR: Precedence via grammar structure or tables
// PEG: Precedence via ordering
// Before (LL/LR style - may need precedence hints):
Expr -> Term | Expr + Term
Term -> Factor | Term * Factor
// After (PEG style - ordering expresses precedence):
Expr -> Expr + Term | Term // + tried first (lower precedence)
Term -> Term * Factor | Factor // * tried later (higher precedence)
Recommendations by Use Case
Interactive Tools (IDEs, Editors)
Primary: LL(k) or PEG Alternative: LR if grammar requires it
Reasoning:
- Both LL(k) and PEG are fast and support incremental parsing
- PEG’s memoization can provide significant speedup for repeated parsing
- PEG’s ordered choice makes precedence natural for expression languages
- LL(k) is simpler for straightforward grammars
PEG Advantages for IDEs:
- Memoization caches parse results, making re-parsing very fast
- Incremental parsing benefits from cached nodes
- Ordered choice makes grammar writing intuitive
Batch Processing
Primary: LR Alternative: LL(k) if grammar is simpler
Complex Languages
Primary: GLR Alternative: Transform grammar to be unambiguous, or use PEG if ambiguity can be resolved by ordering
PEG for Complex Languages:
- Use PEG if ambiguities can be resolved by ordering alternatives
- Use GLR if you need to explore or track multiple parse trees
- PEG is deterministic - always produces one parse tree
Learning/Prototyping
Primary: LL(k) or PEG
Reasoning:
- LL(k): Simplest to understand and use
- PEG: Intuitive grammar writing (ordered choice is natural), good for learning parsing concepts
PEG for Learning:
- Ordered choice is easy to understand
- Precedence is expressed naturally
- No need for precedence tables or complex grammar analysis
Next Steps
- Read backend-specific documentation:
- Check Examples for real-world usage
Green/Red Trees
Sipha uses an immutable green/red tree representation for syntax trees. This design provides efficient memory usage and fast tree operations.
Overview
The green/red tree architecture separates:
- Green trees: Compact, shared representation stored in an arena
- Red trees: Convenient API for traversing and querying the tree
This separation provides:
- Efficient memory usage: Shared subtrees reduce memory footprint
- Fast traversal: Optimized for common tree operations
- Immutability: Trees are immutable, enabling safe sharing
Green Trees
Green trees are the storage layer:
- Compact: Minimal memory overhead
- Shared: Subtrees can be shared across multiple trees
- Arena-allocated: Stored in an arena for efficient allocation
- Immutable: Once created, cannot be modified
GreenNode
A GreenNode represents a node in the green tree:
pub struct GreenNode<K: SyntaxKind> {
kind: K,
children: Vec<SyntaxElement<K>>,
// ... internal fields
}
Green nodes contain:
- Kind: The syntax kind (terminal or non-terminal)
- Children: Child elements (nodes or tokens)
- Metadata: Internal metadata for tree operations
Red Trees
Red trees are the API layer:
- Convenient: Easy-to-use API for traversal
- Lazy: Created on-demand from green trees
- Navigable: Rich navigation API (parent, siblings, etc.)
- Queryable: Supports queries and visitors
SyntaxNode
A SyntaxNode represents a node in the red tree:
let root = SyntaxNode::new_root(green_root.clone());
// Navigate
for child in root.children() {
println!("{:?}", child.kind());
}
// Find nodes
if let Some(expr) = root.descendants().find(|n| n.kind() == MySyntaxKind::Expr) {
println!("Found expression");
}
Tree Structure
Elements
Tree elements can be:
- Nodes: Non-terminal nodes (have children)
- Tokens: Terminal nodes (leaf nodes)
pub enum SyntaxElement<K: SyntaxKind> {
Node(SyntaxNode),
Token(SyntaxToken),
}
Hierarchy
Trees form a hierarchy:
Root (NonTerminal)
├── Child1 (NonTerminal)
│ ├── Token1 (Terminal)
│ └── Token2 (Terminal)
└── Child2 (NonTerminal)
└── Token3 (Terminal)
Tree Structure Visualization
graph TD
subgraph Green["Green Tree (Storage)"]
G1[GreenNode: Expr]
G2[GreenNode: Term]
G3[GreenToken: Number]
G4[GreenToken: Plus]
G5[GreenNode: Term]
G6[GreenToken: Number]
G1 --> G2
G1 --> G4
G1 --> G5
G2 --> G3
G5 --> G6
end
subgraph Red["Red Tree (API)"]
R1[SyntaxNode: Expr]
R2[SyntaxNode: Term]
R3[SyntaxToken: Number]
R4[SyntaxToken: Plus]
R5[SyntaxNode: Term]
R6[SyntaxToken: Number]
R1 --> R2
R1 --> R4
R1 --> R5
R2 --> R3
R5 --> R6
end
G1 -.->|"new_root()"| R1
style G1 fill:#c8e6c9,color:#000000
style G2 fill:#c8e6c9,color:#000000
style G3 fill:#ffccbc,color:#000000
style G4 fill:#ffccbc,color:#000000
style G5 fill:#c8e6c9,color:#000000
style G6 fill:#ffccbc,color:#000000
style R1 fill:#e1f5ff,color:#000000
style R2 fill:#e1f5ff,color:#000000
style R3 fill:#fff9c4,color:#000000
style R4 fill:#fff9c4,color:#000000
style R5 fill:#e1f5ff,color:#000000
style R6 fill:#fff9c4,color:#000000
Memory Efficiency
Green/red trees are memory-efficient:
- Sharing: Unchanged subtrees are shared across edits
- Arena allocation: Nodes are allocated in an arena
- Compact representation: Minimal overhead per node
- Incremental updates: Only changed regions are recreated
Design Rationale
The green/red tree design is inspired by:
- rust-analyzer: Uses similar architecture
- rowan: Provides the green/red tree foundation
- IDE requirements: Optimized for interactive editing
Benefits:
- Fast incremental parsing: Reuse green nodes
- Efficient memory: Share unchanged subtrees
- Safe sharing: Immutability enables safe sharing
- Rich API: Red trees provide convenient navigation
Conversion
Convert between green and red trees:
// Green to Red
let red_root = SyntaxNode::new_root(green_node.clone());
// Red to Green
let green_node = red_node.green();
Next Steps
- Learn Working with Trees for traversal and queries
- See Tree Manipulation for building and modifying trees
See Also
- Incremental Parsing - How green/red trees enable incremental parsing
- Glossary - Definitions of tree-related terms
- Architecture - Deep dive into data structures
Working with Syntax Trees
This chapter shows how to traverse, query, and work with syntax trees in Sipha.
Creating a Red Tree
Create a red tree from a parse result:
use sipha::syntax::SyntaxNode;
let root = SyntaxNode::new_root(result.root.clone());
Traversal
Children
Iterate over direct children:
for child in root.children() {
println!("Child: {:?}", child.kind());
}
Descendants
Iterate over all descendants (depth-first):
for descendant in root.descendants() {
println!("Descendant: {:?}", descendant.kind());
}
Ancestors
Iterate over ancestors:
if let Some(node) = root.descendants().find(|n| n.kind() == MySyntaxKind::Expr) {
for ancestor in node.ancestors() {
println!("Ancestor: {:?}", ancestor.kind());
}
}
Siblings
Access siblings:
if let Some(node) = root.descendants().find(|n| n.kind() == MySyntaxKind::Expr) {
if let Some(next) = node.next_sibling() {
println!("Next sibling: {:?}", next.kind());
}
if let Some(prev) = node.prev_sibling() {
println!("Previous sibling: {:?}", prev.kind());
}
}
Querying
Find by Kind
Find nodes by syntax kind:
let exprs: Vec<_> = root.descendants()
.filter(|n| n.kind() == MySyntaxKind::Expr)
.collect();
Find by Position
Find nodes at a specific position:
use sipha::syntax::TextSize;
let pos = TextSize::from(10);
if let Some(node) = root.token_at_offset(pos) {
println!("Token at position {}: {:?}", pos, node.kind());
}
Find Parent
Find the parent of a node:
if let Some(parent) = node.parent() {
println!("Parent: {:?}", parent.kind());
}
Text Access
Node Text
Get the text covered by a node:
let text = node.text();
println!("Node text: {}", text);
Token Text
Get the text of a token:
if let Some(token) = node.first_token() {
println!("Token text: {}", token.text());
}
Text Range
Get the text range of a node:
let range = node.text_range();
println!("Range: {:?}", range);
Visitors
Use visitors to traverse trees:
#[cfg(feature = "visitor")]
use sipha::syntax::{SyntaxVisitor, SyntaxWalker};
struct MyVisitor;
impl SyntaxVisitor for MyVisitor {
fn visit_node(&mut self, node: &SyntaxNode) {
println!("Visiting: {:?}", node.kind());
}
fn visit_token(&mut self, token: &SyntaxToken) {
println!("Token: {:?}", token.kind());
}
}
let mut visitor = MyVisitor;
root.walk(&mut visitor);
Queries
Use XPath-like queries (requires query feature):
#[cfg(feature = "query")]
use sipha::syntax::{QueryBuilder, XPathQuery};
let query = QueryBuilder::new()
.descendant(MySyntaxKind::Expr)
.child(MySyntaxKind::Number)
.build();
let results: Vec<_> = query.matches(&root).collect();
Pattern Matching
Match tree patterns:
// Match specific structure
if let Some(expr) = root.descendants().find(|n| {
n.kind() == MySyntaxKind::Expr &&
n.children().any(|c| c.kind() == MySyntaxKind::Number)
}) {
println!("Found expression with number");
}
Collecting Information
Count Nodes
Count nodes by kind:
let expr_count = root.descendants()
.filter(|n| n.kind() == MySyntaxKind::Expr)
.count();
Collect Text
Collect text from nodes:
let all_text: String = root.descendants()
.filter_map(|n| n.first_token())
.map(|t| t.text().to_string())
.collect();
Best Practices
- Use iterators: Prefer iterator methods over manual traversal
- Cache results: Cache frequently accessed nodes
- Filter early: Filter nodes as early as possible
- Use visitors: Use visitors for complex traversals
- Leverage queries: Use queries for pattern matching
Next Steps
- See Tree Manipulation for building and modifying trees
- Check Examples for real-world usage
Tree Manipulation
This chapter shows how to build and modify syntax trees in Sipha.
Building Trees
GreenNodeBuilder
Use GreenNodeBuilder to build green trees:
use sipha::syntax::{GreenNodeBuilder, SyntaxKind};
let mut builder = GreenNodeBuilder::new();
// Start a node
builder.start_node(MySyntaxKind::Expr);
// Add tokens
builder.token(MySyntaxKind::Number, "42").unwrap();
builder.token(MySyntaxKind::Plus, "+").unwrap();
builder.token(MySyntaxKind::Number, "10").unwrap();
// Finish the node
let expr_node = builder.finish_node().unwrap();
// Build root
builder.start_node(MySyntaxKind::Root);
builder.finish_node().unwrap();
let root = builder.finish().unwrap();
Nested Structures
Build nested structures:
let mut builder = GreenNodeBuilder::new();
builder.start_node(MySyntaxKind::Expr);
// Nested expression
builder.start_node(MySyntaxKind::Term);
builder.token(MySyntaxKind::Number, "1").unwrap();
builder.finish_node().unwrap();
builder.token(MySyntaxKind::Plus, "+").unwrap();
// Another nested expression
builder.start_node(MySyntaxKind::Term);
builder.token(MySyntaxKind::Number, "2").unwrap();
builder.finish_node().unwrap();
builder.finish_node().unwrap();
let root = builder.finish().unwrap();
Modifying Trees
Trees are immutable, so modifications create new trees:
Replacing Nodes
Replace a node by rebuilding:
fn replace_node(old_node: &SyntaxNode, new_kind: MySyntaxKind) -> GreenNode<MySyntaxKind> {
let mut builder = GreenNodeBuilder::new();
builder.start_node(new_kind);
// Copy children
for child in old_node.children() {
match child {
SyntaxElement::Node(n) => {
// Recursively replace
let new_child = replace_node(&n, new_kind);
// Add new child (simplified - actual API may differ)
}
SyntaxElement::Token(t) => {
builder.token(t.kind(), t.text()).unwrap();
}
}
}
builder.finish_node().unwrap();
builder.finish().unwrap()
}
Inserting Nodes
Insert nodes by rebuilding:
fn insert_child(parent: &SyntaxNode, new_child: GreenNode<MySyntaxKind>, index: usize) -> GreenNode<MySyntaxKind> {
let mut builder = GreenNodeBuilder::new();
builder.start_node(parent.kind());
// Copy children before index
for (i, child) in parent.children().enumerate() {
if i == index {
// Insert new child
// (simplified - actual API may differ)
}
// Copy existing child
}
builder.finish_node().unwrap();
builder.finish().unwrap()
}
Tree Transformation
Transform trees using visitors:
struct Transformer;
impl SyntaxVisitor for Transformer {
fn visit_node(&mut self, node: &SyntaxNode) {
// Transform node
if node.kind() == MySyntaxKind::OldKind {
// Replace with new kind
}
}
}
Incremental Updates
For incremental updates, reuse unchanged nodes:
use sipha::incremental::IncrementalParser;
// Initial parse
let result1 = parser.parse_incremental(&tokens, None, &[], entry, Some(&grammar));
// After edit, reuse unchanged nodes
let result2 = parser.parse_incremental(
&new_tokens,
Some(&result1.root),
&edits,
entry,
Some(&grammar),
);
Best Practices
- Use builders: Use
GreenNodeBuilderfor building trees - Reuse nodes: Reuse unchanged nodes in incremental updates
- Immutable operations: All tree operations create new trees
- Cache results: Cache frequently accessed nodes
- Batch modifications: Batch multiple modifications together
Next Steps
- See Working with Trees for traversal
- Check Incremental Parsing for efficient updates
Error Handling Overview
Sipha provides comprehensive error handling with configurable error recovery strategies. This section covers error types, recovery mechanisms, and best practices for handling errors in your parsers.
Overview
Sipha’s error handling system is designed to:
- Provide detailed error information with precise source locations
- Recover from errors gracefully to continue parsing
- Support rich diagnostics with context and suggestions (via
diagnosticsfeature) - Distinguish between errors and warnings for different severity levels
Error Handling Flow
flowchart TD
Start[Parse Input] --> Parse[Parser Processes Tokens]
Parse --> Error{Error<br/>Encountered?}
Error -->|No| Success[Parse Success]
Error -->|Yes| Recover{Recovery<br/>Enabled?}
Recover -->|Yes| Attempt[Attempt Recovery]
Recover -->|No| Fail[Report Error]
Attempt --> Success2[Continue Parsing]
Attempt --> Fail2[Recovery Failed]
Success --> Result[Return ParseResult]
Success2 --> Result
Fail --> Result
Fail2 --> Result
style Success fill:#c8e6c9,color:#000000
style Success2 fill:#c8e6c9,color:#000000
style Fail fill:#ffccbc,color:#000000
style Fail2 fill:#ffccbc,color:#000000
style Result fill:#e1f5ff,color:#000000
Key Concepts
Errors vs Warnings
- Errors: Prevent successful parsing or indicate syntax violations
- Warnings: Non-fatal issues that don’t prevent parsing
Error Recovery
Sipha parsers can automatically recover from errors to continue parsing, which is essential for:
- IDEs: Providing real-time feedback as users type
- Language Servers: Maintaining parse state despite errors
- Interactive Tools: Allowing partial parsing results
Rich Diagnostics
With the diagnostics feature enabled, errors include:
- Source code snippets
- Highlighted error locations
- Suggestions and hints
- Related code locations
What’s Covered
This section covers:
- Error Types: Understanding ParseError, LexerError, and ParseWarning
- Working with Errors: Checking, reporting, and extracting error information
- Error Recovery: Configuring and understanding error recovery strategies
- Rich Diagnostics: Using miette for beautiful error reporting
- Best Practices: Guidelines and common patterns
Quick Start
The most basic error handling:
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
let result = parser.parse(&tokens, entry_point);
// Always check for errors
if !result.errors.is_empty() {
for error in &result.errors {
eprintln!("Error: {}", error);
}
return Err("Parsing failed".into());
}
// Check for warnings
for warning in &result.warnings {
eprintln!("Warning: {}", warning.message);
}
Next Steps
- Learn about Error Types to understand what errors you might encounter
- See Working with Errors for practical error handling patterns
- Explore Error Recovery to configure recovery strategies
Error Types
Sipha defines several error types to represent different kinds of problems that can occur during parsing and lexing.
ParseError
ParseError represents errors that occur during parsing. It’s an enum with four variants:
use sipha::error::ParseError;
use sipha::syntax::{TextRange, TextSize};
// Unexpected token error
let error = ParseError::UnexpectedToken {
span: TextRange::new(TextSize::from(10), TextSize::from(15)),
expected: vec!["identifier".to_string(), "number".to_string()],
};
// Invalid syntax error
let error = ParseError::InvalidSyntax {
span: TextRange::new(TextSize::from(0), TextSize::from(5)),
message: "Invalid expression syntax".to_string(),
};
// Unexpected end of file
let error = ParseError::UnexpectedEof {
span: TextRange::new(TextSize::from(100), TextSize::from(100)),
expected: vec!["}".to_string(), ";".to_string()],
};
// Ambiguity error (GLR parser)
let error = ParseError::Ambiguity {
span: TextRange::new(TextSize::from(20), TextSize::from(25)),
alternatives: vec!["expression1".to_string(), "expression2".to_string()],
};
ParseError Variants
UnexpectedToken
Occurs when the parser encounters a token that doesn’t match any expected production.
- Most common parse error
- Contains a list of expected token types
- Parser can often recover by skipping the unexpected token
Example:
// Input: "let x = 42 +"
// Error: UnexpectedToken with expected: ["number", "identifier", "("]
InvalidSyntax
Occurs when the input violates grammar rules in a way that can’t be attributed to a single unexpected token.
- Contains a descriptive error message
- Used for complex syntax violations
- May require more context to understand
Example:
// Input: "function() { return }"
// Error: InvalidSyntax with message: "Missing return value"
UnexpectedEof
Occurs when the parser reaches the end of input but expects more tokens.
- Contains a list of expected tokens
- Common when delimiters are missing (e.g., unclosed parentheses)
- Indicates incomplete input
Example:
// Input: "if (condition { ... }"
// Error: UnexpectedEof with expected: [")"]
Ambiguity (GLR backend only)
Occurs when the GLR parser finds multiple valid parse trees.
- Contains a list of alternative interpretations
- Requires disambiguation to resolve
- Only occurs with GLR parser backend
Example:
// Input: "1 + 2 * 3" (ambiguous without precedence)
// Error: Ambiguity with alternatives: ["(1 + 2) * 3", "1 + (2 * 3)"]
LexerError
LexerError represents errors that occur during tokenization (lexing):
use sipha::error::{LexerError, LexerErrorKind};
use sipha::syntax::{TextRange, TextSize};
let error = LexerError::new(
TextRange::new(TextSize::from(5), TextSize::from(6)),
LexerErrorKind::UnexpectedChar { char: '!' },
);
LexerErrorKind Variants
UnexpectedChar
An invalid character was encountered that doesn’t match any token pattern.
- Contains the unexpected character
- Usually indicates a typo or unsupported character
Example:
// Input: "let x = 42#"
// Error: UnexpectedChar { char: '#' }
UnterminatedString
A string literal was started but never closed.
- Common when quotes are mismatched
- Indicates incomplete string literal
Example:
// Input: "let s = \"hello world"
// Error: UnterminatedString
InvalidEscape
An invalid escape sequence was found in a string.
- Contains the invalid escape sequence
- Common with malformed escape sequences
Example:
// Input: "let s = \"hello\\zworld\""
// Error: InvalidEscape { escape: "\\z" }
InvalidNumber
A number literal has invalid format.
- Contains a reason explaining the issue
- Common with malformed numeric literals
Example:
// Input: "let x = 123abc"
// Error: InvalidNumber { reason: "Invalid digit 'a' in number" }
UnexpectedEof
End of file was reached unexpectedly during tokenization.
- Usually indicates incomplete input
- Common when input is cut off mid-token
Example:
// Input: "let x = 4" (cut off)
// Error: UnexpectedEof
ParseWarning
ParseWarning represents non-fatal issues that don’t prevent parsing:
use sipha::error::{ParseWarning, Severity};
use sipha::syntax::{TextRange, TextSize};
// Create warnings with different severity levels
let warning = ParseWarning::warning(
TextRange::new(TextSize::from(0), TextSize::from(10)),
"Deprecated syntax used".to_string(),
);
let info = ParseWarning::info(
TextRange::new(TextSize::from(10), TextSize::from(20)),
"Consider using newer syntax".to_string(),
);
let hint = ParseWarning::hint(
TextRange::new(TextSize::from(20), TextSize::from(30)),
"This can be simplified".to_string(),
);
Severity Levels
Warning
Important issues that should be addressed.
- Indicates potential problems
- Should be shown to users
- May indicate deprecated or problematic code
Info
Informational messages.
- Provides context or suggestions
- Less urgent than warnings
- Can be shown optionally
Hint
Suggestions for improvement.
- Optimization opportunities
- Style suggestions
- Can be shown in IDE tooltips
Converting Between Error Types
Lexer errors can be converted to parse errors:
use sipha::error::{LexerError, ParseError};
let lexer_error: LexerError = /* ... */;
let parse_error: ParseError = lexer_error.into();
This is useful when you want to handle all errors uniformly.
Error Methods
All error types provide useful methods:
ParseError Methods
use sipha::error::ParseError;
// Get the error location
let span = error.span();
// Format expected tokens as a readable string
let expected_str = error.format_expected();
// Returns: "identifier or number" or "identifier, number, or string"
// Get "did you mean" suggestions
if let Some(suggestion) = error.did_you_mean("identifer") {
println!("{}", suggestion); // "Did you mean 'identifier'?"
}
// Get context around the error
if let Some((before, error_span, after)) = error.get_context(source_code) {
println!("Context: ...{}[{}]{}...", before, error_span, after);
}
// Format error with full context
let formatted = error.format_with_context(source_code, Some("actual_token"));
println!("{}", formatted);
LexerError Methods
use sipha::error::LexerError;
// Get the error location
let span = error.span();
// Get the error kind
let kind = error.kind();
Next Steps
- Learn about Working with Errors to see how to use these error types
- See Error Recovery to understand how errors are handled
- Check Best Practices for guidelines on error handling
Working with Errors
This chapter covers how to check, report, and extract information from errors in your parsers.
Working with Parse Results
After parsing, check the ParseResult for errors and warnings:
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
use sipha::syntax::SyntaxNode;
use sipha::error::Severity;
let result = parser.parse(&tokens, entry_point);
// Check for errors
if !result.errors.is_empty() {
eprintln!("Parsing failed with {} errors:", result.errors.len());
for error in &result.errors {
eprintln!(" Error: {}", error);
eprintln!(" Location: {:?}", error.span());
}
}
// Check for warnings
if !result.warnings.is_empty() {
eprintln!("Found {} warnings:", result.warnings.len());
for warning in &result.warnings {
eprintln!(" [{}] {} at {:?}",
match warning.severity {
Severity::Warning => "WARNING",
Severity::Info => "INFO",
Severity::Hint => "HINT",
},
warning.message,
warning.span
);
}
}
// Even with errors, a tree may be available
let root = SyntaxNode::new_root(result.root.clone());
Error Information
Getting Error Details
ParseError provides several methods for extracting information:
use sipha::error::ParseError;
// Get the error location
let span = error.span();
// Format expected tokens as a readable string
let expected_str = error.format_expected();
// Returns: "identifier or number" or "identifier, number, or string"
// Get "did you mean" suggestions
if let Some(suggestion) = error.did_you_mean("identifer") {
println!("{}", suggestion); // "Did you mean 'identifier'?"
}
// Get context around the error
if let Some((before, error_span, after)) = error.get_context(source_code) {
println!("Context: ...{}[{}]{}...", before, error_span, after);
}
// Format error with full context
let formatted = error.format_with_context(source_code, Some("actual_token"));
println!("{}", formatted);
Example: Comprehensive Error Reporting
use sipha::error::ParseError;
fn report_parse_errors(errors: &[ParseError], source: &str) {
for (i, error) in errors.iter().enumerate() {
println!("\nError {}: {}", i + 1, error);
println!(" Location: {:?}", error.span());
// Show context
if let Some((before, error_span, after)) = error.get_context(source) {
println!(" Context: ...{}[{}]{}...", before, error_span, after);
}
// Show expected tokens
match error {
ParseError::UnexpectedToken { expected, .. }
| ParseError::UnexpectedEof { expected, .. } => {
if !expected.is_empty() {
println!(" Expected: {}", error.format_expected());
}
}
ParseError::InvalidSyntax { message, .. } => {
println!(" Details: {}", message);
}
ParseError::Ambiguity { alternatives, .. } => {
println!(" Alternatives: {}", alternatives.join(", "));
}
}
}
}
Handling Lexer Errors
Lexer errors occur during tokenization and can be handled separately:
use sipha::lexer::LexerBuilder;
use sipha::error::{LexerError, LexerErrorKind};
let lexer = LexerBuilder::new()
// ... configure lexer
.build(eof_kind, default_kind)?;
match lexer.tokenize(input) {
Ok(tokens) => {
// Tokenization succeeded
}
Err(errors) => {
// Handle lexer errors
for error in errors {
eprintln!("Lexer error at {:?}: {}", error.span(), error.kind());
match error.kind() {
LexerErrorKind::UnexpectedChar { char } => {
eprintln!(" Unexpected character: '{}'", char);
}
LexerErrorKind::UnterminatedString => {
eprintln!(" String literal not closed");
}
LexerErrorKind::InvalidEscape { escape } => {
eprintln!(" Invalid escape sequence: {}", escape);
}
LexerErrorKind::InvalidNumber { reason } => {
eprintln!(" Invalid number: {}", reason);
}
LexerErrorKind::UnexpectedEof => {
eprintln!(" Unexpected end of file");
}
}
}
}
}
Lexer errors can also be converted to parse errors:
use sipha::error::{LexerError, ParseError};
let lexer_error: LexerError = /* ... */;
let parse_error: ParseError = lexer_error.into();
Error Metrics
ParseResult includes metrics about the parsing process:
let result = parser.parse(&tokens, entry_point);
println!("Parsing metrics:");
println!(" Tokens consumed: {}", result.metrics.tokens_consumed);
println!(" Nodes created: {}", result.metrics.nodes_created);
println!(" Errors recovered: {}", result.metrics.errors_recovered);
println!(" Cache hits: {}", result.metrics.cache_hits);
println!(" Parse time: {:?}", result.metrics.parse_time);
These metrics are useful for:
- Performance analysis: Understanding parse performance
- Debugging: Identifying problematic grammar rules
- Optimization: Finding bottlenecks in parsing
Error Handling in Incremental Parsing
Incremental parsing maintains error information across edits:
use sipha::incremental::IncrementalParser;
let mut incremental = IncrementalParser::new(parser);
// Initial parse
let result1 = incremental.parse_incremental(/* ... */);
// After edit, errors are updated for affected regions
let result2 = incremental.parse_incremental(/* ... */);
// Errors from previous parse are preserved if not affected by edit
Next Steps
- Learn about Error Recovery to understand how errors are handled automatically
- See Rich Diagnostics for beautiful error reporting
- Check Best Practices for guidelines on error handling
Error Recovery
Sipha parsers attempt to recover from errors and continue parsing. This is essential for providing good user experience in IDEs and language servers.
Overview
Error recovery allows parsers to:
- Continue parsing despite errors
- Provide partial results even when input has errors
- Maintain parse state for interactive applications
- Report multiple errors in a single pass
Automatic Recovery
By default, parsers use automatic error recovery:
use sipha::backend::ll::{LlParser, LlConfig};
// Default config includes error recovery
let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)?;
// Parser will attempt to recover from errors
let result = parser.parse(&tokens, entry_point);
// Check how many errors were recovered from
println!("Errors recovered: {}", result.metrics.errors_recovered);
Even when errors occur, the parser will:
- Report the error
- Attempt to recover
- Continue parsing
- Return a parse tree (possibly incomplete)
Recovery Strategies
Parsers use several strategies to recover:
1. Token Skipping
Skip unexpected tokens and continue parsing.
Example:
// Input: "let x = 42 + + 10"
// Parser skips the second "+" and continues
// Error reported, but parsing continues
2. Token Insertion
Insert missing tokens (e.g., semicolons, closing delimiters).
Example:
// Input: "let x = 42 let y = 10"
// Parser may insert a semicolon after "42"
// Error reported, but both statements parsed
3. Rule Skipping
Skip to next statement or block when current rule fails.
Example:
// Input: "let x = invalid; let y = 10;"
// Parser skips the first statement and continues with the second
// Error reported for first statement, second statement parsed successfully
4. Best Effort
Continue parsing despite errors, producing the best possible result.
Example:
// Input with multiple errors
// Parser attempts to parse as much as possible
// All errors reported, partial tree returned
The specific strategy depends on the parser backend and configuration.
Configuring Recovery
Different backends support different recovery options:
use sipha::backend::ll::{LlParser, LlConfig};
let config = LlConfig {
error_recovery: true, // Enable error recovery
max_recovery_attempts: 10, // Limit recovery attempts
..Default::default()
};
let mut parser = LlParser::new(&grammar, config)?;
Recovery Options
error_recovery (bool)
- Enable or disable error recovery
- Default:
true - When disabled, parser stops at first error
max_recovery_attempts (usize)
- Maximum number of recovery attempts per parse
- Default: varies by backend
- Prevents infinite recovery loops
When to Use Recovery
Use Recovery For:
- IDEs and Language Servers: Need to parse incomplete or erroneous code
- Interactive Tools: Users are actively editing code
- Error Reporting: Want to find all errors in one pass
- Partial Results: Need parse tree even with errors
Disable Recovery For:
- Strict Validation: Want to fail fast on errors
- Batch Processing: Errors should stop processing
- Testing: Want to catch all errors immediately
- Performance: Recovery has overhead
Recovery Examples
Example 1: Missing Delimiter
// Input: "if (condition { ... }"
// Parser behavior:
// 1. Reports UnexpectedEof error expecting ")"
// 2. May insert ")" and continue
// 3. Parses the block successfully
// Result: Error reported, but parse tree includes the block
Example 2: Unexpected Token
// Input: "let x = 42 +"
// Parser behavior:
// 1. Reports UnexpectedToken error
// 2. Skips the "+" or inserts placeholder
// 3. Continues parsing
// Result: Error reported, variable declaration partially parsed
Example 3: Invalid Syntax
// Input: "function() { return }"
// Parser behavior:
// 1. Reports InvalidSyntax error
// 2. Marks statement as incomplete
// 3. Continues parsing rest of function
// Result: Error reported, function body partially parsed
Recovery Quality
Recovery quality depends on:
- Grammar structure: Well-structured grammars recover better
- Error location: Errors at statement boundaries recover better
- Parser backend: Different backends have different recovery strategies
- Configuration: Recovery settings affect behavior
Limitations
Error recovery has some limitations:
- May produce incorrect trees: Recovered trees may not match intended structure
- Can mask errors: Multiple errors may be reduced to fewer reported errors
- Performance overhead: Recovery adds processing time
- Not always possible: Some errors can’t be recovered from
Best Practices
- Always check errors: Even with recovery, check
result.errors - Validate recovered trees: Don’t trust recovered parse trees completely
- Limit recovery attempts: Use
max_recovery_attemptsto prevent loops - Consider context: Recovery works better with more context
- Test recovery: Test your grammar with various error scenarios
Next Steps
- See Working with Errors for how to handle recovered errors
- Learn about Rich Diagnostics for better error reporting
- Check Best Practices for guidelines on error recovery
Rich Diagnostics
When the diagnostics feature is enabled, errors integrate with miette for beautiful error reporting with source code snippets, highlighted locations, and helpful suggestions.
Enabling Diagnostics
Add the diagnostics feature to your Cargo.toml:
[dependencies]
sipha = { version = "0.5.0", features = ["diagnostics"] }
This enables:
- Rich error formatting with source code snippets
- Highlighted error locations
- Suggestions and hints
- Integration with editor tooling
Using Diagnostics
Errors automatically implement the Diagnostic trait from miette:
#[cfg(feature = "diagnostics")]
use sipha::error::ParseError;
// Errors automatically implement Diagnostic trait
let error: ParseError = /* ... */;
// Print with rich formatting
eprintln!("{:#}", error);
// Errors include:
// - Source code snippets
// - Highlighted error locations
// - Suggestions and hints
// - Related code locations
Example Output
With diagnostics enabled, errors are displayed with rich formatting:
Error: Unexpected token
┌─ example.txt:10:15
│
10 │ let x = 42 +
│ ^ Unexpected token
│
= Expected: identifier, number, or operator
= Help: Did you mean to add an expression after '+'?
Diagnostic Features
Source Code Snippets
Diagnostics automatically include source code around the error:
let error: ParseError = /* ... */;
eprintln!("{:#}", error);
// Shows:
// ┌─ file.rs:42:10
// │
//42 │ let x = invalid
// │ ^^^^^^^^ Error here
Highlighted Locations
Error locations are highlighted in the source code:
// The error span is automatically highlighted
// Multiple related errors can be shown together
Suggestions
Diagnostics can include suggestions:
// "Did you mean" suggestions
// Fix suggestions
// Helpful hints
Related Locations
Multiple related errors can be shown together:
// Errors at different locations
// Related errors in the same context
// Error chains
Integration with Editors
Diagnostics integrate with editor tooling:
- Language Servers: Errors are reported via LSP
- IDE Integration: Errors shown in editor
- Terminal Output: Beautiful terminal output
- JSON Output: Structured error data
Custom Diagnostic Messages
You can customize diagnostic messages:
use sipha::error::ParseError;
// Create errors with custom messages
let error = ParseError::invalid_syntax_with_suggestion(
span,
"Invalid expression",
"Try adding parentheses around the expression"
);
Diagnostic Severity
Diagnostics support different severity levels:
- Error: Fatal errors that prevent parsing
- Warning: Non-fatal issues (via
ParseWarning) - Info: Informational messages
- Hint: Suggestions for improvement
Example: Full Diagnostic Output
#[cfg(feature = "diagnostics")]
use sipha::error::ParseError;
use miette::Diagnostic;
fn report_with_diagnostics(error: &ParseError, source: &str) {
// Print with full diagnostic formatting
eprintln!("{:#}", error);
// Or use miette's reporting features
// miette::set_hook(/* ... */);
}
Performance Considerations
Diagnostics have minimal performance impact:
- Compile-time: Feature is gated, no overhead when disabled
- Runtime: Small overhead for formatting
- Memory: Source code snippets are included in errors
When to Use Diagnostics
Use diagnostics for:
- IDEs and Language Servers: Better user experience
- Development Tools: Clearer error messages
- Documentation: Better error examples
- Debugging: Easier to understand errors
Next Steps
- See Working with Errors for basic error handling
- Learn about Error Recovery for recovery strategies
- Check Best Practices for guidelines
Best Practices
This chapter provides guidelines and common patterns for error handling in Sipha parsers.
Always Check for Errors
Always check for errors, even if you expect success:
let result = parser.parse(&tokens, entry_point);
// Always check errors, even if you expect success
if !result.errors.is_empty() {
// Handle errors appropriately
return Err(format!("Parsing failed: {} errors", result.errors.len()));
}
Why: Errors can occur even in valid-looking input due to:
- Grammar ambiguities
- Edge cases
- Recovery attempts
- Incomplete input
Provide Context in Error Messages
When reporting errors to users, include:
- Source location: Line and column numbers
- Context: Surrounding code
- Suggestions: What was expected or how to fix it
use sipha::error::ParseError;
fn report_error(error: &ParseError, source: &str) {
println!("Error: {}", error);
println!("Location: {:?}", error.span());
// Show context
if let Some((before, error_span, after)) = error.get_context(source) {
println!("Context: ...{}[{}]{}...", before, error_span, after);
}
// Show suggestions
if let Some(suggestion) = error.did_you_mean("actual_token") {
println!("Suggestion: {}", suggestion);
}
}
Handle Warnings Appropriately
Warnings don’t prevent parsing, but should be addressed:
use sipha::error::Severity;
// Warnings don't prevent parsing, but should be addressed
for warning in &result.warnings {
match warning.severity {
Severity::Warning => {
// Log or report warnings
eprintln!("Warning: {}", warning.message);
}
Severity::Info => {
// Optionally show info messages
// Useful for development tools
}
Severity::Hint => {
// Hints can be shown in IDE tooltips
// Less urgent, more suggestions
}
}
}
Use Error Recovery Wisely
Error recovery is great for IDEs, but consider disabling it for:
- Strict validation: When you want to fail fast
- Batch processing: When errors should stop processing
- Testing: When you want to catch all errors
use sipha::backend::ll::{LlParser, LlConfig};
// For strict validation
let strict_config = LlConfig {
error_recovery: false, // Disable recovery
..Default::default()
};
// For IDE/language server
let ide_config = LlConfig {
error_recovery: true, // Enable recovery
max_recovery_attempts: 100, // Allow many attempts
..Default::default()
};
Leverage Diagnostics
If building an IDE or language server, enable the diagnostics feature for:
- Better error messages
- Source code snippets
- Helpful suggestions
- Integration with editor tooling
[dependencies]
sipha = { version = "0.5.0", features = ["diagnostics"] }
Common Error Patterns
Pattern 1: Missing Delimiter
// Input: "if (condition { ... }"
// Error: UnexpectedEof with expected: [")"]
// Handling:
// 1. Report the error with context
// 2. Show what delimiter is missing
// 3. Suggest where to add it
Pattern 2: Unexpected Token
// Input: "let x = 42 +"
// Error: UnexpectedToken with expected: [number, identifier, "("]
// Handling:
// 1. Show the unexpected token
// 2. List expected alternatives
// 3. Provide "did you mean" suggestions
Pattern 3: Invalid Syntax
// Input: "function() { return }"
// Error: InvalidSyntax with message: "Missing return value"
// Handling:
// 1. Explain the syntax violation
// 2. Show the problematic code
// 3. Suggest how to fix it
Error Handling Checklist
When implementing error handling:
- Check
result.errorsafter every parse - Check
result.warningsfor non-fatal issues - Provide context in error messages
- Use appropriate severity levels for warnings
- Enable diagnostics for better error reporting
- Test error recovery behavior
- Handle lexer errors separately if needed
- Consider error metrics for debugging
Performance Considerations
Error handling has minimal performance impact:
- Error checking: Negligible overhead
- Error recovery: Small overhead, but enables partial results
- Diagnostics: Small overhead for formatting
- Metrics: Minimal overhead for tracking
Testing Error Handling
Test your error handling:
#[test]
fn test_error_handling() {
let result = parser.parse(&invalid_tokens, entry_point);
// Verify errors are reported
assert!(!result.errors.is_empty());
// Verify error details
let error = &result.errors[0];
assert!(error.span().start() > TextSize::zero());
// Verify recovery (if enabled)
if config.error_recovery {
assert!(result.metrics.errors_recovered > 0);
}
}
Error Handling in Production
For production code:
- Log errors: Use proper logging framework
- Monitor metrics: Track error rates and recovery success
- User-friendly messages: Translate technical errors to user messages
- Graceful degradation: Continue operation when possible
- Error reporting: Collect error data for improvement
Next Steps
- Review Error Types to understand error variants
- See Working with Errors for practical patterns
- Learn about Error Recovery for recovery strategies
- Explore Rich Diagnostics for better error reporting
Performance
This chapter covers performance optimization tips and benchmarking for Sipha.
General Tips
Use Incremental Parsing
For interactive applications, always use incremental parsing:
let mut incremental = IncrementalParser::new(parser);
// Much faster for small edits
Choose the Right Backend
- LL(k): Fast, good for most grammars
- LR: Fast, good for batch parsing
- GLR: Slower, only use for ambiguous grammars
Optimize Grammar
- Factor common patterns: Extract common subexpressions
- Avoid deep nesting: Flatten deeply nested structures
- Use appropriate lookahead: Don’t use more lookahead than needed
Lexer Performance
Pattern Ordering
Order patterns from most specific to least specific:
// Good: Specific patterns first
.token(MySyntaxKind::Keyword, Pattern::Literal("if".into()))
.token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::letters()))
// Bad: General patterns first
.token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::letters()))
.token(MySyntaxKind::Keyword, Pattern::Literal("if".into()))
Use Character Classes
Prefer character classes over regex when possible:
// Good: Character class
.token(MySyntaxKind::Number, Pattern::CharClass(CharSet::digits()))
// Slower: Regex
.token(MySyntaxKind::Number, Pattern::Regex(r"\d+".into()))
Keyword Trie
Keywords are matched using a trie for O(m) performance where m is keyword length.
Parser Performance
Cache Population
Always provide grammar for cache population:
// Good: Cache populated
parser.parse_incremental(&tokens, old_tree, &edits, entry, Some(&grammar))
// Bad: No cache
parser.parse_incremental(&tokens, old_tree, &edits, entry, None)
Reuse Budget
Adjust reuse budget based on use case:
// For large files
let budget = ReuseBudget::Heuristic {
max_depth: 30,
max_nodes: 2000,
};
// For small files
let budget = ReuseBudget::Fixed(100);
Memory Optimization
Arena Allocation
Green nodes are arena-allocated for efficient memory usage.
Node Sharing
Incremental parsing shares unchanged nodes, reducing memory usage.
Cache Eviction
Cache automatically evicts old versions:
// Keep last 2 versions (default)
cache.evict_old_entries(2);
Benchmarking
Measure Parse Time
let start = std::time::Instant::now();
let result = parser.parse(&tokens, entry);
let duration = start.elapsed();
println!("Parse time: {:?}", duration);
println!("Metrics: {:?}", result.metrics);
Compare Backends
// LL parser
let ll_start = Instant::now();
ll_parser.parse(&tokens, entry);
let ll_time = ll_start.elapsed();
// LR parser
let lr_start = Instant::now();
lr_parser.parse(&tokens, entry);
let lr_time = lr_start.elapsed();
Profiling
Use a Profiler
Profile your parser to find bottlenecks:
# With perf
perf record --call-graph=dwarf cargo run --release
perf report
# With valgrind
valgrind --tool=callgrind cargo run --release
Common Bottlenecks
- Lexer: Pattern matching, especially regex
- Parser: Table lookups, especially for large grammars
- Tree construction: Node allocation and linking
Best Practices
- Profile first: Profile before optimizing
- Measure changes: Benchmark before and after changes
- Use incremental parsing: Always use for interactive apps
- Optimize grammar: Factor and simplify grammar
- Choose right backend: Use appropriate backend for use case
Next Steps
- See Grammar Analysis for grammar optimization
- Check Architecture for design considerations
Custom Patterns
This chapter shows how to extend lexer patterns with custom matching logic.
Custom Matchers
For complex tokenization logic, use custom matchers:
let lexer = LexerBuilder::new()
.custom_token(MySyntaxKind::String, |text, pos| {
// Custom string matching
if text[pos..].starts_with('"') {
let mut end = pos + 1;
while end < text.len() && text.as_bytes()[end] != b'"' {
if text.as_bytes()[end] == b'\\' {
end += 2; // Skip escape sequence
} else {
end += 1;
}
}
if end < text.len() {
let value = text[pos+1..end].to_string();
Some((end - pos + 1, TokenValue::String(value)))
} else {
None // Unterminated string
}
} else {
None
}
})
.build(MySyntaxKind::Eof, MySyntaxKind::Ident)
.expect("Failed to build lexer");
Custom Matcher Signature
Custom matchers have this signature:
Fn(&str, usize) -> Option<(usize, TokenValue)>
Where:
- Input:
&stris the text,usizeis the starting position - Output:
Option<(usize, TokenValue)>where:usizeis the number of characters matchedTokenValueis the token’s value (if any)
TokenValue
Token values can be:
pub enum TokenValue {
Integer(i64),
Float(f64),
String(String),
None,
}
Examples
String Literal
.custom_token(MySyntaxKind::String, |text, pos| {
if !text[pos..].starts_with('"') {
return None;
}
let mut end = pos + 1;
while end < text.len() {
match text.as_bytes()[end] {
b'"' => {
let value = text[pos+1..end].to_string();
return Some((end - pos + 1, TokenValue::String(value)));
}
b'\\' if end + 1 < text.len() => {
end += 2; // Skip escape sequence
}
_ => end += 1,
}
}
None // Unterminated string
})
Floating Point Number
.custom_token(MySyntaxKind::Float, |text, pos| {
let mut end = pos;
let mut has_dot = false;
// Optional sign
if end < text.len() && (text.as_bytes()[end] == b'+' || text.as_bytes()[end] == b'-') {
end += 1;
}
// Digits before dot
while end < text.len() && text.as_bytes()[end].is_ascii_digit() {
end += 1;
}
// Dot
if end < text.len() && text.as_bytes()[end] == b'.' {
has_dot = true;
end += 1;
}
// Digits after dot
while end < text.len() && text.as_bytes()[end].is_ascii_digit() {
end += 1;
}
if has_dot && end > pos {
let value = text[pos..end].parse::<f64>().ok()?;
Some((end - pos, TokenValue::Float(value)))
} else {
None
}
})
Multiline Comment
.custom_token(MySyntaxKind::Comment, |text, pos| {
if !text[pos..].starts_with("/*") {
return None;
}
let mut end = pos + 2;
while end + 1 < text.len() {
if text.as_bytes()[end] == b'*' && text.as_bytes()[end + 1] == b'/' {
let value = text[pos+2..end].to_string();
return Some((end - pos + 2, TokenValue::String(value)));
}
end += 1;
}
None // Unterminated comment
})
Best Practices
- Return early: Return
Nonequickly if pattern doesn’t match - Handle edge cases: Handle EOF, invalid sequences, etc.
- Escape sequences: Properly handle escape sequences
- Performance: Keep matching logic efficient
- Error handling: Return
Nonefor invalid input
Combining with Regular Patterns
Custom matchers work alongside regular patterns:
let lexer = LexerBuilder::new()
// Regular patterns
.token(MySyntaxKind::Number, Pattern::CharClass(CharSet::digits()))
.token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
// Custom matcher
.custom_token(MySyntaxKind::String, |text, pos| {
// Custom logic
})
.build(MySyntaxKind::Eof, MySyntaxKind::Ident)
.expect("Failed to build lexer");
Priority
Custom matchers have the same priority as regular patterns based on definition order.
Next Steps
Grammar Analysis
Sipha provides tools for analyzing and optimizing grammars.
Grammar Validation
Validate grammars before use:
use sipha::backend::ll::LlParser;
let errors = LlParser::validate(&grammar);
if !errors.is_empty() {
for error in errors {
eprintln!("Grammar error: {:?}", error);
}
}
Left Recursion Detection
Detect left-recursive rules:
use sipha::grammar::analysis::detect_left_recursion;
let left_recursive = detect_left_recursion(&grammar);
for rule in left_recursive {
eprintln!("Left-recursive rule: {:?}", rule);
}
Reachability Analysis
Find unreachable rules:
use sipha::grammar::analysis::find_unreachable_rules;
let unreachable = find_unreachable_rules(&grammar);
for rule in unreachable {
eprintln!("Unreachable rule: {:?}", rule);
}
FIRST and FOLLOW Sets
Compute FIRST and FOLLOW sets for LL parsing:
use sipha::grammar::analysis::compute_first_sets;
use sipha::grammar::analysis::compute_follow_sets;
let first_sets = compute_first_sets(&grammar);
let follow_sets = compute_follow_sets(&grammar);
Conflict Detection
Detect shift/reduce and reduce/reduce conflicts:
use sipha::backend::lr::LrParser;
let conflicts = LrParser::detect_conflicts(&grammar);
for conflict in conflicts {
eprintln!("Conflict: {:?}", conflict);
}
Grammar Optimization
Factor Common Patterns
Extract common subexpressions:
// Before: Duplicated pattern
.rule(MyNonTerminal::Expr, Expr::choice(vec![
Expr::seq(vec![Expr::non_terminal(MyNonTerminal::Term), Expr::token(plus_token)]),
Expr::seq(vec![Expr::non_terminal(MyNonTerminal::Term), Expr::token(minus_token)]),
]))
// After: Factored
.rule(MyNonTerminal::Expr, Expr::seq(vec![
Expr::non_terminal(MyNonTerminal::Term),
Expr::choice(vec![
Expr::token(plus_token),
Expr::token(minus_token),
]),
]))
Eliminate Left Recursion
Transform left-recursive rules:
// Before: Left-recursive
.rule(MyNonTerminal::Expr, Expr::choice(vec![
Expr::seq(vec![
Expr::non_terminal(MyNonTerminal::Expr),
Expr::token(plus_token),
Expr::non_terminal(MyNonTerminal::Term),
]),
Expr::non_terminal(MyNonTerminal::Term),
]))
// After: Right-recursive
.rule(MyNonTerminal::Expr, Expr::seq(vec![
Expr::non_terminal(MyNonTerminal::Term),
Expr::opt(Box::new(Expr::seq(vec![
Expr::token(plus_token),
Expr::non_terminal(MyNonTerminal::Expr),
]))),
]))
Grammar Visualization
Visualize grammar structure (planned feature):
// Future API
let dot = grammar.to_dot();
std::fs::write("grammar.dot", dot).unwrap();
Best Practices
- Validate early: Validate grammars before use
- Check for conflicts: Detect and resolve conflicts
- Optimize structure: Factor and simplify grammar
- Test thoroughly: Test with various inputs
- Analyze performance: Profile grammar performance
Next Steps
- See Grammars for grammar definition
- Check Performance for optimization tips
Unicode Support
Sipha provides full Unicode support for identifiers and text when the unicode feature is enabled.
Enabling Unicode
Enable Unicode support in Cargo.toml:
[dependencies]
sipha = { version = "0.5.0", features = ["unicode"] }
Unicode Character Classes
Use Unicode character classes in patterns:
use sipha::lexer::{CharSet, Pattern};
// Unicode letters
let unicode_letters = CharSet::unicode_letters();
// Unicode digits
let unicode_digits = CharSet::unicode_digits();
// Unicode identifiers (letters, digits, and connectors)
let unicode_ident = CharSet::unicode_ident_start();
Unicode Identifiers
Match Unicode identifiers:
let lexer = LexerBuilder::new()
.token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::unicode_ident_start()))
.build(MySyntaxKind::Eof, MySyntaxKind::Ident)
.expect("Failed to build lexer");
Unicode Normalization
Sipha handles Unicode normalization automatically:
- NFC: Normalized Form Canonical Composition
- NFD: Normalized Form Canonical Decomposition
Examples
Japanese Identifiers
let lexer = LexerBuilder::new()
.token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::unicode_letters()))
.build(MySyntaxKind::Eof, MySyntaxKind::Ident)
.expect("Failed to build lexer");
// Can tokenize: 変数名, 関数名, etc.
Emoji in Strings
// Emoji are handled correctly in string literals
let input = r#""Hello 👋 World 🌍""#;
let tokens = lexer.tokenize(input).unwrap();
Performance
Unicode support has minimal performance impact:
- Character classes: Efficient Unicode property lookups
- Normalization: Cached normalization results
- Identifiers: Fast Unicode property checks
Best Practices
- Enable when needed: Only enable
unicodefeature if needed - Use appropriate classes: Use Unicode character classes for identifiers
- Handle normalization: Be aware of Unicode normalization
- Test with Unicode: Test with various Unicode inputs
Next Steps
Adding a Backend
This chapter shows how to implement a new parsing backend for Sipha.
ParserBackend Trait
All backends implement the ParserBackend trait:
pub trait ParserBackend<T, N>: Sized + Send
where
T: Token,
N: NonTerminal,
{
type Config: Default + Clone;
type Error: std::error::Error + Send + Sync + 'static;
type State: Send + Sync;
fn new(grammar: &Grammar<T, N>, config: Self::Config) -> Result<Self, Self::Error>;
fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N>;
fn parse_with_session(...) -> ParseResult<T, N>;
fn validate(grammar: &Grammar<T, N>) -> Vec<GrammarError<T, N>>;
fn capabilities() -> BackendCapabilities;
fn state(&self) -> &Self::State;
}
Implementation Steps
1. Define Backend Type
pub struct MyParser<T, N>
where
T: Token,
N: NonTerminal,
{
grammar: Grammar<T, N>,
config: MyConfig,
state: MyState,
}
2. Implement ParserBackend
impl<T, N> ParserBackend<T, N> for MyParser<T, N>
where
T: Token,
N: NonTerminal,
{
type Config = MyConfig;
type Error = MyError;
type State = MyState;
fn new(grammar: &Grammar<T, N>, config: Self::Config) -> Result<Self, Self::Error> {
// Initialize parser
Ok(Self {
grammar: grammar.clone(),
config,
state: MyState::new(),
})
}
fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N> {
// Implement parsing logic
// ...
}
fn parse_with_session(
&mut self,
input: &[T],
entry: N,
session: &IncrementalSession<'_, T::Kind>,
) -> ParseResult<T, N> {
// Implement incremental parsing
// ...
}
fn validate(grammar: &Grammar<T, N>) -> Vec<GrammarError<T, N>> {
// Validate grammar compatibility
// ...
}
fn capabilities() -> BackendCapabilities {
BackendCapabilities {
name: "MyParser",
algorithm: Algorithm::PEG, // Or other
supports_left_recursion: true,
supports_ambiguity: false,
supports_incremental: true,
supports_error_recovery: true,
max_lookahead: None,
}
}
fn state(&self) -> &Self::State {
&self.state
}
}
3. Implement Parsing Logic
Implement the core parsing algorithm:
fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N> {
let mut builder = GreenNodeBuilder::new();
let mut errors = Vec::new();
let mut tokens = input.iter();
// Parse according to algorithm
// ...
ParseResult {
root: builder.finish().unwrap(),
errors,
warnings: Vec::new(),
metrics: ParseMetrics::default(),
#[cfg(feature = "backend-glr")]
forest: None,
_phantom: PhantomData,
}
}
4. Implement Incremental Parsing
Support incremental parsing:
fn parse_with_session(
&mut self,
input: &[T],
entry: N,
session: &IncrementalSession<'_, T::Kind>,
) -> ParseResult<T, N> {
// Find reusable nodes
// Re-parse only affected regions
// Combine reused and new nodes
// ...
}
5. Implement Validation
Validate grammar compatibility:
fn validate(grammar: &Grammar<T, N>) -> Vec<GrammarError<T, N>> {
let mut errors = Vec::new();
// Check for left recursion
// Check for ambiguity
// Check for other requirements
// ...
errors
}
Example: PEG Parser
Here’s a simplified example of a PEG parser:
pub struct PegParser<T, N> {
grammar: Grammar<T, N>,
config: PegConfig,
state: PegState,
}
impl<T, N> ParserBackend<T, N> for PegParser<T, N>
where
T: Token,
N: NonTerminal,
{
type Config = PegConfig;
type Error = PegError;
type State = PegState;
fn new(grammar: &Grammar<T, N>, config: Self::Config) -> Result<Self, Self::Error> {
Ok(Self {
grammar: grammar.clone(),
config,
state: PegState::new(),
})
}
fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N> {
// PEG parsing logic
// ...
}
// ... other methods
}
Best Practices
- Follow existing patterns: Look at LL/LR/GLR implementations
- Handle errors gracefully: Return errors, don’t panic
- Support incremental: Implement incremental parsing if possible
- Validate grammar: Check grammar compatibility
- Document behavior: Document algorithm-specific behavior
Next Steps
- See existing backends for reference:
- Check Contributing for development workflow
Custom Syntax Kinds
This chapter covers advanced usage of syntax kinds.
Advanced SyntaxKind Implementation
Additional Methods
The SyntaxKind trait provides optional methods:
impl SyntaxKind for MySyntaxKind {
fn is_terminal(self) -> bool {
// ...
}
fn is_trivia(self) -> bool {
// ...
}
// Optional: Mark keywords
fn is_keyword(self) -> bool {
matches!(self,
MySyntaxKind::If |
MySyntaxKind::While |
MySyntaxKind::For
)
}
// Optional: Mark literals
fn is_literal(self) -> bool {
matches!(self,
MySyntaxKind::Number |
MySyntaxKind::String |
MySyntaxKind::Boolean
)
}
}
Syntax Kind Organization
Grouping Kinds
Organize kinds logically:
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind {
// Keywords
If, While, For, Return,
// Operators
Plus, Minus, Multiply, Divide,
// Punctuation
LParen, RParen, LBrace, RBrace,
// Literals
Number, String, Boolean,
// Identifiers
Ident,
// Trivia
Whitespace, Comment,
// Non-terminals
Expr, Stmt, Block,
// Special
Eof, Error,
}
Using Macros
Use macros to reduce boilerplate:
macro_rules! define_syntax_kinds {
(
keywords: [$($keyword:ident),*],
operators: [$($op:ident),*],
// ...
) => {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind {
$($keyword,)*
$($op,)*
// ...
}
impl SyntaxKind for MySyntaxKind {
fn is_keyword(self) -> bool {
matches!(self, $(MySyntaxKind::$keyword)|*)
}
// ...
}
};
}
Syntax Kind Conversion
Converting to/from Strings
impl MySyntaxKind {
pub fn as_str(self) -> &'static str {
match self {
MySyntaxKind::If => "if",
MySyntaxKind::While => "while",
// ...
}
}
pub fn from_str(s: &str) -> Option<Self> {
match s {
"if" => Some(MySyntaxKind::If),
"while" => Some(MySyntaxKind::While),
// ...
_ => None,
}
}
}
Best Practices
- Use descriptive names: Choose clear, descriptive names
- Group logically: Organize kinds by category
- Document purpose: Document what each kind represents
- Use exhaustiveness: Use exhaustive matching in implementations
- Consider extensions: Design for future extensions
Next Steps
- See Syntax Kinds for basics
- Check Examples for real-world usage
Extending Lexers
This chapter shows how to extend lexers with custom tokenization logic.
Custom Tokenization
Custom Matchers
Use custom matchers for complex patterns:
let lexer = LexerBuilder::new()
.custom_token(MySyntaxKind::String, |text, pos| {
// Custom string matching
// ...
})
.build(MySyntaxKind::Eof, MySyntaxKind::Ident)
.expect("Failed to build lexer");
See Custom Patterns for details.
Extending Pattern Types
Adding New Pattern Types
To add new pattern types, extend the Pattern enum:
pub enum Pattern {
Literal(CompactString),
CharClass(CharSet),
Repeat { pattern: Box<Pattern>, min: usize, max: Option<usize> },
Regex(CompactString),
Any,
// Add new pattern types
Custom(CustomPattern),
}
Implementing Pattern Matching
Implement pattern matching logic:
impl Pattern {
fn matches(&self, text: &str, pos: usize) -> Option<usize> {
match self {
Pattern::Custom(custom) => custom.matches(text, pos),
// ... other patterns
}
}
}
Lexer Hooks
Pre-processing
Add pre-processing hooks:
pub struct LexerBuilder<K: SyntaxKind> {
// ...
preprocessors: Vec<Box<dyn Fn(&str) -> String>>,
}
impl<K: SyntaxKind> LexerBuilder<K> {
pub fn preprocess(mut self, f: impl Fn(&str) -> String + 'static) -> Self {
self.preprocessors.push(Box::new(f));
self
}
}
Post-processing
Add post-processing hooks:
impl<K: SyntaxKind> LexerBuilder<K> {
pub fn postprocess(mut self, f: impl Fn(&mut Vec<Token<K>>) + 'static) -> Self {
self.postprocessors.push(Box::new(f));
self
}
}
Incremental Lexing
Token Updates
Update tokens incrementally:
pub struct IncrementalLexer<K: SyntaxKind> {
lexer: CompiledLexer<K>,
old_tokens: Vec<Token<K>>,
}
impl<K: SyntaxKind> IncrementalLexer<K> {
pub fn tokenize_incremental(
&self,
text: &str,
edits: &[TextEdit],
) -> Vec<Token<K>> {
// Re-tokenize only affected regions
// Reuse unchanged tokens
// ...
}
}
Best Practices
- Use custom matchers: For complex patterns, use custom matchers
- Extend patterns carefully: Only add new pattern types if needed
- Test thoroughly: Test with various inputs
- Document behavior: Document custom behavior
- Consider performance: Keep custom logic efficient
Next Steps
- See Lexers for basic usage
- Check Custom Patterns for examples
Contributing
Thank you for your interest in contributing to Sipha! This document provides guidelines and instructions for contributing.
Code of Conduct
This project adheres to the Rust Code of Conduct. By participating, you are expected to uphold this code.
Getting Started
Prerequisites
- Rust toolchain (stable, beta, or nightly)
cargo(comes with Rust)rustfmtandclippy(install withrustup component add rustfmt clippy)
Development Setup
-
Fork the repository and clone your fork:
git clone https://github.com/yourusername/sipha.git cd sipha -
Build the project:
cargo build -
Run the test suite:
cargo test -
Run clippy:
cargo clippy --all-targets --all-features --workspace -- -W clippy::pedantic -W clippy::nursery -W clippy::cargo -D warnings -
Check formatting:
cargo fmt --all -- --check
Development Workflow
-
Create a branch for your changes:
git checkout -b feature/your-feature-name # or git checkout -b fix/your-bug-fix -
Make your changes following the coding standards below.
-
Write or update tests for your changes.
-
Ensure all tests pass:
cargo test --all-features -
Run clippy and fix any warnings:
cargo clippy --all-targets --all-features --workspace -- -W clippy::pedantic -W clippy::nursery -W clippy::cargo -D warnings -
Format your code:
cargo fmt --all -
Commit your changes with a clear, descriptive commit message:
git commit -m "Add feature: description of what you added" -
Push to your fork:
git push origin feature/your-feature-name -
Open a Pull Request on GitHub with a clear description of your changes.
Coding Standards
Code Style
- Follow Rust’s standard formatting conventions. Run
cargo fmtbefore committing. - Use
cargo clippyto catch common issues and follow Rust best practices. - We use pedantic clippy lints - some are allowed in
clippy.tomlandlib.rs, but try to follow them when possible.
Documentation
- Document all public APIs with doc comments.
- Include examples in doc comments where helpful.
- Update the README if you add new features or change existing behavior.
Testing
- Write tests for new features and bug fixes.
- Aim for good test coverage, especially for critical paths.
- Include both unit tests and integration tests where appropriate.
- Test with different feature combinations when applicable.
Error Handling
- Prefer
Resulttypes over panicking in library code. - Provide clear, actionable error messages.
- Use appropriate error types from the
errormodule.
Commit Messages
- Use clear, descriptive commit messages.
- Start with a verb in imperative mood (e.g., “Add”, “Fix”, “Update”).
- Reference issue numbers when applicable (e.g., “Fix #123: description”).
Example:
Add incremental parsing cache invalidation
Implements cache invalidation for incremental parsing when text edits
occur. This improves performance by only re-parsing affected regions.
Fixes #42
Pull Request Process
-
Ensure your PR is up to date with the target branch (usually
trunk). -
Write a clear description of what your PR does and why.
-
Reference related issues if applicable.
-
Ensure CI passes - all tests, clippy checks, and formatting checks must pass.
-
Request review from maintainers.
-
Address feedback promptly and update your PR as needed.
Feature Requests
If you have an idea for a new feature:
- Check existing issues to see if it’s already been discussed.
- Open an issue describing the feature, its use case, and potential implementation approach.
- Wait for maintainer feedback before implementing.
Bug Reports
When reporting bugs, please include:
- A clear description of the bug
- Steps to reproduce
- Expected behavior
- Actual behavior
- Your environment (Rust version, OS, etc.)
- Minimal code example if possible
Fuzz Testing
Sipha includes fuzz testing infrastructure to ensure robustness and correctness. Fuzz tests help find edge cases and potential panics in the parser.
Prerequisites
Install cargo-fuzz:
cargo install cargo-fuzz
Running Fuzz Tests
Navigate to the fuzz directory and run the fuzz targets:
cd fuzz
cargo fuzz run parser_fuzz
cargo fuzz run incremental_fuzz
Fuzz Targets
parser_fuzz: Tests the parser with random byte streams to ensure no panics occur.incremental_fuzz: Differential fuzzing that compares incremental parsing results with full re-parsing to ensure correctness.
Adding New Fuzz Targets
To add a new fuzz target:
- Create a new file in
fuzz/fuzz_targets/with your fuzz target code - Add a
[[bin]]entry infuzz/Cargo.tomlpointing to your target - Run
cargo fuzz listto verify the target is registered
Questions?
Feel free to open an issue for questions or discussions. We’re happy to help!
License
By contributing to Sipha, you agree that your contributions will be licensed under the MIT License.
Design Principles
This chapter explains the core design principles behind Sipha.
Incremental-First Design
Sipha is designed from the ground up for incremental parsing:
- Node reuse: Unchanged subtrees are automatically reused
- Cache management: Parse results are cached for future reuse
- Affected regions: Only affected regions are re-parsed
- Performance: Optimized for interactive editing scenarios
Immutability
Sipha uses immutable data structures:
- Green trees: Immutable, shareable representation
- Safe sharing: Immutability enables safe sharing across threads
- Functional style: Operations create new trees rather than modifying existing ones
Modularity
Sipha is modular and extensible:
- Backend system: Multiple parsing backends via trait
- Feature flags: Optional features via feature flags
- Pluggable components: Lexers, parsers, and trees are pluggable
Performance
Performance is a key consideration:
- Efficient algorithms: Use efficient parsing algorithms
- Memory efficiency: Shared trees reduce memory usage
- Fast operations: Optimized for common operations
- Incremental updates: Fast updates for interactive applications
Type Safety
Sipha leverages Rust’s type system:
- Generic types: Generic over token and non-terminal types
- Trait bounds: Clear trait bounds for extensibility
- Type safety: Compile-time guarantees for correctness
Error Handling
Sipha provides comprehensive error handling:
- Result types: Use
Resultfor fallible operations - Error recovery: Configurable error recovery strategies
- Rich diagnostics: Detailed error messages with context
API Design
Sipha’s API is designed for:
- Ease of use: Simple, intuitive API
- Flexibility: Support for various use cases
- Extensibility: Easy to extend and customize
- Documentation: Well-documented with examples
Next Steps
- See Module Structure for codebase organization
- Check Data Structures for key types
Module Structure
This chapter describes the organization of Sipha’s codebase.
Core Modules
syntax
Syntax tree types and operations:
green.rs: Green tree nodesred.rs: Red tree nodeskind.rs: Syntax kind traitbuilder.rs: Tree buildervisitor.rs: Tree visitorsquery.rs: Tree queries
lexer
Lexical analysis:
builder.rs: Lexer builderdfa.rs: DFA construction and tokenizationtoken.rs: Token typesmod.rs: Module exports
grammar
Grammar definition and validation:
builder.rs: Grammar builderexpr.rs: Grammar expressionsvalidate.rs: Grammar validationanalysis.rs: Grammar analysisdocs.rs: Grammar documentation
backend
Parser backends:
mod.rs: Backend trait and common codell/: LL(k) parser implementationlr/: LR parser implementationglr/: GLR parser implementationcommon.rs: Common backend utilities
incremental
Incremental parsing support:
mod.rs: Incremental parser and session
error
Error types and diagnostics:
mod.rs: Error types and diagnostics
parser
Parser traits and interfaces:
mod.rs: Parser traits
Workspace Structure
sipha/
├── crates/
│ ├── sipha/ # Main library
│ ├── sipha_derive/ # Procedural macros
│ └── sipha_lsp/ # LSP support
├── fuzz/ # Fuzz testing
└── doc/ # Documentation (this book)
Module Dependencies
syntax
↑
├── lexer
│ ↓
│ grammar
│ ↓
│ backend
│ ↓
│ incremental
│
└── error
Next Steps
- See Data Structures for key types
- Check Design Principles for design rationale
Data Structures
This chapter describes the key data structures in Sipha.
Green Trees
GreenNode
Immutable, shareable tree node:
use sipha::syntax::{SyntaxKind, TextSize};
pub struct GreenNode<K: SyntaxKind> {
kind: K,
text_len: TextSize,
children: GreenChildren<K>,
}
GreenToken
Immutable token:
use sipha::syntax::SyntaxKind;
use compact_str::CompactString;
pub struct GreenToken<K: SyntaxKind> {
kind: K,
text: CompactString,
}
GreenElement
Tree element (node or token):
use std::sync::Arc;
use sipha::syntax::SyntaxKind;
pub enum GreenElement<K: SyntaxKind> {
Node(Arc<GreenNode<K>>),
Token(GreenToken<K>),
}
Red Trees
SyntaxNode
Red tree node for traversal:
use std::sync::Arc;
pub struct SyntaxNode {
green: Arc<GreenNode>,
parent: Option<Arc<SyntaxNode>>,
index: usize,
}
SyntaxToken
Red tree token:
use std::sync::Arc;
pub struct SyntaxToken {
green: GreenToken,
parent: Option<Arc<SyntaxNode>>,
index: usize,
}
Grammar
Grammar
Grammar definition:
use std::collections::HashMap;
use lasso::Rodeo;
pub struct Grammar<T, N> {
rules: HashMap<N, Rule<T, N>>,
entry_point: N,
interner: Rodeo,
}
Rule
Production rule:
pub struct Rule<T, N> {
pub lhs: N,
pub rhs: Expr<T, N>,
pub metadata: RuleMetadata,
}
Expr
Grammar expression:
pub enum Expr<T, N> {
Token(T),
Rule(N),
Seq(Vec<Expr<T, N>>),
Choice(Vec<Expr<T, N>>),
// ... more variants
}
Incremental Parsing
IncrementalParser
Incremental parser wrapper:
pub struct IncrementalParser<P, T, N> {
parser: P,
cache: ParseCache<T::Kind>,
}
IncrementalSession
Incremental parsing session:
pub struct IncrementalSession<'a, K> {
old_tree: Option<&'a GreenNode<K>>,
edits: &'a [TextEdit],
affected: AffectedRegion,
reusable: Vec<ReuseCandidate<K>>,
cache: Option<&'a ParseCache<K>>,
}
ParseCache
Parse result cache:
use std::collections::HashMap;
use std::sync::Arc;
use lasso::Rodeo;
pub struct ParseCache<K> {
version: usize,
nodes: HashMap<CacheKey, Arc<GreenNode<K>>>,
interner: Rodeo,
}
Lexer
CompiledLexer
Compiled lexer:
use std::collections::{HashMap, HashSet};
use compact_str::CompactString;
pub struct CompiledLexer<K> {
rules: Vec<CompiledRule<K>>,
dfa: Option<Dfa>,
keywords: HashMap<CompactString, K>,
trivia_kinds: HashSet<K>,
}
Token
Lexer token:
use compact_str::CompactString;
use sipha::syntax::TextRange;
pub struct Token<K> {
pub kind: K,
pub text: CompactString,
pub range: TextRange,
}
Error Types
ParseError
Parsing error:
use sipha::syntax::TextRange;
pub struct ParseError<T, N> {
pub span: TextRange,
pub message: String,
pub expected: Vec<T>,
pub found: Option<T>,
}
LexerError
Lexer error:
use sipha::syntax::TextRange;
pub struct LexerError {
pub span: TextRange,
pub kind: LexerErrorKind,
}
Next Steps
- See Performance Considerations for optimization
- Check Module Structure for organization
Performance Considerations
This chapter covers performance considerations in Sipha’s design.
Memory Efficiency
Green Tree Sharing
Green trees are designed for sharing:
- Arc-based: Nodes use
Arcfor reference counting - Shared subtrees: Unchanged subtrees are shared across edits
- Arena allocation: Nodes can be arena-allocated for efficiency
Compact Representation
Green trees use compact representations:
- Small nodes: Minimal overhead per node
- Compact strings: Use
CompactStringfor short tokens - Efficient children: Optimized child storage
Parsing Performance
DFA-Based Lexing
Lexer uses DFA for O(n) tokenization:
- Compiled DFA: DFA is compiled once
- Fast matching: O(n) tokenization time
- Efficient patterns: Character classes are efficient
Table-Based Parsing
Parsers use tables for fast lookups:
- Parse tables: Pre-computed parse tables
- Fast lookups: O(1) table lookups
- Compact tables: Optimized table representation
Incremental Performance
Node Reuse
Incremental parsing reuses nodes:
- Budget-based: Limit reuse candidates
- Position indexing: Fast node lookup
- Smart invalidation: Efficient cache invalidation
Cache Management
Parse cache is optimized:
- Version-based: Efficient version management
- Automatic eviction: Old versions are evicted
- Interned strings: Rule names are interned
Optimization Tips
Grammar Optimization
- Factor patterns: Extract common subexpressions
- Minimize nesting: Flatten deeply nested structures
- Use appropriate lookahead: Don’t use more than needed
Lexer Optimization
- Order patterns: Most specific patterns first
- Use character classes: Prefer over regex
- Keyword trie: Keywords use efficient trie
Parser Optimization
- Choose right backend: Use appropriate backend
- Enable cache: Always provide grammar for cache
- Adjust budget: Tune reuse budget for use case
Benchmarking
Measuring Performance
use std::time::Instant;
// Assuming parser, tokens, and entry are already defined
let start = Instant::now();
let result = parser.parse(&tokens, entry);
let duration = start.elapsed();
Profiling
Use profilers to find bottlenecks:
- perf: Linux profiling
- valgrind: Memory and call profiling
- cargo-flamegraph: Flamegraph generation
Best Practices
- Profile first: Profile before optimizing
- Measure changes: Benchmark before and after
- Use incremental: Always use for interactive apps
- Optimize grammar: Factor and simplify
- Choose right backend: Use appropriate backend
Next Steps
- See Performance for optimization tips
- Check Design Principles for design rationale
Basic Arithmetic Example
This example walks through building a simple arithmetic expression parser step by step.
Overview
We’ll build a parser for arithmetic expressions like 42 + 10 * 3 with proper operator precedence.
Step 1: Define Syntax Kinds
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
// Terminals
Number,
Plus,
Minus,
Multiply,
Divide,
LParen,
RParen,
Whitespace,
Eof,
// Non-terminals
Expr,
Term,
Factor,
}
impl SyntaxKind for ArithSyntaxKind {
fn is_terminal(self) -> bool {
!matches!(self,
ArithSyntaxKind::Expr |
ArithSyntaxKind::Term |
ArithSyntaxKind::Factor
)
}
fn is_trivia(self) -> bool {
matches!(self, ArithSyntaxKind::Whitespace)
}
}
Step 2: Build Lexer
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
let lexer = LexerBuilder::new()
.token(ArithSyntaxKind::Number, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 1,
max: None,
})
.token(ArithSyntaxKind::Plus, Pattern::Literal("+".into()))
.token(ArithSyntaxKind::Minus, Pattern::Literal("-".into()))
.token(ArithSyntaxKind::Multiply, Pattern::Literal("*".into()))
.token(ArithSyntaxKind::Divide, Pattern::Literal("/".into()))
.token(ArithSyntaxKind::LParen, Pattern::Literal("(".into()))
.token(ArithSyntaxKind::RParen, Pattern::Literal(")".into()))
.token(ArithSyntaxKind::Whitespace, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
min: 1,
max: None,
})
.trivia(ArithSyntaxKind::Whitespace)
.build(ArithSyntaxKind::Eof, ArithSyntaxKind::Number)
.expect("Failed to build lexer");
Step 3: Define Grammar
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::grammar::{GrammarBuilder, NonTerminal, Expr};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};
use std::convert::TryFrom;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum ArithNonTerminal {
Expr,
Term,
Factor,
}
impl NonTerminal for ArithNonTerminal {
fn name(&self) -> &str {
match self {
ArithNonTerminal::Expr => "Expr",
ArithNonTerminal::Term => "Term",
ArithNonTerminal::Factor => "Factor",
}
}
}
// Helper function to create tokens with proper ranges
fn create_token(kind: ArithSyntaxKind, text: &str, offset: u32) -> LexerToken<ArithSyntaxKind> {
let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}
let grammar = GrammarBuilder::new()
.entry_point(ArithNonTerminal::Expr)
// Expr -> Term | Expr + Term | Expr - Term
.rule(ArithNonTerminal::Expr, Expr::choice(vec![
Expr::seq(vec![
Expr::non_terminal(ArithNonTerminal::Expr),
Expr::token(create_token(ArithSyntaxKind::Plus, "+", 0)),
Expr::non_terminal(ArithNonTerminal::Term),
]),
Expr::seq(vec![
Expr::non_terminal(ArithNonTerminal::Expr),
Expr::token(create_token(ArithSyntaxKind::Minus, "-", 0)),
Expr::non_terminal(ArithNonTerminal::Term),
]),
Expr::non_terminal(ArithNonTerminal::Term),
]))
// Term -> Factor | Term * Factor | Term / Factor
.rule(ArithNonTerminal::Term, Expr::choice(vec![
Expr::seq(vec![
Expr::non_terminal(ArithNonTerminal::Term),
Expr::token(create_token(ArithSyntaxKind::Multiply, "*", 0)),
Expr::non_terminal(ArithNonTerminal::Factor),
]),
Expr::seq(vec![
Expr::non_terminal(ArithNonTerminal::Term),
Expr::token(create_token(ArithSyntaxKind::Divide, "/", 0)),
Expr::non_terminal(ArithNonTerminal::Factor),
]),
Expr::non_terminal(ArithNonTerminal::Factor),
]))
// Factor -> Number | ( Expr )
.rule(ArithNonTerminal::Factor, Expr::choice(vec![
Expr::token(create_token(ArithSyntaxKind::Number, "42", 0)),
Expr::seq(vec![
Expr::token(create_token(ArithSyntaxKind::LParen, "(", 0)),
Expr::non_terminal(ArithNonTerminal::Expr),
Expr::token(create_token(ArithSyntaxKind::RParen, ")", 0)),
]),
]))
.build()
.expect("Failed to build grammar");
Step 4: Parse
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum ArithSyntaxKind {
Number, Plus, Minus, Multiply, Divide, LParen, RParen, Whitespace, Eof,
Expr, Term, Factor,
}
impl SyntaxKind for ArithSyntaxKind {
fn is_terminal(self) -> bool { !matches!(self, ArithSyntaxKind::Expr | ArithSyntaxKind::Term | ArithSyntaxKind::Factor) }
fn is_trivia(self) -> bool { matches!(self, ArithSyntaxKind::Whitespace) }
}
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
let lexer = LexerBuilder::new()
.token(ArithSyntaxKind::Number, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 1, max: None,
})
.token(ArithSyntaxKind::Plus, Pattern::Literal("+".into()))
.token(ArithSyntaxKind::Minus, Pattern::Literal("-".into()))
.token(ArithSyntaxKind::Multiply, Pattern::Literal("*".into()))
.token(ArithSyntaxKind::Divide, Pattern::Literal("/".into()))
.token(ArithSyntaxKind::LParen, Pattern::Literal("(".into()))
.token(ArithSyntaxKind::RParen, Pattern::Literal(")".into()))
.token(ArithSyntaxKind::Whitespace, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
min: 1, max: None,
})
.trivia(ArithSyntaxKind::Whitespace)
.build(ArithSyntaxKind::Eof, ArithSyntaxKind::Number)
.expect("Failed to build lexer");
use sipha::grammar::{GrammarBuilder, NonTerminal, Expr};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};
use std::convert::TryFrom;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum ArithNonTerminal { Expr, Term, Factor, }
impl NonTerminal for ArithNonTerminal {
fn name(&self) -> &str { match self { ArithNonTerminal::Expr => "Expr", ArithNonTerminal::Term => "Term", ArithNonTerminal::Factor => "Factor", } }
}
fn create_token(kind: ArithSyntaxKind, text: &str, offset: u32) -> LexerToken<ArithSyntaxKind> {
let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}
let grammar = GrammarBuilder::new()
.entry_point(ArithNonTerminal::Expr)
.rule(ArithNonTerminal::Expr, Expr::choice(vec![
Expr::seq(vec![Expr::non_terminal(ArithNonTerminal::Expr), Expr::token(create_token(ArithSyntaxKind::Plus, "+", 0)), Expr::non_terminal(ArithNonTerminal::Term)]),
Expr::seq(vec![Expr::non_terminal(ArithNonTerminal::Expr), Expr::token(create_token(ArithSyntaxKind::Minus, "-", 0)), Expr::non_terminal(ArithNonTerminal::Term)]),
Expr::non_terminal(ArithNonTerminal::Term),
]))
.rule(ArithNonTerminal::Term, Expr::choice(vec![
Expr::seq(vec![Expr::non_terminal(ArithNonTerminal::Term), Expr::token(create_token(ArithSyntaxKind::Multiply, "*", 0)), Expr::non_terminal(ArithNonTerminal::Factor)]),
Expr::seq(vec![Expr::non_terminal(ArithNonTerminal::Term), Expr::token(create_token(ArithSyntaxKind::Divide, "/", 0)), Expr::non_terminal(ArithNonTerminal::Factor)]),
Expr::non_terminal(ArithNonTerminal::Factor),
]))
.rule(ArithNonTerminal::Factor, Expr::choice(vec![
Expr::token(create_token(ArithSyntaxKind::Number, "42", 0)),
Expr::seq(vec![Expr::token(create_token(ArithSyntaxKind::LParen, "(", 0)), Expr::non_terminal(ArithNonTerminal::Expr), Expr::token(create_token(ArithSyntaxKind::RParen, ")", 0))]),
]))
.build().expect("Failed to build grammar");
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
use sipha::syntax::SyntaxNode;
let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)
.expect("Failed to create parser");
let input = "42 + 10 * 3";
let tokens = lexer.tokenize(input)
.expect("Failed to tokenize");
let result = parser.parse(&tokens, ArithNonTerminal::Expr);
if !result.errors.is_empty() {
eprintln!("Errors: {:?}", result.errors);
} else {
let root = SyntaxNode::new_root(result.root.clone());
println!("Parse successful!");
println!("Root: {:?}", root.kind());
}
Complete Example
See examples/basic_arithmetic.rs for the complete working example.
Next Steps
- Try Incremental Parsing Example
- Check GLR Example for ambiguous grammars
Incremental Parsing Example
This example demonstrates incremental parsing with Sipha.
Overview
We’ll build a simple language server that uses incremental parsing to efficiently update the parse tree as the user edits code.
Setup
use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::syntax::{TextRange, TextSize, GreenNode};
use sipha::grammar::Grammar;
use sipha::lexer::CompiledLexer;
// These types would be defined in your actual code
// type MyToken = /* ... */;
// type MyNonTerminal = /* ... */;
// type MySyntaxKind = /* ... */;
struct LanguageServer {
parser: IncrementalParser<LlParser, MyToken, MyNonTerminal>,
grammar: Grammar<MyToken, MyNonTerminal>,
lexer: CompiledLexer<MySyntaxKind>,
current_tree: Option<GreenNode<MySyntaxKind>>,
}
Initial Parse
use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::syntax::{TextRange, TextSize, GreenNode};
use sipha::grammar::Grammar;
use sipha::lexer::CompiledLexer;
struct LanguageServer {
parser: IncrementalParser<LlParser, MyToken, MyNonTerminal>,
grammar: Grammar<MyToken, MyNonTerminal>,
lexer: CompiledLexer<MySyntaxKind>,
current_tree: Option<GreenNode<MySyntaxKind>>,
}
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
fn build_grammar() -> Grammar<MyToken, MyNonTerminal> { todo!() }
fn build_lexer() -> CompiledLexer<MySyntaxKind> { todo!() }
impl LanguageServer {
fn new() -> Self {
let grammar = build_grammar();
let parser = LlParser::new(&grammar, Default::default()).unwrap();
let lexer = build_lexer();
Self {
parser: IncrementalParser::new(parser),
grammar,
lexer,
current_tree: None,
}
}
fn initial_parse(&mut self, text: &str) {
let tokens = self.lexer.tokenize(text).unwrap();
let result = self.parser.parse_incremental(
&tokens,
None, // No old tree
&[], // No edits
MyNonTerminal::Expr,
Some(&self.grammar), // Populate cache
);
self.current_tree = Some(result.root.clone());
}
}
Handling Edits
use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::syntax::{TextRange, TextSize, GreenNode};
use sipha::grammar::Grammar;
use sipha::lexer::CompiledLexer;
struct LanguageServer {
parser: IncrementalParser<LlParser, MyToken, MyNonTerminal>,
grammar: Grammar<MyToken, MyNonTerminal>,
lexer: CompiledLexer<MySyntaxKind>,
current_tree: Option<GreenNode<MySyntaxKind>>,
}
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
impl LanguageServer {
fn report_errors(&self, _errors: &[()]) {}
}
impl LanguageServer {
fn on_text_changed(&mut self, edits: Vec<TextEdit>, new_text: &str) {
// Re-tokenize (or update tokens incrementally)
let tokens = self.lexer.tokenize(new_text).unwrap();
// Incremental parse
let result = self.parser.parse_incremental(
&tokens,
self.current_tree.as_ref(), // Previous tree
&edits, // Text edits
MyNonTerminal::Expr,
Some(&self.grammar), // Update cache
);
// Update current tree
self.current_tree = Some(result.root.clone());
// Report errors
if !result.errors.is_empty() {
self.report_errors(&result.errors);
}
}
}
Creating Edits
use sipha::incremental::TextEdit;
use sipha::syntax::{TextRange, TextSize};
struct LanguageServer;
impl LanguageServer {
fn on_text_changed(&mut self, _edits: Vec<TextEdit>, _new_text: &str) {}
}
let mut server = LanguageServer;
// User changed "42" to "100" at position 0
let edit = TextEdit::replace(
TextRange::new(TextSize::from(0), TextSize::from(2)),
"100".into(),
);
server.on_text_changed(vec![edit], "100 + 10");
Performance
Incremental parsing provides significant performance improvements:
- Small edits: 10-100x faster
- Medium edits: 5-20x faster
- Large edits: 2-5x faster
Complete Example
See examples/incremental_parsing.rs for the complete working example.
Next Steps
- See Incremental Parsing for detailed documentation
- Check Real-World Patterns for more examples
GLR Parsing Example
This example demonstrates using the GLR parser to handle ambiguous grammars.
Overview
We’ll parse an ambiguous expression grammar and disambiguate using precedence rules.
Ambiguous Grammar
// Grammar: Expr -> Expr + Expr | Expr * Expr | Number
// Input: "1 + 2 * 3"
// Ambiguous: Could be (1 + 2) * 3 or 1 + (2 * 3)
Setup
use sipha::grammar::Grammar;
type MyToken = ();
type MyNonTerminal = ();
let grammar: Grammar<MyToken, MyNonTerminal> = todo!();
use sipha::backend::glr::{GlrParser, GlrConfig};
use sipha::backend::glr::disambiguate_by_precedence;
let config = GlrConfig::default();
let mut parser = GlrParser::new(&grammar, config)
.expect("Failed to create GLR parser");
Parsing
use sipha::backend::glr::{GlrParser, GlrConfig};
use sipha::backend::glr::disambiguate_by_precedence;
use sipha::syntax::SyntaxNode;
use sipha::lexer::Token;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
struct Grammar;
let grammar = Grammar;
let config = GlrConfig::default();
let mut parser = GlrParser::new(&grammar, config).expect("Failed to create GLR parser");
let tokens: Vec<Token<MySyntaxKind>> = vec![];
let result = parser.parse(&tokens, MyNonTerminal::Expr);
// Check for ambiguity
if let Some(forest) = result.forest {
println!("Ambiguous: {} parse trees", forest.roots.len());
// Disambiguate using precedence
let disambiguated = forest.disambiguate(disambiguate_by_precedence(|op| {
match op {
MySyntaxKind::Multiply | MySyntaxKind::Divide => 2,
MySyntaxKind::Plus | MySyntaxKind::Minus => 1,
_ => 0,
}
}));
// Use disambiguated tree
let root = SyntaxNode::new_root(disambiguated);
} else {
// Unambiguous - use result directly
let root = SyntaxNode::new_root(result.root);
}
Disambiguation Strategies
Precedence
use sipha::backend::glr::disambiguate_by_precedence;
type MySyntaxKind = ();
struct Forest;
impl Forest {
fn disambiguate<F>(&self, _f: F) -> () { () }
}
let forest = Forest;
let disambiguated = forest.disambiguate(disambiguate_by_precedence(|op| {
match op {
MySyntaxKind::Multiply => 2,
MySyntaxKind::Plus => 1,
_ => 0,
}
}));
Associativity
use sipha::backend::glr::disambiguate_by_associativity;
use sipha::backend::glr::Associativity;
type MySyntaxKind = ();
struct Forest;
impl Forest {
fn disambiguate<F>(&self, _f: F) -> () { () }
}
let forest = Forest;
let disambiguated = forest.disambiguate(disambiguate_by_associativity(|op| {
match op {
MySyntaxKind::Minus => Associativity::Left,
MySyntaxKind::Power => Associativity::Right,
_ => Associativity::None,
}
}));
Custom
struct Forest;
impl Forest {
fn disambiguate<F>(&self, _f: F) -> () { () }
}
let forest = Forest;
fn compute_complexity(_tree: &()) -> usize { 0 }
let disambiguated = forest.disambiguate(|alternatives| {
// Choose based on custom criteria
alternatives
.iter()
.min_by_key(|tree| compute_complexity(tree))
.cloned()
});
Complete Example
See examples/glr_parsing.rs for the complete working example.
Next Steps
- See GLR Parser for detailed documentation
- Check Real-World Patterns for more examples
PEG Parsing Example
This example demonstrates using the PEG parser with ordered choice and memoization.
Overview
We’ll parse an expression grammar using PEG’s ordered choice semantics and demonstrate the benefits of memoization (packrat parsing).
Ordered Choice Semantics
PEG uses ordered choice where the first matching alternative wins:
// Grammar: Expr -> Expr + Term | Term
// In PEG: First alternative tried first
// If "Expr + Term" matches, "Term" is never tried
This makes precedence natural - lower precedence operators are tried first.
Setup
use sipha::backend::peg::{PegParser, PegConfig};
use sipha::backend::ParserBackend;
let config = PegConfig {
enable_memoization: true, // Enable packrat parsing
max_memo_size: 10000, // Cache size limit
error_recovery: true,
max_errors: 100,
max_backtrack_depth: 1000,
};
let mut parser = PegParser::new(&grammar, config)
.expect("Failed to create PEG parser");
Basic Parsing
use sipha::backend::peg::{PegParser, PegConfig};
use sipha::backend::ParserBackend;
type MyToken = ();
type MyNonTerminal = ();
struct Grammar;
let grammar = Grammar;
let config = PegConfig::default();
let mut parser = PegParser::new(&grammar, config).expect("Failed to create PEG parser");
let tokens: Vec<MyToken> = vec![];
let result = parser.parse(&tokens, MyNonTerminal::Expr);
if result.errors.is_empty() {
println!("Parse successful!");
println!("Tokens consumed: {}", result.metrics.tokens_consumed);
println!("Nodes created: {}", result.metrics.nodes_created);
} else {
for error in &result.errors {
eprintln!("Error: {error}");
}
}
Memoization (Packrat Parsing)
Memoization caches parse results for O(n) performance:
use sipha::backend::peg::{PegParser, PegConfig};
type MyToken = ();
type MyNonTerminal = ();
struct Grammar;
let grammar = Grammar;
let config_with_memo = PegConfig {
enable_memoization: true,
..Default::default()
};
let config_without_memo = PegConfig {
enable_memoization: false,
..Default::default()
};
// With memoization: O(n) time, O(n) space
let parser_with = PegParser::new(&grammar, config_with_memo)?;
// Without memoization: May be exponential for some grammars
let parser_without = PegParser::new(&grammar, config_without_memo)?;
Incremental Parsing
PEG works well with incremental parsing thanks to memoization:
use sipha::incremental::IncrementalParser;
use sipha::backend::peg::{PegParser, PegConfig};
type MyToken = ();
type MyNonTerminal = ();
struct Grammar;
let grammar = Grammar;
let parser = PegParser::new(&grammar, PegConfig::default()).expect("parser");
use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::syntax::{TextRange, TextSize};
let mut incremental = IncrementalParser::new(parser);
// Initial parse
let result1 = incremental.parse_incremental(
&tokens,
None,
&[],
MyNonTerminal::Expr,
Some(&grammar),
);
// After edit - benefits from memoization cache
let edits = vec![TextEdit::replace(
TextRange::new(TextSize::from(0), TextSize::from(1)),
"new_text".into(),
)];
let result2 = incremental.parse_incremental(
&new_tokens,
Some(&result1.root),
&edits,
MyNonTerminal::Expr,
Some(&grammar),
);
Ordered Choice for Precedence
PEG’s ordered choice makes operator precedence natural:
// Lower precedence tried first
Expr -> Expr + Term | Term
// Higher precedence tried later
Term -> Term * Factor | Factor
// This naturally gives * higher precedence than +
Performance Comparison
The example demonstrates performance with and without memoization:
- With memoization: Linear time O(n)
- Without memoization: May be exponential for some grammars
- Incremental: Fast for small edits (benefits from cache)
Complete Example
See examples/peg_parsing.rs for the complete working example.
Run it with:
cargo run --example peg_parsing --features backend-peg
Expected Output
The example demonstrates:
- Building a lexer and grammar
- Parsing with ordered choice
- Memoization benefits
- Incremental parsing performance
- Performance comparison with/without memoization
Next Steps
- See PEG Parser for detailed documentation
- Check Choosing a Backend to see when to use PEG
- Explore Incremental Parsing for more details
Real-World Patterns
This chapter covers common patterns and idioms for using Sipha in real-world applications.
Language Server Pattern
Structure
struct LanguageServer {
parser: IncrementalParser<LlParser, Token, NonTerminal>,
grammar: Grammar<Token, NonTerminal>,
lexer: CompiledLexer<SyntaxKind>,
documents: HashMap<Uri, Document>,
}
struct Document {
text: String,
tree: Option<GreenNode<SyntaxKind>>,
version: i32,
}
Handling Changes
impl LanguageServer {
fn did_change(&mut self, uri: Uri, edits: Vec<TextEdit>, version: i32) {
let doc = self.documents.get_mut(&uri).unwrap();
doc.text = apply_edits(&doc.text, &edits);
doc.version = version;
let tokens = self.lexer.tokenize(&doc.text).unwrap();
let result = self.parser.parse_incremental(
&tokens,
doc.tree.as_ref(),
&edits,
MyNonTerminal::Root,
Some(&self.grammar),
);
doc.tree = Some(result.root.clone());
// Publish diagnostics
self.publish_diagnostics(uri, &result.errors);
}
}
Error Reporting Pattern
Collecting Errors
fn collect_errors(result: &ParseResult) -> Vec<Diagnostic> {
result.errors
.iter()
.map(|error| Diagnostic {
range: error.span,
severity: DiagnosticSeverity::Error,
message: error.message.clone(),
// ...
})
.collect()
}
Reporting to User
fn report_errors(errors: &[Diagnostic]) {
for error in errors {
eprintln!("Error at {}: {}", error.range, error.message);
}
}
Tree Traversal Pattern
Finding Nodes
fn find_nodes_by_kind(root: &SyntaxNode, kind: SyntaxKind) -> Vec<SyntaxNode> {
root.descendants()
.filter(|n| n.kind() == kind)
.collect()
}
Collecting Information
fn collect_identifiers(root: &SyntaxNode) -> Vec<String> {
root.descendants()
.filter(|n| n.kind() == SyntaxKind::Ident)
.filter_map(|n| n.first_token())
.map(|t| t.text().to_string())
.collect()
}
Grammar Extension Pattern
Adding Rules
fn extend_grammar(base: GrammarBuilder) -> GrammarBuilder {
base
.rule(MyNonTerminal::NewRule, Expr::token(new_token))
.rule(MyNonTerminal::AnotherRule, Expr::choice(vec![
Expr::rule(MyNonTerminal::NewRule),
Expr::rule(MyNonTerminal::ExistingRule),
]))
}
Testing Pattern
Test Helpers
fn parse_test(input: &str) -> ParseResult {
let lexer = build_test_lexer();
let grammar = build_test_grammar();
let mut parser = LlParser::new(&grammar, Default::default()).unwrap();
let tokens = lexer.tokenize(input).unwrap();
parser.parse(&tokens, MyNonTerminal::Expr)
}
#[test]
fn test_simple_expression() {
let result = parse_test("1 + 2");
assert!(result.errors.is_empty());
// ...
}
Best Practices
- Use incremental parsing: Always use for interactive apps
- Cache grammar: Reuse grammar across parses
- Handle errors gracefully: Don’t panic on errors
- Test thoroughly: Test with various inputs
- Profile performance: Profile and optimize hot paths
Next Steps
- See Examples for complete examples
- Check Architecture for design patterns
API Overview
This chapter provides an overview of Sipha’s key types and traits.
Core Traits
SyntaxKind
Trait for syntax kind identifiers:
use std::fmt::Debug;
use std::hash::Hash;
pub trait SyntaxKind: Copy + Clone + PartialEq + Eq + Hash + Debug {
fn is_terminal(self) -> bool;
fn is_trivia(self) -> bool;
}
Token
Trait for tokens:
use std::hash::Hash;
use sipha::syntax::{SyntaxKind, TextSize};
use compact_str::CompactString;
pub trait Token: Clone + PartialEq + Eq + Hash {
type Kind: SyntaxKind;
fn kind(&self) -> Self::Kind;
fn text_len(&self) -> TextSize;
fn text(&self) -> CompactString;
}
NonTerminal
Trait for non-terminals:
use std::hash::Hash;
pub trait NonTerminal: Clone + PartialEq + Eq + Hash {
fn name(&self) -> &str;
}
ParserBackend
Trait for parser backends:
use sipha::grammar::Grammar;
pub trait ParserBackend<T, N>: Sized + Send
where
T: Token,
N: NonTerminal,
{
type Config: Default + Clone;
type Error: std::error::Error + Send + Sync + 'static;
type State: Send + Sync;
fn new(grammar: &Grammar<T, N>, config: Self::Config) -> Result<Self, Self::Error>;
fn parse(&mut self, input: &[T], entry: N) -> ParseResult<T, N>;
// ... more methods
}
Key Types
Grammar
Grammar definition:
use std::marker::PhantomData;
pub struct Grammar<T, N> {
_phantom: PhantomData<(T, N)>,
}
ParseResult
Parse result:
use sipha::syntax::GreenNode;
pub struct ParseResult<T, N> {
pub root: GreenNode<T::Kind>,
pub errors: Vec<ParseError<T, N>>,
pub warnings: Vec<ParseWarning<T, N>>,
pub metrics: ParseMetrics,
#[cfg(feature = "backend-glr")]
pub forest: Option<ParseForest<T::Kind>>,
}
GreenNode
Green tree node:
use sipha::syntax::SyntaxKind;
pub struct GreenNode<K: SyntaxKind> {
_phantom: std::marker::PhantomData<K>,
}
SyntaxNode
Red tree node:
pub struct SyntaxNode {
// ...
}
Builders
GrammarBuilder
Build grammars:
pub struct GrammarBuilder<T, N> {
// ...
}
impl<T, N> GrammarBuilder<T, N> {
pub fn new() -> Self;
pub fn entry_point(mut self, entry: N) -> Self;
pub fn rule(mut self, lhs: N, rhs: Expr<T, N>) -> Self;
pub fn build(self) -> Result<Grammar<T, N>, GrammarError>;
}
LexerBuilder
Build lexers:
use sipha::syntax::SyntaxKind;
use sipha::lexer::{LexerBuilder, Pattern, CompiledLexer};
type LexerError = ();
type K = ();
pub struct LexerBuilder<K: SyntaxKind> {
_phantom: std::marker::PhantomData<K>,
}
impl<K: SyntaxKind> LexerBuilder<K> {
pub fn new() -> Self { todo!() }
pub fn token(self, kind: K, pattern: Pattern) -> Self { todo!() }
pub fn keyword(self, text: &str, kind: K) -> Self { todo!() }
pub fn trivia(self, kind: K) -> Self { todo!() }
pub fn build(self, eof_kind: K, ident_kind: K) -> Result<CompiledLexer<K>, LexerError> { todo!() }
}
GreenNodeBuilder
Build green trees:
use sipha::syntax::{GreenNode, SyntaxKind};
type K = ();
type BuilderError = ();
struct GreenNodeBuilder;
pub struct GreenNodeBuilder {
_phantom: std::marker::PhantomData<()>,
}
impl GreenNodeBuilder {
pub fn new() -> Self { todo!() }
pub fn start_node(&mut self, kind: K) { todo!() }
pub fn token(&mut self, kind: K, text: &str) -> Result<(), BuilderError> { todo!() }
pub fn finish_node(&mut self) -> Result<(), BuilderError> { todo!() }
pub fn finish(self) -> Result<GreenNode<K>, BuilderError> { todo!() }
}
Incremental Parsing
IncrementalParser
Incremental parser wrapper:
use sipha::syntax::{GreenNode, SyntaxKind};
use sipha::grammar::Grammar;
use sipha::incremental::TextEdit;
type P = ();
type T = ();
type N = ();
type ParseResult<T, N> = ();
pub struct IncrementalParser<P, T, N> {
_phantom: std::marker::PhantomData<(P, T, N)>,
}
impl<P, T, N> IncrementalParser<P, T, N> {
pub fn new(parser: P) -> Self { todo!() }
pub fn parse_incremental(
&mut self,
input: &[T],
old_tree: Option<&GreenNode<T::Kind>>,
edits: &[TextEdit],
entry: N,
grammar: Option<&Grammar<T, N>>,
) -> ParseResult<T, N> { todo!() }
}
TextEdit
Text edit:
use sipha::syntax::{TextRange, TextSize};
pub struct TextEdit {
pub range: TextRange,
pub new_text: String,
}
impl TextEdit {
pub fn replace(range: TextRange, new_text: impl Into<String>) -> Self { todo!() }
pub fn insert(pos: TextSize, text: impl Into<String>) -> Self { todo!() }
pub fn delete(range: TextRange) -> Self { todo!() }
}
Error Types
ParseError
Parsing error:
use sipha::syntax::TextRange;
pub struct ParseError<T, N> {
pub span: TextRange,
pub message: String,
pub expected: Vec<T>,
pub found: Option<T>,
}
LexerError
Lexer error:
use sipha::syntax::TextRange;
pub struct LexerError {
pub span: TextRange,
pub kind: LexerErrorKind,
}
Next Steps
- See Feature Flags for available features
- Check FAQ for common questions
Feature Flags
Sipha uses feature flags to enable optional functionality.
Backend Features
backend-ll
Enable LL(k) parser backend (default).
[dependencies]
sipha = { version = "0.5.0", features = ["backend-ll"] }
backend-lr
Enable LR parser backend.
[dependencies]
sipha = { version = "0.5.0", features = ["backend-lr"] }
backend-glr
Enable GLR parser backend (requires backend-lr).
[dependencies]
sipha = { version = "0.5.0", features = ["backend-lr", "backend-glr"] }
Optional Features
diagnostics
Enable rich error diagnostics with miette.
[dependencies]
sipha = { version = "0.5.0", features = ["diagnostics"] }
unicode
Enable full Unicode support for identifiers.
[dependencies]
sipha = { version = "0.5.0", features = ["unicode"] }
visitor
Enable syntax tree visitor patterns.
[dependencies]
sipha = { version = "0.5.0", features = ["visitor"] }
query
Enable XPath-like query API for syntax trees.
[dependencies]
sipha = { version = "0.5.0", features = ["query"] }
tree-utils
Enable tree diffing and validation utilities.
[dependencies]
sipha = { version = "0.5.0", features = ["tree-utils"] }
Feature Combinations
Minimal
[dependencies]
sipha = "0.5.0" # Only backend-ll
Full
[dependencies]
sipha = {
version = "0.5.0",
features = [
"backend-ll",
"backend-lr",
"backend-glr",
"diagnostics",
"unicode",
"visitor",
"query",
"tree-utils",
]
}
Language Server
[dependencies]
sipha = {
version = "0.5.0",
features = [
"backend-ll",
"diagnostics",
"visitor",
"query",
]
}
Default Features
By default, only backend-ll is enabled.
Next Steps
- See API Overview for API reference
- Check FAQ for common questions
FAQ
Frequently asked questions about Sipha.
General
What is Sipha?
Sipha is a flexible, incremental parsing library for Rust with support for multiple parsing algorithms.
Why use Sipha?
Sipha is designed for interactive applications like IDEs and language servers that need fast, incremental parsing.
Is Sipha production-ready?
Sipha is currently in alpha and under active development. The API is subject to change.
Parsing
Which backend should I use?
See Choosing a Backend for guidance.
Does Sipha support left recursion?
- LL parser: With left-recursion elimination
- LR parser: Yes, naturally
- GLR parser: Yes, naturally
How do I handle ambiguous grammars?
Use the GLR backend. See GLR Parser for details.
Incremental Parsing
When should I use incremental parsing?
Use incremental parsing for interactive applications like IDEs and language servers.
How much faster is incremental parsing?
- Small edits: 10-100x faster
- Medium edits: 5-20x faster
- Large edits: 2-5x faster
Do I need to provide the grammar for incremental parsing?
Yes, providing the grammar enables cache population, which improves performance.
Performance
How do I optimize performance?
See Performance for optimization tips.
Is Sipha fast?
Sipha is competitive with other Rust parsing libraries for batch parsing, and significantly faster for incremental parsing.
Grammar
How do I define a grammar?
Use GrammarBuilder. See Grammars for details.
Can I use regex in patterns?
Yes, use Pattern::Regex. See Lexers for details.
Error Handling
How do I handle errors?
Check result.errors after parsing. See Error Handling for details.
Does Sipha recover from errors?
Yes, Sipha supports configurable error recovery strategies.
Contributing
How do I contribute?
See Contributing for guidelines.
Where do I report bugs?
Open an issue on GitHub.
Next Steps
- See Getting Started to get started
- Check Examples for examples
Glossary
This glossary defines key terms used throughout the Sipha documentation.
A
Affected Region
The minimal range of text that needs to be re-parsed after an edit. Sipha computes this automatically by expanding the edit range to include necessary context.
Ambiguous Grammar
A grammar that can produce multiple valid parse trees for the same input. Sipha’s GLR backend can handle ambiguous grammars by producing a parse forest.
Arena Allocation
A memory allocation strategy where objects are allocated in a contiguous region (arena) for better cache locality and faster deallocation. Green trees use arena allocation.
B
Backend
A parsing algorithm implementation. Sipha supports multiple backends: LL(k), LR, and GLR, each with different characteristics.
C
Cache (Parse Cache)
A persistent storage for parse results that enables cross-session node reuse. The cache stores parse nodes keyed by rule name, position, and version.
Choice Expression
A grammar expression that matches one of several alternatives. Represented as Expr::choice(vec![...]).
D
DFA (Deterministic Finite Automaton)
A state machine used by Sipha’s lexer for efficient tokenization. DFAs provide O(n) tokenization time.
Disambiguation
The process of selecting a single parse tree from multiple valid alternatives in an ambiguous grammar. GLR parsers support various disambiguation strategies.
E
Entry Point
The starting non-terminal for parsing. Defined when building a grammar using GrammarBuilder::entry_point().
Error Recovery
The ability of a parser to continue parsing after encountering an error, rather than stopping immediately. Sipha supports configurable recovery strategies.
Expression (Grammar Expression)
A component of a grammar rule’s right-hand side. Can be a token, non-terminal, sequence, choice, optional, repeat, or empty.
F
First Set
The set of terminals that can appear as the first token in a derivation from a non-terminal. Used by LL parsers for prediction.
Follow Set
The set of terminals that can appear immediately after a non-terminal in a derivation. Used by LL parsers for error recovery.
Forest (Parse Forest)
A data structure representing multiple parse trees for ambiguous grammars. GLR parsers produce parse forests when ambiguity is detected.
G
GLR (Generalized LR)
A parsing algorithm that extends LR parsing to handle ambiguous and non-deterministic grammars by maintaining multiple parser stacks.
Grammar
A formal description of a language’s syntax, consisting of non-terminals, production rules, and an entry point.
Green Tree
The storage layer of Sipha’s syntax tree representation. Green trees are compact, immutable, and can be shared across multiple red trees.
I
Incremental Parsing
The ability to efficiently re-parse only changed portions of code, reusing unchanged subtrees from previous parses. Sipha’s core feature.
Immutable
Data structures that cannot be modified after creation. Sipha’s syntax trees are immutable, enabling safe sharing and efficient updates.
L
Left Recursion
A grammar rule where a non-terminal appears as the first symbol in its own production. Some backends require left-recursion elimination.
Lexer
The component that converts raw text into tokens. Also called a tokenizer or scanner.
LL(k) Parser
A top-down predictive parser that uses k tokens of lookahead to make parsing decisions.
LR Parser
A bottom-up shift-reduce parser that builds parse trees from the leaves up. Supports LALR and canonical LR variants.
N
Non-Terminal
A grammar symbol that can be expanded into other symbols. Represented as enum variants implementing the NonTerminal trait.
Node Reuse
The process of identifying and reusing unchanged subtrees from previous parses during incremental parsing.
P
Parse Cache
See Cache (Parse Cache).
Parser Backend
See Backend.
Production Rule
A grammar rule that defines how a non-terminal can be expanded. Written as NonTerminal -> Expression.
R
Red Tree
The API layer of Sipha’s syntax tree representation. Red trees provide convenient navigation and query APIs, created on-demand from green trees.
Reuse Budget
A limit on how many nodes to consider for reuse during incremental parsing. Can be fixed or heuristic-based.
S
Sequence Expression
A grammar expression that matches a sequence of expressions in order. Represented as Expr::seq(vec![...]).
Syntax Kind
A unified type representing both terminals (tokens) and non-terminals. All syntax kinds implement the SyntaxKind trait.
Syntax Tree
A tree representation of the syntactic structure of parsed code. Sipha uses green/red trees for efficient representation.
T
Terminal
A grammar symbol that cannot be expanded further. In Sipha, terminals are tokens produced by the lexer.
Token
A unit of lexical analysis, representing a sequence of characters with a specific meaning (e.g., identifier, keyword, operator).
Trivia
Tokens that are syntactically insignificant, such as whitespace and comments. Trivia is automatically skipped during parsing.
U
Unambiguous Grammar
A grammar that produces exactly one parse tree for each valid input. Most parsers require unambiguous grammars, except GLR.
V
Version (Cache Version)
A number used to invalidate old cache entries. When the grammar or lexer changes, the version increments, invalidating old cache entries.
Cheat Sheet
Quick reference for common Sipha patterns and APIs.
Syntax Kinds
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind {
// Terminals
Number, Plus, Minus, Eof,
// Non-terminals
Expr, Term,
}
impl SyntaxKind for MySyntaxKind {
fn is_terminal(self) -> bool {
!matches!(self, Self::Expr | Self::Term)
}
fn is_trivia(self) -> bool {
false // No trivia in this example
}
}
Lexer Builder
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind { Number, Plus, Minus, Eof, Whitespace, Expr, Term, }
impl SyntaxKind for MySyntaxKind {
fn is_terminal(self) -> bool { !matches!(self, Self::Expr | Self::Term) }
fn is_trivia(self) -> bool { matches!(self, Self::Whitespace) }
}
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
let lexer = LexerBuilder::new()
.token(MySyntaxKind::Number, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 1,
max: None,
})
.token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
.trivia(MySyntaxKind::Whitespace)
.build(MySyntaxKind::Eof, MySyntaxKind::Number)?;
Grammar Builder
use sipha::syntax::SyntaxKind;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind { Number, Plus, Minus, Eof, Whitespace, Expr, Term, }
impl SyntaxKind for MySyntaxKind {
fn is_terminal(self) -> bool { !matches!(self, Self::Expr | Self::Term) }
fn is_trivia(self) -> bool { matches!(self, Self::Whitespace) }
}
use sipha::grammar::{GrammarBuilder, Expr, NonTerminal};
use sipha::lexer::Token as LexerToken;
use sipha::syntax::{TextRange, TextSize};
use std::convert::TryFrom;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum MyNonTerminal { Expr, }
impl NonTerminal for MyNonTerminal {
fn name(&self) -> &str { match self { MyNonTerminal::Expr => "Expr", } }
}
// Helper to create tokens
fn token(kind: MySyntaxKind, text: &str, offset: u32) -> LexerToken<MySyntaxKind> {
let len = TextSize::from(u32::try_from(text.len()).unwrap_or(0));
LexerToken::new(kind, text, TextRange::at(TextSize::from(offset), len))
}
let grammar = GrammarBuilder::new()
.entry_point(MyNonTerminal::Expr)
.rule(MyNonTerminal::Expr, Expr::token(token(MySyntaxKind::Number, "1", 0)))
.build()?;
Grammar Expressions
| Expression | Syntax | Description |
|---|---|---|
| Token | Expr::token(token) | Match a specific token |
| Non-terminal | Expr::non_terminal(NT::Expr) | Match a non-terminal |
| Sequence | Expr::seq(vec![...]) | Match expressions in order |
| Choice | Expr::choice(vec![...]) | Match one of several alternatives |
| Optional | Expr::optional(Box::new(...)) | Match zero or one occurrence |
| Repeat | Expr::repeat(Box::new(...), min, max) | Match repeated occurrences |
| Empty | Expr::Empty | Match nothing (epsilon) |
Parser Usage
use sipha::grammar::{GrammarBuilder, Grammar, Expr, NonTerminal};
use sipha::lexer::Token;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
let grammar: Grammar<MyToken, MyNonTerminal> = todo!();
let tokens: Vec<Token<MySyntaxKind>> = vec![];
use sipha::backend::ll::{LlParser, LlConfig};
use sipha::backend::ParserBackend;
use sipha::syntax::SyntaxNode;
let config = LlConfig::default();
let mut parser = LlParser::new(&grammar, config)?;
let result = parser.parse(&tokens, MyNonTerminal::Expr);
if !result.errors.is_empty() {
// Handle errors
}
let root = SyntaxNode::new_root(result.root.clone());
Incremental Parsing
use sipha::backend::ll::LlParser;
use sipha::grammar::Grammar;
use sipha::lexer::Token;
use sipha::syntax::GreenNode;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
let parser: LlParser = todo!();
let grammar: Grammar<MyToken, MyNonTerminal> = todo!();
let tokens: Vec<Token<MySyntaxKind>> = vec![];
let old_tree: Option<&GreenNode<MySyntaxKind>> = None;
let entry_point: MyNonTerminal = todo!();
use sipha::incremental::{IncrementalParser, TextEdit};
use sipha::syntax::{TextRange, TextSize};
let mut incremental = IncrementalParser::new(parser);
let edits = vec![TextEdit::replace(
TextRange::new(TextSize::from(0), TextSize::from(2)),
"new".into(),
)];
let result = incremental.parse_incremental(
&tokens,
old_tree,
&edits,
entry_point,
Some(&grammar),
);
Syntax Tree Navigation
use sipha::syntax::{SyntaxNode, GreenNode, SyntaxKind};
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
enum MySyntaxKind { Expr, }
impl SyntaxKind for MySyntaxKind {
fn is_terminal(self) -> bool { false }
fn is_trivia(self) -> bool { false }
}
let green_node: GreenNode<MySyntaxKind> = todo!();
use sipha::syntax::SyntaxNode;
let root = SyntaxNode::new_root(green_node);
// Children
for child in root.children() {
println!("{:?}", child.kind());
}
// Descendants
for node in root.descendants() {
if node.kind() == MySyntaxKind::Expr {
// Found expression
}
}
// Parent
if let Some(parent) = root.parent() {
println!("Parent: {:?}", parent.kind());
}
// Siblings
for sibling in root.next_sibling().into_iter() {
println!("Sibling: {:?}", sibling.kind());
}
Error Handling
use sipha::backend::ParserBackend;
type MyToken = ();
type MyNonTerminal = ();
struct Parser;
impl ParserBackend<MyToken, MyNonTerminal> for Parser {
type Config = ();
type Error = ();
type State = ();
fn new(_: &(), _: ()) -> Result<Self, ()> { Ok(Parser) }
fn parse(&mut self, _: &[MyToken], _: MyNonTerminal) -> sipha::backend::ParseResult<MyToken, MyNonTerminal> { todo!() }
}
let mut parser = Parser;
let tokens: Vec<MyToken> = vec![];
let result = parser.parse(&tokens, MyNonTerminal::Expr);
// Check for errors
if !result.errors.is_empty() {
for error in &result.errors {
eprintln!("Error: {} at {:?}", error.message, error.span);
}
}
// Check for warnings
for warning in &result.warnings {
eprintln!("Warning: {}", warning.message);
}
Pattern Matching
| Pattern | Syntax | Example |
|---|---|---|
| Literal | Pattern::Literal("+".into()) | Match exact string |
| Char Class | Pattern::CharClass(CharSet::digits()) | Match character class |
| Repeat | Pattern::Repeat { pattern, min, max } | Match repetition |
| Regex | Pattern::Regex(regex) | Match regex pattern |
Common Character Sets
use sipha::lexer::CharSet;
let _ = CharSet::digits(); // [0-9]
let _ = CharSet::whitespace(); // \s
let _ = CharSet::letters(); // [a-zA-Z]
let _ = CharSet::alphanumeric(); // [a-zA-Z0-9]
let _ = CharSet::hex_digits(); // [0-9a-fA-F]
Backend Selection
| Use Case | Backend | Reason |
|---|---|---|
| Simple grammar | LL | Easy to use, good error messages |
| Left recursion | LR | Natural support |
| Ambiguous grammar | GLR | Handles ambiguity |
| Maximum performance | LR | Efficient table-based parsing |
Feature Flags
[dependencies]
sipha = { version = "0.5.0", features = [
"backend-ll", # LL(k) parser (default)
"backend-lr", # LR parser
"backend-glr", # GLR parser (requires backend-lr)
"diagnostics", # Rich error diagnostics
"unicode", # Unicode support
"visitor", # Visitor patterns
"query", # XPath-like queries
"tree-utils", # Tree utilities
] }
Common Patterns
Optional Semicolon
use sipha::grammar::{GrammarBuilder, Expr, NonTerminal};
use sipha::lexer::Token;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
let semicolon_token: Token<MySyntaxKind> = todo!();
let mut builder = GrammarBuilder::new();
builder.rule(MyNonTerminal::Stmt, Expr::seq(vec![
Expr::non_terminal(MyNonTerminal::Expr),
Expr::optional(Box::new(Expr::token(semicolon_token))),
]))
Delimited List
use sipha::grammar::{GrammarBuilder, Expr, NonTerminal};
use sipha::lexer::Token;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
let open_token: Token<MySyntaxKind> = todo!();
let close_token: Token<MySyntaxKind> = todo!();
let mut builder = GrammarBuilder::new();
builder.rule(MyNonTerminal::List, Expr::seq(vec![
Expr::token(open_token),
Expr::repeat(
Box::new(Expr::non_terminal(MyNonTerminal::Item)),
0,
None,
),
Expr::token(close_token),
]))
Operator Precedence
use sipha::grammar::{GrammarBuilder, Expr, NonTerminal};
use sipha::lexer::Token;
type MyToken = ();
type MyNonTerminal = ();
type MySyntaxKind = ();
let multiply_token: Token<MySyntaxKind> = todo!();
let plus_token: Token<MySyntaxKind> = todo!();
let mut builder = GrammarBuilder::new();
// Higher precedence first
builder.rule(MyNonTerminal::Expr, Expr::choice(vec![
Expr::seq(vec![ // * has higher precedence
Expr::non_terminal(MyNonTerminal::Term),
Expr::token(multiply_token),
Expr::non_terminal(MyNonTerminal::Term),
]),
Expr::seq(vec![ // + has lower precedence
Expr::non_terminal(MyNonTerminal::Expr),
Expr::token(plus_token),
Expr::non_terminal(MyNonTerminal::Term),
]),
Expr::non_terminal(MyNonTerminal::Term),
]))
See Also
- Getting Started - Step-by-step tutorial
- API Documentation - Full API reference
- Examples - Complete working examples