Lexers
Lexers convert raw text into tokens. Sipha provides a flexible lexer builder API that supports pattern matching, trivia handling, and DFA-based tokenization for performance.
Overview
The lexer is the first stage of parsing. It takes source text and converts it into a stream of tokens that the parser can consume. Sipha’s lexer uses a DFA (Deterministic Finite Automaton) for efficient tokenization.
Building a Lexer
Use LexerBuilder to construct a lexer:
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
let lexer = LexerBuilder::new()
.token(MySyntaxKind::Number, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 1,
max: None,
})
.token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
.token(MySyntaxKind::Whitespace, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
min: 1,
max: None,
})
.trivia(MySyntaxKind::Whitespace)
.build(MySyntaxKind::Eof, MySyntaxKind::Number)
.expect("Failed to build lexer");
Patterns
Sipha supports several pattern types for matching tokens:
Literal Patterns
Match exact strings:
.token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
.token(MySyntaxKind::LParen, Pattern::Literal("(".into()))
Character Class Patterns
Match character ranges:
use sipha::lexer::CharSet;
// Match digits [0-9]
.token(MySyntaxKind::Number, Pattern::CharClass(CharSet::digits()))
// Match letters [a-zA-Z]
.token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::new(vec!['a'..='z', 'A'..='Z'])))
// Match whitespace
.token(MySyntaxKind::Whitespace, Pattern::CharClass(CharSet::whitespace()))
Repeat Patterns
Match repeating patterns:
// Match one or more digits
.token(MySyntaxKind::Number, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 1,
max: None, // No maximum
})
// Match zero or more whitespace
.token(MySyntaxKind::Whitespace, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
min: 0,
max: None,
})
// Match exactly 3 digits
.token(MySyntaxKind::ThreeDigits, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 3,
max: Some(3),
})
Regex Patterns
For complex patterns, use regex:
// Match floating point numbers
.token(MySyntaxKind::Float, Pattern::Regex(r"\d+\.\d+".into()))
// Match email addresses
.token(MySyntaxKind::Email, Pattern::Regex(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}".into()))
Any Pattern
Match any single character:
.token(MySyntaxKind::AnyChar, Pattern::Any)
Keywords
Keywords are reserved words that should be matched as literals rather than identifiers:
let lexer = LexerBuilder::new()
.token(MySyntaxKind::Ident, Pattern::CharClass(CharSet::new(vec!['a'..='z', 'A'..='Z'])))
.keyword("if", MySyntaxKind::If)
.keyword("else", MySyntaxKind::Else)
.keyword("while", MySyntaxKind::While)
.build(MySyntaxKind::Eof, MySyntaxKind::Ident)
.expect("Failed to build lexer");
Keywords are matched with higher priority than general patterns, so "if" will match as a keyword rather than an identifier.
Trivia
Trivia are tokens that don’t affect parsing but are preserved in the syntax tree (e.g., whitespace, comments):
let lexer = LexerBuilder::new()
.token(MySyntaxKind::Whitespace, Pattern::CharClass(CharSet::whitespace()))
.token(MySyntaxKind::Comment, Pattern::Regex(r"//.*".into()))
.trivia(MySyntaxKind::Whitespace)
.trivia(MySyntaxKind::Comment)
.build(MySyntaxKind::Eof, MySyntaxKind::Number)
.expect("Failed to build lexer");
The parser will automatically skip trivia during parsing, but they’ll still be available in the syntax tree for formatting and IDE features.
Custom Matchers
For complex tokenization logic, use custom matchers:
let lexer = LexerBuilder::new()
.custom_token(MySyntaxKind::String, |text, pos| {
// Custom string matching logic
if text[pos..].starts_with('"') {
// Find closing quote
let mut end = pos + 1;
while end < text.len() && text.as_bytes()[end] != b'"' {
if text.as_bytes()[end] == b'\\' {
end += 2; // Skip escape sequence
} else {
end += 1;
}
}
if end < text.len() {
Some((end - pos + 1, TokenValue::String(text[pos+1..end].to_string())))
} else {
None // Unterminated string
}
} else {
None
}
})
.build(MySyntaxKind::Eof, MySyntaxKind::Number)
.expect("Failed to build lexer");
Custom matchers return Option<(usize, TokenValue)> where:
usizeis the number of characters matchedTokenValueis the token’s value (if any)
Tokenization
Once the lexer is built, tokenize input:
let input = "42 + 10";
let tokens = lexer.tokenize(input)
.expect("Failed to tokenize input");
for token in &tokens {
println!("{:?}: {}", token.kind, token.text);
}
DFA Construction
Sipha’s lexer uses a DFA (Deterministic Finite Automaton) for efficient tokenization. The DFA is built from NFA (Nondeterministic Finite Automaton) patterns, which are then converted to a DFA for O(n) tokenization performance.
The DFA construction process:
- Pattern to NFA: Each pattern is converted to an NFA
- NFA to DFA: The NFA is converted to a DFA using subset construction
- Optimization: The DFA is optimized for size and performance
Priority and Ambiguity
When multiple patterns could match the same input, Sipha uses priority to resolve ambiguity:
- Keywords have the highest priority (lowest number)
- Patterns are prioritized by order of definition (earlier = higher priority)
- Longest match wins when priorities are equal
Error Handling
The lexer returns LexerError for invalid input:
match lexer.tokenize(input) {
Ok(tokens) => {
// Process tokens
}
Err(error) => {
match error.kind {
LexerErrorKind::UnexpectedChar { .. } => {
// Handle unexpected character
}
LexerErrorKind::InvalidNumber { reason } => {
// Handle invalid number format
}
// ... other error kinds
}
}
}
Best Practices
- Order patterns carefully: More specific patterns should come before general ones
- Use keywords for reserved words: This ensures proper matching priority
- Mark trivia correctly: Ensure all trivia kinds are marked with
.trivia() - Use character classes efficiently: Prefer character classes over regex when possible
- Test edge cases: Test with empty input, very long input, and invalid input
Example: Complete Lexer
use sipha::lexer::{LexerBuilder, Pattern, CharSet};
let lexer = LexerBuilder::new()
// Numbers
.token(MySyntaxKind::Number, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::digits())),
min: 1,
max: None,
})
// Operators
.token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
.token(MySyntaxKind::Minus, Pattern::Literal("-".into()))
.token(MySyntaxKind::Multiply, Pattern::Literal("*".into()))
.token(MySyntaxKind::Divide, Pattern::Literal("/".into()))
// Parentheses
.token(MySyntaxKind::LParen, Pattern::Literal("(".into()))
.token(MySyntaxKind::RParen, Pattern::Literal(")".into()))
// Whitespace (trivia)
.token(MySyntaxKind::Whitespace, Pattern::Repeat {
pattern: Box::new(Pattern::CharClass(CharSet::whitespace())),
min: 1,
max: None,
})
.trivia(MySyntaxKind::Whitespace)
.build(MySyntaxKind::Eof, MySyntaxKind::Number)
.expect("Failed to build lexer");
Next Steps
- Learn about Grammars to see how tokens are used in parsing
- Explore Syntax Trees to see how tokens appear in trees
- Check out Custom Patterns for advanced lexer usage