Custom Patterns
This chapter shows how to extend lexer patterns with custom matching logic.
Custom Matchers
For complex tokenization logic, use custom matchers:
let lexer = LexerBuilder::new()
.custom_token(MySyntaxKind::String, |text, pos| {
// Custom string matching
if text[pos..].starts_with('"') {
let mut end = pos + 1;
while end < text.len() && text.as_bytes()[end] != b'"' {
if text.as_bytes()[end] == b'\\' {
end += 2; // Skip escape sequence
} else {
end += 1;
}
}
if end < text.len() {
let value = text[pos+1..end].to_string();
Some((end - pos + 1, TokenValue::String(value)))
} else {
None // Unterminated string
}
} else {
None
}
})
.build(MySyntaxKind::Eof, MySyntaxKind::Ident)
.expect("Failed to build lexer");
Custom Matcher Signature
Custom matchers have this signature:
Fn(&str, usize) -> Option<(usize, TokenValue)>
Where:
- Input:
&stris the text,usizeis the starting position - Output:
Option<(usize, TokenValue)>where:usizeis the number of characters matchedTokenValueis the token’s value (if any)
TokenValue
Token values can be:
pub enum TokenValue {
Integer(i64),
Float(f64),
String(String),
None,
}
Examples
String Literal
.custom_token(MySyntaxKind::String, |text, pos| {
if !text[pos..].starts_with('"') {
return None;
}
let mut end = pos + 1;
while end < text.len() {
match text.as_bytes()[end] {
b'"' => {
let value = text[pos+1..end].to_string();
return Some((end - pos + 1, TokenValue::String(value)));
}
b'\\' if end + 1 < text.len() => {
end += 2; // Skip escape sequence
}
_ => end += 1,
}
}
None // Unterminated string
})
Floating Point Number
.custom_token(MySyntaxKind::Float, |text, pos| {
let mut end = pos;
let mut has_dot = false;
// Optional sign
if end < text.len() && (text.as_bytes()[end] == b'+' || text.as_bytes()[end] == b'-') {
end += 1;
}
// Digits before dot
while end < text.len() && text.as_bytes()[end].is_ascii_digit() {
end += 1;
}
// Dot
if end < text.len() && text.as_bytes()[end] == b'.' {
has_dot = true;
end += 1;
}
// Digits after dot
while end < text.len() && text.as_bytes()[end].is_ascii_digit() {
end += 1;
}
if has_dot && end > pos {
let value = text[pos..end].parse::<f64>().ok()?;
Some((end - pos, TokenValue::Float(value)))
} else {
None
}
})
Multiline Comment
.custom_token(MySyntaxKind::Comment, |text, pos| {
if !text[pos..].starts_with("/*") {
return None;
}
let mut end = pos + 2;
while end + 1 < text.len() {
if text.as_bytes()[end] == b'*' && text.as_bytes()[end + 1] == b'/' {
let value = text[pos+2..end].to_string();
return Some((end - pos + 2, TokenValue::String(value)));
}
end += 1;
}
None // Unterminated comment
})
Best Practices
- Return early: Return
Nonequickly if pattern doesn’t match - Handle edge cases: Handle EOF, invalid sequences, etc.
- Escape sequences: Properly handle escape sequences
- Performance: Keep matching logic efficient
- Error handling: Return
Nonefor invalid input
Combining with Regular Patterns
Custom matchers work alongside regular patterns:
let lexer = LexerBuilder::new()
// Regular patterns
.token(MySyntaxKind::Number, Pattern::CharClass(CharSet::digits()))
.token(MySyntaxKind::Plus, Pattern::Literal("+".into()))
// Custom matcher
.custom_token(MySyntaxKind::String, |text, pos| {
// Custom logic
})
.build(MySyntaxKind::Eof, MySyntaxKind::Ident)
.expect("Failed to build lexer");
Priority
Custom matchers have the same priority as regular patterns based on definition order.