2.1. Scanner (Lexical Layer)
The scanner (Scanner.lean, 594 lines)
converts a UTF-8 input string into a stream of YamlToken values.
It implements the spec's lexical-layer (L) productions and maintains
several pieces of state:
-
Indentation stack — tracks block-level nesting via column positions
-
Flow level counter — distinguishes block context from flow context (
[,],{,}) -
Simple key state — manages the YAML "simple key" lifecycle where a plain scalar may retroactively become a mapping key when
:is encountered -
Anchor map — records anchor names for alias resolution
-
Position tracking — offset, line, and column for every token
The scanner returns Except ScanError (Array YamlToken), where
ScanError carries position information and a human-readable message.
2.1.1. Token Types
The YamlToken inductive defines the full set of lexical elements:
-
Stream markers:
streamStart,streamEnd -
Document markers:
documentStart,documentEnd -
Structure indicators:
blockSequenceStart,blockMappingStart,blockEnd,flowSequenceStart,flowSequenceEnd,flowMappingStart,flowMappingEnd,key,value,blockEntry,flowEntry -
Content tokens:
scalar(with style tag: plain, single-quoted, double-quoted, literal, folded),anchor,alias,tag -
Placeholder tokens: used as reservation slots for simple key backpatching (filtered before output)
2.1.2. Append-Only Design
A critical design choice is the append-only token stream.
When the scanner encounters a potential simple key, it pushes two
placeholder tokens (for the future key and blockMappingStart
indicators) and records their indices.
If the key is later confirmed (by encountering :), these
placeholders are overwritten in-place via setIfInBounds.
If not confirmed, they remain as placeholders and are filtered
out before the token stream is returned.
This avoids insertAt operations that would shift array indices
and invalidate saved positions — a property that is essential for
the formal proof that the scanner makes monotonic progress.