L4YAML

2.1. Scanner (Lexical Layer)🔗

The scanner (Scanner.lean, 594 lines) converts a UTF-8 input string into a stream of YamlToken values. It implements the spec's lexical-layer (L) productions and maintains several pieces of state:

  • Indentation stack — tracks block-level nesting via column positions

  • Flow level counter — distinguishes block context from flow context ([, ], {, })

  • Simple key state — manages the YAML "simple key" lifecycle where a plain scalar may retroactively become a mapping key when : is encountered

  • Anchor map — records anchor names for alias resolution

  • Position tracking — offset, line, and column for every token

The scanner returns Except ScanError (Array YamlToken), where ScanError carries position information and a human-readable message.

2.1.1. Token Types🔗

The YamlToken inductive defines the full set of lexical elements:

  • Stream markers: streamStart, streamEnd

  • Document markers: documentStart, documentEnd

  • Structure indicators: blockSequenceStart, blockMappingStart, blockEnd, flowSequenceStart, flowSequenceEnd, flowMappingStart, flowMappingEnd, key, value, blockEntry, flowEntry

  • Content tokens: scalar (with style tag: plain, single-quoted, double-quoted, literal, folded), anchor, alias, tag

  • Placeholder tokens: used as reservation slots for simple key backpatching (filtered before output)

2.1.2. Append-Only Design🔗

A critical design choice is the append-only token stream. When the scanner encounters a potential simple key, it pushes two placeholder tokens (for the future key and blockMappingStart indicators) and records their indices. If the key is later confirmed (by encountering :), these placeholders are overwritten in-place via setIfInBounds. If not confirmed, they remain as placeholders and are filtered out before the token stream is returned.

This avoids insertAt operations that would shift array indices and invalidate saved positions — a property that is essential for the formal proof that the scanner makes monotonic progress.