Files
calctext/_bmad-output/implementation-artifacts/1-1-lexer-and-tokenizer.md
C. Cassel baee33b2f1 feat(epic): implement 1-1-lexer-and-tokenizer
Story: _bmad-output/implementation-artifacts/1-1-lexer-and-tokenizer.md
2026-03-16 22:46:15 -04:00

10 KiB

epic, story, title, status
epic story title status
1 1.1 Lexer & Tokenizer review

Epic 1 — Core Calculation Engine (Rust Crate)

Goal: Build calcpad-engine as a standalone Rust crate that powers all platforms. This is the foundation.

Story 1.1: Lexer & Tokenizer

As a CalcPad engine consumer, I want input lines tokenized into a well-defined token stream, So that the parser can build an AST from structured, unambiguous tokens rather than raw text.

Acceptance Criteria:

Given an input line containing an integer such as 42 When the lexer tokenizes the input Then it produces a single Number token with value 42 And no heap allocations occur for this simple expression

Given an input line containing a decimal number such as 3.14 When the lexer tokenizes the input Then it produces a single Number token with value 3.14

Given an input line containing a negative number such as -7 When the lexer tokenizes the input Then it produces tokens representing the negation operator and the number 7

Given an input line containing scientific notation such as 6.022e23 When the lexer tokenizes the input Then it produces a single Number token with value 6.022e23

Given an input line containing SI scale suffixes such as 5k, 2.5M, or 1B When the lexer tokenizes the input Then it produces Number tokens with values 5000, 2500000, and 1000000000 respectively

Given an input line containing currency symbols such as $20, €15, £10, ¥500, or R$100 When the lexer tokenizes the input Then it produces CurrencySymbol tokens paired with their Number tokens And multi-character symbols like R$ are recognized as a single token

Given an input line containing unit suffixes such as 5kg, 200g, or 3.5m When the lexer tokenizes the input Then it produces Number tokens followed by Unit tokens

Given an input line containing arithmetic operators +, -, *, /, ^, % When the lexer tokenizes the input Then it produces the corresponding Operator tokens

Given an input line containing natural language operators such as plus, minus, times, or divided by When the lexer tokenizes the input Then it produces the same Operator tokens as their symbolic equivalents And divided by is recognized as a single two-word operator

Given an input line containing a variable assignment such as x = 10 When the lexer tokenizes the input Then it produces an Identifier token, an Assign token, and a Number token

Given an input line containing a comment such as // this is a note When the lexer tokenizes the input Then it produces a Comment token containing the comment text And the comment token is preserved for display but excluded from evaluation

Given an input line containing plain text with no calculable expression When the lexer tokenizes the input Then it produces a Text token representing the entire line

Given an input line containing mixed content such as $20 in euro - 5% discount When the lexer tokenizes the input Then it produces tokens for the currency value, the conversion keyword, the currency target, the operator, the percentage, and the keyword And each token includes its byte span (start, end) within the input


Tasks/Subtasks

  • Task 1: Set up Rust crate and define Token types

    • 1.1: Initialize calcpad-engine Rust crate with Cargo.toml
    • 1.2: Define Span struct (start, end byte offsets)
    • 1.3: Define TokenKind enum (Number, Operator, Identifier, Assign, CurrencySymbol, Unit, Comment, Text, Keyword, Percent, LParen, RParen)
    • 1.4: Define Token struct with kind, span, and value representation
    • 1.5: Write unit tests for Token construction and Span ranges
  • Task 2: Implement core Lexer struct and scanning infrastructure

    • 2.1: Create Lexer struct holding input &str, cursor position, and token output
    • 2.2: Implement character peek, advance, and whitespace-skipping helpers
    • 2.3: Implement tokenize() method that dispatches to specific scanners based on current char
    • 2.4: Write tests verifying empty input, whitespace-only input
  • Task 3: Tokenize numbers (integers, decimals, scientific notation, SI suffixes)

    • 3.1: Implement integer scanning (sequence of digits)
    • 3.2: Implement decimal scanning (digits, dot, digits)
    • 3.3: Implement scientific notation scanning (e/E followed by optional +/- and digits)
    • 3.4: Implement SI scale suffix detection (k, M, B, T) and multiply value accordingly
    • 3.5: Write tests for integers, decimals, scientific notation, and SI suffixes (42, 3.14, 6.022e23, 5k, 2.5M, 1B)
  • Task 4: Tokenize operators (symbolic and natural language)

    • 4.1: Implement single-character operator scanning (+, -, *, /, ^, %)
    • 4.2: Implement parentheses scanning ( and )
    • 4.3: Implement natural language operator recognition (plus, minus, times, divided by, of)
    • 4.4: Handle divided by as a two-word operator
    • 4.5: Write tests for all symbolic operators and natural language equivalents
  • Task 5: Tokenize identifiers, assignments, currency symbols, and units

    • 5.1: Implement identifier scanning (alphabetic sequences)
    • 5.2: Implement = assignment operator scanning
    • 5.3: Implement currency symbol scanning (, €, £, ¥, and multi-char R)
    • 5.4: Implement unit suffix detection after numbers (kg, g, m, lb, etc.)
    • 5.5: Implement keyword detection (in, to, as, of, discount, off)
    • 5.6: Write tests for identifiers, assignments, currency symbols, units, keywords
  • Task 6: Tokenize comments and plain text fallback

    • 6.1: Implement // comment scanning (rest of line becomes Comment token)
    • 6.2: Implement plain text detection (lines with no calculable tokens become Text)
    • 6.3: Write tests for comment scanning and plain text fallback
  • Task 7: Integration tests for mixed content and span correctness

    • 7.1: Write integration test for x = 10 → [Identifier, Assign, Number]
    • 7.2: Write integration test for $20 in euro - 5% discount → full token stream
    • 7.3: Write integration test verifying byte spans on all tokens
    • 7.4: Write integration test for -7 → [Operator(Minus), Number(7)]
    • 7.5: Write integration test for edge cases (multiple spaces, trailing whitespace, empty input)
    • 7.6: Verify no heap allocations for simple expressions (use stack-based token collection)

Dev Notes

Architecture:

  • calcpad-engine is a standalone Rust library crate (lib.rs)
  • The lexer operates on a single line of input at a time (line-oriented)
  • Tokens must include byte spans for syntax highlighting and error reporting
  • Design for zero/minimal heap allocation on simple expressions
  • SI suffixes (k, M, B, T) are resolved at lex time — the Number token stores the scaled value
  • Natural language operators map to the same Operator variants as symbolic ones
  • The - in -7 is tokenized as a separate Minus operator, not part of the number (parser handles unary minus)
  • Currency symbols precede numbers; unit suffixes follow numbers
  • Multi-character currency symbols (R$) must be handled with lookahead
  • divided by requires two-word lookahead
  • Plain text detection is a fallback — if the lexer cannot produce any calculable tokens, the entire line becomes a Text token

Coding Standards:

  • Use #[derive(Debug, Clone, PartialEq)] on all public types
  • Use f64 for number representation (arbitrary precision comes in Story 1.4)
  • Minimize allocations — use &str slices into the input where possible
  • All public API must be documented with doc comments

Dev Agent Record

Implementation Plan:

  • Created calcpad-engine Rust library crate with two modules: token (types) and lexer (scanning)
  • Token types: Span, Operator enum, TokenKind enum (12 variants), Token struct
  • Lexer uses a byte-offset cursor scanning approach with peek/advance helpers
  • Unit suffixes after numbers use a "pending token" mechanism (Lexer stores pending Unit token emitted on next iteration)
  • SI suffixes (k, M, B, T) distinguished from unit suffixes by checking if followed by more alpha chars
  • Natural language operators and keywords matched via case-insensitive word-boundary matching
  • divided by handled with two-word lookahead
  • Plain text fallback: if no calculable tokens found, entire line becomes Text token
  • Red-green-refactor cycle followed: wrote failing tests first, then implementation, then cleanup

Debug Log:

  • Fixed empty/whitespace input returning Text instead of empty vec (trim check was placed after comment check)
  • Fixed clippy warning: op.clone() on Copy type Operator → replaced with *op
  • Fixed unit suffix tokens not being emitted: pending token was set but loop exited before checking it; restructured loop to check pending tokens at top of each iteration

Completion Notes:

  • All 39 tests pass (5 token tests + 34 lexer tests)
  • Zero clippy warnings
  • All 13 acceptance criteria satisfied with dedicated tests
  • Token types are well-documented with doc comments
  • Public API: tokenize(input: &str) -> Vec<Token> convenience function + Lexer::new().tokenize()

File List

  • Cargo.toml (new) — Rust crate manifest for calcpad-engine
  • src/lib.rs (new) — Crate root, re-exports public API
  • src/token.rs (new) — Token types: Span, Operator, TokenKind, Token with unit tests
  • src/lexer.rs (new) — Lexer implementation with 34 unit/integration tests

Change Log

  • 2026-03-16: Story 1.1 implemented — full lexer/tokenizer for CalcPad engine with 39 tests, all passing. Covers integers, decimals, scientific notation, SI suffixes, currency symbols, unit suffixes, operators (symbolic + natural language), identifiers, assignments, comments, plain text, keywords, percentages, parentheses, and mixed expressions with byte-span tracking.

Status

Current: review