Files
calctext/_bmad-output/implementation-artifacts/1-1-lexer-and-tokenizer.md
2026-03-16 19:54:53 -04:00

3.3 KiB

epic, story, title, status
epic story title status
1 1.1 Lexer & Tokenizer draft

Epic 1 — Core Calculation Engine (Rust Crate)

Goal: Build calcpad-engine as a standalone Rust crate that powers all platforms. This is the foundation.

Story 1.1: Lexer & Tokenizer

As a CalcPad engine consumer, I want input lines tokenized into a well-defined token stream, So that the parser can build an AST from structured, unambiguous tokens rather than raw text.

Acceptance Criteria:

Given an input line containing an integer such as 42 When the lexer tokenizes the input Then it produces a single Number token with value 42 And no heap allocations occur for this simple expression

Given an input line containing a decimal number such as 3.14 When the lexer tokenizes the input Then it produces a single Number token with value 3.14

Given an input line containing a negative number such as -7 When the lexer tokenizes the input Then it produces tokens representing the negation operator and the number 7

Given an input line containing scientific notation such as 6.022e23 When the lexer tokenizes the input Then it produces a single Number token with value 6.022e23

Given an input line containing SI scale suffixes such as 5k, 2.5M, or 1B When the lexer tokenizes the input Then it produces Number tokens with values 5000, 2500000, and 1000000000 respectively

Given an input line containing currency symbols such as $20, €15, £10, ¥500, or R$100 When the lexer tokenizes the input Then it produces CurrencySymbol tokens paired with their Number tokens And multi-character symbols like R$ are recognized as a single token

Given an input line containing unit suffixes such as 5kg, 200g, or 3.5m When the lexer tokenizes the input Then it produces Number tokens followed by Unit tokens

Given an input line containing arithmetic operators +, -, *, /, ^, % When the lexer tokenizes the input Then it produces the corresponding Operator tokens

Given an input line containing natural language operators such as plus, minus, times, or divided by When the lexer tokenizes the input Then it produces the same Operator tokens as their symbolic equivalents And divided by is recognized as a single two-word operator

Given an input line containing a variable assignment such as x = 10 When the lexer tokenizes the input Then it produces an Identifier token, an Assign token, and a Number token

Given an input line containing a comment such as // this is a note When the lexer tokenizes the input Then it produces a Comment token containing the comment text And the comment token is preserved for display but excluded from evaluation

Given an input line containing plain text with no calculable expression When the lexer tokenizes the input Then it produces a Text token representing the entire line

Given an input line containing mixed content such as $20 in euro - 5% discount When the lexer tokenizes the input Then it produces tokens for the currency value, the conversion keyword, the currency target, the operator, the percentage, and the keyword And each token includes its byte span (start, end) within the input