10 KiB
epic, story, title, status
| epic | story | title | status |
|---|---|---|---|
| 1 | 1.1 | Lexer & Tokenizer | review |
Epic 1 — Core Calculation Engine (Rust Crate)
Goal: Build calcpad-engine as a standalone Rust crate that powers all platforms. This is the foundation.
Story 1.1: Lexer & Tokenizer
As a CalcPad engine consumer, I want input lines tokenized into a well-defined token stream, So that the parser can build an AST from structured, unambiguous tokens rather than raw text.
Acceptance Criteria:
Given an input line containing an integer such as 42
When the lexer tokenizes the input
Then it produces a single Number token with value 42
And no heap allocations occur for this simple expression
Given an input line containing a decimal number such as 3.14
When the lexer tokenizes the input
Then it produces a single Number token with value 3.14
Given an input line containing a negative number such as -7
When the lexer tokenizes the input
Then it produces tokens representing the negation operator and the number 7
Given an input line containing scientific notation such as 6.022e23
When the lexer tokenizes the input
Then it produces a single Number token with value 6.022e23
Given an input line containing SI scale suffixes such as 5k, 2.5M, or 1B
When the lexer tokenizes the input
Then it produces Number tokens with values 5000, 2500000, and 1000000000 respectively
Given an input line containing currency symbols such as $20, €15, £10, ¥500, or R$100
When the lexer tokenizes the input
Then it produces CurrencySymbol tokens paired with their Number tokens
And multi-character symbols like R$ are recognized as a single token
Given an input line containing unit suffixes such as 5kg, 200g, or 3.5m
When the lexer tokenizes the input
Then it produces Number tokens followed by Unit tokens
Given an input line containing arithmetic operators +, -, *, /, ^, %
When the lexer tokenizes the input
Then it produces the corresponding Operator tokens
Given an input line containing natural language operators such as plus, minus, times, or divided by
When the lexer tokenizes the input
Then it produces the same Operator tokens as their symbolic equivalents
And divided by is recognized as a single two-word operator
Given an input line containing a variable assignment such as x = 10
When the lexer tokenizes the input
Then it produces an Identifier token, an Assign token, and a Number token
Given an input line containing a comment such as // this is a note
When the lexer tokenizes the input
Then it produces a Comment token containing the comment text
And the comment token is preserved for display but excluded from evaluation
Given an input line containing plain text with no calculable expression
When the lexer tokenizes the input
Then it produces a Text token representing the entire line
Given an input line containing mixed content such as $20 in euro - 5% discount
When the lexer tokenizes the input
Then it produces tokens for the currency value, the conversion keyword, the currency target, the operator, the percentage, and the keyword
And each token includes its byte span (start, end) within the input
Tasks/Subtasks
-
Task 1: Set up Rust crate and define Token types
- 1.1: Initialize
calcpad-engineRust crate withCargo.toml - 1.2: Define
Spanstruct (start, end byte offsets) - 1.3: Define
TokenKindenum (Number, Operator, Identifier, Assign, CurrencySymbol, Unit, Comment, Text, Keyword, Percent, LParen, RParen) - 1.4: Define
Tokenstruct withkind,span, and value representation - 1.5: Write unit tests for Token construction and Span ranges
- 1.1: Initialize
-
Task 2: Implement core Lexer struct and scanning infrastructure
- 2.1: Create
Lexerstruct holding input&str, cursor position, and token output - 2.2: Implement character peek, advance, and whitespace-skipping helpers
- 2.3: Implement
tokenize()method that dispatches to specific scanners based on current char - 2.4: Write tests verifying empty input, whitespace-only input
- 2.1: Create
-
Task 3: Tokenize numbers (integers, decimals, scientific notation, SI suffixes)
- 3.1: Implement integer scanning (sequence of digits)
- 3.2: Implement decimal scanning (digits, dot, digits)
- 3.3: Implement scientific notation scanning (e/E followed by optional +/- and digits)
- 3.4: Implement SI scale suffix detection (k, M, B, T) and multiply value accordingly
- 3.5: Write tests for integers, decimals, scientific notation, and SI suffixes (42, 3.14, 6.022e23, 5k, 2.5M, 1B)
-
Task 4: Tokenize operators (symbolic and natural language)
- 4.1: Implement single-character operator scanning (+, -, *, /, ^, %)
- 4.2: Implement parentheses scanning ( and )
- 4.3: Implement natural language operator recognition (plus, minus, times, divided by, of)
- 4.4: Handle
divided byas a two-word operator - 4.5: Write tests for all symbolic operators and natural language equivalents
-
Task 5: Tokenize identifiers, assignments, currency symbols, and units
- 5.1: Implement identifier scanning (alphabetic sequences)
- 5.2: Implement
=assignment operator scanning - 5.3: Implement currency symbol scanning (
, €, £, ¥, and multi-char R) - 5.4: Implement unit suffix detection after numbers (kg, g, m, lb, etc.)
- 5.5: Implement keyword detection (in, to, as, of, discount, off)
- 5.6: Write tests for identifiers, assignments, currency symbols, units, keywords
-
Task 6: Tokenize comments and plain text fallback
- 6.1: Implement
//comment scanning (rest of line becomes Comment token) - 6.2: Implement plain text detection (lines with no calculable tokens become Text)
- 6.3: Write tests for comment scanning and plain text fallback
- 6.1: Implement
-
Task 7: Integration tests for mixed content and span correctness
- 7.1: Write integration test for
x = 10→ [Identifier, Assign, Number] - 7.2: Write integration test for
$20 in euro - 5% discount→ full token stream - 7.3: Write integration test verifying byte spans on all tokens
- 7.4: Write integration test for
-7→ [Operator(Minus), Number(7)] - 7.5: Write integration test for edge cases (multiple spaces, trailing whitespace, empty input)
- 7.6: Verify no heap allocations for simple expressions (use stack-based token collection)
- 7.1: Write integration test for
Dev Notes
Architecture:
calcpad-engineis a standalone Rust library crate (lib.rs)- The lexer operates on a single line of input at a time (line-oriented)
- Tokens must include byte spans for syntax highlighting and error reporting
- Design for zero/minimal heap allocation on simple expressions
- SI suffixes (k, M, B, T) are resolved at lex time — the Number token stores the scaled value
- Natural language operators map to the same
Operatorvariants as symbolic ones - The
-in-7is tokenized as a separate Minus operator, not part of the number (parser handles unary minus) - Currency symbols precede numbers; unit suffixes follow numbers
- Multi-character currency symbols (R$) must be handled with lookahead
divided byrequires two-word lookahead- Plain text detection is a fallback — if the lexer cannot produce any calculable tokens, the entire line becomes a Text token
Coding Standards:
- Use
#[derive(Debug, Clone, PartialEq)]on all public types - Use
f64for number representation (arbitrary precision comes in Story 1.4) - Minimize allocations — use
&strslices into the input where possible - All public API must be documented with doc comments
Dev Agent Record
Implementation Plan:
- Created
calcpad-engineRust library crate with two modules:token(types) andlexer(scanning) - Token types:
Span,Operatorenum,TokenKindenum (12 variants),Tokenstruct - Lexer uses a byte-offset cursor scanning approach with peek/advance helpers
- Unit suffixes after numbers use a "pending token" mechanism (Lexer stores pending Unit token emitted on next iteration)
- SI suffixes (k, M, B, T) distinguished from unit suffixes by checking if followed by more alpha chars
- Natural language operators and keywords matched via case-insensitive word-boundary matching
divided byhandled with two-word lookahead- Plain text fallback: if no calculable tokens found, entire line becomes Text token
- Red-green-refactor cycle followed: wrote failing tests first, then implementation, then cleanup
Debug Log:
- Fixed empty/whitespace input returning Text instead of empty vec (trim check was placed after comment check)
- Fixed clippy warning:
op.clone()on Copy typeOperator→ replaced with*op - Fixed unit suffix tokens not being emitted: pending token was set but loop exited before checking it; restructured loop to check pending tokens at top of each iteration
Completion Notes:
- All 39 tests pass (5 token tests + 34 lexer tests)
- Zero clippy warnings
- All 13 acceptance criteria satisfied with dedicated tests
- Token types are well-documented with doc comments
- Public API:
tokenize(input: &str) -> Vec<Token>convenience function +Lexer::new().tokenize()
File List
Cargo.toml(new) — Rust crate manifest for calcpad-enginesrc/lib.rs(new) — Crate root, re-exports public APIsrc/token.rs(new) — Token types: Span, Operator, TokenKind, Token with unit testssrc/lexer.rs(new) — Lexer implementation with 34 unit/integration tests
Change Log
- 2026-03-16: Story 1.1 implemented — full lexer/tokenizer for CalcPad engine with 39 tests, all passing. Covers integers, decimals, scientific notation, SI suffixes, currency symbols, unit suffixes, operators (symbolic + natural language), identifiers, assignments, comments, plain text, keywords, percentages, parentheses, and mixed expressions with byte-span tracking.
Status
Current: review