Skip to content

Commit a7d078f

Browse files
committed
Merge branch '005-utf8-rebuild-warnings'
2 parents a353c60 + 89c3268 commit a7d078f

11 files changed

Lines changed: 1566 additions & 7 deletions

File tree

.github/copilot-instructions.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ Auto-generated from all feature plans. Last updated: 2025-11-09
77
- N/A (static site generation, no database for this feature) (002-footnote-noscript-fallback)
88
- Perl 5.40+ (per Constitution Principle VII) + Template::Toolkit 3.102+, Text::Markdown::Blog (via Template::Plugin::Blogdown) (004-collapsible-sections)
99
- N/A (static site generation, no runtime database for this feature) (004-collapsible-sections)
10+
- Perl 5.40+ + Template::Toolkit 3.102+, HTML::TokeParser::Simple, Text::Markdown::Blog (005-utf8-rebuild-warnings)
11+
- SQLite (build-time data only, in `db/ovid.db`) (005-utf8-rebuild-warnings)
1012

1113
- Perl 5.40+ + Devel::Cover, Test::Most, Type::Tiny, Getopt::Long, SQLite (001-test-coverage-improvement)
1214

@@ -70,9 +72,9 @@ Perl 5.40+: Follow standard conventions from constitution.md
7072
- All tasks must pass entire test suite before completion
7173

7274
## Recent Changes
75+
- 005-utf8-rebuild-warnings: Added Perl 5.40+ + Template::Toolkit 3.102+, HTML::TokeParser::Simple, Text::Markdown::Blog
7376
- 004-collapsible-sections: Added Perl 5.40+ (per Constitution Principle VII) + Template::Toolkit 3.102+, Text::Markdown::Blog (via Template::Plugin::Blogdown)
7477
- 004-collapsible-sections: Added Perl 5.40+ (per Constitution Principle VII) + Template::Toolkit 3.102+, Text::Markdown::Blog (via Template::Plugin::Blogdown)
75-
- 002-footnote-noscript-fallback: Added Perl 5.40+ (project standard per constitution) + Template Toolkit 2.x, Dialog.js (client-side for JS mode)
7678

7779

7880
<!-- MANUAL ADDITIONS START -->

bin/rebuild

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,7 @@ my $site = Ovid::Site->new(
2121
);
2222
$site->build;
2323

24-
# Skip tests when building a single file
25-
system( 'prove', '-l', 't' ) unless $options{notest} || defined $options{file};
24+
system( 'prove', '-l', 't' ) unless $options{notest};
2625
exit 0;
2726

2827
__END__
@@ -55,10 +54,11 @@ Only build this file.
5554
5655
=item B<--release>
5756
58-
Builds the site for release. This means that the site will be built, along with the search engine.
57+
Builds the site for release. This means that the site will be built, along
58+
with the search engine.
5959
6060
=item B<--notest>
6161
62-
Skips running the tests. This is useful if you are just trying to build the site quickly and don't care about the tests.
62+
Skips running the tests. Not recommended.
6363
6464
=back

lib/Ovid/Site.pm

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ package Ovid::Site {
7070
sub build ($self) {
7171
say STDERR "Preprocessing files ...";
7272
if ( $self->file ) {
73-
return $self->build_single_file;
73+
return $self->_build_single_file;
7474
}
7575
$self->_assert_tt_config;
7676
$self->_set_files('root');
@@ -84,7 +84,7 @@ package Ovid::Site {
8484
$self->_build_tinysearch if $self->release;
8585
}
8686

87-
sub build_single_file ($self) {
87+
sub _build_single_file ($self) {
8888
printf STDERR "Rebuilding single file: %s\n", $self->file;
8989

9090
$self->_clean_tmp_directory;
@@ -135,6 +135,8 @@ package Ovid::Site {
135135
'--src=tmp',
136136
'--dest=.',
137137
'--lib=include',
138+
'--binmode' => 'utf8', # encoding of output file (same as _run_ttree)
139+
'--encoding' => 'utf8', # encoding of input files (same as _run_ttree)
138140
$relative_file,
139141
);
140142

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Specification Quality Checklist: Fix UTF-8 Warnings in Single File Rebuild
2+
3+
**Purpose**: Validate specification completeness and quality before proceeding to planning
4+
**Created**: 2025-11-18
5+
**Feature**: [spec.md](../spec.md)
6+
7+
## Content Quality
8+
9+
- [x] No implementation details (languages, frameworks, APIs)
10+
- [x] Focused on user value and business needs
11+
- [x] Written for non-technical stakeholders
12+
- [x] All mandatory sections completed
13+
14+
## Requirement Completeness
15+
16+
- [x] No [NEEDS CLARIFICATION] markers remain
17+
- [x] Requirements are testable and unambiguous
18+
- [x] Success criteria are measurable
19+
- [x] Success criteria are technology-agnostic (no implementation details)
20+
- [x] All acceptance scenarios are defined
21+
- [x] Edge cases are identified
22+
- [x] Scope is clearly bounded
23+
- [x] Dependencies and assumptions identified
24+
25+
## Feature Readiness
26+
27+
- [x] All functional requirements have clear acceptance criteria
28+
- [x] User scenarios cover primary flows
29+
- [x] Feature meets measurable outcomes defined in Success Criteria
30+
- [x] No implementation details leak into specification
31+
32+
## Notes
33+
34+
All checklist items pass validation. The specification is complete and ready for planning phase.
35+
36+
**Validation Details**:
37+
- Technical Context section provides necessary background without prescribing implementation
38+
- All success criteria are measurable and technology-agnostic (e.g., "zero UTF-8-related warnings", "UTF-8 characters appear correctly")
39+
- Edge cases address boundary conditions appropriately
40+
- Constraints section clearly identifies limitations without implementation details
41+
- No [NEEDS CLARIFICATION] markers present - the issue is well-understood from the error messages provided
Lines changed: 278 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,278 @@
1+
# Data Model: Fix UTF-8 Warnings in Single File Rebuild
2+
3+
**Feature**: 005-utf8-rebuild-warnings
4+
**Date**: 2025-11-18
5+
**Status**: Complete
6+
7+
## Overview
8+
9+
This feature involves fixing UTF-8 encoding handling rather than creating new data structures. The entities documented here represent the existing data flow and states that the fix must preserve and correct.
10+
11+
## Entities
12+
13+
### Template File
14+
15+
**Description**: Source template file containing UTF-8 encoded content.
16+
17+
**Attributes**:
18+
- `filename` (string): Absolute path to template file (e.g., `root/blog/article.tt2markdown`)
19+
- `raw_contents` (decoded string): File contents read with `:encoding(UTF-8)` layer, producing Perl character strings with UTF8 flag set
20+
- `type` (enum): Template type - `'article'` or `'blog'`
21+
- `title` (string): Article/blog title extracted from template metadata
22+
- `slug` (string): URL-friendly identifier
23+
- `tags` (array of strings): Topic tags for categorization
24+
25+
**Encoding State**:
26+
- **Input**: UTF-8 bytes on disk
27+
- **After `slurp()`**: Decoded character string (UTF8 flag set, contains Unicode code points)
28+
- **Expected by HTML::TokeParser::Simple**: Requires `utf8_mode(1)` when processing decoded strings
29+
30+
**Validation Rules**:
31+
- File must be valid UTF-8 (per FR-006)
32+
- File must exist and be readable
33+
- Must contain required template metadata (title, type, slug)
34+
35+
**State Transitions**:
36+
1. **On Disk**`slurp()`**Decoded Character String**
37+
2. **Decoded String**`HTML::TokeParser::Simple->new()` + `utf8_mode(1)`**Parsed HTML Tokens**
38+
3. **Parsed Tokens** → preprocessing → **Preprocessed Content**
39+
4. **Preprocessed Content** → Template Toolkit → **Generated HTML**
40+
41+
### Preprocessed Content
42+
43+
**Description**: Intermediate representation of template after preprocessing but before Template Toolkit rendering.
44+
45+
**Attributes**:
46+
- `content` (decoded string): Template content with TOC, code blocks wrapped, and other macros expanded
47+
- `toc_links` (array of strings): Generated table of contents HTML links
48+
- `tags` (array of strings): Extracted tags from `{{TAGS}}` macro
49+
- `encoding_state` (internal): Must remain as decoded character string throughout preprocessing
50+
51+
**Processing Steps**:
52+
1. Parse markdown headers to build TOC
53+
2. Wrap code blocks in Template Toolkit directives
54+
3. Expand `{{TOC}}` macro
55+
4. Extract and remove `{{TAGS}}` macro
56+
5. Add anchors to headings for TOC links
57+
58+
**Encoding Requirements**:
59+
- Input: Decoded character string from `slurp()`
60+
- HTML parser: Must use `utf8_mode(1)` to handle decoded input
61+
- Output: Decoded character string (preserves UTF-8 characters)
62+
- Passed to Template Toolkit which handles encoding via `--binmode utf8`
63+
64+
### Generated HTML
65+
66+
**Description**: Final HTML output file after Template Toolkit processing.
67+
68+
**Attributes**:
69+
- `filename` (string): Output path (e.g., `articles/article-name.html`)
70+
- `content` (UTF-8 bytes): HTML with proper UTF-8 encoding
71+
- `charset` (metadata): Must include `<meta charset="utf-8">` tag
72+
73+
**Encoding Requirements**:
74+
- Template Toolkit writes UTF-8 bytes to disk via `--binmode utf8` configuration
75+
- Browser interprets as UTF-8 via charset declaration
76+
- All UTF-8 characters from source template must render correctly
77+
78+
**Validation**:
79+
- Must be valid HTML5
80+
- UTF-8 characters must display correctly in browsers
81+
- No encoding corruption (mojibake)
82+
83+
## Data Flow Diagram
84+
85+
```
86+
Template File (UTF-8 bytes on disk)
87+
88+
slurp() [with :encoding(UTF-8) layer]
89+
90+
Decoded Character String (UTF8 flag set, Unicode code points)
91+
92+
HTML::TokeParser::Simple->new(string => $content)
93+
94+
** FIX: parser->utf8_mode(1) ** ← Tells parser input is decoded
95+
96+
Parse HTML (decode entities to Unicode)
97+
98+
Preprocessed Content (decoded character string)
99+
100+
Template Toolkit (--binmode utf8, --encoding utf8)
101+
102+
Generated HTML (UTF-8 bytes on disk)
103+
104+
Browser (interprets via charset=utf-8)
105+
106+
Displayed Page (correct UTF-8 rendering)
107+
```
108+
109+
## Encoding States
110+
111+
### State 1: Byte String (UTF8 flag NOT set)
112+
113+
**Characteristics**:
114+
- Raw bytes from file system
115+
- May contain UTF-8 encoded data, but Perl doesn't know this
116+
- Each byte is 0-255
117+
- HTML::Parser's **default expectation** (utf8_mode OFF)
118+
119+
**Example**:
120+
```perl
121+
open my $fh, '<:raw', $file; # No encoding layer
122+
my $bytes = do { local $/; <$fh> };
123+
# $bytes: UTF-8 bytes but UTF8 flag not set
124+
# café stored as: "\xC3\xA9"
125+
```
126+
127+
### State 2: Character String (UTF8 flag set) ← **Our Case**
128+
129+
**Characteristics**:
130+
- Decoded Unicode code points
131+
- Result of `:encoding(UTF-8)` layer or `decode_utf8()`
132+
- Characters can be > 255
133+
- HTML::Parser requires **utf8_mode(1)** for this state
134+
135+
**Example**:
136+
```perl
137+
open my $fh, '<:encoding(UTF-8)', $file; # Encoding layer
138+
my $chars = do { local $/; <$fh> };
139+
# $chars: Decoded Unicode characters with UTF8 flag
140+
# café stored as Unicode code points: "caf\x{E9}"
141+
```
142+
143+
## Relationships
144+
145+
### Template File → Preprocessed Content (1:1)
146+
147+
Each template file is preprocessed exactly once per build, producing one preprocessed content representation.
148+
149+
**Processing Function**: `Ovid::Template::File::rewrite()`
150+
151+
**Encoding Constraint**: Must preserve UTF-8 character encoding throughout preprocessing.
152+
153+
### Preprocessed Content → Generated HTML (1:1)
154+
155+
Each preprocessed template is rendered by Template Toolkit to produce one HTML file.
156+
157+
**Processing Function**: Template Toolkit (`ttree`)
158+
159+
**Encoding Constraint**: `--binmode utf8` ensures output is UTF-8 bytes.
160+
161+
### Multiple Files → Tag Map (N:1)
162+
163+
Multiple templates can reference the same tag, aggregated in tag map for tag index pages.
164+
165+
**Not affected by this fix**: Tag handling occurs after UTF-8 is correctly processed.
166+
167+
## Constraints
168+
169+
### Encoding Constraints
170+
171+
1. **Input Constraint**: Template files MUST be valid UTF-8 (enforced by `slurp()` with `:encoding(UTF-8)`)
172+
2. **Processing Constraint**: Decoded strings MUST use `utf8_mode(1)` when passed to HTML::TokeParser::Simple
173+
3. **Output Constraint**: Template Toolkit MUST use `--binmode utf8` for correct output encoding
174+
4. **Validation Constraint**: Invalid UTF-8 in input files MUST cause immediate failure with clear error (FR-006)
175+
176+
### Performance Constraints
177+
178+
Per ASM-006 and constraints section: Performance is explicitly ignored for single-file rebuilds. Correctness takes priority.
179+
180+
### Backward Compatibility Constraints
181+
182+
- Existing templates MUST continue to work without modification (CON-004)
183+
- Full site rebuilds MUST not be affected (User Story 2)
184+
- Generated HTML MUST be identical to pre-fix output (byte-for-byte where UTF-8 is already correct)
185+
186+
## Edge Cases
187+
188+
### Invalid UTF-8 Sequences
189+
190+
**Condition**: Template file contains invalid UTF-8 byte sequences.
191+
192+
**Expected Behavior**: `slurp()` with `:encoding(UTF-8)` layer will die with clear error message indicating file and problem (FR-006).
193+
194+
**Handling**: No special code needed; Perl's encoding layer handles this automatically.
195+
196+
### Mixed Byte/Character Strings
197+
198+
**Condition**: Code path accidentally creates mixed encoded/decoded strings.
199+
200+
**Expected Behavior**: HTML::Parser will warn or croak depending on `utf8_mode` setting.
201+
202+
**Prevention**: Ensure all file I/O uses consistent encoding layers. All `slurp()` calls use `:encoding(UTF-8)`.
203+
204+
### HTML Entities in UTF-8 Content
205+
206+
**Condition**: Template contains both UTF-8 characters (é) and HTML entities (&eacute;).
207+
208+
**Expected Behavior**: With `utf8_mode(1)`:
209+
- UTF-8 character `é` → remains `é` (U+00E9)
210+
- HTML entity `&eacute;` → decoded to `é` (U+00E9)
211+
- Both produce identical output (correct)
212+
213+
**Without Fix**: Entity decoding in context of decoded UTF-8 causes corruption.
214+
215+
### Empty or Binary Files
216+
217+
**Condition**: Template file is empty or contains binary data.
218+
219+
**Expected Behavior**:
220+
- Empty file: Processes successfully (no content to encode)
221+
- Binary file: `:encoding(UTF-8)` layer fails with clear error (desired per edge cases in spec)
222+
223+
## Implementation Impact
224+
225+
### Modified Files
226+
227+
1. **`lib/Ovid/Template/File.pm`**
228+
- `_preprocess_macros()`: Add `$p->utf8_mode(1)` after line 204
229+
- Encoding state: Input is decoded string from `$self->_code`
230+
231+
2. **`lib/Ovid/Site.pm`**
232+
- HTML processing in sitemap generation: Add `$parser->utf8_mode(1)` after line 644
233+
- Encoding state: Input is decoded string from `slurp($file)`
234+
235+
3. **`lib/Text/Markdown/Blog.pm`**
236+
- External link processing: Add `$parser->utf8_mode(1)` after line 85
237+
- Encoding state: Input is decoded string passed to plugin
238+
239+
### No Changes Required
240+
241+
- **`lib/Less/Script.pm`**: `slurp()` and `splat()` already use `:encoding(UTF-8)` correctly
242+
- **Template Toolkit configuration**: `--binmode utf8` already set
243+
- **Template files**: No modifications needed to existing templates
244+
245+
## Testing Data Model
246+
247+
### Test Fixture: UTF-8 Template
248+
249+
**File**: `t/fixtures/utf8_test_template.tt2markdown`
250+
251+
**Required Content**:
252+
- UTF-8 characters: Smart quotes "", em dash —, café, résumé
253+
- HTML entities: `&eacute;`, `&mdash;`, `&hearts;`
254+
- Markdown headers for TOC generation
255+
- Code blocks to test `is_in_code` state handling
256+
257+
**Expected Output**:
258+
- All UTF-8 characters render correctly in HTML
259+
- All entities decode correctly to Unicode
260+
- No encoding warnings in STDERR
261+
- HTML includes `<meta charset="utf-8">`
262+
263+
### Test Assertions
264+
265+
**Encoding State Checks**:
266+
- Input string has UTF8 flag set: `utf8::is_utf8($string)`
267+
- Parser uses utf8_mode: Verified by absence of warnings
268+
- Output HTML is valid UTF-8: Can be read back with `:encoding(UTF-8)`
269+
270+
**Functional Checks**:
271+
- Generated HTML contains expected characters
272+
- No "Parsing of undecoded UTF-8" warnings (FR-001)
273+
- No "Wide character in print" warnings (FR-002)
274+
- UTF-8 characters identical between single and full rebuilds (FR-004)
275+
276+
## Summary
277+
278+
This data model documents the encoding states and transformations that occur during template processing. The fix ensures proper encoding layer handling at the HTML parsing stage by enabling `utf8_mode(1)` when the parser receives decoded character strings. This aligns the parser's expectations with the actual encoding state of the input data, eliminating warnings and preventing potential corruption.

0 commit comments

Comments
 (0)