Skip to content

Follow-ups from #118: prime row split cache and check only last field for unterminated quotes #120

@stevehansen

Description

@stevehansen

Follow-ups from #118

One perf-related item surfaced during the code review for #118 that was deferred because it touches row-class internals scoped out of that PR. Filing here so it isn't lost.

Eager split-then-discard in the multiline continuation loop

In Csv/CsvReader.Engine.cs::Enumerate (and EnumerateAsync), the multiline continuation loop calls options.Splitter.Split(line, options) to check for unterminated quotes. The final rawSplit is computed against the fully-joined line but discarded. The yielded row's lazy RawSplitLine cache will re-Split the same string when the consumer first reads a field. That's one redundant Split per multiline row.

Pre-refactor ReadImpl did roughly the same redundant work (it constructed a new ReadLine per loop iteration and let RawSplitLine cache-warm via the loop condition), so this is not a strict regression — but it's a known optimization point.

Fix sketch. Add an internal PrimeRawSplit(IList<MemoryText>) method (or expose the rawSplitLine field as internal) on the row classes; have the engine call it after factory.Create(...) to seed the cache. This touches row-class internals that were intentionally out of scope for #118.

Resolved

The second deferred item — "only check the last field for unterminated quotes in multiline detection" — was applied in PR #121 commit 84e2c96 after Gemini independently flagged it during review. No follow-up needed for that part.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions