Skip to content

Execution optimizations and improvements#179

Merged
yushangdi merged 10 commits into
meta-pytorch:mainfrom
anubhavchaturvedi:optimizations
Apr 8, 2026
Merged

Execution optimizations and improvements#179
yushangdi merged 10 commits into
meta-pytorch:mainfrom
anubhavchaturvedi:optimizations

Conversation

@anubhavchaturvedi
Copy link
Copy Markdown
Contributor

@anubhavchaturvedi anubhavchaturvedi commented Apr 1, 2026

Performance and usability improvements for tlparse, yielding ~39% faster parsing and ~32% less memory on large logs.

Performance:

  • Eliminate double JSON parse per log line (parse Envelope only once)
  • Static regex compilation via Lazy statics
  • fs::copy for raw.log instead of loading entire file into memory
  • Pre-allocate buffers, reuse String across loop iterations, cache year computation
  • Avoid cloning Vec in compilation metrics path

Features:

  • Transparent gzip input support (.gz files decompress automatically)

Infrastructure:

  • Parse benchmark (benches/parse_benchmark.rs) measuring wall time and peak RSS
  • 3 new integration tests for gzip support

PR is split into smaller commits, each having appropriate description of changes.

Here are the benchmark results before vs after on a 327MB file.

Screenshot 2026-04-01 at 4 53 19 PM

Adds a standalone benchmark (benches/parse_benchmark.rs) that measures:
- Wall time statistics (mean/median/min/max) across configurable iterations
- Peak RSS via getrusage (cold-run and post-warmup measurements)
- Input file line count for context

Usage: TLPARSE_BENCH_INPUT=/path/to/log cargo bench --bench parse_benchmark

No production code changes. Dev-dependencies added: libc (RSS), tempfile (output dirs).

- Removed hardcoded machine-specific path; requires explicit input
- Added cold-run RSS measurement with documentation of ru_maxrss limitations
- Streaming line count instead of loading entire file into memory
- Write errors surfaced via expect() instead of silently swallowed
… year caching

Four localized optimizations with zero API changes:

1. Pre-allocate HTML string in anchor_source (parsers.rs)
   - Remove intermediate Vec<&str> from lines().collect(), iterate directly
   - Pre-allocate output with String::with_capacity(text.len() * 2 + 500)

2. Pre-allocate shortraw_content buffer (lib.rs)
   - Use String::with_capacity(file_size / 8) (~12.5% of input size)
   - Avoids ~30 reallocations for large logs

3. Reuse payload String across parse loop iterations (lib.rs)
   - Hoist payload_buf before loop, clear() each iteration
   - Retains allocated capacity, avoiding millions of small allocations

4. Compute year once before parse loop (lib.rs)
   - Move chrono::Utc::now().year() before format_timestamp closure
   - Eliminates one clock_gettime syscall per log line

Note: syntect lazy-init (SyntaxSet/ThemeSet) was already present in
the codebase via OnceLock, no change needed.
…lone

Three optimizations targeting the hottest paths:

1. Static regex compilation + CompileId helpers (types.rs)
   - Move RE_EVAL_WITH_KEY and RE_SEED_NSPID to module-level Lazy statics
   - Add normalize_attempt() for None->Some(0) migration
   - Add collapse_attempt() for unconditional attempt reset to 0
     (used in compilation_metrics and metrics_index lookups)

2. Eliminate double JSON parse per log line (lib.rs) — HIGHEST IMPACT
   - Parse each line as Envelope only once (was: Value + Envelope)
   - Shortraw (raw.jsonl) output now built by parsing as Value separately,
     inserting glog metadata, and re-serializing with sorted keys
   - Substring-based key-conflict detection as early bail-out before parse
   - Net effect: ~50% reduction in JSON parsing for the main loop

3. Avoid Vec<OutputFile> clone in CompilationMetrics (lib.rs, parsers.rs)
   - Two-phase borrow pattern: immutable slice borrow for parse, then
     mutable access for result processing
   - Changed CompilationMetricsParser.output_files from &Vec to &[OutputFile]
   - Eliminates clone of entire output file list per metrics entry

Output is byte-for-byte identical to baseline across all test logs.
Instead of loading the full input log into a String and passing it through
ParseOutput, the CLI now copies raw.log directly via std::fs::copy().

For a 500MB log, this saves ~500MB+ of heap allocation (String + UTF-8
validated copy). fs::copy uses kernel-level zero-copy (sendfile/copy_file_range).

Changes:
- lib.rs: Removed fs::read_to_string(path) and raw.log ParseOutput entry
- cli.rs: Added fs::copy(log_path, output_dir.join("raw.log")) after
  writing all ParseOutput entries

Note: raw.log is not listed in the non-breaking contract as a guaranteed
ParseOutput entry. Library callers using parse_path() directly will no
longer find raw.log in the returned Vec and should copy the input file
themselves if needed.
Detect .gz extension on input files and transparently decompress using
flate2::read::GzDecoder. This is purely additive — existing .log files
work identically.

Changes:
- lib.rs: Wrap file reader in GzDecoder when path ends in .gz, using
  Box<dyn io::Read> for unified handling
- cli.rs: Copy as raw.log.gz (not raw.log) for gzip inputs
- cli.rs: Accept .log.gz files in --all-ranks-html rank log discovery
  (tries .log.gz suffix before .log)
- Cargo.toml: Add flate2 = "1.0" dependency

Tests: 3 new integration tests covering library-level gzip parsing,
CLI raw.log.gz copying, and all-ranks .log.gz discovery.

Verified: gzip output is byte-for-byte identical to uncompressed
baseline for all test logs.
@meta-cla meta-cla Bot added the cla signed label Apr 1, 2026
Comment thread Cargo.toml Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also update the version here?

@yushangdi
Copy link
Copy Markdown
Contributor

can you summarize the optimziations you made in this PR?

@yushangdi yushangdi requested a review from aorenste April 2, 2026 00:50
Performance improvements in this release:
- ~39% faster parsing (median) on large logs
- ~32% less memory usage
- Transparent gzip input support (.gz files)
- fs::copy for raw.log (avoids loading entire file into memory)
@anubhavchaturvedi
Copy link
Copy Markdown
Contributor Author

Updated the summary, each commit also has the description of changes within it.

Copy link
Copy Markdown
Contributor

@aorenste aorenste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference - this would be a lot easier to review if it were split into separate PRs, not just separate commits on a single PR (github comments are based on the PR, not the commit). You can use ghstack to make a stack of related PRs easier to handle.

The main reason I'm requesting changes is for the version bump.

Comment thread Cargo.toml
[package]
name = "tlparse"
version = "0.4.8"
version = "0.4.9"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the raw.log BC breaking change we should probably bump the minor rev, not just the patch. ("0.5.0")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate what the breaking change you are observing? I tried to make sure these changes do not make need a minor version bump. I do have a lot more improvements planned (streamed parallel processing, use of rayon, refactoring the code to make it more modular, etc) that would have broken some API contracts so avoided as part of this change. Would like to take them on, but they seem more like a major version change to me. I strongly believe we should be able to parse large files even faster.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was the ParseOutput change - but I see you pointed out that it's not really part of the public contract (even though it's in a public function). If this was not a 0.x.x or if there were more uses of this as a library I'd push back but fair enough.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like after the current changes, raw.log will stop being produced. Ideally, we still want raw.log to be produced for BC reasons (among other data collection reasons).

Can we still produce raw.log?

if that will make the perf go down significantly, I'd propose to bump the major version number as @aorenste suggested. We can fix this internally by directly copy the input file as the output raw.log like in D99742022

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The raw.log file will be generated if the tlparse was created using the uncompressed raw.log file. It will not be generated if the users used the compressed file to begin with. This should be the expectation since currently tlparse does not support compressed file parsing. So if a user is using it, they already are well integrated where in a later version when we remove the uncompressed file support, their integration won't break

https://github.com/meta-pytorch/tlparse/pull/179/changes#diff-b2812f19576dd53d0c35b107a322f58a00fce6977f4b5976e1961853982af3ccR173

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the raw.log is moved to the cli layer? but the users could be calling tlparse from parse_path instead of the cli. In this case, raw.log is not produced anymore.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yushangdi for catching that. Fixed it and now we generate both files. I was testing primarily from the cli so missed this. Added tests to avoid it in future as well.

Comment thread src/lib.rs Outdated
Comment thread Cargo.toml Outdated
Comment thread src/lib.rs
Comment thread benches/parse_benchmark.rs
Comment thread benches/parse_benchmark.rs
Comment thread benches/parse_benchmark.rs
…hing

Move the key conflict detection in write_to_shortraw to after JSON parsing,
so it checks actual object keys rather than searching for patterns in string
values (which could cause false positives). Also remove duplicate flate2
dev-dependency.
Copy link
Copy Markdown
Contributor Author

@anubhavchaturvedi anubhavchaturvedi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of ghstack but that will be useful for future PRs. Thanks for that 😄

Comment thread benches/parse_benchmark.rs
Comment thread benches/parse_benchmark.rs
Comment thread src/lib.rs
Comment thread Cargo.toml
[package]
name = "tlparse"
version = "0.4.8"
version = "0.4.9"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate what the breaking change you are observing? I tried to make sure these changes do not make need a minor version bump. I do have a lot more improvements planned (streamed parallel processing, use of rayon, refactoring the code to make it more modular, etc) that would have broken some API contracts so avoided as part of this change. Would like to take them on, but they seem more like a major version change to me. I strongly believe we should be able to parse large files even faster.

Comment thread src/lib.rs Outdated
When the input is already gzipped, copy it as raw.log.gz (unchanged).
When the input is plain text, copy it as raw.log and also write a
gzip-compressed raw.log.gz so downstream consumers always have a
compressed variant available.
Comment thread Cargo.toml
[package]
name = "tlparse"
version = "0.4.8"
version = "0.4.9"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like after the current changes, raw.log will stop being produced. Ideally, we still want raw.log to be produced for BC reasons (among other data collection reasons).

Can we still produce raw.log?

if that will make the perf go down significantly, I'd propose to bump the major version number as @aorenste suggested. We can fix this internally by directly copy the input file as the output raw.log like in D99742022

The fs::copy optimization moved raw.log production to the CLI layer,
which broke library users calling parse_path directly. Re-read the
input file at the end of parse_path so raw.log is always included in
ParseOutput (decompressing gzip inputs as needed).

Add comprehensive tests covering both library and CLI usage:
- Library tests for raw.log in ParseOutput, runtime analysis,
  chromium events, multi-rank landing, plain_text and custom_header
- CLI tests for --strict, --export, --inductor-provenance,
  --plain-text, --custom-header-html, --overwrite flags
- CLI/library output parity test
- Re-export OpRuntime for library consumers
@yushangdi yushangdi merged commit 70828e0 into meta-pytorch:main Apr 8, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants