@@ -458,6 +458,88 @@ So unroll is not "worse" in the meaningful sense. Its absolute latency is
458458larger because the workload is massively larger, while its orchestration cost
459459per submitted task is much lower.
460460
461+ ## Fresh TensorMap Profiling
462+
463+ This section is the required profiling checkpoint for the current branch.
464+ Any optimization that touches the manual-scope runtime hot path or the manual
465+ paged-attention orchestration should refresh both this section and the benchmark
466+ table above with new real-device data.
467+
468+ ### Method
469+
470+ - local profiling-only rebuild with:
471+ - ` PTO2_ORCH_PROFILING=1 `
472+ - ` PTO2_TENSORMAP_PROFILING=1 `
473+ - platform: ` a2a3 `
474+ - device: ` 9 `
475+ - PTO-ISA commit: ` d96c8784 `
476+ - rounds: ` 30 `
477+ - mode: ` --skip-golden `
478+ - AUTO runner:
479+ - ` python examples/a2a3/tensormap_and_ringbuffer/paged_attention/test_paged_attention.py -p a2a3 -d 9 -n 30 --case <Case> --skip-golden `
480+ - manual runner:
481+ - ` python examples/scripts/run_example.py -k examples/a2a3/tensormap_and_ringbuffer/paged_attention_manual_scope/kernels -g examples/a2a3/tensormap_and_ringbuffer/paged_attention_manual_scope/golden.py -p a2a3 -d 9 -c d96c8784 -n 30 --case <Case> --skip-golden `
482+ - parsing:
483+ - per-round device-log ` === Orchestrator Profiling === ` blocks
484+ - per-round device-log ` === TensorMap Lookup Stats === ` blocks
485+ - trimmed average for time fields, mean for lookup / insert counts
486+
487+ ### Results
488+
489+ | Case | Mode | Tasks | ` lookup+dep ` Trim (us) | ` tensormap_ins ` Trim (us) | TensorMap Lookups Avg | TensorMap Inserts Avg | Profiled Submit Trim (us) | Full Orch Trim (us) |
490+ | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
491+ | ` Case1 ` | AUTO | 13 | 4.132 | 1.842 | 40.0 | 12.0 | 16.422 | 194.508 |
492+ | ` Case1 ` | MANUAL | 13 | 1.944 | 1.414 | 16.0 | 3.0 | 14.458 | 259.318 |
493+ | ` Case2 ` | AUTO | 33 | 6.320 | 2.638 | 105.0 | 32.0 | 24.368 | 210.274 |
494+ | ` Case2 ` | MANUAL | 33 | 2.598 | 1.728 | 41.0 | 8.0 | 21.560 | 285.182 |
495+
496+ ### What The Numbers Prove
497+
498+ - The manual-local TensorMap bypass is working.
499+ - ` Case1 ` : lookups dropped from ` 40.0 ` to ` 16.0 ` (` -60.0% ` ), inserts
500+ dropped from ` 12.0 ` to ` 3.0 ` (` -75.0% ` ), and ` lookup+dep ` time dropped
501+ from ` 4.132us ` to ` 1.944us ` (` -53.0% ` ).
502+ - ` Case2 ` : lookups dropped from ` 105.0 ` to ` 41.0 ` (` -61.0% ` ), inserts
503+ dropped from ` 32.0 ` to ` 8.0 ` (` -75.0% ` ), and ` lookup+dep ` time dropped
504+ from ` 6.320us ` to ` 2.598us ` (` -58.9% ` ).
505+ - The manual path still shows non-zero TensorMap traffic because boundary
506+ tensors still use TensorMap in v0. That is expected.
507+ - The remaining non-unroll regression is no longer explained by TensorMap.
508+ - The profiled submit buckets are lower in manual mode
509+ (` -12.0% ` in ` Case1 ` , ` -11.5% ` in ` Case2 ` ).
510+ - But full orchestration time is still much higher
511+ (` +33.3% ` in ` Case1 ` , ` +35.6% ` in ` Case2 ` ).
512+
513+ ### Why Full Orch Is Still Worse
514+
515+ The gap has moved out of the TensorMap buckets.
516+
517+ The current profiling points to two more likely hot regions:
518+
519+ 1 . Orchestration-side explicit-dep construction.
520+ The manual paged-attention orchestration adds many ` Arg.add_dep(...) `
521+ calls and threads task ids explicitly through the loop body.
522+ 2 . Runtime explicit-dep validation and dedupe before the first profiled phase.
523+ ` pto2_submit_mixed_task() ` validates every explicit dep against the current
524+ scope and deduplicates it before the first ` alloc/sync/lookup/insert ` lap is
525+ recorded.
526+
527+ So the next optimization target is no longer "remove more TensorMap work".
528+ It is "make explicit-dep construction and validation cheaper".
529+
530+ ### Code Pointers For The Current Design
531+
532+ - Current-manual-scope-local classification:
533+ - [ pto_orchestrator.cpp] ( /data/uvxiao/pto-runtime/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp#L143 )
534+ - Manual-local tensors bypass TensorMap lookup / insert here:
535+ - [ pto_orchestrator.cpp] ( /data/uvxiao/pto-runtime/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp#L742 )
536+ - Explicit deps are validated and deduplicated here:
537+ - [ pto_orchestrator.cpp] ( /data/uvxiao/pto-runtime/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp#L627 )
538+ - ` Arg.add_dep(...) ` storage is a simple append here:
539+ - [ pto_types.h] ( /data/uvxiao/pto-runtime/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_types.h#L228 )
540+ - The manual paged-attention example that exercises this path is here:
541+ - [ paged_attention_orch.cpp] ( /data/uvxiao/pto-runtime/examples/a2a3/tensormap_and_ringbuffer/paged_attention_manual_scope/kernels/orchestration/paged_attention_orch.cpp#L131 )
542+
461543## Current Implementation Gap
462544
463545The runtime and manual paged-attention examples now match the core v0 alignment
@@ -474,6 +556,8 @@ The remaining work is performance-focused:
474556
475557- reduce non-unroll manual orchestration cost without reintroducing TensorMap
476558 fallback for manual-local tensors
559+ - focus the next optimization round on explicit-dep construction / validation,
560+ not on TensorMap lookup / insert removal
477561- keep the unroll gains while tightening the small-workload path
478562
479563## Historical Optimization Notes
0 commit comments