[AIMIGRAPHX-1017] Remove Q/DQ for Attention Ops#4900
Conversation
|
PR currently in draft as I am looking for some feedback on the premise of this change before I go and work on/clean up the actual implementation. |
There was a problem hiding this comment.
Pull request overview
This PR updates the FP8 quantization pipeline to avoid inserting Q/DQ pairs into dot -> softmax -> dot attention subgraphs so those regions can be fused later (improving performance and reducing scratch usage), and factors the attention-pattern matcher into a reusable header.
Changes:
- Extracts a reusable
match::dot_softmax_dotmatcher and uses it in GPU prefusion matching. - Adds a
skip_instructionsmechanism tocapture_arguments_passand wires it through FP8 quantization. - Detects attention regions in
quantize_fp8()and skips capture/QDQ insertion for those instructions.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/targets/gpu/prefuse_ops.cpp | Switches attention prefusion matching to the new shared dot_softmax_dot matcher. |
| src/quantize_8bits.cpp | Skips inserting capture ops (and therefore Q/DQ) for a provided set of instructions. |
| src/quantization.cpp | Detects attention regions and passes them into the capture pass to skip Q/DQ insertion. |
| src/include/migraphx/quantize_8bits.hpp | Extends capture_arguments_pass API to carry a skip set. |
| src/include/migraphx/match/dot_softmax_dot.hpp | Introduces a reusable matcher for undecomposed attention (dot -> softmax -> dot). |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #4900 +/- ##
===========================================
+ Coverage 92.66% 92.67% +0.01%
===========================================
Files 588 588
Lines 30397 30433 +36
===========================================
+ Hits 28165 28201 +36
Misses 2232 2232
🚀 New features to boost your workflow:
|
TedThemistokleous
left a comment
There was a problem hiding this comment.
Some questions more about the matcher predicate for the no input args case
The rest makes sense to me though - You're moving things to a separate file to reuse for the quant step. Just not sure if the match::any() is needed as its to broad instead of having something a bit more specific.
If we're assuming we have gemm->softmax->gemm already setup here then we can write something specific. let me know if this is incorrect though and you're using any() as a way to just grab everything
|
Changed to instead remove the Q/DQ pairs inside |
| if(compute_type != deq->get_shape().type()) | ||
| dot = m.insert_instruction( | ||
| deq, make_op("convert", {{"target_type", deq->get_shape().type()}}), dot); | ||
| m.replace_instruction(deq, dot); |
There was a problem hiding this comment.
Do the replace instruction outside this function in the apply(). I don't think its safe to modify the graph and write this as a "can_dequantize_gemm" since I this reads like you could potentially do a replace for gemm1 or gemm2 without doing the full modification for the other.
That way we do the replace at the end of apply() and use this as way to check if you can infact dequantize the instruction.
If that's the intent here that's fine, but its better to put anything that modifies the graph in the apply() in the case you have early returns like here.
Maybe a better way is to create a
"can_dequentize_gemm" splitting out the top check at 158 into a separate call, and then leave this but switch it to void() so you're apply the change after all checks are completed.
Motivation
Removes Q/DQ pairs for attention patterns so they can be fused.
Technical Details
Before:
After:
Changelog Category
Add a
CHANGELOG.mdentry for any option other thanNot Applicable