Support FA2 flash_attn_with_kvcache for XPU continuous batching by YangKai0616 · Pull Request #46028 · huggingface/transformers

YangKai0616 · 2026-05-18T10:38:02Z

What does this PR do?

XPU now supports flash_attn_with_kvcache in kernels-community/flash-attn2. This PR enables the relevant path in CB. transformers pipeline test results:

| label                  |   samples |   avg_in |   max_new |   time (s) |   tokens |   tok/s |   mem (GB) |
|------------------------|-----------|----------|-----------|------------|----------|---------|------------|
| fast_decode_off_varlen |         2 |      512 |      1024 |      59.91 |     2048 |   34.18 |       9.09 |
| fast_decode_on_kvcache |         2 |      512 |      1024 |      44.9  |     2048 |   45.61 |       9.09 |
fast_decode_on_kvcache speedup over fast_decode_off_varlen: 1.33x
|------------------------|-----------|----------|-----------|------------|----------|---------|------------|
| fast_decode_off_varlen |         8 |      512 |      1024 |      60.76 |     8192 |  134.82 |       9.09 |
| fast_decode_on_kvcache |         8 |      512 |      1024 |      35.11 |     8192 |  233.33 |       9.09 |
fast_decode_on_kvcache speedup over fast_decode_off_varlen: 1.73x
| fast_decode_off_varlen |         8 |     2048 |       512 |      31.04 |     4096 |  131.94 |       9.09 |
| fast_decode_on_kvcache |         8 |     2048 |       512 |      22.39 |     4096 |  182.92 |       9.09 |
fast_decode_on_kvcache speedup over fast_decode_off_varlen: 1.39x

Model: Qwen/Qwen3-0.6B.
Device: B60.

Hi @ArthurZucker , pls help review, thanks!

YangKai0616 · 2026-05-20T07:07:31Z

Hi @vasqu , could you help review this PR related to attention, thank you!

YangKai0616 · 2026-05-25T09:33:29Z

Friendly ping @IlyasMoutawwakil .

vasqu

Careful approval, just a nit to maybe combine the branches into one but cc @remi-or if you could take a look (but should be safe)

HuggingFaceDocBuilderDev · 2026-05-26T08:37:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

IlyasMoutawwakil · 2026-05-26T09:50:17Z

+        # NOTE: For CUDA, block table should be available with FA2 and FA3, but there seems to be an issue with FA2 atm
+        cuda_available = torch.cuda.is_available()
+        fa_cuda = is_flash_attention_requested(config, version=3) and cuda_available


did you check if this is still an issue with fa2 ?

Tested flash_attention_2 on A100 and it works normally. However, paged|kernels-community/flash-attn2 fails at the kvcache kernel shape check. The current PR can first ensure XPU access to the CB pipeline.

Hmm, that's interesting. What version of FA2 did you use? Usually the kernels version shouldnt diverge from upstream

But yea anyways, this feels out of scope

Hmm, that's interesting. What version of FA2 did you use? Usually the kernels version shouldnt diverge from upstream

flash_attn 2.8.3

Could you also check the beta versions? I think they indirectly included some changes as well to FA2

Can we open a separate issue to track this? Would merge this PR after then and we can track this separately

Summary:

Test script is in repo https://github.com/YangKai0616/transformers/tree/tmp-fa-test, test command: pytest -ra tests/generation/test_continuous_batching.py::ContinuousBatchingWithAcceleratorTest -k test_flash_attn2_with_kvcache_parity.

Test Results:

flash_attn 2.8.3: passed
flash_attn 2.8.4 (pip install git+https://github.com/Dao-AILab/flash-attention.git@main): inconsistent output text
kernels-community/flash-attn2: triggered internal out CHECK_SHAPE error

kernels 0.14.1, and it should be unrelated to kernels, it's an internal issue with FA.

Thanks for surfacing this. huggingface/kernels-community#877 -- let's wait for this one to get merged and reconvene.

Sure, can we merge the current PR for XPU first? Thanks!

Opened an issue over here huggingface/kernels-community#894 - I think we will try to sync with the latest FA2 release instead of the main branch. As you also showed it's not stable

Merging this PR, thanks for the discussion! Very valuable imo

vasqu · 2026-05-26T13:45:15Z

Only waiting for Remi, but LGTM overall!

remi-or · 2026-05-26T15:01:43Z

Thanks for the ping, LGTM!

…ingface#46028) * Support FA2 flash_attn_with_kvcache for XPU continuous batching * Update according to the comments * simplify the code * Code quality

Support FA2 flash_attn_with_kvcache for XPU continuous batching

0ac2283

vasqu approved these changes May 25, 2026

View reviewed changes

Comment thread src/transformers/generation/continuous_batching/initialization.py

YangKai0616 added 4 commits May 26, 2026 06:50

Update according to the comments

1439bd2

simplify the code

2c3cd05

Merge branch 'main' into FA2kvcache-XPU

a87a19c

Code quality

629e8a7

IlyasMoutawwakil reviewed May 26, 2026

View reviewed changes

vasqu requested a review from remi-or May 26, 2026 13:44

Merge branch 'main' into FA2kvcache-XPU

7e38eef

vasqu mentioned this pull request May 27, 2026

[FA2] Sync status and issues with FA2 on main huggingface/kernels-community#894

Open

vasqu enabled auto-merge May 27, 2026 15:59

vasqu added this pull request to the merge queue May 27, 2026

Merged via the queue into huggingface:main with commit a65bf6c May 27, 2026
30 checks passed

stevhliu mentioned this pull request Jun 1, 2026

[docs] xpu continuous batching #46334

Merged

Conversation

YangKai0616 commented May 18, 2026

What does this PR do?

Uh oh!

YangKai0616 commented May 20, 2026

Uh oh!

YangKai0616 commented May 25, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu commented May 26, 2026

Uh oh!

remi-or commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants