Skip to content

fix: build request body up-front so http2 retry works on GOAWAY#275

Open
olegkrutikov wants to merge 1 commit into
go-telegram:mainfrom
olegkrutikov:fix-http2-goaway-retry
Open

fix: build request body up-front so http2 retry works on GOAWAY#275
olegkrutikov wants to merge 1 commit into
go-telegram:mainfrom
olegkrutikov:fix-http2-goaway-retry

Conversation

@olegkrutikov
Copy link
Copy Markdown

Problem

rawRequest streams the multipart request body through an io.Pipe. Pipe readers aren't replayable, so http.NewRequestWithContext cannot derive Request.GetBody, and http2.Transport therefore has no way to retry a POST when Telegram's server sends a GOAWAY frame mid-flight.

GOAWAY is a routine part of HTTP/2 connection draining — Telegram emits it during normal load-balancer reassignment / server rotation. Whenever it arrives between the time client.Do writes the request line and headers and the time it would receive the response, every in-flight POST on that connection fails with:

http2: Transport: cannot retry err [http2: Transport received Server's graceful shutdown GOAWAY]
after Request.Body was written; define Request.GetBody to avoid this error

Long-poll getUpdates is a GET (no body), so it isn't affected — net/http retries it transparently. The result is a bot that keeps receiving updates but silently fails every sendMessage / editMessageText / answerCallbackQuery for as long as the bad connection is reused. In one production deployment this manifested as the bot going completely mute for ~36 hours after Telegram rotated a server behind it; the journal showed the GOAWAY error fanning out across every outbound call simultaneously while getUpdates continued uninterrupted.

This affects every user of the library who reaches Telegram over HTTP/2 (i.e. nearly everyone — Go's default transport negotiates h2 via ALPN). The current workaround is to disable HTTP/2 client-side by setting Transport.TLSNextProto to an empty map, which forces HTTP/1.1 keep-alive and avoids the GOAWAY-retry trap entirely.

Fix

Build the multipart body into a bytes.Buffer and pass bytes.NewReader(buf.Bytes()) to NewRequestWithContext. net/http then auto-populates Request.GetBody and Request.ContentLength for *bytes.Reader, and http2.Transport retries transparently on GOAWAY.

var bodyBuf bytes.Buffer
form := multipart.NewWriter(&bodyBuf)

if params != nil && !reflect.ValueOf(params).IsNil() {
    if _, err := buildRequestForm(form, params); err != nil { return … }
    if err := form.Close(); err != nil { return … }
}

req, _ := http.NewRequestWithContext(ctx, http.MethodPost, u, bytes.NewReader(bodyBuf.Bytes()))

The goroutine that wrote into the pipe is gone, and so is the error-handling that closed it on partial failures. Both became unnecessary once the body is materialised before client.Do is called.

Trade-off

The entire request body is held in memory until the request is sent. For the vast majority of methods (sendMessage, editMessageText, callback answers, …) the body is a few KB. For file uploads the body is bounded by Telegram's bot-API per-request limit (~50 MB) and only held transiently — comparable to the OS socket buffer the streaming version would queue anyway, and small enough to be a non-issue on any modern host. If a future user needs zero-copy streaming for high-volume large uploads, the seam to reintroduce a streaming path is clear (branch on whether params contains an upload), but I'd argue the current behaviour was a strict bug for everyone else and shouldn't be preserved by default.

Test

Test_rawRequest_setsGetBody builds a sendMessage request, captures the resulting *http.Request, and asserts:

  • req.GetBody != nil (precondition for h2 retry).
  • req.ContentLength > 0 and matches the body length.
  • Two consecutive req.GetBody() calls each return bytes identical to req.Body.
  • The body contains the expected multipart fields (name="chat_id", name="text", the literal text value).

That covers the precise contract http2.Transport relies on; the actual retry-on-GOAWAY logic is exercised by net/http's own tests and doesn't need to be re-tested here.

go test ./..., go test -race, and go vet ./... all clean.

Reproduction

// Minimal repro — runs against a real Telegram bot.
// Without this PR, every send fails after the first GOAWAY.
ctx := context.Background()
b, _ := bot.New(token)
for i := 0; i < 1000; i++ {
    _, err := b.SendMessage(ctx, &bot.SendMessageParams{ChatID: chatID, Text: "ping"})
    if err != nil {
        log.Printf("send %d: %v", i, err)
    }
    time.Sleep(time.Second)
}

Run for long enough that Telegram cycles a connection (typically minutes to hours depending on which datacenter you hit). On main you'll eventually see the cannot retry err [http2: ... GOAWAY] error and every subsequent send on that connection fails. With this patch, client.Do retries silently and the loop keeps going.

rawRequest streamed the multipart body through an io.Pipe, which is
not replayable. http.NewRequestWithContext therefore couldn't derive
Request.GetBody, and http2.Transport had no way to retry a POST when
Telegram's server sent a GOAWAY frame mid-flight (a routine part of
HTTP/2 connection draining). Every POST issued on a draining
connection failed with:

  http2: Transport: cannot retry err [http2: Transport received
  Server's graceful shutdown GOAWAY] after Request.Body was written;
  define Request.GetBody to avoid this error

Long-poll getUpdates (a GET, no body) was unaffected, so a bot would
keep receiving updates while silently failing every sendMessage /
editMessageText / answerCallback for the duration of the bad
connection. In one production deployment this manifested as the bot
going completely mute for ~36 hours after Telegram rotated a server
behind it.

Build the body into a bytes.Buffer and pass bytes.NewReader to
NewRequestWithContext; net/http then auto-populates GetBody and
ContentLength, and http2.Transport retries transparently on GOAWAY.

Trade-off: the entire request body is held in memory until the
request is sent. Most methods are a few KB; file uploads are bounded
by Telegram's per-request limit (~50 MB) and only held transiently.

Tests: Test_rawRequest_setsGetBody asserts GetBody is non-nil,
ContentLength is correct, and two GetBody calls each return bytes
identical to the original body.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant