Tip
For information about how to run the HTTP server, see our documentation on deploying models.
When you run a Docker image built by Cog, it serves an HTTP API for making predictions.
The server supports both synchronous and asynchronous prediction creation:
- Synchronous: The server waits until the prediction is completed and responds with the result.
- Asynchronous: The server immediately returns a response and processes the prediction in the background.
The client can create a prediction asynchronously
by setting the Prefer: respond-async header in their request
or by requesting a streamed response with Accept: text/event-stream.
With Prefer: respond-async,
the server responds immediately after starting the prediction
with 202 Accepted status and a prediction object in status starting.
With Accept: text/event-stream,
the server responds with 200 OK and keeps the response open
as a server-sent event stream.
Note
For JSON responses, the only supported way to receive updates on the status of predictions started asynchronously is using webhooks. Polling for prediction status is not currently supported.
You can also use certain server endpoints to create predictions idempotently, such that if a client calls this endpoint more than once with the same ID (for example, due to a network interruption) while the prediction is still running, no new prediction is created. Instead, the client receives the response type requested by the retry: JSON for regular requests or a server-sent event stream for streaming requests.
Here's a summary of the prediction creation endpoints:
| Endpoint | Header | Behavior |
|---|---|---|
POST /predictions |
- | Synchronous, non-idempotent |
POST /predictions |
Prefer: respond-async |
Asynchronous, non-idempotent |
POST /predictions |
Accept: text/event-stream |
Streaming, non-idempotent |
PUT /predictions/<prediction_id> |
- | Synchronous, idempotent |
PUT /predictions/<prediction_id> |
Prefer: respond-async |
Asynchronous, idempotent |
PUT /predictions/<prediction_id> |
Accept: text/event-stream |
Streaming, idempotent |
Choose the endpoint that best fits your needs:
- Use synchronous endpoints when you want to wait for the prediction result.
- Use asynchronous endpoints when you want to start a prediction and receive updates via webhooks.
- Use streaming endpoints when you want to receive prediction lifecycle events over the HTTP response as they happen.
- Use idempotent endpoints when you need to safely retry requests without creating duplicate predictions.
To produce streamed prediction events,
the model must return an iterator and opt in to SSE streaming
with the streaming decorator.
from typing import Iterator
from cog import BaseRunner, Input, streaming
class Runner(BaseRunner):
@streaming
def run(self, prompt: str = Input(description="Prompt")) -> Iterator[str]:
for token in generate_tokens(prompt):
yield tokenThe decorator can also be written as @cog.streaming
or, if imported directly from cog, @streaming.
The parenthesized forms @cog.streaming() and @streaming() are also accepted.
Without the decorator,
iterator outputs still work in normal JSON responses,
but requests with Accept: text/event-stream return 406 Not Acceptable.
To consume a streamed prediction,
send the prediction request with Accept: text/event-stream:
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Accept: text/event-stream
{
"input": {"prompt": "Write a haiku about onions"}
}The server starts the prediction asynchronously
and keeps the HTTP response open as a server-sent event stream.
Each event has an event name and JSON data payload:
event: start
data: {"id":"abc123","status":"processing"}
event: output
data: {"chunk":"Onions","index":0}
event: output
data: {"chunk":" bloom","index":1}
event: completed
data: {"id":"abc123","status":"succeeded","output":["Onions"," bloom"],"metrics":{"predict_time":0.42}}
Prediction streams can emit these event types:
start: The prediction started processing.output: The model yielded an output chunk. The payload includeschunkandindex.log: The model wrote tostdoutorstderr. The payload includessourceanddata.metric: The model recorded a custom metric. The payload includesname,value, andmode.completed: The prediction reached a terminal state. The payload is the final prediction object, withstatusset tosucceeded,failed, orcanceled.
For command-line clients, use a client that prints the response as data arrives:
curl -N \
-H 'Accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{"input":{"prompt":"Write a haiku about onions"}}' \
http://localhost:5000/predictionsFor browser clients,
use fetch() or another client that supports request bodies.
The browser EventSource API only supports GET requests,
so it cannot create a prediction with POST /predictions or
PUT /predictions/<prediction_id>.
const response = await fetch("/predictions", {
method: "POST",
headers: {
"Content-Type": "application/json",
Accept: "text/event-stream",
},
body: JSON.stringify({ input: { prompt: "Write a haiku about onions" } }),
});
const reader = response.body.pipeThrough(new TextDecoderStream()).getReader();
while (true) {
const { value, done } = await reader.read();
if (done) break;
console.log(value);
}Use PUT /predictions/<prediction_id> when the client needs safe retries
or wants to reconnect to an in-flight prediction by ID:
PUT /predictions/wjx3whax6rf4vphkegkhcvpv6a HTTP/1.1
Content-Type: application/json; charset=utf-8
Accept: text/event-stream
{
"input": {"prompt": "Write a haiku about onions"}
}If the prediction is still running,
the server returns a stream for the existing prediction
instead of creating a duplicate prediction.
If earlier events have been dropped from the replay buffer,
the stream emits an error event and closes.
The replay buffer keeps the most recent 1024 events by default.
Set COG_STREAM_HISTORY_CAPACITY to change this limit,
or set it to 0 to disable replay history while keeping live streaming enabled.
Training endpoints do not support SSE streaming;
requests to /trainings with Accept: text/event-stream
return 406 Not Acceptable.
You can provide a webhook parameter in the client request body
when creating a prediction.
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"input": {"prompt": "A picture of an onion with sunglasses"},
"webhook": "https://example.com/webhook/prediction"
}The server makes requests to the provided URL with the current state of the prediction object in the request body at the following times.
start: Once, when the prediction starts (statusisstarting).output: Each time a run function generates an output (either once usingreturnor multiple times usingyield)logs: Each time the run function writes tostdoutcompleted: Once, when the prediction reaches a terminal state (statusissucceeded,canceled, orfailed)
Webhook requests for start and completed event types
are sent immediately.
Webhook requests for output and logs event types
are sent at most once every 500ms.
This interval is not configurable.
By default, the server sends requests for all event types.
Clients can specify which events trigger webhook requests
with the webhook_events_filter parameter in the prediction request body.
For example,
the following request specifies that webhooks are sent by the server
only at the start and end of the prediction:
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"input": {"prompt": "A picture of an onion with sunglasses"},
"webhook": "https://example.com/webhook/prediction",
"webhook_events_filter": ["start", "completed"]
}Endpoints for creating and canceling a prediction idempotently
accept a prediction_id parameter in their path.
By default, the server runs one prediction at a time,
but this can be increased with the concurrency.max setting.
When all prediction slots are in use, the server returns 409 Conflict.
The client should ensure prediction slots are available
before creating a new prediction with a different ID.
Clients are responsible for providing unique prediction IDs.
We recommend generating a UUIDv4 or UUIDv7,
base32-encoding that value,
and removing padding characters (==).
This produces a random identifier that is 26 ASCII characters long.
>> from uuid import uuid4
>> from base64 import b32encode
>> b32encode(uuid4().bytes).decode('utf-8').lower().rstrip('=')
'wjx3whax6rf4vphkegkhcvpv6a'A model's run function can produce file output by yielding or returning
a cog.Path or cog.File value.
By default, files are returned as a base64-encoded data URL.
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
{
"input": {"prompt": "A picture of an onion with sunglasses"},
}HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "succeeded",
"output": "data:image/png;base64,..."
}When creating a prediction synchronously,
the client can configure a base URL to upload output files to instead
by setting the output_file_prefix parameter in the request body:
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
{
"input": {"prompt": "A picture of an onion with sunglasses"},
"output_file_prefix": "https://example.com/upload",
}When the model produces a file output, the server sends the following request to upload the file to the configured URL:
PUT /upload HTTP/1.1
Host: example.com
Content-Type: multipart/form-data
--boundary
Content-Disposition: form-data; name="file"; filename="image.png"
Content-Type: image/png
<binary data>
--boundary--If the upload succeeds, the server responds with output:
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "succeeded",
"output": "http://example.com/upload/image.png"
}If the upload fails, the server responds with an error.
Important
File uploads for predictions created asynchronously
require --upload-url to be specified when starting the HTTP server.
Returns a discovery document listing available API endpoints, the OpenAPI schema URL, and version information.
GET / HTTP/1.1HTTP/1.1 200 OK
Content-Type: application/json
{
"cog_version": "0.17.0",
"docs_url": "/docs",
"openapi_url": "/openapi.json",
"shutdown_url": "/shutdown",
"healthcheck_url": "/health-check",
"predictions_url": "/predictions",
"predictions_idempotent_url": "/predictions/{prediction_id}",
"predictions_cancel_url": "/predictions/{prediction_id}/cancel"
}If training is configured, the response also includes
trainings_url, trainings_idempotent_url, and trainings_cancel_url fields.
Returns the current health status of the model container.
This endpoint always responds with 200 OK —
check the status field in the response body to determine readiness.
The response body is a JSON object with the following fields:
status: One of the following values:STARTING: The model'ssetup()method is still running.READY: The model is ready to accept predictions.BUSY: The model is ready but all prediction slots are in use.SETUP_FAILED: The model'ssetup()method raised an exception.DEFUNCT: The model encountered an unrecoverable error.UNHEALTHY: The model is ready but a user-definedhealthcheck()method returnedFalse.
setup: Setup phase details (included once setup has started):started_at: ISO 8601 timestamp of when setup began.completed_at: ISO 8601 timestamp of when setup finished (if complete).status: One ofstarting,succeeded, orfailed.logs: Output captured during setup.
version: Runtime version information:coglet: Coglet version.cog: Cog Python SDK version (if available).python: Python version (if available).
user_healthcheck_error: Error message from a user-definedhealthcheck()method (if applicable).
GET /health-check HTTP/1.1HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "READY",
"setup": {
"started_at": "2025-01-01T00:00:00.000000+00:00",
"completed_at": "2025-01-01T00:00:05.000000+00:00",
"status": "succeeded",
"logs": ""
},
"version": {
"coglet": "0.17.0",
"cog": "0.14.0",
"python": "3.13.0"
}
}The OpenAPI specification of the API, which is derived from the input and output types specified in your model's Predictor and Training objects.
Makes a single prediction.
The request body is a JSON object with the following fields:
input: A JSON object with the same keys as the arguments to therun()function. AnyFileorPathinputs are passed as URLs.
The response body is a JSON object with the following fields:
status: Eithersucceededorfailed.output: The return value of therun()function.error: Ifstatusisfailed, the error message.metrics: An object containing prediction metrics. Always includespredict_time(elapsed seconds). May also include custom metrics recorded by the model usingself.record_metric().
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
{
"input": {
"image": "https://example.com/image.jpg",
"text": "Hello world!"
}
}HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "succeeded",
"output": "data:image/png;base64,...",
"metrics": {
"predict_time": 4.52
}
}If the client sets the Prefer: respond-async header in their request,
the server responds immediately after starting the prediction
with 202 Accepted status and a prediction object in status processing.
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"input": {"prompt": "A picture of an onion with sunglasses"}
}HTTP/1.1 202 Accepted
Content-Type: application/json
{
"status": "starting",
}If the client sets the Accept: text/event-stream header,
the server starts the prediction asynchronously and responds with a
server-sent event stream.
See Streaming predictions with server-sent events.
Make a single prediction.
This is the idempotent version of the POST /predictions endpoint.
PUT /predictions/wjx3whax6rf4vphkegkhcvpv6a HTTP/1.1
Content-Type: application/json; charset=utf-8
{
"input": {"prompt": "A picture of an onion with sunglasses"}
}HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "succeeded",
"output": "data:image/png;base64,..."
}If the client sets the Prefer: respond-async header in their request,
the server responds immediately after starting the prediction
with 202 Accepted status and a prediction object in status processing.
PUT /predictions/wjx3whax6rf4vphkegkhcvpv6a HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"input": {"prompt": "A picture of an onion with sunglasses"}
}HTTP/1.1 202 Accepted
Content-Type: application/json
{
"id": "wjx3whax6rf4vphkegkhcvpv6a",
"status": "starting"
}If the client sets the Accept: text/event-stream header,
the server starts the prediction asynchronously and responds with a
server-sent event stream.
If a prediction with the same ID is already running,
the server returns a stream for the existing prediction.
See Streaming predictions with server-sent events.
A client can cancel an asynchronous prediction by making a
POST /predictions/<prediction_id>/cancel request
using the prediction id provided when the prediction was created.
For example, if the client creates a prediction by sending the request:
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"id": "abcd1234",
"input": {"prompt": "A picture of an onion with sunglasses"},
}The client can cancel the prediction by sending the request:
POST /predictions/abcd1234/cancel HTTP/1.1A prediction cannot be canceled if it's
created synchronously, without the Prefer: respond-async header,
or created without a provided id.
If a prediction exists with the provided id,
the server responds with status 200 OK.
Otherwise, the server responds with status 404 Not Found.
When a prediction is canceled,
Cog raises CancelationException
in sync predictors (or asyncio.CancelledError in async predictors).
This exception may be caught by the model to perform necessary cleanup.
The cleanup should be brief, ideally completing within a few seconds.
After cleanup, the exception must be re-raised using a bare raise statement.
Failure to re-raise the exception may result in the termination of the container.
from cog import BaseRunner, CancelationException, Input, Path
class Runner(BaseRunner):
def run(self, image: Path = Input(description="Image to process")) -> Path:
try:
return self.process(image)
except CancelationException:
self.cleanup()
raise # always re-raise