replicate · anish-sahoo · Jun 29, 2026 · Jun 30, 2026 · Jun 30, 2026 · Jun 30, 2026
@@ -6,15 +6,16 @@ The Cog CLI is a Go binary that provides commands for the full model lifecycle:
 
 ## Commands Overview
 
-| Command     | Job To Be Done                        |
-| ----------- | ------------------------------------- |
-| `cog init`  | Bootstrap a new model project         |
-| `cog build` | Create a container image              |
-| `cog run`   | Run a prediction in a container       |
-| `cog exec`  | Run arbitrary commands in a container |
-| `cog serve` | Start HTTP server in a container      |
-| `cog push`  | Deploy to Replicate                   |
-| `cog login` | Authenticate with Replicate           |
+| Command          | Job To Be Done                        |
+| ---------------- | ------------------------------------- |
+| `cog init`       | Bootstrap a new model project         |
+| `cog build`      | Create a container image              |
+| `cog run`        | Run a prediction in a container       |
+| `cog exec`       | Run arbitrary commands in a container |
+| `cog serve`      | Start HTTP server in a container      |
+| `cog playground` | Browser UI to talk to a running model |
+| `cog push`       | Deploy to Replicate                   |
+| `cog login`      | Authenticate with Replicate           |
 
 ## Development Commands
 
@@ -98,6 +99,44 @@ Builds the image (if needed) and starts a container running the [Container Runti
 
 **Code**: `pkg/cli/serve.go`
 
+---
+
+### cog playground
+
+**Job**: Open a browser UI for talking to a running model.
+
+```bash
+cog serve -p 8393      # terminal 1: start the model API
+cog playground         # terminal 2: opens the UI in your browser
+```
+
+Unlike the other commands, `cog playground` doesn't build an image or run model code. It starts a small Go web server that serves a schema-driven browser UI -- a Postman-like tool for Cog models -- and reverse-proxies requests to a _separate_ running model API, typically one started with `cog serve`. The UI reflects the model's [Schema](./02-schema.md) from `/openapi.json` and lets you run sync, streaming (SSE), and async predictions with either a generated form or raw JSON input.
+
+The browser only ever talks to the playground's own origin, which forwards to the target API chosen at runtime (via an `X-Cog-Target` header). Proxying sidesteps CORS -- the model API sets none -- and keeps SSE streaming intact. Async predictions have no GET-by-id endpoint, so the server also hosts a webhook sink and relays delivered events back to the browser over its own SSE stream.
+
+```mermaid
+flowchart LR
+    Browser["Browser UI"]
+
+    subgraph Playground["cog playground (host)"]
+        Static["Static UI assets<br/>(go:embed)"]
+        Proxy["Reverse proxy<br/>/proxy/*"]
+        Sink["Webhook sink + relay<br/>/webhook/{token} → /events"]
+    end
+
+    Model["Target model API<br/>(e.g. cog serve)"]
+
+    Browser -->|"load UI"| Static
+    Browser -->|"schema, predictions, SSE"| Proxy
+    Proxy -->|"X-Cog-Target"| Model
+    Model -->|"webhook (async)"| Sink
+    Sink -->|"SSE events"| Browser
+```
+
+The UI is plain HTML/JS (no build step); JSON is edited and displayed with a vendored Ace editor. Assets are compiled into the binary with `go:embed`.
+
+**Code**: `pkg/cli/playground.go` (server, reverse proxy, webhook sink); `pkg/cli/playground/` (embedded UI assets)
+
 ## Build Commands
 
 ### cog build
@@ -219,9 +258,11 @@ pkg/cli/
 ├── predict.go      # prediction execution and legacy cog predict
 ├── exec.go         # cog exec
 ├── serve.go        # cog serve
+├── playground.go   # cog playground (UI server, reverse proxy, webhook sink)
 ├── push.go         # cog push
 ├── login.go        # cog login
-└── init.go         # cog init
+├── init.go         # cog init
+└── playground/     # embedded playground UI assets (go:embed)
 ```
 
 Commands delegate to packages under `pkg/`:

@@ -175,6 +175,49 @@ cog login [flags]
       --token-stdin   Pass login token on stdin instead of opening a browser. You can find your Replicate login token at https://replicate.com/auth/token
 ```
 
+## `cog playground`
+
+Open a browser playground for talking to a running model.
+
+Starts a local web server that serves a schema-driven UI (a Postman-like tool
+for Cog models). Point it at any running Cog HTTP API -- for example one started
+with 'cog serve' -- and the playground reflects that model's inputs and outputs
+from its OpenAPI schema in real time.
+
+Requests are reverse-proxied through this server, so the target API does not
+need to set CORS headers. The server also hosts a webhook sink so async
+predictions can be observed in the browser.
+
+Async/webhook testing against a containerized model requires the webhook URL to
+be reachable from inside the container. On Docker Desktop the default
+'host.docker.internal' works once the server listens on a reachable interface
+(e.g. --host 0.0.0.0).
+
+```
+cog playground [flags]
+```
+
+**Examples**
+
+```
+  # Start a model API in one terminal
+  cog serve -p 8393
+
+  # Open the playground pointing at it
+  cog playground --target http://localhost:8393
+```
+
+**Options**
+
+```
+  -h, --help                  help for playground
+      --host string           Address to bind (use 0.0.0.0 to receive webhooks from containers) (default "127.0.0.1")
+      --no-open               Do not open the browser automatically
+  -p, --port int              Port to listen on (0 picks a free port)
+      --target string         Default target model API URL (default "http://localhost:8393")
+      --webhook-host string   Hostname the model uses to reach this server for webhooks (default "host.docker.internal")
+```
+
 ## `cog push`
 
 Build a Docker image from cog.yaml and push it to a container registry.

@@ -0,0 +1,65 @@
+# Playground
+
+`cog playground` opens a browser UI for talking to a running Cog model — a Postman-like tool that reflects your model's inputs and outputs from its OpenAPI schema and lets you run predictions interactively.
+
+It doesn't build an image or run your model. Point it at a model API you're already running — typically [`cog serve`](cli.md#cog-serve) — and it proxies requests to that API.
+
+## Quick start
+
+Start your model's HTTP server in one terminal:
+
+```sh
+cog serve -p 8393
+```
+
+Open the playground in another:
+
+```sh
+cog playground --target http://localhost:8393
+```
+
+This serves the UI on a local port and opens it in your browser. You can change the target API from the UI at any time.
+
+## What you can do
+
+- **Schema-driven form.** Inputs render from the model's `/openapi.json` as the appropriate widgets (text, number, boolean, enum, list, file, secret). Optional fields without a default start unchecked so they're omitted.
+- **Form or JSON.** Toggle between the generated form and a JSON editor; the two stay in sync.
+- **Files by upload or URL.** File inputs accept an uploaded file (sent as a data URI) or a URL, with an inline preview for images, audio, and video.
+- **Sync, streaming, or async.** Run modes appear based on what the model supports — streaming (SSE) when the predictor uses `@cog.streaming`, and async via webhooks.
+- **Rendered or raw output.** View the rendered result (media, text, JSON) or switch to **Raw** to see exactly what arrived over the wire. A Copy button grabs the whole payload.
+
+## Options
+
+| Flag             | Description                                                                                    |
+| ---------------- | ---------------------------------------------------------------------------------------------- |
+| `--target`       | Default model API URL (also changeable in the UI). Defaults to `http://localhost:8393`.        |
+| `-p, --port`     | Port to listen on. `0` (default) picks a free port.                                            |
+| `--host`         | Address to bind. Use `0.0.0.0` to receive webhooks from a containerized model.                 |
+| `--webhook-host` | Hostname the model uses to reach the playground for webhooks (default `host.docker.internal`). |
+| `--no-open`      | Don't open the browser automatically.                                                          |
+
+## CORS and webhooks
+
+Requests are reverse-proxied through the playground, so the model API doesn't need to send any CORS headers.
+
+[Async predictions](http.md#webhooks) are observed via webhooks (there's no status-polling endpoint), so the playground hosts a webhook sink and relays events to the browser. For this to work against a model running in a container, the playground must be reachable from inside the container:
+
+```sh
+cog playground --host 0.0.0.0 --webhook-host host.docker.internal
+```
+
+> [!NOTE]
+> Sync and streaming predictions work without any of this — the webhook setup is only needed for async runs.
+
+## Remote models
+
+If your model runs on another machine, forward its port over SSH and point the playground at it:
+
+```sh
+ssh -L 8393:localhost:5000 user@remote
+cog playground --target http://localhost:8393
+```
+
+Sync and streaming work over the tunnel. For async/webhooks, run the playground next to the model on the remote and forward only the UI port instead.
+
+See the [CLI reference](cli.md#cog-playground) for the full list of flags.
@@ -9,7 +9,7 @@ This example shows how a Cog runner can yield text chunks as a model generates t
 From this directory:
 
 ```sh
-cog predict -i prompt="Write a short haiku about databases"
+cog run -i prompt="Write a short haiku about databases"
 ```
 
 This returns the final accumulated output after the prediction completes.
@@ -46,6 +46,6 @@ data: {"id":"streaming-demo","status":"succeeded",...}
 
 ## How it works
 
-`predict.py` defines `run() -> Iterator[str]`. Each `yield` becomes one streamed output chunk. The example uses Hugging Face `TextIteratorStreamer` to receive generated text from `model.generate()` while generation is still running.
+`run.py` defines `run() -> Iterator[str]`. Each `yield` becomes one streamed output chunk. The example uses Hugging Face `TextIteratorStreamer` to receive generated text from `model.generate()` while generation is still running.
 
 The normal prediction response still contains the accumulated output for compatibility. Requesting `Accept: text/event-stream` is useful when clients want to display tokens as they arrive.
@@ -1,3 +1,3 @@
-torch==2.12.0
+torch==2.8.0
 transformers==5.0.0rc3
 accelerate==1.6.0
@@ -12,12 +12,12 @@
 class Runner(BaseRunner):
     def setup(self) -> None:
         self.device = "cuda" if torch.cuda.is_available() else "cpu"
-        dtype = torch.float16 if self.device == "cuda" else torch.float32
+        dtype = torch.bfloat16 if self.device == "cuda" else torch.float32
 
         self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
         self.model = AutoModelForCausalLM.from_pretrained(
             MODEL_NAME,
-            torch_dtype=dtype,
+            dtype=dtype,
         ).to(self.device)
         self.model.eval()
 

@@ -11,6 +11,7 @@ nav:
   - Prediction API: python.md
   - Training API: training.md
   - HTTP API: http.md
+  - Playground: playground.md
   - CLI: cli.md
   - Environment variables: environment.md
   - Private registry: private-package-registry.md