Google Veo 3.1 API: generate video from text (with code)

Call Google Veo 3.1 through one OpenAI-compatible API. Create a video task, poll or use a webhook for the result, handle duration and resolution.

You have a prompt and you want a video file back. Google Veo 3.1 produces short, high-fidelity clips from a text description, with sound, coherent motion and believable lighting. The friction is rarely the model itself. It is the plumbing: a separate account, a separate billing relationship, an async job format that differs from every chat API you already use, and no clean way to share a key and a budget with the rest of your stack.

This post shows how to call Veo 3.1 through AI Generate, an OpenAI-compatible aggregator that routes your request to the upstream provider and returns the result. You get one API key, one credit pool and one bill across video, image, music and chat. We are not the model author and we do not claim to be the cheapest route to it. We are upfront that there is a margin on top of upstream cost. What you buy is one surface instead of five.

What does Veo 3.1 actually do?

Veo 3.1 is a text-to-video and image-to-video model. You give it a prompt (optionally a reference image and an aspect ratio), it generates a short clip. Typical strengths:

Natural scenes with realistic lighting and depth: people, nature, architecture, product shots.
Camera language in the prompt: pans, dolly moves, slow zooms, tracking shots.
Generated audio that matches the scene, rather than a silent clip you score later.

Generation is asynchronous. You do not hold a connection open while frames render. You create a task, get an id back, and collect the result a little later by polling or via a webhook.

How do I call Veo 3.1 from the API?

Media generation uses the jobs surface. You POST a model slug and an input object, and you get a taskId immediately. Authentication is a Bearer token that starts with sk-aig-, which you create on the register page.

1. Create the task

curl https://aimarcusimage.eu/api/v1/jobs/createTask \
  -H "Authorization: Bearer sk-aig-..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/veo-3-1",
    "input": {
      "prompt": "A golden retriever running through autumn leaves at golden hour, slow dolly follow, cinematic",
      "aspect_ratio": "16:9",
      "duration": 8
    }
  }'

The response carries the task identifier:

{ "taskId": "f0147a2e78670ecbce46020219f931a1" }

2. Poll for the result

Read the task with recordInfo until it reports completion, then pull the video URL out of the payload:

import time, requests

BASE = "https://aimarcusimage.eu/api/v1"
HEAD = {"Authorization": "Bearer sk-aig-..."}

def create_video(prompt):
    r = requests.post(f"{BASE}/jobs/createTask", headers=HEAD, json={
        "model": "google/veo-3-1",
        "input": {"prompt": prompt, "aspect_ratio": "16:9", "duration": 8},
    })
    r.raise_for_status()
    return r.json()["taskId"]

def wait_for_video(task_id, every=5, timeout=600):
    deadline = time.time() + timeout
    while time.time() < deadline:
        r = requests.get(f"{BASE}/jobs/recordInfo",
                         headers=HEAD, params={"taskId": task_id})
        r.raise_for_status()
        data = r.json()
        state = data.get("state") or data.get("status")
        if state in ("success", "completed"):
            return data
        if state in ("failed", "error"):
            raise RuntimeError(data)
        time.sleep(every)
    raise TimeoutError(task_id)

tid = create_video("Aerial shot over a misty pine forest at sunrise, slow forward push")
result = wait_for_video(tid)
print(result)

The exact field names inside the completed record are shaped by the upstream provider, so read defensively: check for a completion state, then look for the video URL (commonly under a results or output array). Store the whole record the first time you see it so you can map the fields once and move on.

Handling async, duration and resolution

A few things save you grief in production:

Treat it as a job queue, not a request. Persist the taskId against the user action that triggered it. Retries become trivial and you get a clean audit trail. The id is also your idempotency handle if a poll times out.
Duration and aspect ratio are inputs, not post-processing. Set duration and aspect_ratio in the request. Longer clips cost more and take longer to render, so request the length you actually need.
Resolution. Where the model exposes a resolution or quality option, pass it through in input. Start at the lower tier while you iterate on the prompt, raise it once the prompt is locked.
Poll on a sane interval. Every 5 seconds is plenty for video. Polling reads do not count against the 20 requests / 10 seconds rate limit, so you will not throttle yourself by checking.
Rehost the output. Download the returned file to your own storage or CDN rather than hot-linking the provider URL, so you control availability and lifetime.

Webhook instead of polling

Polling is fine for a script. For a service, a webhook is cleaner: pass a callback URL with the task and receive a push when the render is done, instead of holding a poll loop open.

{
  "model": "google/veo-3-1",
  "input": { "prompt": "City street at night, neon reflections, slow tracking shot", "duration": 8 },
  "callBackUrl": "https://your-app.example/webhooks/veo"
}

Verify the signature on the incoming request, look up the job by its taskId, fetch and store the file. Keep a slow poll as a fallback in case a callback is ever missed.

What this costs / how pricing works

You pay per call from a prepaid credit balance. There is no subscription and no monthly minimum. You top up from $10, credits never expire, and the same balance covers video, image, music and chat. New accounts get free trial credits once the email is verified, which is enough to render a few clips and decide if it fits.

Video is the most expensive modality here because it is the most expensive upstream. Veo 3.1 is billed per generated video, scaling with duration and resolution. The exact per-call price for every variant is on the model page, and the playground shows the cost before you spend anything. We are an aggregator and add a margin over upstream cost, so if a single model is all you will ever call, going direct may be cheaper. The trade is one key, one budget and one invoice across everything.

When to use this vs going direct

Situation	Better fit
Veo is the only model you will ever touch	Direct provider
You also call image, music or chat models	AI Generate (one key, one bill)
You want to A/B Veo against other video models	AI Generate (swap the slug)
You want one prepaid budget and spend caps across a team	AI Generate

To compare Veo against other video models, change one field. google/veo-3-1 becomes another video slug and the rest of the request is identical. Browse options on the model catalogue.

FAQ

Is this the official Google Veo API?

No. AI Generate is an aggregator that routes your request to the upstream provider and returns the same model output. You get the Veo result with one key, one credit pool and an OpenAI-compatible surface shared with every other model.

How long does a Veo 3.1 render take?

Tens of seconds to a few minutes, depending on duration and resolution. Because it is asynchronous, poll on a 5-second interval or use a webhook so you are not blocking.

Can I generate video from an image?

Yes, where the variant supports image-to-video. Pass the reference image in the input object alongside the prompt. Check the model page for the exact field a given variant expects.

What happens if a render fails?

The task record reports a failed state and you are not charged for a failed generation. Retry using the same persisted request; the taskId is your handle for tracking and idempotency.

Full request and response details are in the documentation, model variants and pricing live on the Veo 3.1 page, and you can mint a key on the register page.