Until late 2025 we ran everything chat-related through OpenAI directly. One account, one key, one bill. In January 2026 we migrated to a multi-model aggregator. Here's what actually happened.

What we had

  • OpenAI enterprise tier — $~$8k/month across all surfaces
  • All chat completions on gpt-4o or gpt-4o-mini
  • One embedding model, one image model (DALL-E 3)
  • Direct SDK integration in four services

What triggered the move

  • Claude Opus 4.6 was scoring noticeably better on our agentic coding evals
  • Gemini 3 Pro became the clear winner for long-context document analysis
  • We wanted to test open-source models (DeepSeek, Llama 4) without standing up inference infrastructure
  • Running five providers meant five contracts, five billing cycles, five rate limit dashboards

The migration

We switched to an OpenAI-compatible aggregator (this one, actually — we use our own product in production). The entire change was replacing baseURL and apiKey in the SDK. Code diff was 8 lines across the four services.

// Before
const openai = new OpenAI({ apiKey: process.env.OPENAI_KEY });

// After
const openai = new OpenAI({
  apiKey: process.env.AIGEN_KEY,
  baseURL: 'https://aimarcus.eu/aigenerate/api/v1',
});

What changed in the bill

  • Chat (GPT-4o → mixed GPT-4o + Claude Sonnet based on task): -22%
  • Agentic coding (moved to Claude Opus): +8%, but evals improved 15 points
  • Long-context analysis (moved to Gemini 3 Pro): -40%
  • Image generation (moved from DALL-E 3 to Nano Banana): -65%

Net effect on monthly bill: -28%. Better model choice on each surface + lower per-token rates from upstream aggregators that we don't have to negotiate directly.

Three things that went wrong

  • Streaming SSE format differences. Claude streams slightly differently from OpenAI. Our client-side buffer logic broke for about 10 minutes until we added a compatibility shim.
  • Token counting. Different models tokenize differently. Our internal "tokens used per conversation" metric needed to be re-calibrated per model.
  • Caching keyed on model name. Our prompt cache used gpt-4o as the key. When we started routing some requests to Claude, we got stale cache hits. Lesson: always include model in cache keys.

Would we do it again?

Yes — for three reasons unrelated to cost.

  • Optionality. When OpenAI had a 40-minute outage on April 3rd, our chat surface kept running because traffic failed over to Anthropic and Gemini automatically. That alone is worth the migration.
  • Model parity is over. There is no "the best model" in 2026. There's a best model for each task. Aggregation is how you actually use that fact.
  • One bill. Finance wanted this more than engineering did. One invoice, one vendor relationship, one contract for 12 providers.

If you're running any meaningful volume on a single provider, the switch pays for itself the first month.