Calling the Claude API from Symfony: streaming, structured extraction, and per-plan quotas

A production architecture for Claude in a Symfony SaaS — scoped HttpClient, SSE streaming to the browser, tool-use extraction, retries, and AI quotas wired into your billing plans.

Adding "AI features" to a SaaS in 2026 is table stakes. Adding them well — streaming that doesn't buffer, costs that can't run away, a test suite that doesn't need an API key — is where most integrations fall short. This is the architecture I built for ShipAnvil's AI module, and the decisions behind it.

SDK or no SDK?

Anthropic ships an official PHP SDK (anthropic-ai/sdk), and it's a fine choice. I deliberately went the other way: the Claude API is one JSON endpoint (POST /v1/messages with x-api-key and anthropic-version headers), and Symfony's HttpClient already does everything needed — scoped clients, retries, timeouts, streaming. Owning the ~300 lines means you own every byte on the wire, your retry policy is your retry policy, and there's one less dependency at the very bottom of your product. For a boilerplate whose buyers will read and modify the code, that transparency is the feature. If you'd rather not own it, swap the implementation behind the same interface — that's what the interface is for.

# config/packages/framework.yaml
framework:
    http_client:
        scoped_clients:
            anthropic.client:
                base_uri: 'https://api.anthropic.com'
                headers:
                    x-api-key: '%env(ANTHROPIC_API_KEY)%'
                    anthropic-version: '2023-06-01'
                timeout: 120   # generous: thinking models take their time

Decision 1: an interface, and a fake that actually behaves

Everything depends on AiClientInterface (complete() + stream()), with two implementations: the real Anthropic client and an offline deterministic stub. The stub isn't a mock that returns "ok" — it streams plausible chunks with realistic timing and "extracts" real-looking structures. That buys you:

  • Tests without keys or network. The whole AI feature surface runs in CI, deterministic and free.
  • Development without burning tokens. Front-end work on the chat UI doesn't need a live model.
  • A product you can demo and sell without AI configured — flip one env var to go live.

Decision 2: parse SSE incrementally, forward it incrementally

With "stream": true, the Messages API answers with Server-Sent Events: message_start, then content_block_delta events carrying text_delta chunks, then message_stop. Two things go wrong in naive implementations:

Buffering. HttpClient gives you raw chunks; an SSE event can span several chunks or arrive several-per-chunk. You need a small incremental parser that buffers on \n\n boundaries and yields complete events — not a regex over the full body at the end (that's not streaming, that's waiting with extra steps).

foreach ($this->httpClient->stream($response) as $chunk) {
    foreach ($this->sseParser->push($chunk->getContent()) as $event) {
        if ('content_block_delta' === $event->type) {
            yield $event->textDelta();   // a few characters, immediately
        }
    }
}

The last hop. Your beautifully streamed tokens then hit a controller that buffers the whole response anyway. Use a StreamedResponse, disable output buffering, and check your reverse proxy: with Apache + PHP-FPM, X-Accel-Buffering: no and flushing after each chunk is the difference between "typewriter effect" and "30-second pause, then a wall of text".

Decision 3: structured extraction via tool use, not "please reply in JSON"

For anything that feeds program logic — extracting contacts, classifying tickets, parsing documents — don't prompt for JSON and json_decode your fingers crossed. Define a tool whose input_schema is the structure you want, and force it with tool_choice:

$payload = [
    'model' => $this->model,
    'max_tokens' => 1024,
    'messages' => [['role' => 'user', 'content' => $text]],
    'tools' => [[
        'name' => 'save_contact',
        'description' => 'Record the contact details found in the text.',
        'input_schema' => [
            'type' => 'object',
            'properties' => [
                'name' => ['type' => 'string'],
                'email' => ['type' => 'string'],
                'company' => ['type' => 'string'],
            ],
            'required' => ['name'],
        ],
    ]],
    'tool_choice' => ['type' => 'tool', 'name' => 'save_contact'],
];

The model must respond with a tool_use block whose input matches your schema — you get a parsed array, not prose that usually contains JSON. Wrap this in a small StructuredExtractor service and every future "turn this text into data" feature is a schema definition away. (The API also has a first-class structured-outputs mode via output_config — same idea, worth evaluating; tool use has the advantage of working uniformly across model generations.)

Decision 4: meter usage where billing already lives

The failure mode of AI features isn't technical, it's economic: one enthusiastic user on your 9 €/month plan can spend more in tokens than their subscription. The fix is to treat AI usage like any other entitlement:

  • Every completed request writes an AiUsage row (organization, feature, input/output tokens — the API returns exact counts in usage).
  • A quota check runs before each request, against limits defined in the same plan configuration as the rest of billing — not in a second config that drifts.
  • When the quota's gone, the user gets a clean "you've used this month's AI allowance — upgrade?" — which makes AI a reason to upgrade, the only sane way to price it.

Tie the metering to the organization, not the user, for the same reason subscriptions attach there: teams share a plan, so they share its allowance.

Decision 5: retries with backoff, but never on POST-that-succeeded

The API returns 429 on rate limits and 529 when overloaded — both retryable with exponential backoff and jitter. Two subtleties: respect the retry-after header when present, and only retry when you know the request didn't produce a billable completion (connection failures, 4xx-before-processing, explicit overload signals). A blind retry-on-5xx policy around a streaming response can double-bill a long completion that failed at the last byte.

Model choice, as of June 2026: claude-opus-4-8 as the capable default, claude-sonnet-4-6 for the speed/cost sweet spot on production traffic, claude-haiku-4-5 for cheap high-volume tasks. Make it an env var — model IDs change more often than your code should (current list).

The shape of the whole thing

src/Ai/
├── Client/        AiClientInterface, AnthropicClient, FakeAiClient, SseEventParser
├── Extraction/    StructuredExtractor + your schemas
├── Quota/         per-plan limits, usage metering
└── Controller/    streaming chat, extraction endpoints

That module — streaming chat UI included, quotas pre-wired into the billing plans, the fake provider, and the tests for all of it — ships in ShipAnvil alongside the billing it plugs into. You can try the chat in the live demo — it runs on the offline stub, which is rather the point.