June 11, 2026 · Eric

Calling the Claude API from Symfony: streaming, structured extraction, and per-plan quotas

A production architecture for Claude in a Symfony SaaS: scoped HttpClient, SSE streaming to the browser, tool-use extraction, retries, and AI quotas wired into your billing plans.

Adding "AI features" to a SaaS in 2026 is table stakes. Adding them well (streaming that doesn't buffer, costs that can't run away, a test suite that doesn't need an API key) is where most integrations fall short. This is the architecture I built for ShipAnvil's AI module, and the decisions behind it.

SDK or no SDK?

Anthropic ships an official PHP SDK (anthropic-ai/sdk), and it's a fine choice. I deliberately went the other way: the Claude API is one JSON endpoint (POST /v1/messages with x-api-key and anthropic-version headers), and Symfony's HttpClient already does everything needed: scoped clients, retries, timeouts, streaming. Owning the ~300 lines means you own every byte on the wire, your retry policy is your retry policy, and there's one less dependency at the very bottom of your product. For a boilerplate whose buyers will read and modify the code, that transparency is the feature. If you'd rather not own it, swap the implementation behind the same interface: that's what the interface is for.

# config/packages/framework.yaml
framework:
    http_client:
        scoped_clients:
            anthropic.client:
                base_uri: 'https://api.anthropic.com'
                headers:
                    x-api-key: '%env(ANTHROPIC_API_KEY)%'
                    anthropic-version: '2023-06-01'
                timeout: 120   # generous: thinking models take their time

Decision 1: an interface, and a fake that actually behaves

Everything depends on AiClientInterface (complete() + stream()), with two implementations: the real Anthropic client and an offline deterministic stub. The stub isn't a mock that returns "ok": it streams plausible chunks with realistic timing and "extracts" real-looking structures. That buys you:

Tests without keys or network. The whole AI feature surface runs in CI, deterministic and free.
Development without burning tokens. Front-end work on the chat UI doesn't need a live model.
A product you can demo and sell without AI configured: flip one env var to go live.

Decision 2: parse SSE incrementally, forward it incrementally

With "stream": true, the Messages API answers with Server-Sent Events: message_start, then content_block_delta events carrying text_delta chunks, then message_stop. Two things go wrong in naive implementations:

Buffering. HttpClient gives you raw chunks; an SSE event can span several chunks or arrive several-per-chunk. You need a small incremental parser that buffers on \n\n boundaries and yields complete events, not a regex over the full body at the end (that's not streaming, that's waiting with extra steps).

foreach ($this->httpClient->stream($response) as $chunk) {
    foreach ($this->sseParser->push($chunk->getContent()) as $event) {
        if ('content_block_delta' === $event->type) {
            yield $event->textDelta();   // a few characters, immediately
        }
    }
}

The last hop. Your beautifully streamed tokens then hit a controller that buffers the whole response anyway. Use a StreamedResponse, disable output buffering, and check your reverse proxy: with Apache + PHP-FPM, X-Accel-Buffering: no and flushing after each chunk is the difference between "typewriter effect" and "30-second pause, then a wall of text".

Decision 3: structured extraction via tool use, not "please reply in JSON"

For anything that feeds program logic (extracting contacts, classifying tickets, parsing documents), don't prompt for JSON and json_decode your fingers crossed. Define a tool whose input_schema is the structure you want, and force it with tool_choice:

$payload = [
    'model' => $this->model,
    'max_tokens' => 1024,
    'messages' => [['role' => 'user', 'content' => $text]],
    'tools' => [[
        'name' => 'save_contact',
        'description' => 'Record the contact details found in the text.',
        'input_schema' => [
            'type' => 'object',
            'properties' => [
                'name' => ['type' => 'string'],
                'email' => ['type' => 'string'],
                'company' => ['type' => 'string'],
            ],
            'required' => ['name'],
        ],
    ]],
    'tool_choice' => ['type' => 'tool', 'name' => 'save_contact'],
];

The model must respond with a tool_use block whose input matches your schema: you get a parsed array, not prose that usually contains JSON. Wrap this in a small StructuredExtractor service and every future "turn this text into data" feature is a schema definition away. (The API also has a first-class structured-outputs mode via output_config: same idea, worth evaluating; tool use has the advantage of working uniformly across model generations.)

Decision 4: meter usage where billing already lives

The failure mode of AI features isn't technical, it's economic: one enthusiastic user on your 9 €/month plan can spend more in tokens than their subscription. The fix is to treat AI usage like any other entitlement:

Every completed request writes an AiUsage row (organization, feature, input/output tokens, the API returns exact counts in usage).
A quota check runs before each request, against limits defined in the same plan configuration as the rest of billing, not in a second config that drifts.
When the quota's gone, the user gets a clean "you've used this month's AI allowance, upgrade?", which makes AI a reason to upgrade, the only sane way to price it.

Tie the metering to the organization, not the user, for the same reason subscriptions attach there: teams share a plan, so they share its allowance.

Decision 5: retries with backoff, but never on POST-that-succeeded

The API returns 429 on rate limits and 529 when overloaded: both retryable with exponential backoff and jitter. Two subtleties: respect the retry-after header when present, and only retry when you know the request didn't produce a billable completion (connection failures, 4xx-before-processing, explicit overload signals). A blind retry-on-5xx policy around a streaming response can double-bill a long completion that failed at the last byte.

Model choice, as of June 2026: claude-opus-4-8 as the capable default, claude-sonnet-4-6 for the speed/cost sweet spot on production traffic, claude-haiku-4-5 for cheap high-volume tasks. Make it an env var: model IDs change more often than your code should (current list).

The shape of the whole thing

src/Ai/
├── Client/        AiClientInterface, AnthropicClient, FakeAiClient, SseEventParser
├── Extraction/    StructuredExtractor + your schemas
├── Quota/         per-plan limits, usage metering
└── Controller/    streaming chat, extraction endpoints

That module (streaming chat UI included, quotas pre-wired into the billing plans, the fake provider, and the tests for all of it) ships in ShipAnvil alongside the billing it plugs into. You can try the chat in the live demo: it runs on the offline stub, which is rather the point.