atheryon / blog / why-claude

Why we built our capital markets agent stack on Claude

Long context that survives a prospectus, tool use that holds in a regulated environment, and a safety posture buyer-side compliance will sign off on. Here’s the working.

Author

Terry Tsakiris · Founder, Atheryon

Published

19 May 2026

Reading time

6 minutes

TL;DR

The short version

We built an evaluation framework around the dimensions that actually matter for a production agent stack across trading, risk, and operations workflows. We chose Claude. Three reasons: long context that survives a real prospectus, tool use that holds up in a regulated environment, and a safety posture that buyer-side compliance functions will actually sign off on. Here’s the working.

§ 01

The problem we were solving

Capital markets workflows are not chatbot workflows. A single agent run might need to:

Ingest a 400-page bond prospectus
Cross-reference it against an internal credit policy
Pull live pricing from S&P Global
Reconcile against a position in the firm’s risk system
Produce a report the desk can defend to compliance

Most LLM demos break before step two. We needed a model that doesn’t.

§ 02

Our evaluation framework

We built an evaluation framework around the dimensions that actually matter in a front office. Not benchmark scores — production constraints.

Dimension

Weight

Why it matters

Context window (effective, not max)

25%

Prospectuses, ISDAs, term sheets — all long.

Tool use reliability

20%

Agent must call risk systems without drift.

Instruction following under load

15%

Compliance language must not be paraphrased.

Latency under concurrent load

10%

Trading windows close.

Cost per million tokens

10%

Margin matters.

Safety / refusal calibration

10%

Over-refusal breaks workflows; under-refusal breaks audits.

Enterprise data controls

10%

No training on our data.

§ 03

Why Claude won

Long context that doesn’t degrade

The published context numbers are marketing. What matters is whether the model can answer a question that requires reasoning over content on page 312 of a prospectus. In our tests against EU Green Bond framework documents, Claude maintained recall and reasoning quality through documents that caused noticeable degradation in competing models. For a front-to-back agent, “the answer is somewhere in the document but the model forgot” is a non-starter.

Tool use that holds the line

Our agents call internal tools: a pricing service, a risk API, a position store, a compliance check. In stress tests with 8+ tools available and ambiguous instructions, Claude’s tool selection was the most consistent. It also volunteered fewer hallucinated tool calls — critical when a wrong call costs real money or generates a real audit finding.

Safety calibration that compliance approves

This one surprised us. We expected to fight the safety layer. Instead, Claude’s calibration — say what you can, decline what you can’t, explain why — aligned almost exactly with how a buyer-side second-line risk function wants an analyst to behave. The same posture that makes Claude “cautious” in consumer settings makes it deployable in regulated ones.

The Anthropic stance on safety is a procurement advantage

When a second-line risk function reviews an AI vendor, they aren’t reading benchmarks. They’re reading model cards, responsible scaling policies, and incident disclosures. Anthropic’s published positions — Responsible Scaling Policy, Constitutional AI, transparent post-deployment monitoring — close procurement gates that would otherwise require months of legal and risk review.

§ 04

What we built

Atheryon’s reference system is a front-to-back capital markets agent stack.

Orchestration

Claude

Agent 01
Trading Systems
Market Data (S&P)
Agent 02
Risk Management
Reference Data
Agent 03
Portfolio Analytics
Enterprise Data
Agent 04
Operations & Reporting
Unstructured (Research / News)

Reference architecture. Components shipped / building / roadmap — see /roadmap.

Each agent is single-purpose. They share a Claude-driven orchestration layer that routes work, manages tool selection, and enforces compliance constraints declaratively rather than in code.

§ 05

What we’d tell another shop evaluating today

01
Don’t pick on benchmarks. Pick on procurement.
The model your compliance team will approve is worth more than the model that’s 2 points higher on MMLU.
02
Test long context with your actual documents.
Synthetic needle-in-haystack tests will mislead you.
03
Stress tool use, not raw generation.
Agents live or die on tool selection under ambiguity.
04
Measure cost per task, not per token.
A cheaper model that needs three retries is not cheaper.

§ 06

Where we go next

We’re building the case studies. If you’re at an Australian bank, asset manager, or capital markets infrastructure provider and want to see the reference system, book a system assessment.

atheryon / blog / why-claude / end-of-document

Book system assessment