atheryon / blog / why-claude

Why we built our capital markets agent stack on Claude

Long context that survives a prospectus, tool use that holds in a regulated environment, and a safety posture buyer-side compliance will sign off on. Here’s the working.

Author
Terry Tsakiris · Founder, Atheryon
Published
19 May 2026
Reading time
6 minutes
TL;DR

The short version

We built an evaluation framework around the dimensions that actually matter for a production agent stack across trading, risk, and operations workflows. We chose Claude. Three reasons: long context that survives a real prospectus, tool use that holds up in a regulated environment, and a safety posture that buyer-side compliance functions will actually sign off on. Here’s the working.

§ 01

The problem we were solving

Capital markets workflows are not chatbot workflows. A single agent run might need to:

  • Ingest a 400-page bond prospectus
  • Cross-reference it against an internal credit policy
  • Pull live pricing from S&P Global
  • Reconcile against a position in the firm’s risk system
  • Produce a report the desk can defend to compliance

Most LLM demos break before step two. We needed a model that doesn’t.

§ 02

Our evaluation framework

We built an evaluation framework around the dimensions that actually matter in a front office. Not benchmark scores — production constraints.

Dimension
Weight
Context window (effective, not max)
25%
Prospectuses, ISDAs, term sheets — all long.
Tool use reliability
20%
Agent must call risk systems without drift.
Instruction following under load
15%
Compliance language must not be paraphrased.
Latency under concurrent load
10%
Trading windows close.
Cost per million tokens
10%
Margin matters.
Safety / refusal calibration
10%
Over-refusal breaks workflows; under-refusal breaks audits.
Enterprise data controls
10%
No training on our data.
§ 03

Why Claude won

01

Long context that doesn’t degrade

The published context numbers are marketing. What matters is whether the model can answer a question that requires reasoning over content on page 312 of a prospectus. In our tests against EU Green Bond framework documents, Claude maintained recall and reasoning quality through documents that caused noticeable degradation in competing models. For a front-to-back agent, “the answer is somewhere in the document but the model forgot” is a non-starter.

02

Tool use that holds the line

Our agents call internal tools: a pricing service, a risk API, a position store, a compliance check. In stress tests with 8+ tools available and ambiguous instructions, Claude’s tool selection was the most consistent. It also volunteered fewer hallucinated tool calls — critical when a wrong call costs real money or generates a real audit finding.

03

Safety calibration that compliance approves

This one surprised us. We expected to fight the safety layer. Instead, Claude’s calibration — say what you can, decline what you can’t, explain why — aligned almost exactly with how a buyer-side second-line risk function wants an analyst to behave. The same posture that makes Claude “cautious” in consumer settings makes it deployable in regulated ones.

04

The Anthropic stance on safety is a procurement advantage

When a second-line risk function reviews an AI vendor, they aren’t reading benchmarks. They’re reading model cards, responsible scaling policies, and incident disclosures. Anthropic’s published positions — Responsible Scaling Policy, Constitutional AI, transparent post-deployment monitoring — close procurement gates that would otherwise require months of legal and risk review.

§ 04

What we built

Atheryon’s reference system is a front-to-back capital markets agent stack.

Orchestration
Claude
  • Agent 01
    Trading Systems
    Market Data (S&P)
  • Agent 02
    Risk Management
    Reference Data
  • Agent 03
    Portfolio Analytics
    Enterprise Data
  • Agent 04
    Operations & Reporting
    Unstructured (Research / News)

Reference architecture. Components shipped / building / roadmap — see /roadmap.

Each agent is single-purpose. They share a Claude-driven orchestration layer that routes work, manages tool selection, and enforces compliance constraints declaratively rather than in code.

§ 05

What we’d tell another shop evaluating today

  1. 01
    Don’t pick on benchmarks. Pick on procurement.

    The model your compliance team will approve is worth more than the model that’s 2 points higher on MMLU.

  2. 02
    Test long context with your actual documents.

    Synthetic needle-in-haystack tests will mislead you.

  3. 03
    Stress tool use, not raw generation.

    Agents live or die on tool selection under ambiguity.

  4. 04
    Measure cost per task, not per token.

    A cheaper model that needs three retries is not cheaper.

§ 06

Where we go next

We’re building the case studies. If you’re at an Australian bank, asset manager, or capital markets infrastructure provider and want to see the reference system, book a system assessment.

atheryon / blog / why-claude / end-of-document
Book system assessment