TradeArena v0.1 Benchmark Card

Execution realism and risk gates materially change LLM trading-agent evaluation. This is a compact, citable result page for auditable agent evaluation, not a profitability claim or financial advice.

Showcase Audit report Crisis gallery GitHub

Result Provenance

Where this benchmark card comes from and how to reproduce it.

Field	Value
Software release	v0.1.2
Benchmark lineage	v0.1 snapshot
Benchmark card source	`docs/results/` tracked snapshots plus first-run outputs
Reproduction command	`python -m pip install -e ".[dev]"`; `python scripts/run_showcase.py`; `python scripts/build_benchmark_page.py`
Data	tracked synthetic / timestamp-masked / redacted artifacts under `docs/results/`
Live model calls	not required for first-run reproduction
Raw prompt/response caches	not included
Intended use	benchmark and audit research, not trading advice

What Is Measured

The v0.1 card emphasizes audit and execution dimensions, not only return.

Return and max drawdown.
Fill rate, rejection rate, spread, latency, slippage, and partial fills.
Risk edits, clipped decisions, violations, and audit completeness.
Concentration / Herfindahl for portfolio probes.
Calibration and representation robustness diagnostics.

First-Run Execution Benchmark

Deterministic, local benchmark cases generated by the quickstart path.

Scenario	Agent / baseline	Return	Max drawdown	Fill rate	Rejection rate	Risk edits	Audit completeness
deterministic quickstart	buy_and_hold_realistic	53.74%	-6.63%	89.83%	8.90%	0	100.00%
deterministic quickstart	risk_aware_realistic	35.08%	-1.26%	90.34%	7.95%	124	100.00%

Key Result 1: Risk Gates Are Active, Not Cosmetic

Risk gates repeatedly edit or clip intended allocations before execution.

The benchmark reports risk edits alongside return so that risk control is visible in the result card.

Crisis-Scene LLM Benchmark

Timestamp-masked 2022 Tech/Rates and 2023 SVB stress paths, averaged by feedback mode.

Scenario	Agent / baseline	Return	Max drawdown	Fill rate	Rejection rate	Risk edits	Audit completeness
svb_2023	LLM policies (hidden feedback)	1.23%	-1.86%	75.59%	24.41%	212	100.00%
svb_2023	LLM policies (placebo feedback)	1.19%	-1.87%	76.88%	23.12%	204	100.00%
svb_2023	LLM policies (true feedback)	1.08%	-1.87%	78.16%	21.84%	196	100.00%
tech_rates_2022	LLM policies (hidden feedback)	-3.27%	-4.89%	65.93%	34.07%	914	100.00%
tech_rates_2022	LLM policies (placebo feedback)	-2.57%	-4.40%	63.10%	36.90%	906	100.00%
tech_rates_2022	LLM policies (true feedback)	-3.09%	-4.81%	64.33%	35.67%	778	100.00%

True-Feedback Model Rows

Policy-level rows under structured true risk feedback. Model names are redacted or normalized labels; raw provider prompts and responses are not shipped.

Scenario	Policy label	Return	Max drawdown	Fill rate	Risk edits	Violations	Calibration
svb_2023	frontier-policy-D (redacted)	1.03%	-1.87%	80.16%	177	13	0.198
svb_2023	frontier-policy-B (redacted)	1.24%	-1.89%	77.37%	206	14	0.207
svb_2023	frontier-policy-C (redacted)	0.67%	-1.88%	78.52%	196	14	0.397
svb_2023	frontier-policy-A (redacted)	1.39%	-1.86%	76.60%	203	14	0.224
tech_rates_2022	frontier-policy-D (redacted)	-1.84%	-4.79%	62.62%	863	39	0.071
tech_rates_2022	frontier-policy-B (redacted)	-2.49%	-4.53%	63.90%	969	41	0.067
tech_rates_2022	frontier-policy-C (redacted)	-5.32%	-5.32%	69.10%	402	226	0.435
tech_rates_2022	frontier-policy-A (redacted)	-2.72%	-4.61%	61.69%	880	38	0.090

Key Result 2: Execution Assumptions Change Realized Exposure

The 51-stock hourly probe separates intended allocation from realistic execution outcomes.

Fill rate, rejected orders, latency, and slippage become part of the benchmark outcome.

51-Stock Intraday Portfolio Probe

Passive, deterministic, Markowitz/MVO, execution-stress, and redacted LLM policy rows on a 51-stock hourly panel.

Agent / baseline	Return	Max drawdown	Fill rate	Rejected	Risk edits	Herfindahl	Audit completeness
Buy and Hold	0.71%	-0.79%	87.16%	98	2040	0.019	100.00%
Deterministic Risk-Aware	-1.35%	-3.11%	76.26%	91	672	0.062	100.00%
Markowitz MVO	-0.54%	-1.35%	87.94%	185	357	0.023	100.00%
Low-Liquidity Stress	-2.16%	-3.60%	75.55%	95	672	0.062	100.00%
Latency Stress	1.96%	-1.33%	44.40%	209	672	0.081	100.00%
Frontier Policy A (redacted)	-2.23%	-2.93%	71.86%	378	2924	0.045	100.00%
Frontier Policy C (redacted)	-0.53%	-2.31%	63.60%	254	1200	0.035	100.00%

Key Result 3: Audit Completeness Is A Benchmark Dimension

Each row should be traceable to a trajectory, not just a return curve.

TradeArena keeps compact result snapshots and redacted manifests so users can inspect what happened without shipping raw provider text.

Representation Robustness Snapshot

80 rolling failure anchors and 320 pre-failure steps across eight LLM trajectories.

Embedding	View	Anchors	Pre-failure steps	Mean rank delta	Contraction rate	Mean pre-shift
hash64	fused	80	320	0.471	67.50%	0.071
lsa32	fused	80	320	5.123	86.25%	0.084
hash64	plan	80	320	8.703	97.50%	0.122
lsa32	plan	80	320	4.870	85.00%	0.097

Limitations

This page is a benchmark and audit artifact, not financial advice or a live-trading guarantee. First-run reproduction uses tracked artifacts, and public policy rows use redacted or normalized labels. Raw provider prompts, responses, credentials, and caches are not shipped.