TradeArena v0.2 Benchmark Card

Execution realism and risk gates materially change autonomous financial-agent evaluation. This is a compact, citable result page for agent reliability and intent-to-execution audit, not a profitability claim or financial advice.

Execution mode: realistic-stress, not calibrated transaction-cost prediction. Default results use shared stress assumptions; calibrated claims require quote/order-book/fill provenance.

Result Provenance

Where this benchmark card comes from and how to reproduce it.

FieldValue
Software releasev0.2.0
Benchmark lineagev0.2 snapshot
Benchmark card source`docs/results/` tracked snapshots plus first-run outputs
Reproduction command`python -m pip install -e ".[dev]"`; `python scripts/run_showcase.py`; `python scripts/build_benchmark_page.py`
Datatracked synthetic / timestamp-masked / redacted artifacts under `docs/results/`
Live model callsnot required for first-run reproduction
Raw prompt/response cachesnot included
Intended useagent reliability and audit research, not trading advice

Claim Badges And Validation Status

Rows are evidence-ranked before they are ranked by performance.

SurfaceBadge / statusEvidence boundary
Execution mode`stress-only` by defaultDefault rows are stress-simulator evidence, not calibrated transaction-cost prediction.
Calibration evidence`quote-calibrated` sample rowsFixture and public Binance samples show calibration plumbing; broader venue claims need external reports.
Provider rows`cached-provider` / `redacted-prompt`Useful reliability probes; not enough for strong model-skill claims without independent repetition.
Baselines`deterministic-baseline`Classical baselines are main anchors, not appendix rows.
Reproduction`fresh-environment` CI passingCI installs `tradearena-benchmark==0.2.0`, generates artifacts, hashes a run, and replays a step.
External validationopen: macOS, Ubuntu, Colab/Binder, baseline, calibration, claim reviewIndependent reports are requested in issues #43, #44, #45, #46, #47, and #48.

What Is Measured

The v0.2 card emphasizes audit and execution dimensions, not only return.

First-Run Execution Benchmark

Deterministic, local benchmark cases generated by the quickstart path.

ScenarioAgent / baselineReturnMax drawdownFill rateRejection rateRisk editsAudit completeness
deterministic quickstartbuy_and_hold_realistic53.74%-6.63%89.83%8.90%0100.00%
deterministic quickstartrisk_aware_realistic35.08%-1.26%90.34%7.95%124100.00%

Non-LLM Classical Baseline Check

Deterministic baselines answer whether LLM policies outperform fixed non-LLM strategies, not only other LLMs.

UniverseScenarioBest classicalClassical returnBest LLMLLM returnReturn gapLLM wins?
real_marketYahoo 2022 rates drawdownRisk parity4.77%poe:gemini-3.1-pro2.56%-2.21%no
real_marketYahoo recent GSPC/BTC/BTC futuresBuy and hold12.15%poe:gemini-3.1-pro4.86%-7.29%no
syntheticCalm trendBuy and hold2.61%poe:kimi-k2.53.19%0.59%yes
syntheticHigh volatilityMean reversion1.88%poe:gemini-3.1-pro1.44%-0.44%no
syntheticJump and tail riskBuy and hold2.81%poe:gpt-5.51.67%-1.14%no
syntheticLatency spikeBuy and hold3.29%poe:gemini-3.1-pro3.29%0.00%no
syntheticLiquidity collapseMinimum variance9.07%poe:gpt-5.54.42%-4.65%no
syntheticSpread explosionBuy and hold1.07%deepseek:deepseek-v4-pro0.48%-0.59%no

Classical Baseline Aggregate

Buy-and-hold, equal weight, random, always-hold, momentum, mean reversion, risk parity, minimum variance, and Markowitz/MVO across the benchmark scenarios.

UniverseBaselineScenariosAvg returnWorst DDAvg SharpeAvg fillRejectedRisk edits
real_marketRisk parity27.61%-4.67%4.63686.11%40
real_marketMinimum variance26.12%-5.99%3.66784.17%50
real_marketMarkowitz MVO25.07%-5.15%3.27279.00%70
real_marketBuy and hold22.53%-16.87%2.26978.89%90
real_marketEqual weight22.41%-16.89%2.22483.33%60
real_marketAlways hold20.00%0.00%0.0000.00%00
real_marketRandom2-0.02%-11.16%0.72571.43%140
real_marketMean reversion2-2.52%-9.84%-0.54675.00%60
real_marketNaive momentum2-6.38%-15.38%-1.69867.18%120
syntheticBuy and hold63.00%-2.03%6.63169.58%1196
syntheticMinimum variance62.72%-3.82%4.88571.88%90
syntheticRisk parity61.85%-3.42%3.75971.88%90
syntheticEqual weight61.76%-3.26%3.81570.62%100
syntheticMarkowitz MVO61.69%-3.60%2.85465.35%150
syntheticNaive momentum60.75%-3.81%3.51067.41%50
syntheticRandom60.46%-3.92%0.35467.43%110
syntheticMean reversion60.15%-5.21%1.67074.44%30
syntheticAlways hold60.00%0.00%0.0000.00%00

Decision Quality vs Execution Quality

A three-axis decomposition separates pre-risk intent, risk discipline, and execution robustness.

Decision Quality vs Execution Quality Scores are bounded diagnostics in [0, 1]; higher is better. Alpha qualityRisk disciplineExecution robustness Benchmark families LLM syntheticLLM real-marketClassical syntheticClassical real-market Alpha uses proposed target weights before risk edits and execution costs. Risk and execution axes summarize interventions and fill realism.
FamilyRowsAlphaRiskExecutionPre-risk alphaRealized returnFill rate
LLM synthetic1020.6230.6530.7782.89%0.88%48.37%
LLM real-market900.4890.4120.6870.48%-4.38%65.10%
Classical synthetic540.7280.5690.7473.41%1.37%62.07%
Classical real-market180.6280.3940.7513.84%1.65%69.46%

Execution Calibration Evidence

Rows here are evidence for calibration plumbing; default benchmark rows remain realistic-stress unless they attach quote/order-book/fill provenance.

EvidenceAligned fillsMedian spreadP90 spreadMedian shortfallP90 shortfallStress MAECalibrated MAE
fixture80.953 bps1.364 bps1.467 bps1.646 bps3.533 bps0.447 bps
public Binance BTCUSDT perpetual sample5000.016 bps0.016 bps0.008 bps1.659 bps3.163 bps0.908 bps

Key Result 1: Risk Gates Are Active, Not Cosmetic

Risk gates repeatedly edit or clip intended allocations before execution.

The benchmark reports risk edits alongside return so that risk control is visible in the result card.

Crisis-Scene LLM Benchmark

Timestamp-masked 2022 Tech/Rates and 2023 SVB stress paths, averaged by feedback mode.

ScenarioAgent / baselineReturnMax drawdownFill rateRejection rateRisk editsAudit completeness
svb_2023LLM policies (hidden feedback)1.23%-1.86%75.59%24.41%212100.00%
svb_2023LLM policies (placebo feedback)1.19%-1.87%76.88%23.12%204100.00%
svb_2023LLM policies (true feedback)1.08%-1.87%78.16%21.84%196100.00%
tech_rates_2022LLM policies (hidden feedback)-3.27%-4.89%65.93%34.07%914100.00%
tech_rates_2022LLM policies (placebo feedback)-2.57%-4.40%63.10%36.90%906100.00%
tech_rates_2022LLM policies (true feedback)-3.09%-4.81%64.33%35.67%778100.00%

True-Feedback Model Rows

Policy-level rows under structured true risk feedback. Model names are redacted or normalized labels; raw provider prompts and responses are not shipped.

ScenarioPolicy labelReturnMax drawdownFill rateRisk editsViolationsCalibration
svb_2023frontier-policy-D (redacted)1.03%-1.87%80.16%177130.198
svb_2023frontier-policy-B (redacted)1.24%-1.89%77.37%206140.207
svb_2023frontier-policy-C (redacted)0.67%-1.88%78.52%196140.397
svb_2023frontier-policy-A (redacted)1.39%-1.86%76.60%203140.224
tech_rates_2022frontier-policy-D (redacted)-1.84%-4.79%62.62%863390.071
tech_rates_2022frontier-policy-B (redacted)-2.49%-4.53%63.90%969410.067
tech_rates_2022frontier-policy-C (redacted)-5.32%-5.32%69.10%4022260.435
tech_rates_2022frontier-policy-A (redacted)-2.72%-4.61%61.69%880380.090

Key Result 2: Execution Assumptions Change Realized Exposure

The 51-stock hourly probe separates intended allocation from realistic execution outcomes.

Fill rate, rejected orders, latency, and slippage become part of the benchmark outcome.

51-Stock Intraday Portfolio Probe

Passive, deterministic, Markowitz/MVO, execution-stress, and redacted LLM policy rows on a 51-stock hourly panel.

Agent / baselineReturnMax drawdownFill rateRejectedRisk editsHerfindahlAudit completeness
Buy and Hold0.71%-0.79%87.16%9820400.019100.00%
Deterministic Risk-Aware-1.35%-3.11%76.26%916720.062100.00%
Markowitz MVO-0.54%-1.35%87.94%1853570.023100.00%
Low-Liquidity Stress-2.16%-3.60%75.55%956720.062100.00%
Latency Stress1.96%-1.33%44.40%2096720.081100.00%
Frontier Policy A (redacted)-2.23%-2.93%71.86%37829240.045100.00%
Frontier Policy C (redacted)-0.53%-2.31%63.60%25412000.035100.00%

Key Result 3: Audit Completeness Is A Benchmark Dimension

Each row should be traceable to a trajectory, not just a return curve.

TradeArena keeps compact result snapshots and redacted manifests so users can inspect what happened without shipping raw provider text.

Representation Robustness Snapshot

80 rolling failure anchors and 320 pre-failure steps across eight LLM trajectories.

EmbeddingViewAnchorsPre-failure stepsMean rank deltaContraction rateMean pre-shift
hash64fused803200.47167.50%0.071
lsa32fused803205.12386.25%0.084
hash64plan803208.70397.50%0.122
lsa32plan803204.87085.00%0.097

Limitations

This page is a benchmark and audit artifact, not financial advice or a live-trading guarantee. First-run reproduction uses tracked artifacts, and public policy rows use redacted or normalized labels. Raw provider prompts, responses, credentials, and caches are not shipped.