TradeArena v0.1 Benchmark Card

Execution realism and risk gates materially change LLM trading-agent evaluation. This is a compact, citable result page for auditable agent evaluation, not a profitability claim or financial advice.

Result Provenance

Where this benchmark card comes from and how to reproduce it.

FieldValue
Software releasev0.1.2
Benchmark lineagev0.1 snapshot
Benchmark card source`docs/results/` tracked snapshots plus first-run outputs
Reproduction command`python -m pip install -e ".[dev]"`; `python scripts/run_showcase.py`; `python scripts/build_benchmark_page.py`
Datatracked synthetic / timestamp-masked / redacted artifacts under `docs/results/`
Live model callsnot required for first-run reproduction
Raw prompt/response cachesnot included
Intended usebenchmark and audit research, not trading advice

What Is Measured

The v0.1 card emphasizes audit and execution dimensions, not only return.

First-Run Execution Benchmark

Deterministic, local benchmark cases generated by the quickstart path.

ScenarioAgent / baselineReturnMax drawdownFill rateRejection rateRisk editsAudit completeness
deterministic quickstartbuy_and_hold_realistic53.74%-6.63%89.83%8.90%0100.00%
deterministic quickstartrisk_aware_realistic35.08%-1.26%90.34%7.95%124100.00%

Key Result 1: Risk Gates Are Active, Not Cosmetic

Risk gates repeatedly edit or clip intended allocations before execution.

The benchmark reports risk edits alongside return so that risk control is visible in the result card.

Crisis-Scene LLM Benchmark

Timestamp-masked 2022 Tech/Rates and 2023 SVB stress paths, averaged by feedback mode.

ScenarioAgent / baselineReturnMax drawdownFill rateRejection rateRisk editsAudit completeness
svb_2023LLM policies (hidden feedback)1.23%-1.86%75.59%24.41%212100.00%
svb_2023LLM policies (placebo feedback)1.19%-1.87%76.88%23.12%204100.00%
svb_2023LLM policies (true feedback)1.08%-1.87%78.16%21.84%196100.00%
tech_rates_2022LLM policies (hidden feedback)-3.27%-4.89%65.93%34.07%914100.00%
tech_rates_2022LLM policies (placebo feedback)-2.57%-4.40%63.10%36.90%906100.00%
tech_rates_2022LLM policies (true feedback)-3.09%-4.81%64.33%35.67%778100.00%

True-Feedback Model Rows

Policy-level rows under structured true risk feedback. Model names are redacted or normalized labels; raw provider prompts and responses are not shipped.

ScenarioPolicy labelReturnMax drawdownFill rateRisk editsViolationsCalibration
svb_2023frontier-policy-D (redacted)1.03%-1.87%80.16%177130.198
svb_2023frontier-policy-B (redacted)1.24%-1.89%77.37%206140.207
svb_2023frontier-policy-C (redacted)0.67%-1.88%78.52%196140.397
svb_2023frontier-policy-A (redacted)1.39%-1.86%76.60%203140.224
tech_rates_2022frontier-policy-D (redacted)-1.84%-4.79%62.62%863390.071
tech_rates_2022frontier-policy-B (redacted)-2.49%-4.53%63.90%969410.067
tech_rates_2022frontier-policy-C (redacted)-5.32%-5.32%69.10%4022260.435
tech_rates_2022frontier-policy-A (redacted)-2.72%-4.61%61.69%880380.090

Key Result 2: Execution Assumptions Change Realized Exposure

The 51-stock hourly probe separates intended allocation from realistic execution outcomes.

Fill rate, rejected orders, latency, and slippage become part of the benchmark outcome.

51-Stock Intraday Portfolio Probe

Passive, deterministic, Markowitz/MVO, execution-stress, and redacted LLM policy rows on a 51-stock hourly panel.

Agent / baselineReturnMax drawdownFill rateRejectedRisk editsHerfindahlAudit completeness
Buy and Hold0.71%-0.79%87.16%9820400.019100.00%
Deterministic Risk-Aware-1.35%-3.11%76.26%916720.062100.00%
Markowitz MVO-0.54%-1.35%87.94%1853570.023100.00%
Low-Liquidity Stress-2.16%-3.60%75.55%956720.062100.00%
Latency Stress1.96%-1.33%44.40%2096720.081100.00%
Frontier Policy A (redacted)-2.23%-2.93%71.86%37829240.045100.00%
Frontier Policy C (redacted)-0.53%-2.31%63.60%25412000.035100.00%

Key Result 3: Audit Completeness Is A Benchmark Dimension

Each row should be traceable to a trajectory, not just a return curve.

TradeArena keeps compact result snapshots and redacted manifests so users can inspect what happened without shipping raw provider text.

Representation Robustness Snapshot

80 rolling failure anchors and 320 pre-failure steps across eight LLM trajectories.

EmbeddingViewAnchorsPre-failure stepsMean rank deltaContraction rateMean pre-shift
hash64fused803200.47167.50%0.071
lsa32fused803205.12386.25%0.084
hash64plan803208.70397.50%0.122
lsa32plan803204.87085.00%0.097

Limitations

This page is a benchmark and audit artifact, not financial advice or a live-trading guarantee. First-run reproduction uses tracked artifacts, and public policy rows use redacted or normalized labels. Raw provider prompts, responses, credentials, and caches are not shipped.