TradeArena v0.1 Benchmark Card
Execution realism and risk gates materially change LLM trading-agent evaluation. This is a compact, citable result page for auditable agent evaluation, not a profitability claim or financial advice.
Result Provenance
Where this benchmark card comes from and how to reproduce it.
| Field | Value |
|---|
| Software release | v0.1.2 |
| Benchmark lineage | v0.1 snapshot |
| Benchmark card source | `docs/results/` tracked snapshots plus first-run outputs |
| Reproduction command | `python -m pip install -e ".[dev]"`; `python scripts/run_showcase.py`; `python scripts/build_benchmark_page.py` |
| Data | tracked synthetic / timestamp-masked / redacted artifacts under `docs/results/` |
| Live model calls | not required for first-run reproduction |
| Raw prompt/response caches | not included |
| Intended use | benchmark and audit research, not trading advice |
What Is Measured
The v0.1 card emphasizes audit and execution dimensions, not only return.
- Return and max drawdown.
- Fill rate, rejection rate, spread, latency, slippage, and partial fills.
- Risk edits, clipped decisions, violations, and audit completeness.
- Concentration / Herfindahl for portfolio probes.
- Calibration and representation robustness diagnostics.
First-Run Execution Benchmark
Deterministic, local benchmark cases generated by the quickstart path.
| Scenario | Agent / baseline | Return | Max drawdown | Fill rate | Rejection rate | Risk edits | Audit completeness |
|---|
| deterministic quickstart | buy_and_hold_realistic | 53.74% | -6.63% | 89.83% | 8.90% | 0 | 100.00% |
| deterministic quickstart | risk_aware_realistic | 35.08% | -1.26% | 90.34% | 7.95% | 124 | 100.00% |
Key Result 1: Risk Gates Are Active, Not Cosmetic
Risk gates repeatedly edit or clip intended allocations before execution.
The benchmark reports risk edits alongside return so that risk control is visible in the result card.
Crisis-Scene LLM Benchmark
Timestamp-masked 2022 Tech/Rates and 2023 SVB stress paths, averaged by feedback mode.
| Scenario | Agent / baseline | Return | Max drawdown | Fill rate | Rejection rate | Risk edits | Audit completeness |
|---|
| svb_2023 | LLM policies (hidden feedback) | 1.23% | -1.86% | 75.59% | 24.41% | 212 | 100.00% |
| svb_2023 | LLM policies (placebo feedback) | 1.19% | -1.87% | 76.88% | 23.12% | 204 | 100.00% |
| svb_2023 | LLM policies (true feedback) | 1.08% | -1.87% | 78.16% | 21.84% | 196 | 100.00% |
| tech_rates_2022 | LLM policies (hidden feedback) | -3.27% | -4.89% | 65.93% | 34.07% | 914 | 100.00% |
| tech_rates_2022 | LLM policies (placebo feedback) | -2.57% | -4.40% | 63.10% | 36.90% | 906 | 100.00% |
| tech_rates_2022 | LLM policies (true feedback) | -3.09% | -4.81% | 64.33% | 35.67% | 778 | 100.00% |
True-Feedback Model Rows
Policy-level rows under structured true risk feedback. Model names are redacted or normalized labels; raw provider prompts and responses are not shipped.
| Scenario | Policy label | Return | Max drawdown | Fill rate | Risk edits | Violations | Calibration |
|---|
| svb_2023 | frontier-policy-D (redacted) | 1.03% | -1.87% | 80.16% | 177 | 13 | 0.198 |
| svb_2023 | frontier-policy-B (redacted) | 1.24% | -1.89% | 77.37% | 206 | 14 | 0.207 |
| svb_2023 | frontier-policy-C (redacted) | 0.67% | -1.88% | 78.52% | 196 | 14 | 0.397 |
| svb_2023 | frontier-policy-A (redacted) | 1.39% | -1.86% | 76.60% | 203 | 14 | 0.224 |
| tech_rates_2022 | frontier-policy-D (redacted) | -1.84% | -4.79% | 62.62% | 863 | 39 | 0.071 |
| tech_rates_2022 | frontier-policy-B (redacted) | -2.49% | -4.53% | 63.90% | 969 | 41 | 0.067 |
| tech_rates_2022 | frontier-policy-C (redacted) | -5.32% | -5.32% | 69.10% | 402 | 226 | 0.435 |
| tech_rates_2022 | frontier-policy-A (redacted) | -2.72% | -4.61% | 61.69% | 880 | 38 | 0.090 |
Key Result 2: Execution Assumptions Change Realized Exposure
The 51-stock hourly probe separates intended allocation from realistic execution outcomes.
Fill rate, rejected orders, latency, and slippage become part of the benchmark outcome.
51-Stock Intraday Portfolio Probe
Passive, deterministic, Markowitz/MVO, execution-stress, and redacted LLM policy rows on a 51-stock hourly panel.
| Agent / baseline | Return | Max drawdown | Fill rate | Rejected | Risk edits | Herfindahl | Audit completeness |
|---|
| Buy and Hold | 0.71% | -0.79% | 87.16% | 98 | 2040 | 0.019 | 100.00% |
| Deterministic Risk-Aware | -1.35% | -3.11% | 76.26% | 91 | 672 | 0.062 | 100.00% |
| Markowitz MVO | -0.54% | -1.35% | 87.94% | 185 | 357 | 0.023 | 100.00% |
| Low-Liquidity Stress | -2.16% | -3.60% | 75.55% | 95 | 672 | 0.062 | 100.00% |
| Latency Stress | 1.96% | -1.33% | 44.40% | 209 | 672 | 0.081 | 100.00% |
| Frontier Policy A (redacted) | -2.23% | -2.93% | 71.86% | 378 | 2924 | 0.045 | 100.00% |
| Frontier Policy C (redacted) | -0.53% | -2.31% | 63.60% | 254 | 1200 | 0.035 | 100.00% |
Key Result 3: Audit Completeness Is A Benchmark Dimension
Each row should be traceable to a trajectory, not just a return curve.
TradeArena keeps compact result snapshots and redacted manifests so users can inspect what happened without shipping raw provider text.
Representation Robustness Snapshot
80 rolling failure anchors and 320 pre-failure steps across eight LLM trajectories.
| Embedding | View | Anchors | Pre-failure steps | Mean rank delta | Contraction rate | Mean pre-shift |
|---|
| hash64 | fused | 80 | 320 | 0.471 | 67.50% | 0.071 |
| lsa32 | fused | 80 | 320 | 5.123 | 86.25% | 0.084 |
| hash64 | plan | 80 | 320 | 8.703 | 97.50% | 0.122 |
| lsa32 | plan | 80 | 320 | 4.870 | 85.00% | 0.097 |
Limitations
This page is a benchmark and audit artifact, not financial advice or a live-trading guarantee. First-run reproduction uses tracked artifacts, and public policy rows use redacted or normalized labels. Raw provider prompts, responses, credentials, and caches are not shipped.