trialdesignbench
¶
TrialDesignBench is a community-driven benchmark for evaluating AI agents in clinical trial design.
Scope¶
The benchmark currently focuses on two core tasks:
- Task 1 (Reproduction): Given a Statistical Analysis Plan (SAP) or study protocol, evaluate how accurately AI agents can reproduce the trial design using R.
- Task 2 (design generation): Given high-level clinical requirements, evaluate the ability of AI agents to draft new clinical trial designs using R.
Task 1 (reproduction)¶
This baseline implements the workflow for reproducing existing designs:
- Create a local benchmark workspace.
- Convert a SAP/protocol PDF to Mathpix Markdown, with optional LaTeX ZIP output.
- Build the standard TrialDesignBench reproduction prompt.
- Run the prompt against a locally installed Codex SDK/runtime and save the run artifacts.
Task 2 (design generation)¶
Under development.
Installation¶
For development:
Quick start¶
- Initialize a workspace:
- Configure API credentials (Mathpix):
- Run the benchmark on a protocol PDF:
For a full explanation of CLI commands, artifacts, and configuration options, see the usage guide and configuration.