trialdesignbench ¶

Python versions License

TrialDesignBench is a community-driven benchmark for evaluating AI agents in clinical trial design.

Scope¶

The benchmark currently focuses on two core tasks:

Task 1 (Reproduction): Given a Statistical Analysis Plan (SAP) or study protocol, evaluate how accurately AI agents can reproduce the trial design using R.
Task 2 (design generation): Given high-level clinical requirements, evaluate the ability of AI agents to draft new clinical trial designs using R.

This baseline implements the workflow for reproducing existing designs:

Create a local benchmark workspace.
Convert a SAP/protocol PDF to Mathpix Markdown, with optional LaTeX ZIP output.
Build the standard TrialDesignBench reproduction prompt.
Run the prompt against a locally installed Codex SDK/runtime and save the run artifacts.

Under development.

uv add trialdesignbench

For development:

git clone https://github.com/BBSW-org/TrialDesignBench.git
cd TrialDesignBench
uv sync

Configure API credentials (Mathpix):

uv run tdb configure --workspace tdb-workspace

Run the benchmark on a protocol PDF:

uv run tdb run path/to/sap.pdf --workspace tdb-workspace --case-id tdb-001

For a full explanation of CLI commands, artifacts, and configuration options, see the usage guide and configuration.