Skip to content

trialdesignbench

PyPI version Python versions CI tests Mypy check Ruff check Documentation License

TrialDesignBench is a community-driven benchmark for evaluating AI agents in clinical trial design.

Scope

The benchmark currently focuses on two core tasks:

  • Task 1 (Reproduction): Given a Statistical Analysis Plan (SAP) or study protocol, evaluate how accurately AI agents can reproduce the trial design using R.
  • Task 2 (design generation): Given high-level clinical requirements, evaluate the ability of AI agents to draft new clinical trial designs using R.

Task 1 (reproduction)

This baseline implements the workflow for reproducing existing designs:

  1. Create a local benchmark workspace.
  2. Convert a SAP/protocol PDF to Mathpix Markdown, with optional LaTeX ZIP output.
  3. Build the standard TrialDesignBench reproduction prompt.
  4. Run the prompt against a locally installed Codex SDK/runtime and save the run artifacts.

Task 2 (design generation)

Under development.

Installation

uv add trialdesignbench

For development:

git clone https://github.com/BBSW-org/TrialDesignBench.git
cd TrialDesignBench
uv sync

Quick start

  1. Initialize a workspace:
    uv run tdb init tdb-workspace
    
  2. Configure API credentials (Mathpix):
    uv run tdb configure --workspace tdb-workspace
    
  3. Run the benchmark on a protocol PDF:
    uv run tdb run path/to/sap.pdf --workspace tdb-workspace --case-id tdb-001
    

For a full explanation of CLI commands, artifacts, and configuration options, see the usage guide and configuration.