Step 1 Usage¶
Workflow step 1 starts from an individual SAP or protocol PDF and produces a local reproduction run directory. The baseline implementation uses Mathpix for PDF OCR and a local Codex SDK/runtime for the agent execution step.
Create a workspace¶
The workspace contains:
.envfor Mathpix and Codex configuration..gitignorethat excludes credentials, converted documents, and run outputs.converted/for Mathpix Markdown and metadata.runs/for prompts, Codex responses, and run summaries.
Configure credentials¶
This writes MATHPIX_APP_ID, MATHPIX_APP_KEY, CODEX_MODEL, and optionally
CODEX_BIN to tdb-workspace/.env. The default model is gpt-5.5; Codex runs
default to high reasoning effort.
Convert only¶
Add --save-tex-zip to request Mathpix's LaTeX ZIP conversion in addition to
the Mathpix Markdown text.
TrialDesignBench reuses existing non-empty converted/<pdf-stem>.mmd and
converted/<pdf-stem>.mathpix.json files by default. Add --force when you
need to submit the PDF to Mathpix again. Use --http-timeout to raise the
per-request HTTP timeout for large uploads or slow connections; --timeout
continues to control the overall Mathpix polling deadline.
Convert and run Codex¶
The command saves:
converted/<pdf-stem>.mmdconverted/<pdf-stem>.mathpix.jsonruns/<case-id>/prompt.mdruns/<case-id>/codex_response.mdruns/<case-id>/codex_run.jsonruns/<case-id>.step1.json
Use --no-codex when you only want to test ingestion while still using the same
output layout.
If the Codex step fails after conversion, the pipeline still writes
runs/<case-id>.step1.json with the conversion artifact and codex_run set to
null, so the workspace state remains inspectable before retrying.