Skip to contents

Runs causal_llm_extract() (real Ollama / OpenAI / Anthropic) or the deterministic llm_benchmark_simulate() fallback over every abstract in the corpus, and writes a draft JSONL in the same schema as cerrado_gold_standard_v1.jsonl but with claims marked status = "draft" so the Shiny reviewer can distinguish machine-generated entries from human-added ones.

Usage

llm_preannotate(
  corpus,
  backend = c("ollama", "openai", "anthropic", "simulator"),
  model = NULL,
  output_path = "cerrado_draft_gold.jsonl",
  cache_dir = NULL,
  max_abstracts = NULL,
  verbose = TRUE,
  ...
)

Arguments

corpus

Either a path to a JSONL file with one record per line (records must have abstract_id and abstract_text), or a list of records already in memory.

backend

One of "ollama", "openai", "anthropic", "simulator". When "simulator" the function uses llm_benchmark_simulate() against a pseudo-gold-standard derived from the abstract's first sentence – useful for demos and CI builds without API access. Defaults to "ollama" with gemma4:latest.

model

Optional model id. Defaults to "gemma4:latest" for Ollama.

output_path

Path to write the draft JSONL.

cache_dir

Optional directory for per-record JSON caches. When set, re-running the function over the same corpus short-circuits to cached extractions, so interrupted jobs resume exactly where they left off. Defaults to NULL (no cache).

max_abstracts

Optional integer cap on how many abstracts to process (useful for staged runs).

verbose

Logical; print per-abstract progress.

...

Forwarded to causal_llm_extract().

Value

Invisibly, the list of records written.