
Pre-annotate a corpus with an LLM to produce draft claims
Source:R/llm_annotation.R
llm_preannotate.RdRuns causal_llm_extract() (real Ollama / OpenAI / Anthropic) or
the deterministic llm_benchmark_simulate() fallback over every
abstract in the corpus, and writes a draft JSONL in the same
schema as cerrado_gold_standard_v1.jsonl but with claims
marked status = "draft" so the Shiny reviewer can distinguish
machine-generated entries from human-added ones.
Usage
llm_preannotate(
corpus,
backend = c("ollama", "openai", "anthropic", "simulator"),
model = NULL,
output_path = "cerrado_draft_gold.jsonl",
cache_dir = NULL,
max_abstracts = NULL,
verbose = TRUE,
...
)Arguments
- corpus
Either a path to a JSONL file with one record per line (records must have
abstract_idandabstract_text), or a list of records already in memory.- backend
One of
"ollama","openai","anthropic","simulator". When"simulator"the function usesllm_benchmark_simulate()against a pseudo-gold-standard derived from the abstract's first sentence – useful for demos and CI builds without API access. Defaults to"ollama"withgemma4:latest.- model
Optional model id. Defaults to
"gemma4:latest"for Ollama.- output_path
Path to write the draft JSONL.
- cache_dir
Optional directory for per-record JSON caches. When set, re-running the function over the same corpus short-circuits to cached extractions, so interrupted jobs resume exactly where they left off. Defaults to
NULL(no cache).- max_abstracts
Optional integer cap on how many abstracts to process (useful for staged runs).
- verbose
Logical; print per-abstract progress.
- ...
Forwarded to
causal_llm_extract().