
Run the Pilar 1 LLM-KG pipeline on a (potentially large) corpus
Source:R/llm_kg_pipeline.R
llm_kg_pipeline_run.RdDrives causal_llm_extract() over a JSONL corpus of abstracts
(one JSON object with source + abstract per line) with
resumable JSONL checkpointing and throttle / retry. The function
is engineered for 10 000+ abstract runs that can take several
hours to a day on a local Ollama instance.
Usage
llm_kg_pipeline_run(
corpus_path,
output_path,
backend = c("ollama", "openai", "anthropic"),
model = NULL,
host = "http://localhost:11434",
temperature = 0,
timeout_sec = 120,
max_retries = 3L,
min_confidence = 0.5,
verbose = TRUE,
max_abstracts = NULL
)Arguments
- corpus_path
Character path to a JSONL file with one
{"source": ..., "abstract": ...}object per line.- output_path
Character path to the JSONL claims output. Created on first run; appended on resume.
- backend
One of
"ollama","openai","anthropic".- model
LLM model identifier; defaults to backend-appropriate.
- host
Ollama HTTP host (ignored for hosted backends).
- temperature
Numeric; LLM sampling temperature. Default
0.- timeout_sec
Numeric; per-call HTTP timeout. Default
120.- max_retries
Integer; transient-error retry budget. Default
3L.- min_confidence
Numeric; claims below this confidence are discarded. Default
0.5.- verbose
Logical; emit progress messages. Default
TRUE.- max_abstracts
Optional integer; for testing, cap the run.
Resumability
On every successful per-abstract extraction, claims are appended
to output_path and the source identifier is appended to
output_path.done. When llm_kg_pipeline_run() is restarted
on the same output_path, abstracts whose source is already in
the .done file are skipped automatically. This makes the
pipeline safe under arbitrary process kills, network drops, or
Ollama restarts.