Reads a PDF (typically a soil survey chapter, field-sheet scan, or
thesis appendix), prompts the configured VLM to extract horizon
attributes against inst/schemas/horizon.json, and merges
the result into pedon. Every extracted attribute is recorded
with source = "extracted_vlm" and the model's reported
confidence and verbatim source quote.
Usage
extract_horizons_from_pdf(
pedon,
pdf_path = NULL,
provider,
max_retries = 3L,
overwrite = FALSE,
prompt_name = "extract_horizons",
schema_name = "horizon",
pdf_text = NULL
)Arguments
- pedon
A
PedonRecordto merge into. Mutated in place AND returned invisibly.- pdf_path
Path to the PDF file. Either
pdf_pathorpdf_textmust be supplied.- provider
A chat provider from
vlm_provider(or aMockVLMProviderfor testing).- max_retries
Integer; how many times to re-prompt on validation failure. Default 3.
- overwrite
If
TRUE, lower-authority values are allowed to clobber higher-authority ones. DefaultFALSE.- prompt_name
Override the default prompt template (
"extract_horizons").- schema_name
Override the default schema (
"horizon").- pdf_text
Optional alternative to
pdf_path: the already-extracted description text. Useful for smoke tests, unit tests withoutpdftools, and for already-OCR'd field-sheet text.
Value
Invisibly, the (mutated) pedon. Carries a
"vlm_extraction" attribute with the parsed response,
number of attempts, and number of provenance entries added.
Details
The PedonRecord's authority order guarantees that values already
tagged "measured" are never silently overwritten by VLM
extraction unless overwrite = TRUE.
If the PDF is long (more than ~30,000 characters), it is chunked page-by-page and each page is sent independently. This is a conservative-but-simple strategy; for very long surveys callers should pre-chunk and call this function once per profile.
