
Build a fully-classified `PedonRecord` from documents in one call
Source:R/classify-from-documents.R
classify_from_documents.RdHighest-level entry point of the soilKey VLM pipeline. Given a soil-description PDF and / or a profile-wall photograph, this function:
Usage
classify_from_documents(
pdf = NULL,
image = NULL,
fieldsheet = NULL,
pedon = NULL,
provider = "auto",
model = NULL,
systems = c("wrb", "sibcs", "usda"),
report = NULL,
overwrite = FALSE,
verbose = TRUE
)Arguments
Optional path to a soil-description PDF.
- image
Optional path to a profile-wall image (JPG / PNG); if supplied, Munsell extraction is attempted with the configured provider.
- fieldsheet
Optional path to a site-metadata field sheet (image or PDF).
- pedon
Optional existing
PedonRecord; when supplied, the function fills only the fields VLM extraction can fill (subject to the provenance-authority order).- provider
Either a provider name passed to
vlm_provider(default"ollama") OR a pre-built ellmer chat object (when you want full control oversystem_prompt,api_key, ...).- model
Optional model identifier; passed through to
vlm_provider()whenprovideris a string. Defaults to the per-provider default fromdefault_model.- systems
Character vector listing which classification systems to run; subset of
c("wrb", "sibcs", "usda"). Default: all three.- report
Optional output path for a self-contained report (
.htmlor.pdf). When supplied,reportis called on the classification results + pedon. DefaultNULL(no report file).- overwrite
When merging extracted values into an existing pedon, allow VLM-extracted attributes to clobber already-recorded ones. Default
FALSE– the provenance authority order (measured > extracted_vlm) is enforced byPedonRecord$add_measurement().- verbose
Emit cli progress messages. Default
TRUE.
Value
A list with elements:
pedonThe (mutated)
PedonRecord.classificationsNamed list with up to three
ClassificationResultobjects keyed bywrb,sibcs,usda.reportPath to the rendered report file (if
report = ...was supplied), elseNULL.providerThe chat-provider object actually used (useful for downstream debugging or cost accounting).
Details
Constructs a vision-language provider chat object via
vlm_provider(defaults to local Ollama with Gemma 4 edge for institutional independence and data sovereignty).Extracts horizons from
pdfviaextract_horizons_from_pdf, Munsell colours fromimageviaextract_munsell_from_photo, and site metadata fromfieldsheetviaextract_site_from_fieldsheet. Every extracted attribute is stampedsource = "extracted_vlm"in the PedonRecord's provenance log.Runs the three deterministic keys (
classify_wrb2022,classify_sibcs,classify_usda). The VLM never classifies – the package's architectural invariant is preserved.Optionally renders a one-pager HTML / PDF report via
report.
At least one of pdf, image or fieldsheet
must be supplied; you can also pass an existing partially-filled
PedonRecord via pedon and let this function fill
the gaps.
Why local-first by default
The default provider = "ollama" runs the entire VLM pipeline
on the user's machine via Gemma 4 (edge variant, ~3 GB, multimodal
text+image). No part of the soil description, photograph or
field sheet ever leaves the local network. This is the
recommended configuration for governmental surveys, indigenous
land studies, and unpublished research data; it also makes the
pipeline reproducible without an internet connection. Cloud
providers ("anthropic", "openai", "google")
remain one argument away when they are the right call.
Architectural invariants preserved
The VLM never classifies. Every extracted value carries
source = "extracted_vlm"; the deterministic keys consume the resultingPedonRecordunaware of how each value was obtained.Provenance is preserved end-to-end. The
evidence_gradeon eachClassificationResultreflects whether decisive attributes came frommeasured,predicted_spectra,extracted_vlm,inferred_prior, oruser_assumed– so a caller always knows how robust the classification is.Authority order is enforced. A pre-existing
measuredvalue is never silently overwritten by a laterextracted_vlmvalue (unlessoverwrite = TRUE).
Examples
if (FALSE) { # \dontrun{
# The simplest possible end-to-end call -- local Gemma 4 edge.
res <- classify_from_documents(
pdf = "perfil_042_descricao.pdf",
image = "perfil_042_parede.jpg",
report = "perfil_042.html"
)
res$classifications$wrb$name
#> "Geric Ferric Rhodic Chromic Ferralsol (Clayic, Humic, Dystric, Ochric, Rubic)"
# Cloud provider for a one-shot, production run
res <- classify_from_documents(
pdf = "perfil_042_descricao.pdf",
provider = "anthropic"
)
# Different Gemma 4 size on Ollama
res <- classify_from_documents(
pdf = "perfil_042_descricao.pdf",
provider = "ollama",
model = "gemma4:31b"
)
} # }