
Vision-language extraction of pedon data (Module 2)
Source:vignettes/v04_vlm_extraction.Rmd
v04_vlm_extraction.RmdModule 2 (vlm-*) lets soilKey build a
PedonRecord from a field-description PDF, a profile photo,
or a fieldsheet image – using a vision-language model (VLM) for the
extraction and JSON-Schema validation as a hard gate. The
taxonomic key itself is never delegated to the LLM: the VLM is
restricted to extraction, and every extracted value is recorded in the
provenance log so the evidence grade reflects the lower confidence of
VLM-sourced data.
This vignette walks the extraction loop end to end, using a
MockVLMProvider so the example runs offline and without API
keys. Swapping the mock for a real ellmer chat (Anthropic,
OpenAI, etc.) is a one-line change.
1. The mock provider
MockVLMProvider exposes the same $chat()
method as an ellmer chat object, but pops responses from a
pre-loaded queue. Because all of soilKey’s VLM logic talks
to the provider through $chat(), swapping it for a real
chat (or for any custom backend) is transparent.
mock <- MockVLMProvider$new(responses = list())
class(mock)
#> [1] "MockVLMProvider" "R6"2. Schemas and prompts
VLM responses are constrained by a JSON Schema (Draft-07). The schema
for horizon extraction is shipped at
inst/schemas/horizon.json; the rendered prompt template
lives at inst/prompts/extract_horizons.md.
sch <- jsonlite::fromJSON(soilKey:::load_schema("horizon"),
simplifyVector = FALSE)
length(sch$properties$horizons$items$properties)
#> [1] 35
head(names(sch$properties$horizons$items$properties), 12)
#> [1] "top_cm" "bottom_cm" "designation"
#> [4] "boundary_distinctness" "boundary_topography" "munsell_moist"
#> [7] "munsell_dry" "structure_grade" "structure_size"
#> [10] "structure_type" "consistence_moist" "clay_films_amount"The schema enforces strict types – top_cm must be
numeric, designation must be a string, Munsell colours come
as {hue, value, chroma, confidence, source_quote} triples,
etc. Any response that fails validation triggers the retry loop with the
validation error included in the next prompt.
3. Walking the extraction loop with the mock
Suppose the model returns the following horizon-level JSON for a synthetic Latossolo description:
horizon_json <- '{
"horizons": [
{
"top_cm": 0,
"bottom_cm": 15,
"designation": "A",
"munsell_moist": {"hue": "2.5YR", "value": 3, "chroma": 4,
"confidence": 0.85, "source_quote": "vermelho-escuro"},
"clay_pct": {"value": 50, "confidence": 0.9, "source_quote": "muito argilosa (50%)"},
"oc_pct" : {"value": 2.0, "confidence": 0.85, "source_quote": "C org. 2.0%"}
},
{
"top_cm": 15,
"bottom_cm": 65,
"designation": "Bw1",
"munsell_moist": {"hue": "2.5YR", "value": 3, "chroma": 6,
"confidence": 0.85, "source_quote": "vermelho"},
"clay_pct": {"value": 60, "confidence": 0.9, "source_quote": "muito argilosa"},
"oc_pct" : {"value": 1.2, "confidence": 0.85, "source_quote": "C org. 1.2%"}
}
]
}'Wire the mock provider to return this JSON, send a prompt through
validate_or_retry, and watch the loop:
mock <- MockVLMProvider$new(responses = list(horizon_json))
res <- soilKey:::validate_or_retry(
provider = mock,
prompt = "extract horizons from <fake document>",
schema = "horizon",
max_retries = 0L
)
str(res, max.level = 2)
#> List of 3
#> $ data :List of 1
#> ..$ horizons:List of 2
#> $ raw : chr "{\n \"horizons\": [\n {\n \"top_cm\": 0,\n \"bottom_cm\": 15,\n \"designation\": \"A\",\n "| __truncated__
#> $ attempts: int 1The returned res$data is the parsed (validated) R
object. res$attempts == 1 because the canned response
passed validation on the first try. The validation_error_at
argument of MockVLMProvider$new() forces an attempt to
return malformed JSON, exercising the retry path – see
tests/testthat/test-vlm-extract.R for a worked retry-loop
test.
4. Merging extraction into a PedonRecord
apply_horizons_extraction() consumes the parsed JSON and
merges it into a PedonRecord, recording each value in the
provenance log with source = "extracted_vlm".
pr <- PedonRecord$new(
site = list(id = "VLM-demo", lat = -22.5, lon = -43.7,
country = "BR", parent_material = "gneiss"),
horizons = data.table::data.table(top_cm = numeric(0), bottom_cm = numeric(0))
)
added <- soilKey:::apply_horizons_extraction(pr, res$data, overwrite = TRUE)
cat("Provenance entries added:", added, "\n")
#> Provenance entries added: 12
# Inspect what landed in the horizons table.
pr$horizons[, .(top_cm, bottom_cm, designation,
munsell_hue_moist, munsell_value_moist, munsell_chroma_moist,
clay_pct, oc_pct)]
#> top_cm bottom_cm designation munsell_hue_moist munsell_value_moist
#> <num> <num> <char> <char> <num>
#> 1: 0 15 A 2.5YR 3
#> 2: 15 65 Bw1 2.5YR 3
#> munsell_chroma_moist clay_pct oc_pct
#> <num> <num> <num>
#> 1: 4 50 2.0
#> 2: 6 60 1.25. Provenance and evidence grade
Every value that came from the VLM carries
source = "extracted_vlm" in the provenance log. The
classification’s evidence grade reacts:
prov <- pr$provenance
head(prov[, .(horizon_idx, attribute, source, confidence)])
#> horizon_idx attribute source confidence
#> <int> <char> <char> <num>
#> 1: 1 designation extracted_vlm NA
#> 2: 1 munsell_hue_moist extracted_vlm 0.85
#> 3: 1 munsell_value_moist extracted_vlm 0.85
#> 4: 1 munsell_chroma_moist extracted_vlm 0.85
#> 5: 1 clay_pct extracted_vlm 0.90
#> 6: 1 oc_pct extracted_vlm 0.85Even the simplest call to classify_wrb2022() will
reflect the lower-confidence VLM source via a lower evidence grade than
the same profile assembled from lab data. (See vignette 02 for the grade
ladder.)
6. Production: swap the mock for an ellmer chat
In production, replace MockVLMProvider with a chat
object from ellmer. The rest of the pipeline is
unchanged.
6a. Local-first with Gemma 4 (Ollama)
For sensitive field descriptions, governmental surveys or just for reproducibility without an API key, the recommended path is a local Gemma 4 model via Ollama. The data never leaves your machine.
# Install once (any platform)
ollama pull gemma4:e4b # ~3 GB, multimodal, fits a laptop
# ollama pull gemma4:31b # frontier dense, best quality
# ollama serve # background server
# Local Gemma 4 edge -- multimodal text + image (and audio).
provider <- vlm_provider("ollama") # default: gemma4:e4b
# provider <- vlm_provider("ollama", model = "gemma4:31b") # frontier
extract_horizons_from_pdf(
pedon = pr,
pdf_path = "field-reports/perfil-LV-001.pdf",
provider = provider,
max_retries = 3L
)6b. Cloud providers (Anthropic, OpenAI, Google)
# install.packages("ellmer")
# Anthropic Claude (needs ANTHROPIC_API_KEY in the environment).
provider <- vlm_provider("anthropic") # default: claude-sonnet-4-7
# Or OpenAI / Google in the same one-liner shape:
# provider <- vlm_provider("openai") # default: gpt-4o
# provider <- vlm_provider("google") # default: gemini-2.0-pro
extract_horizons_from_pdf(
pedon = pr,
pdf_path = "field-reports/perfil-LV-001.pdf",
provider = provider,
max_retries = 3L
)6c. The one-liner: classify_from_documents()
For the canonical case – “I have a PDF and (optionally) a profile photo, give me the three classifications and a one-pager report” – chain everything in a single call. The default provider is local Gemma 4 edge:
res <- classify_from_documents(
pdf = "perfil_042_descricao.pdf",
image = "perfil_042_parede.jpg",
report = "perfil_042.html" # optional output report
)
res$classifications$wrb$name
#> "Geric Ferric Rhodic Chromic Ferralsol (Clayic, Humic, Dystric, Ochric, Rubic)"
res$classifications$sibcs$name
#> "Latossolos Vermelhos Distroficos tipicos, argilosa, moderado"
res$classifications$usda$name
#> "Rhodic Hapludox"Photos go through extract_munsell_from_photo(); site
metadata through extract_site_from_fieldsheet(). All three
follow the same validate-or-retry contract so they accept any provider
that exposes $chat(prompt, ...).
The next vignette (v05_spatial_spectra_pipeline) shows
how the SoilGrids spatial prior and OSSL Vis-NIR predictions can fill
remaining gaps in the pedon before classification.