
Spatial prior + OSSL spectra pipeline (Modules 3 & 4)
Source:vignettes/v05_spatial_spectra_pipeline.Rmd
v05_spatial_spectra_pipeline.RmdModules 3 and 4 sit alongside the deterministic key, never inside it:
-
Module 3 (
spatial-*) – pulls a probabilistic prior over RSGs from a regional or global map (SoilGrids, an Embrapa pedological map, or any raster the user supplies) and runs a consistency check that warns when the deterministic classification disagrees with the prior. The key is never overwritten – the prior is purely advisory. -
Module 4 (
spectra-*,vlm-*) – gap-fills horizon attributes (clay, CEC, BS, OC, pH, …) from Vis-NIR / SWIR or MIR spectra via the OSSL library. Predicted values are recorded withsource = "predicted_spectra"so the evidence grade tracks the substitution.
This vignette walks both modules end-to-end on the canonical Ferralsol fixture, with all external dependencies (raster files, OSSL parquet libraries, ellmer chats) replaced by inline synthetic objects so the example is fully reproducible.
1. Start from a partially-described pedon
Take the canonical Latossolo and intentionally erase the CEC and base-saturation values from the lower horizons. We will fill them back in with simulated OSSL predictions in §3.
pr_full <- make_ferralsol_canonical()
pr_partial <- pr_full$clone(deep = TRUE)
pr_partial$horizons[3:5, c("cec_cmol", "bs_pct") := NA]
pr_partial$horizons[, .(top_cm, bottom_cm, designation,
clay_pct, cec_cmol, bs_pct, oc_pct)]
#> top_cm bottom_cm designation clay_pct cec_cmol bs_pct oc_pct
#> <num> <num> <char> <num> <num> <num> <num>
#> 1: 0 15 A 50 8.0 24 2.0
#> 2: 15 35 AB 52 6.5 17 1.2
#> 3: 35 65 BA 55 NA NA 0.6
#> 4: 65 130 Bw1 60 NA NA 0.3
#> 5: 130 200 Bw2 60 NA NA 0.2The classification on this incomplete pedon already differs:
res_partial <- classify_wrb2022(pr_partial, on_missing = "silent")
res_partial$rsg_or_order
#> [1] "Nitisols"
res_partial$evidence_grade
#> [1] "A"The evidence grade reflects the missing data – the per-RSG trace records which RSGs returned NA because of attributes we erased.
2. Module 3 – spatial prior consistency check
The prior is a probability vector over RSGs from any source – a regional SoilGrids extract, a national Embrapa map, an interpolated kriging surface, etc. For this vignette we build it inline so the example runs without network access:
# Synthetic prior consistent with the gneiss-Mata-Atlantica context:
# Ferralsols dominate, with a tail of Acrisols and Cambisols.
prior <- data.table::data.table(
rsg_code = c("FR", "AC", "CM", "AL"),
probability = c(0.62, 0.20, 0.12, 0.06)
)
prior
#> rsg_code probability
#> <char> <num>
#> 1: FR 0.62
#> 2: AC 0.20
#> 3: CM 0.12
#> 4: AL 0.06prior_consistency_check() confirms the deterministic
call (FR) is supported by the prior:
chk <- prior_consistency_check(rsg_code = "FR", prior = prior, threshold = 0.05)
chk
#> $consistent
#> [1] TRUE
#>
#> $p
#> [1] 0.62
#>
#> $threshold
#> [1] 0.05
#>
#> $status
#> [1] "consistent"
#>
#> $note
#> [1] "Assigned RSG 'FR' has prior probability 0.620 at this location (>= threshold 0.050)."
#>
#> $top_prior
#> rsg_code probability
#> <char> <num>
#> 1: FR 0.62
#> 2: AC 0.20
#> 3: CM 0.12Now suppose the deterministic key had instead landed on Cambisols. The same prior would flag the disagreement (Cambisols at probability 0.12 vs the dominant Ferralsols at 0.62 – the inconsistency margin):
prior_consistency_check(rsg_code = "AL", prior = prior, threshold = 0.05)
#> $consistent
#> [1] TRUE
#>
#> $p
#> [1] 0.06
#>
#> $threshold
#> [1] 0.05
#>
#> $status
#> [1] "consistent"
#>
#> $note
#> [1] "Assigned RSG 'AL' has prior probability 0.060 at this location (>= threshold 0.050)."
#>
#> $top_prior
#> rsg_code probability
#> <char> <num>
#> 1: FR 0.62
#> 2: AC 0.20
#> 3: CM 0.12The deterministic key is never overwritten by the prior. The check
only flags cases where a manual review is warranted; the user remains in
charge of the final assignment. Real production runs would source the
prior from spatial_prior_soilgrids() (a live SoilGrids-WCS
request) or spatial_prior_embrapa():
prior <- spatial_prior_soilgrids(pr_partial, buffer_m = 250)3. Module 4 – OSSL gap-filling
The OSSL workflow is:
- Pre-process the raw spectra (SNV / Savitzky-Golay 1st derivative).
- Send each horizon’s spectrum through one of three predictors:
-
predict_ossl_mbl()– memory-based learning (recommended); -
predict_ossl_plsr_local()– partial-least-squares with a local subset; -
predict_ossl_pretrained()– a pre-trained Cubist or RF model.
-
- Convert each property’s prediction-interval width to an A–D
confidence grade via
pi_to_confidence(). -
fill_from_spectra()writes each predicted value into the horizons table AND adds a provenance entry withsource = "predicted_spectra".
A production call would look like this (skipped in this vignette because OSSL is a multi-GB dataset that would have to be downloaded):
fill_from_spectra(
pr_partial,
library = "ossl",
region = "south_america",
properties = c("clay_pct", "cec_cmol", "bs_pct", "oc_pct"),
method = "mbl",
preprocess = "snv+sg1",
k_neighbors = 100L,
ossl_library = "/path/to/ossl-soilsite-vnir.parquet"
)For the vignette, simulate the predicted values directly through
pedon$add_measurement():
preds <- list(
list(idx = 3, attribute = "cec_cmol", value = 5.5, confidence = 0.78),
list(idx = 3, attribute = "bs_pct", value = 14, confidence = 0.72),
list(idx = 4, attribute = "cec_cmol", value = 4.9, confidence = 0.79),
list(idx = 4, attribute = "bs_pct", value = 13, confidence = 0.74),
list(idx = 5, attribute = "cec_cmol", value = 4.7, confidence = 0.70),
list(idx = 5, attribute = "bs_pct", value = 13, confidence = 0.71)
)
pr_filled <- pr_partial$clone(deep = TRUE)
for (p in preds) {
pr_filled$add_measurement(
horizon_idx = p$idx,
attribute = p$attribute,
value = p$value,
source = "predicted_spectra",
confidence = p$confidence,
overwrite = TRUE
)
}
pr_filled$horizons[, .(top_cm, bottom_cm, cec_cmol, bs_pct)]
#> top_cm bottom_cm cec_cmol bs_pct
#> <num> <num> <num> <num>
#> 1: 0 15 8.0 24
#> 2: 15 35 6.5 17
#> 3: 35 65 5.5 14
#> 4: 65 130 4.9 13
#> 5: 130 200 4.7 13The predicted values are now in the horizons table, and the
provenance log records each as predicted_spectra:
prov <- pr_filled$provenance
prov[source == "predicted_spectra", .(horizon_idx, attribute, source, confidence)]
#> horizon_idx attribute source confidence
#> <int> <char> <char> <num>
#> 1: 3 cec_cmol predicted_spectra 0.78
#> 2: 3 bs_pct predicted_spectra 0.72
#> 3: 4 cec_cmol predicted_spectra 0.79
#> 4: 4 bs_pct predicted_spectra 0.74
#> 5: 5 cec_cmol predicted_spectra 0.70
#> 6: 5 bs_pct predicted_spectra 0.714. Re-classify with the gap-filled pedon
After OSSL fills the missing CEC/BS, the deterministic key has a complete dataset and the classification’s evidence grade reflects the predicted source.
res_filled <- classify_wrb2022(pr_filled, on_missing = "silent")
res_filled$rsg_or_order
#> [1] "Ferralsols"
res_filled$evidence_grade
#> [1] "B"
res_filled$name
#> [1] "Geric Ferric Rhodic Chromic Ferralsol (Clayic, Humic, Dystric, Ochric, Rubic)"The evidence grade ladder for the same profile across the three workflows:
| Workflow | Evidence grade |
|---|---|
| Lab-only (full canonical fixture) | A |
Spectra-filled (OSSL predicted_spectra) |
B |
| VLM-extracted only | C |
| User-assumed | D |
5. Combining priors and posteriors
combine_priors() merges multiple sources (SoilGrids +
Embrapa + a custom map) with weights, returning a single normalised
prior that you can feed into prior_consistency_check():
combined <- combine_priors(
priors = list(
soilgrids = data.table::data.table(rsg_code = c("FR", "AC", "CM"),
probability = c(0.62, 0.20, 0.18)),
embrapa = data.table::data.table(rsg_code = c("FR", "AC", "NT"),
probability = c(0.55, 0.30, 0.15))
),
weights = c(soilgrids = 0.6, embrapa = 0.4)
)
combined
#> rsg_code probability
#> <char> <num>
#> 1: FR 0.7139748011
#> 2: AC 0.2841641547
#> 3: CM 0.0017189621
#> 4: NT 0.0001420821posterior_classify() would then take a
ClassificationResult and a prior, returning a posterior
probability over RSGs (the deterministic key contributes a sharply
peaked likelihood). Used for ranking ambiguous fixtures or for
active-learning loops.
Summary
- The deterministic key always runs first and is never overwritten.
- Module 3 gives a probabilistic sanity check against external maps and warns on disagreement.
- Module 4 fills missing horizon attributes from spectra, with full provenance, so the evidence grade tracks the substitution.
- Putting them together, you can take a partially-described pedon, fill its gaps from spectra, classify it deterministically, and cross-check the result against a global map – all in one pipeline.
The next vignette (v06_wosis_benchmark) shows how to run
this whole stack at scale against the WoSIS global pedon archive for
paper-grade validation.