
Fit 2SLS using foundation-model (or proxy) embeddings as instruments
Source:R/causal_iv.R
causal_iv_from_embeddings.RdConvenience wrapper that:
Takes a matrix of per-profile embeddings (rows = profiles, columns = embedding dims),
Reduces them to
n_pcsprincipal components,Attaches the PCs to
dataas new columns namedPC_1, ...,Runs
causal_iv_fit_2sls()with the PCs as instruments.
Usage
causal_iv_from_embeddings(
data,
embeddings,
exposure,
outcome,
covariates = NULL,
n_pcs = 5L
)Arguments
- data
Data frame with
exposure,outcomeand anycovariates.- embeddings
Numeric matrix with
nrow(data)rows (one per data row) and any number of columns (embedding dimensions).- exposure, outcome, covariates
See
causal_iv_fit_2sls().- n_pcs
Integer; number of top principal components to keep as instruments. Default
5L.
Value
edaphos_causal_iv object (see causal_iv_fit_2sls()).
Details
Using the top n_pcs principal components instead of raw
embedding dimensions keeps the instrument count manageable
(avoiding the curse of dimensionality) and ensures the instruments
are orthogonal (which simplifies the Sargan diagnostics). The
default n_pcs = 5L yields a 4-over-identified model for a
single-exposure query, enabling the Sargan J-test.