A convenient simplifying assumption in statistical modeling is that observations are independent and identically distributed (i.i.d.). This assumption allows us to use a single model to make predictions across all data points. But in practice, this assumption rarely holds. Data are collected across different individuals, environments, and tasks—each with their own characteristics, constraints, and dynamics. When the i.i.d. assumption breaks down, using a single global model can obscure meaningful heterogeneity.
To model this heterogeneity, a growing class of methods aim to make inference adaptive to context. These include varying-coefficient models in statistics, transfer and meta-learning in machine learning, and in-context learning in large foundation models. Though these approaches arise from different traditions, they share a common goal: to use contextual information—whether covariates, environments, or support sets—to inform sample-specific inference. Figure {@fig:overview-bridge} summarizes how statistics, meta-learning, and foundation models flow into a unified context→parameters view, which we formalize in (★).
{#fig:overview-bridge width="90%"}
We formalize this by assuming each observation
In population models, the assumption is that
This shift raises new modeling challenges. Estimating a unique
We study supervised prediction with units
In global (i.i.d.) models,
For a new unit with context
How context enters.
-
Explicit parameterization: a map
$f:\mathcal{C}\to\Theta$ sets$\theta_i=f(c_i)$ (e.g., varying-coefficients, hierarchical Bayes, multi-task/meta-learning). Here$\mathcal{R}(\theta;c)$ typically regularizes$f$ (e.g., Lipschitz over$\mathcal{C}$ , group lasso, low-rank). -
Implicit parameterization: context alters optimization or internal states without exposing
$\theta$ directly (e.g., mixture-of-experts with gates$g(x,c)$ ; retrieval where$S(c)$ is built by a retriever$R(c)$ ; in-context learning where a prompt map$P(c)$ conditions a foundation model).
Emerging approaches blur this distinction. Instead of treating prompting (implicit) and fine-tuning (explicit) as separate mechanisms, models can map contextual information—such as a natural language task description—directly into parameter updates. Text-to-LoRA provides a concrete example of this paradigm, where context is encoded and used to generate task-specific parameters, effectively collapsing the boundary between implicit and explicit adaptation [@charakorn2025text].
To formalize this connection, we now return to a unified estimator.
For convenience, we use a context encoder
Granularity. We refer to adaptation granularity
-
Information via
$S(c)$ or$P(c)$ (what context is exposed), -
Inductive bias via
$\mathcal{R}(\theta;c)$ (how parameters may vary), - Compute via warm-starts/caching/steps (how aggressively we solve (★) at test time).
Standing assumptions (used as needed).
(i) Exchangeability within context: conditional on
(ii) Regularity: either
(iii) Identifiability/stability:
(iv) Resource tracking: we track
With this notation in place, we now formalize the link between explicit and implicit context adaptation. This link has been contributed in pieces across recent work; here we unify those results.
Recent theoretical work suggests that “explicit” context models (e.g., varying-coefficients, hierarchical/multitask) and “implicit” mechanisms (e.g., in-context learning via attention) often implement the same estimator class under squared loss, differing mainly in how they encode neighborhoods and regularization. Using the notation above, we make this precise.
Proposition 1 (Explicit varying-coefficients and linear ICL coincide with kernel ridge on joint features in the linear squared-loss setting).
Assume squared loss and the regression model
-
(A) Explicit varying-coefficients. Let
$\theta(c)=B,\phi(c)$ with$B\in\mathbb{R}^{d_x\times d_c}$ and ridge penalty$\lambda\lVert B\rVert_F^2$ . The weighted ridge solution yields $$ \widehat y(x,c)=k_{(x,c)}^\top \big(K+\lambda I\big)^{-1} y,\quad K_{ab}=\langle \psi_a,\psi_b\rangle=\langle x_a,x_b\rangle\cdot\langle \phi(c_a),\phi(c_b)\rangle, $$ i.e., kernel ridge regression (KRR) on joint features. -
(B) Implicit adaptation via linear ICL. Let a single linear attention layer consume the weighted support set
$S(c)$ with linear$q=Q\psi$ ,$k=K\psi$ ,$v=V\psi$ and a linear readout. With attention weights proportional to$w_{ij}(c)\cdot \langle q,k_{ij}\rangle$ , the induced predictor equals KRR with kernel $$ k\big((x,c),(x',c')\big)=\langle q(x,c),k(x',c')\rangle, $$ i.e., a learned dot-product kernel on the same joint features. If attention parameters are trained in the linearized/NTK regime, learning equals kernel regression with the network’s NTK, which is again a dot-product kernel on linear transforms of$\psi$ .
Corollary 1 (Retrieval, gating, and weighting are kernel/measure choices). Choosing
Proof in Appendix A.
Positioning and prior art. Proposition 1 is expository: part (A) is standard ridge⇔kernel duality on joint features; part (B) follows from (i) fixed attention + trained linear head = ridge on fixed features and (ii) NTK linearization ⇒ kernel regression with the network’s NTK.
Our contribution is the unified context-aware formulation: explicit design knobs via
Limitations This linear, squared-loss bridge captures a large class of explicit and implicit adaptors, but it abstracts away at least three realities: (i) non-quadratic losses (e.g., logistic) change the effective kernel/weighting via loss curvature; (ii) when the prediction head is nonlinear (e.g., an MLP or attention), it introduces model curvature: the head’s Jacobian/Hessian depends on the input and parameters, so the effective metric and weights change with the representation; the fixed-kernel view from the linear case no longer holds; (iii) multi-modal context encoders (text, graphs, images) alter both the neighborhood definition and the regularizer. We view these as open extensions to explore beyond the scope of this review.
In this review, we examine methods that use context to guide inference, either by specifying how parameters change with covariates or by learning to adapt behavior implicitly. We begin with classical models that impose explicit structure, such as varying-coefficient models and multi-task learning, and then turn to more flexible approaches like meta-learning and in-context learning with foundation models. Though these methods arise from different traditions, they share a common goal: to tailor inference to the local characteristics of each observation or task. Along the way, we highlight recurring themes: complex models often decompose into simpler, context-specific components; foundation models can both adapt to and generate context; and context-awareness challenges classical assumptions of homogeneity. These perspectives offer a unifying lens on recent advances and open new directions for building adaptive, interpretable, and personalized models.
Several surveys have examined specific aspects of context-adaptive inference, but they have largely remained confined to individual methodological traditions. Classical statistical surveys focus on varying-coefficient models and related structured regression methods. In machine learning, surveys on transfer and meta-learning emphasize task adaptation and shared representations, while recent work on foundation models explores the implicit adaptation capabilities of large pretrained models. Table 1 summarizes the scope and coverage of representative surveys.
| Survey | Topic Focus | Scope | Coverage of Adaptivity | Gap Relative to This Work |
|---|---|---|---|---|
| Statistical Methods with Varying Coefficient Models[@doi:10.4310/sii.2008.v1.n1.a15] | Varying-coefficient modeling | Classical statistical modeling, with parameters expressed as functions of covariates | Explicit adaptivity: parameters change smoothly with context via (f(c)) | Limited to explicit, parametric formulations; no connection to neural or emergent adaptation |
| A Survey of Deep Meta-Learning[@doi:10.48550/arXiv.2010.03522] | Meta-learning | Neural meta-learning methods for cross-task adaptation | Task-level adaptivity: models learn to generalize quickly across tasks | Focused on task switching; does not integrate explicit parameter modeling or implicit foundation model adaptation |
| LoRA: Low-Rank Adaptation of Large Language Models[@doi:10.48550/arXiv.2106.09685] | Parameter-efficient adaptation | Adaptation of large pretrained transformer models via low-rank updates while freezing base weights | Implicit adaptivity via parameter-efficient updates, enabling contextual adaptation without full fine-tuning | Strong in efficient adaptation mechanism, but narrow in scope; does not address explicit contextual structure or cross-domain generalization |
| Foundational Models Defining a New Era in Vision: A Survey and Outlook[@doi:10.48550/arXiv.2307.13721] | Vision-based foundation models | Architectures, multimodal integration, prompting, fusion in vision models | Implicit adaptivity in vision contexts, via prompt or fusion mechanisms across visual tasks | Domain-specific focus limits generalization; less discussion on theoretical adaptation across modalities |
| A Comprehensive Survey on Pretrained Foundation Models[@doi:10.48550/arXiv.2302.09419] | Pretrained foundation models | Coverage of models across modalities, training regimes, adaptation and fine-tuning strategies | Implicit adaptivity via representation transfer and generalization across tasks | Broad in scope but does not deeply analyze parameter-level adaptation or explicit–implicit alignment |
Table 1: Representative surveys and key papers covering context-adaptive inference. Most works focus on a single methodological tradition and do not connect explicit and implicit approaches.
While existing surveys have reviewed individual components of this landscape—such as varying-coefficient models, meta-learning, or foundation models—they have remained largely siloed. This work brings together three traditions—statistical varying-coefficient models, meta-learning / transfer, and foundation-model prompting / retrieval—into a single mathematical view and a practical design checklist. By situating classical statistical models, modern machine learning methods, and foundation models along a shared spectrum of context-adaptive inference, we highlight common principles and distinctive challenges. The next section outlines the conceptual foundations of context-adaptive inference, preparing the ground for detailed discussions of explicit and implicit modeling approaches in later sections.