context-review/content/02.introduction.md at main · AdaptInfer/context-review

Introduction

A convenient simplifying assumption in statistical modeling is that observations are independent and identically distributed (i.i.d.). This assumption allows us to use a single model to make predictions across all data points. But in practice, this assumption rarely holds. Data are collected across different individuals, environments, and tasks—each with their own characteristics, constraints, and dynamics. When the i.i.d. assumption breaks down, using a single global model can obscure meaningful heterogeneity.

To model this heterogeneity, a growing class of methods aim to make inference adaptive to context. These include varying-coefficient models in statistics, transfer and meta-learning in machine learning, and in-context learning in large foundation models. Though these approaches arise from different traditions, they share a common goal: to use contextual information—whether covariates, environments, or support sets—to inform sample-specific inference. Figure {@fig:overview-bridge} summarizes how statistics, meta-learning, and foundation models flow into a unified context→parameters view, which we formalize in (★).

{#fig:overview-bridge width="90%"}

We formalize this by assuming each observation $x_i$ is drawn from a distribution governed by parameters $\theta_i$: $$ x_i \sim P(x;,\theta_i). $$

In population models, the assumption is that $\theta_i=\theta$ for all $i$. In context-adaptive models, we instead posit that the parameters vary with context: $$ \theta_i=f(c_i)\quad\text{or}\quad \theta_i\sim P(\theta\mid c_i), $$ where $c_i$ captures the relevant covariates or environment for observation $i$. The goal is to estimate either a deterministic function $f$ or a conditional distribution over parameters.

This shift raises new modeling challenges. Estimating a unique $\theta_i$ from a single observation is ill-posed without structural regularization—smoothness, sparsity, shared representations, or latent grouping. And as adaptivity becomes more implicit (e.g., via neural networks or black-box inference), we need tools to recover, interpret, or constrain the underlying parameter variation.

Problem Setup and Notation

We study supervised prediction with units $i=1,\dots,n$. Each unit has a context $c_i\in\mathcal{C}$ (e.g., patient/user/site/time) and observed data $\mathcal{D}i={(x{ij},y_{ij})}{j=1}^{m_i}$ with $x{ij}\in\mathcal{X}$ and $y_{ij}\in\mathcal{Y}$. Predictions come from a model family $\mathcal{H}={h_\theta:\mathcal{X}\to\mathcal{Y}\mid \theta\in\Theta}$ (e.g., linear model, neural net, probabilistic model).

In global (i.i.d.) models, $\theta_i\equiv\theta^\star$. In context-adaptive models, parameters vary with context: $\theta_i=f(c_i)$ or $\theta_i\sim P(\theta\mid c_i)$.

For a new unit with context $c$, we write a unified empirical objective: $$ \widehat{\theta}(c)\in\arg\min_{\theta\in\Theta}; \underbrace{\sum_{(i,j)\in S(c)} \ell!\big(h_\theta(x_{ij}),y_{ij}\big)}{\text{context-dependent support}} ;+; \underbrace{\mathcal{R}(\theta;,c)}{\text{context-structured regularization}}, \tag{★} $$ where $\ell$ is a proper loss (e.g., squared, logistic), $S(c)\subseteq{1,\dots,n}\times\mathbb{N}$ is a support set selected for context $c$, and $\mathcal{R}(\theta;c)$ encodes how parameters are allowed to vary with context (smoothness, sparsity, low-rank, hierarchy, etc.).

How context enters.

Explicit parameterization: a map $f:\mathcal{C}\to\Theta$ sets $\theta_i=f(c_i)$ (e.g., varying-coefficients, hierarchical Bayes, multi-task/meta-learning). Here $\mathcal{R}(\theta;c)$ typically regularizes $f$ (e.g., Lipschitz over $\mathcal{C}$, group lasso, low-rank).
Implicit parameterization: context alters optimization or internal states without exposing $\theta$ directly (e.g., mixture-of-experts with gates $g(x,c)$; retrieval where $S(c)$ is built by a retriever $R(c)$; in-context learning where a prompt map $P(c)$ conditions a foundation model).

Emerging approaches blur this distinction. Instead of treating prompting (implicit) and fine-tuning (explicit) as separate mechanisms, models can map contextual information—such as a natural language task description—directly into parameter updates. Text-to-LoRA provides a concrete example of this paradigm, where context is encoded and used to generate task-specific parameters, effectively collapsing the boundary between implicit and explicit adaptation [@charakorn2025text].

To formalize this connection, we now return to a unified estimator. For convenience, we use a context encoder $\phi:\mathcal{C}\to\mathbb{R}^d$ and a similarity/kernel $K(c,c')$. A common instance of (★) is kernel-weighted risk: $$ \sum_{i,j} w_{ij}(c),\ell!\big(h_\theta(x_{ij}),y_{ij}\big);+;\mathcal{R}(\theta), \qquad w_{ij}(c)\propto K!\big(\phi(c),\phi(c_i)\big)\cdot \mathbf{1}!\big[(i,j)\in S(c)\big]. $$

Granularity. We refer to adaptation granularity $g\in{\text{group},\text{unit},\text{example}}$ and to three design “knobs” we will revisit:

Information via $S(c)$ or $P(c)$ (what context is exposed),
Inductive bias via $\mathcal{R}(\theta;c)$ (how parameters may vary),
Compute via warm-starts/caching/steps (how aggressively we solve (★) at test time).

Standing assumptions (used as needed).
(i) Exchangeability within context: conditional on $(\theta_i,c_i)$, $(x_{ij},y_{ij})$ are i.i.d.
(ii) Regularity: either $\theta=f(c)$ with $f$ in a regular class (e.g., Lipschitz/sparse/low-rank) or retrieval weights $w_{ij}(c)$ are bounded and locally normalized.
(iii) Identifiability/stability: $\ell$ is convex in model outputs and $\mathcal{R}$ yields a unique or stable minimizer.
(iv) Resource tracking: we track $|S(c)|$, optimization steps, and memory to compare adaptation efficiency.

With this notation in place, we now formalize the link between explicit and implicit context adaptation. This link has been contributed in pieces across recent work; here we unify those results.

Theoretical Bridge

Recent theoretical work suggests that “explicit” context models (e.g., varying-coefficients, hierarchical/multitask) and “implicit” mechanisms (e.g., in-context learning via attention) often implement the same estimator class under squared loss, differing mainly in how they encode neighborhoods and regularization. Using the notation above, we make this precise.

Proposition 1 (Explicit varying-coefficients and linear ICL coincide with kernel ridge on joint features in the linear squared-loss setting).
Assume squared loss and the regression model $y=\langle \theta(c),x\rangle+\varepsilon$ with $\mathbb{E}[\varepsilon]=0$. Let (i) a context encoder $\phi:\mathcal{C}\to\mathbb{R}^{d_c}$, (ii) joint features $\psi(x,c):=x\otimes \phi(c)\in\mathbb{R}^{d_x d_c}$, (iii) a context-dependent support set $S(c)$ with nonnegative weights $w_{ij}(c)$.

(A) Explicit varying-coefficients. Let $\theta(c)=B,\phi(c)$ with $B\in\mathbb{R}^{d_x\times d_c}$ and ridge penalty $\lambda\lVert B\rVert_F^2$. The weighted ridge solution yields $$ \widehat y(x,c)=k_{(x,c)}^\top \big(K+\lambda I\big)^{-1} y,\quad K_{ab}=\langle \psi_a,\psi_b\rangle=\langle x_a,x_b\rangle\cdot\langle \phi(c_a),\phi(c_b)\rangle, $$ i.e., kernel ridge regression (KRR) on joint features.
(B) Implicit adaptation via linear ICL. Let a single linear attention layer consume the weighted support set $S(c)$ with linear $q=Q\psi$, $k=K\psi$, $v=V\psi$ and a linear readout. With attention weights proportional to $w_{ij}(c)\cdot \langle q,k_{ij}\rangle$, the induced predictor equals KRR with kernel $$ k\big((x,c),(x',c')\big)=\langle q(x,c),k(x',c')\rangle, $$ i.e., a learned dot-product kernel on the same joint features. If attention parameters are trained in the linearized/NTK regime, learning equals kernel regression with the network’s NTK, which is again a dot-product kernel on linear transforms of $\psi$.

Corollary 1 (Retrieval, gating, and weighting are kernel/measure choices). Choosing $S(c)$ via a retriever $R(c)$, or gating in a mixture-of-experts, corresponds to changing the kernel and/or the empirical measure (weights $w_{ij}(c)$) used by KRR on $\psi$.

Proof in Appendix A.

Positioning and prior art. Proposition 1 is expository: part (A) is standard ridge⇔kernel duality on joint features; part (B) follows from (i) fixed attention + trained linear head = ridge on fixed features and (ii) NTK linearization ⇒ kernel regression with the network’s NTK. Our contribution is the unified context-aware formulation: explicit design knobs via $S(c)$, $\mathcal{R}(\theta;c)$, and compute—and the mapping of retrieval/gating to kernel/measure choices. See transformer ICL as classical estimators [@arXiv:2212.07677; @arXiv:2208.01066; @arXiv:2212.10559] and NTK analyses [@arXiv:1806.07572; @arXiv:1912.02803].

Limitations This linear, squared-loss bridge captures a large class of explicit and implicit adaptors, but it abstracts away at least three realities: (i) non-quadratic losses (e.g., logistic) change the effective kernel/weighting via loss curvature; (ii) when the prediction head is nonlinear (e.g., an MLP or attention), it introduces model curvature: the head’s Jacobian/Hessian depends on the input and parameters, so the effective metric and weights change with the representation; the fixed-kernel view from the linear case no longer holds; (iii) multi-modal context encoders (text, graphs, images) alter both the neighborhood definition and the regularizer. We view these as open extensions to explore beyond the scope of this review.

Scope of Review and Relation to Prior Work

In this review, we examine methods that use context to guide inference, either by specifying how parameters change with covariates or by learning to adapt behavior implicitly. We begin with classical models that impose explicit structure, such as varying-coefficient models and multi-task learning, and then turn to more flexible approaches like meta-learning and in-context learning with foundation models. Though these methods arise from different traditions, they share a common goal: to tailor inference to the local characteristics of each observation or task. Along the way, we highlight recurring themes: complex models often decompose into simpler, context-specific components; foundation models can both adapt to and generate context; and context-awareness challenges classical assumptions of homogeneity. These perspectives offer a unifying lens on recent advances and open new directions for building adaptive, interpretable, and personalized models.

Related Surveys and Reviews

Several surveys have examined specific aspects of context-adaptive inference, but they have largely remained confined to individual methodological traditions. Classical statistical surveys focus on varying-coefficient models and related structured regression methods. In machine learning, surveys on transfer and meta-learning emphasize task adaptation and shared representations, while recent work on foundation models explores the implicit adaptation capabilities of large pretrained models. Table 1 summarizes the scope and coverage of representative surveys.

Survey	Topic Focus	Scope	Coverage of Adaptivity	Gap Relative to This Work
Statistical Methods with Varying Coefficient Models[@doi:10.4310/sii.2008.v1.n1.a15]	Varying-coefficient modeling	Classical statistical modeling, with parameters expressed as functions of covariates	Explicit adaptivity: parameters change smoothly with context via (f(c))	Limited to explicit, parametric formulations; no connection to neural or emergent adaptation
A Survey of Deep Meta-Learning[@doi:10.48550/arXiv.2010.03522]	Meta-learning	Neural meta-learning methods for cross-task adaptation	Task-level adaptivity: models learn to generalize quickly across tasks	Focused on task switching; does not integrate explicit parameter modeling or implicit foundation model adaptation
LoRA: Low-Rank Adaptation of Large Language Models[@doi:10.48550/arXiv.2106.09685]	Parameter-efficient adaptation	Adaptation of large pretrained transformer models via low-rank updates while freezing base weights	Implicit adaptivity via parameter-efficient updates, enabling contextual adaptation without full fine-tuning	Strong in efficient adaptation mechanism, but narrow in scope; does not address explicit contextual structure or cross-domain generalization
Foundational Models Defining a New Era in Vision: A Survey and Outlook[@doi:10.48550/arXiv.2307.13721]	Vision-based foundation models	Architectures, multimodal integration, prompting, fusion in vision models	Implicit adaptivity in vision contexts, via prompt or fusion mechanisms across visual tasks	Domain-specific focus limits generalization; less discussion on theoretical adaptation across modalities
A Comprehensive Survey on Pretrained Foundation Models[@doi:10.48550/arXiv.2302.09419]	Pretrained foundation models	Coverage of models across modalities, training regimes, adaptation and fine-tuning strategies	Implicit adaptivity via representation transfer and generalization across tasks	Broad in scope but does not deeply analyze parameter-level adaptation or explicit–implicit alignment

Table 1: Representative surveys and key papers covering context-adaptive inference. Most works focus on a single methodological tradition and do not connect explicit and implicit approaches.

While existing surveys have reviewed individual components of this landscape—such as varying-coefficient models, meta-learning, or foundation models—they have remained largely siloed. This work brings together three traditions—statistical varying-coefficient models, meta-learning / transfer, and foundation-model prompting / retrieval—into a single mathematical view and a practical design checklist. By situating classical statistical models, modern machine learning methods, and foundation models along a shared spectrum of context-adaptive inference, we highlight common principles and distinctive challenges. The next section outlines the conceptual foundations of context-adaptive inference, preparing the ground for detailed discussions of explicit and implicit modeling approaches in later sections.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction

Problem Setup and Notation

Theoretical Bridge

Scope of Review and Relation to Prior Work

Related Surveys and Reviews

FilesExpand file tree

02.introduction.md

Latest commit

History

02.introduction.md

File metadata and controls

Introduction

Problem Setup and Notation

Theoretical Bridge

Scope of Review and Relation to Prior Work

Related Surveys and Reviews