context-review/content/04.historical.md at main · AdaptInfer/context-review

Independent and identically distributed samples

The initial research data mostly came from strictly designed experiments, such as agricultural field trials or psychological experiments. The characteristics of such data are small scale and simple structure. Researchers usually assume that each observation is independent of one another and identically distributed [@Fisher1949Design]. Under this setting, there is no dependency among the data, and researchers mainly focus on the overall average level or effect.

Linear models emerged as the fundamental approach to conduct statistical analysis for such data. One of the first methods in the development of linear models, the method of least squares, was first published in a work by Legendre [@Legendre1805Comets] and later independently developed and justified by Gauss [@Gauss1809Ambientium]. By reducing the squared deviations between predictions and results, this method offered a general framework for fitting a regression line via observed data. The estimator is expressed as follows:

$$ \hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{n} \bigl( y_i - x_i^{\top}\beta \bigr)^{2} $$

which has a closed-form solution,

$$ \hat{\beta} = (X^{\top}X)^{-1}X^{\top}y $$

This offered a systematic procedure for predicting unknown parameters from independently and identically distributed data, which builds the foundation of regression analysis.

The establishment of generalized linear models (GLMs) expanded the concepts of linear regression to non-Gaussian outcomes. Nelder and Wedderburn [@doi:10.2307/2344614] initiated the central concept of connecting the mean of the response variable to a linear predictor through a link function, which was later formalized into the now-standard compact notation [@doi:10.1007/978-1-4899-3242-6]:

$$ g(\mu_i) = \eta_i $$

Before this unified general framework, one of the first instances was the use of the logistic function for binary data. Berkson advocated its application to biomedical dose–response studies, illustrating how the probabilities of success or failure in experimental settings could be captured by the link formulation [@doi:10.1080/01621459.1944.10500699]. Based on this, logistic regression was formalized as a regression model for binary sequences, establishing the logit link [@doi:10.1111/j.2517-6161.1958.tb00292.x]:

$$ \log \frac{p_i}{1-p_i} = x_i^\top \beta $$

For count data, Poisson regression was introduced within the GLM framework, employing the log link [@doi:10.2307/2344614]:

$$ \log(\mu_i) = x_i^\top \beta $$

These developments strengthened GLMs as a unifying framework for a variety of independently and identically distributed data types by extending linear modeling to categorical and count outcomes.

Alongside regression, analysis of variance (ANOVA) was another early milestone of statistical methodology. The development of ANOVA introduced the concept of splitting total variance into components related to within-group and between-group differences [@Fisher1925Research]. The central idea for ANOVA was the F ratio,

$$ F = \frac{MS_{\text{Between}}}{MS_{\text{Within}}} $$

which provided a consistent framework for determining the significance of differences between group means and established the foundation of data analysis in modern experiments.

Together, these early statistical frameworks provided the foundation for analysis of independently and identically distributed data. However, as studies became more complex and observations were no longer truly independent, new methods were needed, leading to the development of hierarchical models.

Hierarchical data

With the expansion of research fields, a hierarchical structure gradually emerges in the collected empirical data. For example, bull is nested in sire and herd [@doi:10.1080/00288233.1981.10420865]; the respondents are nested within the population of subjects [@doi:10.1037/h0040957]; and repeated measurements of the same object at different time points also make the data show longitudinal correlation. This type of data is more complex than independent and identically distributed samples. It not only has differences among groups but also needs to take into account the changes of individuals over time simultaneously. The observed values are no longer completely independent of each other, but there exists a distinct hierarchical effect.

Early work on hierarchical dependence started with the introduction of linear models that contain both fixed and random effects to estimate genetic parameters [@Henderson1950Genetic]. This approach laid the statistical basis for partitioning variability into systematic and random components, which became the basis for how intraclass correlation would later be understood and applied. A general form of the linear models can be written as:

$$ y_{ij} = \mu + u_j + \varepsilon_{ij}, \quad u_j \sim N(0, \sigma_u^2), \quad \varepsilon_{ij} \sim N(0, \sigma^2) $$

Around the same period, it became clear that comparative rate estimation in clinical and epidemiological studies depends on the structure of sampled subgroups, with heterogeneity in source populations affecting inference in widespread use [@doi:10.1093/jnci/11.6.1269]. Building on this recognition of heterogeneity, applications expanded to repeated measurements and longitudinal structures; restricted maximum likelihood estimation was introduced to advance the framework and improve variance component inference in unbalanced settings [@doi:10.1093/biomet/58.3.545]. This methodological foundation enabled the formulation of linear mixed-effects models (LMMs) that combined fixed effects for population-level trends with random effects for subject- or group-specific deviations [@doi:10.2307/2529876]. In matrix notation, these models take the form

$$ y = X\beta + Zu + \varepsilon, \quad u \sim N(0, G), \quad \varepsilon \sim N(0, R) $$

and later became widely used in the field of biostatistics and social sciences. To extend mixed-effects modeling beyond Gaussian responses, practical estimation procedures for generalized linear models with random effects were proposed, enabling the application of link functions to clustered binary, count, or categorical outcomes [@doi:10.1093/biomet/78.4.719]. Further methodological advances introduced penalized quasi-likelihood and approximate inference techniques that made practical application feasible in many fields, especially medical and biological fields [@doi:10.1080/01621459.1993.10594284]. These developments gave rise to Generalized Linear Mixed Models (GLMMs), formulated as

$$ g(\mu_i) = x_i^\top \beta + z_i^\top u $$

combining link functions with both fixed ($\beta$) and random ($u$) effects. As mixed-effects approaches developed rapidly, the conceptual foundation of Bayesian hierarchical modeling—in which multilevel structures with prior distributions link parameters across groups—had already been articulated [@doi:10.1111/j.2517-6161.1972.tb00885.x]:

$$ y_{ij} \mid \theta_j \sim p(y_{ij} \mid \theta_j), \quad \theta_j \sim p(\theta_j \mid \phi), \quad \phi \sim p(\phi) $$

Nevertheless, it wasn't until later that the Bayesian hierarchical model had practical influence. In the 1980s, the demonstration of the practical use of Gibbs sampling in image analysis paved the way for adoption in hierarchical Bayesian modeling [@doi:10.1109/TPAMI.1984.4767596]. Later, advances in Markov chain Monte Carlo methods provided the computational tools necessary to fit Bayesian hierarchical models and made Bayesian hierarchical modeling practical for applied researchers [@doi:10.1201/b16018].

Collectively, these developments shifted statistical modeling from independence assumptions to explicitly capturing correlation and hierarchical structure. As methods for nested data improved, new problems arose with functional, continuous, and high-dimensional observations. This led to the development of approaches for curves, trajectories, and large feature spaces, which in turn led to functional data analysis and high-dimensional inference.

Functional types and high-dimensional data

As data collection advanced into the 1980s and 1990s, researchers began encountering observations that are entire curves or functions (e.g., time series, spectra, images) rather than fixed-dimensional vectors. Functional Data Analysis (FDA) [@doi:10.1002/0470013192.bsa239] treats each observation as a smooth function $x_i(t)$ defined over a continuum (time, wavelength, etc.). Typical FDA methods use basis expansions (e.g., Fourier or spline bases) to represent $x_i(t)$ and perform tasks such as functional principal component analysis. This emphasis on continuous predictors demanded flexible regression tools. Generalized Additive Models (GAMs) [@doi:10.1214/ss/1177013604] were developed to capture nonlinear effects while retaining interpretability. In a GAM, the response is modeled as

$$ y_i = \alpha + \sum_{j=1}^p f_j(x_{ij}) + \varepsilon_i $$

where each $f_j$ is an estimated smooth function (often implemented with splines or kernels). GAMs then generalize linear models by allowing the data to determine the shape of each predictor’s effect. For example, a GAM can automatically reveal a U-shaped or threshold effect in a covariate, which a purely linear model would miss.

As data dimensionality and complexity continued to grow (for instance with genomic or imaging data), there was a shift toward automated feature learning. Representation learning [@doi:10.1109/TPAMI.2013.50] generalizes classical dimension reduction (like PCA) to nonlinear, data-driven embeddings. Neural autoencoders are a prototypical example: one trains an encoder network $h_\phi(x)$ and a decoder $g_\theta(z)$ by minimizing reconstruction error,

$$ (\theta^{\ast},\phi^{\ast}) = \arg\min_{\theta,\phi}\sum_{i=1}^n |x_i - g_\theta(h_\phi(x_i))|^2 $$

Here $z_i = h_\phi(x_i)$ is a low-dimensional code capturing the essential structure of $x_i$. With linear networks this recovers PCA, but in general the learned code can represent highly nonlinear features. Modern representation learning methods (e.g., deep autoencoders, variational autoencoders, deep embedding networks) therefore enable models to adapt their feature extraction to complex, high-dimensional data in a context-dependent way [@doi:10.1109/TPAMI.2013.50]. By learning bases and transformations from the data itself, these methods built on the ideas of FDA and GAMs to handle the evolving scale and richness of scientific data.

Heterogeneous tasks and sparse data

With the proliferation of diverse tasks and domains, statistical learning shifted toward methods that transfer information across problems. Multi-task learning (MTL) [@doi:10.1023/A:1007379606734] arose in response: it jointly models related tasks to improve performance, especially when each task has limited data. Concretely, if we have $T$ tasks with data $(X^t,Y^t)$ for $t=1,\dots,T$, one can learn task-specific models $w^t$ by minimizing a coupled objective, for example

$$ \min_{W}\sum_{t=1}^T \sum_{i=1}^{n_t} \ell(y_i^t, f(x_i^t;w^t)) + \lambda,\Omega(W) $$

where $W=[w^1,\dots,w^T]$ collects all task parameters and $\Omega(W)$ is a regularizer that enforces shared structure (e.g., penalizing deviations from a common parameter). This formulation lets tasks share a representation or bias: information useful for one task can aid others. Transfer learning extends this idea between domains wherein a model learned on a large source dataset is adapted to a target domain with little labeled data [@doi:10.1109/TKDE.2009.191]. For instance, a convolutional neural network pretrained on ImageNet can be fine-tuned on a small medical imaging dataset, effectively reusing learned features to improve accuracy with scarce data.

Another fundamental challenge is distribution shift between training and deployment. Covariate shift occurs when the input distribution $p(x)$ changes but the conditional $p(y|x)$ remains the same. In this case one can correct the loss by importance weighting. Denote $p_{\mathrm{train}}(x)$ and $p_{\mathrm{test}}(x)$ the feature distributions; then

$$ \mathbb{E}_{x\sim p_{\mathrm{test}}}[\ell(f(x),y)] = \mathbb{E}_{x\sim p_{\mathrm{train}}}\Bigl[\tfrac{p_{\mathrm{test}}(x)}{p_{\mathrm{train}}(x)},\ell(f(x),y)\Bigr], \quad w(x) = \frac{p_{\mathrm{test}}(x)}{p_{\mathrm{train}}(x)} $$

More generally, domain adaptation methods address cases where both $p(x)$ and $p(y|x)$ differ across source and target domains. Techniques such as kernel mean matching, adversarial domain-invariant representations, or graph-based alignment were developed to tackle these shifts. These methods emerged as data collection became distributed across different environments (e.g., new sensors, populations, or changing conditions), requiring models that adapt to new contexts beyond the original training distribution.

Finally, the combination of many tasks and very few examples per task has given rise to few-shot learning. In few-shot learning [@doi:10.1145/3386252], the goal is to generalize to new classes or tasks from only a handful of labeled examples. Modern approaches typically leverage prior experience across tasks or classes. Metric-based methods, such as Matching Networks and Prototypical Networks [@doi:10.48550/arXiv.1606.04080; @doi:10.48550/arXiv.1703.05175], learn an embedding $\phi(x)$ so that samples from the same class cluster together. A novel class can then be represented by the mean of its few support examples in this latent space. Alternatively, meta-learning algorithms train on many simulated few-shot tasks so the model learns to adapt quickly. These few-shot and meta-learning frameworks explicitly confront data scarcity by transferring inductive biases across tasks, enabling effective learning in regimes far beyond the traditional large-sample assumptions.

Online and interactive data

In the Internet era, the way data is collected has undergone significant changes. User behavior data, sensor data, and platform experimental data exhibit the characteristics of streaming and interactivity [@doi:10.1145/1772690.1772758]. The generation of dynamic data poses new challenges to model construction and problem-solving. Data is not collected all at once but is generated in real time and is often related to the feedback loop of the system [@doi:10.1115/1.3662552]. This type of data is more dynamic and complex compared to traditional experimental or measurement data, involving both time dependence and continuous influence from the environment and interaction.

A natural starting point to address this challenge is the concept of online learning [@doi:10.48550/arXiv.1802.02871]. A foundational formulation is the Online Convex Optimization (OCO) framework [@doi:10.1561/2200000018]. At each round $t = 1, \dots, T$, the learner selects a decision $x_t \in F$, where $F \subset \mathbb{R}^n$ is a convex decision set. The environment then reveals a convex loss function $c_t : F \to \mathbb{R}$, and the learner incurs loss $c_t(x_t)$. The central performance measure is the regret, defined as:

$$ R(T) = \sum_{t=1}^{T} c_t(x_t) - \min_{x \in F} \sum_{t=1}^{T} c_t(x) $$

which compares the learner’s cumulative loss with that of the best fixed decision in hindsight. A simple and influential algorithm in this framework is Online Gradient Descent (OGD). Given a subgradient $g_t \in \partial c_t(x_t)$, OGD performs a gradient step followed by projection back onto the feasible set:

$$ x_{t+1} = \Pi_F(x_t - \eta_t g_t) $$

where $\eta_t$ is a step size and $\Pi_F$ denotes projection onto $F$. With appropriately chosen step sizes, OGD achieves the classic bound $R(T) = O(\sqrt{T})$, implying vanishing average regret. This framework provides the mathematical foundation for more complex online methods, such as bandits and reinforcement learning.

While OCO provides guarantees with respect to a fixed comparator, real-world data streams rarely remain stationary. The statistical relationship between inputs $x$ and outputs $y$ can change over time. This phenomenon is known as concept drift [@doi:10.1145/2523813]. To cope with drift, online adaptive learning methods augment the basic OCO paradigm by explicitly adapting to distributional changes. A representative technique for drift detection is the Drift Detection Method (DDM) [@doi:10.1007/978-3-540-28645-5_29]. It monitors the online error rate $\hat{p}_i$ after $i$ samples together with its standard deviation $s_i$. The method keeps track of the minimum values observed $(p_{\min}, s_{\min})$, and raises an alarm when

$$ \hat{p}_i + s_i \geq p_{\min} + \alpha \cdot s_{\min} $$

for a predefined threshold $\alpha$. This statistical test signals when the current error distribution significantly deviates from the past minimum, indicating concept drift. The old model is then discarded, and a new model is trained using the data accumulated since the warning level (i.e., from $k_w$ to $k_d$). After reinitialization, $(p_{\min}, s_{\min})$ are reset, and the updated model continues processing the incoming stream. Beyond DDM, the Early Drift Detection Method (EDDM) [@Baena2006EarlyDrift] monitors the distribution of distances between errors, making it more sensitive to gradual drifts. Ensemble-based methods such as Bagging-ADWIN and Boosting-ADWIN [@doi:10.1145/1557019.1557041], which couple classical ensemble learning with adaptive sliding windows, have also demonstrated strong performance on evolving data streams.

Though adaptive learning addresses the challenge of non-stationarity, it still operates under a full-information feedback model: after each round, the entire loss function $c_t(\cdot)$ is revealed. In many interactive systems, however, such complete feedback is unavailable. This setting motivates the study of partial-information online learning, captured by the classic Multi-Armed Bandit (MAB) framework [@doi:10.1090/S0002-9904-1952-09620-8]. Within the MAB, the agent repeatedly chooses among a finite set of actions (“arms”) and receives the reward of that arm. The central challenge is the exploration–exploitation trade-off. Classical strategies include $\epsilon$-greedy [@Sutton1998RLIntroduction], which with probability $1-\epsilon$ selects the current best arm and with probability $\epsilon$ explores. Upper Confidence Bound (UCB) algorithms [@doi:10.1023/A:1013689704352] emphasize the “Optimism in the Face of Uncertainty,” where at each round $t$ with arm $a$:

$$ \mathrm{UCB}_a(t) = \hat{\mu}_a + \sqrt{\frac{2\ln t}{n_a}} $$

The arm $\arg\max_a \mathrm{UCB}_a(t)$ is chosen at each round, achieving sublinear regret. Their key limitation—ignoring contextual side information—leads naturally to Contextual Bandits. Formally, at each round $t$, the agent now observes a context $x_t$ and the goal is to learn a policy $\pi : X \to A$ that maximizes expected cumulative reward. A representative algorithm is LinUCB, which assumes a linear relationship between context and reward; at round $t$, LinUCB selects

$$ a_t = \arg \max_{a \in \mathcal{A}} \left( x_t^\top \hat{\theta}_a + \alpha \sqrt{x_t^\top A_a^{-1} x_t} \right) $$

where $\hat{\theta}_a$ is the estimated weight and $A_a$ is the design matrix. Contextual bandits improve by personalizing decisions to the observed environment. They have been widely applied in news recommendation and adaptive experimentation. However, CB assumes actions do not affect future contexts, which motivates the richer framework of reinforcement learning.

RL formalizes sequential decision-making under uncertainty through the lens of Markov Decision Processes (MDPs) [@Bellman1957Markovian]. An MDP is defined by a state space $S$, an action space $A$, transition dynamics $P(s' \mid s, a)$, a reward function $r(s, a)$, and a discount factor $\gamma \in [0, 1)$. At each step $t$, the agent observes a state $s_t$, selects an action $a_t \sim \pi(\cdot \mid s_t)$, receives a reward $r_t$, and transitions to a new state $s_{t+1}$. The goal is to learn a policy $\pi$ that maximizes the expected cumulative discounted reward:

$$ J(\pi) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right] $$

Classical solution approaches include value-based methods (e.g., Q-learning) [@Watkins1989Learning], which estimate action-value functions and act greedily with respect to them, and policy-based methods (e.g., policy gradient, actor–critic) [@doi:10.1023/a:1022672621406] [@doi:10.1109/tsmc.1983.6313077], which directly optimize the policy parameters via stochastic gradient ascent. These methods establish theoretical guarantees and have enabled applications ranging from game-playing (e.g., Go, Atari) [@doi:10.1038/nature14236] to robotics.

The trajectory from online convex optimization to reinforcement learning highlights how the statistical study of streaming, interactive data has evolved into increasingly expressive frameworks for context-adaptive inference. As data collection becomes more interactive and non-stationary, the ability to learn from context and feedback loops is central to adaptive intelligence. This theme naturally connects to the next section on multimodal data, where adaptivity must also span across modalities.

Multimodal data

With the advancement of the digitalization process, research and application have begun to simultaneously involve various types of data such as images [@doi:10.1088/2040-8978/17/7/073001], audio [@doi:10.1121/1.1912679], and text [@doi:10.1145/361219.361220]. These data are not only high-dimensional but also show clear structural and representational differences. Text consists of discrete symbolic sequences that contain semantic and grammatical structures; images are spatially organized pixel matrices that reflect local spatial correlations and global patterns; and audio is a continuous waveform signal that captures dynamic temporal characteristics. Compared with earlier single-type numerical or functional data, multimodal data are more heterogeneous and complex. New challenges asked researchers to explore how to establish semantic correspondences across modalities for cross-modal alignment, build joint models through shared latent spaces, and dynamically adjust inter-modal dependencies according to different tasks or contexts.

From a statistical perspective, representation learning [@doi:10.1109/TPAMI.2013.50] can be regarded as a generalization of traditional linear dimension reduction techniques such as Principal Component Analysis (PCA) and Factor Analysis (FA). While PCA and FA aim to identify low-dimensional subspaces that capture maximum variance or shared covariance structure through linear projections, representation learning extends this idea to nonlinear mappings that can model highly complex and structured data. Representation learning addresses this challenge by automatically learning low-dimensional embeddings that map complex observations into a shared latent space. The basic idea is to learn a nonlinear mapping

$$ f_{\theta}: \mathcal{X} \rightarrow \mathcal{Z} $$

where $\mathcal{X}$ represents the original multimodal input space (e.g., images, text, or audio), and $\mathcal{Z}$ denotes the shared latent space. The context-adaptive nature of representation learning lies in its ability to adjust feature extraction pathways according to the input modality and to align heterogeneous data within a unified semantic space. Recent theoretical work provides a unified explanation of this adaptability. The Contexture Theory formalizes representation learning as approximating the expectation operator induced by the context variable, linking supervised, self-supervised, and manifold learning under a common spectral framework [@doi:10.48550/arXiv.2505.01557]. This perspective highlights that constructing informative context variables can yield greater efficiency than merely scaling model capacity, offering a rigorous theoretical basis for context-aware representation learning.

To infer hidden structures, traditional latent variable models in statistics such as Factor Analysis and Gaussian Mixture Models (GMM) require estimating the posterior distribution $p(z|x)$. However, when the generative process $p_\theta(x|z)$ is highly nonlinear, the posterior becomes intractable. Variational inference approximates the true posterior by introducing a tractable distribution $q(z|x)$ and minimizing their divergence, typically via the Kullback–Leibler (KL) divergence.

Amortized inference introduces a major innovation by parameterizing the approximate posterior as a learnable function $q_\phi(z|x)$ shared across all samples. This amortizes the cost of inference, enabling efficient posterior estimation via a neural network encoder. Variational Autoencoders (VAEs) [@arXiv:1312.6114] can be regarded as nonlinear extensions of classical latent variable models. VAEs jointly train the generative model $p_\theta(x|z)$ and the inference model $q_\phi(z|x)$ by maximizing the Evidence Lower Bound (ELBO):

$$ \mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}[q_\phi(z|x) ,||, p(z)] $$

The model can dynamically adjust posterior estimation $q_\phi(z|x)$ according to input characteristics or modality, achieving efficient and scalable probabilistic reasoning.

In many real-world scenarios, the amount of labeled data for each task may be limited. Meta-learning, or learning to learn, provides a framework for acquiring inductive knowledge that enables rapid adaptation to new tasks with few examples. The central idea is to learn a meta-model that captures transferable structures across tasks and can quickly adjust its parameters when facing a new context. It involves a two-level optimization process:

$$ \min_{\theta} \sum_{T_i \sim p(T)} \mathcal{L}_{T_i}\big(U(\theta, T_i)\big) $$

where $\theta$ denotes the meta-parameters shared across tasks, $T_i$ represents individual tasks sampled from a task distribution $p(T)$, and $U(\theta, T_i)$ indicates the task-specific adaptation process, such as gradient updates.

Two main paradigms have been developed. Gradient-based meta-learning, exemplified by Model-Agnostic Meta-Learning (MAML) [@doi:10.48550/arXiv.1703.03400], optimizes an initialization of $\theta$ that allows fast convergence on new tasks with a few gradient steps. In contrast, metric-based meta-learning [@doi:10.48550/arXiv.1703.05175] focuses on learning a task-invariant representation space in which samples from the same class or semantic context remain close in distance metrics.

Meta-learning can be viewed as a hierarchical model where the outer loop learns the hyperprior over tasks and the inner loop performs task-specific inference. The model internalizes the variability across tasks into shared meta-parameters and dynamically adjusts its learning strategy when exposed to new contextual distributions. Such an approach bridges the gap between multi-task learning and context-aware adaptation, providing a scalable solution for heterogeneous and data-sparse environments.

Large-scale pre-trained data

With the rapid expansion of digital ecosystems and the emergence of web-scale data collection, research has entered a new stage characterized by massive, heterogeneous, and weakly supervised datasets. In recent years, a vast amount of cross-domain data has been centrally collected—such as large-scale text corpora [@doi:10.48550/arXiv.1810.04805] and multimodal alignment data [@doi:10.48550/arXiv.2103.00020] that combine text, images, audio, and sound—representing an unprecedented scale and diversity of information sources. Unlike earlier multimodal datasets that were carefully curated for specific experimental designs, these pretraining datasets are gathered under open, non-controlled conditions and often lack explicit task labels or unified structures.

The central modeling challenge has shifted from fitting a single data distribution to extracting transferable and generalizable patterns that remain stable across heterogeneous contexts. This transition has motivated the development of foundation models and related adaptive paradigms, which aim to capture universal representations, perform context-dependent inference, and achieve robust generalization across domains.

Foundation models [@doi:10.48550/arXiv.2108.07258] represent a fundamental paradigm shift in machine learning toward building universal systems trained on massive and heterogeneous data sources. Unlike earlier models that were designed for specific tasks or modalities, foundation models learn general-purpose representations through large-scale pretraining, which can be adapted to downstream tasks with minimal fine-tuning. Given a massive dataset $\mathcal{D} = \lbrace x_i \rbrace_{i=1}^N$ sampled from diverse domains, the objective is to learn a parameterized function $f_\theta$ that captures broadly transferable patterns:

$$ \theta^* = \arg\min_\theta ; \mathbb{E}_{x \sim \mathcal{D}} , \ell(f_\theta(x)) $$

where $\ell$ represents a pretraining objective such as next-token prediction or contrastive alignment.

The context-adaptive nature of foundation models arises from their ability to integrate knowledge from large and diverse data distributions, enabling emergent behaviors such as cross-modal transfer, domain generalization, and zero-shot reasoning. They thus redefine the notion of generalization—from fitting within a context to adapting across contexts—by embedding statistical invariances into large-scale learned representations.

In-context learning (ICL) [@doi:10.48550/arXiv.2301.00234] describes the ability of large language or multimodal models to adapt to new tasks through contextual examples provided at inference time, without updating model parameters. Given a prompt sequence $\lbrace(x_i, y_i)\rbrace_{i=1}^k$ followed by a query $x_{k+1}$, the model generates a prediction $\hat{y}_{k+1}$ by conditioning on the in-context information:

$$ \hat{y}_{k+1} = f_\theta(x_{k+1} \mid x_1, y_1, \dots, x_k, y_k) $$

This phenomenon suggests that pretrained models can perform implicit meta-learning within their internal representations, dynamically adjusting inference behavior according to contextual input. From a statistical perspective, ICL transforms the adaptation process from an external optimization (parameter update) to an internal inference mechanism conditioned on observed data, thus embodying a new form of context-adaptive reasoning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Independent and identically distributed samples

Hierarchical data

Functional types and high-dimensional data

Heterogeneous tasks and sparse data

Online and interactive data

Multimodal data

Large-scale pre-trained data

FilesExpand file tree

04.historical.md

Latest commit

History

04.historical.md

File metadata and controls

Independent and identically distributed samples

Hierarchical data

Functional types and high-dimensional data

Heterogeneous tasks and sparse data

Online and interactive data

Multimodal data

Large-scale pre-trained data