The initial research data mostly came from strictly designed experiments, such as agricultural field trials or psychological experiments. The characteristics of such data are small scale and simple structure. Researchers usually assume that each observation is independent of one another and identically distributed [@Fisher1949Design]. Under this setting, there is no dependency among the data, and researchers mainly focus on the overall average level or effect.
Linear models emerged as the fundamental approach to conduct statistical analysis for such data. One of the first methods in the development of linear models, the method of least squares, was first published in a work by Legendre [@Legendre1805Comets] and later independently developed and justified by Gauss [@Gauss1809Ambientium]. By reducing the squared deviations between predictions and results, this method offered a general framework for fitting a regression line via observed data. The estimator is expressed as follows:
which has a closed-form solution,
This offered a systematic procedure for predicting unknown parameters from independently and identically distributed data, which builds the foundation of regression analysis.
The establishment of generalized linear models (GLMs) expanded the concepts of linear regression to non-Gaussian outcomes. Nelder and Wedderburn [@doi:10.2307/2344614] initiated the central concept of connecting the mean of the response variable to a linear predictor through a link function, which was later formalized into the now-standard compact notation [@doi:10.1007/978-1-4899-3242-6]:
Before this unified general framework, one of the first instances was the use of the logistic function for binary data. Berkson advocated its application to biomedical dose–response studies, illustrating how the probabilities of success or failure in experimental settings could be captured by the link formulation [@doi:10.1080/01621459.1944.10500699]. Based on this, logistic regression was formalized as a regression model for binary sequences, establishing the logit link [@doi:10.1111/j.2517-6161.1958.tb00292.x]:
For count data, Poisson regression was introduced within the GLM framework, employing the log link [@doi:10.2307/2344614]:
These developments strengthened GLMs as a unifying framework for a variety of independently and identically distributed data types by extending linear modeling to categorical and count outcomes.
Alongside regression, analysis of variance (ANOVA) was another early milestone of statistical methodology. The development of ANOVA introduced the concept of splitting total variance into components related to within-group and between-group differences [@Fisher1925Research]. The central idea for ANOVA was the F ratio,
which provided a consistent framework for determining the significance of differences between group means and established the foundation of data analysis in modern experiments.
Together, these early statistical frameworks provided the foundation for analysis of independently and identically distributed data. However, as studies became more complex and observations were no longer truly independent, new methods were needed, leading to the development of hierarchical models.
With the expansion of research fields, a hierarchical structure gradually emerges in the collected empirical data. For example, bull is nested in sire and herd [@doi:10.1080/00288233.1981.10420865]; the respondents are nested within the population of subjects [@doi:10.1037/h0040957]; and repeated measurements of the same object at different time points also make the data show longitudinal correlation. This type of data is more complex than independent and identically distributed samples. It not only has differences among groups but also needs to take into account the changes of individuals over time simultaneously. The observed values are no longer completely independent of each other, but there exists a distinct hierarchical effect.
Early work on hierarchical dependence started with the introduction of linear models that contain both fixed and random effects to estimate genetic parameters [@Henderson1950Genetic]. This approach laid the statistical basis for partitioning variability into systematic and random components, which became the basis for how intraclass correlation would later be understood and applied. A general form of the linear models can be written as:
Around the same period, it became clear that comparative rate estimation in clinical and epidemiological studies depends on the structure of sampled subgroups, with heterogeneity in source populations affecting inference in widespread use [@doi:10.1093/jnci/11.6.1269]. Building on this recognition of heterogeneity, applications expanded to repeated measurements and longitudinal structures; restricted maximum likelihood estimation was introduced to advance the framework and improve variance component inference in unbalanced settings [@doi:10.1093/biomet/58.3.545]. This methodological foundation enabled the formulation of linear mixed-effects models (LMMs) that combined fixed effects for population-level trends with random effects for subject- or group-specific deviations [@doi:10.2307/2529876]. In matrix notation, these models take the form
and later became widely used in the field of biostatistics and social sciences. To extend mixed-effects modeling beyond Gaussian responses, practical estimation procedures for generalized linear models with random effects were proposed, enabling the application of link functions to clustered binary, count, or categorical outcomes [@doi:10.1093/biomet/78.4.719]. Further methodological advances introduced penalized quasi-likelihood and approximate inference techniques that made practical application feasible in many fields, especially medical and biological fields [@doi:10.1080/01621459.1993.10594284]. These developments gave rise to Generalized Linear Mixed Models (GLMMs), formulated as
combining link functions with both fixed (
Nevertheless, it wasn't until later that the Bayesian hierarchical model had practical influence. In the 1980s, the demonstration of the practical use of Gibbs sampling in image analysis paved the way for adoption in hierarchical Bayesian modeling [@doi:10.1109/TPAMI.1984.4767596]. Later, advances in Markov chain Monte Carlo methods provided the computational tools necessary to fit Bayesian hierarchical models and made Bayesian hierarchical modeling practical for applied researchers [@doi:10.1201/b16018].
Collectively, these developments shifted statistical modeling from independence assumptions to explicitly capturing correlation and hierarchical structure. As methods for nested data improved, new problems arose with functional, continuous, and high-dimensional observations. This led to the development of approaches for curves, trajectories, and large feature spaces, which in turn led to functional data analysis and high-dimensional inference.
As data collection advanced into the 1980s and 1990s, researchers began encountering observations that are entire curves or functions (e.g., time series, spectra, images) rather than fixed-dimensional vectors. Functional Data Analysis (FDA) [@doi:10.1002/0470013192.bsa239] treats each observation as a smooth function
where each
As data dimensionality and complexity continued to grow (for instance with genomic or imaging data), there was a shift toward automated feature learning. Representation learning [@doi:10.1109/TPAMI.2013.50] generalizes classical dimension reduction (like PCA) to nonlinear, data-driven embeddings. Neural autoencoders are a prototypical example: one trains an encoder network
Here
With the proliferation of diverse tasks and domains, statistical learning shifted toward methods that transfer information across problems. Multi-task learning (MTL) [@doi:10.1023/A:1007379606734] arose in response: it jointly models related tasks to improve performance, especially when each task has limited data. Concretely, if we have
where
Another fundamental challenge is distribution shift between training and deployment. Covariate shift occurs when the input distribution
More generally, domain adaptation methods address cases where both
Finally, the combination of many tasks and very few examples per task has given rise to few-shot learning. In few-shot learning [@doi:10.1145/3386252], the goal is to generalize to new classes or tasks from only a handful of labeled examples. Modern approaches typically leverage prior experience across tasks or classes. Metric-based methods, such as Matching Networks and Prototypical Networks [@doi:10.48550/arXiv.1606.04080; @doi:10.48550/arXiv.1703.05175], learn an embedding
In the Internet era, the way data is collected has undergone significant changes. User behavior data, sensor data, and platform experimental data exhibit the characteristics of streaming and interactivity [@doi:10.1145/1772690.1772758]. The generation of dynamic data poses new challenges to model construction and problem-solving. Data is not collected all at once but is generated in real time and is often related to the feedback loop of the system [@doi:10.1115/1.3662552]. This type of data is more dynamic and complex compared to traditional experimental or measurement data, involving both time dependence and continuous influence from the environment and interaction.
A natural starting point to address this challenge is the concept of online learning [@doi:10.48550/arXiv.1802.02871]. A foundational formulation is the Online Convex Optimization (OCO) framework [@doi:10.1561/2200000018]. At each round
which compares the learner’s cumulative loss with that of the best fixed decision in hindsight. A simple and influential algorithm in this framework is Online Gradient Descent (OGD). Given a subgradient
where
While OCO provides guarantees with respect to a fixed comparator, real-world data streams rarely remain stationary. The statistical relationship between inputs
for a predefined threshold
Though adaptive learning addresses the challenge of non-stationarity, it still operates under a full-information feedback model: after each round, the entire loss function
The arm
where
RL formalizes sequential decision-making under uncertainty through the lens of Markov Decision Processes (MDPs) [@Bellman1957Markovian]. An MDP is defined by a state space
Classical solution approaches include value-based methods (e.g., Q-learning) [@Watkins1989Learning], which estimate action-value functions and act greedily with respect to them, and policy-based methods (e.g., policy gradient, actor–critic) [@doi:10.1023/a:1022672621406] [@doi:10.1109/tsmc.1983.6313077], which directly optimize the policy parameters via stochastic gradient ascent. These methods establish theoretical guarantees and have enabled applications ranging from game-playing (e.g., Go, Atari) [@doi:10.1038/nature14236] to robotics.
The trajectory from online convex optimization to reinforcement learning highlights how the statistical study of streaming, interactive data has evolved into increasingly expressive frameworks for context-adaptive inference. As data collection becomes more interactive and non-stationary, the ability to learn from context and feedback loops is central to adaptive intelligence. This theme naturally connects to the next section on multimodal data, where adaptivity must also span across modalities.
With the advancement of the digitalization process, research and application have begun to simultaneously involve various types of data such as images [@doi:10.1088/2040-8978/17/7/073001], audio [@doi:10.1121/1.1912679], and text [@doi:10.1145/361219.361220]. These data are not only high-dimensional but also show clear structural and representational differences. Text consists of discrete symbolic sequences that contain semantic and grammatical structures; images are spatially organized pixel matrices that reflect local spatial correlations and global patterns; and audio is a continuous waveform signal that captures dynamic temporal characteristics. Compared with earlier single-type numerical or functional data, multimodal data are more heterogeneous and complex. New challenges asked researchers to explore how to establish semantic correspondences across modalities for cross-modal alignment, build joint models through shared latent spaces, and dynamically adjust inter-modal dependencies according to different tasks or contexts.
From a statistical perspective, representation learning [@doi:10.1109/TPAMI.2013.50] can be regarded as a generalization of traditional linear dimension reduction techniques such as Principal Component Analysis (PCA) and Factor Analysis (FA). While PCA and FA aim to identify low-dimensional subspaces that capture maximum variance or shared covariance structure through linear projections, representation learning extends this idea to nonlinear mappings that can model highly complex and structured data. Representation learning addresses this challenge by automatically learning low-dimensional embeddings that map complex observations into a shared latent space. The basic idea is to learn a nonlinear mapping
where
To infer hidden structures, traditional latent variable models in statistics such as Factor Analysis and Gaussian Mixture Models (GMM) require estimating the posterior distribution
Amortized inference introduces a major innovation by parameterizing the approximate posterior as a learnable function
The model can dynamically adjust posterior estimation
In many real-world scenarios, the amount of labeled data for each task may be limited. Meta-learning, or learning to learn, provides a framework for acquiring inductive knowledge that enables rapid adaptation to new tasks with few examples. The central idea is to learn a meta-model that captures transferable structures across tasks and can quickly adjust its parameters when facing a new context. It involves a two-level optimization process:
where
Two main paradigms have been developed. Gradient-based meta-learning, exemplified by Model-Agnostic Meta-Learning (MAML) [@doi:10.48550/arXiv.1703.03400], optimizes an initialization of
Meta-learning can be viewed as a hierarchical model where the outer loop learns the hyperprior over tasks and the inner loop performs task-specific inference. The model internalizes the variability across tasks into shared meta-parameters and dynamically adjusts its learning strategy when exposed to new contextual distributions. Such an approach bridges the gap between multi-task learning and context-aware adaptation, providing a scalable solution for heterogeneous and data-sparse environments.
With the rapid expansion of digital ecosystems and the emergence of web-scale data collection, research has entered a new stage characterized by massive, heterogeneous, and weakly supervised datasets. In recent years, a vast amount of cross-domain data has been centrally collected—such as large-scale text corpora [@doi:10.48550/arXiv.1810.04805] and multimodal alignment data [@doi:10.48550/arXiv.2103.00020] that combine text, images, audio, and sound—representing an unprecedented scale and diversity of information sources. Unlike earlier multimodal datasets that were carefully curated for specific experimental designs, these pretraining datasets are gathered under open, non-controlled conditions and often lack explicit task labels or unified structures.
The central modeling challenge has shifted from fitting a single data distribution to extracting transferable and generalizable patterns that remain stable across heterogeneous contexts. This transition has motivated the development of foundation models and related adaptive paradigms, which aim to capture universal representations, perform context-dependent inference, and achieve robust generalization across domains.
Foundation models [@doi:10.48550/arXiv.2108.07258] represent a fundamental paradigm shift in machine learning toward building universal systems trained on massive and heterogeneous data sources. Unlike earlier models that were designed for specific tasks or modalities, foundation models learn general-purpose representations through large-scale pretraining, which can be adapted to downstream tasks with minimal fine-tuning. Given a massive dataset
where
The context-adaptive nature of foundation models arises from their ability to integrate knowledge from large and diverse data distributions, enabling emergent behaviors such as cross-modal transfer, domain generalization, and zero-shot reasoning. They thus redefine the notion of generalization—from fitting within a context to adapting across contexts—by embedding statistical invariances into large-scale learned representations.
In-context learning (ICL) [@doi:10.48550/arXiv.2301.00234] describes the ability of large language or multimodal models to adapt to new tasks through contextual examples provided at inference time, without updating model parameters. Given a prompt sequence
This phenomenon suggests that pretrained models can perform implicit meta-learning within their internal representations, dynamically adjusting inference behavior according to contextual input. From a statistical perspective, ICL transforms the adaptation process from an external optimization (parameter update) to an internal inference mechanism conditioned on observed data, thus embodying a new form of context-adaptive reasoning.