Incremental Learning

Overview

I used to think that incremental learning was REALLY HARD.

Now I think its just some for-loops around existing learners.

Examples

How to turn a Naive Bayes classifier into an incremental learner:

Classify new examples BEFORE adding examples to training set

How to turn any learner into an incremental learning (premise: that learner stabilizes on new examples faster than mode changes in the underlying phenomena).

Note: never survived peer review. Sigh. Incremental Discretization and Bayes Classifiers Handles Concept Drift and Scales Very Well

How to turn any recursive splitting algorithm into an incremental learner:

Supervise: build one classifier per leaf; keep a sample S of examples at each leaf; turn off learning once learner performance; keep X% of the examples B that perform badly in that cluster; turn on learning for that leaf cluster starting with S + B
Unsupervised: read era examples and cluster. For all subsequent eras, if new examples fall into old clusters then forgetaboutit. Else. sometimes, build new clusters. For more, see below.

Simple single pass k-means

A sample incremental method

Farnstrom, F., Lewis, J., and Elkan, C. 2000. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl. 2, 1 (Jun. 2000), 51-57. http://portal.acm.org/citation.cfm?id=360419

Model clusters as gaussians: i,e, modeled by the sum, sumSquared, and n of each feature.

When clusters combine, their new stats are the sum of the of the old sums, the sum of the old symSquareds, and the n
Do not add a point to a cluster if any of the its attributes falls outside of mean±1.96sd (for 95% confidence)

Algorithm:

Initialize k random clusters C1... Ck
Now we have old stats and new stats (initially, old stats is nil)
LOOP: Take 1% of the data and add it to a buffer.
For all remaining items in the buffer, run k-means, moving the centroids (until they converge). If can add to a cluster (according to old confidence intervals), then add it (updating sum, sumSquared, and n of new stats). Here, each cluster is treated as being a point at the mean values, weighted with the number of points in the that cluster.
When no more items can be added, then if any cluster has no items, seed it with the most distant data point Now clear the buffer. Add the new stats to the old stats.
If there is no more new data, stop. Otherwise, go to 3

Runs nearly twice as fast as old k-means. Here's the results from using various clustering algorithms on the KDD 1998 cup data (95412 rows, 481 fields). This algorithm is "N1". Normal batch k-means is "K". "R1" is a random sampling k-means (that, btw, generated low quality clusters).

Note that clusters generated by this algorithms, working in increments of 1% of the data, produces clusters of almost the same quality as multiple-pass K-means.

Other advantages:

Small enough memory that you could have multiple versions going, and pick the one with the lowest variances in the generated clusters.
Very simple implementation.
Satisfies the data mining desiderata

The Data mining Desiderata comes from Bradley et al.:

Require one scan (or less) of the database if possible: a single data scan is considered costly, early termination if appropriate is highly desirable.
On-line “anytime” behavior: a “best” answer is always available, with status information on progress, expected remaining time, etc. provided
Suspendable, stoppable, resumable; incremental progress saved to resume a stopped job.
Ability to incrementally incorporate additional data with existing models efficiently.
Work within confines of a given limited RAM buffer.
Utilize variety of possible scan modes: sequential,index, and sampling scans if available.
Ability to operate on forward-only cursor over a view of the database. This is necessary since the database view may be a result of an expensive join query, over a potentially distributed data warehouse, with much processing required to construct each row (case).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental Learning

Overview

Examples

Simple single pass k-means

FilesExpand file tree

inc.md

Latest commit

History

inc.md

File metadata and controls

Incremental Learning

Overview

Examples

Simple single pass k-means