Feature and Instance Selection

Timm's rule: the best thing to do with data is to throw most of it away.

Simpler models,
Quicker to explain, audit
Less work for anything down stream that has to watch or act on any variable

Occam's Razor

English philosopher, William of Occam (1300-1349) propounded Occam's Razor:
- Entia non sunt multiplicanda praeter necessitatem.
- (Latin for "Entities should not be multiplied more than necessary"). That is, the fewer assumptions an explanation of a phenomenon depends on, the better it is.
(BTW, Occam's razor did not survive into the 21st century.
- The data mining community modified it to the Minimum Description Length (MDL) principle.
- MDL: the best theory is the smallest BOTH is size AND number of errors).
Many ways to throw away data
- feature selection
- range selection
- instance selection (prototype generation)

Feature Section

The case for FSS

Repeated result: throwing out features rarely damages a theory

And, sometimes, feature removal is very useful:

E.g. linear regression on bn.arff yielded:

Defects =     
    82.2602 * S1=L,M,VH +
    158.6082 * S1=M,VH +
    249.407  * S1=VH +
     41.0281 * S2=L,H +
     68.9153 * S2=H +
    151.9207 * S3=M,H +
    125.4786 * S3=H +
    257.8698 * S4=H,M,VL +
    108.1679 * S4=VL +
    134.9064 * S5=L,M +
   -385.7142 * S6=H,M,VH +
    115.5933 * S6=VH +
   -178.9595 * S7=H,L,M,VL +
   ...
   [ 50 lines deleted ]

On a 10-way cross-validation, this correlates 0.45 from predicted to actuals.

10 times, take 90% of the date and run a WRAPPER- a best first search through combinations of attributes. At each step, linear regression was called to asses a particular combination of attributes. In those ten experiments, WRAPPER found that adding feature X to features A,B,C,... improved correlation the following number of times:

number of folds (%)  attribute
           2( 20 %)     1 S1
           0(  0 %)     2 S2
           2( 20 %)     3 S3
           1( 10 %)     4 S4
           0(  0 %)     5 S5
           1( 10 %)     6 S6
           6( 60 %)     7 S7     <==
           1( 10 %)     8 F1
           1( 10 %)     9 F2
           2( 20 %)    10 F3
           2( 20 %)    11 D1
           0(  0 %)    12 D2
           5( 50 %)    13 D3     <==
           0(  0 %)    14 D4
           0(  0 %)    15 T1
           1( 10 %)    16 T2
           1( 10 %)    17 T3
           1( 10 %)    18 T4
           0(  0 %)    19 P1
           1( 10 %)    20 P2
           0(  0 %)    21 P3
           1( 10 %)    22 P4
           6( 60 %)    23 P5     <==
           1( 10 %)    24 P6
           2( 20 %)    25 P7
           1( 10 %)    26 P8
           0(  0 %)    27 P9
           2( 20 %)    28 Hours
           8( 80 %)    29 KLoC   <==
           4( 40 %)    30 Language
           3( 30 %)    32 log(hours)

Four variables appeared in the majority of folds. A second run did a 10-way using just those variables to yield a smaller model with (much) larger correlation (98\%):
```
Defects =
    876.3379 * S7=VL +
   -292.9474 * D3=L,M +
    483.6206 * P5=M +
      5.5113 * KLoC +
     95.4278
```

Excess attributes

Confuse decision tree learners
- Too much early splitting of data
- Less data available for each sub-tree
Too many things correlated to class?
- Dump some of them!

Why FSS?

throw away noisy attributes
throw away redundant attributes
smaller model= better accuracies (often)
smaller model= simpler explanation
smaller model= less variance
smaller model= any downstream processing will thank you

Problem

Exploring all subsets exponential
Need heuristic methods to cull search;
- e.g. forward/back select
Forward select:
- start with empty set
- grow via hill climbing:
- repeat
  - try adding one thing and if that improves things
  - try again using the remaining attributes
- until no improvement after N additions OR nothing to add
Back select
- as above but start with all attributes and discard, don't add
Usually, we throw away most attributes:
- so forward select often better
- exception: J48 exploits interactions more than,say, NB.
- so, possibly, back select is better when wrapping j48
- so, possibly, forward select is as good as it gets for NB

FSS types:

filters vs wrappers:
- wrappers: use an actual target learners e.g. WRAPPER
- filters: study aspects of the data e.g. the rest
- filters are faster!
- wrappers exploit bias of target learner so often perform better, when they terminate
  - don't terminate on large data sets
solo vs combinations:
- evaluate solo attributes: e.g. INFO GAIN, RELIEF
- evaluate combinations: e.g. PCA, SVD, CFS, CBS, WRAPPER
- solos can be faster than combinations
supervised vs unsupervised:
- use/ignores class values e.g. PCA/SVD is unsupervised, reset supervised
numeric vs discrete search methods
- ranker: for schemes that numerically score attributes e.g. RELIEF, INFO GAIN,
- best first: for schemes that do heuristic search e.g. CBS, CFS, WRAPPER

Hall and Holmes:

This paper: pre-discretize numerics using entropy.

Hall & Holmes

INFO GAIN

often useful in high-dimensional problems
- real simple to calculate
attributes scored based on info gain: H(C) - H(C|A)
Sort of like doing decision tree learning, just to one level.

RELIEF

Kononenko97
useful attributes differentiate between instances from other class
randomly pick some instances (here, 250)
find something similar, in an another class
compute distance this one to the other one
Stochastic sampler: scales to large data sets.

Binary RELIEF (two class system) for "n" instances for weights on features "F"

set all weights W[f]=0
for i = 1 to n; do
   randomly select instance R with class C
   find nearest hit H      // closest thing of same class
   find nearest miss M     // closest thing of difference class
   for f = 1 to #features; do
       W[f] = W[f] - diff(f,R,H)/n + diff(f,R,M)/n
   done
done

diff:
- discrete differences: 0 if same 1 if not.
- continuous: differences absolute differences
- normalized to 0:1
- When values are missing, see Kononenko97, p4.

N-class RELIEF: not 1 near hit/miss, but k nearest misses for each class C

W[f]= W[f] - ∑i=1..k diff(f,R, Hi) / (n*k) 
           + ∑C ≠ class(R) ∑i=1..k ( 
                                P(C) / ( 1 - P(class(R)))
                                * diff(f,R, Mi(C)) / (n*k)
                               )

The P(C) / (1 - P(class(R)) expression is a normalization function that

demotes the effect of R from rare classes
and rewards the effect of near hits from common classes.

CBS (consistency-based evaluation)

Seek combinations of attributes that divide data containing a strong single class majority.
- Kind of like info gain, but emphasis of single winner
Discrete attributes
Forward select to find subsets of attributes

WRAPPER

Forward select attributes
- score each combination using a 5-way cross val
When wrapping, best to try different target learners
- Check that we aren't over exploiting the learner's bias
- e.g. J48 and NB

PRINCIPAL COMPONENTS ANALYSIS (PCA)

(The traditional way to do FSS.)

Only unsupervised method studied here
Transform dimensions
Find covariance matrix C[i,j] is the correlation i to j;
- C[i,i]=1;
- C[i,j]=C[j,i]
Find eigenvectors
Transform the original space to the eigenvectors
Rank them by the variance in their predictions
Report the top ranked vectors

Makes things easier, right? Well...

if   domain1  <= 0.180 
then NoDefects 
else if domain1 > 0.180 
     then if domain1 <= 0.371 then NoDefects 
     else if domain1 > 0.371 then Defects 


domain1 = 0.241 * loc     + 0.236 * v(g) 
        + 0.222 * ev(g)   + 0.236 * iv(g)     + 0.241 *  n 
        + 0.238 * v       - 0.086 * l         + 0.199  * d 
        + 0.216 * i       + 0.225 * e + 0.236 * b + 0.221  * t 
        + 0.241 * lOCode  + 0.179 * lOComment 
        + 0.221 * lOBlank + 0.158 * lOCodeAndComment 
        + 0.163 * uniqO p + 0.234 * uniqOpnd 
        + 0.241 * totalOp + 0.241 * totalOpnd 
        + 0.236 * branchCount

PCA vs LDA (linear discrminant analysis)

LDA = PCA + class knowledge

(Note: LDA should not be confused with LDA (latent Dirichlet allocation) which currently all the rage in text mining. And that LDA is not covered in this subject.)

Latent Semantic Indexing

Performing PCA is the equivalent of performing Singular Value Decomposition (SVD) on the data.
Any n * m matrix X (of terms n in documents m) can be rewritten as:
- X = To * So * Do'
- So is a diagonal matrix scoring attributes, top to bottom, most interesting to least interesting
- We can shrink X by dumping the duller (lower) rows of So

Latent Semantic Indexing is a method for selecting informative subspaces of feature spaces.
It was developed for information retrieval to reveal semantic information from document co-occurrences.
Terms that did not appear in a document may still associate with a document.
LSI derives uncorrelated index factors that might be considered artificial concepts.
SVD easy to perform in Matlab
- Also, there is some C-code.
- Also Java Classes available
  - class SingularValueDecomposition
    - Constructor: SingularValueDecomposition(Matrix Arg)
    - Methods: GetS(); GetU(); GetV(); (U,V correspond to T,D)
Be careful about using these tools blindly
- It is no harm to understand what is going on!
The Matrix Cookbook
Note: major win for SVD/LSI: scales very well.
- Research possibility: text mining for software engineering
  - typically very small corpuses
  - so might we find better FSS for text mining than SVD/LSI

CFS (correlation-based feature selection)

Scores high subsets with strong correlation to class and weak correlation to each other.
Numerator: how predictive
Denominator: how redundant
FIRST ranks correlation of solo attributes
THEN heuristic search to explore subsets

And the winner is:

Wrapper! and it that is too slow...
CFS, Relief are best all round performers
- CFS selects fewer features
Phew. Hall invented CFS

Other Methods

Other methods not explored by Hall and Holmes...

Note: the text mining literature has yet to make such an assessment. Usually, SVD rules. But see An Approach to Classify Software Maintenance Requests, from ICSM 2002, for a nice comparison of nearest neighbor, CART, Bayes classifiers, and some other information retrieval methods).
Using random forests for feature selection of the mth variable:
- randomly permute all values of the mth variable in the oob data
- Put these altered oob x-values down the tree and get classifications.
- Proceed as though computing a new internal error rate (i.e. run the classifier).
- The amount by which this new error exceeds the original test set error is defined as the importance of the mth variable.
Use the Nomogram scores

Range Selection

From before: use only sharp and relevant ranges.

Using Nomograms

Martin Možina, Janez Demšar, Michael Kattan, and Blaž Zupan. 2004. Nomograms for visualization of naive Bayesian classifier. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD '04)
if there are two classes best and rest with B and R examples
and x occurs with frequency n1,n2 in B,R
then the odds of x in best and rest are b=n1/B and r=n2/R
then the log of the odds ratio LOR = 0 if b ≤ r else log(b/r)
and if we normalize LOR to 0,1 min,max
then we can draw some pretty diagrams and do some simple what if queries:

The reverse function LOR to probability with a class occuring a probability p is:

function points2p(lor, p) { 
    E = 2.7182818285
    return 1 / (1 + E^(-1*log(p/(1 - p)) - lor )) }

Many variables have low nomogram scores

Best or rest

given a numeric class, sort the values then divide them into 90% rest and 10% best.
if there are two classes best and rest with B and R examples
and x occurs with frequency n1,n2 in B,R
then the odds of x in best and rest are b=n1/B and r=n2/R
then we "like" x at strength - s= b^2/(b+r) if b > r else 0 - the b^2 term muliples Bayes odds b/(b+r) times a support term that rewards things that occur more often
in the usual case, very rew ranges with b > r

Generate rules using just the ranges with s>0

The TAR2 treatment learner
When compared to state-of-the-art optimizer, does better. Application: optimizing rocket re-entry tactics. - G. Gay, T. Menzies, M. Davies, and K. Gundy-Burlet. Automatically finding the control variables for complex system behavior Automated Software Engineering 17(4), December 2010, pp 439-468.

Instance Selection

Prototype Selection with Clusters

Step1: Feature selection: sort columns by their Infogain score. Delete bottom half.
Step2: Cluster: return one example pre centroid.

Reduction of 800 rows by 24 attributes to 5 attributes by 22 rows

For many data sets:

Note that for classification by weighted scores from 2 nearest neighbors, the reduced data as accurate as the full data.

Active Learning with Quick

Estimation data sets (a few dozen rows)

Ekrem Kocaguneli, Tim Menzies, Jacky Keung, David Cok, and Ray Madachy. 2013. Active Learning and Effort Estimation: Finding the Essential Content of Software Effort Estimation Data IEEE Trans. Softw. Eng. 39, 8 (August 2013), 1040-1053.

QUICK is an active learning method that assists in reducing the complexity of data interpretation by identifying the essential content of SEE datasets. QUICK works as follows:

Group rows and columns by their similarity;
Discard redundant columns (synonyms) that are too similar;
Sort rows by their outlier score (i.e. ones that are too distant);
Ask expert's opinion about sorted rows 1..x
Run CART on 1..x then check estimates on rows x+1...x+5.
Stop when accuracy no longer improves

Details

Input data matrix D (D may not have
Normalized all ranges min..max, 0..1
Discard redundant columns (those that repeat information in other columns)
- Transpose matrix (columns become rows);
  D1 = transpose(D)
- Compute RNN = reverse nearest neighbor score between "rows" (count popularity; i.e. how often row X is someone else's nearest neighbor)
- Reject the popular columns (contain repeated information)
  D2 = reject(D1)
Discard outlier rows (those taht are too wierd)
- Transpose matrix (now columns are columns again);
  D3 = transpose(D2)
- Compute row popularity again
- sort the rows by their popularity (most popular ones are most informative)
  D4 = sort(D3)
Ask in sorted order

Results: #1: Need data on just a few projects

Results #2: uses very little of the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature and Instance Selection

Feature Section

Excess attributes

Why FSS?

Problem

FSS types:

Hall and Holmes:

INFO GAIN

RELIEF

CBS (consistency-based evaluation)

WRAPPER

PRINCIPAL COMPONENTS ANALYSIS (PCA)

PCA vs LDA (linear discrminant analysis)

Latent Semantic Indexing

CFS (correlation-based feature selection)

And the winner is:

Other Methods

Range Selection

Instance Selection

Prototype Selection with Clusters

Active Learning with Quick

Prototype Selection

FilesExpand file tree

fss.md

Latest commit

History

fss.md

File metadata and controls

Feature and Instance Selection

Feature Section

Excess attributes

Why FSS?

Problem

FSS types:

Hall and Holmes:

INFO GAIN

RELIEF

CBS (consistency-based evaluation)

WRAPPER

PRINCIPAL COMPONENTS ANALYSIS (PCA)

PCA vs LDA (linear discrminant analysis)

Latent Semantic Indexing

CFS (correlation-based feature selection)

And the winner is:

Other Methods

Range Selection

Instance Selection

Prototype Selection with Clusters

Active Learning with Quick

Prototype Selection