You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/16_dimensionality_reduction.rst
+8-21Lines changed: 8 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ will allow you to perform several reduction methods and visualise the results as
13
13
Dimensionality reduction condenses large numbers of measuremnts into a more managable number of components, this can
14
14
help to visualise results and identify clusters of objects and outliers.
15
15
16
-
To use the **Dimenaionality Reduction Plot**, select a reduction method from the available choices and click
16
+
To use the **Dimensionality Reduction Plot**, select a reduction method from the available choices and click
17
17
**Update Chart**. The different methods are explained further in the sections below. CPA will normalise measurements
18
18
before applying these methods.
19
19
@@ -36,30 +36,17 @@ tool open, you'll also see the option to *send the selected objects directly to
36
36
Reduction Methods
37
37
*****************
38
38
39
-
- **Principal Component Analysis (PCA)**: PCA attempts to generate a series of features which capture the variance of
40
-
the original dataset. Measurements which vary in the same manner are collapsed towards a single new measurement, termed
41
-
a *Principal Component*. On the resulting axis labels, CPA will also display the proportion of the original variance
42
-
which is explained by each principal component. Components are sorted by their contribution to variance, so PC1 will
43
-
always be the most significant feature.
39
+
- **Principal Component Analysis (PCA)**: PCA attempts to generate a series of features which capture the variance of the original dataset. Measurements which vary in the same manner are collapsed towards a single new measurement, termed a *Principal Component*. On the resulting axis labels, CPA will also display the proportion of the original variance which is explained by each principal component. Components are sorted by their contribution to variance, so PC1 will always be the most significant feature.
44
40
45
-
- **Singular Value Decomposition (SVD)**: SVD is very similar to PCA, but does not center the data before processing.
46
-
This can be much faster and more memory efficient than PCA when working with very large datasets, but a trade-off is
47
-
that the resulting components will not be ordered by significance (i.e. PC1 may not be the most important feature).
41
+
- **Singular Value Decomposition (SVD)**: SVD is very similar to PCA, but does not center the data before processing. This can be much faster and more memory efficient than PCA when working with very large datasets, but a trade-off is that the resulting components will not be ordered by significance (i.e. PC1 may not be the most important feature).
48
42
49
-
- **Gaussian Random Projection (GRP)**: This method reduces the dimensionality of the dataset by projecting samples into
50
-
fewer dimensions while preserving the pairwise distances between them. The random matrix used for projection is
51
-
generated using a gaussian distribution.
43
+
- **Gaussian Random Projection (GRP)**: This method reduces the dimensionality of the dataset by projecting samples into fewer dimensions while preserving the pairwise distances between them. The random matrix used for projection is generated using a gaussian distribution.
52
44
53
-
- **Sparse Random Projection (SRP)**: Similar to GRP, but uses a sparse matrix instead of a gaussian one. This can be
54
-
more memory efficient with large datasets.
45
+
- **Sparse Random Projection (SRP)**: Similar to GRP, but uses a sparse matrix instead of a gaussian one. This can be more memory efficient with large datasets.
55
46
56
-
- **Factor Analysis (FA)**: Like PCA, Factor Analysis generates a series of components which describe the variance
57
-
of the dataset. However, with FA the variance in each direction within the input space can be modelled independently.
47
+
- **Factor Analysis (FA)**: Like PCA, Factor Analysis generates a series of components which describe the variance of the dataset. However, with FA the variance in each direction within the input space can be modelled independently.
58
48
59
-
- **Feature Agglomeration (FAgg)**: This method utilises hierarchical clustering to group together features that behave
60
-
similarly. The generated clusters can then be treated like components
49
+
- **Feature Agglomeration (FAgg)**: This method utilises hierarchical clustering to group together features that behave similarly. The generated clusters can then be treated like components
61
50
62
-
- **t-Distributed Stochastic Neighbor Embedding (t-SNE)**: t-SNE helps to visualise high dimensional data by giving
63
-
individual datapoints a coordinate on a 2D map, on which similar points are placed close together. The resulting
64
-
clusters can help to visualise different object types within a dataset.
51
+
- **t-Distributed Stochastic Neighbor Embedding (t-SNE)**: t-SNE helps to visualise high dimensional data by giving individual datapoints a coordinate on a 2D map, on which similar points are placed close together. The resulting clusters can help to visualise different object types within a dataset.
Copy file name to clipboardExpand all lines: docs/source/5_classifier.rst
+30-1Lines changed: 30 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -292,10 +292,15 @@ V.B.7 Data preparation
292
292
293
293
Typically one wouldn't use the raw features as input for the machine learning, but the data is cleaned in some ways (e.g., by removing zero variance features) and normalized. Data preparation takes place before the machine learning is done, i.e., before training a classifier. We here describe how you can perform data preparation steps in CPA.
294
294
295
+
*Scaling*
296
+
*********
297
+
298
+
Features can be normalised and centered before training/classification by activating the Scaler option in *Advanced > Use Scaler*. The features are centered to have mean 0 and scaled to have standard deviation 1
299
+
295
300
*Normalization Tool*
296
301
********************
297
302
298
-
Typically the features are normalized before training a classifier. For example, the features are centered to have mean 0 and scaled to have standard deviation 1. This can be done in CPA with the Normalization Tool. From the main menu, navigate to Tools > Normalization Tool. You can choose which features to normalize.
303
+
Outside the classifier scaling can be done with the Normalization Tool. From the main menu, navigate to Tools > Normalization Tool. You can choose which features to normalize and save the resulting table for later use.
299
304
300
305
*Removing zero variance features*
301
306
*********************************
@@ -306,3 +311,27 @@ A zero variance feature is a feature that has the same entry for all objects, fo
306
311
***************
307
312
308
313
A standard procedure is finding features with NAN (not a number) entries in the data and removing those cells. CPA automatically ignores cells with NANs, so this step is already been taken take of.
314
+
315
+
316
+
V.B.8 Classifier types
317
+
----------------------
318
+
319
+
CPA supports several different classifier types:
320
+
321
+
- **RandomForest**: Produces a series of decision tree classifiers and uses averaging across all trees to generate predictions.
322
+
323
+
- **AdaBoost**: Fits a series of weak learners (simple classification rules which don't perform well alone). The input data is adjusted after each cycle to add weight to samples which the previous learner classified incorrectly. As learners are added, examples that are difficult to predict receive increasing influence. A final prediction is generated from a weighted majority vote from all learners.
324
+
325
+
- **SVC**: Support Vector Classification. This technique considers all features and attempts to generate multi-dimensional dividing lines (termed "hyperplanes") which will distinguish between classes.
326
+
327
+
- **GradientBoosting**: This takes a similar approach to AdaBoost, but uses gradients instead of weights to make adjustments to the importance of individual samples.
328
+
329
+
- **LogisticRegression**: Classifies objects via logistic regression. Classifications are made based on a series of curves corresponding to decision boundaries.
330
+
331
+
- **LDA**: Linear Discriminant Analysis. This method projects the input data to a linear subspace consisting of the directions which maximize the separation between classes, then establishes a boundary which discriminates the classes.
332
+
333
+
- **KNeighbors**: Classifies based on the majority class of the nearest *k* known samples. Classification is inferred from the training points nearest to the test sample.
334
+
335
+
- **FastGentleBoosting**: A modification of the AdaBoost classification strategy, with optimisations for working with limited training data.
336
+
337
+
- **Neural Network**: Generates a multi-layer perceptron neural network. Layers of neurons link each input feature to output features. Each neuron generates a signal based on it's input and weighting from each source. The user can customise the number of intermediate 'hidden' layers between the input (measurement) and output (class) neurons. Additional hidden layers can help to generate more complex classifications. Neuron count per layer should generally be set to between the number of classes and number of input features.
0 commit comments