Skip to content

Commit a0a73b0

Browse files
author
David Stirling
committed
More docs
1 parent 76545d5 commit a0a73b0

2 files changed

Lines changed: 38 additions & 22 deletions

File tree

docs/source/16_dimensionality_reduction.rst

Lines changed: 8 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ will allow you to perform several reduction methods and visualise the results as
1313
Dimensionality reduction condenses large numbers of measuremnts into a more managable number of components, this can
1414
help to visualise results and identify clusters of objects and outliers.
1515

16-
To use the **Dimenaionality Reduction Plot**, select a reduction method from the available choices and click
16+
To use the **Dimensionality Reduction Plot**, select a reduction method from the available choices and click
1717
**Update Chart**. The different methods are explained further in the sections below. CPA will normalise measurements
1818
before applying these methods.
1919

@@ -36,30 +36,17 @@ tool open, you'll also see the option to *send the selected objects directly to
3636
Reduction Methods
3737
*****************
3838

39-
- **Principal Component Analysis (PCA)**: PCA attempts to generate a series of features which capture the variance of
40-
the original dataset. Measurements which vary in the same manner are collapsed towards a single new measurement, termed
41-
a *Principal Component*. On the resulting axis labels, CPA will also display the proportion of the original variance
42-
which is explained by each principal component. Components are sorted by their contribution to variance, so PC1 will
43-
always be the most significant feature.
39+
- **Principal Component Analysis (PCA)**: PCA attempts to generate a series of features which capture the variance of the original dataset. Measurements which vary in the same manner are collapsed towards a single new measurement, termed a *Principal Component*. On the resulting axis labels, CPA will also display the proportion of the original variance which is explained by each principal component. Components are sorted by their contribution to variance, so PC1 will always be the most significant feature.
4440

45-
- **Singular Value Decomposition (SVD)**: SVD is very similar to PCA, but does not center the data before processing.
46-
This can be much faster and more memory efficient than PCA when working with very large datasets, but a trade-off is
47-
that the resulting components will not be ordered by significance (i.e. PC1 may not be the most important feature).
41+
- **Singular Value Decomposition (SVD)**: SVD is very similar to PCA, but does not center the data before processing. This can be much faster and more memory efficient than PCA when working with very large datasets, but a trade-off is that the resulting components will not be ordered by significance (i.e. PC1 may not be the most important feature).
4842

49-
- **Gaussian Random Projection (GRP)**: This method reduces the dimensionality of the dataset by projecting samples into
50-
fewer dimensions while preserving the pairwise distances between them. The random matrix used for projection is
51-
generated using a gaussian distribution.
43+
- **Gaussian Random Projection (GRP)**: This method reduces the dimensionality of the dataset by projecting samples into fewer dimensions while preserving the pairwise distances between them. The random matrix used for projection is generated using a gaussian distribution.
5244

53-
- **Sparse Random Projection (SRP)**: Similar to GRP, but uses a sparse matrix instead of a gaussian one. This can be
54-
more memory efficient with large datasets.
45+
- **Sparse Random Projection (SRP)**: Similar to GRP, but uses a sparse matrix instead of a gaussian one. This can be more memory efficient with large datasets.
5546

56-
- **Factor Analysis (FA)**: Like PCA, Factor Analysis generates a series of components which describe the variance
57-
of the dataset. However, with FA the variance in each direction within the input space can be modelled independently.
47+
- **Factor Analysis (FA)**: Like PCA, Factor Analysis generates a series of components which describe the variance of the dataset. However, with FA the variance in each direction within the input space can be modelled independently.
5848

59-
- **Feature Agglomeration (FAgg)**: This method utilises hierarchical clustering to group together features that behave
60-
similarly. The generated clusters can then be treated like components
49+
- **Feature Agglomeration (FAgg)**: This method utilises hierarchical clustering to group together features that behave similarly. The generated clusters can then be treated like components
6150

62-
- **t-Distributed Stochastic Neighbor Embedding (t-SNE)**: t-SNE helps to visualise high dimensional data by giving
63-
individual datapoints a coordinate on a 2D map, on which similar points are placed close together. The resulting
64-
clusters can help to visualise different object types within a dataset.
51+
- **t-Distributed Stochastic Neighbor Embedding (t-SNE)**: t-SNE helps to visualise high dimensional data by giving individual datapoints a coordinate on a 2D map, on which similar points are placed close together. The resulting clusters can help to visualise different object types within a dataset.
6552

docs/source/5_classifier.rst

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -292,10 +292,15 @@ V.B.7 Data preparation
292292

293293
Typically one wouldn't use the raw features as input for the machine learning, but the data is cleaned in some ways (e.g., by removing zero variance features) and normalized. Data preparation takes place before the machine learning is done, i.e., before training a classifier. We here describe how you can perform data preparation steps in CPA.
294294

295+
*Scaling*
296+
*********
297+
298+
Features can be normalised and centered before training/classification by activating the Scaler option in *Advanced > Use Scaler*. The features are centered to have mean 0 and scaled to have standard deviation 1
299+
295300
*Normalization Tool*
296301
********************
297302

298-
Typically the features are normalized before training a classifier. For example, the features are centered to have mean 0 and scaled to have standard deviation 1. This can be done in CPA with the Normalization Tool. From the main menu, navigate to Tools > Normalization Tool. You can choose which features to normalize.
303+
Outside the classifier scaling can be done with the Normalization Tool. From the main menu, navigate to Tools > Normalization Tool. You can choose which features to normalize and save the resulting table for later use.
299304

300305
*Removing zero variance features*
301306
*********************************
@@ -306,3 +311,27 @@ A zero variance feature is a feature that has the same entry for all objects, fo
306311
***************
307312

308313
A standard procedure is finding features with NAN (not a number) entries in the data and removing those cells. CPA automatically ignores cells with NANs, so this step is already been taken take of.
314+
315+
316+
V.B.8 Classifier types
317+
----------------------
318+
319+
CPA supports several different classifier types:
320+
321+
- **RandomForest**: Produces a series of decision tree classifiers and uses averaging across all trees to generate predictions.
322+
323+
- **AdaBoost**: Fits a series of weak learners (simple classification rules which don't perform well alone). The input data is adjusted after each cycle to add weight to samples which the previous learner classified incorrectly. As learners are added, examples that are difficult to predict receive increasing influence. A final prediction is generated from a weighted majority vote from all learners.
324+
325+
- **SVC**: Support Vector Classification. This technique considers all features and attempts to generate multi-dimensional dividing lines (termed "hyperplanes") which will distinguish between classes.
326+
327+
- **GradientBoosting**: This takes a similar approach to AdaBoost, but uses gradients instead of weights to make adjustments to the importance of individual samples.
328+
329+
- **LogisticRegression**: Classifies objects via logistic regression. Classifications are made based on a series of curves corresponding to decision boundaries.
330+
331+
- **LDA**: Linear Discriminant Analysis. This method projects the input data to a linear subspace consisting of the directions which maximize the separation between classes, then establishes a boundary which discriminates the classes.
332+
333+
- **KNeighbors**: Classifies based on the majority class of the nearest *k* known samples. Classification is inferred from the training points nearest to the test sample.
334+
335+
- **FastGentleBoosting**: A modification of the AdaBoost classification strategy, with optimisations for working with limited training data.
336+
337+
- **Neural Network**: Generates a multi-layer perceptron neural network. Layers of neurons link each input feature to output features. Each neuron generates a signal based on it's input and weighting from each source. The user can customise the number of intermediate 'hidden' layers between the input (measurement) and output (class) neurons. Additional hidden layers can help to generate more complex classifications. Neuron count per layer should generally be set to between the number of classes and number of input features.

0 commit comments

Comments
 (0)