Skip to content

Latest commit

 

History

History
149 lines (106 loc) · 6.78 KB

File metadata and controls

149 lines (106 loc) · 6.78 KB

ALPINE Optimization

The ALPINE package offers a comprehensive set of tools for optimizing the ALPINE model, enhancing its performance and making it easier to use. This package includes functions for running, extending, and saving optimization results, helping users streamline their workflows with the ALPINE model. The tutorial is organized as follows:

Objective of the Optimization

The objective of the optimization process is to minimize the guided covariates' Adjusted Rand Index (ARI) and Homogeneity Score (HS). The ARI measures the similarity between two partitions of a dataset, providing an indication of clustering quality, while the HS evaluates the homogeneity within clusters, monitoring the effectiveness of mixing. By optimizing these scores, the goal is to identify the best set of hyperparameters that yields improved clustering and optimal mixing results.

$$\text{min} \frac{1}{c}\sum^c_{i=1} (\text{ARI}_i + \text{HS}_i)$$

The objective function is defined as the sum of the ARI and HS scores, normalized by the number of guided covariates (c). The values of objective function are between 0 and 2 (0 being the best). Note that the objective function doesn't consider the reconstruction error which heavily depends on the number of total components in the model where more components lead to a lower reconstruction error.

Create an Optimization Object

To create an optimization object, you can use the following code:

from ALPINE import ComponentOptimizer

co = ComponentOptimizer(
    adata = adata,
    covariate_keys = ['batch', 'condition', 'severity'],
    loss_type='kl-divergence',
    max_iter=None,
    batch_size=None,
    gpu=True,
    random_state=None
)

The ComponentOptimizer requires only the adata object and the covariate_keys. All other hyperparameters are optional and can be automatically set by ALPINE. The covariate_keys parameter should be a list of covariates to be optimized, ensuring that these covariates are present in the adata.obs data frame.

Note: It is strongly recommended to set the random_state to a fixed value to ensure reproducibility of the results. This makes each iteration of the optimization process more comparable, as differences in matrix initialization can lead to variations in the results.

Setting Hyperparameter and Running the Optimization

To set the hyperparameters and run the optimization process, you can use the following code:

co.bayesian_search(
    n_total_components_range=(50, 100),
    lam_power_range=(2, 6),
    alpha_W_range=(0, 1),
    orth_W_range=(0, 0.5),
    l1_ratio_range=(0, 1),
    weight_reduce_covar_dims = 0,
    n_splits=None,
    max_evals=50,
    min_components=None,
    trials_filename=None # check the latter for more information
)

Hyperparameters:

  • n_total_components_range: The range for the total number of components to be optimized.
  • lam_power_range: The range for the power of the regularization parameter (e.g., 10^2 to 10^6).
  • alpha_W_range: The range for the elastic-net regularization parameter (alpha_W).
  • orth_W_range: The range for the orthogonal regularization parameter (orth_W).
  • l1_ratio_range: The range for the l1_ratio regularization parameter.
  • weight_reduce_covar_dims: The weight controlling the reduction of the number of components in the guided covariates, which can lead the model to slightly overfit on the guided covariates.
  • n_splits: Number of folds for stratified cross validation. The default is None which uses te entire dataset.
  • max_evals: The maximum number of evaluations to be performed.
  • min_components: The minimum number of guided covariate components to be optimized. By default, this is set to the minimum number of categories for each guided covariate. Users can specify a list of values (e.g., [3, 3, 3]).
  • trials_filename: If it is provided, the file will be loaded, and the optimization process will be continued from the last saved trial.

Saving Optimization Results

To save the optimization results, you can use the following code:

co.save_trials(filename='./trials.pkl')

Extending the Optimization Process

Option 1: Extending the Optimization Process directly

You can extend the optimization process by adding more evaluations using the following code:

co.extend_training(extra_evals=30)

Option 2: Load the trials file and continue the optimization process

You can load the trials file and continue the optimization process by using the following code:

co.bayesian_search(
    n_total_components_range=(50, 100),
    lam_power_range=(2, 6),
    alpha_W_range=(0, 1),
    orth_W_range=(0, 0.5),
    l1_ratio_range=(0, 1),
    weight_reduce_covar_dims = 0,
    max_evals=50,
    min_components=None,
    trials_filename='./trials.pkl'
)

Simply, you can load the trials file and continue the optimization process by using the trials_filename parameter.

Useful functions

Check the optimization history (sorted):

If you want to check the optimization history, you can use the following code:

train_hist_df = co.get_train_history()

The train_hist_df is a pandas DataFrame containing the training history of the optimization process where has already been sorted by the score (the lower the better).

Get the hyperparameters from the training history:

After checking the training history, you can get the hyperparameters from the training history using the following code:

params = get_hyperparameter(idx)

where idx is the index of the hyperparameters in the training history.

Fit the model with the best hyperparameters:

You can fit the model with the best hyperparameters using the following code:

alpine_model = co.fit_the_best_param()

Notes

Due to the current implementation, all hyperparameters must be optimized simultaneously, which prevents users from setting specific fixed values for individual parameters. To approximate a fixed value effect for a hyperparameter, you can define a very narrow range, such as (10, 10 + 1e-5), to achieve similar results.