diff --git a/README.md b/README.md index b8189307..1cfbaf36 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,10 @@ - - - - Fallback image description - +
+ + + + Fallback image description + +
--- @@ -14,111 +16,55 @@ [![Static Badge](https://img.shields.io/badge/PyTorch-%3E%3D2.3-blue?logo=pytorch&logoColor=white)](https://pytorch.org/) [![Static Badge](https://img.shields.io/badge/Discord%20-%20community%20-%20%235865F2?logo=discord&logoColor=%23FFFFFF&label=Discord)](https://discord.gg/76KkRnb3nk) -TorchJD is a library extending autograd to enable -[Jacobian descent](https://arxiv.org/pdf/2406.16232) with PyTorch. It can be used to train neural -networks with multiple objectives. In particular, it supports multi-task learning, with a wide -variety of aggregators from the literature. It also enables the instance-wise risk minimization -paradigm. The full documentation is available at [torchjd.org](https://torchjd.org), with several -usage examples. - -## Jacobian descent (JD) -Jacobian descent is an extension of gradient descent supporting the optimization of vector-valued -functions. This algorithm can be used to train neural networks with multiple loss functions. In this -context, JD iteratively updates the parameters of the model using the Jacobian matrix of the vector -of losses (the matrix stacking each individual loss' gradient). For more details, please refer to -Section 2.1 of the [paper](https://arxiv.org/pdf/2406.16232). - -### How does this compare to averaging the different losses and using gradient descent? - -Averaging the losses and computing the gradient of the mean is mathematically equivalent to -computing the Jacobian and averaging its rows. However, this approach has limitations. If two -gradients are conflicting (they have a negative inner product), simply averaging them can result in -an update vector that is conflicting with one of the two gradients. Averaging the losses and making -a step of gradient descent can thus lead to an increase of one of the losses. - -This is illustrated in the following picture, in which the two objectives' gradients $g_1$ and $g_2$ -are conflicting, and averaging them gives an update direction that is detrimental to the first -objective. Note that in this picture, the dual cone, represented in green, is the set of vectors -that have a non-negative inner product with both $g_1$ and $g_2$. - -![image](docs/source/_static/gradients_cone_projections_upgrad_mean.svg) - -With Jacobian descent, $g_1$ and $g_2$ are computed individually and carefully aggregated using an -aggregator $\mathcal A$. In this example, the aggregator is the Unconflicting Projection of -Gradients $\mathcal A_{\text{UPGrad}}$: it -projects each gradient onto the dual cone, and averages the projections. This ensures that the -update will always be beneficial to each individual objective (given a sufficiently small step -size). In addition to $\mathcal A_{\text{UPGrad}}$, TorchJD supports -[more than 10 aggregators from the literature](https://torchjd.org/stable/docs/aggregation). +TorchJD is a PyTorch library for training neural networks with **multiple losses**. It supports +two complementary approaches: + +- **Scalarization**: combine losses into a single scalar before backprop, using methods from the + literature (geometric mean, softmax weighting, [etc.](#supported-scalarizers)). This is often a good baseline. +- **[Jacobian descent](https://arxiv.org/pdf/2406.16232)**: compute the Jacobian matrix of losses + with respect to parameters and aggregate it into an update direction using state-of-the-art + aggregators (UPGrad, MGDA, CAGrad, [and many more](#supported-aggregators-and-weightings)). + This in particular allows taking conflict-free + optimization directions, which can resolve problems that may be impossible to solve with standard + scalarizers. + +The full documentation is available at [torchjd.org](https://torchjd.org). ## Installation + TorchJD can be installed directly with pip: ```bash pip install "torchjd[quadprog_projector]" ``` + This includes the dependencies required by UPGrad and DualProj. Some other aggregators may have additional dependencies. Please refer to the [installation documentation](https://torchjd.org/stable/installation) for them. ## Usage -Compared to standard `torch`, `torchjd` simply changes the way to obtain the `.grad` fields of your -model parameters. - -### Using the `autojac` engine - -The autojac engine is for computing and aggregating Jacobians efficiently. - -#### 1. `backward` + `jac_to_grad` -In standard `torch`, you generally combine your `losses` into a single scalar `loss`, and call -`loss.backward()` to compute the gradient of the loss with respect to each model parameter and to -store it in the `.grad` fields of those parameters. The basic usage of `torchjd` is to replace this -`loss.backward()` by a call to -[`torchjd.autojac.backward(losses)`](https://torchjd.org/stable/docs/autojac/backward/). Instead of -computing the gradient of a scalar loss, it will compute the Jacobian of a vector of losses, and -store it in the `.jac` fields of the model parameters. You then have to call -[`torchjd.autojac.jac_to_grad`](https://torchjd.org/stable/docs/autojac/jac_to_grad/) to aggregate -this Jacobian using the specified -[`Aggregator`](https://torchjd.org/stable/docs/aggregation#torchjd.aggregation.Aggregator), and to -store the result into the `.grad` fields of the model parameters. See this -[usage example](https://torchjd.org/stable/examples/basic_usage/) for more details. - -#### 2. `mtl_backward` + `jac_to_grad` -In the case of multi-task learning, an alternative to -[`torchjd.autojac.backward`](https://torchjd.org/stable/docs/autojac/backward/) is -[`torchjd.autojac.mtl_backward`](https://torchjd.org/stable/docs/autojac/mtl_backward/). It computes -the gradient of each task-specific loss with respect to the corresponding task's parameters, and -stores it in their `.grad` fields. It also computes the Jacobian of the vector of losses with -respect to the shared parameters and stores it in their `.jac` field. Then, the -[`torchjd.autojac.jac_to_grad`](https://torchjd.org/stable/docs/autojac/jac_to_grad/) function can -be called to aggregate this Jacobian and replace the `.jac` fields by `.grad` fields for the shared -parameters. - -The following example shows how to use TorchJD to train a multi-task model with Jacobian descent, -using [UPGrad](https://torchjd.org/stable/docs/aggregation/upgrad/). +### Scalarization + +Scalarization methods combine losses into a single scalar before backprop. Here is how to change +a standard training loop to use scalarization: ```diff import torch from torch.nn import Linear, MSELoss, ReLU, Sequential from torch.optim import SGD -+ from torchjd.autojac import jac_to_grad, mtl_backward -+ from torchjd.aggregation import UPGrad ++ from torchjd.scalarization import GeometricMean shared_module = Sequential(Linear(10, 5), ReLU(), Linear(5, 3), ReLU()) task1_module = Linear(3, 1) task2_module = Linear(3, 1) - params = [ - *shared_module.parameters(), - *task1_module.parameters(), - *task2_module.parameters(), - ] + params = [*shared_module.parameters(), *task1_module.parameters(), *task2_module.parameters()] loss_fn = MSELoss() optimizer = SGD(params, lr=0.1) -+ aggregator = UPGrad() ++ scalarizer = GeometricMean() inputs = torch.randn(8, 16, 10) # 8 batches of 16 random input vectors of length 10 task1_targets = torch.randn(8, 16, 1) # 8 batches of 16 targets for the first task @@ -126,126 +72,123 @@ using [UPGrad](https://torchjd.org/stable/docs/aggregation/upgrad/). for input, target1, target2 in zip(inputs, task1_targets, task2_targets): features = shared_module(input) - output1 = task1_module(features) - output2 = task2_module(features) - loss1 = loss_fn(output1, target1) - loss2 = loss_fn(output2, target2) + loss1 = loss_fn(task1_module(features), target1) + loss2 = loss_fn(task2_module(features), target2) - loss = loss1 + loss2 - loss.backward() -+ mtl_backward([loss1, loss2], features=features) -+ jac_to_grad(shared_module.parameters(), aggregator) ++ loss = scalarizer(torch.stack([loss1, loss2])) ++ loss.backward() optimizer.step() optimizer.zero_grad() ``` -> [!NOTE] -> In this example, the Jacobian is only with respect to the shared parameters. The task-specific -> parameters are simply updated via the gradient of their task’s loss with respect to them. - -> [!TIP] -> Once your model parameters all have a `.grad` field, it's the role of the -> [optimizer](https://docs.pytorch.org/docs/stable/optim.html#torch.optim.Optimizer) to update the -> parameters values. This is exactly the same as in standard `torch`. - -#### 3. `jac` - -If you're simply interested in computing Jacobians without storing them in the `.jac` fields, you -can also use the [`torchjd.autojac.jac`](https://torchjd.org/stable/docs/autojac/jac/) function, -that is analog to -[`torch.autograd.grad`](https://docs.pytorch.org/docs/stable/generated/torch.autograd.grad.html), -except that it computes the Jacobian of a vector of losses rather than the gradient of a scalar -loss. - -### Using the `autogram` engine +### Jacobian descent -The Gramian of the Jacobian, defined as the Jacobian multiplied by its transpose, contains all the -dot products between individual gradients. It thus contains all the information about conflict and -gradient imbalance. It turns out that most aggregators from the literature -(e.g. [UPGrad](https://torchjd.org/stable/docs/aggregation/upgrad/)) make a linear combination of -the rows of the Jacobian, whose weights only depend on the Gramian of the Jacobian. - -An alternative implementation of Jacobian descent is thus to: -- Compute this Gramian incrementally (layer by layer), without ever storing the full Jacobian in - memory. -- Extract the weights from it using a - [`Weighting`](https://torchjd.org/stable/docs/aggregation#torchjd.aggregation.Weighting). -- Combine the losses using those weights and make a step of gradient descent on the combined loss. - -The main advantage of this approach is to save memory because the Jacobian (that is typically large) -never has to be stored in memory. The -[`torchjd.autogram.Engine`](https://torchjd.org/stable/docs/autogram/engine/) is precisely made to -compute the Gramian of the Jacobian efficiently. - -The following example shows how to use the `autogram` engine to minimize the vector of per-instance -losses with Jacobian descent using [UPGrad](https://torchjd.org/stable/docs/aggregation/upgrad/). +Jacobian descent computes per-loss gradients individually and aggregates them into a single update +direction. Some aggregators, like [UPGrad](https://torchjd.org/stable/docs/aggregation/upgrad/), +are specifically designed to find directions that are beneficial to all losses simultaneously. +Here is how to change a standard multi-task training loop to use Jacobian descent: ```diff import torch from torch.nn import Linear, MSELoss, ReLU, Sequential from torch.optim import SGD -+ from torchjd.autogram import Engine -+ from torchjd.aggregation import UPGradWeighting - - model = Sequential(Linear(10, 5), ReLU(), Linear(5, 3), ReLU(), Linear(3, 1), ReLU()) ++ from torchjd.autojac import jac_to_grad, mtl_backward ++ from torchjd.aggregation import UPGrad -- loss_fn = MSELoss() -+ loss_fn = MSELoss(reduction="none") - optimizer = SGD(model.parameters(), lr=0.1) + shared_module = Sequential(Linear(10, 5), ReLU(), Linear(5, 3), ReLU()) + task1_module = Linear(3, 1) + task2_module = Linear(3, 1) + params = [*shared_module.parameters(), *task1_module.parameters(), *task2_module.parameters()] -+ weighting = UPGradWeighting() -+ engine = Engine(model, batch_dim=0) + loss_fn = MSELoss() + optimizer = SGD(params, lr=0.1) ++ aggregator = UPGrad() inputs = torch.randn(8, 16, 10) # 8 batches of 16 random input vectors of length 10 - targets = torch.randn(8, 16) # 8 batches of 16 targets for the first task + task1_targets = torch.randn(8, 16, 1) # 8 batches of 16 targets for the first task + task2_targets = torch.randn(8, 16, 1) # 8 batches of 16 targets for the second task - for input, target in zip(inputs, targets): - output = model(input).squeeze(dim=1) # shape [16] -- loss = loss_fn(output, target) # shape [1] -+ losses = loss_fn(output, target) # shape [16] + for input, target1, target2 in zip(inputs, task1_targets, task2_targets): + features = shared_module(input) + loss1 = loss_fn(task1_module(features), target1) + loss2 = loss_fn(task2_module(features), target2) +- loss = loss1 + loss2 - loss.backward() -+ gramian = engine.compute_gramian(losses) # shape: [16, 16] -+ weights = weighting(gramian) # shape: [16] -+ losses.backward(weights) ++ mtl_backward([loss1, loss2], features=features) ++ jac_to_grad(shared_module.parameters(), aggregator) optimizer.step() optimizer.zero_grad() ``` -You can even go one step further by considering the multiple tasks and each element of the batch -independently (Instance-Wise Multitask Learning). See [this example](https://torchjd.org/stable/examples/iwmtl/) for more details. - -More usage examples can be found [here](https://torchjd.org/stable/examples/). +### The `autojac` engine + +The [`autojac` engine](https://torchjd.org/stable/docs/autojac/) provides fine-grained control +over Jacobian computation and aggregation. It lets you compute Jacobians with respect to specific +layers or activations (partial Jacobian descent), store them in `.jac` fields for inspection, and +apply any aggregator independently. See the [autojac examples](https://torchjd.org/stable/examples/) +for more details. + +### The `autogram` engine + +TorchJD also provides the [`autogram` engine](https://torchjd.org/stable/docs/autogram/engine/), +which computes the Gramian of the Jacobian incrementally without ever storing the full Jacobian in +memory. This makes Jacobian descent feasible on large models where the full Jacobian would be too +expensive to store. See the [autogram examples](https://torchjd.org/stable/examples/) for more +details. + +More usage examples, including instance-wise risk minimization and partial Jacobian descent, can be +found [in the docs](https://torchjd.org/stable/examples/). + +## Supported Scalarizers + +| Scalarizer | Publication | +|---|---| +| [Constant](https://torchjd.org/stable/docs/scalarization/constant/) | - | +| [COSMOS](https://torchjd.org/stable/docs/scalarization/cosmos/) | [COSMOS: Enhancing Multi-Objective Optimization with Scalarization](https://arxiv.org/pdf/2303.04536) | +| [DWA](https://torchjd.org/stable/docs/scalarization/dwa/) | [End-to-End Multi-Task Learning with Attention](https://arxiv.org/pdf/1803.10704) | +| [FAMO](https://torchjd.org/stable/docs/scalarization/famo/) | [FAMO: Fast Adaptive Multitask Optimization](https://arxiv.org/pdf/2306.03792) | +| [GeometricMean](https://torchjd.org/stable/docs/scalarization/geometric_mean/) | [MultiNet++: Multi-Stream Feature Aggregation and Geometric Loss Strategy for Multi-Task Learning](https://arxiv.org/pdf/1902.08325) | +| [IMTL-L](https://torchjd.org/stable/docs/scalarization/imtl_l/) | [Towards Impartial Multi-task Learning](https://discovery.ucl.ac.uk/id/eprint/10120667/) | +| [Mean](https://torchjd.org/stable/docs/scalarization/mean/) | - | +| [PBI](https://torchjd.org/stable/docs/scalarization/pbi/) | [A Decomposition-Based Evolutionary Algorithm for Many Objective Optimization](https://ieeexplore.ieee.org/document/7445185) | +| [Random](https://torchjd.org/stable/docs/scalarization/random/) | [Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning](https://arxiv.org/pdf/2111.10603) | +| [STCH](https://torchjd.org/stable/docs/scalarization/stch/) | [Smooth Tchebycheff Scalarization for Multi-Objective Optimization](https://arxiv.org/pdf/2402.19078) | +| [Sum](https://torchjd.org/stable/docs/scalarization/sum/) | - | +| [UW](https://torchjd.org/stable/docs/scalarization/uw/) | [Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics](https://arxiv.org/pdf/1705.07115) | ## Supported Aggregators and Weightings + TorchJD provides many existing aggregators from the literature, listed in the following table. -| Aggregator | Weighting | Publication | -|------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| [UPGrad](https://torchjd.org/stable/docs/aggregation/upgrad/#torchjd.aggregation.UPGrad) (recommended) | [UPGradWeighting](https://torchjd.org/stable/docs/aggregation/upgrad/#torchjd.aggregation.UPGradWeighting) | [Jacobian Descent For Multi-Objective Optimization](https://arxiv.org/pdf/2406.16232) | -| [AlignedMTL](https://torchjd.org/stable/docs/aggregation/aligned_mtl#torchjd.aggregation.AlignedMTL) | [AlignedMTLWeighting](https://torchjd.org/stable/docs/aggregation/aligned_mtl#torchjd.aggregation.AlignedMTLWeighting) | [Independent Component Alignment for Multi-Task Learning](https://arxiv.org/pdf/2305.19000) | -| [CAGrad](https://torchjd.org/stable/docs/aggregation/cagrad#torchjd.aggregation.CAGrad) | [CAGradWeighting](https://torchjd.org/stable/docs/aggregation/cagrad#torchjd.aggregation.CAGradWeighting) | [Conflict-Averse Gradient Descent for Multi-task Learning](https://arxiv.org/pdf/2110.14048) | -| [ConFIG](https://torchjd.org/stable/docs/aggregation/config#torchjd.aggregation.ConFIG) | - | [ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks](https://arxiv.org/pdf/2408.11104) | -| [Constant](https://torchjd.org/stable/docs/aggregation/constant#torchjd.aggregation.Constant) | [ConstantWeighting](https://torchjd.org/stable/docs/aggregation/constant#torchjd.aggregation.ConstantWeighting) | - | -| - | [CRMOGMWeighting](https://torchjd.org/stable/docs/aggregation/cr_mogm/#torchjd.aggregation.CRMOGMWeighting) | [On the Convergence of Stochastic Multi-Objective Gradient Manipulation and Beyond](https://proceedings.neurips.cc/paper_files/paper/2022/file/f91bd64a3620aad8e70a27ad9cb3ca57-Paper-Conference.pdf) | -| [DualProj](https://torchjd.org/stable/docs/aggregation/dualproj#torchjd.aggregation.DualProj) | [DualProjWeighting](https://torchjd.org/stable/docs/aggregation/dualproj#torchjd.aggregation.DualProjWeighting) | [Gradient Episodic Memory for Continual Learning](https://arxiv.org/pdf/1706.08840) | -| [ExcessMTL](https://torchjd.org/stable/docs/aggregation/excess_mtl#torchjd.aggregation.ExcessMTL) | [ExcessMTLWeighting](https://torchjd.org/stable/docs/aggregation/excess_mtl#torchjd.aggregation.ExcessMTLWeighting) | [Robust Multi-Task Learning with Excess Risks](https://proceedings.mlr.press/v235/he24n.html) | -| [FairGrad](https://torchjd.org/stable/docs/aggregation/fairgrad#torchjd.aggregation.FairGrad) | [FairGradWeighting](https://torchjd.org/stable/docs/aggregation/fairgrad#torchjd.aggregation.FairGradWeighting) | [Fair Resource Allocation in Multi-Task Learning](https://arxiv.org/pdf/2402.15638) | -| [GradDrop](https://torchjd.org/stable/docs/aggregation/graddrop#torchjd.aggregation.GradDrop) | - | [Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout](https://arxiv.org/pdf/2010.06808) | -| [GradVac](https://torchjd.org/stable/docs/aggregation/gradvac#torchjd.aggregation.GradVac) | [GradVacWeighting](https://torchjd.org/stable/docs/aggregation/gradvac#torchjd.aggregation.GradVacWeighting) | [Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models](https://arxiv.org/pdf/2010.05874) | -| [IMTLG](https://torchjd.org/stable/docs/aggregation/imtl_g#torchjd.aggregation.IMTLG) | [IMTLGWeighting](https://torchjd.org/stable/docs/aggregation/imtl_g#torchjd.aggregation.IMTLGWeighting) | [Towards Impartial Multi-task Learning](https://www.semanticscholar.org/paper/Towards-Impartial-Multi-task-Learning-Liu-Li/45c0828baec1dd53b81f1b2635788fdf27d0792d) | -| [Krum](https://torchjd.org/stable/docs/aggregation/krum#torchjd.aggregation.Krum) | [KrumWeighting](https://torchjd.org/stable/docs/aggregation/krum#torchjd.aggregation.KrumWeighting) | [Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent](https://proceedings.neurips.cc/paper/2017/file/f4b9ec30ad9f68f89b29639786cb62ef-Paper.pdf) | -| [Mean](https://torchjd.org/stable/docs/aggregation/mean#torchjd.aggregation.Mean) | [MeanWeighting](https://torchjd.org/stable/docs/aggregation/mean#torchjd.aggregation.MeanWeighting) | - | -| [MGDA](https://torchjd.org/stable/docs/aggregation/mgda#torchjd.aggregation.MGDA) | [MGDAWeighting](https://torchjd.org/stable/docs/aggregation/mgda#torchjd.aggregation.MGDAWeighting) | [Multiple-gradient descent algorithm (MGDA) for multiobjective optimization](https://comptes-rendus.academie-sciences.fr/mathematique/articles/10.1016/j.crma.2012.03.014/) | -| - | [MoDoWeighting](https://torchjd.org/stable/docs/aggregation/modo/#torchjd.aggregation.MoDoWeighting) | [Three-Way Trade-Off in Multi-Objective Learning: Optimization, Generalization and Conflict-Avoidance](https://www.jmlr.org/papers/volume25/23-1287/23-1287.pdf) | -| [NashMTL](https://torchjd.org/stable/docs/aggregation/nash_mtl#torchjd.aggregation.NashMTL) | - | [Multi-Task Learning as a Bargaining Game](https://arxiv.org/pdf/2202.01017) | -| [PCGrad](https://torchjd.org/stable/docs/aggregation/pcgrad#torchjd.aggregation.PCGrad) | [PCGradWeighting](https://torchjd.org/stable/docs/aggregation/pcgrad#torchjd.aggregation.PCGradWeighting) | [Gradient Surgery for Multi-Task Learning](https://arxiv.org/pdf/2001.06782) | -| [Random](https://torchjd.org/stable/docs/aggregation/random#torchjd.aggregation.Random) | [RandomWeighting](https://torchjd.org/stable/docs/aggregation/random#torchjd.aggregation.RandomWeighting) | [Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning](https://arxiv.org/pdf/2111.10603) | +| Aggregator | Weighting | Publication | +|----|----|----| +| [UPGrad](https://torchjd.org/stable/docs/aggregation/upgrad/#torchjd.aggregation.UPGrad) (recommended) | [UPGradWeighting](https://torchjd.org/stable/docs/aggregation/upgrad/#torchjd.aggregation.UPGradWeighting) | [Jacobian Descent For Multi-Objective Optimization](https://arxiv.org/pdf/2406.16232) | +| [AlignedMTL](https://torchjd.org/stable/docs/aggregation/aligned_mtl#torchjd.aggregation.AlignedMTL) | [AlignedMTLWeighting](https://torchjd.org/stable/docs/aggregation/aligned_mtl#torchjd.aggregation.AlignedMTLWeighting) | [Independent Component Alignment for Multi-Task Learning](https://arxiv.org/pdf/2305.19000) | +| [CAGrad](https://torchjd.org/stable/docs/aggregation/cagrad#torchjd.aggregation.CAGrad) | [CAGradWeighting](https://torchjd.org/stable/docs/aggregation/cagrad#torchjd.aggregation.CAGradWeighting) | [Conflict-Averse Gradient Descent for Multi-task Learning](https://arxiv.org/pdf/2110.14048) | +| [ConFIG](https://torchjd.org/stable/docs/aggregation/config#torchjd.aggregation.ConFIG) | - | [ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks](https://arxiv.org/pdf/2408.11104) | +| [Constant](https://torchjd.org/stable/docs/aggregation/constant#torchjd.aggregation.Constant) | [ConstantWeighting](https://torchjd.org/stable/docs/aggregation/constant#torchjd.aggregation.ConstantWeighting) | - | +| - | [CRMOGMWeighting](https://torchjd.org/stable/docs/aggregation/cr_mogm/#torchjd.aggregation.CRMOGMWeighting) | [On the Convergence of Stochastic Multi-Objective Gradient Manipulation and Beyond](https://proceedings.neurips.cc/paper_files/paper/2022/file/f91bd64a3620aad8e70a27ad9cb3ca57-Paper-Conference.pdf) | +| [DualProj](https://torchjd.org/stable/docs/aggregation/dualproj#torchjd.aggregation.DualProj) | [DualProjWeighting](https://torchjd.org/stable/docs/aggregation/dualproj#torchjd.aggregation.DualProjWeighting) | [Gradient Episodic Memory for Continual Learning](https://arxiv.org/pdf/1706.08840) | +| [ExcessMTL](https://torchjd.org/stable/docs/aggregation/excess_mtl#torchjd.aggregation.ExcessMTL) | [ExcessMTLWeighting](https://torchjd.org/stable/docs/aggregation/excess_mtl#torchjd.aggregation.ExcessMTLWeighting) | [Robust Multi-Task Learning with Excess Risks](https://proceedings.mlr.press/v235/he24n.html) | +| [FairGrad](https://torchjd.org/stable/docs/aggregation/fairgrad#torchjd.aggregation.FairGrad) | [FairGradWeighting](https://torchjd.org/stable/docs/aggregation/fairgrad#torchjd.aggregation.FairGradWeighting) | [Fair Resource Allocation in Multi-Task Learning](https://arxiv.org/pdf/2402.15638) | +| [GradDrop](https://torchjd.org/stable/docs/aggregation/graddrop#torchjd.aggregation.GradDrop) | - | [Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout](https://arxiv.org/pdf/2010.06808) | +| [GradVac](https://torchjd.org/stable/docs/aggregation/gradvac#torchjd.aggregation.GradVac) | [GradVacWeighting](https://torchjd.org/stable/docs/aggregation/gradvac#torchjd.aggregation.GradVacWeighting) | [Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models](https://arxiv.org/pdf/2010.05874) | +| [IMTLG](https://torchjd.org/stable/docs/aggregation/imtl_g#torchjd.aggregation.IMTLG) | [IMTLGWeighting](https://torchjd.org/stable/docs/aggregation/imtl_g#torchjd.aggregation.IMTLGWeighting) | [Towards Impartial Multi-task Learning](https://www.semanticscholar.org/paper/Towards-Impartial-Multi-task-Learning-Liu-Li/45c0828baec1dd53b81f1b2635788fdf27d0792d) | +| [Krum](https://torchjd.org/stable/docs/aggregation/krum#torchjd.aggregation.Krum) | [KrumWeighting](https://torchjd.org/stable/docs/aggregation/krum#torchjd.aggregation.KrumWeighting) | [Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent](https://proceedings.neurips.cc/paper/2017/file/f4b9ec30ad9f68f89b29639786cb62ef-Paper.pdf) | +| [Mean](https://torchjd.org/stable/docs/aggregation/mean#torchjd.aggregation.Mean) | [MeanWeighting](https://torchjd.org/stable/docs/aggregation/mean#torchjd.aggregation.MeanWeighting) | - | +| [MGDA](https://torchjd.org/stable/docs/aggregation/mgda#torchjd.aggregation.MGDA) | [MGDAWeighting](https://torchjd.org/stable/docs/aggregation/mgda#torchjd.aggregation.MGDAWeighting) | [Multiple-gradient descent algorithm (MGDA) for multiobjective optimization](https://comptes-rendus.academie-sciences.fr/mathematique/articles/10.1016/j.crma.2012.03.014/) | +| - | [MoDoWeighting](https://torchjd.org/stable/docs/aggregation/modo/#torchjd.aggregation.MoDoWeighting) | [Three-Way Trade-Off in Multi-Objective Learning: Optimization, Generalization and Conflict-Avoidance](https://www.jmlr.org/papers/volume25/23-1287/23-1287.pdf) | +| [NashMTL](https://torchjd.org/stable/docs/aggregation/nash_mtl#torchjd.aggregation.NashMTL) | - | [Multi-Task Learning as a Bargaining Game](https://arxiv.org/pdf/2202.01017) | +| [PCGrad](https://torchjd.org/stable/docs/aggregation/pcgrad#torchjd.aggregation.PCGrad) | [PCGradWeighting](https://torchjd.org/stable/docs/aggregation/pcgrad#torchjd.aggregation.PCGradWeighting) | [Gradient Surgery for Multi-Task Learning](https://arxiv.org/pdf/2001.06782) | +| [Random](https://torchjd.org/stable/docs/aggregation/random#torchjd.aggregation.Random) | [RandomWeighting](https://torchjd.org/stable/docs/aggregation/random#torchjd.aggregation.RandomWeighting) | [Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning](https://arxiv.org/pdf/2111.10603) | | - | [SDMGradWeighting](https://torchjd.org/stable/docs/aggregation/sdmgrad#torchjd.aggregation.SDMGradWeighting) | [Direction-oriented Multi-objective Learning: Simple and Provable Stochastic Algorithms](https://arxiv.org/pdf/2305.18409) | -| [Sum](https://torchjd.org/stable/docs/aggregation/sum#torchjd.aggregation.Sum) | [SumWeighting](https://torchjd.org/stable/docs/aggregation/sum#torchjd.aggregation.SumWeighting) | - | -| [Trimmed Mean](https://torchjd.org/stable/docs/aggregation/trimmed_mean#torchjd.aggregation.TrimmedMean) | - | [Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates](https://proceedings.mlr.press/v80/yin18a/yin18a.pdf) | +| [Sum](https://torchjd.org/stable/docs/aggregation/sum#torchjd.aggregation.Sum) | [SumWeighting](https://torchjd.org/stable/docs/aggregation/sum#torchjd.aggregation.SumWeighting) | - | +| [Trimmed Mean](https://torchjd.org/stable/docs/aggregation/trimmed_mean#torchjd.aggregation.TrimmedMean) | - | [Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates](https://proceedings.mlr.press/v80/yin18a/yin18a.pdf) | ## Release Methodology @@ -257,7 +200,10 @@ release contains breaking changes, the [changelog](CHANGELOG.md) and the GitHub include clear instructions on how to migrate. ## Contribution -Please read the [Contribution page](CONTRIBUTING.md). + +Please read the [Contribution page](CONTRIBUTING.md) and join our +[Discord](https://discord.gg/76KkRnb3nk) +to get involved! Thanks to our amazing contributors for making this project possible: