Use max rating deviation per team to calculate match quality by mankinskin · Pull Request #1060 · FAForever/server

mankinskin · 2025-07-13T11:28:52Z

The idea here is to keep rating variety within each team of a match low. Instead of discounting quality for deviation across all ratings in a match, the deviation for each team is limited.

This fixes imbalances on maps with uneven spawn positions (lot of mexes, air spot, navy spot, ...), where the overall match variety may be low, but one team gets a lot of rating variety whereas the other is very balanced. One team will have the best and the worst rated players, while the other will have the average ratings.

This will either give the strong player an advantage on a strong spot, or a disadvantage when they are on a weak spot, as the majority of the game will be in the hands of his weaker teammates.

By limiting the maximum variety per team, matches like this should have lower quality assigned.

mankinskin · 2025-09-06T18:21:48Z

Hi, I am trying to make the tests pass now, I tried setting up the faf database locally, however when I execute the config/init-db.sh script and the docker container is created, I get warnings afterwards

"Access denied for user 'root'@'localhost' (using password: NO)"

from the docker container. I am running on Windows. I am now running the tests in github actions through this PR.

mankinskin · 2025-09-06T18:25:46Z

@BlackYps Hi, tests pass now. I think this change should improve the team matchmaking a lot, with more equally distributed teams.

BlackYps · 2025-09-08T09:35:33Z

server/matchmaker/algorithm/team_matchmaker.py

        unfairness = rating_disparity / config.MAXIMUM_RATING_IMBALANCE
-        deviation = statistics.pstdev(ratings)
-        rating_variety = deviation / config.MAXIMUM_RATING_DEVIATION
+        max_team_deviation = max(map(statistics.pstdev, [match[0].displayed_ratings, match[1].displayed_ratings]))


Why are you using displayed ratings here?

I was looking for the ratings separately for each team, I was not sure which is the correct value to use. Which value do you suggest to use?

use average_rating of the original searches like it is done here:

server/server/matchmaker/algorithm/team_matchmaker.py

Lines 311 to 312 in 871c64e

for search in team.get_original_searches():

ratings.append(search.average_rating)

BlackYps · 2025-09-08T09:47:04Z

It seems the actual change here is that you calculate the deviation for each team separately and then use the maximum value. I expect that this does make the matchmaker more sensitive to games with a large rating variety.
This alone does not mean that the overall situation will be better. The tuning of the matchmaker is a trade-off between "bad game" and "no game". I suspect that your change will have the same effect as simply setting MAXIMUM_RATING_DEVIATION to a lower value. I had a script to simulate matchmaker activity. I will have to dig that up again.

However, besides all of this, I don't see that your change will solve the problem that you stated:

One team will have the best and the worst rated players, while the other will have the average ratings.

This will still be the case, because it leads to teams with the smallest difference in total rating. If you want to change that, then you have to take a look at the part that assigns search parties to the teams.

mankinskin · 2025-09-08T13:46:43Z

I expect that this does make the matchmaker more sensitive to games with a large rating variety.

Yes, rating variety is usually bad because players are put in more un-equal face-offs, even when the average ratings are equal. The difference when using the max rating deviation of both teams is that variety within the teams themselves is limited. An equal team can not compensate for a highly diverse team. This should result in less diverse teams on both sides and more equal face-offs on the field.

This alone does not mean that the overall situation will be better. The tuning of the matchmaker is a trade-off between "bad game" and "no game".

Yes of course, but we can still tweak MAXIMUM_RATING_DEVIATION for this, if the requirements are too harsh.

However, besides all of this, I don't see that your change will solve the problem that you stated:

One team will have the best and the worst rated players, while the other will have the average ratings.

This will still be the case, because it leads to teams with the smallest difference in total rating.

I don't understand? This is exactly the problem, large differences in rating within the teams. The old version selected for low variety across all players in the match. This allowed high rating variety within one team, if the opposing team has very average ratings, effectively compensating for the extreme rating variety in the first team. By limiting the rating variety within the individual teams, no team should have exceptionally strong and weak players, but both teams should follow the same bell curve around the average rating.

mankinskin · 2025-09-08T19:44:11Z

I think, maybe I see your point? Do you mean it will still select the teams with the smallest rating deviation as higher quality and therefore there will still be one team with all perfectly average ratings?
This should be accounted for by only using the maximum value across both teams.. the other team's variety is never optimized for.

BlackYps · 2025-09-11T17:02:12Z

There are two parts to this problem.
The first one is that until now, it didn't make a difference if one team had high rating variation and the other one low, or if the rating variation is distributed more evenly over both teams. (Think of 8 players being in a game. Until now for the rating variety calculation it literally makes no difference how you assign these players to the teams because we only checked the total variation of all players in the game.) Your change fixes this problem, which is good and already a useful addition.
However, there is also part two: we can't brute-force all possible combinations of players in the queue to create potential games, because that would be too computationally expensive. So instead we use the largest differencing method to create potential games. It's been too long, so I don't remember for sure, but it's quite likely that this naturally leads to the highest and lowest rated person being in the same team. And if all the generated game candidates have this characteristic, then the change we do in part 1 has no impact. It's possible that we would have to switch to a different approximation algorithm to tackle this problem. I suggest that you make yourself familiar with the algorithm explained in the wikipedia article and do some examples with a pen on a sheet of paper to get a feel how it behaves. Maybe you even find a better approximation algorithm for our use case during your studies.

mankinskin · 2025-09-24T20:10:27Z

It's been too long, so I don't remember for sure, but it's quite likely that this naturally leads to the highest and lowest rated person being in the same team. And if all the generated game candidates have this characteristic, then the change we do in part 1 has no impact

If I understand the algorithm for "balanced two-way partitioning" correctly, it will simply try to form the matches from player-pairs with least difference, then it matches "pairs with pairs" of "most similar differences" (sorted), such that they can be paired with least total difference down the line.

I don't think this prefers largest and smallest ratings being together (if we are using balanced partitioning), as the differences between players' counterparts are minimized the same way. Only with sparse ratings in the queue, i.e. not enough counterparts in the same rating range, the algorithm is forced to put very different ratings into the same buckets, and differences are very large, so one team has big and small, and the others are in between.

But that is exactly why we filter the match results for other metrics and not all possible matches are good enough to be played.

I think its a good starting point to discount match quality with the rating variety in the whole match. But it has to go one step further to discount rating variety in one team aswell. These matches have high tendency of bad game experiences on random maps.

If we trust that the algorithm will find the best matches possible, then it is not a feature to give players games which are not balanced when the best possible matches are not good.

BlackYps · 2025-10-07T19:26:33Z

I don't think this prefers largest and smallest ratings being together (if we are using balanced partitioning), as the differences between players' counterparts are minimized the same way. Only with sparse ratings in the queue, i.e. not enough counterparts in the same rating range, the algorithm is forced to put very different ratings into the same buckets, and differences are very large, so one team has big and small, and the others are in between.

I'm not sure if I understand you correctly. Take 2v2 for example. To have the most balanced result you always need to pair the highest and lowest together against the two in the middle. That's why I'm saying that the changes so far better convey what we want to achieve, so this is already an improvement, but in practice you will not see different team arrangements from the matchmaker.

We could say that we would rather have a bit more combined rating difference between the two teams if both teams have a similar rating variety. For example to create a team of 1200 and 1000 against 1100 and 900. This would require more changes though, because it requires changing the algorithm that builds the teams.

BlackYps · 2025-10-07T19:30:07Z

One more thing, this repository has no maintainer at the moment, so it's unclear when the next server release will be.
Additionally, because of the current ddos attacks it probably makes sense to wait with matchmaker changes until the situation has stabilized more, to get a better read of how the changes affect the matchmaking in practice.
We can still continue the work here, I just want to manage expectations: It will probably take a while until these changes are actually deployed.

mankinskin · 2025-10-08T22:28:06Z

I see what you are saying.. there are multiple components to this problem. This change fixes one issue, but the other issue is that we can still have very different rating variety between the teams, although it is limited. Below the limit (in variety) max_variety we can still have very different values for each of the two teams. There is a "team variety-variety", which should also be optimized to prefer games of similarly varied teams, which maximizes the chance of having perfect mirrors for each player.

Then we have two variables:

MAX_RATING_DEVIATION
used to optimize for teams with low team.variance = pstdev(team.ratings), scaled with 1/MAX
MAX_TEAM_VARIETY_DEVIATION
used to optimize for matches with low: pstdev([team.variance for team in match.teams]), scaled with 1/MAX

I guess that is why you say we have to implement the second part in the match selection process, not in the match quality rating.

One important remark about this though: do we really want to allow for more rating variety in teams, even if they are equally variant in the same match? This will allow games where, even though there is a mirror for each player, in an asymmetrical map this just means they have to face the strongest player. The maximum variance here should simply be limited.

Setting the maximum rating deviation high to allow for more matches to be made, but then filtering them again for high similarity in team variety, still exposes very different players into the same match.

The second parameter is more involved with detecting "balanced variety" for "mirror matching", but simply limiting the total variety that is allowed is already a big factor. We have to remember that the optimization function includes the search time and will weaken restrictions over time. But we should still take an honest attempt at limiting the variety within the same team.

I think these are separate features and this is basically a bugfix for the team rating deviation limitation, wheras the previous implementation limited the deviation of all ratings in the match, this change limits the deviation of ratings for each team. We can now more directly influence the maxium allowed rating deviation of a single team.

One could argue that pstdev(match.ratings) == max([pstdev(team.rating) for team in match])), i.e. it doesn't matter whether you look at all ratings in the match or only at the one team with the maximum rating deviation.
But I don't think this is generally true.. would be an interesting thing to prove.. or disprove. I am thinking of an example like:

Team 1	Team 2
30	25
15	25
10	5

Here, both teams have the same total rating, but the total variance is very different than the individual variances

>>> pstdev([30, 15, 10])
8.498365855987975
>>> pstdev([25, 25, 5])
9.428090415820634
>>> pstdev([30, 25, 25, 15, 10, 5])
8.975274678557506

so
8.97 != max([8.49, 9.42])

I am not sure, maybe we have to do some fuzzing tests to see which of random games actually get selected, and then do a quality analysis for different configurations? I don't think this would be too much effort.

I think a good test would be if simply lowering max_rating_variety with this change would really allow for less matches to be made.

mankinskin · 2025-11-08T15:08:17Z

Can we at least merge this? The current balancing is really bad.

BlackYps · 2025-11-08T21:45:16Z

Sorry for not coming back to you, there is a lot else going on for me.

About the fuzzing tests, I pushed the simulation test I made four years ago to the matchmaker-simulation-test branch in this repo. It's been a long time since I have used it, but this should give you a good starting ground to compare simulations with different parameters or code changes.

I'll do a code review, but for the reasons explained above the changes will not see the light of deployment soon.
At the moment I feel like this change would do the same as just altering the config.MAXIMUM_RATING_DEVIATION value and I would not endorse this change. These values have been carefully chosen to find a point on the bad game vs. no game scale that is acceptable for the community.
We have to justify matchmaker changes to the community so we can't just make changes on a whim. "The current balancing is really bad" is not good enough for that. If you can however show me that this change would actually increase the average match quality without reducing how quickly people get matched on average, then it's a different story and good justification to merge this change.

BlackYps · 2025-11-08T21:46:05Z

server/matchmaker/algorithm/team_matchmaker.py

-        deviation = statistics.pstdev(ratings)
-        rating_variety = deviation / config.MAXIMUM_RATING_DEVIATION
+        max_team_deviation = max(map(statistics.pstdev, [match[0].displayed_ratings, match[1].displayed_ratings]))
+        max_rating_variety = max_team_deviation / config.MAXIMUM_RATING_DEVIATION


I would keep the name rating_variety here

but the point is that it is the maximum rating variety of all teams

BlackYps · 2025-11-08T21:46:22Z

tests/unit_tests/test_matchmaker_algorithm_team_matchmaker.py

+    assert set(matches[0][0].get_original_searches()) == {c1, s[2], s[5]}
+    assert set(matches[0][1].get_original_searches()) == {c3, s[1], s[6]}
+    assert set(matches[1][0].get_original_searches()) == {c4, s[4]}
+    assert set(matches[1][1].get_original_searches()) == {c2, s[0], s[3]}


Why did this change?

the tests were not passing (i think even before the change) and this part actually looked wrong, at least the way I made sense of it this should be the correct assertions, but I am not sure.

mankinskin · 2025-12-07T16:25:38Z

At the moment I feel like this change would do the same as just altering the config.MAXIMUM_RATING_DEVIATION

Honestly, I don't understand why you would think that. There is a difference in using the total rating deviation of all players in a match vs using the deviation of each team. It doesn't lower the allowed deviation as it is currently implemented. This is just a "bug", as it doesn't actually create well balanced teams. Instead of applying a mirror matching within the rating deviation range, it allows matches to spread ratings out further by matching worse and better players into the same team. they will be allowed to deviate more than "allowed", because the "very average" rated players compensate for the extreme rating difference the other team.

Its just a different metric than what is implemented right now.

I regularly play games that are just completely unbalanced because of this exact problem. The variance of the entire match may be low, but the variance in one team is very high. This is not the correct behavior that should be implemented by "MAXIMUM_RATING_DEVIATION". The MAXIMUM_RATING_DEVIATION should limit the deviation per team, not between all players in the match. It simply creates unbalanced games.

This creates a lot of toxicity as there are simply more unequal and unfair matches between players when they encounter.

mankinskin · 2025-12-07T17:01:59Z

But also.. I mean, what is this balance? The balancing just seems really broken.
2v2 ladder

global ratings
T1: [1500, 1500]
T2: [1128, 1385]
2v2 ratings
T1: [1200, 1060]
T2: [1100, 990]

Balance rating: 90%

BlackYps · 2025-12-12T13:05:23Z

I feel this discussion is not going forward without more data. Please show me examples of these unbalanced games. Also show me the effects of your change by utilising the simulation test that I pushed to the repo. Otherwise we are just talking in the abstract with no way to settle on a solution.

BlackYps · 2025-12-12T13:07:54Z

2v2 ratings
T1: [1200, 1060]
T2: [1100, 990]

Calling this broken seems extreme to me. Yes, I would expect the 1060 and 990 to be swapped, but one team was probably premade. Still, the total rating difference between the two teams is 170 points, which seems pretty reasonable.
Global rating is not considered in the matchmaker.

mankinskin · 2025-12-14T15:08:54Z

Still, the total rating difference between the two teams is 170 points, which seems pretty reasonable.

I mean 170 points is about 15% of the average rating in the match. T1 has about 9% more total rating points. Its just not balanced, let me tell you that from experience.

I also feel like this discussion isn't really progressing anywhere, but mostly because the maintainers here seem to resist any change with the reasoning that it's "not that broken". Maybe it would be better just recognize actual user feedback and support them in trying to improve the experience on the platform by making some changes yourself and by running your own tests.

But I guess I need to argue for days about a frankly simple change to the balancing and now have to do more work to prove it actually improves things, without any experience in the codebase.

Actually, no thanks. Just let this game die then.

BlackYps · 2025-12-14T16:32:25Z

You have chosen a topic that is very sensitive to the community. People have very different opinions about the state of the matchmaking and it's hard to come to objective answers.

I'm sorry that you feel frustrated. The reality of the situation is that this repository is currently unmaintained. Nobody has time to help you complete the feature. And even then it will probably not get deployed, because we just lack the manpower.

mankinskin · 2025-12-14T17:35:24Z

I understand. Sorry aswell because I don't have the time to work more on this right now. Still I feel like with a codebase like this, you could dare a bit more experimentation. That would be better than not changing anything at all, while the game is objectively not balancing well very often. I don't think there can be second opinions about this.

I see there is a trade-off between waiting times and match quality, however I don't think that is the only reason why games are sometimes unbalanced.

mankinskin · 2025-12-18T12:36:39Z

Again a game like this
https://replay.faforever.com/26106543


1170	784
51	729
0	104

Balance 56%

and I was basically playing 1v3

This is exactly the type of game that would be prevented by using the max rating deviation per team. Also I don't see the point in playing a 56% balance game, ever.

The total ratings were 1221 vs 1617 !!! WTF

The only reasonable thing to do here is Ctrl-K and not waste time with this game!

Doesn't help that you don't see the ratings in-game anymore.

mankinskin · 2026-03-28T20:15:38Z

Another perfect example of why this needs fixing:

https://replay.faforever.com/26758871

T1	T2
1274	1827
1188	962
1174	864

Balance 94%

T1 -> low variance
T2 -> big variance

-> both together, average variance -> balance is not derated

Fix: rate team variance based on the maximum variance of both teams instead of the variance of all players in the match

mankinskin changed the title ~~Use max rating deviation per team for match quality~~ Use max rating deviation per team to calculate match quality Jul 13, 2025

Askaholic requested a review from BlackYps July 21, 2025 00:59

mankinskin force-pushed the develop branch 4 times, most recently from a1194ef to 41f2084 Compare September 6, 2025 18:15

LinusBehrbohm added 2 commits September 6, 2025 20:17

Use max rating deviation per team for match quality

e00de33

Fix tests

871c64e

mankinskin force-pushed the develop branch from 41f2084 to 871c64e Compare September 6, 2025 18:17

BlackYps reviewed Sep 8, 2025

View reviewed changes

BlackYps requested changes Nov 8, 2025

View reviewed changes

mankinskin closed this Dec 14, 2025

mankinskin reopened this Mar 28, 2026

	for search in team.get_original_searches():
	ratings.append(search.average_rating)

Uh oh!

Conversation

mankinskin commented Jul 13, 2025

Uh oh!

mankinskin commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mankinskin commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BlackYps Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

mankinskin Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

BlackYps Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

BlackYps commented Sep 8, 2025

Uh oh!

mankinskin commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mankinskin commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BlackYps commented Sep 11, 2025

Uh oh!

mankinskin commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BlackYps commented Oct 7, 2025

Uh oh!

BlackYps commented Oct 7, 2025

Uh oh!

mankinskin commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mankinskin commented Nov 8, 2025

Uh oh!

BlackYps commented Nov 8, 2025

Uh oh!

BlackYps Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

mankinskin Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

BlackYps Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

mankinskin Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

mankinskin commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mankinskin commented Dec 7, 2025

Uh oh!

BlackYps commented Dec 12, 2025

Uh oh!

BlackYps commented Dec 12, 2025

Uh oh!

mankinskin commented Dec 14, 2025

Uh oh!

BlackYps commented Dec 14, 2025

Uh oh!

mankinskin commented Dec 14, 2025

Uh oh!

mankinskin commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mankinskin commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

mankinskin commented Sep 6, 2025 •

edited

Loading

mankinskin commented Sep 6, 2025 •

edited

Loading

mankinskin commented Sep 8, 2025 •

edited

Loading

mankinskin commented Sep 8, 2025 •

edited

Loading

mankinskin commented Sep 24, 2025 •

edited

Loading

mankinskin commented Oct 8, 2025 •

edited

Loading

mankinskin commented Dec 7, 2025 •

edited

Loading

mankinskin commented Dec 18, 2025 •

edited

Loading

mankinskin commented Mar 28, 2026 •

edited

Loading