Use max rating deviation per team to calculate match quality#1060
Use max rating deviation per team to calculate match quality#1060mankinskin wants to merge 2 commits intoFAForever:developfrom
Conversation
a1194ef to
41f2084
Compare
|
Hi, I am trying to make the tests pass now, I tried setting up the faf database locally, however when I execute the config/init-db.sh script and the docker container is created, I get warnings afterwards "Access denied for user 'root'@'localhost' (using password: NO)" from the docker container. I am running on Windows. I am now running the tests in github actions through this PR. |
|
@BlackYps Hi, tests pass now. I think this change should improve the team matchmaking a lot, with more equally distributed teams. |
| unfairness = rating_disparity / config.MAXIMUM_RATING_IMBALANCE | ||
| deviation = statistics.pstdev(ratings) | ||
| rating_variety = deviation / config.MAXIMUM_RATING_DEVIATION | ||
| max_team_deviation = max(map(statistics.pstdev, [match[0].displayed_ratings, match[1].displayed_ratings])) |
There was a problem hiding this comment.
Why are you using displayed ratings here?
There was a problem hiding this comment.
I was looking for the ratings separately for each team, I was not sure which is the correct value to use. Which value do you suggest to use?
There was a problem hiding this comment.
use average_rating of the original searches like it is done here:
server/server/matchmaker/algorithm/team_matchmaker.py
Lines 311 to 312 in 871c64e
|
It seems the actual change here is that you calculate the deviation for each team separately and then use the maximum value. I expect that this does make the matchmaker more sensitive to games with a large rating variety. However, besides all of this, I don't see that your change will solve the problem that you stated:
This will still be the case, because it leads to teams with the smallest difference in total rating. If you want to change that, then you have to take a look at the part that assigns search parties to the teams. |
Yes, rating variety is usually bad because players are put in more un-equal face-offs, even when the average ratings are equal. The difference when using the max rating deviation of both teams is that variety within the teams themselves is limited. An equal team can not compensate for a highly diverse team. This should result in less diverse teams on both sides and more equal face-offs on the field.
Yes of course, but we can still tweak MAXIMUM_RATING_DEVIATION for this, if the requirements are too harsh.
I don't understand? This is exactly the problem, large differences in rating within the teams. The old version selected for low variety across all players in the match. This allowed high rating variety within one team, if the opposing team has very average ratings, effectively compensating for the extreme rating variety in the first team. By limiting the rating variety within the individual teams, no team should have exceptionally strong and weak players, but both teams should follow the same bell curve around the average rating. |
|
I think, maybe I see your point? Do you mean it will still select the teams with the smallest rating deviation as higher quality and therefore there will still be one team with all perfectly average ratings? |
|
There are two parts to this problem. |
If I understand the algorithm for "balanced two-way partitioning" correctly, it will simply try to form the matches from player-pairs with least difference, then it matches "pairs with pairs" of "most similar differences" (sorted), such that they can be paired with least total difference down the line. I don't think this prefers largest and smallest ratings being together (if we are using balanced partitioning), as the differences between players' counterparts are minimized the same way. Only with sparse ratings in the queue, i.e. not enough counterparts in the same rating range, the algorithm is forced to put very different ratings into the same buckets, and differences are very large, so one team has big and small, and the others are in between. But that is exactly why we filter the match results for other metrics and not all possible matches are good enough to be played. I think its a good starting point to discount match quality with the rating variety in the whole match. But it has to go one step further to discount rating variety in one team aswell. These matches have high tendency of bad game experiences on random maps. If we trust that the algorithm will find the best matches possible, then it is not a feature to give players games which are not balanced when the best possible matches are not good. |
I'm not sure if I understand you correctly. Take 2v2 for example. To have the most balanced result you always need to pair the highest and lowest together against the two in the middle. That's why I'm saying that the changes so far better convey what we want to achieve, so this is already an improvement, but in practice you will not see different team arrangements from the matchmaker. We could say that we would rather have a bit more combined rating difference between the two teams if both teams have a similar rating variety. For example to create a team of 1200 and 1000 against 1100 and 900. This would require more changes though, because it requires changing the algorithm that builds the teams. |
|
One more thing, this repository has no maintainer at the moment, so it's unclear when the next server release will be. |
|
I see what you are saying.. there are multiple components to this problem. This change fixes one issue, but the other issue is that we can still have very different rating variety between the teams, although it is limited. Below the limit (in variety) max_variety we can still have very different values for each of the two teams. There is a "team variety-variety", which should also be optimized to prefer games of similarly varied teams, which maximizes the chance of having perfect mirrors for each player. Then we have two variables:
I guess that is why you say we have to implement the second part in the match selection process, not in the match quality rating. One important remark about this though: do we really want to allow for more rating variety in teams, even if they are equally variant in the same match? This will allow games where, even though there is a mirror for each player, in an asymmetrical map this just means they have to face the strongest player. The maximum variance here should simply be limited. Setting the maximum rating deviation high to allow for more matches to be made, but then filtering them again for high similarity in team variety, still exposes very different players into the same match. The second parameter is more involved with detecting "balanced variety" for "mirror matching", but simply limiting the total variety that is allowed is already a big factor. We have to remember that the optimization function includes the search time and will weaken restrictions over time. But we should still take an honest attempt at limiting the variety within the same team. I think these are separate features and this is basically a bugfix for the team rating deviation limitation, wheras the previous implementation limited the deviation of all ratings in the match, this change limits the deviation of ratings for each team. We can now more directly influence the maxium allowed rating deviation of a single team. One could argue that
Here, both teams have the same total rating, but the total variance is very different than the individual variances >>> pstdev([30, 15, 10])
8.498365855987975
>>> pstdev([25, 25, 5])
9.428090415820634
>>> pstdev([30, 25, 25, 15, 10, 5])
8.975274678557506so I am not sure, maybe we have to do some fuzzing tests to see which of random games actually get selected, and then do a quality analysis for different configurations? I don't think this would be too much effort. I think a good test would be if simply lowering |
|
Can we at least merge this? The current balancing is really bad. |
|
Sorry for not coming back to you, there is a lot else going on for me. About the fuzzing tests, I pushed the simulation test I made four years ago to the matchmaker-simulation-test branch in this repo. It's been a long time since I have used it, but this should give you a good starting ground to compare simulations with different parameters or code changes. I'll do a code review, but for the reasons explained above the changes will not see the light of deployment soon. |
| deviation = statistics.pstdev(ratings) | ||
| rating_variety = deviation / config.MAXIMUM_RATING_DEVIATION | ||
| max_team_deviation = max(map(statistics.pstdev, [match[0].displayed_ratings, match[1].displayed_ratings])) | ||
| max_rating_variety = max_team_deviation / config.MAXIMUM_RATING_DEVIATION |
There was a problem hiding this comment.
I would keep the name rating_variety here
There was a problem hiding this comment.
but the point is that it is the maximum rating variety of all teams
| assert set(matches[0][0].get_original_searches()) == {c1, s[2], s[5]} | ||
| assert set(matches[0][1].get_original_searches()) == {c3, s[1], s[6]} | ||
| assert set(matches[1][0].get_original_searches()) == {c4, s[4]} | ||
| assert set(matches[1][1].get_original_searches()) == {c2, s[0], s[3]} |
There was a problem hiding this comment.
the tests were not passing (i think even before the change) and this part actually looked wrong, at least the way I made sense of it this should be the correct assertions, but I am not sure.
Honestly, I don't understand why you would think that. There is a difference in using the total rating deviation of all players in a match vs using the deviation of each team. It doesn't lower the allowed deviation as it is currently implemented. This is just a "bug", as it doesn't actually create well balanced teams. Instead of applying a mirror matching within the rating deviation range, it allows matches to spread ratings out further by matching worse and better players into the same team. they will be allowed to deviate more than "allowed", because the "very average" rated players compensate for the extreme rating difference the other team. Its just a different metric than what is implemented right now. I regularly play games that are just completely unbalanced because of this exact problem. The variance of the entire match may be low, but the variance in one team is very high. This is not the correct behavior that should be implemented by "MAXIMUM_RATING_DEVIATION". The MAXIMUM_RATING_DEVIATION should limit the deviation per team, not between all players in the match. It simply creates unbalanced games. This creates a lot of toxicity as there are simply more unequal and unfair matches between players when they encounter. |
|
But also.. I mean, what is this balance? The balancing just seems really broken.
Balance rating: 90% |
|
I feel this discussion is not going forward without more data. Please show me examples of these unbalanced games. Also show me the effects of your change by utilising the simulation test that I pushed to the repo. Otherwise we are just talking in the abstract with no way to settle on a solution. |
Calling this broken seems extreme to me. Yes, I would expect the 1060 and 990 to be swapped, but one team was probably premade. Still, the total rating difference between the two teams is 170 points, which seems pretty reasonable. |
I mean 170 points is about 15% of the average rating in the match. T1 has about 9% more total rating points. Its just not balanced, let me tell you that from experience. I also feel like this discussion isn't really progressing anywhere, but mostly because the maintainers here seem to resist any change with the reasoning that it's "not that broken". Maybe it would be better just recognize actual user feedback and support them in trying to improve the experience on the platform by making some changes yourself and by running your own tests. But I guess I need to argue for days about a frankly simple change to the balancing and now have to do more work to prove it actually improves things, without any experience in the codebase. Actually, no thanks. Just let this game die then. |
|
You have chosen a topic that is very sensitive to the community. People have very different opinions about the state of the matchmaking and it's hard to come to objective answers. I'm sorry that you feel frustrated. The reality of the situation is that this repository is currently unmaintained. Nobody has time to help you complete the feature. And even then it will probably not get deployed, because we just lack the manpower. |
|
I understand. Sorry aswell because I don't have the time to work more on this right now. Still I feel like with a codebase like this, you could dare a bit more experimentation. That would be better than not changing anything at all, while the game is objectively not balancing well very often. I don't think there can be second opinions about this. I see there is a trade-off between waiting times and match quality, however I don't think that is the only reason why games are sometimes unbalanced. |
|
Again a game like this
Balance 56% and I was basically playing 1v3 This is exactly the type of game that would be prevented by using the max rating deviation per team. Also I don't see the point in playing a 56% balance game, ever. The total ratings were 1221 vs 1617 !!! WTF The only reasonable thing to do here is Ctrl-K and not waste time with this game! Doesn't help that you don't see the ratings in-game anymore. |
|
Another perfect example of why this needs fixing: https://replay.faforever.com/26758871
Balance 94% T1 -> low variance -> both together, average variance -> balance is not derated Fix: rate team variance based on the maximum variance of both teams instead of the variance of all players in the match
|

#975
The idea here is to keep rating variety within each team of a match low. Instead of discounting quality for deviation across all ratings in a match, the deviation for each team is limited.
This fixes imbalances on maps with uneven spawn positions (lot of mexes, air spot, navy spot, ...), where the overall match variety may be low, but one team gets a lot of rating variety whereas the other is very balanced. One team will have the best and the worst rated players, while the other will have the average ratings.
This will either give the strong player an advantage on a strong spot, or a disadvantage when they are on a weak spot, as the majority of the game will be in the hands of his weaker teammates.
By limiting the maximum variety per team, matches like this should have lower quality assigned.