perf[next-dace]: Update thread block size to (256, 1)#2598
Conversation
Co-authored-by: Daniel Ganellari <gadaniel@ethz.ch>
philip-paul-mueller
left a comment
There was a problem hiding this comment.
LGTM.
However, I am not sure if it is also set somewhere in ICON4Py.
But from the numbers you posted I do not think that.
There was a problem hiding this comment.
Very good results, and nice color gradients in the plot!
However, I am not sure whether these default settings should be set in gt4py or in icon4py, in model_backends.py. To me, it feels more an application default, and it should belong to icon4py.
I had the same trouble figuring out where this should be set. Since we have to set a default value in |
|
I also kind of think that it should be in ICON4Py, the question is just what value do we use in GT4Py?
|
Adding a fourth option: I vote for either 1 or 4, but I don't have a strong opinion. |
|
I think that |
Update thread block size to

(256, 1)as a default. Thanks to @dganellari for finding this out 👍Initially this seemed like it improves MI300A performance but I've also tested GH200 and A100 CPUs with
bluelinein a single node (icon_ch1-medium-stencilsexperiment) and I saw improvements in some stencils.Here are the results per GT4Py program in GH200:
And here are the results from
icon_ch1-mediumthat reports the timer of the whole dycore (taking themintimer fornh_solvefrom each output) fromA100:I thought that since we see improvements in all cases we can use it as the default in GT4Py