Impact of Decomposition on WRF performance

Discussions centered around methodologies for high performance computing.

Impact of Decomposition on WRF performance

Postby tcraig » Sun Dec 08, 2013 4:20 pm

I have a 275 x 205 WRF grid, and I have been playing with task counts and decomposition options. This is using MPI only, no OpenMP.

First, the algorithm that generates the decompositions automatically (with nproc_x and nproc_y = -1) is far from ideal from what I can tell. That algorithm, Subroutine MPASPECT in external/RSL_LITE/module_dm.F, determines an nproc_x and an nproc_y if the user doesn't set them. It currently finds nproc_x and nproc_y such that nproc_x divides ntasks (the total number of MPI tasks) evenly, but nproc_y doesn't have this requirement. It then finds the combination of nproc_x and nproc_y that minimizes abs(nproc_y-nproc_x). As a result, the product of nproc_x and nproc_y is often larger than ntasks which leads to idle processors and the optimization is far from ideal.

Does anyone have any idea whether it's a requirement that nproc_x divide ntasks evenly? And how was the optimization min(abs(nproc_y-nproc_x)) chosen?

Below is a small subset of my timing results for my case on a Cray XE6 that show a number of the issues.

Code: Select all
                                                                           local
total   default   nproc_x   nproc_y     nproc_x*    domain    timing
tasks   decomp                                 nproc_y        size    (sec/day)
 572      no            11            52            572          25x4        164
 574      yes           14            41            574          20x5        167
 575      yes           23            23            529          12x9        197
 575      no            25            23            575          11x9        175


First, looking at the 575 default decomposition (23 x 23) versus a more reasonable decomposition (25 x 23) shows a performance difference of about 10% (197 to 175). The default decomposition sets nproc_x and nproc_y to 23 x 23 which is only 529 tasks. I think this means 46 tasks will be idle with this decomposition. A user defined decomposition of 25x23 (with 275 tasks) will keep all tasks busy and is 10% faster from what I can tell.

Second, I was surprised that the skinnier domain sizes 25x4 (572 tasks) and 20x5 (574 tasks) performed better than 11x9 (575 tasks). My expectation was that lower aspect ratio domain sizes would perform better. Does anyone have a better understanding of why the higher aspect ratio blocks perform better?

I have lots of other data, but I believe the MPASPECT subroutine has lots of room for improvement. I would like to know if others agree. Does the current implementation contain a bug? For instance, should nproc_y also be checked that it divides ntasks evenly? That would significantly improve the algorithm, although would lead to some task counts not being runable. Or would a better optimization help. I am planning to change my implementation so it optimizes on a weighted sum of the nproc_x/nproc_y aspect ratio and the total number of tasks used compared to the target number of tasks.

I am trying hard to understand WRF performance and trying to tune to a reasonably optimal task count and decomposition. I haven't been able to find much information in the User Guide or the Forum about this. If anyone has any other pointers to documentation on WRF parallelization performance, that would be very helpful.

thanks,

tony.....
tcraig
 
Posts: 1
Joined: Sun Dec 08, 2013 3:25 pm

Return to High Performance Computing

Who is online

Users browsing this forum: No registered users and 1 guest