Chronic unexplained terminations w/o error message

Any issues with the actual running of the WRF.

Chronic unexplained terminations w/o error message

Postby grantwp » Wed Jan 16, 2019 2:10 pm

I am a new WRF user and have been trying to run my first simulation in LES mode. My goal is a 4-hour single-domain idealized simulation at 5-meter resolution with 3/100-second time steps. I'm running it on a new Ubuntu machine with 32 cores and 128M of RAM. To date, I have seen no evidence (e.g., using the 'top' command) that the model ever requires more than the expected modest fraction of the available RAM. There are no other competing user processes.

Nevetheless, at about 1:30 into the simulations, it is now quitting after (typically) 24 hours of wall clock time. I have had to restart it many times to get as far as I have gotten, though it usually ran for a much longer period of time (several days or longer) without interruption earlier in the simulation. Fortunately, I'm saving restart files for every minute of simulation time, and the model always restarts without complaint on the latest file and successfully outputs a couple more before quitting again (it takes about 6.5 hours of wall clock time per minute of simulation).

There is never any error message in any of the logs or output to the terminal; the program just quits. I have checked whether there is a system-imposed time limit on user processes, and that doesn't appear to be the case. I have tried running it on 16 rather than 32 cores; no difference. Command used is

mpirun -n 32 ./wrf.exe >& LOG.out &

Any help debugging this problem would be greatly appreciated. I would have thought that any numerical instability leading to a crash would carry over to a restart, so the fact that WRF plugs away for another few hundred time steps after each restart would seem to me to rule this out, but maybe I'm misunderstanding how restarts work.

I'm attaching the namelist.input file in case I have set something incorrectly. EDIT: Forum isn't allowing me to attached the file, regardless of what I set the extension to. If requested, I'll post as text in the message below.
grantwp
 
Posts: 6
Joined: Mon Nov 19, 2018 7:46 pm

Re: Chronic unexplained terminations w/o error message

Postby kwthomas » Wed Jan 16, 2019 6:00 pm

Hi...

The general rule is that if WRF stops without leaving any error messages, the computer is an OOM, as in
Out Of Memory state. The computer/node will SIGKILL (kill -9) the top memory users. If it doesn't do this,
the system could hang or crash and reboot.

If this happens, it is logged. Try running "dmesg -T" ("dmesg" only on older systems), and look to see if
anything was killed off. You'll have to do this on all the nodes.

You can also have your job "echo $status" after the mpirun. This works for csh/tcsh. If you using something
else, such as "bash", check the man page for syntax to collect the status of the last command.

At least on systems that I've used, I get 9 (SIGKILL).

The problem happening every 24 hours sounds suspicious. Check with your admin people to see if there are
any "cron" jobs running on the compute nodes. One of them may come along needing a lot of memory.
The OOM daemon won't zap ROOT commands.
Kevin W. Thomas
Center for Analysis and Prediction of Storms
University of Oklahoma
kwthomas
 
Posts: 279
Joined: Thu Aug 07, 2008 6:53 pm

Re: Chronic unexplained terminations w/o error message

Postby grantwp » Thu Jan 17, 2019 1:34 am

'dmesg -T' yields nothing but a long series of [UWF BLOCK] messages going all the way back to hours before the latest WRF instance died. This time it died after only 10 hours -- interval seems to be getting shorter as I push further into the simulation.

There are no other nodes -- this is being run entirely on a private desktop computer that I built with an AMD Ryzen Threadripper 2950X (16 core/32 thread) processor and 128MB. As I mentioned earlier, WRF's not coming even close to consuming all the available RAM; closer to 64MB every time I check.

I'll try the 'echo $status' command next and see what happens.

Thanks,
Grant
grantwp
 
Posts: 6
Joined: Mon Nov 19, 2018 7:46 pm

Re: Chronic unexplained terminations w/o error message

Postby grantwp » Wed Jan 23, 2019 12:04 pm

Whatever the cause of the termination, it appears to be quasi-random rather than deterministic: The same simulation has now been running (starting at the last saved restart time) without interruption for five days, advancing from 1:35 simulation time to 1:52, so far.
grantwp
 
Posts: 6
Joined: Mon Nov 19, 2018 7:46 pm

Re: Chronic unexplained terminations w/o error message

Postby grantwp » Tue Feb 05, 2019 4:22 pm

WRF has now been running continuously for over 18 calendar days, same simulation, same hardware & software environment, everything. I'm completely baffled as to why I had chronic stopping problems about a third of the way into a 4-hour simulation and why they have now mysteriously vanished. The only possibility that comes to my mind is that it had something to do with interruptions in the remote connection, even though I had the WRF process set to redirect stdio and stderr to a file rather than to the remote terminal.
grantwp
 
Posts: 6
Joined: Mon Nov 19, 2018 7:46 pm


Return to Runtime Problems

Who is online

Users browsing this forum: No registered users and 3 guests