Error in `./wrf.exe': double free or corruption (out): 0x000

Any issues with the actual running of the WRF.

Error in `./wrf.exe': double free or corruption (out): 0x000

Postby ChenShuying » Fri Nov 24, 2017 5:24 am

Dear all, I am running WRFV3.9.1.1 use the ERA-interim as the initial and lateral condition and 30 processors to run wrf.exe. My model always broke down with the CFL error or the error pasted below. When I try to reduce my time_step could fix for the CFL error(but it really isn't a good resolution for I running a long period simulation takes too much time) but can't work with the below error. Does anybody have been frustrated that error too? How to settle it? In fact, I don't even know what is this error meaning? If you do know anything about it, please let me know, it is really emergency.

*** Error in `./wrf.exe': double free or corruption (out): 0x0000000008e38e60 ***
37043 ======= Backtrace: =========
37044 /lib64/libc.so.6(+0x7c503)[0x7f9498b58503]
37045 ./wrf.exe[0x1ce6ee4]
37046 ./wrf.exe[0x1d1c420]
37047 ./wrf.exe[0x1d2570c]
37048 ./wrf.exe[0x16cde17]
37049 ./wrf.exe[0x17b087d]
37050 ./wrf.exe[0x11b1b7a]
37051 ./wrf.exe[0x1088eab]
37052 ./wrf.exe[0x47289b]
37053 ./wrf.exe[0x472e7c]
37054 ./wrf.exe[0x4081c4]
37055 ./wrf.exe[0x4079dd]
37056 /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9498afdb35]
37057 ./wrf.exe[0x407a14]
37058 ======= Memory map: ========
37059 00400000-02f0b000 r-xp 00000000 00:28 41407849012 /home/CSY/Build_WRF/WRFV3/main/wrf.exe
37060 0310a000-0310b000 r--p 02b0a000 00:28 41407849012 /home/CSY/Build_WRF/WRFV3/main/wrf.exe
37061 0310b000-0315d000 rw-p 02b0b000 00:28 41407849012 /home/CSY/Build_WRF/WRFV3/main/wrf.exe
37062 0315d000-08a6d000 rw-p 00000000 00:00 0
37063 08e36000-11a1e000 rw-p 00000000 00:00 0 [heap]
37064 7f9474000000-7f9474021000 rw-p 00000000 00:00 0
37065 7f9474021000-7f9478000000 ---p 00000000 00:00 0
37066 7f9478497000-7f94784d8000 rw-s 00000000 00:12 4381486 /dev/shm/mpich_shar_tmpUaWNns (deleted)
37067 7f94784d8000-7f9478519000 rw-s 00000000 00:12 4607829 /dev/shm/mpich_shar_tmpYP2Qtq (deleted)
37068 7f9478519000-7f947855a000 rw-s 00000000 00:12 4669558 /dev/shm/mpich_shar_tmpfbUHvq (deleted)
37069 7f947855a000-7f947859b000 rw-s 00000000 00:12 3515979 /dev/shm/mpich_shar_tmpWGCiqp (deleted)
37070 7f947859b000-7f94785dc000 rw-s 00000000 00:12 4142561 /dev/shm/mpich_shar_tmpfNbywq (deleted)
37071 7f94785dc000-7f947861d000 rw-s 00000000 00:12 3391270 /dev/shm/mpich_shar_tmp6Mkkrp (deleted)
37072 7f947861d000-7f947865e000 rw-s 00000000 00:12 4145778 /dev/shm/mpich_shar_tmpGzRBuq (deleted)
37073 7f947865e000-7f947869f000 rw-s 00000000 00:12 4386188 /dev/shm/mpich_shar_tmpnFMMnp (deleted)
37074 7f9478fed000-7f947b7e7000 rw-p 00000000 00:00 0
37075 7f947b7e7000-7f947b828000 rw-s 00000000 00:12 4580581 /dev/shm/mpich_shar_tmpKn65AK (deleted)
37076 7f947b828000-7f947d5be000 rw-p 00000000 00:00 0
37077 7f947d5be000-7f947d5ff000 rw-s 00000000 00:12 4580580 /dev/shm/mpich_shar_tmpDjlfDw (deleted)
37078 7f947d5ff000-7f9491f87000 rw-p 00000000 00:00 0
37079 7f9491f87000-7f9491fc8000 rw-s 00000000 00:12 4340120 /dev/shm/mpich_shar_tmp9nf1NX (deleted)
37080 7f9491fc8000-7f9492009000 rw-s 00000000 00:12 4655856 /dev/shm/mpich_shar_tmpk0pfIX (deleted)
37081 7f9492009000-7f949204a000 rw-s 00000000 00:12 4669555 /dev/shm/mpich_shar_tmpauT2KX (deleted)
37082 7f949204a000-7f949208b000 rw-s 00000000 00:12 4463712 /dev/shm/mpich_shar_tmpP8H7XX (deleted)
37083 7f949208b000-7f9493238000 rw-p 00000000 00:00 0
37084 7f9493256000-7f9493fa4000 rw-p 00000000 00:00 0
37085 7f9493fa4000-7f9493fb0000 r-xp 00000000 08:02 134322311 /usr/lib64/libnss_files-2.17.so
37086 7f9493fb0000-7f94941af000 ---p 0000c000 08:02 134322311 /usr/lib64/libnss_files-2.17.so
37087 7f94941af000-7f94941b0000 r--p 0000b000 08:02 134322311 /usr/lib64/libnss_files-2.17.so
37088 7f94941b0000-7f94941b1000 rw-p 0000c000 08:02 134322311 /usr/lib64/libnss_files-2.17.so
37089 7f94941b1000-7f94941b7000 rw-p 00000000 00:00 0
37090 7f94941b7000-7f9498adc000 rw-s 00000000 00:12 4463707 /dev/shm/mpich_shar_tmpneYGvk (deleted)
37091 7f9498adc000-7f9498c92000 r-xp 00000000 08:02 134261429 /usr/lib64/libc-2.17.so
37092 7f9498c92000-7f9498e92000 ---p 001b6000 08:02 134261429 /usr/lib64/libc-2.17.so
37093 7f9498e92000-7f9498e96000 r--p 001b6000 08:02 134261429 /usr/lib64/libc-2.17.so
37094 7f9498e96000-7f9498e98000 rw-p 001ba000 08:02 134261429 /usr/lib64/libc-2.17.so
37095 7f9498e98000-7f9498e9d000 rw-p 00000000 00:00 0
37096 7f9498e9d000-7f9498ed8000 r-xp 00000000 08:02 134256705 /usr/lib64/libquadmath.so.0.0.0
37097 7f9498ed8000-7f94990d7000 ---p 0003b000 08:02 134256705 /usr/lib64/libquadmath.so.0.0.0
37098 7f94990d7000-7f94990d8000 r--p 0003a000 08:02 134256705 /usr/lib64/libquadmath.so.0.0.0
37099 7f94990d8000-7f94990d9000 rw-p 0003b000 08:02 134256705 /usr/lib64/libquadmath.so.0.0.0
37100 7f94990d9000-7f94990ee000 r-xp 00000000 08:02 134217804 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
37101 7f94990ee000-7f94992ed000 ---p 00015000 08:02 134217804 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
37102 7f94992ed000-7f94992ee000 r--p 00014000 08:02 134217804 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
37103 7f94992ee000-7f94992ef000 rw-p 00015000 08:02 134217804 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
37104 7f94992ef000-7f94993ef000 r-xp 00000000 08:02 134261437 /usr/lib64/libm-2.17.so
37105 7f94993ef000-7f94995ef000 ---p 00100000 08:02 134261437 /usr/lib64/libm-2.17.so
37106 7f94995ef000-7f94995f0000 r--p 00100000 08:02 134261437 /usr/lib64/libm-2.17.so
37107 7f94995f0000-7f94995f1000 rw-p 00101000 08:02 134261437 /usr/lib64/libm-2.17.so
37108 7f94995f1000-7f9499710000 r-xp 00000000 08:02 138352275 /usr/lib64/libgfortran.so.3.0.0
37109 7f9499710000-7f9499910000 ---p 0011f000 08:02 138352275 /usr/lib64/libgfortran.so.3.0.0
37110 7f9499910000-7f9499911000 r--p 0011f000 08:02 138352275 /usr/lib64/libgfortran.so.3.0.0
37111 7f9499911000-7f9499913000 rw-p 00120000 08:02 138352275 /usr/lib64/libgfortran.so.3.0.0
37112 7f9499913000-7f949992a000 r-xp 00000000 08:02 134322319 /usr/lib64/libpthread-2.17.so
37113 7f949992a000-7f9499b29000 ---p 00017000 08:02 134322319 /usr/lib64/libpthread-2.17.so
37114 7f9499b29000-7f9499b2a000 r--p 00016000 08:02 134322319 /usr/lib64/libpthread-2.17.so
37115 7f9499b2a000-7f9499b2b000 rw-p 00017000 08:02 134322319 /usr/lib64/libpthread-2.17.so
37116 7f9499b2b000-7f9499b2f000 rw-p 00000000 00:00 0
37117 7f9499b2f000-7f9499b36000 r-xp 00000000 08:02 134322323 /usr/lib64/librt-2.17.so
37118 7f9499b36000-7f9499d35000 ---p 00007000 08:02 134322323 /usr/lib64/librt-2.17.so
37119 7f9499d35000-7f9499d36000 r--p 00006000 08:02 134322323 /usr/lib64/librt-2.17.so
37120 7f9499d36000-7f9499d37000 rw-p 00007000 08:02 134322323 /usr/lib64/librt-2.17.so
37121 7f9499d37000-7f9499d57000 r-xp 00000000 08:02 134261422 /usr/lib64/ld-2.17.so
37122 7f9499d72000-7f9499f42000 rw-p 00000000 00:00 0
37123 7f9499f54000-7f9499f56000 rw-p 00000000 00:00 0
37124 7f9499f56000-7f9499f57000 r--p 0001f000 08:02 134261422 /usr/lib64/ld-2.17.so
37125 7f9499f57000-7f9499f58000 rw-p 00020000 08:02 134261422 /usr/lib64/ld-2.17.so
37126 7f9499f58000-7f9499f59000 rw-p 00000000 00:00 0
37127 7fff378ef000-7fff37912000 rw-p 00000000 00:00 0 [stack]
37128 7fff379b2000-7fff379b4000 r-xp 00000000 00:00 0 [vdso]
37129 ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
37130
37131 Program received signal SIGABRT: Process abort signal.
37132
37133 Backtrace for this error:
37134 #0 0x7F949960A6F7
37135 #1 0x7F949960AD3E
37136 #2 0x7F9498B1124F
37137 #3 0x7F9498B111D7
37138 #4 0x7F9498B128C7
37139 #5 0x7F9498B50F06
37140 #6 0x7F9498B58502
37141 #7 0x1CE6EE3 in __module_ra_cam_MOD_radcswmx
37142 #8 0x1D1C41F in __module_ra_cam_MOD_radctl
37143 #9 0x1D2570B in __module_ra_cam_MOD_camrad
37144 #10 0x16CDE16 in __module_radiation_driver_MOD_radiation_driver
37145 #11 0x17B087C in __module_first_rk_step_part1_MOD_first_rk_step_part1
37146 #12 0x11B1B79 in solve_em_
37147 #13 0x1088EAA in solve_interface_
37148 #14 0x47289A in __module_integrate_MOD_integrate
37149 #15 0x472E7B in __module_integrate_MOD_integrate
37150 #16 0x4081C3 in __module_wrf_top_MOD_wrf_run
ChenShuying
 
Posts: 5
Joined: Fri Aug 18, 2017 8:18 am

Re: Error in `./wrf.exe': double free or corruption (out): 0

Postby kwthomas » Mon Nov 27, 2017 7:26 pm

The "double free" error probably means your job ran out of memory. It's very easy to miss checking a return from an ALLOC when writing source code. Using an ALLOCated array that didn't get the memory allocated can very easily mess up memory bookeeping.

Try more nodes,

CFL errors are a model instablilty problem. Sometimes the model recovers, and sometimes it crashes. At least in our case, CFL errors are caused by severe convection that doesn't get throttled.

If you want to playaround with settings, here are changes we made this past spring:

mp_tend_lim = 0.07
mp_zero_thresh = 1.e-12
diff_6th_opt = 2
diff_6th_factor = 0.25
moist_adv_opt = 1 (we'd been using 2)

This allowed us to use bigger timesteps. For 3km, we used 20 sec for Thompson. Sometimes 18 sec would seg fault in the past

Good luck.
Kevin W. Thomas
Center for Analysis and Prediction of Storms
University of Oklahoma
kwthomas
 
Posts: 208
Joined: Thu Aug 07, 2008 6:53 pm

Re: Error in `./wrf.exe': double free or corruption (out): 0

Postby ChenShuying » Fri Dec 01, 2017 8:01 am

Thanks very much for your reply @kwthomas.
Try more nodes seems no help to my model, I still get the "double free" crash :( Now we are trying to check the server to exclude trouble from computer.
And I am interested the Setting:

mp_tend_lim = 0.07
mp_zero_thresh = 1.e-12
diff_6th_opt = 2
diff_6th_factor = 0.25
moist_adv_opt = 1 (we'd been using 2)

How can I use them for my model to accept a larger time step when CFL appears? Which part of namelist.input should I put in those setting. I am very wiling to try.
ChenShuying
 
Posts: 5
Joined: Fri Aug 18, 2017 8:18 am

Re: Error in `./wrf.exe': double free or corruption (out): 0

Postby kwthomas » Fri Dec 01, 2017 6:18 pm

mp_tend_lim
mp_zero_out_thresh (name typo in original email)
are in "&physics".

diff _6th_opt
diff_6th_factor
moist_adv_opt
are in "&dynamics"

It is unlikely that the computer itself is causing the "double free" error. If the computer was having problems with memory, other programs would likely fail.

More likely is faulty model code, or faulty compiler optimization. Try building your software with a lower optimization level and see what happens.

Compiler optimization problems can happen. I have to build WRF with Intel 17.0.4 at -O2 as -O3 builds fail with some physics schemes that we run. Fortunately, the problems are fixed in Intel 18.0.0 which came out recently.
Kevin W. Thomas
Center for Analysis and Prediction of Storms
University of Oklahoma
kwthomas
 
Posts: 208
Joined: Thu Aug 07, 2008 6:53 pm

Re: Error in `./wrf.exe': double free or corruption (out): 0

Postby ChenShuying » Sat Dec 02, 2017 1:36 am

For the setting they seems great, I'll try! Thanks @kwthomas

And I still fight with the "double free or cooruption(out)". I understand you and agree with you. But the wired thing is I am running one year's simulation which from winter to spring and I almost complete 2/3. At first, model would crash too but never come out to be the "double free" but with the CFL error, you know, I reduce the time step and fix it. However, it always breaks down with "double free" now. So I am a bit suspect with our server computer. Furthermore, I ran my model with different physical scheme on the same server. And this time wrf crash with "Segmentation fault - invalid memory reference". It seems also the memory problem. Do you have any idea about the "Segmentation"?
Maybe, I should try re-compile wrf :(
ChenShuying
 
Posts: 5
Joined: Fri Aug 18, 2017 8:18 am

Re: Error in `./wrf.exe': double free or corruption (out): 0

Postby kwthomas » Fri Dec 08, 2017 5:32 pm

Intel 17.0.4 causes WRF problems with -O3. -O2 fixes things. Try that with your version.
Kevin W. Thomas
Center for Analysis and Prediction of Storms
University of Oklahoma
kwthomas
 
Posts: 208
Joined: Thu Aug 07, 2008 6:53 pm


Return to Runtime Problems

Who is online

Users browsing this forum: No registered users and 7 guests