Uninitialized ProcOrient, segmentation fault

Any issues with the actual running of the WRF.

Uninitialized ProcOrient, segmentation fault

Postby arango » Fri Jan 11, 2019 5:55 pm

I am using WRF Version 4.0.3, and I getting a segmentation error when running in parallel with 4 MPI nodes with either 2x2, 1x4, 4x1 partitions (patch distributions). The error is in output_wrf.f90 around line 1032:
Code: Select all
       
       p => grid%head_statevars%next
       DO WHILE ( ASSOCIATED( p ) )
         IF ( p%ProcOrient .NE. 'X' .AND. p%ProcOrient .NE. 'Y' ) THEN 

because the character*1 p%ProcOrient is not assigned. It was initialized to a blank space ' '. It seems to be an issue in gen_allocs.c. The weird thing is that it runs with 1x2 or 2x1 partitions. It seems that the above conditional is not robust in parallel (I am using OpenMPI). As far as I understand it, ProcOrient is either "X', 'Y', or ' '. Then, a more robust conditional will be:
Code: Select all
        IF ( p%ProcOrient .EQ. CHAR(32) ) THEN 

where CHAR(32) is the ASCII character for a blank space (SP). Do this makes sense? I haven't been able to find a namelist parameter that takes care of this problem. I Google this error and I found information about replacing gen_allocs.c several years ago (2010).

By the way, I am using ifort 19.0.1.144 20181018 and OpenMPI for Darwin.

Any suggestion is appreciated.
Last edited by arango on Sat Jan 12, 2019 11:23 am, edited 1 time in total.
arango
 
Posts: 5
Joined: Wed Dec 05, 2018 5:13 pm

Re: Unitilized ProcOrient, segmentation fault

Postby kwthomas » Fri Jan 11, 2019 7:02 pm

You can try running more nodes. Every once in a while, a node configuration doesn't play well with WRF.

The Intel 17.x and Intel 18.x compilers have a history of generating badly optimized code at times. Maybe
Intel 19.x has the same problems. None of the systems that I have access to have Intel 19.x.

Try rebuilding WRF from scratch at a lower optimization level if adding more nodes doesn't help.
Kevin W. Thomas
Center for Analysis and Prediction of Storms
University of Oklahoma
kwthomas
 
Posts: 279
Joined: Thu Aug 07, 2008 6:53 pm

Re: Unitilized ProcOrient, segmentation fault

Postby arango » Fri Jan 11, 2019 10:46 pm

Thank you for the suggestion. It is not related to the optimization level since I get the same behavior with debugging flags (-g -O0) and optimized (-O3). I ran inside the parallel TotalView debugger and that's the reason why I provided more detailed information where it is happening. I also made the modification suggested above with CHAR(32) and still fails at the same line and I got the SIGSEGV segmentation fault. Maybe the error is not related to that IF-conditional but due to stack memory somewhere else. It seems that WRF requires lots of memory. However, my application grid is small 200x150x31.

I tried differently partitions with more nodes 2x4, 4x2, 3x3, and still get the segmentation fault. It only works with 1x2 or 2x1. I may try with gfortran to see if I get the same behavior, but I need to check first if I have consistent libraries with the same version of gfortran.
arango
 
Posts: 5
Joined: Wed Dec 05, 2018 5:13 pm

Re: Uninitialized ProcOrient, segmentation fault

Postby arango » Mon Mar 04, 2019 7:16 pm

I tried again in a different Linux cluster, and I get similar segmentation flag but in input_wrf.f90 around line 1258:

Code: Select all
      p => grid%head_statevars%next
      DO WHILE ( ASSOCIATED( p ) )
        IF ( p%ProcOrient .NE. 'X' .AND. p%ProcOrient .NE. 'Y' ) THEN


Definitely, how this conditional is coded in both input_wrf.F and output_wrf.F is not robust! This time I am running WRF on 16 PETs. Memory is not an issue since I assigned 177 Gb in my SLURM batch script. It died when processing the WRF nested grid in my coupled system DATA-WRF-ROMS:

Code: Select all
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
romsG              000000000E7E7704  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B590E3836D0  Unknown               Unknown  Unknown
romsG              0000000003E8626E  input_wrf_               1258  input_wrf.f90
romsG              0000000003BDCEEE  module_io_domain_         667  module_io_domain.f90
romsG              00000000040B22B4  med_initialdata_i          82  mediation_wrfmain.f90
romsG              00000000040B1BFF  med_initialdata_i          18  mediation_wrfmain.f90
romsG              000000000407EB27  med_nest_initial_         378  mediation_integrate.f90
romsG              0000000000DC100A  module_integrate_         300  module_integrate.f90
romsG              0000000000C9EAE3  module_wrf_top_mp         324  module_wrf_top.f90
romsG              000000000042A23A  esmf_wrf_mod_mp_w        2008  esmf_atm.f90


Now, if I remove the WRF nested grid. WRF will die in opem_his_w:

Code: Select all
forrtl: severe (66): output statement overflows record, unit -5, file Internal List-Directed Write
Image              PC                Routine            Line        Source
romsG              000000000E7DDE53  Unknown               Unknown  Unknown
romsG              000000000E836619  Unknown               Unknown  Unknown
romsG              00000000040B0EA9  open_hist_w_             2045  mediation_integrate.f90
romsG              00000000040A27BB  med_hist_out_             925  mediation_integrate.f90
romsG              000000000409E201  med_last_solve_io         710  mediation_integrate.f90
romsG              0000000000DC4619  module_integrate_         412  module_integrate.f90
romsG              0000000000C9EAE3  module_wrf_top_mp         324  module_wrf_top.f90
romsG              000000000042A23A  esmf_wrf_mod_mp_w        2008  esmf_atm.f90


It doesn't like the free-formatted statement:

Code: Select all
  IF( alarm_id .EQ. AUXHIST5_ALARM .AND. config_flags%mean_diag .EQ. 1 ) THEN
      WRITE(message, *) "RASM STATS: MEAN AUXHIST5 oid=", oid, " fname=", trim(fname), " alarmI_id=", alarm_id, " T\
ime_outNow=", timestr
      CALL wrf_debug(200,  message )

This type of encoding of a character variable message is hazardous in parallel and should be done by master PET, and we need a format specifier, say:

Code: Select all
     WRITE(message,'(a,i0,a,a,a,i0,a,a)') ...


There is no other way If one wants a robust code and avoid overflow in parallel applications!

By the way, the compiling optimization is -g -O0. WRF is not optimized at all so we cannot use that excuse. I am using an older intel compiler version 17.0.4.

Why can I not run WRF with more than 4 PETs in both Darwin and Linux operating systems? Why It hangs on when I add a nested grid (max_dom=2)?

Here are some details:

Code: Select all
Operating System : Linux
 CPU Hardware     : x86_64
 Compiler System  : ifort
 Compiler Command : /opt/sw/packages/intel-17.0.4/mvapich2/2.2/bin/mpif90
 Compiler Flags   : -fp-model precise -heap-arrays -g -traceback -check uninit -warn interfaces,nouncalled -ge
n-interfaces -I/projects/dmcs_1/sw/packages/intel-17.0.4/mvapich2-2.2/esmf/7.1.0r_nc3/mod/modO/Linux.intel.64.
mvapich2.default -I/projects/dmcs_1/sw/packages/intel-17.0.4/mvapich2-2.2/esmf/7.1.0r_nc3/include -I/projects/
dmcs_1/sw/packages/intel-17.0.4/netcdf/3.6.3/include
 MPI Communicator : -2080374780  PET size = 16


Notice that I added option -heap-arrays to avoid issues with the stack. WRF is a memory hog.

I am reluctant to start modifying WRF code deep into the numerical kernel. I have several corrections to WRF already. We submitted the details from the git site. I will appreciate if someone explains to me the mechanism of submitting corrections. I don't want to do these corrections to every new version of WRF. If any of the developers/librarians of WRF wants to know more details, please send me an email at arango@marine.rutgers.edu. Thank you.
arango
 
Posts: 5
Joined: Wed Dec 05, 2018 5:13 pm


Return to Runtime Problems

Who is online

Users browsing this forum: No registered users and 9 guests