metgrid corrupts met_em.d01 netcdf files

Introduction to the forum

metgrid corrupts met_em.d01 netcdf files

Postby hafner » Wed Sep 27, 2017 7:51 pm

Hello to all,

I am experiencing a weird problem with metgrid routine. When I run metgrid to generate met_em.d01.xxxx
files in NetCDF format some are corrupt. Then real.exe crashes. In some case the file is not complete, or it has
corrupted variable Times , e.g. Times =""

But when I resubmit metgrid again it apparently fixes the erroneous met_dem.d01.xx files. It is frustrating to go and fix multiple netcdf files, as it takes time. Please, if anybody had similar experience, or if you would have any suggestions how to attack the prolem, let me know .

thanks!

Jan Hafner (Mr.)
hafner
 
Posts: 4
Joined: Wed Sep 27, 2017 7:43 pm

Re: metgrid corrupts met_em.d01 netcdf files

Postby kwthomas » Thu Sep 28, 2017 4:05 pm

Hi...

If your files are >2gtb and you didn't compile the software with the environmental variable WRFIO_NIO_LARGE_FILE_SUPPORT set to 1 (WRF 3.8.1 or earlier, not needed in WRF 3.9 or later), then you can expect corrupt files. The default is 32-bit NETCDF support. For files >2gb, you need 64-bit NETCDF support. This is finally the default in WRF 3.9 and later.

Are you using a Lustre filesystem with striping set? You may be on a system that allows you to change the stripe count, even though setting it above one can cause corrupt files. In my case, I had files with random large blocks of ascii nulls in them. In each job run, the location of the null blocks was always different, If you just show "Times =", then this *might* be happening to you.

In the directory you are running from type:

lfs getstripe . (make sure you get the period)

Look for the "stripe count" lines or count the number of lines under the "obdidx" line.

If greater than one, that stipes are being used and *may* be causing the problem. It is possible one of the OST's has a problem too.

For testing, try:

lfs setstripe -c 1 . (get the period)

in the directory you run from. Remove any files created from the previous run.

If this turns out to be the case, you might check with your System Admin to see what the rules for striping are.

The only place that I've run into this is on Lonestar5.
Kevin W. Thomas
Center for Analysis and Prediction of Storms
University of Oklahoma
kwthomas
 
Posts: 188
Joined: Thu Aug 07, 2008 6:53 pm

Re: metgrid corrupts met_em.d01 netcdf files

Postby hafner » Tue Oct 03, 2017 7:46 pm

Kevin

thanks for your suggestion, but it seems it did not work. The Netcdf files are relatively small, about 164 Mb,
also the problem affects only few output met_em.d01.xxx.nc files. When I check striping for those files it is 1, i.e. not stripped.

I am trying to complete 6 month simulation based on 6 hourly data, So there are a lot of met_em.d01.xxx ( and met_em.d02.xxxx) files, Only about 10 -15 of the total number are problematic. That is for example corrupted NetCDF or missing variable Times.

When I modify the namelist.wps dates and re-ran metgrid it is just fine. It is frustrating to manually look and fix all corrupted met_em.d01.xxx files.

Please, do you have any other suggestion what could be wrong, if it is problem with metgrid in WRF, or more likely my machine specific issue.

Please, let me know your opinion.

Thanks !

Jan
hafner
 
Posts: 4
Joined: Wed Sep 27, 2017 7:43 pm

Re: metgrid corrupts met_em.d01 netcdf files

Postby kwthomas » Wed Oct 04, 2017 5:11 pm

Hi Jan...

Are you doing this as an MPI task? If so, try adding or subtracting a node or two. If you use
nproc_x/nproc_y, you'll have to adjust those.

I know of someone on LONESTAR5 that had to do this for WRF after an OS upgrade.

One of the CAPS people ran into the same problem on STAMPEDE2. They changed there values to ones that I've used, and all worked fine.

I can tell you the STAMPEDE2 problem was either missing NETCDF files or files that were only 32 bytes. We write splitfiles (io_form_history=102), so non-splitfiles may error differently.

This is an MPI bug, but who knows where.

Hopefully, it will fix your problem.
Kevin W. Thomas
Center for Analysis and Prediction of Storms
University of Oklahoma
kwthomas
 
Posts: 188
Joined: Thu Aug 07, 2008 6:53 pm

Re: metgrid corrupts met_em.d01 netcdf files

Postby hafner » Thu Oct 05, 2017 9:39 pm

Kevin

yes i believe it is MPI job, and I am running it on our local cluster (U. of Hawaii at Manoa).

the slurm script is :
======================
#!/bin/sh
#SBATCH -J metgrd04
#SBATCH -t 24:00:00
#SBATCH -N 1
#SBATCH --exclusive
#SBATCH --tasks-per-node=20
#SBATCH --partition=exclusive.q
###SBATCH --partition=community.q
#SBATCH -D /home/jhafner/lus/wrf3.8/wps4
source ~/.bash_profile #load your environment and the ability to use modules

cd $SLURM_SUBMIT_DIR

export OMP_NUM_THREADS=1

export I_MPI_FABRICS=tmi
export I_MPI_PMI_LIBRARY=/opt/local/slurm/default/lib64/libpmi.so
env

srun -n 20 /home/jhafner/lus/wrf3.8/wps4/metgrid.exe > /home/jhafner/lus/wrf3.8/wps4/metgrid.exe.out 2>&1

========================== end ==============================================

I will try to run it on different nproc combination ( not 20 as in this case). At least it seems it is local MPI related problem. Then I can start bugging local support .. well I will see ..

thank you for your kind help ..

Jan
hafner
 
Posts: 4
Joined: Wed Sep 27, 2017 7:43 pm

Re: metgrid corrupts met_em.d01 netcdf files

Postby kwthomas » Mon Oct 09, 2017 3:22 pm

Try two nodes, as in

#SBATCH -N 2
Kevin W. Thomas
Center for Analysis and Prediction of Storms
University of Oklahoma
kwthomas
 
Posts: 188
Joined: Thu Aug 07, 2008 6:53 pm

Re: metgrid corrupts met_em.d01 netcdf files

Postby hafner » Tue Oct 10, 2017 3:40 pm

Kevin
thanks! I will try ..

Jan
hafner
 
Posts: 4
Joined: Wed Sep 27, 2017 7:43 pm


Return to Welcome to the WRF User's Forum

Who is online

Users browsing this forum: Bing [Bot] and 2 guests