[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bad files causing thredds crash



Hi Bob:

1) What heap memory settings are you giving to the JVM on tomcat startup? (to verify, you can look at http://server/thredds/debug and choose "showRuntimeInfo", assuming you have remote debugging enabled)

2) what is the hardware, OS, and JVM version?

3) how big is /thredds/Hfradar/aggreghfradarC1/u33h/catalog.xml  ?

4) remind me of the URL of your server so i can see how big some of these datasets are.

Im guessing there is a "memory leak" (objects not getting GCd) but i need to eliminate the obvious first.

Re: the info in the log, eg:

2007-11-28T04:55:27.959 -0800 [ 633762099][ 1094643]

= date [msecs since server start] [request number]

so this server has been up about 7 days, and served > 1 million requests (!)

I need to investigate what TDS does with bad files, especially if you replace them. Ill get back to you asap.

Bob Simons wrote:
OK. Thredds crashed this morning attached are the catalina.out and threddsServlet.logs from that time period.

I hope that first bracketed number on each line of threddsServlet.log isn't memoryInUse. Is it?


A possibly related question about bad files:
If there is a bad data file (e.g., 0 bytes) which Thredds notices and we subsequently delete, if we then replace that file, does thredds notice that it is a new file (because the cacheAged cache includes information about the files date and notices that the file's date has changed) or do we need to delete the relevant cacheAged file whenever we delete a data file?


Thanks for looking into this.

John Caron wrote:
too bad. if.when it happens again, grab it. i will try to replicate here.

(you are checking all files of the form threddsServlet.log.N N=1..10 ?)


Bob Simons wrote:
Sorry. It has been too long. That information is no longer available.

John Caron wrote:
sorry to be vague; in content/thredds/logs/threddsServlet.log there should be at least one more, possibly 2

Bob Simons wrote:
That was the only message with "[    6343]".


John Caron wrote:
Hi Bob:

Can you look in the log file for all messages from request 6343:

2007-11-15T08:17:57.883 -0800 [ 1024030][ 6343] ERROR - thredds.server.opendap.NcDODSServlet - Error
java.lang.OutOfMemoryError: Java heap space
these will be in order, but may be interleaved with other requests:


id like to see what size the original request was.


Bob Simons wrote:
We're having a lot of problems with thredds crashing lately.
We're hoping you can look into the problem or give us advice.
We recently updated to THREDDS Data Server Version 3.16.22 Build Date = 2007-10-25 19:10:28.


The general scenario is:
1) One or more invalid .nc files are added to the data directories.
  Obviously, this isn't intentional.
  And we just found the source of many of these: a bad NIC card.
  But given the huge number of files involved, it has always
  occurred periodically.
  Often, it is a file with 0 bytes.

2) During aggregation, thredds notices the bad file and reports
  it to catalina.out. For example:

2007-11-15T09:17:45.277 -0800 [ 1446340][ 3538] ERROR - thredds.server.opendap.NcDODSServlet - Error
java.io.IOException: Cant read /u00/satellite/AK/sstd/hday/AK2007272_014200h_sstd.nc: not a valid NetCDF file.
at ucar.nc2.NetcdfFile.open(NetcdfFile.java:459)
at ucar.nc2.NetcdfFile.open(NetcdfFile.java:261)
at ucar.nc2.dataset.NetcdfDataset.openFile(NetcdfDataset.java:341)
at ucar.nc2.ncml.Aggregation$Dataset$PolymorphicReader.open(Aggregation.java:1286)


at ucar.nc2.NetcdfFileCache.acquire(NetcdfFileCache.java:188)
at ucar.nc2.ncml.Aggregation$Dataset._acquireFile(Aggregation.java:1186)


at ucar.nc2.ncml.Aggregation$Dataset.acquireFile(Aggregation.java:1172)

at ucar.nc2.ncml.Aggregation$Dataset.getNcoords(Aggregation.java:1100)
at ucar.nc2.ncml.Aggregation$Dataset.setStartEnd(Aggregation.java:1121)


at ucar.nc2.ncml.Aggregation$Dataset.access$100(Aggregation.java:1002)
at ucar.nc2.ncml.Aggregation.buildCoords(Aggregation.java:315)
at ucar.nc2.ncml.AggregationExisting.buildDataset(AggregationExisting.java:61)


        at ucar.nc2.ncml.Aggregation.finish(Aggregation.java:293)
        at ucar.nc2.ncml.NcMLReader.readNetcdf(NcMLReader.java:397)

3) Sometime afterwards
(not immediatly, but it seems to be sooner if there are more bad files),
thredds crashes. For example:


2007-11-15T08:17:57.883 -0800 [ 1024030][ 6343] ERROR - thredds.server.opendap.NcDODSServlet - Error
java.lang.OutOfMemoryError: Java heap space
DODServlet ERROR (anyExceptionHandler): java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
ReqState:
serverClassName: 'thredds.server.opendap.NcDODSServlet'
dataSet: 'satellite/CM/usfc/hday'
requestSuffix: 'dods'
CE: 'altitude,lat,lon,time'
compressOK: false
InitParameters:
DebugOn: ''
maxNetcdfFilesCached: '10'


java.lang.OutOfMemoryError: Java heap space

There may be a couple of these errors (referring to different datasets) in rapid succession, then:

2007-11-15T08:18:18.048 -0800 [ 1044195][ 6487] ERROR - org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/thredds].[NetcdfDODS] - Servlet.service() for servlet NetcdfDODS threw exception
java.lang.OutOfMemoryError: Java heap space
Nov 15, 2007 8:18:12 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor processChildren
SEVERE: Exception invoking periodic operation:
java.lang.OutOfMemoryError: Java heap space
Nov 15, 2007 8:18:53 AM unknown unknown
SEVERE: Exception invoking periodic operation:
java.lang.OutOfMemoryError: Java heap space




So, we aren't certain that the bad files are the cause of the OutOfMemoryErrors, but the correlation is strong. With no or few bad files, thredds seems to stay up longer. With more bad files, thredds seems more likely to crash.

As a possible cause, I'll mention the previous problem with .nc files not being automatically closed by an over-ridden finalize method. (See Original Message below.) This caused a memory and resource leak if the program didn't explicitly close every file (even if an Exception or Error is thrown during processing). A similar problem could certainly be the cause of the problem here.

And I'll mention our previous request to have Thredds simply ignore bad files, but keep serving the rest of the dataset. (See Original Message below.) You said you were going to work on that. I don't know if you made changes or what changes you made. I'll just point out that we only recently upgraded to a version of thredds which might have these changes in it. We had previously been using a somewhat older version because of other problems with the intervening versions of thredds.

I hope that is helpful.

Thanks for looking into this.

Sincerely,

Bob Simons
Satellite Data Product Manager
Environmental Research Division
NOAA Southwest Fisheries Science Center
1352 Lighthouse Ave
Pacific Grove, CA 93950-2079
(831)658-3205
address@hidden

The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric
Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><


-------- Original Message -------- Subject: finalize Date: Fri, 18 Aug 2006 12:10:57 -0700 From: Bob Simons <address@hidden> Organization: NOAA/ERD To: John Caron <address@hidden>

I had a bug in a new procedure I was writing (my fault, I admit),
wherein the close() method of many NetcdfFile objects was never called.


On Windows, this code ran fine. The Java implementation on Windows
(specifically the garbage collector) must have been smart enough to
recover all the memory allocated and close all the files when the
NetcdfFile objects were no longer referenced.

But on Linux, the procedure failed after many calls, and I got an error
message (from Java? from Tomcat? it doesn't matter) saying "too many
files open".


Fine. I fixed the problem in my code. But I two requests for you:

1) In the JavaDocs for the NetcdfFile class (up at the top), please
include the same suggestion as you already have in NetcdfFileCache's
JavaDocs regarding the use of "finally" to ensure that close() is
called. E.g.,


    NetcdfFile ncfile = null;
    try {
        ncfile = NetcdfFile.open(fileName);
        ...
    } finally {
        ncfile.close();
    }

2) I think it would be a huge improvement to NetcdfFile (and presumably
NetcdfFileCache, NetcdfDataset, NetcdfDatasetCache, DODSNetcdfFile
(which isn't in the latest copy of the documentation I have), and any
similar object), to include the following method (or something like it):


    protected void finalize() throws Throwable {
        close();
    }

As you probably know, the finalize() method is inherited from the
java.lang.Object class and is called whenever an object is eligible for
garbage collection. In this case, overriding finalize() lets you ensure
that the close() method always gets called.


In real life, it isn't always possible to use try/finally. That opens up
the possibility of not calling close() if an error occurs. It is
dangerous to have that resource leakage be a possibility.


Your use of finalize() prevents this type of resource leakage (as
dangerous as memory leaks in other languages) from occurring.

I admit again that it was my coding error that lead to these issue
coming up. But I think it is good defensive programming to have a
finalize() method for classes like these.

Thank you.
***end -------- Original Message --------




-------- Original Message --------
Subject: Request for enhancement
Date: Wed, 03 Jan 2007 15:23:39 -0800
From: Bob Simons <address@hidden>
Organization: NOAA/ERD
To: John Caron <address@hidden>
CC: Roy Mendelssohn <address@hidden>, Dave Foley <address@hidden>


The Problem:

We often use THREDDS to aggregate near-real-time data. We have one
directory per data set and frequently (hourly or daily) add an
additional data file to each data set's directory. Sometimes things go
wrong (e.g., power outages) and a partial and/or damaged file is left in
a data set's directory. We try to deal with this, but our methods aren't
perfect.


Currently, when THREDDS is aggregating a data set, if THREDDS finds an
invalid .nc data file in with all the .nc data files for a given data
set, it throws an Exception ("java.io.IOException: Cant read [fileName]:
not a valid NetCDF file.") and rejects the entire dataset. This is a
serious problem for us because one bad file makes an entire dataset
disappear from THREDDS. The only way for us to find out if this has
happened is to go through the <tomcat>/logs/catalina.out log file (which
we do periodically) and look for the Exception messages.


The Request:

It would be nice if THREDDS simply rejected the bad files (throwing an
Exception as it does now), but kept going with the aggregation of the
data set.


(I realize that the proposed recommendation is good for our situation
and you may not be comfortable making it the standard behavior for THREDDS.)



Thank you. ***end -------- Original Message --------






NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.