[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bad files causing thredds crash

Subject: Re: bad files causing thredds crash
Date: Thu, 29 Nov 2007 10:09:22 -0700

Hi Bob:

1) What heap memory settings are you giving to the JVM on tomcatstartup? (to verify, you can look at http://server/thredds/debug andchoose "showRuntimeInfo", assuming you have remote debugging enabled)


2) what is the hardware, OS, and JVM version?

3) how big is /thredds/Hfradar/aggreghfradarC1/u33h/catalog.xml  ?

4) remind me of the URL of your server so i can see how big some ofthese datasets are.

Im guessing there is a "memory leak" (objects not getting GCd) but ineed to eliminate the obvious first.


Re: the info in the log, eg:

2007-11-28T04:55:27.959 -0800 [ 633762099][ 1094643]

= date [msecs since server start] [request number]

so this server has been up about 7 days, and served > 1 million requests (!)

I need to investigate what TDS does with bad files, especially if youreplace them. Ill get back to you asap.


Bob Simons wrote:

OK. Thredds crashed this morning attached are the catalina.out andthreddsServlet.logs from that time period.
I hope that first bracketed number on each line of threddsServlet.logisn't memoryInUse. Is it?
A possibly related question about bad files:
If there is a bad data file (e.g., 0 bytes) which Thredds notices andwe subsequently delete, if we then replace that file, does threddsnotice that it is a new file (because the cacheAged cache includesinformation about the files date and notices that the file's date haschanged) or do we need to delete the relevant cacheAged file wheneverwe delete a data file?
Thanks for looking into this.

John Caron wrote:
too bad. if.when it happens again, grab it. i will try to replicatehere.
(you are checking all files of the form threddsServlet.log.N N=1..10 ?)


Bob Simons wrote:
Sorry. It has been too long. That information is no longer available.

John Caron wrote:
sorry to be vague; in content/thredds/logs/threddsServlet.log thereshould be at least one more, possibly 2
Bob Simons wrote:
That was the only message with "[    6343]".


John Caron wrote:
Hi Bob:

Can you look in the log file for all messages from request 6343:
2007-11-15T08:17:57.883 -0800 [ 1024030][ 6343] ERROR -thredds.server.opendap.NcDODSServlet - Error
java.lang.OutOfMemoryError: Java heap space
these will be in order, but may be interleaved with other requests:

id like to see what size the original request was.


Bob Simons wrote:
We're having a lot of problems with thredds crashing lately.
We're hoping you can look into the problem or give us advice.
We recently updated to THREDDS Data Server Version 3.16.22 BuildDate = 2007-10-25 19:10:28.
The general scenario is:
1) One or more invalid .nc files are added to the data directories.
  Obviously, this isn't intentional.
  And we just found the source of many of these: a bad NIC card.
  But given the huge number of files involved, it has always
  occurred periodically.
  Often, it is a file with 0 bytes.

2) During aggregation, thredds notices the bad file and reports
  it to catalina.out. For example:
2007-11-15T09:17:45.277 -0800 [ 1446340][ 3538] ERROR -thredds.server.opendap.NcDODSServlet - Errorjava.io.IOException: Cant read/u00/satellite/AK/sstd/hday/AK2007272_014200h_sstd.nc: not avalid NetCDF file.
        at ucar.nc2.NetcdfFile.open(NetcdfFile.java:459)
        at ucar.nc2.NetcdfFile.open(NetcdfFile.java:261)
atucar.nc2.dataset.NetcdfDataset.openFile(NetcdfDataset.java:341)atucar.nc2.ncml.Aggregation$Dataset$PolymorphicReader.open(Aggregation.java:1286)atucar.nc2.NetcdfFileCache.acquire(NetcdfFileCache.java:188)atucar.nc2.ncml.Aggregation$Dataset._acquireFile(Aggregation.java:1186)atucar.nc2.ncml.Aggregation$Dataset.acquireFile(Aggregation.java:1172)atucar.nc2.ncml.Aggregation$Dataset.getNcoords(Aggregation.java:1100)atucar.nc2.ncml.Aggregation$Dataset.setStartEnd(Aggregation.java:1121)atucar.nc2.ncml.Aggregation$Dataset.access$100(Aggregation.java:1002)atucar.nc2.ncml.Aggregation.buildCoords(Aggregation.java:315)atucar.nc2.ncml.AggregationExisting.buildDataset(AggregationExisting.java:61)
        at ucar.nc2.ncml.Aggregation.finish(Aggregation.java:293)
        at ucar.nc2.ncml.NcMLReader.readNetcdf(NcMLReader.java:397)

3) Sometime afterwards
(not immediatly, but it seems to be sooner if there are morebad files),
  thredds crashes. For example:
2007-11-15T08:17:57.883 -0800 [ 1024030][ 6343] ERROR -thredds.server.opendap.NcDODSServlet - Error
java.lang.OutOfMemoryError: Java heap space
DODServlet ERROR (anyExceptionHandler):java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
ReqState:
  serverClassName:    'thredds.server.opendap.NcDODSServlet'
  dataSet:            'satellite/CM/usfc/hday'
  requestSuffix:      'dods'
  CE:                 'altitude,lat,lon,time'
  compressOK:          false
  InitParameters:
    DebugOn: ''
    maxNetcdfFilesCached: '10'

java.lang.OutOfMemoryError: Java heap space
There may be a couple of these errors (referring to differentdatasets) in rapid succession, then:
2007-11-15T08:18:18.048 -0800 [ 1044195][ 6487] ERROR -org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/thredds].[NetcdfDODS]- Servlet.service() for servlet NetcdfDODS threw exception
java.lang.OutOfMemoryError: Java heap space
Nov 15, 2007 8:18:12 AMorg.apache.catalina.core.ContainerBase$ContainerBackgroundProcessorprocessChildren
SEVERE: Exception invoking periodic operation:
java.lang.OutOfMemoryError: Java heap space
Nov 15, 2007 8:18:53 AM unknown unknown
SEVERE: Exception invoking periodic operation:
java.lang.OutOfMemoryError: Java heap space
So, we aren't certain that the bad files are the cause of theOutOfMemoryErrors, but the correlation is strong. With no orfew bad files, thredds seems to stay up longer. With more badfiles, thredds seems more likely to crash.
As a possible cause, I'll mention the previous problem with .ncfiles not being automatically closed by an over-ridden finalizemethod. (See Original Message below.) This caused a memory andresource leak if the program didn't explicitly close every file(even if an Exception or Error is thrown during processing). Asimilar problem could certainly be the cause of the problem here.
And I'll mention our previous request to have Thredds simplyignore bad files, but keep serving the rest of the dataset. (SeeOriginal Message below.) You said you were going to work onthat. I don't know if you made changes or what changes you made.I'll just point out that we only recently upgraded to a versionof thredds which might have these changes in it. We hadpreviously been using a somewhat older version because of otherproblems with the intervening versions of thredds.
I hope that is helpful.

Thanks for looking into this.

Sincerely,

Bob Simons
Satellite Data Product Manager
Environmental Research Division
NOAA Southwest Fisheries Science Center
1352 Lighthouse Ave
Pacific Grove, CA 93950-2079
(831)658-3205
address@hidden

The contents of this message are mine personally and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric
Administration.
<>< <>< <>< <>< <>< <>< <>< <>< <><


-------- Original Message --------
Subject: finalize
Date: Fri, 18 Aug 2006 12:10:57 -0700
From: Bob Simons <address@hidden>
Organization: NOAA/ERD
To: John Caron <address@hidden>

I had a bug in a new procedure I was writing (my fault, I admit),
wherein the close() method of many NetcdfFile objects was nevercalled.
On Windows, this code ran fine. The Java implementation on Windows
(specifically the garbage collector) must have been smart enough to
recover all the memory allocated and close all the files when the
NetcdfFile objects were no longer referenced.
But on Linux, the procedure failed after many calls, and I gotan errormessage (from Java? from Tomcat? it doesn't matter) saying "toomany
files open".

Fine. I fixed the problem in my code. But I two requests for you:

1) In the JavaDocs for the NetcdfFile class (up at the top), please
include the same suggestion as you already have inNetcdfFileCache's
JavaDocs regarding the use of "finally" to ensure that close() is
called.  E.g.,

    NetcdfFile ncfile = null;
    try {
        ncfile = NetcdfFile.open(fileName);
        ...
    } finally {
        ncfile.close();
    }
2) I think it would be a huge improvement to NetcdfFile (andpresumably
NetcdfFileCache, NetcdfDataset, NetcdfDatasetCache, DODSNetcdfFile
(which isn't in the latest copy of the documentation I have),and anysimilar object), to include the following method (or somethinglike it):
    protected void finalize() throws Throwable {
        close();
    }

As you probably know, the finalize() method is inherited from the
java.lang.Object class and is called whenever an object iseligible forgarbage collection. In this case, overriding finalize() letsyou ensure
that the close() method always gets called.
In real life, it isn't always possible to use try/finally. Thatopens up
the possibility of not calling close() if an error occurs. It is
dangerous to have that resource leakage be a possibility.

Your use of finalize() prevents this type of resource leakage (as
dangerous as memory leaks in other languages) from occurring.

I admit again that it was my coding error that lead to these issue
coming up. But I think it is good defensive programming to have a
finalize() method for classes like these.

Thank you.
***end -------- Original Message --------




-------- Original Message --------
Subject: Request for enhancement
Date: Wed, 03 Jan 2007 15:23:39 -0800
From: Bob Simons <address@hidden>
Organization: NOAA/ERD
To: John Caron <address@hidden>
CC: Roy Mendelssohn <address@hidden>, Dave Foley<address@hidden>
The Problem:

We often use THREDDS to aggregate near-real-time data. We have one
directory per data set and frequently (hourly or daily) add an
additional data file to each data set's directory. Sometimesthings gowrong (e.g., power outages) and a partial and/or damaged file isleft ina data set's directory. We try to deal with this, but ourmethods aren't
perfect.
Currently, when THREDDS is aggregating a data set, if THREDDSfinds aninvalid .nc data file in with all the .nc data files for a givendataset, it throws an Exception ("java.io.IOException: Cant read[fileName]:not a valid NetCDF file.") and rejects the entire dataset. Thisis a
serious problem for us because one bad file makes an entire dataset
disappear from THREDDS. The only way for us to find out if this has
happened is to go through the <tomcat>/logs/catalina.out logfile (which
we do periodically) and look for the Exception messages.

The Request:
It would be nice if THREDDS simply rejected the bad files(throwing anException as it does now), but kept going with the aggregationof the
data set.
(I realize that the proposed recommendation is good for oursituationand you may not be comfortable making it the standard behaviorfor THREDDS.)
Thank you.
***end -------- Original Message --------

Prev by Date: [THREDDS #GRM-434463]: Time Ranges, Temporal Subsetting with Thredds WCS
Next by Date: Re: Changing where THREDDS puts it cache
Previous by thread: [THREDDS #GRM-434463]: Time Ranges, Temporal Subsetting with Thredds WCS
Next by thread: Re: Changing where THREDDS puts it cache
Index(es):
- Date
- Thread