[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #NGQ-500529]: ldm questions



Paul,

> Here's a simple depiction of the process:
> 
> GFS files downloaded    ->  cf1 ldm  <->  arps ldm  <-> firewall <-> ldad ldm
> and processed w/ wgrib2
> 
> 1. use wgrib2 to create *.grb2 files.
> 2. insert *.grb2 files into cf1 queue
> 3. arps ldm requests *.grb2 files from cf1
> 4. ldad ldm requests *.grb2 files from arps

Thanks for clearing that up.

The files are arriving at Arps in a timely manner but arrive at Ldad in an 
increasingly delayed manner, while other (presumably smaller?) files are not 
delayed.  Is this correct?

> My initial suspicions were a bottle neck at the firewall. However it is only 
> these grb2 files that see the delay, other data through the firewall to the 
> ldad ldm doesn't experience this issue. I will admit however that the grb2 
> files are written to the arps queue approximately every two minutes though at 
> the one hour point on the ldad ldm side when I stop seeing the data come 
> through I have verified the files are still in the arps queue. The ldm 
> transactions between cf1 and arps are nearly instantaneous and never delayed 
> between the two.

Are the other, timely files on a different LDM connection (i.e., due to a 
different REQUEST entry in the LDM configuration-file on Ldad) and do they use 
less bandwidth than the problematical files?  If so, then there might be 
"packet shaping" occurring at the firewall.  This is where the firewall 
preferentially throttles-down high-bandwidth TCP connections.  We've seen this 
before at several institutions.  We've seen that it can be difficult to 1) get 
the IT department to admit to such; and 2) get them to stop it.

Is the Ldad LDM reporting rtstats(1) statistics?  Are you plotting them?  Do 
the plots show that high-volume connections have large latencies while 
low-volume ones don't?

> I did increase the ldad queue size from the default size of 800M to 2G, but 
> the magic number still seems to be 1 hour.

The LDM system has a "maximum latency" threshold, which is one hour by default. 
 If a data-product arrives more than this threshold amount of time after it was 
created, then it will be discarded.  Based on your description, this appears to 
be happening at the Ldad LDM due to the increasing latency.

[Regarding sending a SIGUSR2 to the pqact(1) process:]

> How do I do this? Set in a conf file then restart ldm or just execute from 
> terminal window?

You can get the process ID of the pqact(1) process via the ps(1) utility or by 
searching the LDM log file for the string "pqact".  Once you have it, you can 
use the kill(1) utility to send it signals.

> Is the product-queue on some sort of RAID?
> 
> No. I also looked on the MIDDS side and the arps queue was 6.5G (default 2G) 
> and the cf1 queue was 1.5G (default 1G).

By "default 2G" do you mean that output from the command "ldmadmin config" on 
Arps shows that the size of the product-queue is set to "2G"?

My question about the size of the computer (32 or 64 bit) should have been 
directed at Ldad and not the MIDDS systems.

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: NGQ-500529
Department: Support LDM
Priority: Normal
Status: Closed