[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20211026: Re: [conduit] 20210830: Re: High CONDUIT latencies from vm-lnx-conduit2.ncep.noaa.gov



Hi all,

Yes, now that we've had four GFS run cycles since the upgraded queue, things continue to look great! Fingers crossed that we've returned to the reliable CONDUIT service that has benefitted the community for so many years.

Thanks much!

--Kevin

_________________________________________________
Kevin Tyle, M.S.; Manager of Departmental Computing
NSF XSEDE Campus Champion
Dept. of Atmospheric & Environmental Sciences
UAlbany ETEC Bldg - Harriman Campus
1220 Washington Ave, Room 419
Albany, NY 12226
address@hidden | 518-442-4578 | @nywxguy | he/him/his                           
_________________________________________________

From: Pete Pokrandt <address@hidden>
Sent: Tuesday, January 11, 2022 12:51 AM
To: Jesse Marks - NOAA Affiliate <address@hidden>; Alicia Bentley - NOAA Federal <address@hidden>
Cc: Tyle, Kevin R <address@hidden>; address@hidden <address@hidden>; _NWS NCEP NCO Dataflow <address@hidden>; address@hidden <address@hidden>
Subject: Re: 20211026: Re: [conduit] 20210830: Re: High CONDUIT latencies from vm-lnx-conduit2.ncep.noaa.gov
 
Agreed, latencies are MUCH MUCH better - these recent ones look more like what we used to see before the GFS upgrade cranked up the data volume and latencies.

Many thanks to all involved in this fix, much appreciated!

Pete




-----
Pete Pokrandt - Systems Programmer
UW-Madison Dept of Atmospheric and Oceanic Sciences
608-262-3086  - address@hidden



From: Jesse Marks - NOAA Affiliate <address@hidden>
Sent: Monday, January 10, 2022 6:01 PM
To: Alicia Bentley - NOAA Federal <address@hidden>
Cc: Pete Pokrandt <address@hidden>; Tyle, Kevin R <address@hidden>; address@hidden <address@hidden>; _NWS NCEP NCO Dataflow <address@hidden>; address@hidden <address@hidden>
Subject: Re: 20211026: Re: [conduit] 20210830: Re: High CONDUIT latencies from vm-lnx-conduit2.ncep.noaa.gov
 
I think it's safe to say the queue expansion has solved the latency problem:

image.png


Happy to have this one fixed for you all!  Thanks for being patient with us while we got a solution in place.

Thanks,
Jesse


On Mon, Jan 10, 2022 at 11:52 AM Alicia Bentley - NOAA Federal <address@hidden> wrote:
Hi Jesse, 

Thank you very much for the update! Looking forward to the 18Z cycle!

Take care, 
Alicia
___________________________________
Alicia M. Bentley, Ph.D.
Physical Scientist | Model Evaluation Group
NOAA/NWS/NCEP/EMC (VPPPG Branch)
Webpage: Model Evaluation Group (MEG)


On Mon, Jan 10, 2022 at 11:29 AM Jesse Marks - NOAA Affiliate <address@hidden> wrote:
All went smoothly with today's work!  The disk expansion and queue growth were done during the 12Z delivery so it won't be the best test case.  The 18Z delivery should provide us with a much better picture of the improvement this work has provided.

Thanks,
Jesse

On Tue, Jan 4, 2022 at 3:31 PM Pete Pokrandt <address@hidden> wrote:
Hooray! Thanks for the update, hope the work goes smoothly.

Pete

-----
Pete Pokrandt - Systems Programmer
UW-Madison Dept of Atmospheric and Oceanic Sciences
608-262-3086  - address@hidden



From: Jesse Marks - NOAA Affiliate <address@hidden>
Sent: Tuesday, January 4, 2022 2:26 PM
To: Pete Pokrandt <address@hidden>
Cc: Tyle, Kevin R <address@hidden>; address@hidden <address@hidden>; _NWS NCEP NCO Dataflow <address@hidden>; address@hidden <address@hidden>; Alicia Bentley - NOAA Federal <address@hidden>
Subject: Re: 20211026: Re: [conduit] 20210830: Re: High CONDUIT latencies from vm-lnx-conduit2.ncep.noaa.gov
 
Hi Everyone,

We have work scheduled for Monday, 1/10 to grow the disk space where CONDUIT ldm.pq resides.  I will follow up with you all once the work is complete.

Thanks,
Jesse

On Wed, Dec 1, 2021 at 2:49 PM Jesse Marks - NOAA Affiliate <address@hidden> wrote:
Hi Pete,

Apologies for the delayed response here. 

We currently have a ticket open to grow the disk space where the CONDUIT ldm.pq files reside.  This would allow us to expand the size of the ldm queue and hopefully resolve this issue.  If we fail to get meaningful traction there, we are interested in exploring the metered approach you described or even scaling back the size of the GFS delivery.

We will keep you apprised of our efforts on disk growth and get back to you soon with an update.

Thank you,
Jesse

On Tue, Oct 26, 2021 at 4:44 PM 'Pete Pokrandt' via _NWS NCEP NCO Dataflow <address@hidden> wrote:
Tom,

Is it feasible to build in a bit of delay to the ingest of the GFS forecast hours on the root CONDUIT machines? Something like putting in a 'sleep 10' once a forecast hour is available for ingest before it's pqinserted (or whatever method they are using at NCEP) might be enough to reduce the peaks in data to the point where their current queues can handle them. 

Doing so would introduce an increasing delay between when the data is available on the NCEP server versus when we get it through CONDUIT, up to ~20 minutes by the time we got to the final forecast hours, but in my opinion, it's better to get a complete data set a little later than an incomplete data set more quickly.

I'm not sure how much reducing the frequency of 0.25 deg GFS forecast hours is going to help.. I think there is value in the 3h forecast hour frequency out to maybe 120 hours, or even 168, but I could personally live without 3h 0.25 degree forecast hours for the later forecasts, where the value/accuracy of the deterministic forecast drops off significantly - but cutting those hours isn't always going to help. 

An example - last night's 00 UTC 26 run of the GFS - we started getting incomplete 1.0 degree forecast hours beginning with the 72h forecast. Other 1 degree forecast hours from last night's 00 UTC run that were incomplete on my end just visually looking at the file sizes were the 81, 90, 93, 102, 105, 111, 129, 132, 141, 153, 156, 162. 

For the 0p25 degree data, I only save up through 87h in individual files, but the 66, 72, 78 and 87h forecast hours were incomplete. 

So the main times where the backlog typically occurs, at least for the 00 UTC are between the ingest of say, 60 and 180h. I'm not sure what other data is being ingested at the same - from a recent ldmadmin watch on the CONDUIT feed it looks like maybe some NDFD products. Also the 03 UTC RUC2 (rap?) comes through around the same time as the 60-63h GFS forecasts through maybe 117 to 120h.

And we still don't know why one of the root conduit ingest machines has small latencies, while the other has latencies that consistently rise to ~1000 seconds during each GFS peak. 

Sorry, just rambling and trying to brainstorm..

Pete






-----
Pete Pokrandt - Systems Programmer
UW-Madison Dept of Atmospheric and Oceanic Sciences
608-262-3086  - address@hidden



From: Tom Yoksas <address@hidden>
Sent: Tuesday, October 26, 2021 2:02 PM
To: Tyle, Kevin R <address@hidden>; Pete Pokrandt <address@hidden>
Cc: address@hidden <address@hidden>; _NWS NCEP NCO Dataflow <address@hidden>; address@hidden <address@hidden>; Alicia Bentley - NOAA Federal <address@hidden>
Subject: 20211026: Re: [conduit] 20210830: Re: High CONDUIT latencies from vm-lnx-conduit2.ncep.noaa.gov
 
Hi Kevin, Pete, et. al,

On 10/26/21 12:05, Tyle, Kevin R wrote:
> Hi everyone, over the past several days we’ve noticed another big
> degradation of GFS data receipt via CONDUIT. Pete, can you confirm the
> same on your end?

We can confirm that we are seeing the same high latencies as you
are.  Here is the CONDUIT latency plot for the machine that we use
to REQUEST CONDUIT from NCEP:

https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+conduit.unidata.ucar.edu

During the somewhat recently held LDM virtual training workshop, we
had the opportunity to work with some of the folks that maintain
the top level CONDUIT source machines in NCEP, and we believe that
we identified the cause for the products that are not making it into
the CONDUIT feed - there is not enough memory (RAM) on the virtual
machines to allow for the LDM queue size to be increased enough for
all of the data to be sent during volume peaks.  This issue was
discussed during one of the recently held virtual User Committee
meetings, and we were advised that it may not be possible to upgrade/
redo the virtual machines until sometime in 2022.  Our NOAA contact
was going to see if something could be done earlier than this, but
we have not heard whether or not this will be possible.

if the memory can not be increased on the NCEP CONDUIT source
machines, the option on the table is to reduce the CONDUIT
content.  Since the volume in CONDUIT is _heavily_ dominated by
GFS 0.25 degree data, it would seem that would need to be reduced,
possibly by including fewer forecast hours.

We are not happy with the current situation (a massive understatement),
but we are not in a position to affect the change(s) that are needed
to return CONDUIT to full functionality.

Cheers,

Tom

> _________________________________________________
>
> *From:* conduit <address@hidden> *On Behalf Of *Pete
> Pokrandt via conduit
> *Sent:* Tuesday, August 31, 2021 12:47 PM
> *To:* Jesse Marks - NOAA Affiliate <address@hidden>; address@hidden
> *Cc:* _NWS NCEP NCO Dataflow <address@hidden>; Anne Myckow - NOAA
> Federal <address@hidden>; address@hidden;
> address@hidden
> *Subject:* Re: [conduit] 20210830: Re: High CONDUIT latencies from
> vm-lnx-conduit2.ncep.noaa.gov
>
> Thanks for the update, Jesse. I can confirm that we are seeing smaller
> lags originating from conduit2, and since yesterday's 18 UTC run, I
> don't think we have missed any data here at UW-Madison.
>
> Kevin Tyle, how's your reception been at Albany since the 18 UTC run
> yesterday?
>
> Pete
>
> -----
> Pete Pokrandt - Systems Programmer
> UW-Madison Dept of Atmospheric and Oceanic Sciences
> 608-262-3086  - address@hidden <mailto:address@hidden>
>
> ------------------------------------------------------------------------
>
> *From:*Jesse Marks - NOAA Affiliate <address@hidden
> <mailto:address@hidden>>
> *Sent:* Tuesday, August 31, 2021 10:26 AM
> *To:* address@hidden <mailto:address@hidden> <address@hidden
> <mailto:address@hidden>>
> *Cc:* Pete Pokrandt <address@hidden <mailto:address@hidden>>;
> Anne Myckow - NOAA Federal <address@hidden
> <mailto:address@hidden>>; address@hidden
> <mailto:address@hidden> <address@hidden
> <mailto:address@hidden>>; address@hidden
> <mailto:address@hidden>
> <address@hidden
> <mailto:address@hidden>>; _NWS NCEP NCO Dataflow
> <address@hidden <mailto:address@hidden>>
> *Subject:* Re: 20210830: Re: High CONDUIT latencies from
> vm-lnx-conduit2.ncep.noaa.gov
>
> Thanks for the quick reply, Tom.  Looking through our conduit2 logs, we
> began seeing sends of product from our conduit2 to conduit1 machine
> after we restarted the LDM server on conduit2 yesterday.  It appears
> latencies improved fairly significantly at that time:
>
> However we still do not see direct sends from conduit2 to external
> LDMs.  Our server team is currently looking into the TCP service issue
> that appears to be causing this problem.
>
> Thanks,
>
> Jesse
>
> On Mon, Aug 30, 2021 at 7:49 PM Tom Yoksas <address@hidden
> <mailto:address@hidden>> wrote:
>
>     Hi Jesse,
>
>     On 8/30/21 5:16 PM, Jesse Marks - NOAA Affiliate wrote:
>      > Quick question:  how are you computing these latencies?
>
>     Latency in the LDM/IDD context is the time difference between when a
>     product is first put into an LDM queue for redistribution and the time
>     it is received by a downstream machine.  This measure of latency, of
>     course, requires that the clocks on the originating and receiving
>     machines be maintained accurately.
>
>     re:
>      > More
>      > specifically, how do you determine which conduit machine the data is
>      > coming from?
>
>     The machine on which the product is inserted into the LDM queue is
>     available in the LDM transaction.  We provide an website where users
>     can create graphs of things like feed latencies:
>
>     Unidata HomePage
>     https://www.unidata.ucar.edu <https://www.unidata.ucar.edu>
>
>         IDD Operational Status
>     https://rtstats.unidata.ucar.edu/rtstats/
>     <https://rtstats.unidata.ucar.edu/rtstats/>
>
>           Real-time IDD Statistics -> Statistics by Host
>     https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/siteindex
>     <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/siteindex>
>
>     The variety of measures of feed quality for the Unidata machine that
>     is REQUESTing the CNODUIT feed from the NCEP cluster can be found in:
>
>     https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/siteindex?conduit.unidata.ucar.edu
>     <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/siteindex?conduit.unidata.ucar.edu>
>
>     The latencies being reported by the Unidata machine that is being fed
>     from the NCEP cluster is:
>
>     CONDUIT latencies on conduit.conduit.unidata.ucar.edu
>     <http://conduit.conduit.unidata.ucar.edu>:
>
>     https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+conduit.unidata.ucar.edu
>     <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+conduit.unidata.ucar.edu>
>
>     As you can see, the traces are color color coded, and the label at the
>     top identifies the source machines for products.
>
>     re:
>      > The reason I ask is because I am not seeing any sends of
>      > product from conduit2 in the last several days of logs both to
>     our local
>      > conduit1 machine and to any distant end users.
>
>     Hmm...  we are.
>
>     re:
>      > Also, we have isolated what is likely the issue and will have our
>     team
>      > take a closer look in the morning.  I'm hopeful they'll be able to
>      > resolve this soon.
>
>     Excellent!  We are hopeful that the source of the high latencies will
>     be identified and fixed.
>
>     Cheers,
>
>     Tom
>
>      > On Mon, Aug 30, 2021 at 5:24 PM Anne Myckow - NOAA Federal
>      > <address@hidden <mailto:address@hidden>
>     <mailto:address@hidden <mailto:address@hidden>>> wrote:
>      >
>      >     Pete,
>      >
>      >     Random aside, can you please update your doco to say that
>      >     Dataflow's email list is now address@hidden
>     <mailto:address@hidden>
>      >     <mailto:address@hidden <mailto:address@hidden>>
>     ? I'm CC'ing it here. That other
>      >     email address is going to get turned off within the next year.
>      >
>      >     Thanks,
>      >     Anne
>      >
>      >     On Wed, Aug 18, 2021 at 4:02 PM Pete Pokrandt
>     <address@hidden <mailto:address@hidden>
>      >     <mailto:address@hidden <mailto:address@hidden>>> wrote:
>      >
>      >         Dear Anne, Dustin and all,
>      >
>      >         Recently we have noticed fairly high latencies on the CONDUIT
>      >         ldm data feed originating from the machine
>      > vm-lnx-conduit2.ncep.noaa.gov <http://vm-lnx-conduit2.ncep.noaa.gov>
>      >         <http://vm-lnx-conduit2.ncep.noaa.gov
>     <http://vm-lnx-conduit2.ncep.noaa.gov>>. The feed originating
>      >         from vm-lnx-conduit1.ncep.noaa.gov
>     <http://vm-lnx-conduit1.ncep.noaa.gov>
>      >         <http://vm-lnx-conduit1.ncep.noaa.gov
>     <http://vm-lnx-conduit1.ncep.noaa.gov>> does not have the high
>      >         latencies. Unidata and other top level feeds are seeing
>     similar
>      >         high latencies from vm-lnx-conduit2.ncep.noaa.gov
>     <http://vm-lnx-conduit2.ncep.noaa.gov>
>      >         <http://vm-lnx-conduit2.ncep.noaa.gov
>     <http://vm-lnx-conduit2.ncep.noaa.gov>>.
>      >
>      >         Here are some graphs showing the latencies that I'm seeing:
>      >
>      >          From
>      >
>     https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd-agg.aos.wisc.edu
>     <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd-agg.aos.wisc.edu>
>      >      
>       <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd-agg.aos.wisc.edu <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd-agg.aos.wisc.edu>> -
>      >         latencies for CONDUIT data arriving at our UW-Madison AOS
>     ingest
>      >         machine
>      >
>      >
>      >
>      >          From
>      >
>     https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/siteindex?conduit.unidata.ucar.edu
>     <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/siteindex?conduit.unidata.ucar.edu>
>      >      
>       <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/siteindex?conduit.unidata.ucar.edu <https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/siteindex?conduit.unidata.ucar.edu>> (latencies
>      >         at Unidata)
>      >
>      >
>      >
>      >         At least here at UW-Madison, these latencies are causing
>     us to
>      >         lose some data during the large GFS/GEFS periods.
>      >
>      >         Any idea what might be causing this?
>      >
>      >         Pete
>      >
>      >
>      >
>      >
>      >      
>       <http://www.weather.com/tv/shows/wx-geeks/video/the-incredible-shrinking-cold-pool <http://www.weather.com/tv/shows/wx-geeks/video/the-incredible-shrinking-cold-pool>>-----
>      >         Pete Pokrandt - Systems Programmer
>      >         UW-Madison Dept of Atmospheric and Oceanic Sciences
>      >         608-262-3086  - address@hidden
>     <mailto:address@hidden> <mailto:address@hidden
>     <mailto:address@hidden>>
>      >
>      >
>      >
>      >     --
>      >     Anne Myckow
>      >     Dataflow Team Lead
>      >     NWS/NCEP/NCO
>      >
>      >
>      >
>      > --
>      > Jesse Marks
>      > Dataflow Analyst
>      > NCEP Central Operations
>      > 678-896-9420
>
>     --
>     +----------------------------------------------------------------------+
>     * Tom Yoksas                                      UCAR Unidata Program *
>     * (303) 497-8642 (last resort)                           P.O. Box 3000 *
>     * address@hidden <mailto:address@hidden>                        
>                Boulder, CO 80307 *
>     * Unidata WWW Service http://www.unidata.ucar.edu/
>     <http://www.unidata.ucar.edu/> *
>     +----------------------------------------------------------------------+
>
>
>
> --
>
> Jesse Marks
>
> Dataflow Analyst
>
> NCEP Central Operations
>
> 678-896-9420
>

--
+----------------------------------------------------------------------+
* Tom Yoksas                                      UCAR Unidata Program *
* (303) 497-8642 (last resort)                           P.O. Box 3000 *
* address@hidden                                    Boulder, CO 80307 *
* Unidata WWW Service                     http://www.unidata.ucar.edu/ *
+----------------------------------------------------------------------+


--
Jesse Marks
Dataflow Analyst - NCEP Central Operations
NOAA Affiliate - AceInfo Solutions
240-736-8262


--
Jesse Marks
Dataflow Analyst - NCEP Central Operations
NOAA Affiliate - AceInfo Solutions
240-736-8262


--
Jesse Marks
Dataflow Analyst - NCEP Central Operations
NOAA Affiliate - AceInfo Solutions
240-736-8262


--
Jesse Marks
Dataflow Analyst - NCEP Central Operations
NOAA Affiliate - AceInfo Solutions
240-736-8262