Re: 20050420: Unidata decoders - syn2nc bug re Unicode

NOTE: The decoders mailing list is no longer active. The list archives are made available for historical reasons.
To: "Hanson, Kurt" <khanson@xxxxxxx>
Subject: Re: 20050420: Unidata decoders - syn2nc bug re Unicode
From: Robb Kambic <rkambic@xxxxxxxxxxxxxxxx>
Date: Wed, 25 May 2005 10:18:35 -0600 (MDT)
Kurt,

thanks for the complete bug report, great job.  the "no decoding" pragma
was inserted into the perl decoders but it is commented out because the
earlier version of perl < 5.8.0  Can't locate encoding.pm.  in the next
release tomorrow, there is a note about uncommenting the "no decoding"
pragma.

thanks,
robb...



On Wed, 20 Apr 2005, Unidata Support wrote:

>
> ------- Forwarded Message
>
> >To: <support@xxxxxxxxxxxxxxxx>
> >From: "Hanson, Kurt" <khanson@xxxxxxx>
> >Subject: Unidata decoders - syn2nc bug re Unicode
> >Organization: UCAR/Unidata
> >Keywords: 200504201720.j3KHK5v2016989
>
> This is a multi-part message in MIME format.
>
> ------_=_NextPart_001_01C545CD.2E77045F
> Content-Type: multipart/alternative;
>       boundary="----_=_NextPart_002_01C545CD.2E77045F"
>
>
> ------_=_NextPart_002_01C545CD.2E77045F
> Content-Type: text/plain;
>       charset="Windows-1252"
> Content-Transfer-Encoding: quoted-printable
>
> I imagine this will end up in Rob Kambic's inbox... if so, hello again
> Rob.
>
> We've been occasionally experiencing issues with syn2nc. The problem is
> that once every week or two, the syn2nc log will suddenly begin filling
> with messages about Unicode:
>
> Malformed UTF-8 character (unexpected continuation byte 0x8e, with no
> preceding start byte) in index at /dicast2-papp/DICAST/tmp/syn2nc_308
> line 618, <STDIN> chunk 1.
> Malformed UTF-8 character (unexpected non-continuation byte 0x2a,
> immediately after start byte 0xf6) in index at
> /dicast2-papp/DICAST/tmp/syn2nc_308 line 618, <STDIN> chunk 1.
>
> The log file grows without bound until finally the disk partition fills,
> hobbling the entire system.
>
> I think I understand the problem and have a fix. The problem appears to
> be due to some garbage characters in several of the synoptic messages
> from today for a single site -- FBSK in Botswana. (I imagine that all of
> the problems we've ever seen are from this site.)
>
> The issue is that since 5.8.0, Perl has some automatic support for
> handling Unicode characters. Once Perl sees a character outside of the
> range [0,127], it assumes that the text data is Unicode rather than
> ASCII. Since the garbage characters from today's FBSK data did not
> conform to Unicode rules, Perl itself (rather than syn2nc) generated the
> messages.
>
> So the magic fix I installed is to put a "no encoding;" line (pragma)
> into the syn2nc script. This ensures that Perl doesn't try to guess what
> sort of character set the text is in -- it just passes the data up to
> the application level in raw form. That's what we need with syn2nc.
>
> Scope:
> * We experience this problem on a Linux RedHat Enterprise 3.0 Athlon
> system running Perl 5.8.0.
> * We do not experience it on a Solaris 8 system running Perl 5.8.0.
>
> Testing:
> When I pipe the attached synoptic file
> synoptic.20050420.1200.asc.FBSK_210 into the pristine syn2nc on the
> Linux system, the log file grows without bound. When I pipe it into my
> patched version, the log file size remains stable, and the file never
> gets any Unicode error messages.
>
> Discussion:
> Thinking beyond the low-level Perl issue, I'm not sure what syn2nc
> should do when it encounters the garbage characters... Nor am I sure
> what it actually does -- I'd dig into the script itself to find that out
> but I'm running short on time. What do you think?
>
> Also, I'd be curious to hear whether you see the garbage characters in
> your FBSK synoptics for today. Its possible but unlikely that the
> garbage is not due to the FBSK sensor itself but due to some
> communications issue that is WSI-specific.
>
> I'm attaching a few things:
> * syn2nc.new -- an updated version of syn2nc from the 3.0.9 version of
> the decoders package.
> * syn2nc.patch -- a diff of my version vs the pristine 3.0.9
> * synoptic.20050420.1200.asc.FBSK_210 -- message #210 from today's
> synoptic feed, containing garbage characters 0x8e and others in line 6
> of the file.
>
> Relevant Perl references:
> * Unicode intro: http://perldoc.perl.org/perluniintro.html
> * encoding pragma: http://perldoc.perl.org/encoding.html
>
> Whew. I think that's about everything! Feel free to contact me.
>
> Kurt Hanson
> Senior Software Engineer & Scientific Analyst
> WSI Corporation
> 400 Minuteman Rd.
> Andover, MA 01810
> my phone: 978.983.6549
> www.wsi.com
>
>  <<syn2nc.new>>  <<syn2nc.patch>> 
> <<synoptic.20050420.1200.asc.FBSK_210>>
>

==============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
rkambic@xxxxxxxxxxxxxxxx                   WWW: http://www.unidata.ucar.edu/
==============================================================================
2005 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the decoders archives: