to determine character encoding used by a file on a linux box, try the
following:
file -bi filename
example using a tornado warning ingested via noaaport:
file -bi 2013021922.TOR
application/octet-stream; charset=binary
sample of standard encoding used for example when i create my crontab files
on a linux box:
file -bi crontab.ldm
text/plain; charset=us-ascii
it is quite possible that for us, on the receiving end, the encoding used at
the source does not matter since we change that encoding while we ingest via
noaaport or ldm, but that is just a guess.
cheers,
--patrick
--------------------
Patrick L. Francis
VP Media Logic Group
http://www.medialogicgroup.com
http://www.hamweather.com
http://www.alertsbroadcaster.com
http://www.modelweather.com
FB: http://www.facebook.com/wxprofessor
--
-----Original Message-----
From: ldm-users-bounces@xxxxxxxxxxxxxxxx
[mailto:ldm-users-bounces@xxxxxxxxxxxxxxxx] On Behalf Of daryl herzmann
Sent: Wednesday, February 20, 2013 10:22 AM
To: ldm-users@xxxxxxxxxxxxxxxx
Subject: [ldm-users] What is the \x92 character in PNSWSH?
Hi LDM Users,
So a long term annoyance / curiosity continues to get the best of me.
I figured I would spam the very smart folks on the ldm-users list and hope
somebody could educate me. Attached you will find a PNS statement from WSH
that came down our lovely IDD feed on LDM today.
Within the file, you will find the following characters and here's some
python code showing where its at :)
>>> a = open('PNSWSH.txt').read()
>>> a.find("\x92")
2191
>>> a[2190:2200]
'ts from FAA\x92s \r\r\nTDW'
So it should have been an apostrophe, but it instead appears to be Windows
CP1252 encoding for "RIGHT SINGLE QUOTATION MARK" ?
After much gnashing of teeth and conversations with NWS TOC folks, this
appears to be some issue with products generated in a Word Processor getting
saved to a text file without US-ASCII encoding being set during the process,
so it defaults to some windows encoding? Or it is some copy/paste issue.
The jury never did return a verdict on this and my support ticket with the
TOC was closed, oye.
I asked Unidata and they did not know. So does anybody here know what
character encoding is used for text data that come down the IDD?
If you are still reading this, you are probably wondering two things:
1) why I care.
2) if I have a life.
Well, this is an important problem when saving these products to a database.
See, databases can be sticklers about character encoding and often do not
accept "garbage in".
If you check out the NWS website, the apostrophe is gone!
http://www.srh.noaa.gov/productview2.php?pil=pnswsh&max=51
http://www.nws.noaa.gov/view/national.php?prod=pns&sid=wsh
There is a large and vast conspiracy afoot!
daryl
--
/**
* Daryl Herzmann
* Assistant Scientist -- Iowa Environmental Mesonet
* http://mesonet.agron.iastate.edu
*/