survex issues on Chinese Windows
Wookey
wookey at wookware.org
Thu Aug 18 10:31:24 BST 2016
On 2016-08-18 02:44 +0100, Olly Betts wrote:
> On Thu, Aug 18, 2016 at 01:44:50AM +0100, Wookey wrote:
> > On any version of survex on any OS: If you process a file which has a
> > *begin string which contains UTF-8 chinese characters, e.g.:
> > *begin 洞
> >
> > you get:
> > ChineseTest.svx:1:8: error: Character “�” not allowed in station name (use *SET NAMES to set allowed characters)
> >
> > Which I guess is technically correct, but very unhelpful to chinese
> > people. Do we do the same ting to anypone who used accented characters
> > in latin languages, or cyrillic characters?
>
> Yes, it's only ASCII alphanumerics (and nobody's mentioned it as an
> issue previously).
Right. I think that those involved have to have good enough English to
be able to put English survey names in so far. The thing that has
caused this to become a live issue is topodroid. You enter a survey
name in that, and that's the string it uses in the the *begin. It's
natural to enter the survey name in Chinese if you are Chinese, and
the Topodroid and Therion part works fine, but Survex doesn't accept
it.
> > Is it reasonable to restrict survey names to US-ASCII characters in
> > these days of UTF-8 in a nominally internationalised program? Perhaps
> > more chars should be added to the list of chars allowed in survey and
> > station names?
>
> Not being able to use accented forms is probably a fairly minor
> annoyance (as there are standard ways to write most Latin alphabet
> languages with just ASCII), but not being able to use any of your
> alphabet is more of a problem.
> One issue is that currently there's no character encoding info in Survex
> files, so we don't know what those bytes represent at all. The lack of
> encoding information is already an issue (e.g. for displaying the title
> in Aven), so we probably should add a new "*encoding" command.
Yes, as therion has had from the start. Using matching syntax would be
good to avoid gratuitous divergence.
> Once we know the encoding we could do a full-Unicode "is
alphanumeric" I think that would be good, and in fact anything that
favours a particular character set is somewhat imperialist.
> (or just treat non-ASCII values as valid in names perhaps).
Anything except the separator and control characters is arguably
acceptable, but keeping smileys and wierd 'lookalike' characters out
of surveynames is probably a good thing, so "is alphanumeric" seems
like a good test (May not exclude the wierd chars I suppose). Do all
unicode codepoints have a flag to this effect?
Wookey
--
Principal hats: Linaro, Debian, Wookware, ARM
http://wookware.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.survex.com/pipermail/survex/attachments/20160818/92d5012c/attachment.sig>
More information about the Survex
mailing list