survex issues on Chinese Windows

Thu Aug 18 10:31:24 BST 2016

On 2016-08-18 02:44 +0100, Olly Betts wrote:
> On Thu, Aug 18, 2016 at 01:44:50AM +0100, Wookey wrote:
> > On any version of survex on any OS: If you process a file which has a
> > *begin string which contains UTF-8 chinese characters, e.g.:
> > *begin 洞
> > 
> > you get:
> > ChineseTest.svx:1:8: error: Character “�” not allowed in station name (use *SET NAMES to set allowed characters)
> > 
> > Which I guess is technically correct, but very unhelpful to chinese
> > people. Do we do the same ting to anypone who used accented characters
> > in latin languages, or cyrillic characters?
> 
> Yes, it's only ASCII alphanumerics (and nobody's mentioned it as an
> issue previously).

Right. I think that those involved have to have good enough English to
be able to put English survey names in so far. The thing that has
caused this to become a live issue is topodroid. You enter a survey
name in that, and that's the string it uses in the the *begin. It's
natural to enter the survey name in Chinese if you are Chinese, and
the Topodroid and Therion part works fine, but Survex doesn't accept
it.

> > Is it reasonable to restrict survey names to US-ASCII characters in
> > these days of UTF-8 in a nominally internationalised program?  Perhaps
> > more chars should be added to the list of chars allowed in survey and
> > station names?
> 
> Not being able to use accented forms is probably a fairly minor
> annoyance (as there are standard ways to write most Latin alphabet
> languages with just ASCII), but not being able to use any of your
> alphabet is more of a problem.

> One issue is that currently there's no character encoding info in Survex
> files, so we don't know what those bytes represent at all.  The lack of
> encoding information is already an issue (e.g. for displaying the title
> in Aven), so we probably should add a new "*encoding" command.

Yes, as therion has had from the start. Using matching syntax would be
good to avoid gratuitous divergence.

> Once we know the encoding we could do a full-Unicode "is
alphanumeric" I think that would be good, and in fact anything that
favours a particular character set is somewhat imperialist.

> (or just treat non-ASCII values as valid in names perhaps).

Anything except the separator and control characters is arguably
acceptable, but keeping smileys and wierd 'lookalike' characters out
of surveynames is probably a good thing, so "is alphanumeric" seems
like a good test (May not exclude the wierd chars I suppose). Do all
unicode codepoints have a flag to this effect?

Wookey
-- 
Principal hats:  Linaro, Debian, Wookware, ARM
http://wookware.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.survex.com/pipermail/survex/attachments/20160818/92d5012c/attachment.sig>