* [gentoo-user] [OT] Need advice from people who use non-ascii all day long
@ 2009-12-03 19:20 felix
2009-12-03 19:50 ` Renat Golubchyk
` (3 more replies)
0 siblings, 4 replies; 24+ messages in thread
From: felix @ 2009-12-03 19:20 UTC (permalink / raw
To: gentoo-user
I have a project which requires normalizing names, and by that, I mean
converting to lower case etc, whatever eliminates redundancies. I
know Unicode has a different "normalize" meaning, but for my purposes,
that has already been done. Maybe I should call it standardization or
make up a new cromulent word.
By which I really mean I am confused by a lot of advice I have gotten
from USAians who get by with the good old 7 bit ASCII character set on
a daily basis, whether it be written in Unicode or not.
One of the puzzles to me is all the accented chars. Umlauts, etc. I
am not trying to convert names for permanent purposes but for internal
comparison. In Germany is a district "Busingen", with an umlauted
'u'. Is it reasonable to consider it the same word whether with or
without the unlauted u? French has the cedilla and acute and grave
accents. Spanish has the tilde n. Scandinavian languages (all?
some?) have the o with a slash.
Or put another way, I don't know much about German, French, Spanish,
etc keyboards. Do your keyboards have any of the extra keys, all of
them? Are German keyboards and French and Spanish keyboards as
restricted to their own languages as US keyboards are? If you have to
hit two or three keys to keep the umlauts, accents, and tildes, do you
get lazy sometimes and type the base character by itself? Is it even
considered the base character, or is it considered lazy and sloppy,
much as I get complaints about typing "thru" because "through" is too
much trouble?
I need something the equivalent of the C function strcasecmp() which
not only ignores case, but all other differences without distinction,
whatever they may be. If leaving off umlauts horrifies academics and
purists but is what people do in the real world, I want to take that
into consideration, so that if one person uses the ummlaut and another
doesn't, it won't generated two separate entries. But if leaving off
the umlaut or accent is a distinct place name, then I can't do that --
but if real world people do that and live with the confusion, then I
guess I have to make a different choice.
Yes, I am something of an ignorant American. I know some Japanese,
French, and Spanish, but not the details of everyday usage. I'd like
to learn.
--
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 19:20 [gentoo-user] [OT] Need advice from people who use non-ascii all day long felix
@ 2009-12-03 19:50 ` Renat Golubchyk
2009-12-03 20:07 ` felix
2009-12-03 22:07 ` Volker Armin Hemmann
2009-12-03 22:38 ` Arttu V.
` (2 subsequent siblings)
3 siblings, 2 replies; 24+ messages in thread
From: Renat Golubchyk @ 2009-12-03 19:50 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 1937 bytes --]
Hi!
On Thu, 3 Dec 2009 11:20:03 -0800
felix@crowfix.com wrote:
> In Germany is a district "Busingen", with an umlauted 'u'. Is it
> reasonable to consider it the same word whether with or without the
> unlauted u?
No. For many words it would be ok, but not for all. For example,
"drucken" means "to print", "drücken" (with an umlaut) means "to
press". In German you can exchange an umlaut with the combination "base
letter + e", i.e. ü --> ue, ö --> oe, and ß --> ss. There are words
with the combination "oe" that is in that particular case does not mean
"ö". So it's not straight forward, especially with names. Those may
have a rather odd spelling for historical reasons.
> Or put another way, I don't know much about German, French, Spanish,
> etc keyboards. Do your keyboards have any of the extra keys, all of
> them? Are German keyboards and French and Spanish keyboards as
> restricted to their own languages as US keyboards are? If you have to
> hit two or three keys to keep the umlauts, accents, and tildes, do you
> get lazy sometimes and type the base character by itself? Is it even
> considered the base character, or is it considered lazy and sloppy,
> much as I get complaints about typing "thru" because "through" is too
> much trouble?
German keyboards have keys for all umlauts and 'ß'. You can google for
pictures of different keyboard layouts.
> I need something the equivalent of the C function strcasecmp() which
> not only ignores case, but all other differences without distinction,
> whatever they may be.
I'd suggest you use a unicode library. BTW, what about cyrillic
letters or other alphabets? Those may have nothing to do with ASCII. Or
is your project restricted to latin letters?
Cheers,
Renat
--
Probleme kann man niemals mit derselben Denkweise loesen,
durch die sie entstanden sind.
(Einstein)
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 19:50 ` Renat Golubchyk
@ 2009-12-03 20:07 ` felix
2009-12-03 20:29 ` Renat Golubchyk
` (2 more replies)
2009-12-03 22:07 ` Volker Armin Hemmann
1 sibling, 3 replies; 24+ messages in thread
From: felix @ 2009-12-03 20:07 UTC (permalink / raw
To: gentoo-user
On Thu, Dec 03, 2009 at 08:50:08PM +0100, Renat Golubchyk wrote:
> I'd suggest you use a unicode library. BTW, what about cyrillic
> letters or other alphabets? Those may have nothing to do with ASCII. Or
> is your project restricted to latin letters?
The data is already in normalized Unicode. My problem is eliminating
errors from near misses :-( Cyrillic doesn't look like the same
problem -- no accents that I can see. Chinese, Japanese, etc, same as
far as I know. Arabic has lots of tricks on combining letters and
leaving out vowels, so it is probably an entirely different problem.
One thing I did not make clear is that this is for place names only,
like cities and whatever the equivalent of a US state or Canadian
province is, such as Busingen.
So do people type in Busingen different ways depending on how they
feel, do some people always leave off the umlaut, do some always use
it? My biggest annoyance is that a lot of the google results come
from Americans full of theory about languages they only know from the
W3C recommendations. Maybe email or real documents follow proper
usage much more closely than addresses on a web form, but I don't care
about them. Maybe web forms in Germany, where they want a district,
do as many web sites do in English and have a menu of possible
districts, in which case no one types in umlauts anyway :-)
--
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 20:07 ` felix
@ 2009-12-03 20:29 ` Renat Golubchyk
2009-12-03 22:32 ` Francisco Ares
2009-12-04 0:03 ` Volker Armin Hemmann
2009-12-04 9:17 ` Patrick Holthaus
2 siblings, 1 reply; 24+ messages in thread
From: Renat Golubchyk @ 2009-12-03 20:29 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 551 bytes --]
On Thu, 3 Dec 2009 12:07:26 -0800
felix@crowfix.com wrote:
> So do people type in Busingen different ways depending on how they
> feel, do some people always leave off the umlaut, do some always use
> it?
If you want to leave of the umlaut you have to be absolutely sure that
there exists no other place with the spelling without umlaut. Otherwise
you may have collisions sometimes.
--
Probleme kann man niemals mit derselben Denkweise loesen,
durch die sie entstanden sind.
(Einstein)
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 19:50 ` Renat Golubchyk
2009-12-03 20:07 ` felix
@ 2009-12-03 22:07 ` Volker Armin Hemmann
2009-12-03 22:14 ` Alan McKinnon
1 sibling, 1 reply; 24+ messages in thread
From: Volker Armin Hemmann @ 2009-12-03 22:07 UTC (permalink / raw
To: gentoo-user
On Donnerstag 03 Dezember 2009, Renat Golubchyk wrote:
> Hi!
>
> On Thu, 3 Dec 2009 11:20:03 -0800
>
> felix@crowfix.com wrote:
> > In Germany is a district "Busingen", with an umlauted 'u'. Is it
> > reasonable to consider it the same word whether with or without the
> > unlauted u?
>
> No. For many words it would be ok, but not for all. For example,
> "drucken" means "to print", "drücken" (with an umlaut) means "to
> press". In German you can exchange an umlaut with the combination "base
> letter + e", i.e. ü --> ue, ö --> oe, and ß --> ss. There are words
> with the combination "oe" that is in that particular case does not mean
> "ö". So it's not straight forward, especially with names. Those may
> have a rather odd spelling for historical reasons.
and it is hilarious to see american media fuck that up almost every time ...
;)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 22:07 ` Volker Armin Hemmann
@ 2009-12-03 22:14 ` Alan McKinnon
0 siblings, 0 replies; 24+ messages in thread
From: Alan McKinnon @ 2009-12-03 22:14 UTC (permalink / raw
To: gentoo-user
On Friday 04 December 2009 00:07:33 Volker Armin Hemmann wrote:
> On Donnerstag 03 Dezember 2009, Renat Golubchyk wrote:
> > Hi!
> >
> > On Thu, 3 Dec 2009 11:20:03 -0800
> >
> > felix@crowfix.com wrote:
> > > In Germany is a district "Busingen", with an umlauted 'u'. Is it
> > > reasonable to consider it the same word whether with or without the
> > > unlauted u?
> >
> > No. For many words it would be ok, but not for all. For example,
> > "drucken" means "to print", "drücken" (with an umlaut) means "to
> > press". In German you can exchange an umlaut with the combination "base
> > letter + e", i.e. ü --> ue, ö --> oe, and ß --> ss. There are words
> > with the combination "oe" that is in that particular case does not mean
> > "ö". So it's not straight forward, especially with names. Those may
> > have a rather odd spelling for historical reasons.
>
> and it is hilarious to see american media fuck that up almost every time
> ... ;)
>
What's even more funny is hearing news readers on the South Africa public
broadcaster try to pronounce regular *English* words...
--
alan dot mckinnon at gmail dot com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 20:29 ` Renat Golubchyk
@ 2009-12-03 22:32 ` Francisco Ares
2009-12-03 22:54 ` felix
0 siblings, 1 reply; 24+ messages in thread
From: Francisco Ares @ 2009-12-03 22:32 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 571 bytes --]
On Thu, Dec 3, 2009 at 6:29 PM, Renat Golubchyk <ragermany@gmx.net> wrote:
> On Thu, 3 Dec 2009 12:07:26 -0800
> felix@crowfix.com wrote:
> > So do people type in Busingen different ways depending on how they
> > feel, do some people always leave off the umlaut, do some always use
> > it?
>
> If you want to leave of the umlaut you have to be absolutely sure that
> there exists no other place with the spelling without umlaut. Otherwise
> you may have collisions sometimes.
>
>
What about a set of dictionaries? And also a library for mistyped word
search?
Francisco
[-- Attachment #2: Type: text/html, Size: 983 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 19:20 [gentoo-user] [OT] Need advice from people who use non-ascii all day long felix
2009-12-03 19:50 ` Renat Golubchyk
@ 2009-12-03 22:38 ` Arttu V.
2009-12-03 22:57 ` felix
2009-12-06 1:58 ` daid kahl
2009-12-15 16:05 ` J. Roeleveld
3 siblings, 1 reply; 24+ messages in thread
From: Arttu V. @ 2009-12-03 22:38 UTC (permalink / raw
To: gentoo-user
On 12/3/09, felix@crowfix.com <felix@crowfix.com> wrote:
> I have a project which requires normalizing names, and by that, I mean
> converting to lower case etc, whatever eliminates redundancies.
I assume you have already removed the language problem from the
equation? I.e., the fact that København, Copenhague, Kööpenhamina and
Copenhagen all mean the same place, just in different European
languages (Danish, Spanish, Finnish and English, in that order).
If you have input in multiple languages then it is not just about
umlauts or no umlauts ...
--
Arttu V.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 22:32 ` Francisco Ares
@ 2009-12-03 22:54 ` felix
0 siblings, 0 replies; 24+ messages in thread
From: felix @ 2009-12-03 22:54 UTC (permalink / raw
To: gentoo-user
On Thu, Dec 03, 2009 at 08:32:45PM -0200, Francisco Ares wrote:
> What about a set of dictionaries? And also a library for mistyped word
> search?
Way too much effort for this. Nice idea, might even be fun, but it's
just trying to avoid the common things, and I mainly wondered about
how often people whose keyboards have accents etc skip them if they
have the chance and what the repercussions would be.
We have people already who enter Brooklyn for their city instead of
New York, which might even be technically correct, but it doesn't
match anything else and we don't correct it because it doesn't happen
often enough. There may be people in Louisiana who use the original
French spelling for all I know; we don't handle that either. Or ditto
for Spanish names in the southwest that might have a tilde.
--
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 22:38 ` Arttu V.
@ 2009-12-03 22:57 ` felix
0 siblings, 0 replies; 24+ messages in thread
From: felix @ 2009-12-03 22:57 UTC (permalink / raw
To: gentoo-user
On Fri, Dec 04, 2009 at 12:38:34AM +0200, Arttu V. wrote:
> I assume you have already removed the language problem from the
> equation? I.e., the fact that K?benhavn, Copenhague, K??penhamina and
> Copenhagen all mean the same place, just in different European
> languages (Danish, Spanish, Finnish and English, in that order).
We're trying to go by the name in the native language, but that might
not be possible, in which case I guess we'll have to get all the
possible translations. It certainly is messy.
> If you have input in multiple languages then it is not just about
> umlauts or no umlauts ...
I've certainly learned a bit of that ... so far we only have to deal
with country codes (no problem) and disticts (can't be toooo many, I
hope).
--
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 20:07 ` felix
2009-12-03 20:29 ` Renat Golubchyk
@ 2009-12-04 0:03 ` Volker Armin Hemmann
2009-12-04 0:18 ` Alan McKinnon
2009-12-04 0:31 ` felix
2009-12-04 9:17 ` Patrick Holthaus
2 siblings, 2 replies; 24+ messages in thread
From: Volker Armin Hemmann @ 2009-12-04 0:03 UTC (permalink / raw
To: gentoo-user
look at my name, ok?
Just dropping the Umlaut is wrong. No if, but, maybe. It is wrong. Error.
Mistake. Fail. If you can not enter ä, ö or ü, you must transform them to ae,
oe or ue.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-04 0:03 ` Volker Armin Hemmann
@ 2009-12-04 0:18 ` Alan McKinnon
2009-12-04 0:31 ` felix
1 sibling, 0 replies; 24+ messages in thread
From: Alan McKinnon @ 2009-12-04 0:18 UTC (permalink / raw
To: gentoo-user; +Cc: Volker Armin Hemmann
On Friday 04 December 2009 02:03:23 Volker Armin Hemmann wrote:
> look at my name, ok?
>
> Just dropping the Umlaut is wrong. No if, but, maybe. It is wrong. Error.
> Mistake. Fail. If you can not enter ä, ö or ü, you must transform them to
> ae, oe or ue.
>
Your name shows here in 7-bit ASCII:
Volker Armin Hemmann <volkerarmin@googlemail.com>
typo?
--
alan dot mckinnon at gmail dot com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-04 0:03 ` Volker Armin Hemmann
2009-12-04 0:18 ` Alan McKinnon
@ 2009-12-04 0:31 ` felix
2009-12-04 13:42 ` Volker Armin Hemmann
1 sibling, 1 reply; 24+ messages in thread
From: felix @ 2009-12-04 0:31 UTC (permalink / raw
To: gentoo-user
On Fri, Dec 04, 2009 at 01:03:23AM +0100, Volker Armin Hemmann wrote:
> look at my name, ok?
>
> Just dropping the Umlaut is wrong. No if, but, maybe. It is wrong. Error.
> Mistake. Fail. If you can not enter ?, ? or ?, you must transform them to ae,
> oe or ue.
I'd like to find a program which would do that! Seriously. But
anyway, the purpose of this is not to transform names so our antique
ASCII-7 computers can store them, but to eliminate redundant records.
For instance, we get data from vendors for all cities and states,
geolocation data, which has its own redundancies, such as both FORT
WORTH and FT WORTH, or SAINT LOUIS and ST LOUIS. But we have to
convert to upper case, get rid of punctuation, get rid of extra white
space, etc, and all that is independent of the locale. I want to do
the same for unicode. If enough Europeans are in the habit of taking
shortcuts and skipping umlauts and accents and cedilla and tildes,
then I'd like to standardize the data for lookup. This has nothing to
do with converting people's names for storage. We don't even store
the transformed place name.
--
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 20:07 ` felix
2009-12-03 20:29 ` Renat Golubchyk
2009-12-04 0:03 ` Volker Armin Hemmann
@ 2009-12-04 9:17 ` Patrick Holthaus
2009-12-04 9:55 ` felix
2 siblings, 1 reply; 24+ messages in thread
From: Patrick Holthaus @ 2009-12-04 9:17 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 860 bytes --]
Hey!
> So do people type in Busingen different ways depending on how they
> feel, do some people always leave off the umlaut, do some always use
> it?
You cannot simply leave the umlaut out since it is considered as a separate
letter for itself. You cannot choose whether to write an "ö" or an "o". Like
Renat said, there are words that completely change their meaning when
exchanging the characters.
I think this is especially true if it comes to names. While people get used to
spellings like "Goettingen" for Göttingen, it looks odd and wrong to Germans.
Like someone who doesn't have the character on the keyboard ;)
Also keep in mind that there are cities that are spelled with "oe" or "ae" by
design. (Soest, Oelde, Aerzen, Oestinghausen etc.) Those cannot be spelled
with an "ö" instead. It would simply be wrong.
Patrick
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 489 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-04 9:17 ` Patrick Holthaus
@ 2009-12-04 9:55 ` felix
0 siblings, 0 replies; 24+ messages in thread
From: felix @ 2009-12-04 9:55 UTC (permalink / raw
To: gentoo-user
On Fri, Dec 04, 2009 at 10:17:30AM +0100, Patrick Holthaus wrote:
> You cannot simply leave the umlaut out since it is considered as a separate
> letter for itself. You cannot choose whether to write an "?" or an "o". Like
> Renat said, there are words that completely change their meaning when
> exchanging the characters.
>
> I think this is especially true if it comes to names. While people get used to
> spellings like "Goettingen" for G?ttingen, it looks odd and wrong to Germans.
> Like someone who doesn't have the character on the keyboard ;)
> Also keep in mind that there are cities that are spelled with "oe" or "ae" by
> design. (Soest, Oelde, Aerzen, Oestinghausen etc.) Those cannot be spelled
> with an "?" instead. It would simply be wrong.
OK, that settles it :-)
It seems the message you folks are trying to pound into my head is
that people don't just casually drop the umlauts and accents. That's
what was bugging me -- if it is an extra key or weird combinations
like in emacs, maybe people would skip it often enough that we would
have to allow for that.
This is a better answer than I had feared because now I don't have to
sweat weird transliterations. There may still be some, but probably
not enough to worry about.
Now on to other mysteries, like why our (American) customer thinks
people in French Guayana (sp?) are going to write "French Guayana" for
the country name. Even my thick skull doesn't expect people living in
Deutschland (probably spelled wrong too, it is very late in a long and
tiring day, so apologies in advance, and if it is correct, apologies
for not recognizing that :-) to write "Germany" ...
Thanks for not pounding my head too heavily ...
--
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-04 0:31 ` felix
@ 2009-12-04 13:42 ` Volker Armin Hemmann
2009-12-04 20:50 ` Alan McKinnon
0 siblings, 1 reply; 24+ messages in thread
From: Volker Armin Hemmann @ 2009-12-04 13:42 UTC (permalink / raw
To: gentoo-user
On Freitag 04 Dezember 2009, felix@crowfix.com wrote:
> If enough Europeans are in the habit of taking
> shortcuts and skipping umlauts and accents and cedilla and tildes,
we don't. Because skipping Umlaut, accent&co creates a completly new word.
Probably one that is already there.
Munster is a town
Müenster/Muenster is a town.
You can not drop it. Ever. Only the idiots at cnn and foxnews drop it.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-04 13:42 ` Volker Armin Hemmann
@ 2009-12-04 20:50 ` Alan McKinnon
2009-12-05 0:01 ` Neil Bothwick
0 siblings, 1 reply; 24+ messages in thread
From: Alan McKinnon @ 2009-12-04 20:50 UTC (permalink / raw
To: gentoo-user
On Friday 04 December 2009 15:42:56 Volker Armin Hemmann wrote:
> On Freitag 04 Dezember 2009, felix@crowfix.com wrote:
> > If enough Europeans are in the habit of taking
> > shortcuts and skipping umlauts and accents and cedilla and tildes,
>
> we don't. Because skipping Umlaut, accent&co creates a completly new word.
> Probably one that is already there.
>
> Munster is a town
>
> Müenster/Muenster is a town.
>
> You can not drop it. Ever. Only the idiots at cnn and foxnews drop it.
>
And there's always a contrary:
In Afrikaans (major ZA language derived from Dutch) there are two accent
characters:
- deelteken: two dots above certain vowel
- kappie: carat-like symbol above certain vowels
In all cases, these characters are used simply as modifiers of the vowel. They
change inflection or the pronounciation of the vowel, not the letter itself.
Sometimes they just make spelling easier:
The verb modifier to indicate past tense is the prefix "ge" and the verb for
eat is "eet". So "ate" translates to "geeet". Three consecutive "e"'s looks
weird so this word is written "geëet"
--
alan dot mckinnon at gmail dot com
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-04 20:50 ` Alan McKinnon
@ 2009-12-05 0:01 ` Neil Bothwick
0 siblings, 0 replies; 24+ messages in thread
From: Neil Bothwick @ 2009-12-05 0:01 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 208 bytes --]
On Fri, 4 Dec 2009 22:50:52 +0200, Alan McKinnon wrote:
> Three consecutive "e"'s looks weird
Are you calling my laptop weird?
;-)
--
Neil Bothwick
THE BORG: Calm, Cool and Collective...
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 19:20 [gentoo-user] [OT] Need advice from people who use non-ascii all day long felix
2009-12-03 19:50 ` Renat Golubchyk
2009-12-03 22:38 ` Arttu V.
@ 2009-12-06 1:58 ` daid kahl
2009-12-06 2:35 ` felix
2009-12-15 16:05 ` J. Roeleveld
3 siblings, 1 reply; 24+ messages in thread
From: daid kahl @ 2009-12-06 1:58 UTC (permalink / raw
To: gentoo-user
> I have a project which requires normalizing names, and by that, I mean
> converting to lower case etc, whatever eliminates redundancies. I
> know Unicode has a different "normalize" meaning, but for my purposes,
> that has already been done. Maybe I should call it standardization or
> make up a new cromulent word.
>
> By which I really mean I am confused by a lot of advice I have gotten
> from USAians who get by with the good old 7 bit ASCII character set on
> a daily basis, whether it be written in Unicode or not.
> Yes, I am something of an ignorant American. I know some Japanese,
> French, and Spanish, but not the details of everyday usage. I'd like
> to learn.
Your project sounds interesting, but I have little to contribute on
the technical side.
I'm curious about your handling of Japanese, just because I'm living
outside Tokyo these days. My grasp on Japanese is basically rubbish,
but I can at least claim to know a thing or two.
To keep this in line with your stated application, I actually wonder
how you handle "Tokyo." For pronunciation purposes, if you put it in
hiragana and literally romanized it, you'd probably get Toukyou. In
Japanese a double-vowel just extends the sound and isn't a dipthong
(and usually o is extended by u and only rarely another o). For a lot
of cases on the double 'oo' they'll Romanize the second 'o' as an 'h',
since other wise someone will pronounce it like (a) "fool." So, take
a family name Ohshiro. Probably it should be Romanized "Oohshiro,"
but then people would say something like seeing fireworks.
Tokyo is Romanized this way, according to one culture book I read,
because everyone knows both the o's are extended! I'm sure all these
people also know that "kyo" is a single syllable, too! So it's not
"To-key-oh" it's just "To-kyo" where both syllables are extended from
the double oo.
Osaka is also an extended O at the beginning as I recall, and Kyoto is
the same case as Tokyo (incidentally, the Chinese characters for those
two cities are the same and just reversed in order!).
Again to speak to the original application, I don't know who types
Tookyoo or Tohkyoh or Toukyou. Probably no one because it's generally
Romanized as we all know it. But for typing purposes, Japanese type
the pronunciation of words via hiragana and then a little list pops up
and they select the word they want. So in this sense, they are typing
"Toukyou" into the keyboard...just it's in hiragana.
If you had any questions about Japanese things, I could ask a
colleague. They are all happy to answer questions.
Regards,
daid
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-06 1:58 ` daid kahl
@ 2009-12-06 2:35 ` felix
2009-12-06 2:45 ` daid kahl
0 siblings, 1 reply; 24+ messages in thread
From: felix @ 2009-12-06 2:35 UTC (permalink / raw
To: gentoo-user
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 2359 bytes --]
On Sun, Dec 06, 2009 at 10:58:59AM +0900, daid kahl wrote:
> I'm curious about your handling of Japanese, just because I'm living
> outside Tokyo these days. My grasp on Japanese is basically rubbish,
> but I can at least claim to know a thing or two.
Our handling is simple -- we don't yet. I don't know how to handle
things like that, or the previous example of Copenhagen in different
languages. Look at Naples -- that's not what Italins call it. Venice
is really bad -- no idea how English got it so mangled. Speaking of
Japanese, their word for Mexico (last time I checked) was taken from
the English MEKS-ih-ko and comes out as may-kee-shoo-ko rather than
the more more natural may-hee-ko if they had taken it straight from
Spanish.
As long a things stay in Unicode in the native language, we will do
alright. It's covering accepted mistakes (I can't think of a better
term) that is the problem -- thus my worries I'd have to include all
the accented and unaccented versions.
I learned enough Japanese to travel and hold bare bones questions.
As for romanization of Japanese, how do you even know which system to
use? Just as Peking is now Beijing, I have seen Tokyo with bars over
the Os and of course without. That is the same problem as Rome and
Roma.
As for ToKyo being two syllables ... I think it depends on how one
defines syllables. Ak a Japanese to pronounce three (san) slowly, and
it wil be two syllables, sa-n, "saw uhn". Ask for three hundred which
comes out as "sambyaku" because the "n" syllable changes sound when it
sounds better, and they will make quite a few syllables out of it,
such as (I am guessing now) saw-umm-bee-yaw-koo. To write Tokyo in
the proper furigana is probably something like toh-o-kee-yoh-o.
> Kyoto is the same case as Tokyo (incidentally, the Chinese
> characters for those two cities are the same and just reversed in
> order!).
Nope -- Tokyo is 東京, east capital. Kyoto is 京都, capital city. Kyo
is the same, to is different.
--
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-06 2:35 ` felix
@ 2009-12-06 2:45 ` daid kahl
2009-12-06 2:47 ` daid kahl
2009-12-06 3:19 ` felix
0 siblings, 2 replies; 24+ messages in thread
From: daid kahl @ 2009-12-06 2:45 UTC (permalink / raw
To: gentoo-user
> Our handling is simple -- we don't yet. I don't know how to handle
> things like that, or the previous example of Copenhagen in different
> languages. Look at Naples -- that's not what Italins call it. Venice
> is really bad -- no idea how English got it so mangled. Speaking of
> Japanese, their word for Mexico (last time I checked) was taken from
> the English MEKS-ih-ko and comes out as may-kee-shoo-ko rather than
> the more more natural may-hee-ko if they had taken it straight from
> Spanish.
Yeah, you get all kinds of crazy. For a long time I couldn't
understand why 'computer' is in katakana (ie: taken from English) and
'calculus' isn't. As it turns out, the Japanese invented calculus
independent of Newton and Leibniz.
> As for ToKyo being two syllables ... I think it depends on how one
> defines syllables. Ak a Japanese to pronounce three (san) slowly, and
> it wil be two syllables, sa-n, "saw uhn". Ask for three hundred which
> comes out as "sambyaku" because the "n" syllable changes sound when it
> sounds better, and they will make quite a few syllables out of it,
> such as (I am guessing now) saw-umm-bee-yaw-koo. To write Tokyo in
> the proper furigana is probably something like toh-o-kee-yoh-o.
Well, I don't think "n" is really a syllable. It's a sound, and it's
the only part of the syllabary in Japanese that doesn't have a vowel.
I'm not really convinced this is a syllable in reality.
The proper way to write Tokyo for syllabary would be to-u-kyo-u I
think, but I'm not certain. But really that's misleading because
you're *not* supposed to pronounce the sounds twice, you just extend
them, so they aren't really syllables either, they are just modifiers.
>
>> Kyoto is the same case as Tokyo (incidentally, the Chinese
>> characters for those two cities are the same and just reversed in
>> order!).
>
> Nope -- Tokyo is 東京, east capital. Kyoto is 京都, capital city. Kyo
> is the same, to is different.
Huh. I wonder how the hell I came up with that? I'm convinced I did
not decide that on my own but that someone told me. And they told me
I'm sure because I remember the story that went with it. Very
strange. But you're absolutely right.
Regards,
daid
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-06 2:45 ` daid kahl
@ 2009-12-06 2:47 ` daid kahl
2009-12-06 3:19 ` felix
1 sibling, 0 replies; 24+ messages in thread
From: daid kahl @ 2009-12-06 2:47 UTC (permalink / raw
To: gentoo-user
>> such as (I am guessing now) saw-umm-bee-yaw-koo. To write Tokyo in
>> the proper furigana is probably something like toh-o-kee-yoh-o.
Oh, I should mention that this is in writing correct. But the yo is a
subscript, so it's also a modifier, so the ki part isn't pronounced,
it's modified into a different sound...
Good times.
~daid
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-06 2:45 ` daid kahl
2009-12-06 2:47 ` daid kahl
@ 2009-12-06 3:19 ` felix
1 sibling, 0 replies; 24+ messages in thread
From: felix @ 2009-12-06 3:19 UTC (permalink / raw
To: gentoo-user
On Sun, Dec 06, 2009 at 11:45:43AM +0900, daid kahl wrote:
> Well, I don't think "n" is really a syllable. It's a sound, and it's
> the only part of the syllabary in Japanese that doesn't have a vowel.
> I'm not really convinced this is a syllable in reality.
It's certainly a syllable in their syllabaries, and their opinion is
all that counts ... it is *their* language ...
> The proper way to write Tokyo for syllabary would be to-u-kyo-u I
No, they don't have kyo in the syllabaries. The furigana I have seen
say that is ki-yo, two syllables.
Now I may be full of it, as most of what I learned was 30 years ago,
and I never got beyond reading and writing at a third or fourth grade
level. I imagine Japanese readers of this are snickering at the crazy
foreigners.
--
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
2009-12-03 19:20 [gentoo-user] [OT] Need advice from people who use non-ascii all day long felix
` (2 preceding siblings ...)
2009-12-06 1:58 ` daid kahl
@ 2009-12-15 16:05 ` J. Roeleveld
3 siblings, 0 replies; 24+ messages in thread
From: J. Roeleveld @ 2009-12-15 16:05 UTC (permalink / raw
To: gentoo-user
On Thursday 03 December 2009 20:20:03 felix@crowfix.com wrote:
> I have a project which requires normalizing names, and by that, I mean
> converting to lower case etc, whatever eliminates redundancies. I
> know Unicode has a different "normalize" meaning, but for my purposes,
> that has already been done. Maybe I should call it standardization or
> make up a new cromulent word.
>
> By which I really mean I am confused by a lot of advice I have gotten
> from USAians who get by with the good old 7 bit ASCII character set on
> a daily basis, whether it be written in Unicode or not.
>
> One of the puzzles to me is all the accented chars. Umlauts, etc. I
> am not trying to convert names for permanent purposes but for internal
> comparison. In Germany is a district "Busingen", with an umlauted
> 'u'. Is it reasonable to consider it the same word whether with or
> without the unlauted u? French has the cedilla and acute and grave
> accents. Spanish has the tilde n. Scandinavian languages (all?
> some?) have the o with a slash.
>
> Or put another way, I don't know much about German, French, Spanish,
> etc keyboards. Do your keyboards have any of the extra keys, all of
> them? Are German keyboards and French and Spanish keyboards as
> restricted to their own languages as US keyboards are? If you have to
> hit two or three keys to keep the umlauts, accents, and tildes, do you
> get lazy sometimes and type the base character by itself? Is it even
> considered the base character, or is it considered lazy and sloppy,
> much as I get complaints about typing "thru" because "through" is too
> much trouble?
>
> I need something the equivalent of the C function strcasecmp() which
> not only ignores case, but all other differences without distinction,
> whatever they may be. If leaving off umlauts horrifies academics and
> purists but is what people do in the real world, I want to take that
> into consideration, so that if one person uses the ummlaut and another
> doesn't, it won't generated two separate entries. But if leaving off
> the umlaut or accent is a distinct place name, then I can't do that --
> but if real world people do that and live with the confusion, then I
> guess I have to make a different choice.
>
> Yes, I am something of an ignorant American. I know some Japanese,
> French, and Spanish, but not the details of everyday usage. I'd like
> to learn.
>
Hi Felix,
Apart from what was already mentioned, you might want to also consider the
following:
1) Even though people tend to try to do it correctly, non-natives can still
make mistakes with the names. These mistakes are frowned upon by the natives,
but are a part of live.
2) Names of cities can change with time, example:
"New York" used to be called "Nieuw Amsterdam" (Or "New Amsterdam" in english)
3) Some cities have multiple valid spellings in the same language:
The Hague = "Den Haag" or " 's Gravenhage" (yes, the apostrofe before the "s"
is part of the second version of the name)
An easier option might be to filter on the post-codes, these should be unique
and if you put the countries international abreviation in front of it, like
so: "NL-1234 AA" or "D-12345", you have a single field to check and then link
the actual city-name to the postcode area.
Disclaimer: I have no clue if these 2 postcodes actually exist
--
Joost
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2009-12-15 18:01 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-03 19:20 [gentoo-user] [OT] Need advice from people who use non-ascii all day long felix
2009-12-03 19:50 ` Renat Golubchyk
2009-12-03 20:07 ` felix
2009-12-03 20:29 ` Renat Golubchyk
2009-12-03 22:32 ` Francisco Ares
2009-12-03 22:54 ` felix
2009-12-04 0:03 ` Volker Armin Hemmann
2009-12-04 0:18 ` Alan McKinnon
2009-12-04 0:31 ` felix
2009-12-04 13:42 ` Volker Armin Hemmann
2009-12-04 20:50 ` Alan McKinnon
2009-12-05 0:01 ` Neil Bothwick
2009-12-04 9:17 ` Patrick Holthaus
2009-12-04 9:55 ` felix
2009-12-03 22:07 ` Volker Armin Hemmann
2009-12-03 22:14 ` Alan McKinnon
2009-12-03 22:38 ` Arttu V.
2009-12-03 22:57 ` felix
2009-12-06 1:58 ` daid kahl
2009-12-06 2:35 ` felix
2009-12-06 2:45 ` daid kahl
2009-12-06 2:47 ` daid kahl
2009-12-06 3:19 ` felix
2009-12-15 16:05 ` J. Roeleveld
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox