Forums - Non-English characters

Non-English characters

pkasting

19 years ago

Oct 23, 2005 - 6:50am

I have long been unhappy at the transliteration of all characters into the English alphabet and the question of how to submit a title that says (in Cyrillic) "MOCKBA" has brought it to a head.

I think we should just allow other character sets into the tree. Any technical reasons not to do that, like some non-unicode-capable backend software or something? I can think of one non-technical argument against it, which is that people trying to search for someone's name won't find the results they want, confusing two similar-looking characters. But I think the counter-arguments to that are that we ALREADY have that problem with people who DO work with or speak languages with these other characters, and who will be confused when they search for the correct thing and don't find stuff; and that the current way is just flat-out inaccurate and portrays artist and album names as something other than they actually are. What I would propose is that the "inexact match" suggestion algorithm be told a set of English transliterations to try first in finding possible matches. That way if there's an exact english title that looks the same as a transliteration of a non-English title, you'll find the English one first. And if you search for "Motorhead" you'll probably get to the umlauted band because nothing else will match as closely. Actual translations, such as "Live in Moscow", are probably more difficult to do correctly (should we do "Moscow" or "Moskva" or what?) and more difficult to code into the system.

Incidentally, even if nothing is done, we'll still get transliterations as close matches if not too many characters are different.

Also, I think the pain (on the searcher's end) won't be quite as bad as it might appear. I would suspect the majority of searches happen for band names rather than artist names, and people get artists names by clicking on them. I think artist names are going to be the vast majoirty of cases of using a non-English character, so probably most of the "I couldn't find my favorite ____" issues just won't occur.

I'll also note that this did bite me the other day when I had to keep remembering to Romanize names I was looking up to see if they were already in the database. Argh.

If we allowed extended characters, of course, I'd have to go find all the submissions I've made where I've "Romanized" the characters and make a Shambles post listing the correct names -- but that pain is just going to get worse over time, so if we're going to do this (and I think we should), the sooner the better.

It will be annoying to do correctly, but it's an iterative process; you can make various sweeps and fix things wihtout having to "get everything right before doing the conversion". Just allowing _new_ artists into the tree that use extended character sets would be a start. Then I'll go clean up the names on all the couple hundred bands I've touched. Other people can do whatever they feel is best. The whole point is that we can take it at whatever speed we want, and every change we make moves us monotonically closer to being correct.

I'm 100% behind this

Matt Westwood

19 years ago

Oct 23, 2005 - 9:31am

... I think we'd need to set up a separate shambles forum for pointing out character adjustments needed.

We'd also need a convenient symbol map.

And I'm against it ...

misterpomp

19 years ago

Oct 23, 2005 - 10:53am

... since it's only a partial solution. After all, I presume nobody's wanting to put all those Japanese musicians in Japanese, are they? No, the style of the site should stay as is: English. Focussing the argument around a slightly different set (Cyrillic) is taking the size of this change out of view. Bad bad idea IMHO.

···

pkasting

19 years ago

Oct 23, 2005 - 11:11am

If the Japanese musicians in question are consistenly listed on their albums in Kanji, and we're capable of inputting the proper Unicode, then yes, I would in fact list them that way. However, most Japanese musicians and bands Romanize their releases anyway, especially foreign market releases, so there's a legitimate reason to put some of those in using Roman characters, whereas a band like Mot

Motorhead, then ...

Matt Westwood

19 years ago

Oct 23, 2005 - 12:52pm

... can that be amended to show the umlaut? It'd be a start ...

Japan

misterpomp

19 years ago

Oct 23, 2005 - 3:11pm

Taking Japan as an example only. Many Japanese releases of Japanese artists (and we defer to home territory releases don't forget) will have album names, band names or artist names (or all 3) in their native language. Therefore that's how they'll have to go in. Which will make the site more 'correct' but less usable, IMHO. There is a big difference, as I'm sure you know between a non-English word spelt using standard English characters and a non-English word that can't be spelt with standard character set, so you Juan/John analogy is, I think, irrelevant.

All I meant by my comment was that by taking a character set with more similarities than dis-similarities and with relatively few examples, this disguises the size of the conceptual change we are dealing with here. I first submitted Torm

···

pkasting

19 years ago

Oct 23, 2005 - 9:47pm

Well I'm NOT happy to live with it.

Assume for the sake of argument that:
(1) It is somehow a misspelling of a Japanese artist/band to put it in using any Romanization
(2) It's impossible for us to use the correct Unicode for Kanji or we decide that the usability cost is "too high"
Even given these (neither of which I'm actually prepared to give, especially (1)), your argument boils down to (using made-up numbers) "if we can't make 0.01% of the artists on the site correct, we should force a full 5% to be incorrect." ANY artist or band names that are actually CORRECT are improvements on the current system regardless of if there are exceptions that we cannot accommodate. What I am proposing is at least better than the current system, yet you're objecting on what I consider trumped-up grounds because it "can't be done consistently." Fine. Then let it be done inconsistently by doing it in as many cases as we possibly can.

Also you seem to be suffering under the misimpression that what we've done here is use "English" translations of names in other languages, which is completely incorrect. We've transliterated to the most similar-looking character, which is not the same thing at all. Let me take one example. Dan Swan

IMDB

ajweitzman

19 years ago

Oct 24, 2005 - 1:58am

In agreeing (mostly) with pkasting on this subject, my opinion is that the site should reflect correctness as much as it can, but should allow for less strict alternatives for the purposes of searching.

IMDB does this rather well. Here's a sample page:

[www.imdb.com] http://www.imdb.com/title/tt0211915/

If you are in the US or UK, you most likely know this movie simply as "Amelie." In fact, that's how I searched for it. But the original title of the movie doesn't have "Amelie" in it anywhere, only "Am

···

pkasting

19 years ago

Oct 24, 2005 - 3:38am

The current inexact-match algorithm should already handle things pretty well as far as this is concerned. At most probably we'd need to give it a set of preferred transliterals.

···

bgzimmer

19 years ago

Oct 24, 2005 - 4:12am

FWIW, Last.fm uses Japanese, Chinese, Cyrillic, and Greek characters in its performer database:

[www.last.fm] http://www.last.fm/tag/japanese
[www.last.fm] http://www.last.fm/tag/chinese
[www.last.fm] http://www.last.fm/tag/russian
[www.last.fm] http://www.last.fm/tag/greek

Same for Musicbrainz, e.g.:

[musicbrainz.org] http://musicbrainz.org/showartist.html?artistid=10522
浜崎あゆみ (Ayumi Hamasaki)

[musicbrainz.org] http://musicbrainz.org/showartist.html?artistid=1230
王菲 (Faye Wong)

[musicbrainz.org] http://musicbrainz.org/showartist.html?artistid=162611
Кино (Kino)

[musicbrainz.org] http://musicbrainz.org/showartist.html?artistid=56996
Καιτη Γαρμπη (Katy Garbi)

Potential problems

Python

19 years ago

Oct 24, 2005 - 4:02pm

I can see a potential problem: the end user must have the fonts installed in order to see the non-standard fonts. Those first two names result in "?????" and "??" respectivily on my browser. Cyrillic and Greek seems to work fine though.

I'm using Firefox and as far as I can remember, Firefox has never asked me to install fonts it doesn't find on my PC. I know Internet Explorer does but I'm pretty sure you can disable that feature.

Also, Καιτη Γαρμπη transliterates to Kaity Garmpy, which I doubt the search algorithm will map to Katy Garbi.

···

bgzimmer

19 years ago

Oct 24, 2005 - 8:57pm

Hmm, I'm using Firefox with Western encoding (ISO-8859-1) with no special fonts loaded, and I have no problem with any of those character sets. No problem for me in IE either.

···

pkasting

19 years ago

Oct 25, 2005 - 12:57am

Weirdly, I can see the first two here at work on Firefox 1.0.x Linux, but not at home on my FF 1.5 Beta install. I'll have to check into that. In any case, while I know the consistency of this bugs misterpomp, I really intended this feature request more for names that use predominantly Roman characters but have umlauts, accents, and other marks. I think fixing those is doable and won't degrade user experiences with the site. Tackling fully non-Roman names, like Asian or Greek character sets, is a bit harder and any of several solutions could be used. I would prefer not to cross that bridge until we're done with this first one, so consider this a "limited" feature request.

Update: OK, works fine at home now that I went into my Windows XP Control Panel and asked it to install support for "East Asian languages", whose font files are apparently not on the system by default (I long ago put Japanese et. al fonts onto my Linux system at work--I deal with that stuff not infrequently in hardware documentation).

···

Python

19 years ago

Nov 5, 2005 - 2:04pm

So how would we enter this album?
[gwardnet.d2g.com] http://gwardnet.d2g.com/winefan/AW_Related_Material/LPs/451/451.html

It should be 451

···

Matt Westwood

19 years ago

Nov 5, 2005 - 3:10pm

Same way as those Pink Cream 69 albums went in. Those ones had degree symbols.

···

Python

19 years ago

Nov 5, 2005 - 4:43pm

True, I forgot about those.
OK, no problem then :)

Non-English characters

pkasting · 19 replies