Web interface & Firefox

Viewing 13 posts - 16 through 28 (of 28 total)
  • Author
    Posts
  • #6571
    CCRDude
    Participant

    Ok, as part of my attempts to make the system more Unicode even on the console, I did actually create a locale (de_DE.UTF8). Guess what this fixed as a side-effect? πŸ™‚ I’ll play around some more tomorrow, since I also made changes to make paths shorter, but it seems it’s working now finally.

    #6572
    rpedde
    Participant

    @CCRDude wrote:

    Ok, as part of my attempts to make the system more Unicode even on the console, I did actually create a locale (de_DE.UTF8). Guess what this fixed as a side-effect? πŸ™‚ I’ll play around some more tomorrow, since I also made changes to make paths shorter, but it seems it’s working now finally.

    Ooooooh… interesting. This was German, right? There isn’t any multi-byte characters in German, are there? Just a different codepage?

    Or are there?

    #6573
    Jxn
    Participant

    In UTF-8 everything outside of pure ASCII (7 bit) is multi byte. So as in swedish (ÀâΓ₯) you have multi character codings of ΓΌ,ß and Γ£(?) in german UTF-8.

    #6574
    rpedde
    Participant

    @Jxn wrote:

    In UTF-8 everything outside of pure ASCII (7 bit) is multi byte. So as in swedish (ÀâΓ₯) you have multi character codings of ΓΌ,ß and Γ£(?) in german UTF-8.

    Yeah, for utf-8. I understand that. But some non-utf encoding systems are multibyte. Big5, shift-jis, etc. I was asking if the native (non-utf) encoding for german was multibyte. If strlen != bytelen, that would explain those problems.

    — Ron

    #6575
    fizze
    Participant

    no, german is single-byte. just a different codepage. 850, iirc.

    #6576
    CCRDude
    Participant

    I even read somewhere that under Linux de_DE.UTF8 is absolutely identical to en_GB.UTF8 (should be anyway, but imho there are some differences in UTF-8 implementations because of different versions of the UTF-8 standard).

    The codepage of german depends on the base you; in ISO, german usually is ISO-8859-1 or ISO-8859-15 (the later including the Euro currency sign). IBM codepages 850 and 437, as english. Windows codepage is Windows-1252, identical as well, and Mac uses MacRoman for both languages, too. So the only difference to english would be ISO-8859-1 to ISO-8859-15 (english uses only the first, while german at some point started to switch to the later) – shouldn’t matter though since the euro sign is probably not involved here πŸ˜€

    But: now that you mention it, the Debian image for the Kurobox seems to have been made by someone from Japan (no wonder, the Kurobox itself can only be imported from there, displays Japanese characters on the box and the shipped manual is Japanese only). It even uses some Japanese apt sources, and while the shell was in english after installation, who knows? I’m still struggling to find out where exactly I can find the charset used by ssh, shells, NFS, Samba, etc. exactly πŸ˜‰

    #6577
    Jxn
    Participant

    (I shouldn’t write this, I shouldn’t write this… πŸ˜‰ )

    Locale “de_DE.UTF-8” and “en_GB.UTF-8” is not the same. But character encodings are same, UTF-8, but language and country is different, and makes programs behave differently.

    ISO-8859-1, Windows-1252 is not same.
    IBM codepages 850 and 437 are not same.
    In Japan they usally use Shift-JIS, which has same issues with ASCII (and Latin-X).

    There is not one singel character encoding that is “german”, there are many (as in any language). Note that many countries use many scripts, like 3 in Japan.

    On servers I usally use locale “C”, and on desktops I use “sv_SE.UTF-8”.

    In Unix (and Linux) you can set your locale when you log in or start a program.
    Locale tells the program which language/country you want the program to use AND which character encodings (code page in MS Windows lingo) you want it to use. You usally set variable LANG to prefered locale (locale ‘C’ is special and usally means ASCII and american english. Used in scripts etc. No conversion and ordering of characters according to ASCII code).
    Locales should make the program to use right sort and compare strings (‘Γ©’, ‘Γ¨’ and ‘e’ is counted the same character in french when you compare them), writing time and date etc.

    With this you can have different locales on each and every program that you run. You can have one Mozilla talking swedish and next time you run it it could talk german to you.

    Anyway, the locales has three parts. Take “sv_DE.ISO-8859-1” as an example.

    “sv” tells you want the program to talk swedish (if it can).
    “DE” tells that you want the program to use german dates, currencies etc.
    “ISO-8859-1” tells us which character encoding you want the program to use.

    There are lots of encodings, like CCRDude wrote. ASCII is an old one that americans used, its also called ISO-646 or US-ASCII. (EBCDIC is another 8bit encoding that only IBM used). ASCII encodes characters in a 7-bit long character. There are different national variations of ASCII, where some characters, like {[]}|@$, usally are changed for some national symbols (like Γ₯ÀâÅÄÖ in swedish in ISO-646-SE). Still only 7-bit though.

    Later ASCII was enlarged to 8 bit and called ISO-8859-X or Latin-X (where X is a number between 1 and 16). Still not all characters at the same time, but at least programmers could have both ‘[‘ and ‘Γ…’ in same file if they used ISO-8859-1 πŸ˜‰

    So each program talk to you in your language and character encoding (code page). Next that log in on same machine can have another locale.

    But still not all human characters in one document. So then came two standards, Unicode and ISO-10646 (or UCS). Unicode started with 16 bits for each character but now Unicode and ISO-10141 use 32 bits (and can expand to more). They also use same encodings and are mostly equal. With those two, we can code all characters used in human languages. Like ‘A’, ‘ΓΎ’, ‘Γ…’ and ‘Γ…’ (yes, ‘Γ…’ and ‘Γ…’ is different, one is character “A ring”, other is the unit “Γ…ngstrΓΆm”, but I used Latin 1 here though :oops:). In this way it’s possible to convert and still have same semantics. There is some codings in Unicode (and ISO-10646) that combine two singel character, so a 32-bit character ‘Γ…’ could also be combined with character ‘A’ and ‘a ring over last character’. So Unicode is not as simpel as ISO-8859-1 or ASCII. But human caracters is not simpel either.

    Anyway, all 16 bits are not always needed to encode a C program. You mostly only need 7 bit ASCII. To save space (and make old program kind of work), there was an encoding of Unicode that is called UTF-8. Which ordinary ASCII characters, you only need on 8-bit character, with 8:th bit 0. If it is a 1, you use next character to make up more bits in Unicode. You can use up to four bytes this way in UTF-8 encoding of Unicode for one character. There are als UTF-16 and UTF-32 (also called UCS-4).

    So UTF-8 is a multi byte way of represent Unicode (or ISO-10646). If you use UTF-32, you don’t need to handle variable width characters.

    Short summery, read a book about Unicode, it’s fun…

    Wikipedia:
    ISO-646: http://en.wikipedia.org/wiki/ISO_646
    ISO-8859: http://en.wikipedia.org/wiki/ISO_8859
    ISO-10646: http://en.wikipedia.org/wiki/Universal_character_set
    Unicode: http://en.wikipedia.org/wiki/Unicode
    Shift-JIS: http://en.wikipedia.org/wiki/Shift-JIS
    Web pages:
    Unicode: http://unicode.org/
    Unicode FAQ: http://www.unicode.org/faq/
    Free ISO-standards: http://isotc.iso.org/livelink/livelink/fetch/2000/2489/Ittf_Home/PubliclyAvailableStandards.htm
    Character encodings: http://czyborra.com/charsets/

    #6578
    CCRDude
    Participant

    @Jxn: Yes, that’s what I meant – encodings are the same. So a file created with de_DE.UTF-8 is on a binary level the same as one created with en_GB.UTF-8, so actually it shouldn’t matter if a program created them with one locale and another program or instance of the same reads them with the other locale πŸ˜‰ (shouldn’t matter with any Unicode, but as mentioned, there are a lot of versions of the UTF-8 standards since it was expanded and changed a few times to include more languages with new characters)

    And the character encoding is often not explicitely mentioned in locales, and sometimes the country is lower case as well, and divided by a – instead of a – (just take a look into the language preferences your browser sends, or the Linux environment LANGUAGE (not LANG))…

    To make this even more complicated, depending on the standard byte order of different operating systems, UTF-16 and UTF-32 can even differ on a byte level – so these texts usually start with a two-byte BOM, a Byte Order Marker, telling the application if they are lesser endian or bigger endian. UTF-8 has a standard at least, but the byte order can be changed through an optional BOM as well.

    You wrote a very detailed description on the topic though πŸ™‚

    Back the problem which isn’t one any more, I assume mt-daapd should run under the “standard” locale? Which would most likely be the one from the environment; type “env” and look out for “LANG=…” which in my case is “LANG=de_DE.UTF-8”. That environment variable wasn’t set before I created the locale though, so the question would be what would be used as default if that was not there.

    #6579
    rpedde
    Participant

    @CCRDude wrote:

    Back the problem which isn’t one any more, I assume mt-daapd should run under the “standard” locale? Which would most likely be the one from the environment; type “env” and look out for “LANG=…” which in my case is “LANG=de_DE.UTF-8”. That environment variable wasn’t set before I created the locale though, so the question would be what would be used as default if that was not there.

    It cheerfully ignores locale, except where the underlying libc has the presense of mind to do something about it.

    Everything internally is utf-8. The config file is assumed to be in utf-8 as well.

    All metadata is stored as utf-8 also. The metadata is also cleaned, so if there is an invalid utf-8 string that claims to be utf-8, it gets mangled by character replacement to become valid utf-8.

    Two things that don’t get utf-8 linted are the config file and file paths. File paths I can’t mangle — if the readdir/scandir said the path was somesuch, then it *was*, whether or not it is valid utf-8. This can cause problems when it’s actually codepage. To fix it, I’d have to iconv codepage filenames to utf-8 for storage and display, and revert to codepage when doing file opens and stuff. I have most of this already ready to go, it’s pretty high on my list right now. I’ll assume codepage based on locale, or allow someone to override it in the config.

    Maybe when I get all that done it will fix the things you are seeing. I have to assume that’s the problem.

    #6580
    CCRDude
    Participant

    Maybe when I get all that done it will fix the things you are seeing.

    After setting the locale, I don’t see any problems at all any more πŸ™‚

    So actually, I’ll probably grab a newer nightly next and see if I can fix any other HTML standard compliance issues πŸ˜‰

    #6581
    Jxn
    Participant

    @CCRDude wrote:

    @Jxn: Yes, that’s what I meant – encodings are the same. So a file created with de_DE.UTF-8 is on a binary level the same as one created with en_GB.UTF-8, so actually it shouldn’t matter if a program created them with one locale and another program or instance of the same reads them with the other locale πŸ˜‰ (shouldn’t matter with any Unicode, but as mentioned, there are a lot of versions of the UTF-8 standards since it was expanded and changed a few times to include more languages with new characters)

    locales change lots of things, like ordering, currencies and script(how characters ar coded).

    So in this case, the only thing interesting is the character coding, which is the same in both locales you tested, namely UTF-8.

    And for your (and our) purpose, any change in Unicode doesn’t matter. And there are not lots of Unicode. It’s lots of locales that uses same script coding, UTF-8.

    Anyway, locale C (or de_DE.ISO-8859-1) and de_DE.UTF-8 (or en_GB.UTF-8) is a difference. locale C use ASCII (which mostly is equal to ISO-8859-1 in locale de_DE.ISO-8859-1) and the others use UTF-8. That could make a difference in any character outside of ASCII (remember, only 7 bits, no Γ₯ÀâÅÄÖ in ASCII, but in ISO-8859-1). Lots of balls in the air right now πŸ˜‰

    #6582
    CCRDude
    Participant

    Hey, you seem to know more about what I tested than I did, how about coming over and fixing all the remains? πŸ˜›
    Yes, the locale I set by hand was UTF-8, but there was no explicit locale before I generated one, so how do you know both I “tested” where UTF-8?
    Actually, the problem was (or may have been) that the OS image was partly Japanese at some point.

    Also I don’t quite understand what you mean by “there are not lots of Unicode”, there’s a noun missing, lots of what? If you feel insulted by my statement that there are lots of Unicode versions, keep in mind that we’ve reached Unicode version 5.0.0 by now, with I think nearly 20 versions of Unicode so far. For a standard, that’s actually quite a lot in my opinion.

    And about all that locale changes you mention – there are only two “things” involved here, which are ID3 tags, which use UTF-8 or UTF-16 LE or BE, but the locace is only used to determine the language of comments, sync and unsync lyric fields, so the “change lots of things” won’t hit here (I wrote my own tagging application to be able to properly tag polish files, so I think I encountered all ID3v2 special cases πŸ˜‰ ). The other thing are filenames, which are quite useless here since they’re just stored the way they’re read and that’s it.
    So I strongly disagree with your disagreement πŸ˜‰

    #6583
    Jxn
    Participant

    @CCRDude wrote:

    Hey, you seem to know more about what I tested than I did, how about coming over and fixing all the remains? πŸ˜›

    Please, don’t be upset. You working on this, and that is great. I only read what you write and comment on that information.

    Yes, the locale I set by hand was UTF-8, but there was no explicit locale before I generated one, so how do you know both I “tested” where UTF-8?
    Actually, the problem was (or may have been) that the OS image was partly Japanese at some point.

    Japanese system/locale? Did you get strange output from ‘ls -l’? You should have been using locale “C” if not specified using any other locale. But that doesn’t mean I know that, it’s just an obeservation from information in your posts and my knowled about this subjekt.
    You wrote (or did I missunderstood you) you used two other locales with UTF-8 scripting. But locales has influence on more stuff than what script the computer is using.
    So my comments was more to clear that locales and scripts is not the same. And that the two locales you said you use has same script, UTF-8. And that should not give any differences in this case. Locales “C” on the other hand, would have given differences compared to the other locales mentioned here.

    Also I don’t quite understand what you mean by “there are not lots of Unicode”, there’s a noun missing, lots of what? If you feel insulted by my statement that there are lots of Unicode versions, keep in mind that we’ve reached Unicode version 5.0.0 by now, with I think nearly 20 versions of Unicode so far. For a standard, that’s actually quite a lot in my opinion.

    I’m not insulted? Sorry if I have given you that impression, or if I insulted you.
    You are right that there is different Unicode standards, but on same computer system there is only one. So that shouldn’t give any problems for programs running on same system.

    And about all that locale changes you mention – there are only two “things” involved here, which are ID3 tags, which use UTF-8 or UTF-16 LE or BE, but the locace is only used to determine the language of comments, sync and unsync lyric fields, so the “change lots of things” won’t hit here (I wrote my own tagging application to be able to properly tag polish files, so I think I encountered all ID3v2 special cases πŸ˜‰ ). The other thing are filenames, which are quite useless here since they’re just stored the way they’re read and that’s it.
    So I strongly disagree with your disagreement πŸ˜‰

    I don’t disagree on possible effects on programs, just the using of terms.
    Locales (like sv_SE.ISO-8859-1 or sv_SE.UTF-8 ) are not same thing as scripts (like ISO-8859-1 or UTF-8 ). Scrips are part of locales. Lots of people (and by that I don’t mean anyone special) get confused by that.
    So using locale en_GB.UTF-8, de_DE.UTF-8 or sv_SE.UTF-8 should give same results when it comes to character encoding. But not comparing those three with en_GB.ISO-8859-1 or C shouldn’t nessesarily do that.

    So my intent with my comments on this was meant to clear that possible confussion. I might have failed on that πŸ˜‰

    So, keep up the good work. I do appresiate that.

    “Well, just pass by folks, nothing interesting here.”

Viewing 13 posts - 16 through 28 (of 28 total)
  • The forum ‘Setup Issues’ is closed to new topics and replies.