You are here: Home » Topic » Foreign characters and XML entries

Foreign characters and XML entries

Viewing 11 posts - 1 through 11 (of 11 total)
  • Author
    Posts
  • #274
    zemanel
    Guest

    hi,
    I’ve just installed the latest build and the problems with mt-daapd crashing after every rescan seem to be gone (although I’ve only been using it for a few hours).

    Another good thing: it can find the artist info, ablum, etc. for almost all wav files, whose info is contained in the XML file.
    So the ability to process XML files is back.

    One problem remains tough: “special” international characters.

    Here’s one example of a fairly popular artist: “Björk“. The XML entry for her song “One day” as produced by iTunes running on Win XP is:


    Track ID2932
    NameOne Day
    ArtistBjörk
    ComposerBjörk Gudmundsdottir
    AlbumDebut
    GenreAlternative & Punk
    KindWAV audio file
    Size56800844
    Total Time322000
    Disc Number1
    Disc Count1
    Track Number7
    Track Count12
    Year1993
    Date Modified2006-04-30T19:19:01Z
    Date Added2006-04-30T19:18:57Z
    Bit Rate1411
    Sample Rate44100
    Rating60
    Persistent ID171A57BBFDACC25F
    Track TypeFile
    Locationfile://localhost/F:/Music/Bj%C3%B6rk/Debut/07%20One%20Day.wav
    File Folder Count4
    Library Folder Count1

    three fields seem to be the culprit(s): Artist, Composer and especially Location, which I suspect is the origin of the problem. Notice the way it writes the path:
    …/Bj%C3%B6rk/Debut/07%20One%20Day.wav

    when it’s actually …/Björk/Debut/07 One Day.wav

    btw, mt-daapd does find the file “07 One Day.wav” but it cannot associate any XML data with it.

    Now a different example: “Clã” a Portuguese band with the special Character “ã“. Their song “A grande pirâmide” appears in the XML file as


    Track ID240
    NameA Grande Pirâmide
    ArtistClã
    AlbumKazoo
    GenreLatin
    KindWAV audio file
    Size37836668
    Total Time214493
    Disc Number1
    Disc Count1
    Track Number1
    Track Count13
    Year1997
    Date Modified2006-05-03T01:55:52Z
    Date Added2006-04-30T00:35:26Z
    Bit Rate1411
    Sample Rate44100
    Rating60
    Persistent ID7AA7DF1713F05A40
    Track TypeFile
    Locationfile://localhost/F:/Music/Cl%C3%A3/Kazoo/01%20A%20Grande%20Pir%C3%A2mide.wav
    File Folder Count4
    Library Folder Count1

    The actual location of the file is …/Clã/Kazoo/01 A Grande Pirâmide.wav
    and mt-daapd shows it as “01 A Grande Pir?mide.wav

    So, what should I do. Should I go to the XML file and edit it manually by replacing things like “Cl%C3%A3” with “Clã“? Or is there some option to process such data in the latest build?

    The first option of doing it manually doesn’t bother me, since I can script it. All I need to know is how to present the info in a way which mt-daapd can read it.

    cheers (and kudos for addressing the scan crashing and XML processing issues so promptly)

    #4435
    rpedde
    Participant

    @zemanel wrote:

    hi,
    So, what should I do. Should I go to the XML file and edit it manually by replacing things like “Cl%C3%A3” with “Clã“? Or is there some option to process such data in the latest build?

    We talked about this before right? This was a ntfs drive that got moved to a unix box and is now accessed via samba? Is that right?

    Until everyone stores everything in utf-8 straight-through, this kind of thing is just plain going to be a nightmare.

    The files are stored as utf-16 on disk, but iTunes is obviously storing the file names as utf-8. So up-promote utf-8 to utf-16? This might take some work.

    I think I have some bjork around, I’ll see if I can’t replicate it. I might not be able to — at least not on windows. I’m pretty sure I was playing with some Bjork music when I was working through playlist stuff, and it worked okay locally, so it’s probably a conversion issue on samba.

    #4436
    zemanel
    Guest

    @rpedde wrote:

    We talked about this before right? This was a ntfs drive that got moved to a unix box and is now accessed via samba? Is that right?

    Yes, we discussed this in another thread which disappeared after the recent change to the forum structure/pages. It’s a ntfs that got moved to a NSLU2 unlsung with the latest 6.8 and support for Western Europe/Latin 1(850).

    @rpedde wrote:

    The files are stored as utf-16 on disk, but iTunes is obviously storing the file names as utf-8. So up-promote utf-8 to utf-16? This might take some work.

    Well, I tried to change one entry in the XML file where iTunes had Bj%C3%B6rk (C3 B6) is the UTF-8 code for the letter ö, opened the file with an xml editor and changed %C3%B6 to 00F6 – the code for the same character in UTF-16, but mt-daapd still couldn’t find it.

    Also, when I Telnet into the slug I can access the folder named Björk by just typing cd Björk. So the Slug certainly understands ö.

    @rpedde wrote:

    I think I have some bjork around, I’ll see if I can’t replicate it. I might not be able to — at least not on windows. I’m pretty sure I was playing with some Bjork music when I was working through playlist stuff, and it worked okay locally, so it’s probably a conversion issue on samba.

    If you can tell me how to edit the xml file so that mt-daapd can process the path info for something like …/Björk/…, which iTunes writes as ../Bj%C3%B6rk/…, I can do it for the other cases myself, even if that means having to change the xml entries manually.

    Cheers

    #4437
    rpedde
    Participant

    @zemanel wrote:

    If you can tell me how to edit the xml file so that mt-daapd can process the path info for something like …/Björk/…, which iTunes writes as ../Bj%C3%B6rk/…, I can do it for the other cases myself, even if that means having to change the xml entries manually.

    … and if I knew that, I’d do it in code. 🙂

    But maybe it’s in codepage. In 850, an o with a diaeresis is 0x94, looks like. Try that.

    [/i]

    #4438
    fizze
    Participant

    Wow, iTunes is really weird there.

    I too run a codepage 850 NTFS drive on my slug, but I use lots of apps for playlists and the likes.

    In fact, I havent enabled the process_m3u’s option because I want to wait until this is more stable.

    Go zemanel, and spot all those bugs 🙂

    #4439
    schiers
    Participant

    Hi,

    we can be lucky that Björk Guðmundsdóttir uses only her first name, don’t we? 8)

    BR,
    Carsten.

    #4440
    zemanel
    Guest

    @schiers wrote:

    Hi,

    we can be lucky that Björk Guðmundsdóttir uses only her first name, don’t we? 8)

    BR,
    Carsten.

    Good point mate 😀

    But if you think the set of characters in her last name is bad, check out her early jazz album “Gling-Gló” and take a look at the name of the tracks… 😯

    #4441
    fizze
    Participant

    hehe, funky 😀

    The stuff with the weirdest names I got is probably from either Bugge Wesseltoft or the Essbjörn svennsno trio….. those crazy nordics 😉

    #4442
    zemanel
    Guest

    Ok, as Ron suggested I tried to replace Bj%C3%B6rk with Bj%94rk in the file names and this time it worked.

    So since I couldn’t find a conversion table online (even though there must be one somewhere), I went to Mathematica and wrote a small conversion tool from UTF8 to Code Page CP850 for Western European languages, which consists of 128 additional “special” characters. Here is the result for the ones which are likely to appear in a file name.

    The table format is as follows: A %BB%CC %DD

    “A” is the special character
    “%BB%CC” is the way iTunes writes it in the XML file
    “%DD” is what it should be in the XML file instead of “%BB%CC”

    UTF-8 CP850
    Ç %C3%87 %80

    ü %C3%BC %81

    é %C3%A9 %82

    â %C3%A2 %83

    ä %C3%A4 %84

    à %C3%A0 %85

    å %C3%A5 %86

    ç %C3%A7 %87

    ê %C3%AA %88

    ë %C3%AB %89

    è %C3%A8 %8A

    ï %C3%AF %8B

    î %C3%AE %8C

    ì %C3%AC %8D

    Ä %C3%84 %8E

    Å %C3%85 %8F

    É %C3%89 %90

    æ %C3%A6 %91

    Æ %C3%86 %92

    ô %C3%B4 %93

    ö %C3%B6 %94

    ò %C3%B2 %95

    û %C3%BB %96

    ù %C3%B9 %97

    ÿ %C3%BF %98

    Ö %C3%96 %99

    Ü %C3%9C %9A

    ø %C3%B8 %9B

    £ %C2%A3 %9C

    Ø %C3%98 %9D

    × %C3%97 %9E

    ƒ %C6%92 %9F

    á %C3%A1 %A0

    í %C3%AD %A1

    ó %C3%B3 %A2

    ú %C3%BA %A3

    ñ %C3%B1 %A4

    Ñ %C3%91 %A5

    ª %C2%AA %A6

    º %C2%BA %A7

    ¿ %C2%BF %A8

    ® %C2%AE %A9

    ¬ %C2%AC %AA

    ½ %C2%BD %AB

    ¼ %C2%BC %AC

    ¡ %C2%A1 %AD

    « %C2%AB %AE

    » %C2%BB %AF

    Á %C3%81 %B5

    Â %C3%82 %B6

    À %C3%80 %B7

    © %C2%A9 %B8

    ¢ %C2%A2 %BD

    ¥ %C2%A5 %BE

    ã %C3%A3 %C6

    Ã %C3%83 %C7

    ¤ %C2%A4 %CF

    ð %C3%B0 %D0

    Ð %C3%90 %D1

    Ê %C3%8A %D2

    Ë %C3%8B %D3

    È %C3%88 %D4

    Í %C3%8D %D6

    Î %C3%8E %D7

    Ï %C3%8F %D8

    ¦ %C2%A6 %DD

    Ì %C3%8C %DE

    Ó %C3%93 %E0

    ß %C3%9F %E1

    Ô %C3%94 %E2

    Ò %C3%92 %E3

    õ %C3%B5 %E4

    Õ %C3%95 %E5

    µ %C2%B5 %E6

    þ %C3%BE %E7

    Þ %C3%9E %E8

    Ú %C3%9A %E9

    Û %C3%9B %EA

    Ù %C3%99 %EB

    ý %C3%BD %EC

    Ý %C3%9D %ED

    ¯ %C2%AF %EE

    ´ %C2%B4 %EF

    ­ %C2%AD %F0

    ± %C2%B1 %F1

    ¾ %C2%BE %F3

    ¶ %C2%B6 %F4

    § %C2%A7 %F5

    ÷ %C3%B7 %F6

    ¸ %C2%B8 %F7

    ° %C2%B0 %F8

    ¨ %C2%A8 %F9

    · %C2%B7 %FA

    ¹ %C2%B9 %FB

    ³ %C2%B3 %FC

    ² %C2%B2 %FD

    Ron, if you want I can send you the table in a txt file formatted in a way which is easier for you to implement the transfomation rules in mt-daapd.

    Perhaps there could be an option in the mt-daapd.conf file where people with ntfs drives attached to a slug could turn the option value to 1 thereby telling mt-daapd how to interpret/translate the UTF-8 paths in the XML correctly?

    #4443
    Anonymous
    Inactive

    I’ll try to add some to the confusion 🙂
    Isn’t the charset conversions handled by iconv in gnu c? Is there a platform agnostic command for charset conversion?
    On my samba I have specified dos charset CP850 and unix charset iso-8859-1 So dos clients (don’t know if that goes for just dos or everything from windows) writes the filename as CP 850, Samba converts and stores it as iso-8859-1 on the server.

    /a

    #4444
    rpedde
    Participant

    Ron, if you want I can send you the table in a txt file formatted in a way which is easier for you to implement the transfomation rules in mt-daapd.

    Except it’s not that easy. There are dozens of code pages each with their own mapping. Also, how the conversions work differ based on what kind of file system it’s sitting on. NTFS allows anything, so that’s easy. FAT32 won’t do some characters, so it has its own unique translation strategy. Each file system has a different one. Samba also does translation, as does NFS, and atalk….

    In short, there are probably tens of thousands of translations, based on destination file system, transport, and codepage.

    The cheapest way to do it is probably to try converting the .xml file to utf-8, utf-16, and codepage for current locale, testing to see if the file exists at that path and going from there.

    Beyond that, it will be difficult. But I’ll start there and see what happens. Might be that that can fix 99% of the cases, and I’d be happy enough with that.

Viewing 11 posts - 1 through 11 (of 11 total)
  • The forum ‘Setup Issues’ is closed to new topics and replies.