large databases

Viewing 4 posts - 1 through 4 (of 4 total)
  • Author
    Posts
  • #156
    Anonymous
    Guest

    Hello,

    As a new user here I appologize if this is posted wrong.

    After installing the latest mt-daapd cvs snapshot on my nslug running
    the latest unslung 5.5 beta, my question relates to performance
    and general operational architecture.

    As I understand it, mt-daapd works by using sqlite2 to build
    a database file representing the directory/file/mp3-tag structure
    that exists on the storage media. It then sends parts of that
    db out to music clients. On systems with LARGE numbers of music
    files, the database files get very large (in a test case on my system
    in excess of 50 mbytes) and burdens low resource machines like the nlslug.
    This causes operations like browsing to be sit and wait affairs,
    let alone the much discussed scanning operations (BTW, thanks for the
    -s command line option).

    My question is this: Why can’t this process be done on the fly?
    What I really would like is for my Roku Soundbridge to mount and
    play mp3’s directly from an smb share, and the nslu2 being designed
    for this, is still snappy even where there are massive numbers of
    directories and files. Unfortunately, Roku seems unlikely to address
    this.

    Since I organize my files in a tiered directory structure
    (ie /genre/artist/album/song) the need to have the mp3-tags included
    in the database is insignificant. What I envision is mt-daapd acting
    as a simple layer between the daap protocol and the filesystem, nothing
    more. I have not yet studied your code base, thinking that ya’ll
    probably have a much better idea if this is practical / useful.
    I would be willing to apply my efforts towards this end if needed
    (c programming among others).

    Best regards,
    cinbody

    #3901
    rpedde
    Participant

    The bottom line is that you are going to have to iterate over all the files in your library (twice) every time someone requests a song. You’ll have to build a map of path to id, and you’ll have to synthesize tag info from paths and send that much data as a stream of dmap blocks back to iTunes, or the soundbridge, or whatever.

    You can do that two ways — you can run through the filesystem and do that on a request, or you can do the memory and processor intensive part of that and save it to a database (or some other cache) and then just send that when you need to. What I’m saying is that the scanning trade-off is what makes something as underperforming as the slug even work as a music server.

    As far as the slug goes, there really aren’t any trade-offs possible. You can’t trade disk i/o for memory, or memory for cpu, because the disk i/o, the memory, and the cpu on the slug all suck. 🙂

    It may be that large libraries just aren’t feasible on the slug. Dunno.

    That said, I think performance on it can be improved by indexing the browse fields, but I dunno — anyone I’ve asked that has a database that big hasn’t gotten back to me with the results of that experiment.

    You willing to try messing with it in an effort to improve performance?

    #3902
    rpedde
    Participant

    dmap is like mp4 atoms. it’s length encoded serialized xml. If, for example, and ituens client says “give a complete list of all song names, artist names, and album names”, then you have to build a dmap block for each song. That block looks something like this:


    0c adbs (daap.databasesongs) -
    05 mstt (dmap.status) - 00c8 (200)
    01 muty (dmap.updatetype) - 00 (0)
    05 mtco (dmap.specifiedtotalcount) - 1743 (5955)
    05 mrco (dmap.returnedcount) - 1743 (5955)
    0c mlcl (dmap.listing) -
    0c mlit (dmap.listingitem) -
    01 mikd (dmap.itemkind) - 02 (2)
    09 asal (daap.songalbum) - Consolidated
    09 asar (daap.songartist) - Consolidated
    09 minm (dmap.itemname) - Consolidated (Adorno Strength Version)

    See how there are containers there? The databasesongs container contains some status info, plus a container that has the song listing. The song listing is a container that holds multiple listingitems (which are also a container), which contains the info for each song.

    Well and good. The crappy thing about this design is that the container also has info about exactly how many bytes long the container is. So the adbs container has the size of the 4 status elements + the size of the mlcl container. The mlcl container is the size of the sum of the mlit containers under it. The mlit container’s size is the sum of the sizes of the elements in it.

    So before you even send the dmap stream, you have to know exactly how big it is. That’s why I’m saying it’s a two-pass deal… you have to know the size of the data you are sending before you send it.

    Now. Here’s a thought experiment. Imagine the speed necessary to walk though that file system and calculate that stuff, extracting the data out, etc. Then decide whether it would be faster to just pre-compute that info and save it in a database. If it’s saved in a database, you get performance advantages from file cache localization, indexes, etc, etc.

    In other words, keeping that same info in a database info is *bound* to be faster. But that’s the situation that exists right now!

    So that can’t be the bottleneck. Or maybe it is… but if that’s the case, then changing to a “scan on demand and get tag info from file names” strategy would suffer from the same problems.

    No, I think the bottle neck is somewhere else — I think the bottleneck is on some of the browse queries that the roku generates. The roku browses for songs by a particular artist, for example. Those queries look like “select distinct artist from songs” or “select distinct artist from songs where foo”. I’m guessing that by creating an index on the columns that the soundbridge queries on (artists, genres, composers, albums), then the result would be returned much quicker.

    Another potential point of slowdown is on playlists. Static playlists particularly seem to be slow… they look something like:


    select * from songs where id in (select distinct songid from playlistitems where playlist=)

    Now I *know* that’s a painful query when there are lots of items in the playlistitems table.

    Thing to do would be to run it with a logfile and “-d9 -Ddatabase” and watch the queries. Use the timestampt to figure out how long each query takes and see what indexes you need to speed it up.

    I don’t have any problems, because my tables have only ~6k items. Not enough to be a performance problem!

    I’ll probably have to build a bogus db for testing. 🙂

    #3903
    rpedde
    Participant

    Actually, it’s almost exactly the other way around. All the “smarts” is on the client — the server is pretty dumb. Sorting, searching, almost all that stuff is done on the client.

    For running in debug, set up a log file in your config, and then add “-d9 -Ddatabase,daap” to whatever starts mt-daapd (your init.d script?). You’ll get a lot of crap, but you’ll also get queries you can recognize as the queries to get various information. With -Ddatabase,daap you should see all db queries, as well as info about sending the data back to the client — so you should be able to see what parts take the longest.

    As far as splitting into multiple databases, that might help, in that I think the nslu2’s biggest problem is memory starvation, and the inability to have enough free memory to build a decent cache for sqlite.

    With smaller db, it might do better that way. I still think that with the right indexes, it would perform decently even on a NSLU2, though.

    If not, I’ve fixed much of the sql to allow me to put the gdbm database back in. If I can’t get the sql to perform as well as I want, I’ll go back to gdbm.

    — Ron

Viewing 4 posts - 1 through 4 (of 4 total)
  • The forum ‘General Discussion’ is closed to new topics and replies.