[Booki-dev] Does-booki-have-this-book server

Sat Oct 16 16:00:33 PDT 2010

hi,

In discussions between Raj, Adam, and myself, we realised that the
Internet Archive servers will often need to ask Booki whether an
Archive book has a) been imported into Booki, and b) been changed in
Booki.  The Internet Archive will use this to offer people an
opportunity to correct books and, if a corrected version exists,
download the corrected version.

The trouble with this is that Booki doesn't index its books by Archive
ID, nor by whether they have been edited, and even if it did, its
framework is more geared toward rich interaction than fast look-ups.
Meanwhile the Archive has two million books and goodness knows how
many readers.

We decided we needed an intermediate server that periodically asks
Booki for information about Archive-sourced books.  That means Booki
only needs to trawl through its database every minute or two, while
the Archive can get instant answers from an in-memory store.

The new server is provisionally called Boouncer (from gatekeeper ->
doorman -> bouncer -> splice with booki; pronunciation is up to you),
and is in git on the booki-dev server:

http://booki-dev.flossmanuals.net/git?p=boouncer.git

As this also involves changes to Booki, it is so far only running
against my test server. The examples below use
does-booki-have-this.halo.gen.nz and booki.halo.gen.nz, which are not
permanent urls.

How it works 1: Epub links
==========================

Every time an IA book details page is loaded, the Archive server asks
whether a corrected epub exists.  If it does exist, it wants the URL.
"Corrected" means changed in Booki, and "exists" includes epubs
generated on the fly (which at this stage means all of them).

The general form for this type of request is:

 /<id-scheme>/epub/<id>

For Internet Archive ids, the id-scheme is "archive.org".

If the book exists and has been edited, the reply is a "302 Found"
redirection to the epub.  This contains the epub URL in the Location
header. For example:

http://does-booki-have-this.halo.gen.nz/archive.org/epub/fairerthandayfo00hugggoog

If the book doesn't exist or hasn't been edited, the result is a "404
Not Found" error. E.g:

http://does-booki-have-this.halo.gen.nz/archive.org/epub/NOT-a-book

How it works 2: Edit links
==========================

On every Archive book details page there will be a link to edit the
book in Booki.  If that book is already in Booki, the link should take
you to the existing edit page.  If more than one copy of the book
exists, it should take you to the "best" one, which is defined as the
first edited one, or the first overall if none have been edited.

If the book is not in Booki, this link should import it for you and
send you to the edit page.  At present Boouncer DOESN'T construct a
URL to import the book: either it can be changed so it does, or IA
and/or Booki can deal with that some other way.  In any case I think
Booki will need tweaking to allow people to jump straight into editing.

The form is:

 /<id-scheme>/edit/<id>

where, as above, id-scheme is "archive.org" for our purposes.  This
redirects to the edit interface (via "302 Found"):

http://does-booki-have-this.halo.gen.nz/archive.org/edit/fairerthandayfo00hugggoog

and this does not, with "404 Not Found":

http://does-booki-have-this.halo.gen.nz/archive.org/edit/no-book-here

Other ID schemes
================

This redirect system works for other kinds of ID.  The id-scheme
correlates to the scheme attribute of an epub's metadata element.  So
if the original epub had an identifier like this:

<metadata>
  <dc:identifier scheme="ISBN">978-0-14-050630-3</dc:identifier>
  <!-- possibly other identifiers here too... --> 
</metadata>

Then the URL /ISBN/edit/978-0-14-050630-3 would find its edit page in
Booki.  This is probably of no use to anyone, and ID schemes are not
widely used, but there you go.

Books imported from the Internet Archive are given an implicit
scheme="archive.org".  This is new, so previously imported books won't
be found.

What Booki provides
===================

The following refers to the "booklist" branch of Booki at
http://booki-dev.flossmanuals.net/git?p=booki.git;a=shortlog;h=refs/heads/booklist

URLs like /list-books-by-id/<id-scheme>.json provide a JSON summary of
books possessing IDs in that scheme.  For example:

http://booki.halo.gen.nz/list-books-by-id/archive.org.json

gets all the archive books.  The JSON is structured like this:

{
  ID : {
      'epub': URL or null,
      'edit': URL or null   
  },...
}

That is, for each ID there is a mapping from modes ('edit', 'epub') to
URLs.  If there is no valid URL (e.g., a corrected epub is unavailable
because the book hasn't been changed), then null is used.  Currently
an edit link is never null.

Here's an example:

{
    "fairerthandayfo00hugggoog": {
        "edit": "http://booki.halo.gen.nz/fairer-than-day-for-sunday-school-and-revival-work/edit/", 
        "epub": "http://objavi.halo.gen.nz/objavi.cgi?destination=download&book=fairer-than-day-for-sunday-school-and-revival-work&mode=epub&server=booki.halo.gen.nz"
    }, 
    "bijdragenototde00valegoog": {
        "edit": "http://booki.halo.gen.nz/kauri/edit/", 
        "epub": null
    }
}

That shows one book with changes and one book without.

TODO, questions
===============

1. Perhaps "epub" should be changed to "corrected-epub", in case at
some point in the future we want to know about uncorrected epubs.

2. Actually, all the url strings could be reviewed, and maybe the
Booki url needs to move to fit into the API plan.

3. The import-and-edit mechanism needs to be worked out (see under
"edit links" above).

4. Booki changes need to be merged into the mainline, and boouncer put
into production.

5. Corrected epubs can be cached in e.g. the Archive S3 servers, but
Booki needs to learn about this.

6. Is a redirection the right thing?  For the edit links, it means
there is a single link to follow for all the different cases (book not
in booki, book in booki, book in booki in several instances), but for
the epub links it is more likely to be "unpacked" than followed.  I
have assumed that reading a URL from an HTTP header is as easy as
reading it from an HTTP body, but possibly that is not so.

7. Silly name. ideas?

that's all I think

Douglas