[Booki-dev] Does-booki-have-this-book server

Sat Oct 16 20:29:07 PDT 2010

Wow.. awesome! Thanks, Douglas!

On Oct 16, 2010, at 4:00 PM, Douglas Bagnall wrote:

> hi,
> 
> In discussions between Raj, Adam, and myself, we realised that the
> Internet Archive servers will often need to ask Booki whether an
> Archive book has a) been imported into Booki, and b) been changed in
> Booki.  The Internet Archive will use this to offer people an
> opportunity to correct books and, if a corrected version exists,
> download the corrected version.
> 
> The trouble with this is that Booki doesn't index its books by Archive
> ID, nor by whether they have been edited, and even if it did, its
> framework is more geared toward rich interaction than fast look-ups.
> Meanwhile the Archive has two million books and goodness knows how
> many readers.
> 
> We decided we needed an intermediate server that periodically asks
> Booki for information about Archive-sourced books.  That means Booki
> only needs to trawl through its database every minute or two, while
> the Archive can get instant answers from an in-memory store.
> 
> The new server is provisionally called Boouncer (from gatekeeper ->
> doorman -> bouncer -> splice with booki; pronunciation is up to you),
> and is in git on the booki-dev server:
> 
> http://booki-dev.flossmanuals.net/git?p=boouncer.git
> 
> As this also involves changes to Booki, it is so far only running
> against my test server. The examples below use
> does-booki-have-this.halo.gen.nz and booki.halo.gen.nz, which are not
> permanent urls.
> 
> How it works 1: Epub links
> ==========================
> 
> Every time an IA book details page is loaded, the Archive server asks
> whether a corrected epub exists.  If it does exist, it wants the URL.
> "Corrected" means changed in Booki, and "exists" includes epubs
> generated on the fly (which at this stage means all of them).
> 
> The general form for this type of request is:
> 
> /<id-scheme>/epub/<id>
> 
> For Internet Archive ids, the id-scheme is "archive.org".
> 
> If the book exists and has been edited, the reply is a "302 Found"
> redirection to the epub.  This contains the epub URL in the Location
> header. For example:
> 
> http://does-booki-have-this.halo.gen.nz/archive.org/epub/fairerthandayfo00hugggoog
> 
> If the book doesn't exist or hasn't been edited, the result is a "404
> Not Found" error. E.g:
> 
> http://does-booki-have-this.halo.gen.nz/archive.org/epub/NOT-a-book
> 
> How it works 2: Edit links
> ==========================
> 
> On every Archive book details page there will be a link to edit the
> book in Booki.  If that book is already in Booki, the link should take
> you to the existing edit page.  If more than one copy of the book
> exists, it should take you to the "best" one, which is defined as the
> first edited one, or the first overall if none have been edited.
> 
> If the book is not in Booki, this link should import it for you and
> send you to the edit page.  At present Boouncer DOESN'T construct a
> URL to import the book: either it can be changed so it does, or IA
> and/or Booki can deal with that some other way.  In any case I think
> Booki will need tweaking to allow people to jump straight into editing.
> 
> The form is:
> 
> /<id-scheme>/edit/<id>
> 
> where, as above, id-scheme is "archive.org" for our purposes.  This
> redirects to the edit interface (via "302 Found"):
> 
> http://does-booki-have-this.halo.gen.nz/archive.org/edit/fairerthandayfo00hugggoog
> 
> and this does not, with "404 Not Found":
> 
> http://does-booki-have-this.halo.gen.nz/archive.org/edit/no-book-here
> 
> Other ID schemes
> ================
> 
> This redirect system works for other kinds of ID.  The id-scheme
> correlates to the scheme attribute of an epub's metadata element.  So
> if the original epub had an identifier like this:
> 
> <metadata>
>  <dc:identifier scheme="ISBN">978-0-14-050630-3</dc:identifier>
>  <!-- possibly other identifiers here too... --> 
> </metadata>
> 
> Then the URL /ISBN/edit/978-0-14-050630-3 would find its edit page in
> Booki.  This is probably of no use to anyone, and ID schemes are not
> widely used, but there you go.
> 
> Books imported from the Internet Archive are given an implicit
> scheme="archive.org".  This is new, so previously imported books won't
> be found.
> 
> What Booki provides
> ===================
> 
> The following refers to the "booklist" branch of Booki at
> http://booki-dev.flossmanuals.net/git?p=booki.git;a=shortlog;h=refs/heads/booklist
> 
> URLs like /list-books-by-id/<id-scheme>.json provide a JSON summary of
> books possessing IDs in that scheme.  For example:
> 
> http://booki.halo.gen.nz/list-books-by-id/archive.org.json
> 
> gets all the archive books.  The JSON is structured like this:
> 
> {
>  ID : {
>      'epub': URL or null,
>      'edit': URL or null   
>  },...
> }
> 
> That is, for each ID there is a mapping from modes ('edit', 'epub') to
> URLs.  If there is no valid URL (e.g., a corrected epub is unavailable
> because the book hasn't been changed), then null is used.  Currently
> an edit link is never null.
> 
> Here's an example:
> 
> {
>    "fairerthandayfo00hugggoog": {
>        "edit": "http://booki.halo.gen.nz/fairer-than-day-for-sunday-school-and-revival-work/edit/", 
>        "epub": "http://objavi.halo.gen.nz/objavi.cgi?destination=download&book=fairer-than-day-for-sunday-school-and-revival-work&mode=epub&server=booki.halo.gen.nz"
>    }, 
>    "bijdragenototde00valegoog": {
>        "edit": "http://booki.halo.gen.nz/kauri/edit/", 
>        "epub": null
>    }
> }
> 
> That shows one book with changes and one book without.
> 
> TODO, questions
> ===============
> 
> 1. Perhaps "epub" should be changed to "corrected-epub", in case at
> some point in the future we want to know about uncorrected epubs.
> 
> 2. Actually, all the url strings could be reviewed, and maybe the
> Booki url needs to move to fit into the API plan.
> 
> 3. The import-and-edit mechanism needs to be worked out (see under
> "edit links" above).
> 
> 4. Booki changes need to be merged into the mainline, and boouncer put
> into production.
> 
> 5. Corrected epubs can be cached in e.g. the Archive S3 servers, but
> Booki needs to learn about this.
> 
> 6. Is a redirection the right thing?  For the edit links, it means
> there is a single link to follow for all the different cases (book not
> in booki, book in booki, book in booki in several instances), but for
> the epub links it is more likely to be "unpacked" than followed.  I
> have assumed that reading a URL from an HTTP header is as easy as
> reading it from an HTTP body, but possibly that is not so.
> 
> 7. Silly name. ideas?
> 
> that's all I think
> 
> Douglas
> _______________________________________________
> Booki-dev mailing list
> Booki-dev at lists.flossmanuals.net
> http://lists.flossmanuals.net/listinfo.cgi/booki-dev-flossmanuals.net