[Booki-dev] Does-booki-have-this-book server
raj kumar
rkumar at archive.org
Sat Oct 16 20:29:07 PDT 2010
Wow.. awesome! Thanks, Douglas!
On Oct 16, 2010, at 4:00 PM, Douglas Bagnall wrote:
> hi,
>
> In discussions between Raj, Adam, and myself, we realised that the
> Internet Archive servers will often need to ask Booki whether an
> Archive book has a) been imported into Booki, and b) been changed in
> Booki. The Internet Archive will use this to offer people an
> opportunity to correct books and, if a corrected version exists,
> download the corrected version.
>
> The trouble with this is that Booki doesn't index its books by Archive
> ID, nor by whether they have been edited, and even if it did, its
> framework is more geared toward rich interaction than fast look-ups.
> Meanwhile the Archive has two million books and goodness knows how
> many readers.
>
> We decided we needed an intermediate server that periodically asks
> Booki for information about Archive-sourced books. That means Booki
> only needs to trawl through its database every minute or two, while
> the Archive can get instant answers from an in-memory store.
>
> The new server is provisionally called Boouncer (from gatekeeper ->
> doorman -> bouncer -> splice with booki; pronunciation is up to you),
> and is in git on the booki-dev server:
>
> http://booki-dev.flossmanuals.net/git?p=boouncer.git
>
> As this also involves changes to Booki, it is so far only running
> against my test server. The examples below use
> does-booki-have-this.halo.gen.nz and booki.halo.gen.nz, which are not
> permanent urls.
>
> How it works 1: Epub links
> ==========================
>
> Every time an IA book details page is loaded, the Archive server asks
> whether a corrected epub exists. If it does exist, it wants the URL.
> "Corrected" means changed in Booki, and "exists" includes epubs
> generated on the fly (which at this stage means all of them).
>
> The general form for this type of request is:
>
> /<id-scheme>/epub/<id>
>
> For Internet Archive ids, the id-scheme is "archive.org".
>
> If the book exists and has been edited, the reply is a "302 Found"
> redirection to the epub. This contains the epub URL in the Location
> header. For example:
>
> http://does-booki-have-this.halo.gen.nz/archive.org/epub/fairerthandayfo00hugggoog
>
> If the book doesn't exist or hasn't been edited, the result is a "404
> Not Found" error. E.g:
>
> http://does-booki-have-this.halo.gen.nz/archive.org/epub/NOT-a-book
>
> How it works 2: Edit links
> ==========================
>
> On every Archive book details page there will be a link to edit the
> book in Booki. If that book is already in Booki, the link should take
> you to the existing edit page. If more than one copy of the book
> exists, it should take you to the "best" one, which is defined as the
> first edited one, or the first overall if none have been edited.
>
> If the book is not in Booki, this link should import it for you and
> send you to the edit page. At present Boouncer DOESN'T construct a
> URL to import the book: either it can be changed so it does, or IA
> and/or Booki can deal with that some other way. In any case I think
> Booki will need tweaking to allow people to jump straight into editing.
>
> The form is:
>
> /<id-scheme>/edit/<id>
>
> where, as above, id-scheme is "archive.org" for our purposes. This
> redirects to the edit interface (via "302 Found"):
>
> http://does-booki-have-this.halo.gen.nz/archive.org/edit/fairerthandayfo00hugggoog
>
> and this does not, with "404 Not Found":
>
> http://does-booki-have-this.halo.gen.nz/archive.org/edit/no-book-here
>
> Other ID schemes
> ================
>
> This redirect system works for other kinds of ID. The id-scheme
> correlates to the scheme attribute of an epub's metadata element. So
> if the original epub had an identifier like this:
>
> <metadata>
> <dc:identifier scheme="ISBN">978-0-14-050630-3</dc:identifier>
> <!-- possibly other identifiers here too... -->
> </metadata>
>
> Then the URL /ISBN/edit/978-0-14-050630-3 would find its edit page in
> Booki. This is probably of no use to anyone, and ID schemes are not
> widely used, but there you go.
>
> Books imported from the Internet Archive are given an implicit
> scheme="archive.org". This is new, so previously imported books won't
> be found.
>
> What Booki provides
> ===================
>
> The following refers to the "booklist" branch of Booki at
> http://booki-dev.flossmanuals.net/git?p=booki.git;a=shortlog;h=refs/heads/booklist
>
> URLs like /list-books-by-id/<id-scheme>.json provide a JSON summary of
> books possessing IDs in that scheme. For example:
>
> http://booki.halo.gen.nz/list-books-by-id/archive.org.json
>
> gets all the archive books. The JSON is structured like this:
>
> {
> ID : {
> 'epub': URL or null,
> 'edit': URL or null
> },...
> }
>
> That is, for each ID there is a mapping from modes ('edit', 'epub') to
> URLs. If there is no valid URL (e.g., a corrected epub is unavailable
> because the book hasn't been changed), then null is used. Currently
> an edit link is never null.
>
> Here's an example:
>
> {
> "fairerthandayfo00hugggoog": {
> "edit": "http://booki.halo.gen.nz/fairer-than-day-for-sunday-school-and-revival-work/edit/",
> "epub": "http://objavi.halo.gen.nz/objavi.cgi?destination=download&book=fairer-than-day-for-sunday-school-and-revival-work&mode=epub&server=booki.halo.gen.nz"
> },
> "bijdragenototde00valegoog": {
> "edit": "http://booki.halo.gen.nz/kauri/edit/",
> "epub": null
> }
> }
>
> That shows one book with changes and one book without.
>
> TODO, questions
> ===============
>
> 1. Perhaps "epub" should be changed to "corrected-epub", in case at
> some point in the future we want to know about uncorrected epubs.
>
> 2. Actually, all the url strings could be reviewed, and maybe the
> Booki url needs to move to fit into the API plan.
>
> 3. The import-and-edit mechanism needs to be worked out (see under
> "edit links" above).
>
> 4. Booki changes need to be merged into the mainline, and boouncer put
> into production.
>
> 5. Corrected epubs can be cached in e.g. the Archive S3 servers, but
> Booki needs to learn about this.
>
> 6. Is a redirection the right thing? For the edit links, it means
> there is a single link to follow for all the different cases (book not
> in booki, book in booki, book in booki in several instances), but for
> the epub links it is more likely to be "unpacked" than followed. I
> have assumed that reading a URL from an HTTP header is as easy as
> reading it from an HTTP body, but possibly that is not so.
>
> 7. Silly name. ideas?
>
> that's all I think
>
> Douglas
> _______________________________________________
> Booki-dev mailing list
> Booki-dev at lists.flossmanuals.net
> http://lists.flossmanuals.net/listinfo.cgi/booki-dev-flossmanuals.net
More information about the Booki-dev
mailing list