[Booki-dev] changes to booki-zip format

Thu Nov 19 01:39:16 PST 2009

to me this looks good. We should put this into the dev guide asap.

I see a few small issues:
* you listed that there are 5 members in the JSON file:
"It is a JSON file [3], containing a single JSON
object with 5 members, as shown here:"

but i think there are only 4?

* it might be worth differentiating "import into booki" ("Tools that
import into Booki should import into this format.") in the first
paragraph from "import with espri". 

* it seems we are leaving the metadata namespace quite open which is
great, but should we define data that *must* be present and some that
are also recommended? For example, "creator" should always be present
even if they are to be listed as 'unknown' or 'orphaned', whereas
"contact" (contact details of the creator) could be a recommended
inclusion.

* do we need to define the 'type' options in the TOC in this doc? ie.
define the characteristics of the booki index structure

* can we eliminate references to 'Authors' and instead use
'Contributors'. 'Author' doesn't actually represent accurately the
diversity of contribution types that people make, nor does it mirror the
development model of collaborative authoring.

* what is the relationship between the 'authors' listed in the manifest
and 'contributors' listed in the example metadata? 'authors' is
per-chapter? and 'contributors' is a list of all 'authors'. 

* since it is intended some of the books will be taken into various
formats do we have a way to define/link layout/css for various types of
output?

* do we need to anticipate fields for forking/branching information for
when we implement git?

* can we define cover graphics somewhere?

---some license issues:
* do we list the copyright owner in the manifest or the metadata?

*  licenses should be defined on a per chapter basis and on a per book
basis - is there a standard way to list licenses in short form?

adam

On Thu, 2009-11-19 at 21:45 +1300, Douglas Bagnall wrote:
> Booki components use a format called "booki-zip" when they're passing
> books between themselves.  It was developed in a rush and has a few
> problems, and now is a good time to change it.
> 
> There's a proposal attached.  The main changes from the current format are:
> 
> 1. Chapter authorship information has been merged with the manifest.
> 
> 2. The TOC has become more explicit and robust.
> 
> 3. The metadata structure has ridiculously but necessarily deep nesting.
> 
> Some points that might need debate are:
> 
> * is #1 above a good idea?
> 
> * does the TOC structure really need 'nav_id' and 'type' (and if so,
> does 'type' need a constrained vocab)?
> 
> * is the metadata structure too flexible or not flexible enough? (no and
> maybe, in my opinion).
> 
> * and bikeshed number 1 is: what should the booki internal metadata
> namespace be (currently 'http://booki.cc/')?
> 
> I won't rabbit on explaining why it is like it is, but feel free to ask
> questions.
> 
> Douglas
> plain text document attachment (booki-zip-standard.txt)
> booki-zip format
> ================
> 
> This describes the booki-zip format that Booki, Espri, and Objavi use
> to communicate with each other.  Tools that import into Booki should
> import into this format.
> 
> Zip container format.
> =====================
> 
> A booki-zip file is a zip file[1], with certain restrictions.  The
> ultimate test of whether a zip is correctly encoded is whether its
> contents can be extracted by the zipfile modules in Python 2.5 and
> 2.6.  This means the contents must be either uncompressed or
> deflate-compressed.  ZIP64 extensions are OK (though unnecessary in
> practical terms), but encryption and comments are not.
> 
> The first file in the zip should be uncompressed and named "mimetype".
> It should contain only the 23 characters "application/x-booki+zip".
> This string will end up in the first few bytes of the zip file,
> allowing it to be identified without unzipping.
> 
> Directory structure.
> ====================
> 
> As well as the just mentioned "mimetype", the booki-zip must have a
> file called "info.json" in its root directory, the contents of which
> will be described shortly.  Any other files in the root directory
> should be html files intended for editing with Booki.  Any associated
> files that are not directly editable by Booki should be in a
> subdirectory named 'static'.   Here is an example structure:
> 
> /
>   mimetype
>   Introduction.html
>   UseCases.html
>   AdamsTips.html
>   Credits.html
>   info.json
> static/
>     BookSprints-ott-adam-en.jpg
>     Blog-writers-en.png
>     Floss-100-en.gif
>     example.css
> 
> All references from the html to the files in 'static' should use
> relative addresses.  For example, an image should be linked thus:
> 
> <img src="static/BookSprints-ott-adam-en.jpg" alt="" />
> 
> It is recommended but not required that the file names have
> conventional extensions (".html", ".jpg", etc).  File names should not
> contain spaces, and must meet the restrictions imposed by the zip
> format.
> 
> There should be nothing in the root directory other than "mimetype",
> "info.json", and the html files, and there should be no other
> subdirectories other than "static".  Apart from starting with
> "mimetype", there is no required order to the arrangement of entries
> within the zip file itself.  Other than "mimetype", files should be
> deflated-compressed.
> 
> character encoding
> ==================
> 
> All html files, and info.json, should be encoded as utf-8.
> 
> info.json
> =========
> 
> The "info.json" file describes the structure of the document and
> carries metadata.  It is a JSON file [3], containing a single JSON
> object with 5 members, as shown here:
> 
> {
>   "spine": [ ... ],
>   "TOC": [ ... ],
>   "manifest": { ... },
>   "metadata": { ... },
> }
> 
> Being JSON object members, the ordering of these elements is not
> significant.  The following order is for narrative purposes only.
> 
> 
> 
> info.json manifest
> ==================
> 
> The manifest is a mapping of identifiers to file names and mime-types.
> Each entry looks like:
> 
>     identifier: [filename, mimetype, authors]
> 
> The constraints on *identifier* match the XML name specification[4]
> (in short, avoid spaces and most punctuation).  In practise, the
> *identifier* is often related to the *filename*.
> 
> *filename* locates the file within the zip, and must match a path in
> the zip index.
> 
> *mimetype* is the IANA media type [5] of the file.  Booki-editable
>  html files must be of type 'text/html', and other files should be
>  correctly identified.
> 
> *authors* is a list of names of people who have contributed to this
>  file.  It can be empty.
> 
> The manifest shouldn't list the 'mimetype' or 'info.json' files, just
> the editable html and associated static files.
> 
> An example manifest, containing two html files and an image, is shown
> here:
> 
>   "manifest": {
>     "Introduction": [
>       "Introduction.html",
>       "text/html",
>       ["Adam Hyde", "Aleksander Erkalovic"]
>     ],
>     "arbitrary-identifier_0005": [
>       "UseCases.html",
>       "text/html",
>       []
>     ],
>     "BookSprints-ott-adam-en.jpg": [
>       "static/BookSprints-ott-adam-en.jpg",
>       "image/jpeg",
>       ["Ansell Adams"]
>     ]
>   }
> 
> 
> info.json spine
> ===============
> 
> The spine lists the identifiers of all the html files in the order
> they appear in the book.  It looks like:
> 
>  "spine": [ identifier, identifier,... ]
> 
> where each *identifier* is the manifest identifier for an editable
> html page.
> 
> Here is a possible spine for the manifest used in the previous
> example:
> 
>   "spine": ["Introduction", "arbitrary-identifier_0005"]
> 
> info.json TOC
> =============
> 
> The TOC (Table of Contents) specifies navigation points with the book.
> It uses a nested structure, with less significant divisions being
> contained within the "children" attribute of greater division.
> 
> The "TOC" element itself is a list of objects with the following
> structure:
> 
>  {
>    "nav_id":   identifier,
>    "title":    division title (optional),
>    "url":      filename and possible fragment ID,
>    "type":     string indicating division type (optional),
>    "role":     epub guide type (optional),
>    "children": list of TOC structures (optional)
>  }
> 
> *nav_id* is a unique identifier for this navigation point.  It uses a
>  different namespace than manifest identifiers and need have no
>  relationship to them.
> 
> *title* is a free string giving the divisions title. It may be omitted.
> 
> *url* points to the start of the division.  It should consist of a
>  filename as found in the manifest, optionally followed by a '#' and a
>  fragment identifier.
> 
> *type* is a string indicating what kind of navigation point it is.
>  This might be used to determine text styles.
> 
> *role*, if present, indicates the navigation point has a particular
>  structural role.  It must be a keyword for "reference type" as
>  defined in the guide section of the epub OPF specification[6].
> 
> *children*, if present, contains a list of objects following this same
>  specification.  These are subsections of this section.
> 
> An example:
> 
> "TOC": [
>    {"nav_id": "section1",
>     "title": "INTRODUCTION",
>     "url": "Introduction.html",
>     "type": "booki-section",
>     "children": [
>         {"nav_id": "chapter1",
>          "title": "WHAT IS GSoC?",
>          "url": "Introduction.html",
>          "type": "chapter",
>          "role": "text"
>          },
>         {"nav_id": "chapter2",
>          "title": "WHY GSOC MATTERS",
>          "url": "Testimonials.html",
>          "type": "chapter",
>          "children" [ ... ]
>          }
>       ]
>    }
> ]
> 
> 
> info.json metadata
> ==================
> 
> The names in the metadata object are "namespaces" in which "keywords"
> are defined.  The objects referred to by keywords are further divided
> by "scheme".  Each scheme points to a list of values.  If the keyword
> is indivisible, there should be a single scheme identified by an empty
> string ("").  Further, if a scheme is the primary default for that
> keyword, it may be identified by an empty string as well as by its
> scheme name.
> 
>   Here's the diagram:
> 
>  "metadata": {
>      namespace: {
>         keyword: {
>            scheme: [value, value,...],
>            scheme: [value],...
>         },...
>      },...
>   }
> 
> Booki uses Dublin Core[7] metadata keywords wherever possible, which are
> stored under the namespace "http://purl.org/dc/elements/1.1/".
> 
> An example metadata section is shown below:
> 
>   "metadata": {
>     "http://purl.org/dc/elements/1.1/": {
>       "publisher": {
>         "": ["FLOSS Manuals http://flossmanuals.net"]
>       },
>       "language": {
>         "": ["en"]
>       },
>       "creator": {
>         "": ["The Contributors"]
>       },
>       "contributor": {
>         "": ["Jennifer Redman", "Bart Massey", "Alexander Pico",
>              "selena deckelmann", "Anne Gentle", "adam hyde", "Olly Betts",
>              "Jonathan Leto", "Google Inc And The Contributors",
>              "Leslie Hawthorn"]
>       },
>       "title": {
>         "": ["GSoC Mentoring"]
>       },
>       "date": {
>         "start": ["2009-10-23"],
>         "last-modified": ["2009-10-30"]
>       },
>       "identifier": {
>         "flossmanuals.net": ["http://en.flossmanuals.net/epub/GSoCMentoring/2009.10.23-19.49.01"],
>         "archive.org": ["gsocmentoring00fm"]
>       }
>    },
>    "http://booki.cc/": {
>       "server": {
>         "": ["en.flossmanuals.net"]
>       },
>       "book": {
>          "": ["GSoCMentoring"]
>       }
>       "dir": {
>          "": ["LTR"]
>       }
>   }
> 
> 
> 
> references
> ==========
> 
> 
> [1] Zip specification: http://www.pkware.com/documents/casestudies/APPNOTE.TXT
> [2] zipfile module: http://docs.python.org/library/zipfile.html
> [3] JSON specification: http://json.org/
> [4] XML name specification http://www.w3.org/TR/REC-xml/#NT-Name
> [5] Media types http://www.iana.org/assignments/media-types/
> [6] Guides in epub http://www.idpf.org/2007/opf/OPF_2.0_final_spec.html#Section2.6
> [7] Dublin Core metadata elements http://dublincore.org/documents/2004/12/20/dces/
> 
> 
> _______________________________________________
> Booki-dev mailing list
> Booki-dev at lists.flossmanuals.net
> http://lists.flossmanuals.net/listinfo.cgi/booki-dev-flossmanuals.net

-- 
Adam Hyde
Founder FLOSS Manuals
German mobile : + 49 177 4935122
Email : adam at flossmanuals.net
irc: irc.freenode.net #flossmanuals

"Free manuals for free software"
http://www.flossmanuals.net/about