[Booki-dev] changes to booki-zip format

Thu Nov 19 03:33:37 PST 2009

hey,

i made some edits...please check to see if they make sense. Issues I
didn't address are:

* since it is intended some of the books will be taken into various
formats do we have a way to define/link layout/css for various types of
output?

* do we need to anticipate fields for forking/branching information for
when we implement git?

* can we define cover graphics somewhere?

adam

On Thu, 2009-11-19 at 10:39 +0100, adam hyde wrote:
> 
> * since it is intended some of the books will be taken into various
> formats do we have a way to define/link layout/css for various types
> of
> output?
> 
> * do we need to anticipate fields for forking/branching information
> for
> when we implement git?
> 
> * can we define cover graphics somewhere? 

-- 
Adam Hyde
Founder FLOSS Manuals
German mobile : + 49 177 4935122
Email : adam at flossmanuals.net
irc: irc.freenode.net #flossmanuals

"Free manuals for free software"
http://www.flossmanuals.net/about
-------------- next part --------------
eooki-zip format
================

This describes the booki-zip format that Booki, Espri, and Objavi use
to communicate with each other.  Tools that import into Booki should
import into this format. Internally Espri manages the conversion of formats into booki-zip.

Zip container format.
=====================

A booki-zip file is a zip file[1], with certain restrictions.  The
ultimate test of whether a zip is correctly encoded is whether its
contents can be extracted by the zipfile modules in Python 2.5 and
2.6.  This means the contents must be either uncompressed or
deflate-compressed.  ZIP64 extensions are OK (though unnecessary in
practical terms), but encryption and comments are not.

The first file in the zip should be uncompressed and named "mimetype".
It should contain only the 23 characters "application/x-booki+zip".
This string will end up in the first few bytes of the zip file,
allowing it to be identified without unzipping.

Directory structure.
====================

As well as the just mentioned "mimetype", the booki-zip must have a
file called "info.json" in its root directory, the contents of which
will be described shortly.  Any other files in the root directory
should be html files intended for editing with Booki.  Any associated
files that are not directly editable by Booki should be in a
subdirectory named 'static'.   Here is an example structure:

/
  mimetype
  Introduction.html
  UseCases.html
  AdamsTips.html
  Credits.html
  info.json
static/
    BookSprints-ott-adam-en.jpg
    Blog-writers-en.png
    Floss-100-en.gif
    example.css

All references from the html to the files in 'static' should use
relative addresses.  For example, an image should be linked thus:

<img src="static/BookSprints-ott-adam-en.jpg" alt="" />

It is recommended but not required that the file names have
conventional extensions (".html", ".jpg", etc).  File names should not
contain spaces, and must meet the restrictions imposed by the zip
format.

There should be nothing in the root directory other than "mimetype",
"info.json", and the html files, and there should be no other
subdirectories other than "static".  Apart from starting with
"mimetype", there is no required order to the arrangement of entries
within the zip file itself.  Other than "mimetype", files should be
deflated-compressed.

character encoding
==================

All html files, and info.json, should be encoded as utf-8.

info.json
=========

The "info.json" file describes the structure of the document and
carries metadata.  It is a JSON file [3], containing a single JSON
object with 4 members, as shown here:

{
  "spine": [ ... ],
  "TOC": [ ... ],
  "manifest": { ... },
  "metadata": { ... },
}

Being JSON object members, the ordering of these elements is not
significant.  The following order is for narrative purposes only.

info.json manifest
==================

The manifest is a mapping of identifiers to file names and mime-types.
Each entry looks like:

    identifier: [filename, mimetype, contributors, rightsHolder, license]

The constraints on *identifier* match the XML name specification[4]
(in short, avoid spaces and most punctuation).  In practise, the
*identifier* is often related to the *filename*.

*filename* locates the file within the zip, and must match a path in
the zip index.

*mimetype* is the IANA media type [5] of the file.  Booki-editable
 html files must be of type 'text/html', and other files should be
 correctly identified.

*contributors* is a list of names of people who have contributed to this
 file.  It can be empty.

*rightsHolder* is the name of the person, group or organisation that manages the rights for the chapter

*license* - the copyright license for the chapter. It can contained more than one entry.

The manifest shouldn't list the 'mimetype' or 'info.json' files, just
the editable html and associated static files.

An example manifest, containing two html files and an image, is shown
here:

  "manifest": {
    "Introduction": [
      "Introduction.html",
      "text/html",
      ["Adam Hyde", "Aleksander Erkalovic"]
      ["Adam Hyde"],
      ["CC-BY-SA"]
    ],
    "arbitrary-identifier_0005": [
      "UseCases.html",
      "text/html",
      [],
      ["Wikimedia Foundation"],
      ["FDL","CC-BY-SA"],
    ],
    "BookSprints-ott-adam-en.jpg": [
      "static/BookSprints-ott-adam-en.jpg",
      "image/jpeg",
      ["Ansell Adams"],
      ["Ansell Adams"],
      ["C"]
    ]
  }

info.json spine
===============

The spine lists the identifiers of all the html files in the order
they appear in the book.  It looks like:

 "spine": [ identifier, identifier,... ]

where each *identifier* is the manifest identifier for an editable
html page.

Here is a possible spine for the manifest used in the previous
example:

  "spine": ["Introduction", "arbitrary-identifier_0005"]

info.json TOC
=============

The TOC (Table of Contents) specifies navigation points with the book.
It uses a nested structure, with less significant divisions being
contained within the "children" attribute of greater division.

The "TOC" element itself is a list of objects with the following
structure:

 {
   "nav_id":   identifier,
   "title":    division title (optional),
   "url":      filename and possible fragment ID,
   "type":     string indicating division type (optional),
   "role":     epub guide type (optional),
   "children": list of TOC structures (optional)
 }

*nav_id* is a unique identifier for this navigation point.  It uses a
 different namespace than manifest identifiers and need have no
 relationship to them.

*title* is a free string giving the divisions title. It may be omitted.

*url* points to the start of the division.  It should consist of a
 filename as found in the manifest, optionally followed by a '#' and a
 fragment identifier.

*type* is a string indicating what kind of navigation point it is.
 This might be used to determine text styles. The main Booki TOC types are:
- booki-chapter (default)
- booki-section
- booki-title 

*role*, if present, indicates the navigation point has a particular
 structural role.  It must be a keyword for "reference type" as
 defined in the guide section of the epub OPF specification[6].

*children*, if present, contains a list of objects following this same
 specification.  These are subsections of this section.

An example:

"TOC": [
   {"nav_id": "section1",
    "title": "INTRODUCTION",
    "url": "Introduction.html",
    "type": "booki-section",
    "children": [
        {"nav_id": "chapter1",
         "title": "WHAT IS GSoC?",
         "url": "Introduction.html",
         "type": "chapter",
         "role": "text"
         },
        {"nav_id": "chapter2",
         "title": "WHY GSOC MATTERS",
         "url": "Testimonials.html",
         "type": "chapter",
         "children" [ ... ]
         }
      ]
   }
]

info.json metadata
==================

The names in the metadata object are "namespaces" in which "keywords"
are defined.  The objects referred to by keywords are further divided
by "scheme".  Each scheme points to a list of values.  If the keyword
is indivisible, there should be a single scheme identified by an empty
string ("").  Further, if a scheme is the primary default for that
keyword, it may be identified by an empty string as well as by its
scheme name.

  Here's the diagram:

 "metadata": {
     namespace: {
        keyword: {
           scheme: [value, value,...],
           scheme: [value],...
        },...
     },...
  }

Booki uses Dublin Core[7] metadata keywords wherever possible, which are
stored under the namespace "http://purl.org/dc/elements/1.1/".

An example metadata section is shown below:

  "metadata": {
    "http://purl.org/dc/elements/1.1/": {
      "publisher": {
        "": ["FLOSS Manuals http://flossmanuals.net"]
      },
      "language": {
        "": ["en"]
      },
      "creator": {
        "": ["The Contributors"]
      },
      "title": {
        "": ["GSoC Mentoring"]
      },
      "date": {
        "start": ["2009-10-23"],
        "last-modified": ["2009-10-30"]
      },
      "identifier": {
        "flossmanuals.net": ["http://en.flossmanuals.net/epub/GSoCMentoring/2009.10.23-19.49.01"],
        "archive.org": ["gsocmentoring00fm"]
      }
   },
   "http://booki.cc/": {
      "server": {
        "": ["en.flossmanuals.net"]
      },
      "book": {
         "": ["GSoCMentoring"]
      }
      "dir": {
         "": ["LTR"]
      }
  }

Recommended minimum metadata terms are:
- language - the primary language of the book
- creator - the person or group that primarily created the content (usually 'the author')
- title - the title of the book
- rightsHolder - the person or organisation managing the rights of the book (usually 'the copyright owner')
- license - the copyright license used for the entire book (blank if chapters are variously licensed)

Please note: the Dublin Core 'contributor' term is probably not necessary since this information can be included in the Manifest.

references
==========

[1] Zip specification: http://www.pkware.com/documents/casestudies/APPNOTE.TXT
[2] zipfile module: http://docs.python.org/library/zipfile.html
[3] JSON specification: http://json.org/
[4] XML name specification http://www.w3.org/TR/REC-xml/#NT-Name
[5] Media types http://www.iana.org/assignments/media-types/
[6] Guides in epub http://www.idpf.org/2007/opf/OPF_2.0_final_spec.html#Section2.6
[7] Dublin Core metadata elements http://dublincore.org/documents/2004/12/20/dces/