[FM Discuss] Future of Booki

James Simmons nicestep at gmail.com
Thu Jun 24 11:54:40 PDT 2010


Adam,

I am well aware of the limitations of archive.org in the EPUB
department.  I document that in my own book. I'm doing some
proofreading of OCR'd text for my other donation, "Ancient Manners" by
Pierre Louys, which I'm submitting to Distributed Proofreaders for
Project Gutenberg.  The book was published in 1906 and I used
Tesseract on it to get OCR results that were not so good, but no worse
than I got with ABBYY Fine Reader.  Because of this, I'm proofing and
correcting the text myself, one page at a time, before submitting it
to DP.  Not much fun.

It seems to me that in the E-Book world a lot of different efforts are
converging, and I'm trying to document that as much as possible in my
book.

James Simmons


> Date: Thu, 24 Jun 2010 18:37:31 +0200
> From: adam <adam at xs4all.nl>
> To: discuss at lists.flossmanuals.net
> Subject: Re: [FM Discuss] Future of Booki
> Message-ID: <1277397451.9881.198.camel at esetera>
> Content-Type: text/plain; charset="UTF-8"
>
> nice book by the way...
>
> one thing that hasn't been mentioned but is interesting...booki and
> archive.org are going to be linked together soon i hope. essentially
> archive.org has a zillion books that have been scanned ('OCR') similar
> to the book you have donated James.
>
> when you scroll down the page of that book:
> http://www.booki.cc/big-aviation-book-for-boys/pages/
>
> you see lots of whacky ocr artifacts. These are created because scanners
> cant tell if a blotch on the page is a blotch or an image, and they
> can't also tell the difference between a big letter and an image. If you
> scroll down that page you will see what I mean.
>
> They also dont format the page with headings or well formatted footnotes
> etc.
>
> Also there is a 5% error rate in the text of OCR scans which is actually
> quite high...
>
> so, this content needs to be improved, and that is what booki will
> enable. we will create a 'round trip' -> books get imported from archive
> into booki and placed directly in the Internet Archive Group. then the
> proof readers proof and improve, then they export and push back to
> archive.org
>
> i'm hoping we get to that soon now Doug is back from his art holiday :)
> (Doug is the Objavi developer)
>
>
>
> adam
>
> On Thu, 2010-06-24 at 11:10 -0500, James Simmons wrote:
>> I tried out Booki by importing a book from the Internet Archive which
>> I donated myself, "The Big Aviation Book For Boys".  I'm incredibly
>> impressed.



More information about the Discuss mailing list