Why This Matters

A few people have asked me now why porting archaic academic journal articles to ePub is important when they can usually be found as a PDF file, or at the excellent Biodiversity Heritage Library.

PDF documents created in the modern era using modern desktop publishing software often is good enough for many accessibility (abbreviated as a11y and often discussed on Twitter as #a11y) issues. However, PDF documents created by scanning old printed content are often horrid.

This is the description page of species account in ‘Quadrupeds of Illinois’ by Robert Kennicott in .

Screenshot of PDF scan of a species description

Transcribing that to ePub is not as simple as just copying the text. Not if done correctly.

At the beginning there is a list of measurements. For a print-disabled user, it is often easier for them to consume such text if it is presented through a screen-reader as a list rather than a bunch of strings delimited by semi-colons. Fortunately, the role attribute in (X)HTML and by extension ePub 3 allows it to be presented to screen-readers as a list, see the image that follows.

using the role attribute to emulate a list.

By wrapping the measurement parts of the description into <span></span> elements with appropriate role attributes, the measurements part of the description will be presented to someone using a screen reader as a list they can easily sequentially navigate through.

The end result does not impact the rendering for sighted users visually.

ePub visual rendering of the species description

Sighted users do not know accommodations were made for print-disabled users, as coding those accommodations had no visual impact.

But creating the ePub goes more than just things like rendering written lists within a paragraph as actual lists to screen readers.

Many of the ‘free’ PDF scans of public domain works are just plain crap to access, even for sighted users with excellent vision.

In this example, I was only able to find this particular paper at one location. The printed document belonged to Harvard but the scan was made by Google and they did a real poor job of it. Google only cares about doing a quality job with the non-public domain works they can sell, their scans of public domain works have a poor quality scan and poor quality OCR and it appears to me their motivation for creating these scans is two-fold:

  1. So they can put their fricken watermark on EVERY SINGLE FRICKEN PAGE and
  2. So they leverage their search monopoly to push their crappy Google Book Reader, which they then use to try and sell you publications that are not free.

Many journals are cited using abbreviations. It use to be that you could enter the abbreviated journal title into Google and find the full journal title within the first few results. Now, you get page after page after page of Google Book results that use the abbreviated journal title in their Works Cited but you can’t find the actual journal title. They made their search algorithm worse for the purpose of pushing Google Books. And these Google Book results are the first results. Clear violation of antitrust laws but they own too many politicians to ever be prosecuted.

The quality of the scans is crap, the pages were not laying flat when scanned and the OCR for the text is very poor. Take this example:

Trying to highlight text from a Google Scan.

I tried highlighting the line starting with ‘The tail is’ but as soon as I got to the word ‘is’ the first two words were no longer highlighted and instead a bunch of stuff unrelated to what I wanted were highlighted, including text from the opposing page that the crappy Google Scan has in the left margin.

If I were to copy the highlighted text and paste it, it would be out of order gibberish with missing words. That is caused by an extremely poor quality OCR scan. Google clearly does not give a flying squirrel’s ass about quality before they paste their logo on something. They do not have to care, they have monopoly power and can squash any competitors, American Capitalism is not about a free market.

With a PDF created by scanning a document, what you see is not rendered text. What you see is a bitmap image file. Text is (sometimes) added later using OCR technology with hints for what the word should be when you select a certain portion of the image to highlight, and Google does a really poor job of that, at least with the public domain scans.

With the ePub files being produced at Pipfrosch Press, what you see is an actual rendering of the actual text. This means you can adjust the font size, even completely change the font used, and it still renders beautifully and the ability to properly highlight the rendered text for copying and pasting elsewhere just works, like it should.

This is why what I am doing with Pipfrosch Press is so important. Capitalism by its very nature is ableist, it discriminates against both those with disabilities and those with lower income levels. We do not matter to Capitalism, which is why laws have to exist to require companies to do some level of accommodation for disabilities. But the laws do not go far enough and the companies do the bare minimum. Actually less than the bare minimum, and they want to be praised for doing what they do even though it is insufficient and they have the money to do a much better job and still live better lives than most Americans.

A lot of capitalists claim to be ‘socially liberal but fiscally conservative’. Tell me, what is more fiscally conservative. To pay for proper accessibility once when creating the digitized version of a document—or to leave it up to public schools, libraries, universities, and employers to each have to individually pay to make an inaccessible document accessible when a user with a disability needs the accommodation?

Leaving it up to the schools and libraries to take what isn’t accessible and make it accessible also takes away from the autonomy of the disabled user. They can not just access the resource, but they have to get help to access the resource.

That is not good enough.

Give us bread but give us roses.

