The How

The Print Copy

For republishing existing content, whenever possible, I like to start with an actual printed version of the work being converted to ePub. This existing work should be in good condition but not pristine condition because I will remove it from the binding. When a copy that use to belong to a library is obtained, that is usually the best option because frequently they have obviously either been rebound before or have had their existing binding repaired and I do not feel as guilty about removing it from the binding.

The purpose of removing it from the binding is to allow high quality photographs of the figures and plates to be taken. The photographs are restored with photo editing software as much as my skill allows, and the figures are converted to SVG. The book is then stored in a box with silica to absorb moisture until I can rebind it.

Text Transcription

When a PDF version of the work is available, I will use the PDF to aid in the transcription of the text but I do not use an automated process. The OCR technology gets some things wrong. The OCR technology never gets the type of dash (- or – or —) right, and sometimes even the text itself is just plain grossly wrong. I also like to apply contextual markup as needed within the small sections as I transcribe them, usually a few sentences at a time.

Doing it this way is important, often authors will have a list of things within a paragraph. For sighted users, eye memory helps us jump back in the list when needed but for print disabled users that is not possible, so tags need to be added inside the paragraph that a screen reader can use to understand it is a list and facilitate moving back and forth in the list as needed for the print disabled users.

Bibliography

For the bibliography, a considerable amount of effort is often required. In modern papers this is not done so much anymore, but historically the works cited was full of abbreviations. Even when the meanings of these abbreviations are well understood, they are a horrible experience for those using a screen reader. However frequently they are not well understood, standard lists of journal abbreviations do exist but many abbreviations used in older academic works are not listed in these standard lists. It is especially problematic when the author left out some words from the title which unfortunately happened with some frequency.

I do my best to try to find the actual work the author referenced. When there is a DOI associated with the work, I will add the DOI to the bibliography. Otherwise if there is a Handle associated with the work, I will add the Handle to the link unless the Handle points to a copy scanned by Google. Google scans often have sections that are very poor quality scans where it is clear the page was not flat. Google scans have Google’s water mark on every page. And Google is leveraging their search monopoly to try and shove their crappy scans of public domain works down everyone’s throat. I refuse to link to their digital books.

When I can not find a DOI or Handle for the work, I try to find a link to the work at the Biodiversity Heritage Library (BHL) and provide that when I am successful in finding the resource there. Fortunately with articles that have aged into the public domain, everything they reference has as well which greatly increases the odds that BHL has it.

For the bibliography, I am using my own style that is similar to APA style but has some differences. Purists can sue me.

Presently the bibliography is manually created but I am working on a database solution that will allow future generation to be much easier as the number of entries in the database grows decreasing the number of new works I have to hunt down.

The bibliography data will be in a centralized MariaDB database but the necessary entries will be exported to a JSON file that will be used to generate the actual bibliography as displayed in the ePub. Those scripts have not yet been written.

Photographic Plates

Photographic plates are re-photographed using a DSLR attached to a copy stand and then the resulting image is restored as much as my skill allows. Hopefully in the future, Pipfrosch Press will have a photoshop guru who can do an even better job at restoring the images. The restored images are then downsized to a width of 1024 pixels for inclusion in the ePub.

Even though my skill with photoshop is not as good as with some people, the result of this method is generally significantly better than what is typical in a PDF scan.

Drawn Figures

As with photographic plates, drawn photographs are photographed using a DSLR. Depending on the type of figure, I do sometimes initially use an automated tool to vectorize the figure but in some cases I manually trace the figure.

Even when I do use an automated tool to trace the figure, an extensive amount of manual cleanup on the resulting SVG code is then performed. Depending upon the complexity of the figure, this may take me days.

The result is always a much higher quality representation of the figure than what exists in a PDF created by scanning the work.

I then attempt to create as accurate of a text description as I can of the figure for the benefit of print disabled users. This is particularly imperative with graphs, where the caption for the figure is never adequate.

Glossary of Terms

Most academic articles are written for an academic audience. I try to determine which terms used may not be understood by a reader who is not an academic and add them to a glossary of terms. When my judgement is that understanding of the term is critical to comprehension, I will make the first use of the term in the article a hyperlink to the glossary.

Most of the glossary definitions come from Wiktionary (and are cited as such) but in some cases when they do not have a definition or I find the definition there to be inadequate, I will write the glossary definition myself.

Supplemental Content

In some cases, it is my judgement that supplemental content is needed for the best comprehension of the academic article. For example, in the Journal of Mammalogy new taxonomic nomenclature is often described. Has the described nomenclature stood the test of time and peer review?

In the case where an ePub is created for a book or a journal volume containing many articles, I also may include supplemental content intended to aid educators who are using the ePub in a classroom environment.

The ePub Workflow

I do not use current software that exists for the purpose of generating ePub content. Every piece of such software that I tried I found to be very dissatisfying. Simple things frequently required manual editing of the code anyway. So I just use a text editor and author in raw XML.

Pipfrosch Press ePubs are produced primarily using the Bluefish text editor in GNU/Linux. I would probably use BBEdit in macOS as that was my favorite text editor years ago, but I currently do not have a Mac of my own, I do borrow use of a Mac for things like photoshop work but that is not practical for ePub XML content that requires frequent edits. I also frequently use the vim text editor in GNU/Linux.

In addition to using a text editor, I wrote several python scripts to handle many of the tasks involved and am currently creating more.

As an ePub is primarily a collection of text files, the source files are kept in a git repository. When I am ready to do a ‘build’ a shell script does a fresh checkout of the repository, adds the font files (which are not kept in git), runs the needed the python scripts, and then packages the ePub.

Two versions of the ePub are created, one that uses italicized type with some liberty and one that applies CSS rules that greatly reduce the usage of italicized type for the benefit of users with dyslexia or other disabilities that are triggered by italicized text. Both versions are made at the same time to ensure that the content itself is always mistake for mistake identical, the only difference is the CSS rules that are applied when the viewer renders the content. Well, any ‘Æ/æ’ and ‘Œ/œ’ ligatures from the original content that are preserved in the ePub are also expanded by the script to their non-ligature form in the version with reduced italicized text as those ligatures also can trigger dyslexia and other visual barriers.

The script then runs the result through EPUBcheck to find any violations of the standard and when I trigger the request to do so, the script also runs the result through the Ace by DAISY accessibility checker to find obvious accessibility issues.

While working on a section of the ePub I will use Calibre to visually check the output. When finished with a section, I will use as many ePub viewers as I have access to to check the output.