SHL Open Workshop 1

Cleaning and preparing text data


The resources for the workshop

Please be respectful with your use of Mass Observation content.

Task 1. Calibre

Calibre is an e-book manager, the reason we needed Calibre was to convert the books we were using from a format we couldn’t use into one we could. In our case we had a set of e-books, most were in epub, azw or pdf formats. We needed to convert them into utf8 encoded plain text so we could analyse the text using other tools. We needed enough books to make us grateful that we could process all the books in a single job rather than just one at a time.

Calibre can ingest and export a lot more formats, it also handles several types of encryption and digital rights management and has some excellent features for working with meta-data. Calibre can import AZW, AZW3, AZW4, CBZ, CBR, CBC, CHM, DJVU, DOCX, EPUB, FB2, HTML, HTMLZ, LIT, LRF, MOBI, ODT, PDF, PRC, PDB, PML, RB, RTF, SNB, TCR, TXT, TXTZ and it can export AZW3, EPUB, DOCX, FB2, HTMLZ, OEB, LIT, LRF, MOBI, PDB, PMLZ, RB, PDF, RTF, SNB, TCR, TXT , TXTZ, ZIP.


  • 1. Please sign in, it helps us justify costs to our funders.
  • 2 Add an eBook to the Calibre Library(eBooks folder)
  • 3. Use Bulk Convert to process multiple files, Export your text as UTF-8.
  • 4. Edit the metadata for Heidi, try to add a cover.

Why did you do this?

In our case we were surprised we needed to; however, somebody had to make the text usable for the work to proceed.

Calibre's purpose is to be an e-Book manager, what we needed was format conversion of a range of electronic document types. When I looked for tools to perform this job this was the one I came across that converted the types we needed. The ability to perform conversion as a batch process meant large folders of unsorted documents could be handled with a few mouse clicks.

The pitfalls of using Calibre as a format converter is that you have to adapt to the way the tool works. In particular the library folder where Calibre stores it’s copies of documents and exported results. This means you'll have to find it's library folder location, this is one of the options when you install the program.

Calibre may not supply you with the sort of error or progress reports you expect - so you should check your results.

Calibre Batch Job