Task 4. Text editors

Use documents in the sessions folder for this task.

Task 3 is to use a text editor to remove all xml tags from an xml document - except the paragraph tags. This could be a lengthy task if you did it manually. The purpose of this exercise is to use the "find and replace" tool to do the work for you. You can do part of the work in standard mode, the harder part requires you use regular expressions.

Regular expressions do powerful text processing operations quickly and efficiently. They’ve been around for a long time. They are included in most programming languages. There are many web pages full of examples, you can often find one that suits your needs without knowing how to write them. The purpose of this part of the task is to teach you how to use them, not to teach you how to write them. Many text editors (including Sublime-text and Atom) include regular expressions as an option for find and replace operations. The icon in Sublime text and Atom looks like this .*

The two required for this task are:

  • <[^>]+>
  • ^(?:[\t ]*(?:\r?\n|\r))+
The first one matches xml tags, the second matches whitespace.

Start by opening 16890516.xml in your text editor, have a look at the contents of the file. There is a lot of inline metadata included in the document. Your task is to remove the metadata but keep the paragraphs (Otherwise you’ll be left with one massive paragraph).

My Suggested process for this completing this task is:


  • 1. Replace the tags you want to keep ( <p > </p>) with your own identifiers (tip: use find and replace, don’t make identifiers that looks like xml or they’ll be removed in the next step)
  • 2. Remove remaining XML markup
  • 3. Replace your own identifiers (put the <p >s back in)
  • 4. Remove the extraneous ‘whitespace’
  • 5. Try doing a folder of documents

Why did you do this?

The purpose of this exercise was to demonstrate “cleaning” text data that could not be handled by any of the tools we had looked at - without doing any programming. The example we used for this exercise is XML data, and the link to my own work is the need to clean some text so I could use it within a multimedia application I had created (I want to harvest advertising content from xml encoded court reports to display in virtual reality puppet shows).

The techniques demonstrated a mix of standard find and replace operations with a more advanced technique using regular expressions (see wikipedia for a definition of regular expressions). Once this was understood the challenge was to do this to a whole folder of documents from the dataset to show how large datasets can be handled.

My use of regular expressions is often as a part of programming operations, where the features are common for handling text data. However, modern text editors provide this function within the search and replace features (the icon is often .*) which lowers the barrier to using this very powerful technique. Despite having used regular expressions for over a decade I seldom create them and, usually just search for one that performs the processing I want.

This exercise was intended to be a little more advanced - because sometimes this is required.

Voyant Tools

Voyant Tools