This summer, I had the challenge to parse some legal archive about Luxembourgian companies. The archive respected a quite strict file format, but were typed by humans. Furthermore there were only available as PDF files.
We needed to extract the content of the whole archive to get the list of company names, together with various references and information about them. For the sake of easiness, the PDF was first converted into text files with pdfminer. The conversion was not perfect, with some lines out of order.
Nevertheless since the format of the archive was somewhat rigid, I first tried to build a Finite State automaton, with transitions chosen by matching whole lines of text in the current state. Matching could involve regular expression, counting words or combining lines into one.
This FSA crashed a lot. I tried to implement a fallback system so it could recover crashes and state errors, but it was still a very poor implementation. A little error could jeopardize the decision to transition along the FSA and make if fail on large part of the article. Typed by humans and converted from a PDF, those text errors were numerous. When it didn’t crash and rely on the fallback system, it recorded absurd information in some fields which proved to be very difficult to clean, and the number of human intervention to fine-tune everything was staggering.
So I rewrote the whole content extractor from scratch, using a more heuristic approach, more tolerant towards errors without doing a lot of tuning. So I thought of some ways to be more tolerant of errors. The first idea I had was approximate regular expressions, of which I will speak in this article.