This summer, I had the challenge to parse some legal archive about Luxembourgian companies. The archive respected a quite strict file format, but were typed by humans. Furthermore there were only available as PDF files.

We needed to extract the content of the whole archive to get the list of company names, together with various references and information about them. For the sake of easiness, the PDF was first converted into text files with pdfminer. The conversion was not perfect, with some lines out of order.

Nevertheless since the format of the archive was somewhat rigid, I first tried to build a Finite State automaton, with transitions chosen by matching whole lines of text in the current state. Matching could involve regular expression, counting words or combining lines into one.

This FSA crashed a lot. I tried to implement a fallback system so it could recover crashes and state errors, but it was still a very poor implementation. A little error could jeopardize the decision to transition along the FSA and make if fail on large part of the article. Typed by humans and converted from a PDF, those text errors were numerous. When it didn’t crash and rely on the fallback system, it recorded absurd information in some fields which proved to be very difficult to clean, and the number of human intervention to fine-tune everything was staggering.

So I rewrote the whole content extractor from scratch, using a more heuristic approach, more tolerant towards errors without doing a lot of tuning. So I thought of some ways to be more tolerant of errors. The first idea I had was approximate regular expressions, of which I will speak in this article.

The idea is simple. Regular expressions failed to easily with small mistakes unless I write a giant, unreadable one that take into account all possible little mistakes. What if I had a regular expression that accepts string within a small number of errors threshold compared to match exactly the pattern.

## The stack of DFA approach

There are two approaches possible for approximate regular expressions. The first one is the direct translation approach. From the pattern, build a new pattern that accepts the same strings, plus the ones that have a few errors without the threshold.

The errors we shall consider are the ones related to the Levenshtein distance. With Levenshtein distance computation, to transform a string into another, three transformations are possible: insertion (adding a character), deletion (removing a character) and conversion (changing a character by another). The number of errors is the minimal number of insertions, deletions and conversions needed to change the string into a string that matches the regex. Note that conversion are akin to a deletion followed by an insertion, they just count as one error instead of two.

Let’s start with the first approach. It’s obviously easier to a new FSA that accepts strings up to three errors from the FSA that accepts exacts strings, rather than building a string pattern from another string pattern. For the sake of simplicity, let’s assume the original FSA is a Deterministic Finite states Automaton (DFA), for example the following one. We will build a Non-deterministic Finite states Automaton (NFA) from it.

The key idea here is to create layers of automatons. Each layers is a replication of the original DFA. Each layers also corresponds to a number n of errors, the minimal number of errors required to match the original DFA. Thus to build a NFA that accepts the same strings as the original DFA but up to three errors, you need four layers : one for there need not to have any error so far, and three for on, two or three required errors to match.

Transition between layer n and layer n+1 are added. Each correspond to a possible error. All states of a layers define the following transitions.

- Insertion : When a character is missing in the string to be matched, it can be inserted. Thus any transition on a character can be a ε-transition. But since this is an error to be corrected, the ε-transition must go one layer lower.
- Deletion : When an additional wrong character is present in the string to match, it can be deleted. There are two possibilities to take into account. In the first case, the original DFA had a transition from the current state s on that character. In which case, a ε-transition must be added from the destination state to s, moving on the layer for one more error. That added transition undoes the progression in the DFA. In the second case, there is no transition from the current state using the wrong character. No progression in the DFA was possible, but the character was not consumed either. Thus a transition from the current state to the corresponding state on the layer with one more error, over the to-be-deleted character must be used. This way, in both cases, no progress was achieved in the original DFA yet the character-to-deleted was consumed and discarded.
- Conversion : As said earlier, a conversion is a deletion followed by an insertion that costs only one. To add transition that correspond to conversion, thus, combine a deletion followed by an insertion but keep the transition going only to the next error layers instead of the second next.

The resulting NFA can of course be made into a DFA using the standard algorithms. However keep in mind that if m is the maximum possible of errors for the approximate regex and n the number of states in the original DFA, the final NFA will have m·n states which can be much harder to make deterministic. Hower once the conversion is over, testing whether a string match will be very fast since a DFA has no backtracking to perform.

An accepting state of the final DFA can also count how many errors were needed to match the original DFA, by marking the accepting state of the DFA with the layer number of the corresponding accepting state in the layered NFA. But neither the NFA nor the final DFA can tell how many errors a string needs to perform to match if that number of errors is superior to the depth of the stack of the NFA, i.e. the maximum number of errors allowed.

If you don’t need very advanced features of regular expression that requires a NFA, like matching parentheses in any number of parentheses, this can be a very efficient solution. This might be more often all right than not. A lot of advanced features are there to provide more strictness on which strings are accepted. If there is tolerance for errors in the first place, maybe using a slightly more tolerant regular expression might not be such a big deal.

## The measuring approach

The second approach is qualitative. The aim is to have an algorithm that doesn’t tells whether a string match the pattern or not, but how many errors need to be corrected so that the string matches the pattern. If the answer is 0, then the string matches exactly the pattern, since nothing needed to be corrected.

This approach involves using a NFA. It’s much simpler to understand. You basically progress over the NFA as for an non-approximate regular expression. When you are block in you progression in the NFA, you backtrack and choose a different path.

So, unlike NFA made from strict regular expression, when all correct attempts have been exhausted, wrong attempt to transverse the NFA can be made. These “wrong moves” correspond to the errors the Levenshtein reckons. They translate as such :

- Insertion : do a ε-transition instead of a regular transition over character c, as if c was inserted.
- Deletion : do a ε-transition in the opposite direction of a regular transition over character c, or do a regular transition over character c from an to the same state.
- Conversion : do a deletion followed by an insertion.

When entering these states, an accumulator reckoning the numbers of errors made must be incremented.

Note it is desirable to try to detect an error elsewhere in the state if the attempted added error yield no result. Thus an erroneous partial match can backtrack to a partial match with no errors. However so far allowed to consider adding an error if and only if all possibilities without additional errors were exhausted. This means that all partial matches without errors are tried first. Then all partial matches with one errors. Then all partial matches with two errors. And so on.

The performance of this naïve version is of course terrible. There are at least twice as many error transitions as they are transition in the original NFA, and introducing an error may give exploration of the same part of the NFA with the same remaining substring.

The issue here is that when the machines tries to make an “error”, it reaches the same NFA. It is not unlikely that the machine will at some point reach a previously reached pair of a state and consumed characters. Doing a search from that point again will yield no more result than the previous time.

A memoization technique can be attempted. At each step, the machine can check a set containing pairs of state and number of consumed characters that yield no result without making. While backtracking, it can add the pairs whom every possibility yielded no result. Now the algorithm should first work a lot to “explore” the NFA in order to try to find a solution. But if nothing matches, it then tries to make errors, doing less and less exploration of the NFA, until it find the one path that matches.

## Conclusion

Implementing these algorithm can be interesting. However googling a little about my second approach, I found that a free and open source solution was already available: TRE. It uses the second approach I described here. Furthermore it allows only a sub-regex of the regex to be match with a certain number of errors. So in the end I wrote a quick wrapper over it and gain a lot of time.

But if you’re ever stuck in parsing stuff with common small typing errors, consider approximate regular expression rather than using something more complex like Markov chains. There’s as easy to use as regular expressions, there are just more tolerant of errors.

Next time I’ll speak about the point separator of the domain of a $2^\mathbb{N}$ map, a custom Machine learning problem that exemplifies very simply the maximum-likelihood method.