In my previous article about PDF parsing, I explained how approximate regular expressions could be used to find landmarks elements typed by humans. I will now speak of another technique, in the realm of machine learning algorithm, to find the best way to separate a text in several distinct components. Specifically I will search the best separator for the header and the content of an article.
The technique is a simple yet efficient example of a maximum likelihood problem. The purely mathematical version of the problem is the following. Given a finite list of values between 0 and 1 what is the best index k on the list such that every value on the left of that index values are mostly close to 1 while on its right values are mostly close to 0? The problem is simple enough that we can compute all the possible solutions and pick the best. The problem can be stated for multiple separators, however the complexity of computing all solution is exponential in the number of separators.
The problem relates to the parsing of the PDFs by using a list of scores, the $i$th cell representing how likely is the $i$ line to be part of the header of the document article as opposed to be part of the content of it. The separator of the list, then, is the index at which the header stops and the content starts. I used the technique to find where the table of content at the beginning of PDF ended, and where the first article started.