In my previous article about PDF parsing, I explained how approximate regular expressions could be used to find landmarks elements typed by humans. I will now speak of another technique, in the realm of machine learning algorithm, to find the best way to separate a text in several distinct components. Specifically I will search the best separator for the header and the content of an article.

The technique is a simple yet efficient example of a maximum likelihood problem. The purely mathematical version of the problem is the following. Given a finite list of values between 0 and 1 what is the best index k on the list such that every value on the left of that index values are mostly close to 1 while on its right values are mostly close to 0? The problem is simple enough that we can compute all the possible solutions and pick the best. The problem can be stated for multiple separators, however the complexity of computing all solution is exponential in the number of separators.

The problem relates to the parsing of the PDFs by using a list of scores, the $i$th cell representing how likely is the $i$ line to be part of the header of the document article as opposed to be part of the content of it. The separator of the list, then, is the index at which the header stops and the content starts. I used the technique to find where the table of content at the beginning of PDF ended, and where the first article started.

## The Equation

Before delving into the issue, let me remind everybody what a maximum likelihood method is. We are given a set of data measurements $d$ and a set of possible hypotheses $H$ that contains the possible function patterns the data could obey.

The maximum likelihood method tries to find the most likely hypothesis, namely the $h \in H$ that maximize the probability $P(h|d)$ that the hypothesis is correct given the observed data $d$.

The first step to find a solution here is to use Bayes’ law to convert the probability to the probability of the data given the hypothesis.

$$P(h|d) = \frac{1}{P(d)} \cdot P(d|h) \cdot P(h)$$

In that equation, $P(h)$ is the prior probability of the hypothesis $h$, i.e. the probability that hypothesis $h$ is true without looking at the data. $\frac{1}{P(d)}$ is a regularization factor that ensure that the integration (or sum in the discrete case) of over all possible $h$ of $P(d|h)\cdot P(d)$ yields $1$. The key point here is $P(d|h)$, the probability of observing data $d$ under model $h$.

Oftentimes, the factor $P(d|h)$ is a large product. Its maximum value over $h$ must be determined by finding the roots of the derivative. Since the derivation of a product is quite overwhelming, people often compute the log-likelihood, by taking the log of the likelihood, to convert the big product into a big sum, much easier to find the derivative. In this case, however, we work on the discrete case and we have small finite set of hypotheses. We shall compute the product for each hypothesis $h$ and obtain thus its likelihood, and in the end, return the maximum one.

Let’s apply the method to the best list separator problem we’re interested in. We have $N$ lines to separate. Let’s enumerate the lines with a list ranging from $0$ to $N-1$. Each element correspond to a probability $v_i$ that represents the likeness of a line to be before $k$. We shall say it is the likeliness that $i$ belongs to category $\mathcal{C}_1$. On the contrary, $(1-v_i)$ is the likeliness to be in category $\mathcal{C}_0$. The data, $d$ is the list of $v_i$’s.

We search $k \in \{0, \dots, N-1\}$ that separates the front content with the rear content of the list. There are $N+1$ hypthoses, ranging from the separator $k=0$ (everything in the rear content) to the separator $k=N$ (everything in the front content).

Let’s assume a uniform distribution of prior hypotheses. That is, for hypothesis $k$, its prior likelihood is $P(k) = \frac{1}{N+1}$. For ease of notation, we shall also set $\alpha = \frac{1}{P(d)}$. So given hypothesis $k$, the above equation turns to $P(h|d) = \frac{\alpha}{N+1} \cdot P(d|h)$. Hence maximizing $P(h|d)$ is equivalent to maximize $P(d|h)$.

Given hypothesis separator $k$, let’s compute $P(d|k)$. The likeliness of the content before $k$ to be in $\mathcal{C}_1$ is

\begin{equation}

P( \{0, \dots, k-1\} \subseteq \mathcal{C}_1 | k ) = \prod_{0 \le i < k} v_i

\end{equation}

Likewise the likeness of the content after $k$ to be actually after $k$ given $k$,

\begin{equation}

P( \{k, \dots, N\} \subseteq \mathcal{C}_0 | k ) = \prod_{k \le i < N} (1-v_i)

\end{equation}

In the end, the likelihood of the entire data given hypothesis $k$ is the conjunction of the two above likelihood.

\begin{equation}

P( d | k ) = \left( \prod_{0 \le i < k} v_i \right) \cdot \left( \prod_{k \le i < N} (1-v_i) \right)

\end{equation}

## The algorithm

So, for each $k \in \{0, \dots, N\}$, we want to compute $P(d|k)$ and take the maximising $k$, i.e. $\text{argmax}_{0\le k\le N} P(d|k). It’s a big product. The naive algorithm, in python, would be written.

prob_of_data_if_k = []

for k in xrange( 0, N+1 ):

likelihood_for_k = 1.0

for i in xrange( 0, k ):

likelihood_for_k *= v[i]

for i in xrange( k, n+1 )

likelihood_for_k *= 1-v[i]

prob_of_data_if_k.append( likelihood_for_k )

The complexity is in $\matcal{O}(N^2)$. We can do better, in $\matcal{O}(N\cdot\log(N))$. Notice how many of the product are repeated in the formulas for each $k$. If $k_1$, $k_2$ are two distinct hypotheses, then their corresponding products both involve the factors $v_0 \cdot \dots \cdot v_{k_1-1}]$ and on $v_{k_2} \cdot \dots \cdot v_{N-1}]$. If somehow, part of the computation at one step could be reused at a later step, a lot of products could be avoided.

The ideas is to construct a pyramid of partial products. The bottom of the pyramid consists of the $N$ elements of $d$. The layer above contains $\left\lceil\frac{N}{2}\right\rceil$ elements, each one being the product of two consecutive elements of the base layer. The next layer contains one half less of elements, and consists of the product of two consecutive elements of the second layer, i.e. the product of four consecutive elements of $d$. An so forth until the top of the pyramid is reached, which consists of a single element that is the product of all the elements of $d$.

A similar pyramid can be made for the factors $(1-v_i)$. To avoid looping twice, a stack containing pairs of products of $v_i$’s and products of $(1-v_i)$’s could be build. Let’s call the pyramid $S$ and let $S^+(l,j)$ be the $j$th product of $v_i$’s at layer $l$ of the pyramid $S$. Likewise, let $S^-(l,j)$ be the $j$th product of $1-v_i$’s.

Given $k$, how do we find $\prod_{0 \le i < k} v_i $ with this pyramid? Let’s decompose $k$ into its binary form. Thus for some $l \in \mathbb{N}$, $k$ is written \begin{equation} k = \sum_{l} 2^l. \end{equation} Let $a$ be an accumulator that will be recursively multiplied to obtain the final product and let $b$ be an accumulator that will record how many factors of $v$ were already multiplied. For each of the exponents $l$ in the decomposition of $k$, let’s multiply $a$ by one of the precomputed product at layer $l$ in the pyramid. We begin with the largest possible $l$ and go top-down. $a$ starts at value $1$ and $b$ at value $0$.

\begin{equation} a := a \cdot S^+\left(l, \frac{b}{2^l}\right), b := b + 2^l. \end{equation}

Since the iteration is bottom-up, $\frac{b}{2^l}$ is always an integer. Indeed, $b$ is a sum of $2^m$ with $m>l$, so a factor $2^l$ can be extracted from $b$.

$\prod_{k \le i < N} (1-v_i)$ can be computed in a similar way. However, the order is reversed. For hypothesis $k$, decompose $N-k$ in base two and iterate over the exponent $l$ of that decomposition indexing layer $l$ from the right.

Care must be take when the number of elements in $d$ is not a power of $2$. Indeed the algorithm described above assumes each elements at one layer is the product of exactly two elements from the layer below. If $|d|$ is not a power of $2$, then one element at layer $l$ might single and thus be simply copied to at layer $l+1$. Those “single” elements should be at the right end of the layer for the products in $S^+$ and at the right of the layer for products in $S^-$.

Here is a python pseudo-code to get that final product from the pyramid.

#building the pyramid

stack = [ list(zip(data,map(lambda x: 1-x, data))) ]

n = len( stack[-1] )

while n > 1:

new_layer = []

for j in xrange( 0, n, 2 ):

positive_product = stack[-1][2*j][0] * stack[-1][2*j+1][0]

new_layer.append( (positive_product, None) )

if n%2 == 1:

new_layer.append( (stack[-1][-1][0], None) )

for j in xrange( -1, -n-1, -2 ):

negative_product = stack[-1][2*j][1] * stack[-1][2*j-1][1]

new_layer[-j][1] = negative_product

if n%2 == 1:

new_layer[0][1] = stack[-1][-1][1]

stack.append( new_layer )

n = len(stack[-1])

#computing step $k$

def probability_of_data_if_k( k ):

l = ceil( log( k, base=2 ) )

a = 1

b = 0

while l >= 0:

if (k-b) % 2**l != 0:

a *= stack[l][ b / 2**l ]

b += 2**l

l -= 1

b = 0

l = ceil( log( len(stack[0]) – k, base=2 )

while l >= 0:

if (neg_k-b) % 2**l != 0:

a *= stack[l][ len(stack[l]) – b / 2**l ]

b += 2**l

l -= 1

return a