DATA ESSENTIAL

## Persistent volumes with Rancher/Kubernetes on AWS

Volume persistence in Kubernetes (and other orchestrators) is in full swing, and for good reason, Kubenetes is no longer satisfied today to be a stateless runtime environment, but also and more often stateful … Let’s see how with Rancher and AWS we may have volumes provisioned automatically.

# High level architecture on AWS

For this blog I use a single group of ec2, all in the same security group, which allows all traffic between instances.

I’ve bootstrapped ec2 instances with the last RancherOS version available on AWS, which is currently v1.0.2.

## Hosman-Lemeshow in Python

Test a model before putting it into production and verify that the model we have assumed is correctly specified with the right assumptions. In this article I present a method to test its model: the test of Hosmer-Lemeshow.

To perform the Hosmer-Lemeshow test, you’ll need a dataset.

This dataset contain relevant information about the scored for people who wants a credits.

First, we need to load the dataset from the CSV file to a new Python dataframe with the Pandas library.

## Hands on Kubernetes with kubeadm

In this technical blog, I’ll show you how to:

• Create four ec2 instances on AWS with Centos 7.2
• Install Docker on Centos with LVM devicemapper
• Install Kubernetes and Kubeadm on each instance
• Run your Master and join with the minion
• Launch your first nginx Pod
• Create your first Replication Controller

## Introducing the Rancher Partner Network

on Oct 18, 2016

This morning, we’re excited to follow on the launch of the Rancher Partner Network – a group of leading organizations focused on building top-notch cloud and container solutions for their customers. These are vendors with whom we collaborate, and whom we trust and endorse to help enterprises bring containers into their development workflows and production environments. The Rancher Partner Network includes consulting partners, systems integrators, resellers, and service providers from the Americas, Europe, Asia, and Australia.

## FinTechMeets: Unlocking the Power of Big Data

How harnessing big data & machine learning will change the face of business

Yesterday, on September 22, 2016, lux future lab held its second FinTechMeets at BGL BNP Paribas in Kirchberg, focusing on what has become a hot topic: big data and how to harness it.

Roughly 350 IT and financial professionals attended the lunchtime gathering, which centered on a global overview of big data, machine learning and their practical applications.

Jonathan Basse, Founder of Data Essential, and Jed Grant, CEO & Founder of KYC3, shared their insights based on years of experience developing solutions that access this wealth of information.

## Hortonworks and Pivotal expand relationship

Hortonworks, Inc. and Pivotal today announced a significant expansion of their strategic relationship centered on the Hortonworks Data Platform and Pivotal HDB. This strategic relationship brings together Hortonworks’ expertise and support for data management and processing with Pivotal’s top analytics engine for Apache Hadoop.

## Hortonworks Splits Core Hadoop from Extended Services

Hortonworks today announced a major change to the way it distributes its Hadoop software. Starting with Hortonworks Data Platform (HDP) 2.4, Hortonworks is enhancing the velocity of its release cycle by balancing the customers’ requirements for a stable platform with rapid access to the latest innovations in the open source community. In order to accomplish this goal, Hortonworks announced a brand new release strategy that includes:

## Best list separator

In my previous article about PDF parsing, I explained how approximate regular expressions could be used to find landmarks elements typed by humans. I will now speak of another technique, in the realm of machine learning algorithm, to find the best way to separate a text in several distinct components.  Specifically I will search the best separator for the header and the content of an article.

The technique is a simple yet efficient example of a maximum likelihood problem. The purely mathematical version of the problem is the following. Given a finite list of values between 0 and 1 what is the best index k on the list such that every value on the left of that index values are mostly close to 1 while on its right values are mostly close to 0? The problem is simple enough that we can compute all the possible solutions and pick the best.  The problem can be stated for multiple separators, however the complexity of computing all solution is exponential in the number of separators.

The problem relates to the parsing of the PDFs by using a list of scores, the $i$th cell representing how likely is the $i$ line to be part of the header of the document article as opposed to be part of the content of it. The separator of the list, then, is the index at which the header stops and the content starts. I used the technique to find where the table of content at the beginning of PDF ended, and where the first article started.

## Approximate Regular Expressions

This summer, I had the challenge to parse some legal archive about Luxembourgian companies. The archive respected a quite strict file format, but were typed by humans. Furthermore there were only available as PDF files.

We needed to extract the content of the whole archive to get the list of company names, together with various references and information about them. For the sake of easiness, the PDF was first converted into text files with pdfminer. The conversion was not perfect, with some lines out of order.

Nevertheless since the format of the archive was somewhat rigid, I first tried to build a Finite State automaton, with transitions chosen by matching whole lines of text in the current state. Matching could involve regular expression, counting words or combining lines into one.

This FSA crashed a lot. I tried to implement a fallback system so it could recover crashes and state errors, but it was still a very poor implementation. A little error could jeopardize the decision to transition along the FSA and make if fail on large part of the article. Typed by humans and converted from a PDF, those text errors were numerous. When it didn’t crash and rely on the fallback system, it recorded absurd information in some fields which proved to be very difficult to clean, and the number of human intervention to fine-tune everything was staggering.

So I rewrote the whole content extractor from scratch, using a more heuristic approach, more tolerant towards errors without doing a lot of tuning. So I thought of some ways to be more tolerant of errors. The first idea I had was approximate regular expressions, of which I will speak in this article.

## On interactive storytelling and Big Data : Chris Crawford’s Siboot

Earlier this week, I went over an article of Chris Crawford’s current project in videos games. Like everybody, I’ve heard of Chris mostly because of his famous 1992 CGDC lecture in which he announced his exit from the mainstream game industry by charging away branding a sword. But I never wonder to what new horizons such a singular figure would ride away.