DATA ESSENTIAL

MAKE SENSE OF YOUR DATA

READ OUR BLOG

Create Dynamic Workflow in Apache Airflow

Problem

For a long time I search a way to properly create a workflow where the tasks depends on dynamic value based on a list of tables content in a text file.

Context explanation through a graphical example

A schematic overview of the DAG’s structure.

                |---> Task B.1 --|
                |---> Task B.2 --|
 Task A --------|---> Task B.3 --|--------> Task C
                |       ....     |
                |---> Task B.N --|

The problem is to import tables from a db2 IBM database into HDFS / Hive using Sqoop, a powerful tool designed for efficiently transferring bulk data from a relational database to HDFS, automatically through Airflow, an open-source tool for orchestrating complex computational workflows and data processing pipelines.

Inserted data are daily aggregate  using Sparks job, but I’ll only talk about the import part where I schedule the Sqoop job to dynamically import data into HDFS.

How to hide credentials in logstash configuration files?

2018-03-27-logstash-keystore-blog

How to hide credentials in logstash configuration files?

logstash 6.2 let you protect credentials with the keystore.

Let’s see how to use logstash-keystore?

e.g. In the following, we will hide the ‘changeme’ password from the elasticsearch output of your logstash pipeline config file.

To create a logstash.keystorefile, open a terminal window and type the following commands

./bin/logstash-keystore create
./bin/logstash-keystore add es_password

ℹ️ the default directory is the same directory as the logstash.yml settings file.

./bin/logstash-keystore list should show you es_password as answser.

📌 The option -path.settings will set the directory for the keystore. (e.g. bin/logstash-keystore --path.settings /etc/logstash/.keystore create). The keystore must be located in Logstash’s path.settings directory.

📌 When you run Logstash from an RPM or DEB package installation, the environment variables are sourced from /etc/sysconfig/logstash. You might need to create /etc/sysconfig/logstash ; Please keep in mind that this file should be owned by root with 600 permissions.

# use es_password in the pipeline:
output {
	elasticsearch {
		hosts => …
		user => “elastic”
		password => “${es_password}”
	}
}

ℹ️ you can set the environment variable LOGSTASH_KEYSTORE_PASS to act as the keystore password.

Documentation

➡️ Official guide – logstash-keystore

To get help with the cli, simply use: $ ./bin/logstash-keystore help

Kibana – Setup log rotation

2018-02-05-elk-kibana-log-rotation

:information_source: In this document we will see how to properly setup a log rotation for kibana. The e.g. is based on CentOS7 but can be easily adapt for ubuntu or any other linux distribution.

Documentation

  • Kibana doesn’t handle log rotation, but it is built to work with an external process that rotates logs, such as logrotate.
  • The logrotate utility is designed to simplify the administration of log files on a system which generates a lot of log files. Logrotate allows for the automatic rotation compression, removal and mailing of log files. Logrotate can be set to handle a log file daily, weekly, monthly or when the log file gets to a certain size.
  • Logrotate can be installed with the logrotate package. It is installed by default and runs daily.
  • The primary configuration file for logrotate is /etc/logrotate.conf; additional configuration files are included from the /etc/logrotate.d directory.

Prerequisites

kibana

  • Have the following elements (pid.file AND logging.dest) setup in the kibana.yaml configuration file
    server.port: 5601
    server.host: "${KIBANA_SRV}"
    elasticsearch.url: "http://${ES_SRV}:9200"
    kibana.index: ".kibana"
    pid.file: /var/run/kibana/kibana.pid
    logging.dest: /var/log/kibana/kibana.log
    
  • :warning: verify if kibana is well authorised to create /var/log/kibana.log file.
    $ mkdir -p /var/log/kibana/ && chown -R kibana:kibana /var/log/kibana/
    $ mkdir -p /var/run/kibana/ && chown -R kibana:kibana /var/run/kibana/
    

logrotate

verify that logrotate is properly installed

$ logrotate --version
    logrotate 3.8.6

verify that logrotate configuration include the logrotate.d directory

$ cat /etc/logrotate.conf
  . . .
  include /etc/logrotate.d

Configuration

logrotate file

We will create a new logrotate configuration for kibana

$ cat << EOF > /etc/logrotate.d/elk-kibana
/var/log/kibana/*.log {
  missingok
  daily
  size 10M
  create 0644 kibana kibana
  rotate 7
  notifempty
  sharedscripts
  notifempty
  compress
  postrotate
    /bin/kill -HUP $(cat /var/run/kibana/kibana.pid 2>/dev/null) 2>/dev/null
  endscript
}
EOF

Verify your file syntax with the following command

logrotate -vd /etc/logrotate.d/elk-kibana

If you didn’t get any error, you can manually start the first rotation with

logrotate -vf /etc/logrotate.d/elk-kibana

Crontab file

If the ligne include /etc/logrotate.d is well present in /etc/logrotate.conf and logrotate.conf present in /etc/cron.daily/logrotate you don’t need to do any more setup.

grep "logrotate.conf" /etc/cron.daily/logrotate
    /usr/sbin/logrotate -s /var/lib/logrotate/logrotate.status /etc/logrotate.conf

 

Happy logs rotation! 
G.

Why CNCF landscape matters

I grant you, “Cloud Native” has something of a buzzword, but there is still a reality behind all that. A Cloud Native application leverages and takes advantage of Cloud features. And today, a native Cloud application likely to be cut into microservices, that these microservices turn in containers, and that these containers are orchestrated by Kubernetes.

But who has looked at these technologies in recent years is well aware of how fast they are evolving, which makes the technology watch even more relevant, but also more complicated, much more complicated. Indeed, where to find these new projects, how to follow them, how to evaluate their degree of maturity, is it time to adopt them to solve our production problems?

Read more

Test your deployments locally with minikube

A chapter of the courses that I give demonstrates how easy it is to implement a multi-tier application on Rancher / Kubernetes (thanks to the Rancher Catalog). The recipe is the same, whether you are on Rancher 1.x or 2.0! The exercise also uses a storage driver that allows application deployment with persistent storage.

My application is on Github, it is simply an application in Node.js with a MongoDB backend, which displays “Hello MyApp from MongoDB!” in the browser when it is called on the route /myapp. To familiarize yourself a bit with Kubernetes, I decided to deploy this application directly on a cluster, with CLI.

You can use any K8s cluster, a deployed with Rancher, or with kubeadm (see my previous blog), or do it locally with kinikube. This is the last option that I chose because it allows to test our deployment locally. For the installation of minikube and kubectl, I invite you to read this tutorial.

If I consider that my application is a MongoDB backend and a Node.js frontend, what will I need in K8s? First of all a volume for persistence. With minikube, I use local storage. A volume claim, to make this volume available to MongoDB. A MongoDB deployment. One service, to make MongoDB available to other third parties. Another deployment for Node.js, as well as its service. And finally an ingress rule to access my application outside my cluster. I will even push the vice a little further by using a configMap for my Node.js environment varaibles (MongoDB URI). Ready?

Read more

Setup a secured redis-cluster on centos7

Setup a secured redis-cluster on centos7

What is redis exactly?

Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker.

Redis Sentinel provides high availability for Redis. In practical terms this means that using Sentinel you can create a Redis deployment that resists without human intervention to certain kind of failures.

⚠️ You need at least three Sentinel instances for a robust deployment.

Sentinel constantly checks if your master and slave instances are working as expected.
Notification.

If a master is not working as expected, Sentinel can start a failover process where a slave is promoted to master, the other additional slaves are reconfigured to use the new master, and the applications using the Redis server informed about the new address to use when connecting.

ℹ️ Sentinel manages the failover, it doesn’t configure Redis for HA. It is an important distinction.

Read more

Ansible – how to collect information about remote hosts with Gathers facts

 Anisble – how to collect information about your remote hosts

In order to do so, we will discuss about ansible modue: Gathers facts

:information_source: Official webpage: https://docs.ansible.com/ansible/setup_module.html

Display facts from all hosts and store them at /tmp/facts indexed by hostname

ansible all -m setup --tree /tmp/facts

➡ now check the file to have a clear view off all variables (facts) collected by ansible for your host like the well known {{ inventory_hostname }}

Read more

Usefulness of an MQ layer in an ELK stack

Usefulness of an MQ layer in an ELK stack

First things first, disambiguation; let’s talk about these strange words.

What does ELK means?

An ELK stack is composed of Elasticsearch, Logstash, and Kibana; these 3 components are owned by the elastic.co company and are particulary useful to handle Data. (ship, enrich, index and search your Data)

What does MQ means?

In computer science, The message queue paradigm is a sibling of the publisher/subscriber pattern.
You can imagine an MQ componant works like several mailboxes; meaning that publisher(s)/subscriber(s) do not interact with the message at the same time, many publishers can post a message for one subsriber and vice versa.

➡ Redis or Kafka or RabbitMQ can be used as buffer in the ELK stack.

Read more

Package manager tools for Debian like

Introduction

The needs are simple, create local repository and mirror the official one for Debian like systems.

RedHat like systems got multiple, Debian like, well not so much.

In this case we need to manage Ubuntu servers, well Canonical has Landscape, but we want to keep it simple.

The possibilities

For Debian Like there’s 2 possibilities :

  • aptly which can seems to be integrate with ansible
  • pulp_deb which is a community project to add to an installation of Katello/Pulp

Persistent volumes with Rancher/Kubernetes on AWS

Volume persistence in Kubernetes (and other orchestrators) is in full swing, and for good reason, Kubenetes is no longer satisfied today to be a stateless runtime environment, but also and more often stateful … Let’s see how with Rancher and AWS we may have volumes provisioned automatically.

High level architecture on AWS

For this blog I use a single group of ec2, all in the same security group, which allows all traffic between instances.

I’ve bootstrapped ec2 instances with the last RancherOS version available on AWS, which is currently v1.0.2.

Read more

Contact Us