Create Dynamic Workflow in Apache Airflow

Problem

For a long time I search a way to properly create a workflow where the tasks depends on dynamic value based on a list of tables content in a text file.

Context explanation through a graphical example

A schematic overview of the DAG’s structure.

                |---> Task B.1 --|
                |---> Task B.2 --|
 Task A --------|---> Task B.3 --|--------> Task C
                |       ....     |
                |---> Task B.N --|

The problem is to import tables from a db2 IBM database into HDFS / Hive using Sqoop, a powerful tool designed for efficiently transferring bulk data from a relational database to HDFS, automatically through Airflow, an open-source tool for orchestrating complex computational workflows and data processing pipelines.

Inserted data are daily aggregate  using Sparks job, but I’ll only talk about the import part where I schedule the Sqoop job to dynamically import data into HDFS.

Hosman-Lemeshow in Python

Test a model before putting it into production and verify that the model we have assumed is correctly specified with the right assumptions. In this article I present a method to test its model: the test of Hosmer-Lemeshow.

To perform the Hosmer-Lemeshow test, you’ll need a dataset.

Downloadcredits_linear_regression_score.csv

This dataset contain relevant information about the scored for people who wants a credits.

First, we need to load the dataset from the CSV file to a new Python dataframe with the Pandas library.

Read more

Contact Us