Problem
For a long time I search a way to properly create a workflow where the tasks depends on dynamic value based on a list of tables content in a text file.
Context explanation through a graphical example
A schematic overview of the DAG’s structure.
|---> Task B.1 --|
|---> Task B.2 --|
Task A --------|---> Task B.3 --|--------> Task C
| .... |
|---> Task B.N --|
The problem is to import tables from a db2 IBM database into HDFS / Hive using Sqoop, a powerful tool designed for efficiently transferring bulk data from a relational database to HDFS, automatically through Airflow, an open-source tool for orchestrating complex computational workflows and data processing pipelines.
Inserted data are daily aggregate using Sparks job, but I’ll only talk about the import part where I schedule the Sqoop job to dynamically import data into HDFS.