237 lines
7.3 KiB
Markdown
237 lines
7.3 KiB
Markdown
# DAG Factory for TMS Data Ingestion
|
||
|
||
## Overview
|
||
This repository contains a **DAG factory** that generates multiple Apache Airflow DAGs to ingest data from a **Treasury Management System (TMS)** into the data warehouse.
|
||
|
||
The factory dynamically creates one DAG per TMS dataset, using **YAML-based layouts** to define parameters and metadata. Each DAG:
|
||
- Calls the **TMSDB CLI connector** (`TMSDB.py`) to retrieve data in CSV format.
|
||
- Loads the data into object storage.
|
||
- Creates or refreshes **Oracle external tables** if needed.
|
||
- Registers workflow metadata in MRDS tables.
|
||
- Processes the landed file for downstream use.
|
||
|
||
---
|
||
|
||
## Components
|
||
|
||
### 1. DAG Factory (`create_dag`)
|
||
- **Purpose**: Auto-generates DAGs for each TMS dataset.
|
||
- **Inputs**:
|
||
- `TMS-layouts/<DAG_NAME>.yml`: defines report parameters, visible/hidden flags, virtual/replacement parameters.
|
||
- `config/TMS.yml`: holds system-wide TMS connection info and storage prefixes.
|
||
- **Outputs**:
|
||
- Airflow DAG objects named like `w_ODS_TMS_<ENTITY>`.
|
||
|
||
### 2. TMSDB Connector (`TMSDB.py`)
|
||
- **Purpose**: CLI tool that interacts with the TMS service.
|
||
- **Commands**:
|
||
- `retrieve`: fetch rows from TMS into CSV, spool to storage, return exit codes (`0 = data`, `1 = no data`).
|
||
- `create-oracle-table`: generate an Oracle DDL file based on dataset metadata.
|
||
- `create-model`: generate dbt models for dataset integration.
|
||
- **Behavior**:
|
||
- Adds synthetic columns (`A_KEY`, `A_WORKFLOW_HISTORY_KEY`).
|
||
- Supports additional columns via `-c`.
|
||
- Uploads to object storage if `bucket:path/file.csv` is given.
|
||
|
||
### 3. Manage Files (`mf`)
|
||
Utilities for file-level operations:
|
||
- `execute_query(sql)`
|
||
- `add_source_file_config(...)`
|
||
- `process_source_file(prefix, file)`
|
||
- `create_external_table(table, source, prefix)`
|
||
- `add_column_date_format(...)`
|
||
|
||
### 4. Manage Runs (`mr`)
|
||
Utilities for workflow tracking:
|
||
- `init_workflow(db, wf_name, run_id)`
|
||
- `set_workflow_property(key, db, name, value)`
|
||
- `finalise_workflow(key, status)`
|
||
- `select_ods_tab(table, expr, cond)`
|
||
|
||
---
|
||
|
||
## How a DAG Works
|
||
|
||
### DAG Structure
|
||
Each DAG has a single task:
|
||
- `retrieve_report`: a `PythonOperator` that orchestrates all steps internally.
|
||
|
||
### Task Flow
|
||
1. **Read YAML configs**
|
||
- Parameters split into visible (exposed in Airflow UI) and hidden.
|
||
- System config (URL, creds, bucket/prefix) loaded from `config/TMS.yml`.
|
||
|
||
2. **Parameter processing**
|
||
- Cartesian product of parameter lists.
|
||
- Support for:
|
||
- `column(...)` aligned columns.
|
||
- `select(...)` SQL evaluation (restricted tables only).
|
||
- Virtual parameters (dropped later).
|
||
- Replace-parameter logic.
|
||
|
||
3. **Workflow init**
|
||
- `mr.init_workflow` creates a workflow key.
|
||
|
||
4. **Data retrieval**
|
||
- Build a `TMSDB.py retrieve` command.
|
||
- Run via subprocess.
|
||
- Handle return codes:
|
||
- `0`: data returned.
|
||
- `1`: no data → workflow finalized as success.
|
||
- `!=0`: error → workflow finalized as failure.
|
||
|
||
5. **First-run bootstrap**
|
||
- If no config exists for the dataset:
|
||
- Run `TMSDB.py create-oracle-table` to generate SQL.
|
||
- Execute SQL via `mf.execute`.
|
||
- Add date formats and external table with `mf.create_external_table`.
|
||
- Register config with `mf.add_source_file_config`.
|
||
|
||
6. **File processing**
|
||
- `mf.process_source_file(prefix, filename)` ingests the CSV.
|
||
|
||
7. **Workflow finalization**
|
||
- `mr.finalise_workflow(wf_key, 'Y' | 'N')`.
|
||
|
||
---
|
||
|
||
## Example DAG
|
||
Example: `w_ODS_TMS_TRANSACTION`
|
||
|
||
```python
|
||
with DAG(
|
||
dag_id="w_ODS_TMS_TRANSACTION",
|
||
default_args=default_args,
|
||
schedule_interval=None,
|
||
start_date=datetime(2025, 1, 1),
|
||
catchup=False,
|
||
params={"date_from": "2025-01-01", "date_to": "2025-01-31"},
|
||
) as dag:
|
||
|
||
retrieve_report = PythonOperator(
|
||
task_id="retrieve_report",
|
||
python_callable=execute_report,
|
||
execution_timeout=timedelta(minutes=30),
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
## Repository Layout
|
||
```
|
||
tms/
|
||
├─ generate_tm_ods_dags.py # DAG generator script (calls create_dag many times)
|
||
├─ TMS-layouts/
|
||
│ ├─ w_ODS_TMS_TRANSACTION.yml
|
||
│ └─ ...
|
||
├─ config/
|
||
│ └─ TMS.yml
|
||
└─ TMS-tables/ # Create table SQL scripts
|
||
```
|
||
|
||
---
|
||
|
||
## Security Considerations
|
||
- **`eval()` is dangerous.**
|
||
Only `select(...)` is allowed, and it’s whitelisted to safe tables.
|
||
- **No raw shell commands.**
|
||
Use `subprocess.run([...], shell=False)` for safety.
|
||
- **Secrets in config.**
|
||
TMS username/password are stored in `TMS.yml` → best stored in Airflow Connections/Secrets Manager.
|
||
- **Exit codes matter.**
|
||
Workflow correctness relies on `TMSDB.py` returning the right codes (`0`, `1`, other).
|
||
|
||
---
|
||
|
||
## Extending the Factory
|
||
|
||
### Add a new dataset
|
||
1. Create a YAML layout in `TMS-layouts/`, e.g.:
|
||
|
||
```yaml
|
||
parameters:
|
||
date_from:
|
||
value: "2025-01-01"
|
||
date_to:
|
||
value: "2025-01-31"
|
||
```
|
||
|
||
2. Add a line in `dag_factory.py`:
|
||
|
||
```python
|
||
create_dag("w_ODS_TMS_NEWENTITY")
|
||
```
|
||
|
||
3. Deploy the DAG file to Airflow.
|
||
|
||
### Run a DAG manually
|
||
In the Airflow UI:
|
||
1. Find DAG `w_ODS_TMS_<ENTITY>`.
|
||
2. Trigger DAG → optionally override visible parameters.
|
||
3. Monitor logs for `retrieve_report`.
|
||
|
||
---
|
||
|
||
## Diagram
|
||
**DAG Factory Flow**
|
||
```mermaid
|
||
flowchart LR
|
||
subgraph DAGFactory["Dag Factory"]
|
||
direction TB
|
||
B["Load TMS config (TMS.yml)"] --> C["Load dataset layout (YAML)"]
|
||
C --> D["Extract visible & hidden parameters"]
|
||
D --> E["Define Airflow DAG with retrieve_report task"]
|
||
E --> F["Register DAG globally"]
|
||
F --> G["Repeat for each dataset name"]
|
||
G --> H["All DAGs available in Airflow"]
|
||
end
|
||
A["Airflow parses dag_factory.py"] --> DAGFactory
|
||
|
||
```
|
||
|
||
**Sample DAG Execution Flow**
|
||
```mermaid
|
||
flowchart LR
|
||
subgraph ExampleDAG
|
||
direction TB
|
||
B[Read YAML configs] --> C[Build parameter combinations]
|
||
C --> D["Evaluate select(...) and replace virtual params"]
|
||
D --> E["Init workflow (mr.init_workflow)"]
|
||
E --> F["Run TMSDB.py retrieve (subprocess)"]
|
||
|
||
%% Branches on return codes
|
||
F --> |rc=1: No data| G[Finalise workflow success]
|
||
F --> |rc=0: Data returned| H[Check if source file config exists]
|
||
F --> |rc!=0: Error| M[Finalise workflow failure]
|
||
|
||
%% Config missing branch
|
||
H --> |"Config missing (first run)"| I[Run TMSDB.py create-oracle-table → Generate DDL]
|
||
I --> J[Execute DDL via mf.execute → Create Oracle external table]
|
||
J --> K["Register file source config (mf.add_source_file_config)"]
|
||
K --> L["Process landed file (mf.process_source_file)"]
|
||
L --> N[Finalise workflow success]
|
||
|
||
%% Config exists branch
|
||
H --> |Config exists| P["Process landed file (mf.process_source_file)"]
|
||
P --> N[Finalise workflow success]
|
||
end
|
||
```
|
||
---
|
||
|
||
## Dependencies
|
||
- **Airflow 2.x**
|
||
- **Python 3.9+**
|
||
- **mrds** package (providing `utils.manage_files` and `utils.manage_runs`)
|
||
- **Oracle client / Impala client** (for table creation & querying)
|
||
- **Object storage client** (for uploading CSVs)
|
||
|
||
---
|
||
|
||
## Summary
|
||
The DAG factory is a scalable way to create **dozens of ingestion DAGs** for TMS datasets with minimal boilerplate. It leverages:
|
||
- **YAML configs** for parameters,
|
||
- **TMSDB CLI** for data retrieval and DDL generation,
|
||
- **MRDS utilities** for workflow tracking and file handling.
|
||
|
||
It standardizes ingestion while keeping each dataset’s DAG lightweight and uniform.
|