# DAG Factory for TMS Data Ingestion ## Overview This repository contains a **DAG factory** that generates multiple Apache Airflow DAGs to ingest data from a **Treasury Management System (TMS)** into the data warehouse. The factory dynamically creates one DAG per TMS dataset, using **YAML-based layouts** to define parameters and metadata. Each DAG: - Calls the **TMSDB CLI connector** (`TMSDB.py`) to retrieve data in CSV format. - Loads the data into object storage. - Creates or refreshes **Oracle external tables** if needed. - Registers workflow metadata in MRDS tables. - Processes the landed file for downstream use. --- ## Components ### 1. DAG Factory (`create_dag`) - **Purpose**: Auto-generates DAGs for each TMS dataset. - **Inputs**: - `TMS-layouts/.yml`: defines report parameters, visible/hidden flags, virtual/replacement parameters. - `config/TMS.yml`: holds system-wide TMS connection info and storage prefixes. - **Outputs**: - Airflow DAG objects named like `w_ODS_TMS_`. ### 2. TMSDB Connector (`TMSDB.py`) - **Purpose**: CLI tool that interacts with the TMS service. - **Commands**: - `retrieve`: fetch rows from TMS into CSV, spool to storage, return exit codes (`0 = data`, `1 = no data`). - `create-oracle-table`: generate an Oracle DDL file based on dataset metadata. - `create-model`: generate dbt models for dataset integration. - **Behavior**: - Adds synthetic columns (`A_KEY`, `A_WORKFLOW_HISTORY_KEY`). - Supports additional columns via `-c`. - Uploads to object storage if `bucket:path/file.csv` is given. ### 3. Manage Files (`mf`) Utilities for file-level operations: - `execute_query(sql)` - `add_source_file_config(...)` - `process_source_file(prefix, file)` - `create_external_table(table, source, prefix)` - `add_column_date_format(...)` ### 4. Manage Runs (`mr`) Utilities for workflow tracking: - `init_workflow(db, wf_name, run_id)` - `set_workflow_property(key, db, name, value)` - `finalise_workflow(key, status)` - `select_ods_tab(table, expr, cond)` --- ## How a DAG Works ### DAG Structure Each DAG has a single task: - `retrieve_report`: a `PythonOperator` that orchestrates all steps internally. ### Task Flow 1. **Read YAML configs** - Parameters split into visible (exposed in Airflow UI) and hidden. - System config (URL, creds, bucket/prefix) loaded from `config/TMS.yml`. 2. **Parameter processing** - Cartesian product of parameter lists. - Support for: - `column(...)` aligned columns. - `select(...)` SQL evaluation (restricted tables only). - Virtual parameters (dropped later). - Replace-parameter logic. 3. **Workflow init** - `mr.init_workflow` creates a workflow key. 4. **Data retrieval** - Build a `TMSDB.py retrieve` command. - Run via subprocess. - Handle return codes: - `0`: data returned. - `1`: no data → workflow finalized as success. - `!=0`: error → workflow finalized as failure. 5. **First-run bootstrap** - If no config exists for the dataset: - Run `TMSDB.py create-oracle-table` to generate SQL. - Execute SQL via `mf.execute`. - Add date formats and external table with `mf.create_external_table`. - Register config with `mf.add_source_file_config`. 6. **File processing** - `mf.process_source_file(prefix, filename)` ingests the CSV. 7. **Workflow finalization** - `mr.finalise_workflow(wf_key, 'Y' | 'N')`. --- ## Example DAG Example: `w_ODS_TMS_TRANSACTION` ```python with DAG( dag_id="w_ODS_TMS_TRANSACTION", default_args=default_args, schedule_interval=None, start_date=datetime(2025, 1, 1), catchup=False, params={"date_from": "2025-01-01", "date_to": "2025-01-31"}, ) as dag: retrieve_report = PythonOperator( task_id="retrieve_report", python_callable=execute_report, execution_timeout=timedelta(minutes=30), ) ``` --- ## Repository Layout ``` tms/ ├─ generate_tm_ods_dags.py # DAG generator script (calls create_dag many times) ├─ TMS-layouts/ │ ├─ w_ODS_TMS_TRANSACTION.yml │ └─ ... ├─ config/ │ └─ TMS.yml └─ TMS-tables/ # Create table SQL scripts ``` --- ## Security Considerations - **`eval()` is dangerous.** Only `select(...)` is allowed, and it’s whitelisted to safe tables. - **No raw shell commands.** Use `subprocess.run([...], shell=False)` for safety. - **Secrets in config.** TMS username/password are stored in `TMS.yml` → best stored in Airflow Connections/Secrets Manager. - **Exit codes matter.** Workflow correctness relies on `TMSDB.py` returning the right codes (`0`, `1`, other). --- ## Extending the Factory ### Add a new dataset 1. Create a YAML layout in `TMS-layouts/`, e.g.: ```yaml parameters: date_from: value: "2025-01-01" date_to: value: "2025-01-31" ``` 2. Add a line in `dag_factory.py`: ```python create_dag("w_ODS_TMS_NEWENTITY") ``` 3. Deploy the DAG file to Airflow. ### Run a DAG manually In the Airflow UI: 1. Find DAG `w_ODS_TMS_`. 2. Trigger DAG → optionally override visible parameters. 3. Monitor logs for `retrieve_report`. --- ## Diagram **DAG Factory Flow** ```mermaid flowchart LR subgraph DAGFactory["Dag Factory"] direction TB B["Load TMS config (TMS.yml)"] --> C["Load dataset layout (YAML)"] C --> D["Extract visible & hidden parameters"] D --> E["Define Airflow DAG with retrieve_report task"] E --> F["Register DAG globally"] F --> G["Repeat for each dataset name"] G --> H["All DAGs available in Airflow"] end A["Airflow parses dag_factory.py"] --> DAGFactory ``` **Sample DAG Execution Flow** ```mermaid flowchart LR subgraph ExampleDAG direction TB B[Read YAML configs] --> C[Build parameter combinations] C --> D["Evaluate select(...) and replace virtual params"] D --> E["Init workflow (mr.init_workflow)"] E --> F["Run TMSDB.py retrieve (subprocess)"] %% Branches on return codes F --> |rc=1: No data| G[Finalise workflow success] F --> |rc=0: Data returned| H[Check if source file config exists] F --> |rc!=0: Error| M[Finalise workflow failure] %% Config missing branch H --> |"Config missing (first run)"| I[Run TMSDB.py create-oracle-table → Generate DDL] I --> J[Execute DDL via mf.execute → Create Oracle external table] J --> K["Register file source config (mf.add_source_file_config)"] K --> L["Process landed file (mf.process_source_file)"] L --> N[Finalise workflow success] %% Config exists branch H --> |Config exists| P["Process landed file (mf.process_source_file)"] P --> N[Finalise workflow success] end ``` --- ## Dependencies - **Airflow 2.x** - **Python 3.9+** - **mrds** package (providing `utils.manage_files` and `utils.manage_runs`) - **Oracle client / Impala client** (for table creation & querying) - **Object storage client** (for uploading CSVs) --- ## Summary The DAG factory is a scalable way to create **dozens of ingestion DAGs** for TMS datasets with minimal boilerplate. It leverages: - **YAML configs** for parameters, - **TMSDB CLI** for data retrieval and DDL generation, - **MRDS utilities** for workflow tracking and file handling. It standardizes ingestion while keeping each dataset’s DAG lightweight and uniform.