mars-elt/airflow/ods/tms/README.md

# DAG Factory for TMS Data Ingestion

##  Overview
This repository contains a **DAG factory** that generates multiple Apache Airflow DAGs to ingest data from a **Treasury Management System (TMS)** into the data warehouse.

The factory dynamically creates one DAG per TMS dataset, using **YAML-based layouts** to define parameters and metadata. Each DAG:
- Calls the **TMSDB CLI connector** (`TMSDB.py`) to retrieve data in CSV format.
- Loads the data into object storage.
- Creates or refreshes **Oracle external tables** if needed.
- Registers workflow metadata in MRDS tables.
- Processes the landed file for downstream use.

---

##  Components

### 1. DAG Factory (`create_dag`)
- **Purpose**: Auto-generates DAGs for each TMS dataset.
- **Inputs**:
  - `TMS-layouts/<DAG_NAME>.yml`: defines report parameters, visible/hidden flags, virtual/replacement parameters.
  - `config/TMS.yml`: holds system-wide TMS connection info and storage prefixes.
- **Outputs**:
  - Airflow DAG objects named like `w_ODS_TMS_<ENTITY>`.

### 2. TMSDB Connector (`TMSDB.py`)
- **Purpose**: CLI tool that interacts with the TMS service.
- **Commands**:
  - `retrieve`: fetch rows from TMS into CSV, spool to storage, return exit codes (`0 = data`, `1 = no data`).
  - `create-oracle-table`: generate an Oracle DDL file based on dataset metadata.
  - `create-model`: generate dbt models for dataset integration.
- **Behavior**:
  - Adds synthetic columns (`A_KEY`, `A_WORKFLOW_HISTORY_KEY`).
  - Supports additional columns via `-c`.
  - Uploads to object storage if `bucket:path/file.csv` is given.

### 3. Manage Files (`mf`)
Utilities for file-level operations:
- `execute_query(sql)`
- `add_source_file_config(...)`
- `process_source_file(prefix, file)`
- `create_external_table(table, source, prefix)`
- `add_column_date_format(...)`

### 4. Manage Runs (`mr`)
Utilities for workflow tracking:
- `init_workflow(db, wf_name, run_id)`
- `set_workflow_property(key, db, name, value)`
- `finalise_workflow(key, status)`
- `select_ods_tab(table, expr, cond)`

---

##  How a DAG Works

### DAG Structure
Each DAG has a single task:
- `retrieve_report`: a `PythonOperator` that orchestrates all steps internally.

### Task Flow
1. **Read YAML configs**
   - Parameters split into visible (exposed in Airflow UI) and hidden.
   - System config (URL, creds, bucket/prefix) loaded from `config/TMS.yml`.

2. **Parameter processing**
   - Cartesian product of parameter lists.
   - Support for:
     - `column(...)` aligned columns.
     - `select(...)` SQL evaluation (restricted tables only).
     - Virtual parameters (dropped later).
     - Replace-parameter logic.

3. **Workflow init**
   - `mr.init_workflow` creates a workflow key.

4. **Data retrieval**
   - Build a `TMSDB.py retrieve` command.
   - Run via subprocess.
   - Handle return codes:
     - `0`: data returned.
     - `1`: no data → workflow finalized as success.
     - `!=0`: error → workflow finalized as failure.

5. **First-run bootstrap**
   - If no config exists for the dataset:
     - Run `TMSDB.py create-oracle-table` to generate SQL.
     - Execute SQL via `mf.execute`.
     - Add date formats and external table with `mf.create_external_table`.
     - Register config with `mf.add_source_file_config`.

6. **File processing**
   - `mf.process_source_file(prefix, filename)` ingests the CSV.

7. **Workflow finalization**
   - `mr.finalise_workflow(wf_key, 'Y' | 'N')`.

---

##  Example DAG
Example: `w_ODS_TMS_TRANSACTION`

```python
with DAG(
    dag_id="w_ODS_TMS_TRANSACTION",
    default_args=default_args,
    schedule_interval=None,
    start_date=datetime(2025, 1, 1),
    catchup=False,
    params={"date_from": "2025-01-01", "date_to": "2025-01-31"},
) as dag:

    retrieve_report = PythonOperator(
        task_id="retrieve_report",
        python_callable=execute_report,
        execution_timeout=timedelta(minutes=30),
    )
```

---

##  Repository Layout
```
 tms/
    ├─ generate_tm_ods_dags.py           # DAG generator script (calls create_dag many times)
    ├─ TMS-layouts/
    │    ├─ w_ODS_TMS_TRANSACTION.yml
    │    └─ ...
    ├─ config/
    │    └─ TMS.yml
    └─ TMS-tables/              # Create table SQL scripts
```

---

##  Security Considerations
- **`eval()` is dangerous.**
  Only `select(...)` is allowed, and it’s whitelisted to safe tables.
- **No raw shell commands.**
  Use `subprocess.run([...], shell=False)` for safety.
- **Secrets in config.**
  TMS username/password are stored in `TMS.yml` → best stored in Airflow Connections/Secrets Manager.
- **Exit codes matter.**
  Workflow correctness relies on `TMSDB.py` returning the right codes (`0`, `1`, other).

---

##  Extending the Factory

### Add a new dataset
1. Create a YAML layout in `TMS-layouts/`, e.g.:

   ```yaml
   parameters:
     date_from:
       value: "2025-01-01"
     date_to:
       value: "2025-01-31"
   ```

2. Add a line in `dag_factory.py`:

   ```python
   create_dag("w_ODS_TMS_NEWENTITY")
   ```

3. Deploy the DAG file to Airflow.

### Run a DAG manually
In the Airflow UI:
1. Find DAG `w_ODS_TMS_<ENTITY>`.
2. Trigger DAG → optionally override visible parameters.
3. Monitor logs for `retrieve_report`.

---

## Diagram
**DAG Factory Flow**
```mermaid
flowchart LR
  subgraph DAGFactory["Dag Factory"]
    direction TB
      B["Load TMS config (TMS.yml)"] --> C["Load dataset layout (YAML)"]
      C --> D["Extract visible & hidden parameters"]
      D --> E["Define Airflow DAG with retrieve_report task"]
      E --> F["Register DAG globally"]
      F --> G["Repeat for each dataset name"]
      G --> H["All DAGs available in Airflow"]
  end
  A["Airflow parses dag_factory.py"] --> DAGFactory

```

**Sample DAG Execution Flow**
```mermaid
flowchart LR
  subgraph ExampleDAG
    direction TB
      B[Read YAML configs] --> C[Build parameter combinations]
      C --> D["Evaluate select(...) and replace virtual params"]
      D --> E["Init workflow (mr.init_workflow)"]
      E --> F["Run TMSDB.py retrieve (subprocess)"]

      %% Branches on return codes
      F --> |rc=1: No data| G[Finalise workflow success]
      F --> |rc=0: Data returned| H[Check if source file config exists]
      F --> |rc!=0: Error| M[Finalise workflow failure]

      %% Config missing branch
      H --> |"Config missing (first run)"| I[Run TMSDB.py create-oracle-table → Generate DDL]
      I --> J[Execute DDL via mf.execute → Create Oracle external table]
      J --> K["Register file source config (mf.add_source_file_config)"]
      K --> L["Process landed file (mf.process_source_file)"]
      L --> N[Finalise workflow success]

      %% Config exists branch
      H --> |Config exists| P["Process landed file (mf.process_source_file)"]
      P --> N[Finalise workflow success]
  end
```
---

##  Dependencies
- **Airflow 2.x**
- **Python 3.9+**
- **mrds** package (providing `utils.manage_files` and `utils.manage_runs`)
- **Oracle client / Impala client** (for table creation & querying)
- **Object storage client** (for uploading CSVs)

---

## Summary
The DAG factory is a scalable way to create **dozens of ingestion DAGs** for TMS datasets with minimal boilerplate. It leverages:
- **YAML configs** for parameters,
- **TMSDB CLI** for data retrieval and DDL generation,
- **MRDS utilities** for workflow tracking and file handling.

It standardizes ingestion while keeping each dataset’s DAG lightweight and uniform.