diff --git a/confluence/FILE_ARCHIVER_Guide.md b/confluence/FILE_ARCHIVER_Guide.md index 18ad5e0..8080425 100644 --- a/confluence/FILE_ARCHIVER_Guide.md +++ b/confluence/FILE_ARCHIVER_Guide.md @@ -11,25 +11,55 @@ The FILE_ARCHIVER package provides flexible archival strategies that accommodate - **Three Archival Strategies**: THRESHOLD_BASED, MINIMUM_AGE_MONTHS (with 0=current month only), HYBRID - **Flexible Configuration**: Per-table archival strategy configuration via A_SOURCE_FILE_CONFIG - **Validation**: Automatic validation of strategy-specific configuration requirements -- **OCI Integration**: Works seamlessly with DBMS_CLOUD operations via cloud_wrapper ### Package Information - **Schema**: CT_MRDS - **Package**: FILE_ARCHIVER - **Current Version**: 3.3.0 -- **Dependencies**: ENV_MANAGER, FILE_MANAGER, cloud_wrapper, A_SOURCE_FILE_CONFIG, A_SOURCE_FILE_RECEIVED, A_WORKFLOW_HISTORY +- **Dependencies**: ENV_MANAGER, FILE_MANAGER, A_SOURCE_FILE_CONFIG, A_SOURCE_FILE_RECEIVED, A_WORKFLOW_HISTORY ### Critical Prerequisites -⚠️ **IMPORTANT**: FILE_ARCHIVER requires data to be registered in `CT_MRDS.A_SOURCE_FILE_RECEIVED` table. This table is automatically populated when files are processed through the modern Airflow + DBT workflow via `FILE_MANAGER.PROCESS_SOURCE_FILE`. +⚠️ **IMPORTANT**: FILE_ARCHIVER requires data to be registered in `CT_MRDS.A_SOURCE_FILE_RECEIVED` table. + +**For new system data (Airflow + DBT):** +- `A_SOURCE_FILE_RECEIVED` records are automatically created by `FILE_MANAGER.PROCESS_SOURCE_FILE` during file validation +- No additional configuration needed - standard workflow handles registration **For legacy data migrated from Informatica + WLA system:** -- Legacy data exported using `DATA_EXPORTER` does NOT automatically create `A_SOURCE_FILE_RECEIVED` records -- Without these records, FILE_ARCHIVER **CANNOT** archive the data -- See [System Migration Guide](System_Migration_Informatica_to_Airflow_DBT.md) for workaround strategies +- Use `DATA_EXPORTER` with **`pRegisterExport => TRUE`** parameter to automatically register exported files in `A_SOURCE_FILE_RECEIVED` +- This enables FILE_ARCHIVER to process legacy data exports without manual registration +- Available in both `EXPORT_TABLE_DATA` (single CSV) and `EXPORT_TABLE_DATA_TO_CSV_BY_DATE` (partitioned CSV exports) -**Recommendation for legacy data**: Export directly to ARCHIVE bucket using `DATA_EXPORTER.EXPORT_TABLE_DATA_BY_DATE` with `pBucketArea => 'ARCHIVE'` to bypass this requirement +**Example - Legacy Data Export with Registration**: +```sql +-- Export legacy data to DATA bucket WITH automatic registration +BEGIN + CT_MRDS.DATA_EXPORTER.EXPORT_TABLE_DATA_TO_CSV_BY_DATE( + pSchemaName => 'OU_TOP', + pTableName => 'AGGREGATED_ALLOTMENT', + pKeyColumnName => 'A_ETL_LOAD_SET_KEY_FK', + pBucketArea => 'DATA', + pFolderName => 'legacy_export', + pMinDate => DATE '2024-01-01', + pMaxDate => DATE '2024-12-31', + pRegisterExport => TRUE, -- ✓ Registers files in A_SOURCE_FILE_RECEIVED + pProcessName => 'LEGACY_MIGRATION' + ); +END; +/ + +-- Now FILE_ARCHIVER can process these files +BEGIN + CT_MRDS.FILE_ARCHIVER.ARCHIVE_TABLE_DATA( + pSourceFileConfigKey => vConfigKey + ); +END; +/ +``` + +**Alternative approach**: Export directly to ARCHIVE bucket using `DATA_EXPORTER.EXPORT_TABLE_DATA_BY_DATE` with `pBucketArea => 'ARCHIVE'` to bypass archival step entirely ## Archival Strategies diff --git a/confluence/System_Migration_Informatica_to_Airflow_DBT.md b/confluence/System_Migration_Informatica_to_Airflow_DBT.md index 90c79b1..1234e53 100644 --- a/confluence/System_Migration_Informatica_to_Airflow_DBT.md +++ b/confluence/System_Migration_Informatica_to_Airflow_DBT.md @@ -1,6 +1,6 @@ # System Migration: Informatica + WLA → Airflow + DBT -This document describes the migration from the legacy Informatica + WLA data processing system to the modern Airflow + DBT architecture, including control table differences, data export strategies, and known limitations. +This document describes the migration from the legacy Informatica + WLA data processing system to the new Airflow + DBT architecture, including control table differences, data export strategies, and known limitations. ## Migration Overview @@ -13,7 +13,7 @@ The MRDS (Market Reference Data System) is undergoing a fundamental technology m - Primary Control Table: `CT_ODS.A_LOAD_HISTORY` - Key Column: `A_ETL_LOAD_SET_KEY` -**Modern System (Airflow + DBT):** +**New System (Airflow + DBT):** - Orchestration: Apache Airflow - Transformation: DBT (Data Build Tool) - Control Schema: `CT_MRDS` (MRDS Control) @@ -49,7 +49,7 @@ DQ_FLAG VARCHAR2(5) -- Data quality flag - Used for temporal partitioning in DATA_EXPORTER - Referenced via `A_ETL_LOAD_SET_KEY_FK` foreign key in data tables -### Modern System: CT_MRDS Control Tables +### New System: CT_MRDS Control Tables #### 1. A_SOURCE_FILE_RECEIVED @@ -126,7 +126,7 @@ END; **Result**: CSV files in ODS bucket (DATA area), partitioned by LOAD_START from A_LOAD_HISTORY -### Scenario 2: Modern System Data (Airflow + DBT → ODS → ARCHIVE) +### Scenario 2: New System Data (Airflow + DBT → ODS → ARCHIVE) **Use Case**: Ongoing processing with new Airflow + DBT system @@ -150,104 +150,91 @@ END; / ``` -## Critical Gap: Legacy Data Archival +## Legacy Data Archival -### Problem Statement +### FILE_ARCHIVER Requirement -**Scenario**: Historical data exported using DATA_EXPORTER from Informatica-loaded tables +⚠️ **IMPORTANT**: FILE_ARCHIVER requires records in `A_SOURCE_FILE_RECEIVED` table to track and manage archival lifecycle. -**Issue**: FILE_ARCHIVER requires records in `A_SOURCE_FILE_RECEIVED`, but legacy exports don't create them +**For new system data (Airflow + DBT)**: +- Records automatically created by `FILE_MANAGER.PROCESS_SOURCE_FILE` +- No additional steps needed -**Impact**: Legacy data exported to ODS/DATA bucket **CANNOT** be archived to ARCHIVE bucket using FILE_ARCHIVER +**For legacy data (Informatica + WLA)**: +- Historical data requires registration in `A_SOURCE_FILE_RECEIVED` +- ✅ **SOLUTION**: Use DATA_EXPORTER v2.9.0+ with `pRegisterExport => TRUE` parameter +- Automatically registers exported files with proper metadata (size, checksum, location) -### Technical Analysis +### Export Strategies for Legacy Data -**DATA_EXPORTER Behavior**: -```sql --- Uses A_LOAD_HISTORY for partitioning (Informatica workflows) -SELECT DISTINCT TO_CHAR(L.LOAD_START,'YYYY') AS YR, - TO_CHAR(L.LOAD_START,'MM') AS MN -FROM OU_TOP.AGGREGATED_ALLOTMENT T, CT_ODS.A_LOAD_HISTORY L -WHERE T.A_ETL_LOAD_SET_KEY_FK = L.A_ETL_LOAD_SET_KEY - AND L.LOAD_START >= :pMinDate - AND L.LOAD_START < :pMaxDate; +#### Strategy 1: Automatic Registration (Recommended) --- Creates CSV files: ODS/legacy_migration/AGGREGATED_ALLOTMENT_YYYYMM.csv --- Does NOT create A_SOURCE_FILE_RECEIVED records -``` +✅ **DATA_EXPORTER v2.9.0+** supports automatic file registration via `pRegisterExport` parameter. -**FILE_ARCHIVER Requirement**: -```sql --- Joins A_SOURCE_FILE_RECEIVED with A_WORKFLOW_HISTORY -JOIN CT_MRDS.A_SOURCE_FILE_RECEIVED r - ON r.A_SOURCE_FILE_CONFIG_KEY = pSourceFileConfig.A_SOURCE_FILE_CONFIG_KEY - AND r.PROCESSING_STATUS = 'INGESTED'; +**Benefits**: +- Simple, one-step export with automatic registration +- Files tracked in `A_SOURCE_FILE_RECEIVED` (enables FILE_ARCHIVER processing) +- Proper metadata capture (file size, checksum, location, timestamps) +- Standard workflow integration (archival strategies, status tracking) --- Without A_SOURCE_FILE_RECEIVED records, archival CANNOT proceed -``` - -### Workaround Strategies - -#### Strategy 1: Manual Registration (Recommended for Small Datasets) - -Manually create `A_SOURCE_FILE_RECEIVED` records for legacy exported files: +**Example - CSV Export with Registration**: ```sql --- Step 1: Export legacy data to ODS/DATA +-- Export with automatic registration (DATA_EXPORTER v2.9.0+) BEGIN - DATA_EXPORTER.EXPORT_TABLE_DATA_TO_CSV_BY_DATE( - pSchemaName => 'OU_TOP', - pTableName => 'AGGREGATED_ALLOTMENT', - pKeyColumnName => 'A_ETL_LOAD_SET_KEY_FK', - pBucketArea => 'DATA', - pFolderName => 'legacy_export', - pMinDate => DATE '2024-01-01', - pMaxDate => DATE '2024-12-31' + CT_MRDS.DATA_EXPORTER.EXPORT_TABLE_DATA_TO_CSV_BY_DATE( + pSchemaName => 'OU_TOP', + pTableName => 'AGGREGATED_ALLOTMENT', + pKeyColumnName => 'A_ETL_LOAD_SET_KEY_FK', + pBucketArea => 'DATA', + pFolderName => 'legacy_export', + pMinDate => DATE '2024-01-01', + pMaxDate => DATE '2024-12-31', + pRegisterExport => TRUE, -- ✓ Automatically registers files + pProcessName => 'LEGACY_MIGRATION' ); END; / --- Step 2: List exported CSV files -SELECT object_name, time_created, bytes -FROM TABLE(MRDS_LOADER.cloud_wrapper.list_objects( - credential_name => 'DEF_CRED_ARN', - location_uri => 'https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/frtgjxu7zl7c/b/data/o/' -)) WHERE object_name LIKE 'ODS/legacy_export/AGGREGATED_ALLOTMENT_%'; +-- Files now registered in A_SOURCE_FILE_RECEIVED with: +-- - SOURCE_FILE_NAME: Full OCI path +-- - PROCESSING_STATUS: 'INGESTED' +-- - BYTES: Actual file size +-- - CHECKSUM: File ETag from OCI +-- - PROCESS_NAME: 'LEGACY_MIGRATION' --- Step 3: Manually register each file in A_SOURCE_FILE_RECEIVED --- (Requires source configuration for AGGREGATED_ALLOTMENT to exist) -INSERT INTO CT_MRDS.A_SOURCE_FILE_RECEIVED ( - A_SOURCE_FILE_RECEIVED_KEY, - A_SOURCE_FILE_CONFIG_KEY, - SOURCE_FILE_NAME, - PROCESSING_STATUS, - RECEPTION_DATE, - BYTES, - CHECKSUM, - EXTERNAL_TABLE_NAME -) VALUES ( - A_SOURCE_FILE_RECEIVED_KEY_SEQ.NEXTVAL, - (SELECT A_SOURCE_FILE_CONFIG_KEY FROM A_SOURCE_FILE_CONFIG - WHERE SOURCE_FILE_ID = 'AGGREGATED_ALLOTMENT' AND SOURCE_FILE_TYPE = 'INPUT'), - 'ODS/legacy_export/AGGREGATED_ALLOTMENT_202401.csv', - 'INGESTED', -- Skip validation, mark as already ingested - DATE '2024-01-15', - 1048576, -- File size in bytes - 'manual_registration', - NULL -- No external table needed -); --- Repeat for all exported CSV files -COMMIT; - --- Step 4: Now FILE_ARCHIVER can process these files +-- Now FILE_ARCHIVER can process these files BEGIN - FILE_ARCHIVER.ARCHIVE_TABLE_DATA(pSourceFileConfig => vConfig); + CT_MRDS.FILE_ARCHIVER.ARCHIVE_TABLE_DATA( + pSourceFileConfigKey => vConfigKey + ); +END; +/ +``` + +**Example - Single CSV Export with Registration**: +```sql +-- For single file export (not partitioned by date) +BEGIN + CT_MRDS.DATA_EXPORTER.EXPORT_TABLE_DATA( + pSchemaName => 'CT_MRDS', + pTableName => 'MY_TABLE', + pKeyColumnName => 'A_ETL_LOAD_SET_KEY_FK', + pBucketArea => 'DATA', + pFolderName => 'legacy_export', + pFileName => 'my_table_export.csv', + pTemplateTableName => 'CT_ET_TEMPLATES.MY_TEMPLATE', + pRegisterExport => TRUE, -- ✓ Registers file + pProcessName => 'LEGACY_MIGRATION' + ); END; / ``` #### Strategy 2: Direct Archive Export (Bypass ODS) +⚠️ **Use when**: You want to skip the ODS bucket entirely and go straight to ARCHIVE + Skip ODS/DATA bucket entirely - export directly to ARCHIVE bucket in Parquet format: ```sql @@ -411,18 +398,18 @@ CALL FILE_MANAGER.ADD_SOURCE_FILE_CONFIG( ## Known Limitations -### 1. No Retroactive A_SOURCE_FILE_RECEIVED Creation -DATA_EXPORTER does not automatically create A_SOURCE_FILE_RECEIVED records when exporting legacy data. This is by design - it's a one-time export tool, not a file tracking system. +### 1. FILE_ARCHIVER Requires A_SOURCE_FILE_RECEIVED +FILE_ARCHIVER cannot archive data without corresponding A_SOURCE_FILE_RECEIVED records. -### 2. FILE_ARCHIVER Requires A_SOURCE_FILE_RECEIVED -FILE_ARCHIVER cannot archive data without corresponding A_SOURCE_FILE_RECEIVED records. This prevents archiving of: -- Legacy Informatica-loaded data exported via DATA_EXPORTER -- Manually uploaded files not processed through FILE_MANAGER.PROCESS_SOURCE_FILE +**Solutions**: +- ✅ **New system data**: Automatically registered via `FILE_MANAGER.PROCESS_SOURCE_FILE` +- ✅ **Legacy data exports**: Use `DATA_EXPORTER` with `pRegisterExport => TRUE` (v2.9.0+) +- ⚠️ **Manual uploads**: Must be registered via `FILE_MANAGER.PROCESS_SOURCE_FILE` or manual INSERT -### 3. Mixed Control Table References +### 2. Mixed Control Table References During migration period, some procedures reference A_LOAD_HISTORY (DATA_EXPORTER) while others reference A_WORKFLOW_HISTORY (FILE_ARCHIVER). This is intentional but requires careful understanding of data lineage. -### 4. A_WORKFLOW_HISTORY vs A_LOAD_HISTORY Column Mismatch +### 3. A_WORKFLOW_HISTORY vs A_LOAD_HISTORY Column Mismatch The control tables have different schemas: - **A_LOAD_HISTORY**: `LOAD_START`, `A_ETL_LOAD_SET_KEY` - **A_WORKFLOW_HISTORY**: `WORKFLOW_START`, `A_WORKFLOW_HISTORY_KEY` @@ -445,4 +432,8 @@ The migration from Informatica + WLA to Airflow + DBT introduces new control tab - **Archival Operations**: Ensuring FILE_ARCHIVER has required metadata - **Testing**: Using correct control tables in test scenarios -The recommended approach for legacy data migration is **Strategy 2 (Direct to ARCHIVE)** for large datasets, as it avoids the complexity of manual A_SOURCE_FILE_RECEIVED registration while achieving the goal of moving historical data to long-term archival storage. +**Recommended Approach for Legacy Data Migration**: + +1. ✅ **Strategy 1 (Automatic Registration)** - Use `DATA_EXPORTER` with `pRegisterExport => TRUE` to automatically register files in `A_SOURCE_FILE_RECEIVED`, enabling full FILE_ARCHIVER workflow (archival strategies, status tracking, rollback capabilities) + +2. ⚠️ **Strategy 2 (Direct to ARCHIVE)** - Export directly to ARCHIVE bucket to bypass ODS bucket entirely and avoid registration requirements (use when tracking is not needed)