From cdd9dff32d3999e5a22b34f3b0fba34384da8189 Mon Sep 17 00:00:00 2001 From: Grzegorz Michalski Date: Tue, 10 Feb 2026 08:21:51 +0100 Subject: [PATCH] =?UTF-8?q?aktualizacja=20dokumentacji=20w=20zwi=C4=85zku?= =?UTF-8?q?=20z=20TRASH=20i=20nowymi=20statusami=20plik=C3=B3w.?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- confluence/FILE_ARCHIVER_Guide.md | 92 ++++++++++++++++--- .../FILE_MANAGER_Configuration_Guide.md | 18 +++- confluence/PROCESS_SOURCE_FILE_Guide.md | 4 +- ...em_Migration_Informatica_to_Airflow_DBT.md | 4 +- confluence/Tables_setup.md | 5 + .../Oracle_External_Tables_Tolerance_Guide.md | 3 +- 6 files changed, 104 insertions(+), 22 deletions(-) diff --git a/confluence/FILE_ARCHIVER_Guide.md b/confluence/FILE_ARCHIVER_Guide.md index b233978..2e9bd6b 100644 --- a/confluence/FILE_ARCHIVER_Guide.md +++ b/confluence/FILE_ARCHIVER_Guide.md @@ -18,7 +18,7 @@ The FILE_ARCHIVER package provides flexible archival strategies that accommodate - **Schema**: CT_MRDS - **Package**: FILE_ARCHIVER -- **Current Version**: 3.1.0 +- **Current Version**: 3.2.0 - **Dependencies**: ENV_MANAGER, FILE_MANAGER, cloud_wrapper, A_SOURCE_FILE_CONFIG, A_SOURCE_FILE_RECEIVED, A_WORKFLOW_HISTORY ### Critical Prerequisites @@ -177,30 +177,46 @@ WHERE ...; ├─ Active data processing (Airflow + DBT) ├─ External tables read data from bucket ├─ Status: INGESTED - └─ FILE_ARCHIVER.ARCHIVE_TABLE_DATA archives based on strategy + ├─ FILE_ARCHIVER.ARCHIVE_TABLE_DATA archives based on strategy + └─ CSV files moved to TRASH subfolder (ODS → TRASH/) + +2.1 TRASH Subfolder (DATA Bucket - File Retention) + ├─ Located in DATA bucket (e.g., TRASH/LM/TABLE_NAME) + ├─ Stores CSV files after archival to Parquet + ├─ Status: ARCHIVED_AND_TRASHED (default retention) + ├─ Enables rollback if archival issues occur + └─ Optional cleanup: ARCHIVED_AND_PURGED (pKeepInTrash=FALSE) 3. ARCHIVE Bucket (Long-term Storage) ├─ Historical data in Parquet format ├─ Hive-style partitioning: PARTITION_YEAR=/PARTITION_MONTH= - ├─ Status: ARCHIVED + ├─ Status: ARCHIVED_AND_TRASHED or ARCHIVED_AND_PURGED └─ Optimized for big data analytics (Spark, Hive) -``` - -### Archival Process - -The FILE_ARCHIVER package automatically manages data movement from ODS to ARCHIVE: **Key Procedures**: -- `ARCHIVE_TABLE_DATA` - Main archival procedure using strategy-specific WHERE clause +- `ARCHIVE_TABLE_DATA(pSourceFileConfigKey, pKeepInTrash)` - Main archival procedure using strategy-specific WHERE clause + - `pKeepInTrash` (BOOLEAN, DEFAULT TRUE) - Controls TRASH folder retention + - TRUE: Files kept in TRASH folder for safety and rollback capability (default) + - FALSE: Files deleted from TRASH folder after successful archival - `GET_ARCHIVAL_WHERE_CLAUSE` - Returns WHERE clause based on configured strategy - `GATHER_TABLE_STAT` - Calculates archival statistics using strategy logic **Archival Execution**: ```sql --- Triggered by FILE_MANAGER or scheduled job +-- Default behavior: Keep files in TRASH folder (ARCHIVED_AND_TRASHED status) BEGIN CT_MRDS.FILE_ARCHIVER.ARCHIVE_TABLE_DATA( - pSourceFileConfig => vSourceFileConfigRecord + pSourceFileConfigKey => vSourceFileConfigKey, + pKeepInTrash => TRUE -- DEFAULT value + ); +END; +/ + +-- Optional: Delete files from TRASH after archival (ARCHIVED_AND_PURGED status) +BEGIN + CT_MRDS.FILE_ARCHIVER.ARCHIVE_TABLE_DATA( + pSourceFileConfigKey => vSourceFileConfigKey, + pKeepInTrash => FALSE -- Cleanup TRASH folder ); END; / @@ -210,7 +226,9 @@ END; - Package retrieves ARCHIVAL_STRATEGY from A_SOURCE_FILE_CONFIG - GET_ARCHIVAL_WHERE_CLAUSE generates appropriate WHERE clause - Data matching criteria moved from ODS to ARCHIVE bucket -- Parquet format with Hive-style partitioning applied +- CSV files moved to TRASH subfolder in DATA bucket (ODS/ → TRASH/) +- Parquet format with Hive-style partitioning applied to ARCHIVE bucket +- TRASH retention controlled by pKeepInTrash parameter ## Configuration Examples @@ -527,8 +545,11 @@ WHERE object_name = 'FILE_ARCHIVER'; ### OCI Buckets - **INBOX**: Incoming file validation (`'INBOX/{SOURCE}/{SOURCE_FILE_ID}/{TABLE_NAME}/'`) - **ODS/DATA**: Operational data processing (`'ODS/{SOURCE}/{TABLE_NAME}/'`) +- **TRASH**: File retention subfolder in DATA bucket (`'TRASH/{SOURCE}/{TABLE_NAME}/'`) - CSV files after archival - **ARCHIVE**: Historical data storage (`'ARCHIVE/{SOURCE}/{TABLE_NAME}/PARTITION_YEAR=/PARTITION_MONTH=/'`) +**Note**: TRASH is NOT a separate bucket - it's a subfolder within the DATA bucket for file retention and rollback capability. + ## Best Practices ### Strategy Selection Guidelines @@ -609,10 +630,53 @@ WHERE object_name = 'FILE_ARCHIVER'; - Check for tables without archival configuration - Optimize MINIMUM_AGE_MONTHS based on actual usage patterns +### TRASH Folder Retention Best Practices + +1. **Default Behavior (pKeepInTrash = TRUE - Recommended)**: + - Keeps CSV files in TRASH folder after archival + - Provides safety net for rollback if archival issues occur + - Supports compliance and audit requirements + - Status: ARCHIVED_AND_TRASHED + - Use for: Production environments, regulatory compliance, critical data + +2. **TRASH Cleanup (pKeepInTrash = FALSE)**: + - Deletes CSV files from TRASH folder after successful archival + - Reduces storage costs in DATA bucket + - Status: ARCHIVED_AND_PURGED + - Use for: Non-critical data, storage optimization, test environments + +3. **Monitoring TRASH Folder**: + ```sql + -- Check files in TRASH retention + SELECT + SOURCE_FILE_NAME, + PROCESSING_STATUS, + ARCH_FILE_NAME, + PARTITION_YEAR, + PARTITION_MONTH + FROM CT_MRDS.A_SOURCE_FILE_RECEIVED + WHERE PROCESSING_STATUS IN ('ARCHIVED_AND_TRASHED', 'ARCHIVED_AND_PURGED') + AND RECEPTION_DATE > SYSDATE - 30 + ORDER BY PROCESSING_STATUS, RECEPTION_DATE DESC; + ``` + +4. **TRASH Folder Structure**: + ``` + DATA Bucket: + ├── ODS/LM/STANDING_FACILITIES/file.csv -- Active operational data + └── TRASH/LM/STANDING_FACILITIES/file.csv -- Retained after archival + + ARCHIVE Bucket: + └── ARCHIVE/LM/STANDING_FACILITIES/ + └── PARTITION_YEAR=2026/ + └── PARTITION_MONTH=02/ + └── *.parquet -- Archived data + ``` + ## Author Created by: Grzegorz Michalski -Date: 2026-02-04 +Date: 2026-02-06 Schema: CT_MRDS Package: FILE_ARCHIVER -Version: 3.1.0 +Version: 3.2.0 diff --git a/confluence/FILE_MANAGER_Configuration_Guide.md b/confluence/FILE_MANAGER_Configuration_Guide.md index 98ecb8f..54ce18f 100644 --- a/confluence/FILE_MANAGER_Configuration_Guide.md +++ b/confluence/FILE_MANAGER_Configuration_Guide.md @@ -371,11 +371,14 @@ INBOX Bucket - Pattern: 'INBOX/{SOURCE}/{SOURCE_FILE_ID}/{TABLE_NAME}/' └── {pTableId}/ -- e.g., "A_UC_DISSEM_METADATA_LOADS", "STANDING_FACILITIES" └── files matching {pSourceFileNamePattern} -ODS Bucket - Pattern: 'ODS/{SOURCE}/{TABLE_NAME}/' -└── ODS/ +DATA Bucket - Patterns: 'ODS/{SOURCE}/{TABLE_NAME}/' and 'TRASH/{SOURCE}/{TABLE_NAME}/' +├── ODS/ +│ └── {pSourceKey}/ -- e.g., "C2D", "LM" +│ └── {pTableId}/ -- e.g., "A_UC_DISSEM_METADATA_LOADS", "STANDING_FACILITIES" +│ └── processed files +└── TRASH/ -- File retention subfolder (not a separate bucket) └── {pSourceKey}/ -- e.g., "C2D", "LM" - └── {pTableId}/ -- e.g., "A_UC_DISSEM_METADATA_LOADS", "STANDING_FACILITIES" - └── processed files + └── {pTableId}/ -- CSV files after archival (ARCHIVED_AND_TRASHED status) ARCHIVE Bucket - Pattern: 'ARCHIVE/{SOURCE}/{TABLE_NAME}/' └── ARCHIVE/ @@ -389,9 +392,11 @@ ARCHIVE Bucket - Pattern: 'ARCHIVE/{SOURCE}/{TABLE_NAME}/' **Critical Path Pattern Requirements:** - **INBOX** requires full 3-level path: `INBOX/{SOURCE}/{SOURCE_FILE_ID}/{TABLE_NAME}/` - **ODS** uses simplified 2-level path: `ODS/{SOURCE}/{TABLE_NAME}/` (no SOURCE_FILE_ID) +- **TRASH** uses simplified 2-level path: `TRASH/{SOURCE}/{TABLE_NAME}/` (subfolder in DATA bucket) - **ARCHIVE** uses simplified 2-level path: `ARCHIVE/{SOURCE}/{TABLE_NAME}/` (no SOURCE_FILE_ID) - **All patterns are mandatory** - no simplified versions allowed - File names must match `pSourceFileNamePattern` for automatic processing +- **Note**: TRASH is NOT a separate bucket - it's a subfolder within the DATA bucket ## Configuration Management Best Practices @@ -693,7 +698,10 @@ SELECT FILE_MANAGER.PROCESS_SOURCE_FILE( 1. **File Arrival**: File is uploaded to Oracle Cloud Storage bucket 2. **Registration**: FILE_MANAGER.REGISTER_SOURCE_FILE_RECEIVED() creates record -3. **Status**: RECEIVED → VALIDATED → READY_FOR_INGESTION → INGESTED → ARCHIVED +3. **Status**: RECEIVED → VALIDATED → READY_FOR_INGESTION → INGESTED → ARCHIVED_AND_TRASHED → ARCHIVED_AND_PURGED (optional) + - Legacy ARCHIVED status maintained for backward compatibility + - ARCHIVED_AND_TRASHED: Files archived to Parquet and kept in TRASH folder (default) + - ARCHIVED_AND_PURGED: Files archived to Parquet and deleted from TRASH folder 4. **External Table**: Created automatically based on template table 5. **Data Loading**: Data is loaded into target ODS schema 6. **Archival**: File is moved to archive bucket after processing diff --git a/confluence/PROCESS_SOURCE_FILE_Guide.md b/confluence/PROCESS_SOURCE_FILE_Guide.md index c8909f4..e92804d 100644 --- a/confluence/PROCESS_SOURCE_FILE_Guide.md +++ b/confluence/PROCESS_SOURCE_FILE_Guide.md @@ -164,7 +164,9 @@ ORDER BY RECEPTION_DATE DESC; | `VALIDATED` | File validation completed successfully | After successful validation | | `READY_FOR_INGESTION` | File validated and prepared for Airflow+DBT processing | After successful validation and preparation | | `INGESTED` | Data has been consumed/ingested by target system | After data consumption | -| `ARCHIVED` | Data exported to PARQUET format and file moved to archival storage | Final archival state using FILE_ARCHIVER | +| `ARCHIVED` | (Legacy) Data exported to PARQUET format and file moved to archival storage | Legacy archival state (backward compatibility) | +| `ARCHIVED_AND_TRASHED` | Data archived to Parquet, CSV files kept in TRASH folder (default) | Archival with file retention using FILE_ARCHIVER | +| `ARCHIVED_AND_PURGED` | Data archived to Parquet, CSV files deleted from TRASH folder | Archival with TRASH cleanup (pKeepInTrash=FALSE) | | `VALIDATION_FAILED` | File validation failed | After failed validation | diff --git a/confluence/System_Migration_Informatica_to_Airflow_DBT.md b/confluence/System_Migration_Informatica_to_Airflow_DBT.md index d354ed2..90c79b1 100644 --- a/confluence/System_Migration_Informatica_to_Airflow_DBT.md +++ b/confluence/System_Migration_Informatica_to_Airflow_DBT.md @@ -68,7 +68,9 @@ ARCH_FILE_NAME VARCHAR2 -- Parquet archive file path **Status Workflow**: ``` -RECEIVED → VALIDATED → READY_FOR_INGESTION → INGESTED → ARCHIVED +RECEIVED → VALIDATED → READY_FOR_INGESTION → INGESTED → ARCHIVED_AND_TRASHED → ARCHIVED_AND_PURGED (optional) + +Note: Legacy ARCHIVED status maintained for backward compatibility ``` **Usage Pattern**: diff --git a/confluence/Tables_setup.md b/confluence/Tables_setup.md index ac829ee..717c660 100644 --- a/confluence/Tables_setup.md +++ b/confluence/Tables_setup.md @@ -394,6 +394,9 @@ DATA Bucket: ├── ODS/ │ └── {SOURCE}/ │ └── {TABLE_NAME}/ +└── TRASH/ -- File retention subfolder (not a separate bucket) + └── {SOURCE}/ + └── {TABLE_NAME}/ -- CSV files after archival (ARCHIVED_AND_TRASHED status) ARCHIVE Bucket: └── ARCHIVE/ @@ -402,6 +405,8 @@ ARCHIVE Bucket: └── PARTITION_YEAR=*/ └── PARTITION_MONTH=*/ └── *.parquet + +Note: TRASH is a subfolder within the DATA bucket for file retention and rollback capability. ``` ### 4. Migration Checklist diff --git a/confluence/additions/Oracle_External_Tables_Tolerance_Guide.md b/confluence/additions/Oracle_External_Tables_Tolerance_Guide.md index ca56e0e..0925cd7 100644 --- a/confluence/additions/Oracle_External_Tables_Tolerance_Guide.md +++ b/confluence/additions/Oracle_External_Tables_Tolerance_Guide.md @@ -123,7 +123,8 @@ WHEN OTHERS THEN ```sql -- Dodano 'VALIDATION_FAILED' do dozwolonych statusów PROCESSING_STATUS IN ('RECEIVED', 'VALIDATED', 'READY_FOR_INGESTION', - 'INGESTED', 'ARCHIVED', 'VALIDATION_FAILED') + 'INGESTED', 'ARCHIVED', 'ARCHIVED_AND_TRASHED', + 'ARCHIVED_AND_PURGED', 'VALIDATION_FAILED') ``` ## 📊 Testowanie