aktualizacja dokumentacji w związku z TRASH i nowymi statusami plików.

This commit is contained in:
Grzegorz Michalski
2026-02-10 08:21:51 +01:00
parent 70909ba8c4
commit cdd9dff32d
6 changed files with 104 additions and 22 deletions

View File

@@ -18,7 +18,7 @@ The FILE_ARCHIVER package provides flexible archival strategies that accommodate
- **Schema**: CT_MRDS - **Schema**: CT_MRDS
- **Package**: FILE_ARCHIVER - **Package**: FILE_ARCHIVER
- **Current Version**: 3.1.0 - **Current Version**: 3.2.0
- **Dependencies**: ENV_MANAGER, FILE_MANAGER, cloud_wrapper, A_SOURCE_FILE_CONFIG, A_SOURCE_FILE_RECEIVED, A_WORKFLOW_HISTORY - **Dependencies**: ENV_MANAGER, FILE_MANAGER, cloud_wrapper, A_SOURCE_FILE_CONFIG, A_SOURCE_FILE_RECEIVED, A_WORKFLOW_HISTORY
### Critical Prerequisites ### Critical Prerequisites
@@ -177,30 +177,46 @@ WHERE ...;
├─ Active data processing (Airflow + DBT) ├─ Active data processing (Airflow + DBT)
├─ External tables read data from bucket ├─ External tables read data from bucket
├─ Status: INGESTED ├─ Status: INGESTED
─ FILE_ARCHIVER.ARCHIVE_TABLE_DATA archives based on strategy ─ FILE_ARCHIVER.ARCHIVE_TABLE_DATA archives based on strategy
└─ CSV files moved to TRASH subfolder (ODS → TRASH/)
2.1 TRASH Subfolder (DATA Bucket - File Retention)
├─ Located in DATA bucket (e.g., TRASH/LM/TABLE_NAME)
├─ Stores CSV files after archival to Parquet
├─ Status: ARCHIVED_AND_TRASHED (default retention)
├─ Enables rollback if archival issues occur
└─ Optional cleanup: ARCHIVED_AND_PURGED (pKeepInTrash=FALSE)
3. ARCHIVE Bucket (Long-term Storage) 3. ARCHIVE Bucket (Long-term Storage)
├─ Historical data in Parquet format ├─ Historical data in Parquet format
├─ Hive-style partitioning: PARTITION_YEAR=/PARTITION_MONTH= ├─ Hive-style partitioning: PARTITION_YEAR=/PARTITION_MONTH=
├─ Status: ARCHIVED ├─ Status: ARCHIVED_AND_TRASHED or ARCHIVED_AND_PURGED
└─ Optimized for big data analytics (Spark, Hive) └─ Optimized for big data analytics (Spark, Hive)
```
### Archival Process
The FILE_ARCHIVER package automatically manages data movement from ODS to ARCHIVE:
**Key Procedures**: **Key Procedures**:
- `ARCHIVE_TABLE_DATA` - Main archival procedure using strategy-specific WHERE clause - `ARCHIVE_TABLE_DATA(pSourceFileConfigKey, pKeepInTrash)` - Main archival procedure using strategy-specific WHERE clause
- `pKeepInTrash` (BOOLEAN, DEFAULT TRUE) - Controls TRASH folder retention
- TRUE: Files kept in TRASH folder for safety and rollback capability (default)
- FALSE: Files deleted from TRASH folder after successful archival
- `GET_ARCHIVAL_WHERE_CLAUSE` - Returns WHERE clause based on configured strategy - `GET_ARCHIVAL_WHERE_CLAUSE` - Returns WHERE clause based on configured strategy
- `GATHER_TABLE_STAT` - Calculates archival statistics using strategy logic - `GATHER_TABLE_STAT` - Calculates archival statistics using strategy logic
**Archival Execution**: **Archival Execution**:
```sql ```sql
-- Triggered by FILE_MANAGER or scheduled job -- Default behavior: Keep files in TRASH folder (ARCHIVED_AND_TRASHED status)
BEGIN BEGIN
CT_MRDS.FILE_ARCHIVER.ARCHIVE_TABLE_DATA( CT_MRDS.FILE_ARCHIVER.ARCHIVE_TABLE_DATA(
pSourceFileConfig => vSourceFileConfigRecord pSourceFileConfigKey => vSourceFileConfigKey,
pKeepInTrash => TRUE -- DEFAULT value
);
END;
/
-- Optional: Delete files from TRASH after archival (ARCHIVED_AND_PURGED status)
BEGIN
CT_MRDS.FILE_ARCHIVER.ARCHIVE_TABLE_DATA(
pSourceFileConfigKey => vSourceFileConfigKey,
pKeepInTrash => FALSE -- Cleanup TRASH folder
); );
END; END;
/ /
@@ -210,7 +226,9 @@ END;
- Package retrieves ARCHIVAL_STRATEGY from A_SOURCE_FILE_CONFIG - Package retrieves ARCHIVAL_STRATEGY from A_SOURCE_FILE_CONFIG
- GET_ARCHIVAL_WHERE_CLAUSE generates appropriate WHERE clause - GET_ARCHIVAL_WHERE_CLAUSE generates appropriate WHERE clause
- Data matching criteria moved from ODS to ARCHIVE bucket - Data matching criteria moved from ODS to ARCHIVE bucket
- Parquet format with Hive-style partitioning applied - CSV files moved to TRASH subfolder in DATA bucket (ODS/ → TRASH/)
- Parquet format with Hive-style partitioning applied to ARCHIVE bucket
- TRASH retention controlled by pKeepInTrash parameter
## Configuration Examples ## Configuration Examples
@@ -527,8 +545,11 @@ WHERE object_name = 'FILE_ARCHIVER';
### OCI Buckets ### OCI Buckets
- **INBOX**: Incoming file validation (`'INBOX/{SOURCE}/{SOURCE_FILE_ID}/{TABLE_NAME}/'`) - **INBOX**: Incoming file validation (`'INBOX/{SOURCE}/{SOURCE_FILE_ID}/{TABLE_NAME}/'`)
- **ODS/DATA**: Operational data processing (`'ODS/{SOURCE}/{TABLE_NAME}/'`) - **ODS/DATA**: Operational data processing (`'ODS/{SOURCE}/{TABLE_NAME}/'`)
- **TRASH**: File retention subfolder in DATA bucket (`'TRASH/{SOURCE}/{TABLE_NAME}/'`) - CSV files after archival
- **ARCHIVE**: Historical data storage (`'ARCHIVE/{SOURCE}/{TABLE_NAME}/PARTITION_YEAR=/PARTITION_MONTH=/'`) - **ARCHIVE**: Historical data storage (`'ARCHIVE/{SOURCE}/{TABLE_NAME}/PARTITION_YEAR=/PARTITION_MONTH=/'`)
**Note**: TRASH is NOT a separate bucket - it's a subfolder within the DATA bucket for file retention and rollback capability.
## Best Practices ## Best Practices
### Strategy Selection Guidelines ### Strategy Selection Guidelines
@@ -609,10 +630,53 @@ WHERE object_name = 'FILE_ARCHIVER';
- Check for tables without archival configuration - Check for tables without archival configuration
- Optimize MINIMUM_AGE_MONTHS based on actual usage patterns - Optimize MINIMUM_AGE_MONTHS based on actual usage patterns
### TRASH Folder Retention Best Practices
1. **Default Behavior (pKeepInTrash = TRUE - Recommended)**:
- Keeps CSV files in TRASH folder after archival
- Provides safety net for rollback if archival issues occur
- Supports compliance and audit requirements
- Status: ARCHIVED_AND_TRASHED
- Use for: Production environments, regulatory compliance, critical data
2. **TRASH Cleanup (pKeepInTrash = FALSE)**:
- Deletes CSV files from TRASH folder after successful archival
- Reduces storage costs in DATA bucket
- Status: ARCHIVED_AND_PURGED
- Use for: Non-critical data, storage optimization, test environments
3. **Monitoring TRASH Folder**:
```sql
-- Check files in TRASH retention
SELECT
SOURCE_FILE_NAME,
PROCESSING_STATUS,
ARCH_FILE_NAME,
PARTITION_YEAR,
PARTITION_MONTH
FROM CT_MRDS.A_SOURCE_FILE_RECEIVED
WHERE PROCESSING_STATUS IN ('ARCHIVED_AND_TRASHED', 'ARCHIVED_AND_PURGED')
AND RECEPTION_DATE > SYSDATE - 30
ORDER BY PROCESSING_STATUS, RECEPTION_DATE DESC;
```
4. **TRASH Folder Structure**:
```
DATA Bucket:
├── ODS/LM/STANDING_FACILITIES/file.csv -- Active operational data
└── TRASH/LM/STANDING_FACILITIES/file.csv -- Retained after archival
ARCHIVE Bucket:
└── ARCHIVE/LM/STANDING_FACILITIES/
└── PARTITION_YEAR=2026/
└── PARTITION_MONTH=02/
└── *.parquet -- Archived data
```
## Author ## Author
Created by: Grzegorz Michalski Created by: Grzegorz Michalski
Date: 2026-02-04 Date: 2026-02-06
Schema: CT_MRDS Schema: CT_MRDS
Package: FILE_ARCHIVER Package: FILE_ARCHIVER
Version: 3.1.0 Version: 3.2.0

View File

@@ -371,11 +371,14 @@ INBOX Bucket - Pattern: 'INBOX/{SOURCE}/{SOURCE_FILE_ID}/{TABLE_NAME}/'
└── {pTableId}/ -- e.g., "A_UC_DISSEM_METADATA_LOADS", "STANDING_FACILITIES" └── {pTableId}/ -- e.g., "A_UC_DISSEM_METADATA_LOADS", "STANDING_FACILITIES"
└── files matching {pSourceFileNamePattern} └── files matching {pSourceFileNamePattern}
ODS Bucket - Pattern: 'ODS/{SOURCE}/{TABLE_NAME}/' DATA Bucket - Patterns: 'ODS/{SOURCE}/{TABLE_NAME}/' and 'TRASH/{SOURCE}/{TABLE_NAME}/'
── ODS/ ── ODS/
│ └── {pSourceKey}/ -- e.g., "C2D", "LM"
│ └── {pTableId}/ -- e.g., "A_UC_DISSEM_METADATA_LOADS", "STANDING_FACILITIES"
│ └── processed files
└── TRASH/ -- File retention subfolder (not a separate bucket)
└── {pSourceKey}/ -- e.g., "C2D", "LM" └── {pSourceKey}/ -- e.g., "C2D", "LM"
└── {pTableId}/ -- e.g., "A_UC_DISSEM_METADATA_LOADS", "STANDING_FACILITIES" └── {pTableId}/ -- CSV files after archival (ARCHIVED_AND_TRASHED status)
└── processed files
ARCHIVE Bucket - Pattern: 'ARCHIVE/{SOURCE}/{TABLE_NAME}/' ARCHIVE Bucket - Pattern: 'ARCHIVE/{SOURCE}/{TABLE_NAME}/'
└── ARCHIVE/ └── ARCHIVE/
@@ -389,9 +392,11 @@ ARCHIVE Bucket - Pattern: 'ARCHIVE/{SOURCE}/{TABLE_NAME}/'
**Critical Path Pattern Requirements:** **Critical Path Pattern Requirements:**
- **INBOX** requires full 3-level path: `INBOX/{SOURCE}/{SOURCE_FILE_ID}/{TABLE_NAME}/` - **INBOX** requires full 3-level path: `INBOX/{SOURCE}/{SOURCE_FILE_ID}/{TABLE_NAME}/`
- **ODS** uses simplified 2-level path: `ODS/{SOURCE}/{TABLE_NAME}/` (no SOURCE_FILE_ID) - **ODS** uses simplified 2-level path: `ODS/{SOURCE}/{TABLE_NAME}/` (no SOURCE_FILE_ID)
- **TRASH** uses simplified 2-level path: `TRASH/{SOURCE}/{TABLE_NAME}/` (subfolder in DATA bucket)
- **ARCHIVE** uses simplified 2-level path: `ARCHIVE/{SOURCE}/{TABLE_NAME}/` (no SOURCE_FILE_ID) - **ARCHIVE** uses simplified 2-level path: `ARCHIVE/{SOURCE}/{TABLE_NAME}/` (no SOURCE_FILE_ID)
- **All patterns are mandatory** - no simplified versions allowed - **All patterns are mandatory** - no simplified versions allowed
- File names must match `pSourceFileNamePattern` for automatic processing - File names must match `pSourceFileNamePattern` for automatic processing
- **Note**: TRASH is NOT a separate bucket - it's a subfolder within the DATA bucket
## Configuration Management Best Practices ## Configuration Management Best Practices
@@ -693,7 +698,10 @@ SELECT FILE_MANAGER.PROCESS_SOURCE_FILE(
1. **File Arrival**: File is uploaded to Oracle Cloud Storage bucket 1. **File Arrival**: File is uploaded to Oracle Cloud Storage bucket
2. **Registration**: FILE_MANAGER.REGISTER_SOURCE_FILE_RECEIVED() creates record 2. **Registration**: FILE_MANAGER.REGISTER_SOURCE_FILE_RECEIVED() creates record
3. **Status**: RECEIVED → VALIDATED → READY_FOR_INGESTION → INGESTED → ARCHIVED 3. **Status**: RECEIVED → VALIDATED → READY_FOR_INGESTION → INGESTED → ARCHIVED_AND_TRASHED → ARCHIVED_AND_PURGED (optional)
- Legacy ARCHIVED status maintained for backward compatibility
- ARCHIVED_AND_TRASHED: Files archived to Parquet and kept in TRASH folder (default)
- ARCHIVED_AND_PURGED: Files archived to Parquet and deleted from TRASH folder
4. **External Table**: Created automatically based on template table 4. **External Table**: Created automatically based on template table
5. **Data Loading**: Data is loaded into target ODS schema 5. **Data Loading**: Data is loaded into target ODS schema
6. **Archival**: File is moved to archive bucket after processing 6. **Archival**: File is moved to archive bucket after processing

View File

@@ -164,7 +164,9 @@ ORDER BY RECEPTION_DATE DESC;
| `VALIDATED` | File validation completed successfully | After successful validation | | `VALIDATED` | File validation completed successfully | After successful validation |
| `READY_FOR_INGESTION` | File validated and prepared for Airflow+DBT processing | After successful validation and preparation | | `READY_FOR_INGESTION` | File validated and prepared for Airflow+DBT processing | After successful validation and preparation |
| `INGESTED` | Data has been consumed/ingested by target system | After data consumption | | `INGESTED` | Data has been consumed/ingested by target system | After data consumption |
| `ARCHIVED` | Data exported to PARQUET format and file moved to archival storage | Final archival state using FILE_ARCHIVER | | `ARCHIVED` | (Legacy) Data exported to PARQUET format and file moved to archival storage | Legacy archival state (backward compatibility) |
| `ARCHIVED_AND_TRASHED` | Data archived to Parquet, CSV files kept in TRASH folder (default) | Archival with file retention using FILE_ARCHIVER |
| `ARCHIVED_AND_PURGED` | Data archived to Parquet, CSV files deleted from TRASH folder | Archival with TRASH cleanup (pKeepInTrash=FALSE) |
| `VALIDATION_FAILED` | File validation failed | After failed validation | | `VALIDATION_FAILED` | File validation failed | After failed validation |

View File

@@ -68,7 +68,9 @@ ARCH_FILE_NAME VARCHAR2 -- Parquet archive file path
**Status Workflow**: **Status Workflow**:
``` ```
RECEIVED → VALIDATED → READY_FOR_INGESTION → INGESTED → ARCHIVED RECEIVED → VALIDATED → READY_FOR_INGESTION → INGESTED → ARCHIVED_AND_TRASHED → ARCHIVED_AND_PURGED (optional)
Note: Legacy ARCHIVED status maintained for backward compatibility
``` ```
**Usage Pattern**: **Usage Pattern**:

View File

@@ -394,6 +394,9 @@ DATA Bucket:
├── ODS/ ├── ODS/
│ └── {SOURCE}/ │ └── {SOURCE}/
│ └── {TABLE_NAME}/ │ └── {TABLE_NAME}/
└── TRASH/ -- File retention subfolder (not a separate bucket)
└── {SOURCE}/
└── {TABLE_NAME}/ -- CSV files after archival (ARCHIVED_AND_TRASHED status)
ARCHIVE Bucket: ARCHIVE Bucket:
└── ARCHIVE/ └── ARCHIVE/
@@ -402,6 +405,8 @@ ARCHIVE Bucket:
└── PARTITION_YEAR=*/ └── PARTITION_YEAR=*/
└── PARTITION_MONTH=*/ └── PARTITION_MONTH=*/
└── *.parquet └── *.parquet
Note: TRASH is a subfolder within the DATA bucket for file retention and rollback capability.
``` ```
### 4. Migration Checklist ### 4. Migration Checklist

View File

@@ -123,7 +123,8 @@ WHEN OTHERS THEN
```sql ```sql
-- Dodano 'VALIDATION_FAILED' do dozwolonych statusów -- Dodano 'VALIDATION_FAILED' do dozwolonych statusów
PROCESSING_STATUS IN ('RECEIVED', 'VALIDATED', 'READY_FOR_INGESTION', PROCESSING_STATUS IN ('RECEIVED', 'VALIDATED', 'READY_FOR_INGESTION',
'INGESTED', 'ARCHIVED', 'VALIDATION_FAILED') 'INGESTED', 'ARCHIVED', 'ARCHIVED_AND_TRASHED',
'ARCHIVED_AND_PURGED', 'VALIDATION_FAILED')
``` ```
## 📊 Testowanie ## 📊 Testowanie