aktualizacja dokumentacji w związku z TRASH i nowymi statusami plików.

This commit is contained in:
Grzegorz Michalski
2026-02-10 08:21:51 +01:00
parent 70909ba8c4
commit cdd9dff32d
6 changed files with 104 additions and 22 deletions

View File

@@ -18,7 +18,7 @@ The FILE_ARCHIVER package provides flexible archival strategies that accommodate
- **Schema**: CT_MRDS
- **Package**: FILE_ARCHIVER
- **Current Version**: 3.1.0
- **Current Version**: 3.2.0
- **Dependencies**: ENV_MANAGER, FILE_MANAGER, cloud_wrapper, A_SOURCE_FILE_CONFIG, A_SOURCE_FILE_RECEIVED, A_WORKFLOW_HISTORY
### Critical Prerequisites
@@ -177,30 +177,46 @@ WHERE ...;
├─ Active data processing (Airflow + DBT)
├─ External tables read data from bucket
├─ Status: INGESTED
─ FILE_ARCHIVER.ARCHIVE_TABLE_DATA archives based on strategy
─ FILE_ARCHIVER.ARCHIVE_TABLE_DATA archives based on strategy
└─ CSV files moved to TRASH subfolder (ODS → TRASH/)
2.1 TRASH Subfolder (DATA Bucket - File Retention)
├─ Located in DATA bucket (e.g., TRASH/LM/TABLE_NAME)
├─ Stores CSV files after archival to Parquet
├─ Status: ARCHIVED_AND_TRASHED (default retention)
├─ Enables rollback if archival issues occur
└─ Optional cleanup: ARCHIVED_AND_PURGED (pKeepInTrash=FALSE)
3. ARCHIVE Bucket (Long-term Storage)
├─ Historical data in Parquet format
├─ Hive-style partitioning: PARTITION_YEAR=/PARTITION_MONTH=
├─ Status: ARCHIVED
├─ Status: ARCHIVED_AND_TRASHED or ARCHIVED_AND_PURGED
└─ Optimized for big data analytics (Spark, Hive)
```
### Archival Process
The FILE_ARCHIVER package automatically manages data movement from ODS to ARCHIVE:
**Key Procedures**:
- `ARCHIVE_TABLE_DATA` - Main archival procedure using strategy-specific WHERE clause
- `ARCHIVE_TABLE_DATA(pSourceFileConfigKey, pKeepInTrash)` - Main archival procedure using strategy-specific WHERE clause
- `pKeepInTrash` (BOOLEAN, DEFAULT TRUE) - Controls TRASH folder retention
- TRUE: Files kept in TRASH folder for safety and rollback capability (default)
- FALSE: Files deleted from TRASH folder after successful archival
- `GET_ARCHIVAL_WHERE_CLAUSE` - Returns WHERE clause based on configured strategy
- `GATHER_TABLE_STAT` - Calculates archival statistics using strategy logic
**Archival Execution**:
```sql
-- Triggered by FILE_MANAGER or scheduled job
-- Default behavior: Keep files in TRASH folder (ARCHIVED_AND_TRASHED status)
BEGIN
CT_MRDS.FILE_ARCHIVER.ARCHIVE_TABLE_DATA(
pSourceFileConfig => vSourceFileConfigRecord
pSourceFileConfigKey => vSourceFileConfigKey,
pKeepInTrash => TRUE -- DEFAULT value
);
END;
/
-- Optional: Delete files from TRASH after archival (ARCHIVED_AND_PURGED status)
BEGIN
CT_MRDS.FILE_ARCHIVER.ARCHIVE_TABLE_DATA(
pSourceFileConfigKey => vSourceFileConfigKey,
pKeepInTrash => FALSE -- Cleanup TRASH folder
);
END;
/
@@ -210,7 +226,9 @@ END;
- Package retrieves ARCHIVAL_STRATEGY from A_SOURCE_FILE_CONFIG
- GET_ARCHIVAL_WHERE_CLAUSE generates appropriate WHERE clause
- Data matching criteria moved from ODS to ARCHIVE bucket
- Parquet format with Hive-style partitioning applied
- CSV files moved to TRASH subfolder in DATA bucket (ODS/ → TRASH/)
- Parquet format with Hive-style partitioning applied to ARCHIVE bucket
- TRASH retention controlled by pKeepInTrash parameter
## Configuration Examples
@@ -527,8 +545,11 @@ WHERE object_name = 'FILE_ARCHIVER';
### OCI Buckets
- **INBOX**: Incoming file validation (`'INBOX/{SOURCE}/{SOURCE_FILE_ID}/{TABLE_NAME}/'`)
- **ODS/DATA**: Operational data processing (`'ODS/{SOURCE}/{TABLE_NAME}/'`)
- **TRASH**: File retention subfolder in DATA bucket (`'TRASH/{SOURCE}/{TABLE_NAME}/'`) - CSV files after archival
- **ARCHIVE**: Historical data storage (`'ARCHIVE/{SOURCE}/{TABLE_NAME}/PARTITION_YEAR=/PARTITION_MONTH=/'`)
**Note**: TRASH is NOT a separate bucket - it's a subfolder within the DATA bucket for file retention and rollback capability.
## Best Practices
### Strategy Selection Guidelines
@@ -609,10 +630,53 @@ WHERE object_name = 'FILE_ARCHIVER';
- Check for tables without archival configuration
- Optimize MINIMUM_AGE_MONTHS based on actual usage patterns
### TRASH Folder Retention Best Practices
1. **Default Behavior (pKeepInTrash = TRUE - Recommended)**:
- Keeps CSV files in TRASH folder after archival
- Provides safety net for rollback if archival issues occur
- Supports compliance and audit requirements
- Status: ARCHIVED_AND_TRASHED
- Use for: Production environments, regulatory compliance, critical data
2. **TRASH Cleanup (pKeepInTrash = FALSE)**:
- Deletes CSV files from TRASH folder after successful archival
- Reduces storage costs in DATA bucket
- Status: ARCHIVED_AND_PURGED
- Use for: Non-critical data, storage optimization, test environments
3. **Monitoring TRASH Folder**:
```sql
-- Check files in TRASH retention
SELECT
SOURCE_FILE_NAME,
PROCESSING_STATUS,
ARCH_FILE_NAME,
PARTITION_YEAR,
PARTITION_MONTH
FROM CT_MRDS.A_SOURCE_FILE_RECEIVED
WHERE PROCESSING_STATUS IN ('ARCHIVED_AND_TRASHED', 'ARCHIVED_AND_PURGED')
AND RECEPTION_DATE > SYSDATE - 30
ORDER BY PROCESSING_STATUS, RECEPTION_DATE DESC;
```
4. **TRASH Folder Structure**:
```
DATA Bucket:
├── ODS/LM/STANDING_FACILITIES/file.csv -- Active operational data
└── TRASH/LM/STANDING_FACILITIES/file.csv -- Retained after archival
ARCHIVE Bucket:
└── ARCHIVE/LM/STANDING_FACILITIES/
└── PARTITION_YEAR=2026/
└── PARTITION_MONTH=02/
└── *.parquet -- Archived data
```
## Author
Created by: Grzegorz Michalski
Date: 2026-02-04
Date: 2026-02-06
Schema: CT_MRDS
Package: FILE_ARCHIVER
Version: 3.1.0
Version: 3.2.0