379 lines
14 KiB
Markdown
379 lines
14 KiB
Markdown
# PROCESS_SOURCE_FILE Procedure Guide
|
|
|
|
This document provides comprehensive documentation for the `FILE_MANAGER.PROCESS_SOURCE_FILE` procedure, which validates incoming files and prepares them for loading on the Airflow+DBT side through Oracle Cloud Infrastructure (OCI) file management operations.
|
|
|
|
## Overview
|
|
|
|
`PROCESS_SOURCE_FILE` is an umbrella procedure that validates incoming files and prepares them for downstream processing by Airflow+DBT pipelines. It orchestrates the complete workflow from file registration and validation to OCI storage preparation, ensuring files are properly validated and positioned for consumption by the Airflow+DBT data processing stack.
|
|
|
|
**Key Characteristics:**
|
|
- **File Validation Focus**: Comprehensive validation of incoming CSV files against template structures
|
|
- **Airflow+DBT Preparation**: Prepares validated files for loading and processing by Airflow+DBT pipelines
|
|
- **OCI File Management**: Handles file operations and movements within Oracle Cloud Infrastructure
|
|
- **Umbrella Procedure**: Coordinates multiple validation and file preparation sub-procedures in sequence
|
|
- **Automated Workflow**: Requires minimal manual intervention once configured
|
|
- **Error Resilient**: Comprehensive error handling and logging for validation and file operations
|
|
- **Status Tracking**: Updates file processing status throughout validation and preparation workflow
|
|
|
|
**Migration Context:**
|
|
- This procedure is part of the **modern Airflow + DBT system** architecture
|
|
- Creates records in `CT_MRDS.A_SOURCE_FILE_RECEIVED` (modern control table)
|
|
- For legacy Informatica + WLA data migration, see [System Migration Guide](System_Migration_Informatica_to_Airflow_DBT.md)
|
|
|
|
## Procedure Signatures
|
|
|
|
The procedure is available in two variants:
|
|
|
|
### Procedure Version
|
|
```sql
|
|
PROCEDURE PROCESS_SOURCE_FILE(pSourceFileReceivedName IN VARCHAR2);
|
|
```
|
|
**Purpose**: Execute processing workflow without return value
|
|
**Use Case**: Standard automated processing, fire-and-forget scenarios
|
|
|
|
### Function Version
|
|
```sql
|
|
FUNCTION PROCESS_SOURCE_FILE(pSourceFileReceivedName IN VARCHAR2) RETURN PLS_INTEGER;
|
|
```
|
|
**Purpose**: Execute processing workflow and return status code
|
|
**Use Case**: When you need to check processing result programmatically
|
|
|
|
## Parameters
|
|
|
|
### pSourceFileReceivedName
|
|
- **Type**: VARCHAR2
|
|
- **Required**: YES
|
|
- **Description**: Relative path to the file within the cloud storage bucket
|
|
- **Format**: `INBOX/{SOURCE}/{SOURCE_FILE_ID}/{TABLE_ID}/filename.csv`
|
|
|
|
**Examples:**
|
|
```sql
|
|
'INBOX/C2D/UC_DISSEM/A_UC_DISSEM_METADATA_LOADS/UC_NMA_DISSEM-277740.csv'
|
|
'INBOX/TOP/ALLOTMENT/AGGREGATED_ALLOTMENT/allotment_data_20241006.csv'
|
|
'INBOX/LM/RATES/INTEREST_RATES/rates_monthly_202410.csv'
|
|
```
|
|
|
|
## Processing Workflow
|
|
|
|
The procedure executes six main steps in sequence:
|
|
|
|
### Step 1: REGISTER_SOURCE_FILE_RECEIVED
|
|
**Purpose**: Register file in the system and extract metadata
|
|
|
|
**Actions:**
|
|
- Creates record in `CT_MRDS.A_SOURCE_FILE_RECEIVED` table
|
|
- Determines source configuration based on file path pattern
|
|
- Extracts file metadata (size, checksum, creation date)
|
|
- Assigns unique `A_SOURCE_FILE_RECEIVED_KEY`
|
|
- Sets initial status to 'RECEIVED'
|
|
|
|
### Step 2: CREATE_EXTERNAL_TABLE
|
|
**Purpose**: Create temporary external table for data access
|
|
|
|
**Actions:**
|
|
- Generates unique external table name
|
|
- Creates external table pointing to the CSV file
|
|
- Uses template table structure from `CT_ET_TEMPLATES`
|
|
- Configures appropriate column mappings and data types
|
|
|
|
### Step 3: VALIDATE_SOURCE_FILE_RECEIVED
|
|
**Purpose**: Perform comprehensive data validation
|
|
|
|
**Actions:**
|
|
- Validates CSV column count against template
|
|
- Checks data type compatibility
|
|
- Verifies required fields are populated
|
|
- Performs business rule validations
|
|
- Updates status to 'VALIDATED' on success
|
|
|
|
### Step 4: DROP_EXTERNAL_TABLE
|
|
**Purpose**: Clean up temporary external table
|
|
|
|
**Actions:**
|
|
- Drops the temporary external table created in Step 2
|
|
- Releases database resources
|
|
- Maintains clean schema state
|
|
|
|
### Step 5: MOVE_FILE
|
|
**Purpose**: Relocate file from INBOX to ODS location
|
|
|
|
**Actions:**
|
|
- Copies file from INBOX bucket to ODS bucket
|
|
- Preserves file metadata
|
|
- Deletes original file from INBOX after successful copy
|
|
|
|
### Step 6: SET_SOURCE_FILE_RECEIVED_STATUS
|
|
**Purpose**: Update final processing status
|
|
|
|
**Actions:**
|
|
- Sets `PROCESSING_STATUS` to 'READY_FOR_INGESTION'
|
|
- Records completion timestamp
|
|
- Indicates file is validated and ready for Airflow+DBT processing
|
|
|
|
## Return Values (Function Version)
|
|
|
|
| Value | Meaning | Description |
|
|
|-------|---------|-------------|
|
|
| `0` | Success | File processed successfully through all steps |
|
|
| `-20001` | Empty Parameters | Both fileUri and receivedKey parameters are NULL |
|
|
| `-20002` | No Config Match | No configuration matches the file pattern |
|
|
| `-20011` | Column Mismatch | CSV has different column count than template |
|
|
| `-20021` | Processing Error | General processing failure |
|
|
| Other negative | Various Errors | Specific error codes for different failure scenarios |
|
|
|
|
## Usage Examples
|
|
|
|
### Basic Processing
|
|
```sql
|
|
-- Simple processing (procedure version)
|
|
BEGIN
|
|
CT_MRDS.FILE_MANAGER.PROCESS_SOURCE_FILE(
|
|
pSourceFileReceivedName => 'INBOX/C2D/UC_DISSEM/A_UC_DISSEM_METADATA_LOADS/data_file.csv'
|
|
);
|
|
END;
|
|
/
|
|
```
|
|
|
|
|
|
## Prerequisites
|
|
|
|
Before using `PROCESS_SOURCE_FILE`, ensure proper system configuration is in place. For detailed setup instructions including source system registration, file type configuration, template table creation, and date format configuration, see the [FILE_MANAGER Configuration Guide](FILE_MANAGER_Configuration_Guide.md).
|
|
|
|
## Monitoring and Troubleshooting
|
|
|
|
### Monitoring File Processing Status
|
|
```sql
|
|
-- Check recent file processing activity
|
|
SELECT
|
|
SOURCE_FILE_NAME,
|
|
PROCESSING_STATUS,
|
|
RECEPTION_DATE,
|
|
EXTERNAL_TABLE_NAME
|
|
FROM CT_MRDS.A_SOURCE_FILE_RECEIVED
|
|
WHERE RECEPTION_DATE >= SYSDATE - 1 -- Last 24 hours
|
|
ORDER BY RECEPTION_DATE DESC;
|
|
```
|
|
|
|
### Processing Status Values
|
|
|
|
**Processing Status Values:**
|
|
|
|
| Status | Description | Workflow Stage |
|
|
|--------|-------------|----------------|
|
|
| `RECEIVED` | File registered, processing starting | Initial registration |
|
|
| `VALIDATED` | File validation completed successfully | After successful validation |
|
|
| `READY_FOR_INGESTION` | File validated and prepared for Airflow+DBT processing | After successful validation and preparation |
|
|
| `INGESTED` | Data has been consumed/ingested by target system | After data consumption |
|
|
| `ARCHIVED` | (Legacy) Data exported to PARQUET format and file moved to archival storage | Legacy archival state (backward compatibility) |
|
|
| `ARCHIVED_AND_TRASHED` | Data archived to Parquet, CSV files kept in TRASH folder (default) | Archival with file retention using FILE_ARCHIVER |
|
|
| `ARCHIVED_AND_PURGED` | Data archived to Parquet, CSV files deleted from TRASH folder | Archival with TRASH cleanup (pKeepInTrash=FALSE) |
|
|
| `VALIDATION_FAILED` | File validation failed | After failed validation |
|
|
|
|
|
|
|
|
### Detailed Processing Logs
|
|
```sql
|
|
-- View detailed processing logs
|
|
SELECT
|
|
LOG_TIMESTAMP,
|
|
PROCEDURE_NAME,
|
|
LOG_LEVEL,
|
|
LOG_MESSAGE,
|
|
PROCEDURE_PARAMETERS
|
|
FROM CT_MRDS.A_PROCESS_LOG
|
|
WHERE PROCEDURE_NAME IN ('PROCESS_SOURCE_FILE', 'REGISTER_SOURCE_FILE_RECEIVED',
|
|
'CREATE_EXTERNAL_TABLE', 'VALIDATE_SOURCE_FILE_RECEIVED')
|
|
AND LOG_TIMESTAMP >= SYSDATE - 1
|
|
ORDER BY LOG_TIMESTAMP DESC;
|
|
```
|
|
|
|
### Common Error Scenarios and Solutions
|
|
|
|
#### Error -20002: No Configuration Match
|
|
**Problem**: File path doesn't match any configured pattern
|
|
```sql
|
|
-- Check configured patterns
|
|
SELECT
|
|
s.A_SOURCE_KEY,
|
|
sfc.SOURCE_FILE_ID,
|
|
sfc.SOURCE_FILE_NAME_PATTERN,
|
|
sfc.TABLE_ID
|
|
FROM CT_MRDS.A_SOURCE_FILE_CONFIG sfc
|
|
JOIN CT_MRDS.A_SOURCE s ON s.A_SOURCE_KEY = sfc.A_SOURCE_KEY
|
|
ORDER BY s.A_SOURCE_KEY, sfc.SOURCE_FILE_ID;
|
|
```
|
|
|
|
**Solution**: Add missing configuration or correct file naming
|
|
|
|
#### Error -20011: Column Count Mismatch
|
|
**Problem**: CSV file has different number of columns than template table
|
|
```sql
|
|
-- Check template table structure
|
|
SELECT column_name, data_type, column_id
|
|
FROM user_tab_columns
|
|
WHERE table_name = 'YOUR_TEMPLATE_TABLE'
|
|
ORDER BY column_id;
|
|
|
|
-- Analyze validation errors
|
|
SELECT FILE_MANAGER.ANALYZE_VALIDATION_ERRORS(file_key) FROM DUAL;
|
|
```
|
|
|
|
**Solutions**:
|
|
1. Fix CSV file column count
|
|
2. Add missing columns to template table
|
|
3. Remove excess columns from CSV
|
|
|
|
#### File Not Found Errors
|
|
**Problem**: File doesn't exist in expected cloud storage location
|
|
```sql
|
|
-- List files in bucket location
|
|
SELECT object_name
|
|
FROM DBMS_CLOUD.LIST_OBJECTS(
|
|
credential_name => 'DEF_CRED_ARN',
|
|
location_uri => 'https://your-bucket-uri/',
|
|
prefix => 'INBOX/C2D/UC_DISSEM/'
|
|
)
|
|
WHERE ROWNUM <= 20;
|
|
```
|
|
|
|
**Solutions**:
|
|
1. Verify file upload to correct location
|
|
2. Check file naming matches expected pattern
|
|
3. Verify cloud storage credentials and permissions
|
|
|
|
## Enhanced Error Monitoring and Logging
|
|
|
|
### Error Log Monitoring
|
|
|
|
The FILE_MANAGER system provides comprehensive error logging for troubleshooting:
|
|
|
|
```sql
|
|
-- View recent processing errors
|
|
SELECT LOG_TIMESTAMP, LOG_LEVEL, LOG_MESSAGE, PROCEDURE_NAME
|
|
FROM CT_MRDS.A_PROCESS_LOG
|
|
WHERE LOG_LEVEL = 'ERROR'
|
|
AND LOG_TIMESTAMP >= SYSDATE - 1 -- Last 24 hours
|
|
ORDER BY LOG_TIMESTAMP DESC;
|
|
|
|
-- View validation-specific errors
|
|
SELECT LOG_TIMESTAMP, LOG_MESSAGE
|
|
FROM CT_MRDS.A_PROCESS_LOG
|
|
WHERE LOG_MESSAGE LIKE '%EXCESS COLUMNS%'
|
|
OR LOG_MESSAGE LIKE '%VALIDATION%'
|
|
ORDER BY LOG_TIMESTAMP DESC;
|
|
|
|
-- Analyze errors for specific file
|
|
SELECT sfl.SOURCE_FILE_NAME, pl.LOG_MESSAGE, pl.LOG_TIMESTAMP
|
|
FROM CT_MRDS.A_SOURCE_FILE_RECEIVED sfl
|
|
JOIN CT_MRDS.A_PROCESS_LOG pl ON pl.LOG_MESSAGE LIKE '%' || sfl.SOURCE_FILE_NAME || '%'
|
|
WHERE sfl.SOURCE_FILE_NAME = 'your_file.csv'
|
|
AND pl.LOG_LEVEL = 'ERROR';
|
|
```
|
|
|
|
### File Validation and Error Handling
|
|
|
|
The FILE_MANAGER system includes comprehensive validation features for CSV files during processing:
|
|
|
|
#### Pre-Processing Validation
|
|
- **Column Count Verification**: Automatically checks if CSV files match template table structure
|
|
- **Error Prevention**: Validates files before creating external tables to prevent processing failures
|
|
- **Detailed Error Messages**: Provides specific guidance when validation fails
|
|
|
|
#### Common Validation Scenarios
|
|
|
|
**Scenario 1: Excess Columns (Error -20011)**
|
|
```
|
|
EXCESS COLUMNS DETECTED!
|
|
CSV file has 8 columns but template expects only 5
|
|
Excess columns: 3
|
|
```
|
|
**Solutions:**
|
|
1. Remove excess columns from CSV file
|
|
2. Add missing columns to template table:
|
|
```sql
|
|
ALTER TABLE CT_ET_TEMPLATES.{SOURCE}_{TABLE_NAME}
|
|
ADD (NEW_COLUMN1 VARCHAR2(100), NEW_COLUMN2 NUMBER);
|
|
```
|
|
|
|
#### Error Analysis for File Validation
|
|
|
|
```sql
|
|
-- Find file key for analysis
|
|
SELECT A_SOURCE_FILE_RECEIVED_KEY
|
|
FROM CT_MRDS.A_SOURCE_FILE_RECEIVED
|
|
WHERE SOURCE_FILE_NAME = 'your_file.csv';
|
|
|
|
-- Analyze validation errors using wrapper function
|
|
SELECT CT_MRDS.FILE_MANAGER.ANALYZE_VALIDATION_ERRORS(file_key) FROM DUAL;
|
|
|
|
-- Example with specific key:
|
|
SELECT CT_MRDS.FILE_MANAGER.ANALYZE_VALIDATION_ERRORS(63) FROM DUAL;
|
|
```
|
|
|
|
#### Validation Error Monitoring
|
|
|
|
```sql
|
|
-- View recent validation errors
|
|
SELECT LOG_TIMESTAMP, LOG_MESSAGE
|
|
FROM CT_MRDS.A_PROCESS_LOG
|
|
WHERE LOG_LEVEL = 'ERROR'
|
|
AND (LOG_MESSAGE LIKE '%EXCESS COLUMNS%' OR LOG_MESSAGE LIKE '%VALIDATION%')
|
|
ORDER BY LOG_TIMESTAMP DESC;
|
|
```
|
|
|
|
### Common Error Patterns and Solutions
|
|
|
|
| Error Code | Pattern | Solution |
|
|
|------------|---------|----------|
|
|
| ORA-20011 | EXCESS COLUMNS DETECTED | Remove excess columns from CSV or add missing columns to template table |
|
|
| ORA-20002 | No match for source file | Configure file pattern in A_SOURCE_FILE_CONFIG |
|
|
| ORA-29913 | External table open error | Check bucket paths and file existence |
|
|
| ORA-01821 | Date format not recognized | Update date format in ADD_COLUMN_DATE_FORMAT |
|
|
|
|
### Proactive Monitoring Setup
|
|
|
|
Set up monitoring for critical error patterns:
|
|
|
|
```sql
|
|
-- Create monitoring view for critical errors
|
|
CREATE OR REPLACE VIEW V_CRITICAL_ERRORS AS
|
|
SELECT
|
|
LOG_TIMESTAMP,
|
|
PROCEDURE_NAME,
|
|
CASE
|
|
WHEN LOG_MESSAGE LIKE '%ORA-20011%' THEN 'COLUMN_MISMATCH'
|
|
WHEN LOG_MESSAGE LIKE '%ORA-20002%' THEN 'CONFIG_MISSING'
|
|
WHEN LOG_MESSAGE LIKE '%ORA-29913%' THEN 'FILE_ACCESS'
|
|
ELSE 'OTHER_ERROR'
|
|
END as ERROR_CATEGORY,
|
|
LOG_MESSAGE
|
|
FROM CT_MRDS.A_PROCESS_LOG
|
|
WHERE LOG_LEVEL = 'ERROR'
|
|
AND LOG_TIMESTAMP >= SYSDATE - 7; -- Last week
|
|
```
|
|
|
|
This enhanced monitoring helps identify and resolve issues quickly, ensuring smooth file processing operations.
|
|
|
|
## Best Practices
|
|
|
|
### File Naming Conventions
|
|
- Use consistent naming patterns that match `SOURCE_FILE_NAME_PATTERN`
|
|
- Avoid special characters that might cause parsing issues
|
|
|
|
|
|
## Related Procedures
|
|
|
|
The following procedures are called internally by `PROCESS_SOURCE_FILE`:
|
|
|
|
- **REGISTER_SOURCE_FILE_RECEIVED**: File registration and metadata extraction
|
|
- **CREATE_EXTERNAL_TABLE**: External table creation for data access
|
|
- **VALIDATE_SOURCE_FILE_RECEIVED**: Data validation and structure checking
|
|
- **DROP_EXTERNAL_TABLE**: Cleanup of temporary external tables
|
|
- **MOVE_FILE**: File relocation between buckets
|
|
- **SET_SOURCE_FILE_RECEIVED_STATUS**: Status management
|
|
|
|
For detailed information about individual procedures, refer to the package documentation.
|
|
|
|
## Summary
|
|
|
|
`PROCESS_SOURCE_FILE` is the cornerstone of the FILE PROCESSOR system, providing a complete automated workflow for validating files and preparing them for Airflow+DBT processing pipelines. Its umbrella architecture ensures consistent file validation and preparation while comprehensive error handling and logging provide visibility and reliability for enterprise file processing operations that feed into downstream Airflow+DBT data workflows. |