MRDS APP
The main purpose of this application is to download XML or CSV files from source, perform some basic ETL and upload them to target. Below is a simplified workflow of the application.
Application workflow
flowchart LR
subgraph CoreApplication
direction TB
B[Read and validate config file] --> |If valid| C[Download source file]
C[Download source file] --> D[Unzip if file is ZIP]
D[Unzip if file is ZIP] --> E[Validate source file]
E --> |If valid| G[Start task defined in config file]
G --> H[Build output file with selected data from source]
H --> I[Enrich output file with metadata]
I --> J[Upload the output file]
J --> K[Trigger remote function]
K --> L[Check if more tasks are available in config file]
L --> |Yes| G
L --> |No| M[Archive & Delete source file]
M --> N[Finish workflow]
end
A[Trigger app via CLI or Airflow DAG] --> CoreApplication
Installation
Checkout repository and cd to root project directory
cd python/mrds_common
Create new virtual environment using Python >=3.11
python3.11 -m venv .venv
Activate virtual environment
source .venv/bin/activate
Upgrade pip
pip install --upgrade pip
Install app
pip install .
Environment variables
There are two operating system environment variables, which are requred by the application:
BUCKET_NAMESPACE - OCI namespace where main operating bucket is located (if not found - default value is frcnomajoc7v)
BUCKET - main operating OCI bucket for downloading and uploading files (if not found - default value is mrds_inbox_poc)
Usage
The application accepts two required and four optional parameters.
Parameters
| Parameter | Short Flag | Required | Default | Description |
|---|---|---|---|---|
--workflow-context |
-w |
No* | None | JSON string representing the workflow context. Must contain run_id and a_workflow_history_key. |
--generate-workflow-context |
No* | Flag type. If provided, app automatically generates and finalizes workflow context. Use this if --workflow-context is not provided. |
||
--source-filename |
-s |
Yes | None | Name of the source file to be looked up in source inbox set in configuration file (inbox_prefix). |
--config-file |
-c |
Yes | None | Path to the YAML configuration file. Can be absolute, or relative to current working directory. |
--keep-source-file |
No | Flag type. If provided, app keeps source file, instead of archiving and deleting it. | ||
--keep-tmp-dir |
No | Flag type. If provided, app keeps tmp directory, instead of deleting it. |
*--workflow-context and --generate-workflow-context are both optional, however - either one of them MUST be provided for the application to run.
CLI
mrds-cli --workflow-context '{"run_id": "0ce35637-302c-4293-8069-3186d5d9a57d", "a_workflow_history_key": 352344}' \
--source-filename 'CSDB_Debt_Daily.ZIP' \
--config-file /home/dbt/GEORGI/projects/mrds_elt/airflow/ods/csdb/debt_daily/config/yaml/csdb_debt_daily.yaml
Python module
Import main function from core module and provide needed parameters:
from mrds.core import main
from mrds.utils.manage_runs import init_workflow, finalise_workflow
from mrds.utils.static_vars import status_success, status_failed
import datetime
import logging
import sys
# Configure logging for your needs. This is just a sample
current_time = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
log_filename = f"mrds_{current_time}.log"
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s - %(message)s",
handlers=[
logging.FileHandler(log_filename),
logging.StreamHandler(sys.stdout),
],
)
STATUS_SUCCESS = status_success
STATUS_FAILURE = status_failed
# Run time parameters
run_id = "0ce35637-302c-4293-8069-3186d5d9a57d"
a_workflow_history_key = init_workflow(database_name='ODS', workflow_name='w_OU_C2D_UC_DISSEM', workflow_run_id=run_id)
workflow_context = {
"run_id": run_id,
"a_workflow_history_key": a_workflow_history_key,
}
source_filename = "CSDB_Debt_Daily.ZIP"
config_file = "/home/dbt/GEORGI/projects/mrds_elt/airflow/ods/csdb/debt_daily/config/yaml/csdb_debt_daily.yaml"
main(workflow_context, source_filename, config_file)
# implement your desired error handling logic and provide correct status to function finalize_workflow
finalise_workflow(workflow_context["a_workflow_history_key"], STATUS_SUCCESS)
Configuration
Generate workflow context
Use this if you are using the application in standalone mode. Workflow context will be generated, and then finalized.
Source filename
This is the source file name to be looked up in in source inbox set in the configuration file (inbox_prefix).
Workflow context
This is a JSON string (or from the application standpoint view - dictionary) containing run_id and a_workflow_history_key values.
workflow_context = {
"run_id": "0ce35637-302c-4293-8069-3186d5d9a57d",
"a_workflow_history_key": 352344,
}
run_id - this represent orchestration ID. Can be any string ID of your choice, for example Airflow DAG ID. a_workflow_history_key - can be generated via mrds.utils.manage_runs.init_workflow() function.
If you provide workflow context by yourself, you need to take care of finalizing it too.
Config file
This is the main place which we can control the application.
At the top, are the Application configurations. These apply to all tasks. These are all optional and are used to override some specific runtime application settings.
# System configurations
encoding_type: cp1252 # Overrides default encoding type (utf-8) of the app. This encoding is used when reading source csv/xml files and when writing the output csv files of the app. For codec naming, follow guidelines here - https://docs.python.org/3/library/codecs.html#standard-encodings
After that, are the global configurations. These apply to all tasks:
# Global configurations
tmpdir: /tmp # root temporary directory to create runtime temporary directory, download source file and perform operations on it, before upload it to target
inbox_prefix: INBOX/C2D/UC_DISSEM # prefix for the inbox containing the source file
archive_prefix: ARCHIVE/C2D/UC_DISSEM # prefix for the archive bucket
workflow_name: w_OU_C2D_UC_DISSEM # name of the particular workflow
validation_schema_path: 'xsd/UseOfCollateralMessage.xsd' # relative path (to runtime location) to schema used to validate XML or CSV file
file_type: xml # file type of the expected source file - either CSV or XML
Following, there is a list of tasks to be performed on the source file. We can have multiple tasks per file, meaning - we can generate more than one output file, from one source file. Further, one of the key configuration parameters per task is "output_columns". There we define columns of the final output file. There are several types of columns:
xpath - this type of column is used when source file is XML. It is a standart xpath expression, pointing to path in the xml.
xpath_element_id - this type of column is used when we need to id a particular xml element. Used to create foreign keys between two separate tasks. It is a standart xpath expression, pointing to path in the xml.
csv_header - this type of column is used when source file is CSV. It just points to the corresponding csv header in the source file.
a_key - generates key unique per row.
workflow_key - generates key unique per run of the application
static - allows the user to define column with static value
The application respects the order of the output columns in the configuration file, when generating the output file. Data and columns from the source file, not included in the configuration file, will not be present in the final output file.
Example of xml task configuration:
# List of tasks
tasks:
- task_name: ou_lm_standing_facilities_header_create_file # name of the particular task
ods_prefix: INBOX/LM/STANDING_FACILITIES/STANDING_FACILITIES_HEADER # prefix for the upload location
output_table: standing_facilities_headers # table in Oracle
namespaces:
ns2: 'http://escb.ecb.int/sf' # XML namespace
output_columns: # Columns in the output file, order will be respected.
- type: 'a_key' # A_KEY type of column
column_header: 'A_KEY' # naming of the column in the output file
- type: 'workflow_key' # WORKFLOW_KEY type of column
column_header: 'A_WORKFLOW_HISTORY_KEY'
- type: 'xpath' # xpath type of column
value: '//ns2:header/ns2:version'
column_header: 'REV_NUMBER'
is_key: 'N' # value is transposed across the rows - YES/NO. Used when there is only single value in source XML
- type: 'xpath'
value: '//ns2:header/ns2:referenceDate'
column_header: 'REF_DATE'
is_key: 'N'
- type: 'static'
value: ''
column_header: 'FREE_TEXT'
- task_name: ou_lm_standing_facilities_create_file
ods_prefix: INBOX/LM/STANDING_FACILITIES/STANDING_FACILITIES
output_table: standing_facilities
namespaces:
ns2: 'http://escb.ecb.int/sf'
output_columns:
- type: 'a_key'
column_header: 'A_KEY'
- type: 'workflow_key'
column_header: 'A_SFH_FK'
- type: 'workflow_key'
column_header: 'A_WORKFLOW_HISTORY_KEY'
- type: 'xpath'
value: '//ns2:disaggregatedStandingFacilities/ns2:standingFacilities/ns2:disaggregatedStandingFacility/ns2:country'
column_header: 'COUNTRY'
- type: 'static'
value: ''
column_header: 'COMMENT_'
Example of CSV task configuration:
tasks:
- task_name: ODS_CSDB_DEBT_DAILY_process_csv
ods_prefix: ODS/CSDB/DEBT_DAILY
output_table: DEBT_DAILY
output_columns:
- type: 'a_key'
column_header: 'A_KEY'
- type: 'workflow_key'
column_header: 'A_WORKFLOW_HISTORY_KEY'
- type: 'csv_header' # csv_header type of column
value: 'Date last modified' # naming of the column in the SOURCE file
column_header: 'Date last modified' # naming of the column in the OUTPUT file
- type: 'csv_header'
value: 'Extraction date'
column_header: 'Extraction date'
- type: 'csv_header'
value: 'ISIN code'
column_header: 'ISIN code'
Development
Installing requirements
Install app + dev requirements. For easier workflow, you can install in editable mode
pip install -e .[dev]
In editable mode, instead of copying the package files to the site-packages directory, pip creates a special link that points to the source code directory. This means any changes you make to your source code will be immediately available without needing to reinstall the package.
Code formattting
Run black to reformat the code before pushing changes.
Following will reformat all files recursively from current dir.
black .
Following will only check and report what needs to be formatted, recursively from current dir.
black --check --diff .
Tests
Run tests with
pytest .
Tox automation
Tox automates runs of black checks and tests
tox .