Data Schema
Data Schema & Integrity Verification (payn.DataSchema.DataSchema)
Serves as a template for data structures within the pipeline. It dynamically creates the expected schema (column names for features, metadata, and labels) based on the configuration files provided by the user and decouples the code from specific column names in the dataset.
- Dynamic Validation: The
validate_dataframefunction enforces schema compliance at runtime, ensuring that required columns (derived fromconfig.yaml) are present before computationally expensive operations begin. - Modes: Distinguishes between training mode (requires ground truth labels) and inference mode (relaxed constraints).
- Leakage Prevention: The
verify_no_leakageandvalidate_split_integrityutilities enforce strict index-based checks. They mathematically verify that training, validation, and test sets share no overlapping indices (data leakage) and that the union of split indices exactly matches the input dataset (data conservation).
Centralized data contract for the PAYN pipeline.
Builds an expected schema (column names) from the configuration to ensure consistency across data loading, training, and validation steps.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
Dict[str, Any]
|
The full configuration dictionary. |
expected_columns |
Set[str]
|
A set of column names expected to be present. |
Source code in payn\DataSchema\dataschema.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | |
__init__(config)
Initialize the DataSchema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
Dict[str, Any]
|
The configuration dictionary containing 'featurisation' and 'meta_columns' sections. |
required |
Source code in payn\DataSchema\dataschema.py
17 18 19 20 21 22 23 24 25 26 27 28 | |
build_schema()
Build the expected schema using the configuration.
Extracts keys from the 'featurisation' and 'meta_columns' sections of
the config and populates self.expected_columns.
Source code in payn\DataSchema\dataschema.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | |
get_expected_columns()
Return the list of expected column names. Returns: A list of string column names expected in the dataframe.
Source code in payn\DataSchema\dataschema.py
48 49 50 51 52 53 54 | |