Skip to content

Data Schema

Data Schema & Integrity Verification (payn.DataSchema.DataSchema)

Serves as a template for data structures within the pipeline. It dynamically creates the expected schema (column names for features, metadata, and labels) based on the configuration files provided by the user and decouples the code from specific column names in the dataset.

  • Dynamic Validation: The validate_dataframe function enforces schema compliance at runtime, ensuring that required columns (derived from config.yaml) are present before computationally expensive operations begin.
  • Modes: Distinguishes between training mode (requires ground truth labels) and inference mode (relaxed constraints).
  • Leakage Prevention: The verify_no_leakage and validate_split_integrity utilities enforce strict index-based checks. They mathematically verify that training, validation, and test sets share no overlapping indices (data leakage) and that the union of split indices exactly matches the input dataset (data conservation).

Centralized data contract for the PAYN pipeline.

Builds an expected schema (column names) from the configuration to ensure consistency across data loading, training, and validation steps.

Attributes:

Name Type Description
config Dict[str, Any]

The full configuration dictionary.

expected_columns Set[str]

A set of column names expected to be present.

Source code in payn\DataSchema\dataschema.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class DataSchema:
    """
        Centralized data contract for the PAYN pipeline.

        Builds an expected schema (column names) from the configuration to ensure
        consistency across data loading, training, and validation steps.

        Attributes:
            config (Dict[str, Any]): The full configuration dictionary.
            expected_columns (Set[str]): A set of column names expected to be present.
        """

    def __init__(self, config: Dict[str, Any]) -> None:
        """
        Initialize the DataSchema.

        Args:
            config: The configuration dictionary containing 'featurisation'
                and 'meta_columns' sections.
        """

        self.config = config
        self.expected_columns: Set[str] = set()
        self.build_schema()

    def build_schema(self) -> None:
        """
        Build the expected schema using the configuration.

        Extracts keys from the 'featurisation' and 'meta_columns' sections of
        the config and populates `self.expected_columns`.
        """
        self.expected_columns = set()
        # Get the feature column (e.g., fingerprint column)
        featurisation_config = self.config.get("featurisation", {})
        fp_col = featurisation_config.get("combined_features_column_name", "FP_combined")
        self.expected_columns.add(fp_col)

        # Get meta columns from config
        meta_columns = self.config.get("meta_columns", {})
        for key, col_name in meta_columns.items():
            self.expected_columns.add(col_name)

    def get_expected_columns(self) -> List[str]:
        """
        Return the list of expected column names.
        Returns:
            A  list of string column names expected in the dataframe.
        """
        return list(self.expected_columns)

__init__(config)

Initialize the DataSchema.

Parameters:

Name Type Description Default
config Dict[str, Any]

The configuration dictionary containing 'featurisation' and 'meta_columns' sections.

required
Source code in payn\DataSchema\dataschema.py
17
18
19
20
21
22
23
24
25
26
27
28
def __init__(self, config: Dict[str, Any]) -> None:
    """
    Initialize the DataSchema.

    Args:
        config: The configuration dictionary containing 'featurisation'
            and 'meta_columns' sections.
    """

    self.config = config
    self.expected_columns: Set[str] = set()
    self.build_schema()

build_schema()

Build the expected schema using the configuration.

Extracts keys from the 'featurisation' and 'meta_columns' sections of the config and populates self.expected_columns.

Source code in payn\DataSchema\dataschema.py
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def build_schema(self) -> None:
    """
    Build the expected schema using the configuration.

    Extracts keys from the 'featurisation' and 'meta_columns' sections of
    the config and populates `self.expected_columns`.
    """
    self.expected_columns = set()
    # Get the feature column (e.g., fingerprint column)
    featurisation_config = self.config.get("featurisation", {})
    fp_col = featurisation_config.get("combined_features_column_name", "FP_combined")
    self.expected_columns.add(fp_col)

    # Get meta columns from config
    meta_columns = self.config.get("meta_columns", {})
    for key, col_name in meta_columns.items():
        self.expected_columns.add(col_name)

get_expected_columns()

Return the list of expected column names. Returns: A list of string column names expected in the dataframe.

Source code in payn\DataSchema\dataschema.py
48
49
50
51
52
53
54
def get_expected_columns(self) -> List[str]:
    """
    Return the list of expected column names.
    Returns:
        A  list of string column names expected in the dataframe.
    """
    return list(self.expected_columns)