Evaluation

Evaluation (`payn.Evaluation.Evaluator`)

This class provides specialized metrics to assess the quality of the reliable negative identification process. This is possible due to the use of fully labelled HTE datasets as a ground truth. Since the primary goal of the framework is to construct a balanced training set, standard classification accuracy is insufficient. This module implements negative specific metrics and confusion matrices.

Negative-Specific Metrics: Isolates the subset of data that the model has labeled as reliable negatives and computes negative Precision and negative Recall.
- Negative Precision measures the purity of the negative set (i.e., How many of the identified negatives are actually negative?). High negative precision is crucial to prevent the introduction of noise through latent positives.
- Negative Recall measures the coverage (i.e., What fraction of all true negatives did we successfully find?).
Undecisive Analysis: The module tracks the volume of undecisive data points—those discarded because their probabilities fell into the zone between the reliable negatives and the known positives.
Safety Checks: The evaluator automatically flags discrepancies, such as "Missed Negatives" (true negatives that were lost during processing) or index mismatches between the input and output dataframes.

Evaluator computes dual evaluation metrics for the Spy pipeline.

It assesses performance from multiple perspectives: 1. Overall Evaluation: Standard classification metrics based on the union of all subsets. 2. Negative-Specific Evaluation: Performance of the reliable negative extraction. 3. Undecisive Analysis: Quantification of discarded data. 4. Missed Negatives: Analysis of ground-truth negatives lost during filtering.

Attributes:

Name	Type	Description
`meta_columns`	`Dict[str, str]`	Mapping of internal column names to dataset columns.
`logger`	`Optional[Logger]`	Logger instance for tracking results.
`fold_index`	`Optional[int]`	Current cross-validation fold index.
`training_mode`	`bool`	Flag indicating if the evaluator is running during training.

Source code in payn\Evaluation\evaluator.py

class Evaluator:
    """
    Evaluator computes dual evaluation metrics for the Spy pipeline.

    It assesses performance from multiple perspectives:
    1. Overall Evaluation: Standard classification metrics based on the union of all subsets.
    2. Negative-Specific Evaluation: Performance of the reliable negative extraction.
    3. Undecisive Analysis: Quantification of discarded data.
    4. Missed Negatives: Analysis of ground-truth negatives lost during filtering.

    Attributes:
        meta_columns (Dict[str, str]): Mapping of internal column names to dataset columns.
        logger (Optional[Logger]): Logger instance for tracking results.
        fold_index (Optional[int]): Current cross-validation fold index.
        training_mode (bool): Flag indicating if the evaluator is running during training.
    """

    def __init__(self, meta_columns: Dict[str, str], logger: Optional[Logger] = None,
                 fold_index: int = None, training_mode: bool = True) -> None:
        """
        Initialize the Evaluator.

        Args:
            meta_columns (Dict[str, str]): Dictionary mapping meta keys to dataframe columns.
            logger (Logger, optional): Logger instance.
            fold_index (int, optional): Index of the current fold.
            training_mode (bool, optional): Whether evaluation is performed during training.
        """
        self.meta_columns = meta_columns
        self.logger = logger
        self.fold_index = fold_index
        self.training_mode = training_mode

    @classmethod
    def from_config(cls, config: Dict[str, Any], logger: Optional[Logger] = None,
                    fold_index: int = 1, training_mode: bool = True) -> "Evaluator":
        """
        Alternative constructor that extracts parameters from a config object.

        Args:
            config (Dict[str, Any]): Configuration dictionary.
            logger (Logger, optional): Logger instance.
            fold_index (int, optional): Current fold index.
            training_mode (bool, optional): Training mode flag.

        Returns:
            Evaluator: An initialized Evaluator instance.
        """
        meta_columns = config.get("meta_columns", {})
        return cls(meta_columns=meta_columns, logger=logger, fold_index=fold_index, training_mode=training_mode)

    def _log(self, message: str) -> None:
        """Helper to log messages with fold context."""
        if self.logger:
            self.logger.log_message(f"Fold {self.fold_index}: {message}")

    def compute_confusion_matrix(self, y_true: np.ndarray, y_pred: np.ndarray) -> List[List[int]]:
        """
        Compute the confusion matrix.

        Args:
            y_true (np.ndarray): True labels.
            y_pred (np.ndarray): Predicted labels.

        Returns:
            List[List[int]]: Confusion matrix as a nested list (JSON serializable).
        """
        cm = confusion_matrix(y_true, y_pred)
        self._log(f"Overall Confusion Matrix: {cm}")
        return cm

    def compute_classification_metrics(self, y_true: np.ndarray, y_pred: np.ndarray) -> Dict[str, float]:
        """
        Compute standard classification metrics.

        Args:
            y_true (np.ndarray): True labels.
            y_pred (np.ndarray): Predicted labels.

        Returns:
            Dict[str, float]: Dictionary containing accuracy, precision, recall, and f1.
        """
        metrics = {
            "accuracy": np.mean(y_true == y_pred),
            "precision": precision_score(y_true, y_pred, pos_label=0),
            "recall": recall_score(y_true, y_pred, pos_label=0),
            "f1": f1_score(y_true, y_pred, pos_label=0)
        }
        self._log(f"Overall Metrics: {metrics}")
        return metrics

    def compute_negative_specific_metrics(self, negatives_df: pd.DataFrame, true_data: pd.DataFrame) -> Dict[
        str, float]:
        """
        Compute metrics specifically for the subset predicted as negatives.

        Calculates:
          - Negative Precision (NPV): Fraction of predicted negatives that are truly negative.
          - Negative Recall (True Negative Rate): Fraction of true negatives captured.
          - Negative F1: Harmonic mean of negative precision and recall.

        Args:
            negatives_df (pd.DataFrame): Dataframe of predicted reliable negatives.
            true_data (pd.DataFrame): Original dataframe with ground truth.

        Returns:
            Dict[str, float]: Dictionary of negative-specific metrics.
        """
        true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
        total_predicted = negatives_df.shape[0]
        if total_predicted == 0:
            self._log("No datapoints in predicted negatives for evaluation.")
            return {"negative_precision": 0.0, "negative_recall": 0.0, "negative_f1": 0.0}

        TN_neg = negatives_df[negatives_df[true_label_col] == 0].shape[0]
        FP_neg = negatives_df[negatives_df[true_label_col] == 1].shape[0]
        neg_precision = TN_neg / (TN_neg + FP_neg) if (TN_neg + FP_neg) > 0 else 0.0

        total_true_negatives = true_data[true_data[true_label_col] == 0].shape[0]
        neg_recall = TN_neg / total_true_negatives if total_true_negatives > 0 else 0.0

        neg_f1 = (2 * neg_precision * neg_recall / (neg_precision + neg_recall)) if (neg_precision + neg_recall) > 0 else 0.0

        self._log(f"Negative-specific: Total predicted negatives={total_predicted}, All negatives available={total_true_negatives} , TN={TN_neg}, FP={FP_neg}, "
                  f"Negative Precision={neg_precision}, Negative Recall={neg_recall}, Negative F1={neg_f1}")
        return {"negative_precision": neg_precision, "negative_recall": neg_recall, "negative_f1": neg_f1, "negative_TN": TN_neg, "negative_FP": FP_neg, "total_negatives": total_true_negatives}

    def compute_missed_negatives(self, true_data: pd.DataFrame, predicted_union: pd.DataFrame) -> Dict[str, float]:
        """
        Compute the ratio of missed negatives (true negatives not captured in the final set).

        Args:
            true_data (pd.DataFrame): Ground truth data.
            predicted_union (pd.DataFrame): Union of all processed data subsets.

        Returns:
            Dict[str, float]: Dictionary containing the missed negative ratio.
        """
        true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
        total_negatives = true_data[true_data[true_label_col] == 0].shape[0]
        # Filter union for rows predicted as 0 (reliable negative)
        predicted_negatives = predicted_union[
            predicted_union[self.meta_columns.get("meta_augmented_bin", "augm_bin")] == 0]
        # Count how many of those are actually 0
        captured_negatives = predicted_negatives[predicted_negatives[true_label_col] == 0].shape[0]
        missed = total_negatives - captured_negatives
        missed_ratio = missed / total_negatives if total_negatives > 0 else 0.0
        self._log(f"Missed negatives: Total negatives={total_negatives}, Captured negatives={captured_negatives}, "
                  f"Missed negatives={missed} (Ratio: {missed_ratio})")
        return {"missed_negative_ratio": missed_ratio}

    def compute_missing_indices(self, true_data: pd.DataFrame, predicted_union: pd.DataFrame) -> List[Any]:
        """
        Identify indices present in true data but missing from predictions.

        Args:
            true_data (pd.DataFrame): Ground truth data.
            predicted_union (pd.DataFrame): Union of all processed data subsets.

        Returns:
            List[Any]: List of missing indices.
        """
        missing = list(set(true_data.index) - set(predicted_union.index))
        if missing:
            self._log(f"WARNING: {len(missing)} datapoints from true_data are missing in predicted union.")
        else:
            self._log("All datapoints from true_data are present in predicted union.")
        return missing

    def log_evaluation_summary(self, results: Dict[str, Any]) -> None:
        """
        Log the evaluation summary as a JSON artifact in MLflow.

        Args:
            results (Dict[str, Any]): The evaluation results dictionary.
        """
        try:
            evaluation_json = json.dumps(results, indent=2)
            artifact_name = f"evaluation_summary_fold_{self.fold_index}.json"
            mlflow.log_text(evaluation_json, artifact_name)
            self._log(f"Logged evaluation summary to artifact {artifact_name}")
        except Exception as e:
            self._log(f"Error logging evaluation summary: {e}")

    def log_metric_individual(self, metrics: Dict[str, float], prefix: str = "") -> None:
        """
        Log individual metrics using mlflow.log_metric.

        Args:
            metrics (Dict[str, float]): Dictionary of metrics.
            prefix (str, optional): Prefix for the metric name in MLflow.
        """
        for metric_name, value in metrics.items():
            mlflow.log_metric(f"{prefix}{metric_name}", value)
            self._log(f"Logged metric {prefix}{metric_name}: {value}")

    def evaluate(
            self,
            true_data: pd.DataFrame,
            augmen_negatives: pd.DataFrame,
            known_positives: pd.DataFrame,
            undecisives: pd.DataFrame
    ) -> Dict[str, Any]:
        """
        Perform comprehensive dual evaluation of the PU pipeline.

        1. Overall Evaluation: Merges all subsets and compares against ground truth.
        2. Negative-Specific Evaluation: Focuses on the purity of the 'reliable negative' set.
        3. Undecisive Analysis: Tracks data loss.
        4. Missed Negatives: Tracks recall loss relative to ground truth.

        Args:
            true_data (pd.DataFrame): Full training data with ground-truth labels.
            augmen_negatives (pd.DataFrame): Datapoints predicted as reliable negatives.
            known_positives (pd.DataFrame): Datapoints forced to be known positives.
            undecisives (pd.DataFrame): Datapoints labeled as "undecisive".

        Returns:
            Dict[str, Any]: A dictionary containing all computed metrics.
        """
        results: Dict[str, Any] = {}
        true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
        aug_pred_col = self.meta_columns.get("meta_augmented_bin", "augm_bin")

        # Create the union of all predictions (final pipeline output)
        predicted_union = pd.concat([augmen_negatives, known_positives, undecisives]).sort_index()
        self._log(f"Predicted union created with {predicted_union.shape[0]} datapoints.")

        # Determine common indices between true_data and predicted_union
        common_indices = true_data.index.intersection(predicted_union.index)
        if len(common_indices) == 0:
            self._log(
                "WARNING: No overlapping indices between true_data and predicted union; cannot compute overall metrics.")
            results["overall"] = {}
        else:
            if len(common_indices) != len(true_data) or len(common_indices) != len(predicted_union):
                self._log(
                    "WARNING: Not all indices are common between true_data and predicted union. Some datapoints may be missing.")
                results["missing_indices"] = self.compute_missing_indices(true_data, predicted_union)
            y_true = true_data.loc[common_indices, true_label_col].values
            y_pred = predicted_union.loc[common_indices, aug_pred_col].values
            overall_cm = self.compute_confusion_matrix(y_true, y_pred)
            overall_metrics = self.compute_classification_metrics(y_true, y_pred)
            results["overall"] = {"confusion_matrix": overall_cm, "metrics": overall_metrics}

        # Negative-specific evaluation (on reliable negatives only)
        results["negative_specific"] = self.compute_negative_specific_metrics(augmen_negatives, true_data)

        # Undecisive analysis: count and ratio
        undecisive_count = undecisives.shape[0]
        total_union = predicted_union.shape[0]
        undecisive_ratio = undecisive_count / total_union if total_union > 0 else 0.0
        self._log(f"Undecisives: {undecisive_count} out of {total_union} (Ratio: {undecisive_ratio})")
        results["undecisives"] = {"count": undecisive_count, "ratio": undecisive_ratio}

        # Missed negatives evaluation: comparing full true_data negatives vs. captured negatives in union
        results["missed_negatives"] = self.compute_missed_negatives(true_data, predicted_union)

        # Log evaluation summary and individual metrics using the logger
        if self.logger:
            self.log_evaluation_summary(results)
            # Log overall metrics if available
            if "overall" in results and "metrics" in results["overall"]:
                self.log_metric_individual(results["overall"]["metrics"], prefix="overall_")
            # Log negative-specific metrics
            if "negative_specific" in results:
                self.log_metric_individual(results["negative_specific"], prefix="")
            # Log undecisive metrics
            if "undecisives" in results:
                self.log_metric_individual(results["undecisives"], prefix="undecisives_")
            # Log missed negatives
            if "missed_negatives" in results:
                self.log_metric_individual(results["missed_negatives"], prefix="")

        return results

`init(meta_columns, logger=None, fold_index=None, training_mode=True)`

Initialize the Evaluator.

Parameters:

Name	Type	Description	Default
`meta_columns`	`Dict[str, str]`	Dictionary mapping meta keys to dataframe columns.	required
`logger`	`Logger`	Logger instance.	`None`
`fold_index`	`int`	Index of the current fold.	`None`
`training_mode`	`bool`	Whether evaluation is performed during training.	`True`

Source code in payn\Evaluation\evaluator.py

def __init__(self, meta_columns: Dict[str, str], logger: Optional[Logger] = None,
             fold_index: int = None, training_mode: bool = True) -> None:
    """
    Initialize the Evaluator.

    Args:
        meta_columns (Dict[str, str]): Dictionary mapping meta keys to dataframe columns.
        logger (Logger, optional): Logger instance.
        fold_index (int, optional): Index of the current fold.
        training_mode (bool, optional): Whether evaluation is performed during training.
    """
    self.meta_columns = meta_columns
    self.logger = logger
    self.fold_index = fold_index
    self.training_mode = training_mode

`compute_classification_metrics(y_true, y_pred)`

Compute standard classification metrics.

Parameters:

Name	Type	Description	Default
`y_true`	`ndarray`	True labels.	required
`y_pred`	`ndarray`	Predicted labels.	required

Returns:

Type	Description
`Dict[str, float]`	Dict[str, float]: Dictionary containing accuracy, precision, recall, and f1.

Source code in payn\Evaluation\evaluator.py

def compute_classification_metrics(self, y_true: np.ndarray, y_pred: np.ndarray) -> Dict[str, float]:
    """
    Compute standard classification metrics.

    Args:
        y_true (np.ndarray): True labels.
        y_pred (np.ndarray): Predicted labels.

    Returns:
        Dict[str, float]: Dictionary containing accuracy, precision, recall, and f1.
    """
    metrics = {
        "accuracy": np.mean(y_true == y_pred),
        "precision": precision_score(y_true, y_pred, pos_label=0),
        "recall": recall_score(y_true, y_pred, pos_label=0),
        "f1": f1_score(y_true, y_pred, pos_label=0)
    }
    self._log(f"Overall Metrics: {metrics}")
    return metrics

`compute_confusion_matrix(y_true, y_pred)`

Compute the confusion matrix.

Parameters:

Name	Type	Description	Default
`y_true`	`ndarray`	True labels.	required
`y_pred`	`ndarray`	Predicted labels.	required

Returns:

Type	Description
`List[List[int]]`	List[List[int]]: Confusion matrix as a nested list (JSON serializable).

Source code in payn\Evaluation\evaluator.py

def compute_confusion_matrix(self, y_true: np.ndarray, y_pred: np.ndarray) -> List[List[int]]:
    """
    Compute the confusion matrix.

    Args:
        y_true (np.ndarray): True labels.
        y_pred (np.ndarray): Predicted labels.

    Returns:
        List[List[int]]: Confusion matrix as a nested list (JSON serializable).
    """
    cm = confusion_matrix(y_true, y_pred)
    self._log(f"Overall Confusion Matrix: {cm}")
    return cm

`compute_missed_negatives(true_data, predicted_union)`

Compute the ratio of missed negatives (true negatives not captured in the final set).

Parameters:

Name	Type	Description	Default
`true_data`	`DataFrame`	Ground truth data.	required
`predicted_union`	`DataFrame`	Union of all processed data subsets.	required

Returns:

Type	Description
`Dict[str, float]`	Dict[str, float]: Dictionary containing the missed negative ratio.

Source code in payn\Evaluation\evaluator.py

def compute_missed_negatives(self, true_data: pd.DataFrame, predicted_union: pd.DataFrame) -> Dict[str, float]:
    """
    Compute the ratio of missed negatives (true negatives not captured in the final set).

    Args:
        true_data (pd.DataFrame): Ground truth data.
        predicted_union (pd.DataFrame): Union of all processed data subsets.

    Returns:
        Dict[str, float]: Dictionary containing the missed negative ratio.
    """
    true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
    total_negatives = true_data[true_data[true_label_col] == 0].shape[0]
    # Filter union for rows predicted as 0 (reliable negative)
    predicted_negatives = predicted_union[
        predicted_union[self.meta_columns.get("meta_augmented_bin", "augm_bin")] == 0]
    # Count how many of those are actually 0
    captured_negatives = predicted_negatives[predicted_negatives[true_label_col] == 0].shape[0]
    missed = total_negatives - captured_negatives
    missed_ratio = missed / total_negatives if total_negatives > 0 else 0.0
    self._log(f"Missed negatives: Total negatives={total_negatives}, Captured negatives={captured_negatives}, "
              f"Missed negatives={missed} (Ratio: {missed_ratio})")
    return {"missed_negative_ratio": missed_ratio}

`compute_missing_indices(true_data, predicted_union)`

Identify indices present in true data but missing from predictions.

Parameters:

Name	Type	Description	Default
`true_data`	`DataFrame`	Ground truth data.	required
`predicted_union`	`DataFrame`	Union of all processed data subsets.	required

Returns:

Type	Description
`List[Any]`	List[Any]: List of missing indices.

Source code in payn\Evaluation\evaluator.py

def compute_missing_indices(self, true_data: pd.DataFrame, predicted_union: pd.DataFrame) -> List[Any]:
    """
    Identify indices present in true data but missing from predictions.

    Args:
        true_data (pd.DataFrame): Ground truth data.
        predicted_union (pd.DataFrame): Union of all processed data subsets.

    Returns:
        List[Any]: List of missing indices.
    """
    missing = list(set(true_data.index) - set(predicted_union.index))
    if missing:
        self._log(f"WARNING: {len(missing)} datapoints from true_data are missing in predicted union.")
    else:
        self._log("All datapoints from true_data are present in predicted union.")
    return missing

`compute_negative_specific_metrics(negatives_df, true_data)`

Compute metrics specifically for the subset predicted as negatives.

Calculates

Negative Precision (NPV): Fraction of predicted negatives that are truly negative.
Negative Recall (True Negative Rate): Fraction of true negatives captured.
Negative F1: Harmonic mean of negative precision and recall.

Parameters:

Name	Type	Description	Default
`negatives_df`	`DataFrame`	Dataframe of predicted reliable negatives.	required
`true_data`	`DataFrame`	Original dataframe with ground truth.	required

Returns:

Type	Description
`Dict[str, float]`	Dict[str, float]: Dictionary of negative-specific metrics.

Source code in payn\Evaluation\evaluator.py

def compute_negative_specific_metrics(self, negatives_df: pd.DataFrame, true_data: pd.DataFrame) -> Dict[
    str, float]:
    """
    Compute metrics specifically for the subset predicted as negatives.

    Calculates:
      - Negative Precision (NPV): Fraction of predicted negatives that are truly negative.
      - Negative Recall (True Negative Rate): Fraction of true negatives captured.
      - Negative F1: Harmonic mean of negative precision and recall.

    Args:
        negatives_df (pd.DataFrame): Dataframe of predicted reliable negatives.
        true_data (pd.DataFrame): Original dataframe with ground truth.

    Returns:
        Dict[str, float]: Dictionary of negative-specific metrics.
    """
    true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
    total_predicted = negatives_df.shape[0]
    if total_predicted == 0:
        self._log("No datapoints in predicted negatives for evaluation.")
        return {"negative_precision": 0.0, "negative_recall": 0.0, "negative_f1": 0.0}

    TN_neg = negatives_df[negatives_df[true_label_col] == 0].shape[0]
    FP_neg = negatives_df[negatives_df[true_label_col] == 1].shape[0]
    neg_precision = TN_neg / (TN_neg + FP_neg) if (TN_neg + FP_neg) > 0 else 0.0

    total_true_negatives = true_data[true_data[true_label_col] == 0].shape[0]
    neg_recall = TN_neg / total_true_negatives if total_true_negatives > 0 else 0.0

    neg_f1 = (2 * neg_precision * neg_recall / (neg_precision + neg_recall)) if (neg_precision + neg_recall) > 0 else 0.0

    self._log(f"Negative-specific: Total predicted negatives={total_predicted}, All negatives available={total_true_negatives} , TN={TN_neg}, FP={FP_neg}, "
              f"Negative Precision={neg_precision}, Negative Recall={neg_recall}, Negative F1={neg_f1}")
    return {"negative_precision": neg_precision, "negative_recall": neg_recall, "negative_f1": neg_f1, "negative_TN": TN_neg, "negative_FP": FP_neg, "total_negatives": total_true_negatives}

`evaluate(true_data, augmen_negatives, known_positives, undecisives)`

Perform comprehensive dual evaluation of the PU pipeline.

Overall Evaluation: Merges all subsets and compares against ground truth.
Negative-Specific Evaluation: Focuses on the purity of the 'reliable negative' set.
Undecisive Analysis: Tracks data loss.
Missed Negatives: Tracks recall loss relative to ground truth.

Parameters:

Name	Type	Description	Default
`true_data`	`DataFrame`	Full training data with ground-truth labels.	required
`augmen_negatives`	`DataFrame`	Datapoints predicted as reliable negatives.	required
`known_positives`	`DataFrame`	Datapoints forced to be known positives.	required
`undecisives`	`DataFrame`	Datapoints labeled as "undecisive".	required

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: A dictionary containing all computed metrics.

Source code in payn\Evaluation\evaluator.py

def evaluate(
        self,
        true_data: pd.DataFrame,
        augmen_negatives: pd.DataFrame,
        known_positives: pd.DataFrame,
        undecisives: pd.DataFrame
) -> Dict[str, Any]:
    """
    Perform comprehensive dual evaluation of the PU pipeline.

    1. Overall Evaluation: Merges all subsets and compares against ground truth.
    2. Negative-Specific Evaluation: Focuses on the purity of the 'reliable negative' set.
    3. Undecisive Analysis: Tracks data loss.
    4. Missed Negatives: Tracks recall loss relative to ground truth.

    Args:
        true_data (pd.DataFrame): Full training data with ground-truth labels.
        augmen_negatives (pd.DataFrame): Datapoints predicted as reliable negatives.
        known_positives (pd.DataFrame): Datapoints forced to be known positives.
        undecisives (pd.DataFrame): Datapoints labeled as "undecisive".

    Returns:
        Dict[str, Any]: A dictionary containing all computed metrics.
    """
    results: Dict[str, Any] = {}
    true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
    aug_pred_col = self.meta_columns.get("meta_augmented_bin", "augm_bin")

    # Create the union of all predictions (final pipeline output)
    predicted_union = pd.concat([augmen_negatives, known_positives, undecisives]).sort_index()
    self._log(f"Predicted union created with {predicted_union.shape[0]} datapoints.")

    # Determine common indices between true_data and predicted_union
    common_indices = true_data.index.intersection(predicted_union.index)
    if len(common_indices) == 0:
        self._log(
            "WARNING: No overlapping indices between true_data and predicted union; cannot compute overall metrics.")
        results["overall"] = {}
    else:
        if len(common_indices) != len(true_data) or len(common_indices) != len(predicted_union):
            self._log(
                "WARNING: Not all indices are common between true_data and predicted union. Some datapoints may be missing.")
            results["missing_indices"] = self.compute_missing_indices(true_data, predicted_union)
        y_true = true_data.loc[common_indices, true_label_col].values
        y_pred = predicted_union.loc[common_indices, aug_pred_col].values
        overall_cm = self.compute_confusion_matrix(y_true, y_pred)
        overall_metrics = self.compute_classification_metrics(y_true, y_pred)
        results["overall"] = {"confusion_matrix": overall_cm, "metrics": overall_metrics}

    # Negative-specific evaluation (on reliable negatives only)
    results["negative_specific"] = self.compute_negative_specific_metrics(augmen_negatives, true_data)

    # Undecisive analysis: count and ratio
    undecisive_count = undecisives.shape[0]
    total_union = predicted_union.shape[0]
    undecisive_ratio = undecisive_count / total_union if total_union > 0 else 0.0
    self._log(f"Undecisives: {undecisive_count} out of {total_union} (Ratio: {undecisive_ratio})")
    results["undecisives"] = {"count": undecisive_count, "ratio": undecisive_ratio}

    # Missed negatives evaluation: comparing full true_data negatives vs. captured negatives in union
    results["missed_negatives"] = self.compute_missed_negatives(true_data, predicted_union)

    # Log evaluation summary and individual metrics using the logger
    if self.logger:
        self.log_evaluation_summary(results)
        # Log overall metrics if available
        if "overall" in results and "metrics" in results["overall"]:
            self.log_metric_individual(results["overall"]["metrics"], prefix="overall_")
        # Log negative-specific metrics
        if "negative_specific" in results:
            self.log_metric_individual(results["negative_specific"], prefix="")
        # Log undecisive metrics
        if "undecisives" in results:
            self.log_metric_individual(results["undecisives"], prefix="undecisives_")
        # Log missed negatives
        if "missed_negatives" in results:
            self.log_metric_individual(results["missed_negatives"], prefix="")

    return results

`from_config(config, logger=None, fold_index=1, training_mode=True)` `classmethod`

Alternative constructor that extracts parameters from a config object.

Parameters:

Name	Type	Description	Default
`config`	`Dict[str, Any]`	Configuration dictionary.	required
`logger`	`Logger`	Logger instance.	`None`
`fold_index`	`int`	Current fold index.	`1`
`training_mode`	`bool`	Training mode flag.	`True`

Returns:

Name	Type	Description
`Evaluator`	`Evaluator`	An initialized Evaluator instance.

Source code in payn\Evaluation\evaluator.py

@classmethod
def from_config(cls, config: Dict[str, Any], logger: Optional[Logger] = None,
                fold_index: int = 1, training_mode: bool = True) -> "Evaluator":
    """
    Alternative constructor that extracts parameters from a config object.

    Args:
        config (Dict[str, Any]): Configuration dictionary.
        logger (Logger, optional): Logger instance.
        fold_index (int, optional): Current fold index.
        training_mode (bool, optional): Training mode flag.

    Returns:
        Evaluator: An initialized Evaluator instance.
    """
    meta_columns = config.get("meta_columns", {})
    return cls(meta_columns=meta_columns, logger=logger, fold_index=fold_index, training_mode=training_mode)

`log_evaluation_summary(results)`

Log the evaluation summary as a JSON artifact in MLflow.

Parameters:

Name	Type	Description	Default
`results`	`Dict[str, Any]`	The evaluation results dictionary.	required

Source code in payn\Evaluation\evaluator.py

def log_evaluation_summary(self, results: Dict[str, Any]) -> None:
    """
    Log the evaluation summary as a JSON artifact in MLflow.

    Args:
        results (Dict[str, Any]): The evaluation results dictionary.
    """
    try:
        evaluation_json = json.dumps(results, indent=2)
        artifact_name = f"evaluation_summary_fold_{self.fold_index}.json"
        mlflow.log_text(evaluation_json, artifact_name)
        self._log(f"Logged evaluation summary to artifact {artifact_name}")
    except Exception as e:
        self._log(f"Error logging evaluation summary: {e}")

`log_metric_individual(metrics, prefix='')`

Log individual metrics using mlflow.log_metric.

Parameters:

Name	Type	Description	Default
`metrics`	`Dict[str, float]`	Dictionary of metrics.	required
`prefix`	`str`	Prefix for the metric name in MLflow.	`''`

Source code in payn\Evaluation\evaluator.py

def log_metric_individual(self, metrics: Dict[str, float], prefix: str = "") -> None:
    """
    Log individual metrics using mlflow.log_metric.

    Args:
        metrics (Dict[str, float]): Dictionary of metrics.
        prefix (str, optional): Prefix for the metric name in MLflow.
    """
    for metric_name, value in metrics.items():
        mlflow.log_metric(f"{prefix}{metric_name}", value)
        self._log(f"Logged metric {prefix}{metric_name}: {value}")

Evaluation

Evaluation (payn.Evaluation.Evaluator)

__init__(meta_columns, logger=None, fold_index=None, training_mode=True)

compute_classification_metrics(y_true, y_pred)

compute_confusion_matrix(y_true, y_pred)

compute_missed_negatives(true_data, predicted_union)

compute_missing_indices(true_data, predicted_union)

compute_negative_specific_metrics(negatives_df, true_data)

evaluate(true_data, augmen_negatives, known_positives, undecisives)

from_config(config, logger=None, fold_index=1, training_mode=True) classmethod

log_evaluation_summary(results)

log_metric_individual(metrics, prefix='')

Evaluation (`payn.Evaluation.Evaluator`)

`init(meta_columns, logger=None, fold_index=None, training_mode=True)`

`compute_classification_metrics(y_true, y_pred)`

`compute_confusion_matrix(y_true, y_pred)`

`compute_missed_negatives(true_data, predicted_union)`

`compute_missing_indices(true_data, predicted_union)`

`compute_negative_specific_metrics(negatives_df, true_data)`

`evaluate(true_data, augmen_negatives, known_positives, undecisives)`

`from_config(config, logger=None, fold_index=1, training_mode=True)` `classmethod`

`log_evaluation_summary(results)`

`log_metric_individual(metrics, prefix='')`