Skip to content

Evaluation

Evaluation (payn.Evaluation.Evaluator)

This class provides specialized metrics to assess the quality of the reliable negative identification process. This is possible due to the use of fully labelled HTE datasets as a ground truth. Since the primary goal of the framework is to construct a balanced training set, standard classification accuracy is insufficient. This module implements negative specific metrics and confusion matrices.

  • Negative-Specific Metrics: Isolates the subset of data that the model has labeled as reliable negatives and computes negative Precision and negative Recall.
    • Negative Precision measures the purity of the negative set (i.e., How many of the identified negatives are actually negative?). High negative precision is crucial to prevent the introduction of noise through latent positives.
    • Negative Recall measures the coverage (i.e., What fraction of all true negatives did we successfully find?).
  • Undecisive Analysis: The module tracks the volume of undecisive data points—those discarded because their probabilities fell into the zone between the reliable negatives and the known positives.
  • Safety Checks: The evaluator automatically flags discrepancies, such as "Missed Negatives" (true negatives that were lost during processing) or index mismatches between the input and output dataframes.

Evaluator computes dual evaluation metrics for the Spy pipeline.

It assesses performance from multiple perspectives: 1. Overall Evaluation: Standard classification metrics based on the union of all subsets. 2. Negative-Specific Evaluation: Performance of the reliable negative extraction. 3. Undecisive Analysis: Quantification of discarded data. 4. Missed Negatives: Analysis of ground-truth negatives lost during filtering.

Attributes:

Name Type Description
meta_columns Dict[str, str]

Mapping of internal column names to dataset columns.

logger Optional[Logger]

Logger instance for tracking results.

fold_index Optional[int]

Current cross-validation fold index.

training_mode bool

Flag indicating if the evaluator is running during training.

Source code in payn\Evaluation\evaluator.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
class Evaluator:
    """
    Evaluator computes dual evaluation metrics for the Spy pipeline.

    It assesses performance from multiple perspectives:
    1. Overall Evaluation: Standard classification metrics based on the union of all subsets.
    2. Negative-Specific Evaluation: Performance of the reliable negative extraction.
    3. Undecisive Analysis: Quantification of discarded data.
    4. Missed Negatives: Analysis of ground-truth negatives lost during filtering.

    Attributes:
        meta_columns (Dict[str, str]): Mapping of internal column names to dataset columns.
        logger (Optional[Logger]): Logger instance for tracking results.
        fold_index (Optional[int]): Current cross-validation fold index.
        training_mode (bool): Flag indicating if the evaluator is running during training.
    """

    def __init__(self, meta_columns: Dict[str, str], logger: Optional[Logger] = None,
                 fold_index: int = None, training_mode: bool = True) -> None:
        """
        Initialize the Evaluator.

        Args:
            meta_columns (Dict[str, str]): Dictionary mapping meta keys to dataframe columns.
            logger (Logger, optional): Logger instance.
            fold_index (int, optional): Index of the current fold.
            training_mode (bool, optional): Whether evaluation is performed during training.
        """
        self.meta_columns = meta_columns
        self.logger = logger
        self.fold_index = fold_index
        self.training_mode = training_mode

    @classmethod
    def from_config(cls, config: Dict[str, Any], logger: Optional[Logger] = None,
                    fold_index: int = 1, training_mode: bool = True) -> "Evaluator":
        """
        Alternative constructor that extracts parameters from a config object.

        Args:
            config (Dict[str, Any]): Configuration dictionary.
            logger (Logger, optional): Logger instance.
            fold_index (int, optional): Current fold index.
            training_mode (bool, optional): Training mode flag.

        Returns:
            Evaluator: An initialized Evaluator instance.
        """
        meta_columns = config.get("meta_columns", {})
        return cls(meta_columns=meta_columns, logger=logger, fold_index=fold_index, training_mode=training_mode)

    def _log(self, message: str) -> None:
        """Helper to log messages with fold context."""
        if self.logger:
            self.logger.log_message(f"Fold {self.fold_index}: {message}")

    def compute_confusion_matrix(self, y_true: np.ndarray, y_pred: np.ndarray) -> List[List[int]]:
        """
        Compute the confusion matrix.

        Args:
            y_true (np.ndarray): True labels.
            y_pred (np.ndarray): Predicted labels.

        Returns:
            List[List[int]]: Confusion matrix as a nested list (JSON serializable).
        """
        cm = confusion_matrix(y_true, y_pred)
        self._log(f"Overall Confusion Matrix: {cm}")
        return cm

    def compute_classification_metrics(self, y_true: np.ndarray, y_pred: np.ndarray) -> Dict[str, float]:
        """
        Compute standard classification metrics.

        Args:
            y_true (np.ndarray): True labels.
            y_pred (np.ndarray): Predicted labels.

        Returns:
            Dict[str, float]: Dictionary containing accuracy, precision, recall, and f1.
        """
        metrics = {
            "accuracy": np.mean(y_true == y_pred),
            "precision": precision_score(y_true, y_pred, pos_label=0),
            "recall": recall_score(y_true, y_pred, pos_label=0),
            "f1": f1_score(y_true, y_pred, pos_label=0)
        }
        self._log(f"Overall Metrics: {metrics}")
        return metrics

    def compute_negative_specific_metrics(self, negatives_df: pd.DataFrame, true_data: pd.DataFrame) -> Dict[
        str, float]:
        """
        Compute metrics specifically for the subset predicted as negatives.

        Calculates:
          - Negative Precision (NPV): Fraction of predicted negatives that are truly negative.
          - Negative Recall (True Negative Rate): Fraction of true negatives captured.
          - Negative F1: Harmonic mean of negative precision and recall.

        Args:
            negatives_df (pd.DataFrame): Dataframe of predicted reliable negatives.
            true_data (pd.DataFrame): Original dataframe with ground truth.

        Returns:
            Dict[str, float]: Dictionary of negative-specific metrics.
        """
        true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
        total_predicted = negatives_df.shape[0]
        if total_predicted == 0:
            self._log("No datapoints in predicted negatives for evaluation.")
            return {"negative_precision": 0.0, "negative_recall": 0.0, "negative_f1": 0.0}

        TN_neg = negatives_df[negatives_df[true_label_col] == 0].shape[0]
        FP_neg = negatives_df[negatives_df[true_label_col] == 1].shape[0]
        neg_precision = TN_neg / (TN_neg + FP_neg) if (TN_neg + FP_neg) > 0 else 0.0

        total_true_negatives = true_data[true_data[true_label_col] == 0].shape[0]
        neg_recall = TN_neg / total_true_negatives if total_true_negatives > 0 else 0.0

        neg_f1 = (2 * neg_precision * neg_recall / (neg_precision + neg_recall)) if (neg_precision + neg_recall) > 0 else 0.0

        self._log(f"Negative-specific: Total predicted negatives={total_predicted}, All negatives available={total_true_negatives} , TN={TN_neg}, FP={FP_neg}, "
                  f"Negative Precision={neg_precision}, Negative Recall={neg_recall}, Negative F1={neg_f1}")
        return {"negative_precision": neg_precision, "negative_recall": neg_recall, "negative_f1": neg_f1, "negative_TN": TN_neg, "negative_FP": FP_neg, "total_negatives": total_true_negatives}

    def compute_missed_negatives(self, true_data: pd.DataFrame, predicted_union: pd.DataFrame) -> Dict[str, float]:
        """
        Compute the ratio of missed negatives (true negatives not captured in the final set).

        Args:
            true_data (pd.DataFrame): Ground truth data.
            predicted_union (pd.DataFrame): Union of all processed data subsets.

        Returns:
            Dict[str, float]: Dictionary containing the missed negative ratio.
        """
        true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
        total_negatives = true_data[true_data[true_label_col] == 0].shape[0]
        # Filter union for rows predicted as 0 (reliable negative)
        predicted_negatives = predicted_union[
            predicted_union[self.meta_columns.get("meta_augmented_bin", "augm_bin")] == 0]
        # Count how many of those are actually 0
        captured_negatives = predicted_negatives[predicted_negatives[true_label_col] == 0].shape[0]
        missed = total_negatives - captured_negatives
        missed_ratio = missed / total_negatives if total_negatives > 0 else 0.0
        self._log(f"Missed negatives: Total negatives={total_negatives}, Captured negatives={captured_negatives}, "
                  f"Missed negatives={missed} (Ratio: {missed_ratio})")
        return {"missed_negative_ratio": missed_ratio}

    def compute_missing_indices(self, true_data: pd.DataFrame, predicted_union: pd.DataFrame) -> List[Any]:
        """
        Identify indices present in true data but missing from predictions.

        Args:
            true_data (pd.DataFrame): Ground truth data.
            predicted_union (pd.DataFrame): Union of all processed data subsets.

        Returns:
            List[Any]: List of missing indices.
        """
        missing = list(set(true_data.index) - set(predicted_union.index))
        if missing:
            self._log(f"WARNING: {len(missing)} datapoints from true_data are missing in predicted union.")
        else:
            self._log("All datapoints from true_data are present in predicted union.")
        return missing

    def log_evaluation_summary(self, results: Dict[str, Any]) -> None:
        """
        Log the evaluation summary as a JSON artifact in MLflow.

        Args:
            results (Dict[str, Any]): The evaluation results dictionary.
        """
        try:
            evaluation_json = json.dumps(results, indent=2)
            artifact_name = f"evaluation_summary_fold_{self.fold_index}.json"
            mlflow.log_text(evaluation_json, artifact_name)
            self._log(f"Logged evaluation summary to artifact {artifact_name}")
        except Exception as e:
            self._log(f"Error logging evaluation summary: {e}")

    def log_metric_individual(self, metrics: Dict[str, float], prefix: str = "") -> None:
        """
        Log individual metrics using mlflow.log_metric.

        Args:
            metrics (Dict[str, float]): Dictionary of metrics.
            prefix (str, optional): Prefix for the metric name in MLflow.
        """
        for metric_name, value in metrics.items():
            mlflow.log_metric(f"{prefix}{metric_name}", value)
            self._log(f"Logged metric {prefix}{metric_name}: {value}")

    def evaluate(
            self,
            true_data: pd.DataFrame,
            augmen_negatives: pd.DataFrame,
            known_positives: pd.DataFrame,
            undecisives: pd.DataFrame
    ) -> Dict[str, Any]:
        """
        Perform comprehensive dual evaluation of the PU pipeline.

        1. Overall Evaluation: Merges all subsets and compares against ground truth.
        2. Negative-Specific Evaluation: Focuses on the purity of the 'reliable negative' set.
        3. Undecisive Analysis: Tracks data loss.
        4. Missed Negatives: Tracks recall loss relative to ground truth.

        Args:
            true_data (pd.DataFrame): Full training data with ground-truth labels.
            augmen_negatives (pd.DataFrame): Datapoints predicted as reliable negatives.
            known_positives (pd.DataFrame): Datapoints forced to be known positives.
            undecisives (pd.DataFrame): Datapoints labeled as "undecisive".

        Returns:
            Dict[str, Any]: A dictionary containing all computed metrics.
        """
        results: Dict[str, Any] = {}
        true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
        aug_pred_col = self.meta_columns.get("meta_augmented_bin", "augm_bin")

        # Create the union of all predictions (final pipeline output)
        predicted_union = pd.concat([augmen_negatives, known_positives, undecisives]).sort_index()
        self._log(f"Predicted union created with {predicted_union.shape[0]} datapoints.")

        # Determine common indices between true_data and predicted_union
        common_indices = true_data.index.intersection(predicted_union.index)
        if len(common_indices) == 0:
            self._log(
                "WARNING: No overlapping indices between true_data and predicted union; cannot compute overall metrics.")
            results["overall"] = {}
        else:
            if len(common_indices) != len(true_data) or len(common_indices) != len(predicted_union):
                self._log(
                    "WARNING: Not all indices are common between true_data and predicted union. Some datapoints may be missing.")
                results["missing_indices"] = self.compute_missing_indices(true_data, predicted_union)
            y_true = true_data.loc[common_indices, true_label_col].values
            y_pred = predicted_union.loc[common_indices, aug_pred_col].values
            overall_cm = self.compute_confusion_matrix(y_true, y_pred)
            overall_metrics = self.compute_classification_metrics(y_true, y_pred)
            results["overall"] = {"confusion_matrix": overall_cm, "metrics": overall_metrics}

        # Negative-specific evaluation (on reliable negatives only)
        results["negative_specific"] = self.compute_negative_specific_metrics(augmen_negatives, true_data)

        # Undecisive analysis: count and ratio
        undecisive_count = undecisives.shape[0]
        total_union = predicted_union.shape[0]
        undecisive_ratio = undecisive_count / total_union if total_union > 0 else 0.0
        self._log(f"Undecisives: {undecisive_count} out of {total_union} (Ratio: {undecisive_ratio})")
        results["undecisives"] = {"count": undecisive_count, "ratio": undecisive_ratio}

        # Missed negatives evaluation: comparing full true_data negatives vs. captured negatives in union
        results["missed_negatives"] = self.compute_missed_negatives(true_data, predicted_union)

        # Log evaluation summary and individual metrics using the logger
        if self.logger:
            self.log_evaluation_summary(results)
            # Log overall metrics if available
            if "overall" in results and "metrics" in results["overall"]:
                self.log_metric_individual(results["overall"]["metrics"], prefix="overall_")
            # Log negative-specific metrics
            if "negative_specific" in results:
                self.log_metric_individual(results["negative_specific"], prefix="")
            # Log undecisive metrics
            if "undecisives" in results:
                self.log_metric_individual(results["undecisives"], prefix="undecisives_")
            # Log missed negatives
            if "missed_negatives" in results:
                self.log_metric_individual(results["missed_negatives"], prefix="")

        return results

__init__(meta_columns, logger=None, fold_index=None, training_mode=True)

Initialize the Evaluator.

Parameters:

Name Type Description Default
meta_columns Dict[str, str]

Dictionary mapping meta keys to dataframe columns.

required
logger Logger

Logger instance.

None
fold_index int

Index of the current fold.

None
training_mode bool

Whether evaluation is performed during training.

True
Source code in payn\Evaluation\evaluator.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def __init__(self, meta_columns: Dict[str, str], logger: Optional[Logger] = None,
             fold_index: int = None, training_mode: bool = True) -> None:
    """
    Initialize the Evaluator.

    Args:
        meta_columns (Dict[str, str]): Dictionary mapping meta keys to dataframe columns.
        logger (Logger, optional): Logger instance.
        fold_index (int, optional): Index of the current fold.
        training_mode (bool, optional): Whether evaluation is performed during training.
    """
    self.meta_columns = meta_columns
    self.logger = logger
    self.fold_index = fold_index
    self.training_mode = training_mode

compute_classification_metrics(y_true, y_pred)

Compute standard classification metrics.

Parameters:

Name Type Description Default
y_true ndarray

True labels.

required
y_pred ndarray

Predicted labels.

required

Returns:

Type Description
Dict[str, float]

Dict[str, float]: Dictionary containing accuracy, precision, recall, and f1.

Source code in payn\Evaluation\evaluator.py
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
def compute_classification_metrics(self, y_true: np.ndarray, y_pred: np.ndarray) -> Dict[str, float]:
    """
    Compute standard classification metrics.

    Args:
        y_true (np.ndarray): True labels.
        y_pred (np.ndarray): Predicted labels.

    Returns:
        Dict[str, float]: Dictionary containing accuracy, precision, recall, and f1.
    """
    metrics = {
        "accuracy": np.mean(y_true == y_pred),
        "precision": precision_score(y_true, y_pred, pos_label=0),
        "recall": recall_score(y_true, y_pred, pos_label=0),
        "f1": f1_score(y_true, y_pred, pos_label=0)
    }
    self._log(f"Overall Metrics: {metrics}")
    return metrics

compute_confusion_matrix(y_true, y_pred)

Compute the confusion matrix.

Parameters:

Name Type Description Default
y_true ndarray

True labels.

required
y_pred ndarray

Predicted labels.

required

Returns:

Type Description
List[List[int]]

List[List[int]]: Confusion matrix as a nested list (JSON serializable).

Source code in payn\Evaluation\evaluator.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def compute_confusion_matrix(self, y_true: np.ndarray, y_pred: np.ndarray) -> List[List[int]]:
    """
    Compute the confusion matrix.

    Args:
        y_true (np.ndarray): True labels.
        y_pred (np.ndarray): Predicted labels.

    Returns:
        List[List[int]]: Confusion matrix as a nested list (JSON serializable).
    """
    cm = confusion_matrix(y_true, y_pred)
    self._log(f"Overall Confusion Matrix: {cm}")
    return cm

compute_missed_negatives(true_data, predicted_union)

Compute the ratio of missed negatives (true negatives not captured in the final set).

Parameters:

Name Type Description Default
true_data DataFrame

Ground truth data.

required
predicted_union DataFrame

Union of all processed data subsets.

required

Returns:

Type Description
Dict[str, float]

Dict[str, float]: Dictionary containing the missed negative ratio.

Source code in payn\Evaluation\evaluator.py
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
def compute_missed_negatives(self, true_data: pd.DataFrame, predicted_union: pd.DataFrame) -> Dict[str, float]:
    """
    Compute the ratio of missed negatives (true negatives not captured in the final set).

    Args:
        true_data (pd.DataFrame): Ground truth data.
        predicted_union (pd.DataFrame): Union of all processed data subsets.

    Returns:
        Dict[str, float]: Dictionary containing the missed negative ratio.
    """
    true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
    total_negatives = true_data[true_data[true_label_col] == 0].shape[0]
    # Filter union for rows predicted as 0 (reliable negative)
    predicted_negatives = predicted_union[
        predicted_union[self.meta_columns.get("meta_augmented_bin", "augm_bin")] == 0]
    # Count how many of those are actually 0
    captured_negatives = predicted_negatives[predicted_negatives[true_label_col] == 0].shape[0]
    missed = total_negatives - captured_negatives
    missed_ratio = missed / total_negatives if total_negatives > 0 else 0.0
    self._log(f"Missed negatives: Total negatives={total_negatives}, Captured negatives={captured_negatives}, "
              f"Missed negatives={missed} (Ratio: {missed_ratio})")
    return {"missed_negative_ratio": missed_ratio}

compute_missing_indices(true_data, predicted_union)

Identify indices present in true data but missing from predictions.

Parameters:

Name Type Description Default
true_data DataFrame

Ground truth data.

required
predicted_union DataFrame

Union of all processed data subsets.

required

Returns:

Type Description
List[Any]

List[Any]: List of missing indices.

Source code in payn\Evaluation\evaluator.py
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
def compute_missing_indices(self, true_data: pd.DataFrame, predicted_union: pd.DataFrame) -> List[Any]:
    """
    Identify indices present in true data but missing from predictions.

    Args:
        true_data (pd.DataFrame): Ground truth data.
        predicted_union (pd.DataFrame): Union of all processed data subsets.

    Returns:
        List[Any]: List of missing indices.
    """
    missing = list(set(true_data.index) - set(predicted_union.index))
    if missing:
        self._log(f"WARNING: {len(missing)} datapoints from true_data are missing in predicted union.")
    else:
        self._log("All datapoints from true_data are present in predicted union.")
    return missing

compute_negative_specific_metrics(negatives_df, true_data)

Compute metrics specifically for the subset predicted as negatives.

Calculates
  • Negative Precision (NPV): Fraction of predicted negatives that are truly negative.
  • Negative Recall (True Negative Rate): Fraction of true negatives captured.
  • Negative F1: Harmonic mean of negative precision and recall.

Parameters:

Name Type Description Default
negatives_df DataFrame

Dataframe of predicted reliable negatives.

required
true_data DataFrame

Original dataframe with ground truth.

required

Returns:

Type Description
Dict[str, float]

Dict[str, float]: Dictionary of negative-specific metrics.

Source code in payn\Evaluation\evaluator.py
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
def compute_negative_specific_metrics(self, negatives_df: pd.DataFrame, true_data: pd.DataFrame) -> Dict[
    str, float]:
    """
    Compute metrics specifically for the subset predicted as negatives.

    Calculates:
      - Negative Precision (NPV): Fraction of predicted negatives that are truly negative.
      - Negative Recall (True Negative Rate): Fraction of true negatives captured.
      - Negative F1: Harmonic mean of negative precision and recall.

    Args:
        negatives_df (pd.DataFrame): Dataframe of predicted reliable negatives.
        true_data (pd.DataFrame): Original dataframe with ground truth.

    Returns:
        Dict[str, float]: Dictionary of negative-specific metrics.
    """
    true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
    total_predicted = negatives_df.shape[0]
    if total_predicted == 0:
        self._log("No datapoints in predicted negatives for evaluation.")
        return {"negative_precision": 0.0, "negative_recall": 0.0, "negative_f1": 0.0}

    TN_neg = negatives_df[negatives_df[true_label_col] == 0].shape[0]
    FP_neg = negatives_df[negatives_df[true_label_col] == 1].shape[0]
    neg_precision = TN_neg / (TN_neg + FP_neg) if (TN_neg + FP_neg) > 0 else 0.0

    total_true_negatives = true_data[true_data[true_label_col] == 0].shape[0]
    neg_recall = TN_neg / total_true_negatives if total_true_negatives > 0 else 0.0

    neg_f1 = (2 * neg_precision * neg_recall / (neg_precision + neg_recall)) if (neg_precision + neg_recall) > 0 else 0.0

    self._log(f"Negative-specific: Total predicted negatives={total_predicted}, All negatives available={total_true_negatives} , TN={TN_neg}, FP={FP_neg}, "
              f"Negative Precision={neg_precision}, Negative Recall={neg_recall}, Negative F1={neg_f1}")
    return {"negative_precision": neg_precision, "negative_recall": neg_recall, "negative_f1": neg_f1, "negative_TN": TN_neg, "negative_FP": FP_neg, "total_negatives": total_true_negatives}

evaluate(true_data, augmen_negatives, known_positives, undecisives)

Perform comprehensive dual evaluation of the PU pipeline.

  1. Overall Evaluation: Merges all subsets and compares against ground truth.
  2. Negative-Specific Evaluation: Focuses on the purity of the 'reliable negative' set.
  3. Undecisive Analysis: Tracks data loss.
  4. Missed Negatives: Tracks recall loss relative to ground truth.

Parameters:

Name Type Description Default
true_data DataFrame

Full training data with ground-truth labels.

required
augmen_negatives DataFrame

Datapoints predicted as reliable negatives.

required
known_positives DataFrame

Datapoints forced to be known positives.

required
undecisives DataFrame

Datapoints labeled as "undecisive".

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: A dictionary containing all computed metrics.

Source code in payn\Evaluation\evaluator.py
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
def evaluate(
        self,
        true_data: pd.DataFrame,
        augmen_negatives: pd.DataFrame,
        known_positives: pd.DataFrame,
        undecisives: pd.DataFrame
) -> Dict[str, Any]:
    """
    Perform comprehensive dual evaluation of the PU pipeline.

    1. Overall Evaluation: Merges all subsets and compares against ground truth.
    2. Negative-Specific Evaluation: Focuses on the purity of the 'reliable negative' set.
    3. Undecisive Analysis: Tracks data loss.
    4. Missed Negatives: Tracks recall loss relative to ground truth.

    Args:
        true_data (pd.DataFrame): Full training data with ground-truth labels.
        augmen_negatives (pd.DataFrame): Datapoints predicted as reliable negatives.
        known_positives (pd.DataFrame): Datapoints forced to be known positives.
        undecisives (pd.DataFrame): Datapoints labeled as "undecisive".

    Returns:
        Dict[str, Any]: A dictionary containing all computed metrics.
    """
    results: Dict[str, Any] = {}
    true_label_col = self.meta_columns.get("meta_true_label_bin", "true_bin")
    aug_pred_col = self.meta_columns.get("meta_augmented_bin", "augm_bin")

    # Create the union of all predictions (final pipeline output)
    predicted_union = pd.concat([augmen_negatives, known_positives, undecisives]).sort_index()
    self._log(f"Predicted union created with {predicted_union.shape[0]} datapoints.")

    # Determine common indices between true_data and predicted_union
    common_indices = true_data.index.intersection(predicted_union.index)
    if len(common_indices) == 0:
        self._log(
            "WARNING: No overlapping indices between true_data and predicted union; cannot compute overall metrics.")
        results["overall"] = {}
    else:
        if len(common_indices) != len(true_data) or len(common_indices) != len(predicted_union):
            self._log(
                "WARNING: Not all indices are common between true_data and predicted union. Some datapoints may be missing.")
            results["missing_indices"] = self.compute_missing_indices(true_data, predicted_union)
        y_true = true_data.loc[common_indices, true_label_col].values
        y_pred = predicted_union.loc[common_indices, aug_pred_col].values
        overall_cm = self.compute_confusion_matrix(y_true, y_pred)
        overall_metrics = self.compute_classification_metrics(y_true, y_pred)
        results["overall"] = {"confusion_matrix": overall_cm, "metrics": overall_metrics}

    # Negative-specific evaluation (on reliable negatives only)
    results["negative_specific"] = self.compute_negative_specific_metrics(augmen_negatives, true_data)

    # Undecisive analysis: count and ratio
    undecisive_count = undecisives.shape[0]
    total_union = predicted_union.shape[0]
    undecisive_ratio = undecisive_count / total_union if total_union > 0 else 0.0
    self._log(f"Undecisives: {undecisive_count} out of {total_union} (Ratio: {undecisive_ratio})")
    results["undecisives"] = {"count": undecisive_count, "ratio": undecisive_ratio}

    # Missed negatives evaluation: comparing full true_data negatives vs. captured negatives in union
    results["missed_negatives"] = self.compute_missed_negatives(true_data, predicted_union)

    # Log evaluation summary and individual metrics using the logger
    if self.logger:
        self.log_evaluation_summary(results)
        # Log overall metrics if available
        if "overall" in results and "metrics" in results["overall"]:
            self.log_metric_individual(results["overall"]["metrics"], prefix="overall_")
        # Log negative-specific metrics
        if "negative_specific" in results:
            self.log_metric_individual(results["negative_specific"], prefix="")
        # Log undecisive metrics
        if "undecisives" in results:
            self.log_metric_individual(results["undecisives"], prefix="undecisives_")
        # Log missed negatives
        if "missed_negatives" in results:
            self.log_metric_individual(results["missed_negatives"], prefix="")

    return results

from_config(config, logger=None, fold_index=1, training_mode=True) classmethod

Alternative constructor that extracts parameters from a config object.

Parameters:

Name Type Description Default
config Dict[str, Any]

Configuration dictionary.

required
logger Logger

Logger instance.

None
fold_index int

Current fold index.

1
training_mode bool

Training mode flag.

True

Returns:

Name Type Description
Evaluator Evaluator

An initialized Evaluator instance.

Source code in payn\Evaluation\evaluator.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
@classmethod
def from_config(cls, config: Dict[str, Any], logger: Optional[Logger] = None,
                fold_index: int = 1, training_mode: bool = True) -> "Evaluator":
    """
    Alternative constructor that extracts parameters from a config object.

    Args:
        config (Dict[str, Any]): Configuration dictionary.
        logger (Logger, optional): Logger instance.
        fold_index (int, optional): Current fold index.
        training_mode (bool, optional): Training mode flag.

    Returns:
        Evaluator: An initialized Evaluator instance.
    """
    meta_columns = config.get("meta_columns", {})
    return cls(meta_columns=meta_columns, logger=logger, fold_index=fold_index, training_mode=training_mode)

log_evaluation_summary(results)

Log the evaluation summary as a JSON artifact in MLflow.

Parameters:

Name Type Description Default
results Dict[str, Any]

The evaluation results dictionary.

required
Source code in payn\Evaluation\evaluator.py
179
180
181
182
183
184
185
186
187
188
189
190
191
192
def log_evaluation_summary(self, results: Dict[str, Any]) -> None:
    """
    Log the evaluation summary as a JSON artifact in MLflow.

    Args:
        results (Dict[str, Any]): The evaluation results dictionary.
    """
    try:
        evaluation_json = json.dumps(results, indent=2)
        artifact_name = f"evaluation_summary_fold_{self.fold_index}.json"
        mlflow.log_text(evaluation_json, artifact_name)
        self._log(f"Logged evaluation summary to artifact {artifact_name}")
    except Exception as e:
        self._log(f"Error logging evaluation summary: {e}")

log_metric_individual(metrics, prefix='')

Log individual metrics using mlflow.log_metric.

Parameters:

Name Type Description Default
metrics Dict[str, float]

Dictionary of metrics.

required
prefix str

Prefix for the metric name in MLflow.

''
Source code in payn\Evaluation\evaluator.py
194
195
196
197
198
199
200
201
202
203
204
def log_metric_individual(self, metrics: Dict[str, float], prefix: str = "") -> None:
    """
    Log individual metrics using mlflow.log_metric.

    Args:
        metrics (Dict[str, float]): Dictionary of metrics.
        prefix (str, optional): Prefix for the metric name in MLflow.
    """
    for metric_name, value in metrics.items():
        mlflow.log_metric(f"{prefix}{metric_name}", value)
        self._log(f"Logged metric {prefix}{metric_name}: {value}")