Skip to content

Augmentation Models

This module contains the core logic for the PU learning process: the Spy Model classifier and the decision engine for identifying reliable negatives.

Spy Model (payn.AugmentationModels.SpyModel.SpyModel)

Wraps a CatBoostClassifier. Selected for its native handling of categorical features and robust performance on tabular chemical data without extensive preprocessing. Other model architectures are applicable here as well, but a class probability score must be calculable or estimable.

The spy model is trained on the spy_infused_training_data (from payn.SpySplitting) to distinguish between "known positives" (s = 1) and "unlabeled/spy Mixture" (s = 0).

  • Categorical Handling: The model automatically detects categorical features (e.g., specific bit positions or metadata tags) appended to the end of the feature vector, optimizing the split strategy for mixed data types.
  • Parallelisation: Automatically detects SLURM cluster environments (SLURM_CPUS_PER_TASK) to adjust thread counts (thread_count), ensuring optimal resource usage while defaulting to single-threaded execution locally for maximum safety.
  • Determinism: Random seeds are propagated strictly from the global config to the CatBoost engine (random_state).
  • Logging: SpyModel is tightly coupled with the payn.Logging system. It automatically logs hyperparameters, trained model artifacts, and evaluation metrics (on test sets) to MLflow run immediately after training.

SpyModel encapsulates the CatBoostClassifier used in the Spy-based learning step.

Attributes:

Name Type Description
config_key str

The key in the config dict relevant to SpyModel.

logger Optional[Logger]

Logger instance for logging model training and evaluation.

fold_index int

Index of the current fold (for cross-validation purposes).

random_state int

Random seed.

eval_metric str

Evaluation metric to use.

verbose int

Verbosity level.

model Optional[CatBoostClassifier]

The trained CatBoost model.

feature_column_name Optional[str]

Column name containing feature vectors.

training_target_column_name Optional[str]

Target column name for training data.

validation_target_column_name Optional[str]

Target column name for validation data.

metrics_list Optional[List[str]]

List of additional metrics to evaluate.

categorical_column_indices List[int]

Indices of features identified as categorical.

Source code in payn\AugmentationModels\SpyModel\spymodel.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
class SpyModel:
    """
       SpyModel encapsulates the CatBoostClassifier used in the Spy-based learning step.

    Attributes:
        config_key (str): The key in the config dict relevant to SpyModel.
        logger (Optional[Logger]): Logger instance for logging model training and evaluation.
        fold_index (int): Index of the current fold (for cross-validation purposes).
        random_state (int): Random seed.
        eval_metric (str): Evaluation metric to use.
        verbose (int): Verbosity level.
        model (Optional[CatBoostClassifier]): The trained CatBoost model.
        feature_column_name (Optional[str]): Column name containing feature vectors.
        training_target_column_name (Optional[str]): Target column name for training data.
        validation_target_column_name (Optional[str]): Target column name for validation data.
        metrics_list (Optional[List[str]]): List of additional metrics to evaluate.
        categorical_column_indices (List[int]): Indices of features identified as categorical.
    """

    config_key = "spy_model"

    def __init__(self, eval_metric: str, random_state: int, verbose: int, fold_index: int = 1, logger: Optional[Logger] = None,
                 feature_column_name: str = None, training_target_column_name: str = None, validation_target_column_name: str = None,
                 metrics_list: Optional[List[str]] = None, categorical_column_indices: Optional[List[int]] = None):
        """
        Initialize the SpyModel class.

        You can either pass a config dict via the alternative constructor `from_config` or pass parameters explicitly.

        Args:
            eval_metric (str): Metric for evaluation of model performance.
            random_state (int): Random seed.
            verbose (int): Verbosity level for CatBoost output.
            fold_index (int, optional): Index of the current fold. Defaults to 1.
            logger (Logger, optional): Logger instance for logging.
            feature_column_name (str, optional): Column name containing feature vectors.
            training_target_column_name (str, optional): Target column name for training data.
            validation_target_column_name (str, optional): Target column name for validation data.
            metrics_list (List[str], optional): List of metrics to evaluate.
            categorical_column_indices (List[int], optional): Indices of features that are categorical.
        """

        self.eval_metric = eval_metric
        self.random_state = random_state
        self.fold_index = fold_index
        self.verbose = verbose
        self.logger = logger
        self.model: Optional[CatBoostClassifier] = None

        # Optional parameters for training; if not provided, they can be set later.
        self.feature_column_name = feature_column_name
        self.training_target_column_name = training_target_column_name
        self.validation_target_column_name = validation_target_column_name
        self.metrics_list = metrics_list
        self.categorical_column_indices = categorical_column_indices or []



    @classmethod
    def from_config(cls, config: Dict[str, Any], logger: Optional[Logger] = None, fold_index: int = 1) -> "SpyModel":
        """
        Alternative constructor that creates a SpyModel instance from a config object.

        Args:
            config (dict): Configuration dictionary.
            logger (Logger, optional): Logger instance.
            fold_index (int): Current fold index.

        Returns:
            SpyModel: An instance of SpyModel with parameters extracted from the config.
        """
        return cls(
            logger=logger,
            fold_index=fold_index,
            random_state = config["general"]["random_seed"],
            eval_metric = config["spy_model"]["eval_metric"],
            verbose = config["general"]["verbose"],
            feature_column_name=config["featurisation"]["combined_features_column_name"],
            training_target_column_name=config["spy_model"]["training_target_column_name"],
            validation_target_column_name=config["spy_model"]["validation_target_column_name"],
            metrics_list = config["spy_model"]["all_metrics"]
        )

    def _prepare_pool(self, data: pd.DataFrame, label_column: Optional[str], feature_column: str) -> Pool:
        """
        Prepare a CatBoost Pool from the dataset.

        Args:
            data (pd.DataFrame): Dataset containing features and target labels.
            label_column (str, optional): Column name for target labels. If None, pool is created without labels.
            feature_column (str): Column name for features.

        Returns:
            Pool: A CatBoost Pool object.

        Raises:
            ValueError: If pool preparation fails.
        """
        try:
            # Expand the list of features into a DataFrame
            pool_data = pd.DataFrame(data[feature_column].to_list())

            # Dynamically infer and store categorical indices if not already set.
            # Warning: Assumes categorical features are appended at the END of the vector.
            if not hasattr(self, "categorical_column_indices") or not self.categorical_column_indices:
                row_sample = data[feature_column].iloc[0]
                cat_count = sum(isinstance(v, str) for v in reversed(row_sample))
                self.categorical_column_indices = list(range(len(row_sample) - cat_count, len(row_sample)))

            return Pool(
                data=pool_data,
                label=data[label_column] if label_column is not None else None,
                cat_features=self.categorical_column_indices
            )

        except Exception as e:
            raise ValueError(f"Error preparing CatBoost Pool: {e}")


    def train(
        self,
        train_data: pd.DataFrame,
        val_data: pd.DataFrame,
        test_data: Optional[pd.DataFrame] = None,
        feature_column: Optional[str] = None,
        training_label_column: Optional[str] = None,
        validation_label_column: Optional[str] = None,
        **kwargs: Any
    ) -> CatBoostClassifier:
        """
        Train the Spy model on the given datasets.

        Args:
            train_data (pd.DataFrame): Training dataset with features and target labels.
            val_data (pd.DataFrame): Validation dataset for monitoring training progress.
            test_data (Optional[pd.DataFrame]): Optional test dataset for evaluation (default: None).
            feature_column (Optional[str]): Column name for features.
            training_label_column (Optional[str]): Column name for target labels in training data.
            validation_label_column (Optional[str]): Column name for target labels in validation data.
            **kwargs: Additional hyperparameters (overriding defaults).

        Returns:
            CatBoostClassifier: Trained CatBoost model.
        """
        feature_column = feature_column or self.feature_column_name
        training_label_column = training_label_column or self.training_target_column_name
        validation_label_column = validation_label_column or self.validation_target_column_name

        if isinstance(test_data, pd.DataFrame):
            test_label_column = validation_label_column
            test_pool = self._prepare_pool(test_data, test_label_column, feature_column)


        train_pool = self._prepare_pool(train_data, training_label_column, feature_column)
        val_pool = self._prepare_pool(val_data, validation_label_column, feature_column)

        # Support for slurm multi-threading
        n_threads = int(os.getenv("SLURM_CPUS_PER_TASK", -1)) # 1

        # Combine default parameters with kwargs (kwargs take precedence)
        combined_params = {
            "random_state": self.random_state,
            "eval_metric": self.eval_metric,
            "verbose": self.verbose,
            "thread_count": n_threads,
            **kwargs,  # Override defaults with values from kwargs
        }

        # Define CatBoost model
        self.model = CatBoostClassifier(**combined_params)

        # Log user-provided and default hyperparameters
        if self.logger:
            self.logger.log_model_hyperparameters(self.model, **combined_params)

        # Create the MLflow callback
        # mlflow_callback = MLflowCatBoostCallback(eval_metric=self.eval_metric, logger=self.logger)

        # Train the model
        self.model.fit(train_pool, eval_set=val_pool, use_best_model=True) # , callbacks=[mlflow_callback]

        # Log the trained model and evaluation results
        if self.logger:
            self.logger.log_model(self.model, f"spy_model_fold_{self.fold_index}")
            # Log additional model attributes
            self.logger.log_model_attributes(self.model)
            if isinstance(test_data, pd.DataFrame):
                test_results = self.evaluate(test_pool)
                self.logger.log_evaluation_metrics(test_results)

        return self.model

    def evaluate(self, test_pool: Pool) -> Dict[str, Any]:
        """
        Evaluate the trained model on a test dataset.

        Args:
            test_pool (Pool): Catboost Pool containing test features and labels.

        Returns:
            Dict[str, Any]: Dictionary of evaluation metrics.

        Raises:
            ValueError: If the model has not been trained yet.
        """
        if not self.model:
            raise ValueError("Model has not been trained yet. Call `train` before evaluating.")

        eval_results = self.model.eval_metrics(
            data=test_pool,
            metrics=self.metrics_list
        )
        return eval_results

    def predict(self, data: pd.DataFrame, feature_column: Optional[str] = None) -> pd.Series:
        """
        Make predictions using the trained Spy model.

        Args:
            data (pd.DataFrame): Dataset containing features for prediction.
            feature_column (str, optional): Column name for features.

        Returns:
            pd.Series: Predicted labels or probabilities.
        """
        feature_column = feature_column or self.feature_column_name
        if not self.model:
            raise ValueError("Model has not been trained yet. Call `train` before making predictions.")

        data_pool = self._prepare_pool(data, label_column=None, feature_column=feature_column)
        return pd.Series(self.model.predict(data_pool))

__init__(eval_metric, random_state, verbose, fold_index=1, logger=None, feature_column_name=None, training_target_column_name=None, validation_target_column_name=None, metrics_list=None, categorical_column_indices=None)

Initialize the SpyModel class.

You can either pass a config dict via the alternative constructor from_config or pass parameters explicitly.

Parameters:

Name Type Description Default
eval_metric str

Metric for evaluation of model performance.

required
random_state int

Random seed.

required
verbose int

Verbosity level for CatBoost output.

required
fold_index int

Index of the current fold. Defaults to 1.

1
logger Logger

Logger instance for logging.

None
feature_column_name str

Column name containing feature vectors.

None
training_target_column_name str

Target column name for training data.

None
validation_target_column_name str

Target column name for validation data.

None
metrics_list List[str]

List of metrics to evaluate.

None
categorical_column_indices List[int]

Indices of features that are categorical.

None
Source code in payn\AugmentationModels\SpyModel\spymodel.py
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def __init__(self, eval_metric: str, random_state: int, verbose: int, fold_index: int = 1, logger: Optional[Logger] = None,
             feature_column_name: str = None, training_target_column_name: str = None, validation_target_column_name: str = None,
             metrics_list: Optional[List[str]] = None, categorical_column_indices: Optional[List[int]] = None):
    """
    Initialize the SpyModel class.

    You can either pass a config dict via the alternative constructor `from_config` or pass parameters explicitly.

    Args:
        eval_metric (str): Metric for evaluation of model performance.
        random_state (int): Random seed.
        verbose (int): Verbosity level for CatBoost output.
        fold_index (int, optional): Index of the current fold. Defaults to 1.
        logger (Logger, optional): Logger instance for logging.
        feature_column_name (str, optional): Column name containing feature vectors.
        training_target_column_name (str, optional): Target column name for training data.
        validation_target_column_name (str, optional): Target column name for validation data.
        metrics_list (List[str], optional): List of metrics to evaluate.
        categorical_column_indices (List[int], optional): Indices of features that are categorical.
    """

    self.eval_metric = eval_metric
    self.random_state = random_state
    self.fold_index = fold_index
    self.verbose = verbose
    self.logger = logger
    self.model: Optional[CatBoostClassifier] = None

    # Optional parameters for training; if not provided, they can be set later.
    self.feature_column_name = feature_column_name
    self.training_target_column_name = training_target_column_name
    self.validation_target_column_name = validation_target_column_name
    self.metrics_list = metrics_list
    self.categorical_column_indices = categorical_column_indices or []

evaluate(test_pool)

Evaluate the trained model on a test dataset.

Parameters:

Name Type Description Default
test_pool Pool

Catboost Pool containing test features and labels.

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: Dictionary of evaluation metrics.

Raises:

Type Description
ValueError

If the model has not been trained yet.

Source code in payn\AugmentationModels\SpyModel\spymodel.py
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
def evaluate(self, test_pool: Pool) -> Dict[str, Any]:
    """
    Evaluate the trained model on a test dataset.

    Args:
        test_pool (Pool): Catboost Pool containing test features and labels.

    Returns:
        Dict[str, Any]: Dictionary of evaluation metrics.

    Raises:
        ValueError: If the model has not been trained yet.
    """
    if not self.model:
        raise ValueError("Model has not been trained yet. Call `train` before evaluating.")

    eval_results = self.model.eval_metrics(
        data=test_pool,
        metrics=self.metrics_list
    )
    return eval_results

from_config(config, logger=None, fold_index=1) classmethod

Alternative constructor that creates a SpyModel instance from a config object.

Parameters:

Name Type Description Default
config dict

Configuration dictionary.

required
logger Logger

Logger instance.

None
fold_index int

Current fold index.

1

Returns:

Name Type Description
SpyModel SpyModel

An instance of SpyModel with parameters extracted from the config.

Source code in payn\AugmentationModels\SpyModel\spymodel.py
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
@classmethod
def from_config(cls, config: Dict[str, Any], logger: Optional[Logger] = None, fold_index: int = 1) -> "SpyModel":
    """
    Alternative constructor that creates a SpyModel instance from a config object.

    Args:
        config (dict): Configuration dictionary.
        logger (Logger, optional): Logger instance.
        fold_index (int): Current fold index.

    Returns:
        SpyModel: An instance of SpyModel with parameters extracted from the config.
    """
    return cls(
        logger=logger,
        fold_index=fold_index,
        random_state = config["general"]["random_seed"],
        eval_metric = config["spy_model"]["eval_metric"],
        verbose = config["general"]["verbose"],
        feature_column_name=config["featurisation"]["combined_features_column_name"],
        training_target_column_name=config["spy_model"]["training_target_column_name"],
        validation_target_column_name=config["spy_model"]["validation_target_column_name"],
        metrics_list = config["spy_model"]["all_metrics"]
    )

predict(data, feature_column=None)

Make predictions using the trained Spy model.

Parameters:

Name Type Description Default
data DataFrame

Dataset containing features for prediction.

required
feature_column str

Column name for features.

None

Returns:

Type Description
Series

pd.Series: Predicted labels or probabilities.

Source code in payn\AugmentationModels\SpyModel\spymodel.py
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
def predict(self, data: pd.DataFrame, feature_column: Optional[str] = None) -> pd.Series:
    """
    Make predictions using the trained Spy model.

    Args:
        data (pd.DataFrame): Dataset containing features for prediction.
        feature_column (str, optional): Column name for features.

    Returns:
        pd.Series: Predicted labels or probabilities.
    """
    feature_column = feature_column or self.feature_column_name
    if not self.model:
        raise ValueError("Model has not been trained yet. Call `train` before making predictions.")

    data_pool = self._prepare_pool(data, label_column=None, feature_column=feature_column)
    return pd.Series(self.model.predict(data_pool))

train(train_data, val_data, test_data=None, feature_column=None, training_label_column=None, validation_label_column=None, **kwargs)

Train the Spy model on the given datasets.

Parameters:

Name Type Description Default
train_data DataFrame

Training dataset with features and target labels.

required
val_data DataFrame

Validation dataset for monitoring training progress.

required
test_data Optional[DataFrame]

Optional test dataset for evaluation (default: None).

None
feature_column Optional[str]

Column name for features.

None
training_label_column Optional[str]

Column name for target labels in training data.

None
validation_label_column Optional[str]

Column name for target labels in validation data.

None
**kwargs Any

Additional hyperparameters (overriding defaults).

{}

Returns:

Name Type Description
CatBoostClassifier CatBoostClassifier

Trained CatBoost model.

Source code in payn\AugmentationModels\SpyModel\spymodel.py
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
def train(
    self,
    train_data: pd.DataFrame,
    val_data: pd.DataFrame,
    test_data: Optional[pd.DataFrame] = None,
    feature_column: Optional[str] = None,
    training_label_column: Optional[str] = None,
    validation_label_column: Optional[str] = None,
    **kwargs: Any
) -> CatBoostClassifier:
    """
    Train the Spy model on the given datasets.

    Args:
        train_data (pd.DataFrame): Training dataset with features and target labels.
        val_data (pd.DataFrame): Validation dataset for monitoring training progress.
        test_data (Optional[pd.DataFrame]): Optional test dataset for evaluation (default: None).
        feature_column (Optional[str]): Column name for features.
        training_label_column (Optional[str]): Column name for target labels in training data.
        validation_label_column (Optional[str]): Column name for target labels in validation data.
        **kwargs: Additional hyperparameters (overriding defaults).

    Returns:
        CatBoostClassifier: Trained CatBoost model.
    """
    feature_column = feature_column or self.feature_column_name
    training_label_column = training_label_column or self.training_target_column_name
    validation_label_column = validation_label_column or self.validation_target_column_name

    if isinstance(test_data, pd.DataFrame):
        test_label_column = validation_label_column
        test_pool = self._prepare_pool(test_data, test_label_column, feature_column)


    train_pool = self._prepare_pool(train_data, training_label_column, feature_column)
    val_pool = self._prepare_pool(val_data, validation_label_column, feature_column)

    # Support for slurm multi-threading
    n_threads = int(os.getenv("SLURM_CPUS_PER_TASK", -1)) # 1

    # Combine default parameters with kwargs (kwargs take precedence)
    combined_params = {
        "random_state": self.random_state,
        "eval_metric": self.eval_metric,
        "verbose": self.verbose,
        "thread_count": n_threads,
        **kwargs,  # Override defaults with values from kwargs
    }

    # Define CatBoost model
    self.model = CatBoostClassifier(**combined_params)

    # Log user-provided and default hyperparameters
    if self.logger:
        self.logger.log_model_hyperparameters(self.model, **combined_params)

    # Create the MLflow callback
    # mlflow_callback = MLflowCatBoostCallback(eval_metric=self.eval_metric, logger=self.logger)

    # Train the model
    self.model.fit(train_pool, eval_set=val_pool, use_best_model=True) # , callbacks=[mlflow_callback]

    # Log the trained model and evaluation results
    if self.logger:
        self.logger.log_model(self.model, f"spy_model_fold_{self.fold_index}")
        # Log additional model attributes
        self.logger.log_model_attributes(self.model)
        if isinstance(test_data, pd.DataFrame):
            test_results = self.evaluate(test_pool)
            self.logger.log_evaluation_metrics(test_results)

    return self.model

Reliable Negative Identification (payn.AugmentationModels.SpyModel.augmen_negative_identifier)

This module is the decision-making engine of the PU learning workflow. It leverages the trained Spy Model to filter the unlabeled dataset, identifying a subset of reliable negatives that are statistically distinct from the positive class.

  • Dynamic Thresholding: Instead of using a fixed probability threshold (e.g., 0.5), the module calculates a dynamic cutoff based on the probability distribution of the spies within the unlabeled datapoints (known positives injected into the unlabeled set). A user-defined spy_tolerance (default 5%) sets the threshold such that 95% of the spies are correctly recognized as positive by the model. This ensures that the identified negatives are unlikely to be latent positives. Unlabeled data points scoring below this threshold are classified as reliable negatives.
  • Classification: The module segments the unlabeled data into three distinct categories:
    1. Known Positives: Original true positives and recovered spies.
    2. Reliable Negatives: Unlabeled data points with predicted probabilities below the calculated threshold. These form the clean negative set for downstream applications such as Regression model training.
    3. Undecisives: Unlabeled data points with probabilities above the threshold but not labeled as positive. These are discarded to prevent "noisy negatives".

AugmenNegativeIdentifier

Identifies augmented (augmen_) reliable negatives using the Spy technique and an optimized threshold.

Attributes:

Name Type Description
model CatBoostClassifier

The trained Spy model.

spy_tolerance float

The acceptable proportion of spies within the reliable negatives.

logger Logger

Logger instance for logging messages and artifacts.

feature_column_name str

Default column name for input features.

mod_data_point_role_column_name str

Default column name indicating each data point's role.

probability_class_1_column_name str

Default column name for the predicted probability of class 1.

mod_prediction_class_column_name str

Default column name for the predicted class.

augmented_bin_column_name str

Default column name for the binary augmented label.

augmented_role_column_name str

Default column name for the augmented role label.

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
class AugmenNegativeIdentifier:
    """
    Identifies augmented (augmen_) reliable negatives using the Spy technique and an optimized threshold.

    Attributes:
        model (CatBoostClassifier): The trained Spy model.
        spy_tolerance (float): The acceptable proportion of spies within the reliable negatives.
        logger (Logger, optional): Logger instance for logging messages and artifacts.
        feature_column_name (str, optional): Default column name for input features.
        mod_data_point_role_column_name (str, optional): Default column name indicating each data point's role.
        probability_class_1_column_name (str, optional): Default column name for the predicted probability of class 1.
        mod_prediction_class_column_name (str, optional): Default column name for the predicted class.
        augmented_bin_column_name (str, optional): Default column name for the binary augmented label.
        augmented_role_column_name (str, optional): Default column name for the augmented role label.
    """

    def __init__(self, model: CatBoostClassifier, spy_tolerance: float = 0.05, logger: Optional[Logger] = None,
                 feature_column_name: Optional[str] = None,
                 mod_data_point_role_column_name: Optional[str] = None,
                 probability_class_1_column_name: Optional[str] = None,
                 mod_prediction_class_column_name: Optional[str] = None,
                 augmented_bin_column_name: Optional[str] = None,
                 augmented_role_column_name: Optional[str] = None) -> None:
        """
        Initialize the AugmenNegativeIdentifier.

        Args:
            model (CatBoostClassifier): Trained Spy model.
            spy_tolerance (float, optional): Tolerance for spy inclusion in negatives.
            logger (Logger, optional): Logger instance for tracking and logging.
            feature_column_name (str, optional): Column name for input features.
            mod_data_point_role_column_name (str, optional): Column name for data point role.
            probability_class_1_column_name (str, optional): Column name for probability predictions for class 1.
            mod_prediction_class_column_name (str, optional): Column name for predicted class.
            augmented_bin_column_name (str, optional): Column name for binary augmented labels.
            augmented_role_column_name (str, optional): Column name for augmented role labels.
        """
        self.model = model
        self.spy_tolerance = spy_tolerance
        self.logger = logger

        # Optional parameters for classification; if not provided, they can be set later.
        self.feature_column_name = feature_column_name
        self.mod_data_point_role_column_name = mod_data_point_role_column_name
        self.probability_class_1_column_name = probability_class_1_column_name
        self.mod_prediction_class_column_name = mod_prediction_class_column_name
        self.augmented_bin_column_name = augmented_bin_column_name
        self.augmented_role_column_name = augmented_role_column_name

    @classmethod
    def from_config(cls, config: Dict[str, Any], model: CatBoostClassifier,
                    logger: Optional[Logger] = None) -> "AugmenNegativeIdentifier":
        """
        Alternative constructor that extracts the required parameters from a config object.

        The configuration dictionary is expected to have keys "spy_splitting" and "meta_columns" with appropriate entries.

        Args:
            config (Dict[str, Any]): Configuration dictionary.
            model (CatBoostClassifier): Trained Spy model.
            logger (Logger, optional): Logger instance.

        Returns:
            AugmenNegativeIdentifier: A new instance configured from the provided config.
        """

        return cls(
            model=model,
            spy_tolerance=config["spy_splitting"]["spy_tolerance"],
            logger=logger,
            feature_column_name = config["featurisation"]["combined_features_column_name"],
            mod_data_point_role_column_name = config["meta_columns"]["meta_mod_data_point_role"],
            probability_class_1_column_name = config["meta_columns"]["meta_mod_probability_1"],
            mod_prediction_class_column_name = config["meta_columns"]["meta_mod_prediction_class"],
            augmented_bin_column_name = config["meta_columns"]["meta_augmented_bin"],
            augmented_role_column_name = config["meta_columns"]["meta_augmented_role"]
        )

    def predict_augmen_probabilities(
        self,
        spy_inf_data: pd.DataFrame,
        feature_column_name: Optional[str] = None,
        mod_prediction_class_column_name: Optional[str] = None,
        probability_class_1_column_name: Optional[str] = None
    ) -> pd.DataFrame:
        """
        Predict probabilities and labels for spy-infused training data.

        The method adds two columns to a copy of the input DataFrame:
        one for predicted classes and one for the predicted probabilities for class 1.

        Args:
            spy_inf_data (pd.DataFrame): Spy-infused training data.
            feature_column_name (str): Name of the column containing input features.
            mod_prediction_class_column_name (Optional[str]): Override for predicted class column name.
            probability_class_1_column_name (Optional[str]): Override for probability column name.

        Returns:
            pd.DataFrame: A new DataFrame with predicted class and probability columns appended.

        Raises:
            KeyError: If the feature column is not found in the data.
            Exception: Propagates any exceptions raised during prediction.
        """
        mod_pred_col = mod_prediction_class_column_name or self.mod_prediction_class_column_name
        prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name
        feature_column_name = feature_column_name or self.feature_column_name
        if feature_column_name not in spy_inf_data.columns:
            raise KeyError(f"Feature column '{feature_column_name}' not found in data.")

        result_df = spy_inf_data.copy()
        features = result_df[feature_column_name].tolist()

        try:
            pred_class = self.model.predict(features, prediction_type='Class')
            # CatBoost .predict(prediction_type='Probability') returns shape (N, 2), we want class 1
            pred_prob = self.model.predict(features, prediction_type='Probability')[:, 1]
        except Exception as e:
            if self.logger:
                self.logger.log_message(f"Error during prediction: {e}")
            raise e

        result_df[mod_pred_col] = pred_class
        result_df[prob_class1_col] = pred_prob
        return result_df

    def find_augmen_threshold(
        self,
        spy_inf_data: pd.DataFrame,
        mod_data_point_role_column_name: Optional[str] = None,
        probability_class_1_column_name: Optional[str] = None
    ) -> float:
        """
        Find the optimal threshold for classifying augmented negatives.

        The threshold is determined by sorting the predicted probabilities for examples with
        a data point role of "unlabeled spy" and selecting the value at an index defined by the spy tolerance.
        If the computed threshold exceeds 0.5, it is set to 0.5.

        Args:
            spy_inf_data (pd.DataFrame): Data with predicted probabilities.
            mod_data_point_role_column_name (str, optional): Override for the data point role column name.
            probability_class_1_column_name (str, optional): Override for the probability column name.

        Returns:
            float: The determined threshold.

        Raises:
            KeyError: If the role column is not found in the data.
            ValueError: If no spy data is found.
        """
        role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name
        prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name

        if role_col not in spy_inf_data.columns:
            raise KeyError(f"Expected column '{role_col}' not found in data.")

        spy_data = spy_inf_data[spy_inf_data[role_col] == "unlabeled spy"]
        probability_values = spy_data[prob_class1_col].values

        if len(probability_values) == 0:
            raise ValueError("No spy data found to calculate threshold.")

        num_spies_to_catch = int(len(spy_data) * (self.spy_tolerance))
        sorted_probabilities = sorted(probability_values)
        threshold = sorted_probabilities[num_spies_to_catch]

        if threshold > 0.5:
            threshold = 0.5
            if self.logger:
                self.logger.log_message("Threshold is higher than 0.5; setting to 0.5")
        if self.logger:
            self.logger.log_threshold(threshold)
        return threshold

    def filter_augmented_negatives_and_known_positives(
        self,
        spy_inf_data: pd.DataFrame,
        augmented_role_column_name: Optional[str] = None,
        mod_data_point_role_column_name: Optional[str] = None
    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
        """
        Filter the augmented negatives from spy-infused data by excluding known positives.
        Also, return the set of "undecisive" datapoints.

        Known positives are defined as rows where the data point role (from mod_data_point_role_column_name)
        is "unlabeled spy" or "true positive". For these rows, the augmented role is forcibly set to "known positive".
        The method returns three DataFrames:
          - filtered_augmented_negatives: rows with augmented role "reliable negative"
          - known_positives: rows with role in ["unlabeled spy", "true positive"]
          - undecisives: rows with augmented role "undecisive"

        Args:
            spy_inf_data (pd.DataFrame): Spy-infused DataFrame containing meta columns.
            augmented_role_column_name (str, optional): Override for the augmented role column name.
            mod_data_point_role_column_name (str, optional): Override for the data point role column name.

        Returns:
            Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (filtered_negatives, known_positives, undecisives).

        Raises:
            KeyError: If expected columns are missing.
        """
        role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name
        aug_role_col = augmented_role_column_name or self.augmented_role_column_name

        for col in [role_col, aug_role_col]:
            if col not in spy_inf_data.columns:
                raise KeyError(f"Expected column '{col}' not found in data.")

        known_positive_roles = ["unlabeled spy", "true positive"]
        updated_df = spy_inf_data.copy()

        updated_df[aug_role_col] = np.where(
            updated_df[role_col].isin(known_positive_roles),
            "known positive",
            updated_df[aug_role_col]
        )
        known_positives = updated_df[updated_df[role_col].isin(known_positive_roles)].copy()
        filtered_negatives = updated_df[updated_df[aug_role_col] == "reliable negative"].copy()
        undecisives = updated_df[updated_df[aug_role_col] == "undecisive"].copy() # Only needed for Evaluation

        if self.logger:
            self.logger._log_dataframe_as_artifact(updated_df, "spy_inf_data.csv")
            self.logger._log_dataframe_as_artifact(known_positives, "true_positives_and_unlabeled_spy.csv")
            self.logger._log_dataframe_as_artifact(filtered_negatives, "augmented_negatives.csv")
            self.logger._log_dataframe_as_artifact(undecisives, "undecisive_datapoints.csv")

        return filtered_negatives, known_positives, undecisives

    def get_augmen_negatives_and_known_positives(
        self,
        spy_inf_data: pd.DataFrame,
        threshold: float,
        augmented_bin_column_name: Optional[str] = None,
        augmented_role_column_name: Optional[str] = None,
        probability_class_1_column_name: Optional[str] = None,
        mod_data_point_role_column_name: Optional[str] = None
    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
        """
    Extract augmented reliable negatives, known positives, and undecisive datapoints based on a threshold.

    The method creates a new binary column (augmented_bin_column_name) for augmented labels based on whether
    the predicted probability (from probability_class_1_column_name) exceeds the threshold. It then assigns an augmented
    role ("reliable negative" if binary label is 0; otherwise "undecisive") and calls the filtering function to separate
    known positives from reliable negatives and to collect undecisive datapoints.

    Args:
        spy_inf_data (pd.DataFrame): Spy-infused training data with probability predictions.
        threshold (float): Threshold for binary classification.
        augmented_bin_column_name (str, optional): Override for the binary augmented column name.
        augmented_role_column_name (str, optional): Override for the augmented role column name.
        probability_class_1_column_name (str, optional): Override for the probability column name.
        mod_data_point_role_column_name (str, optional): Override for the data point role column name.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (augmen_reliable_negatives, known_positives, undecisives).
    """
        aug_bin_col = augmented_bin_column_name or self.augmented_bin_column_name
        aug_role_col = augmented_role_column_name or self.augmented_role_column_name
        prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name
        mod_data_role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name

        df = spy_inf_data.copy()
        df[aug_bin_col] = df[prob_class1_col].apply(lambda x: 1 if x > threshold else 0)
        df[aug_role_col] = df[aug_bin_col].apply(lambda x: "reliable negative" if x == 0 else "undecisive")

        return self.filter_augmented_negatives_and_known_positives(
            spy_inf_data=df,
            augmented_role_column_name=aug_role_col,
            mod_data_point_role_column_name=mod_data_role_col
        )

__init__(model, spy_tolerance=0.05, logger=None, feature_column_name=None, mod_data_point_role_column_name=None, probability_class_1_column_name=None, mod_prediction_class_column_name=None, augmented_bin_column_name=None, augmented_role_column_name=None)

Initialize the AugmenNegativeIdentifier.

Parameters:

Name Type Description Default
model CatBoostClassifier

Trained Spy model.

required
spy_tolerance float

Tolerance for spy inclusion in negatives.

0.05
logger Logger

Logger instance for tracking and logging.

None
feature_column_name str

Column name for input features.

None
mod_data_point_role_column_name str

Column name for data point role.

None
probability_class_1_column_name str

Column name for probability predictions for class 1.

None
mod_prediction_class_column_name str

Column name for predicted class.

None
augmented_bin_column_name str

Column name for binary augmented labels.

None
augmented_role_column_name str

Column name for augmented role labels.

None
Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def __init__(self, model: CatBoostClassifier, spy_tolerance: float = 0.05, logger: Optional[Logger] = None,
             feature_column_name: Optional[str] = None,
             mod_data_point_role_column_name: Optional[str] = None,
             probability_class_1_column_name: Optional[str] = None,
             mod_prediction_class_column_name: Optional[str] = None,
             augmented_bin_column_name: Optional[str] = None,
             augmented_role_column_name: Optional[str] = None) -> None:
    """
    Initialize the AugmenNegativeIdentifier.

    Args:
        model (CatBoostClassifier): Trained Spy model.
        spy_tolerance (float, optional): Tolerance for spy inclusion in negatives.
        logger (Logger, optional): Logger instance for tracking and logging.
        feature_column_name (str, optional): Column name for input features.
        mod_data_point_role_column_name (str, optional): Column name for data point role.
        probability_class_1_column_name (str, optional): Column name for probability predictions for class 1.
        mod_prediction_class_column_name (str, optional): Column name for predicted class.
        augmented_bin_column_name (str, optional): Column name for binary augmented labels.
        augmented_role_column_name (str, optional): Column name for augmented role labels.
    """
    self.model = model
    self.spy_tolerance = spy_tolerance
    self.logger = logger

    # Optional parameters for classification; if not provided, they can be set later.
    self.feature_column_name = feature_column_name
    self.mod_data_point_role_column_name = mod_data_point_role_column_name
    self.probability_class_1_column_name = probability_class_1_column_name
    self.mod_prediction_class_column_name = mod_prediction_class_column_name
    self.augmented_bin_column_name = augmented_bin_column_name
    self.augmented_role_column_name = augmented_role_column_name

filter_augmented_negatives_and_known_positives(spy_inf_data, augmented_role_column_name=None, mod_data_point_role_column_name=None)

Filter the augmented negatives from spy-infused data by excluding known positives. Also, return the set of "undecisive" datapoints.

Known positives are defined as rows where the data point role (from mod_data_point_role_column_name) is "unlabeled spy" or "true positive". For these rows, the augmented role is forcibly set to "known positive". The method returns three DataFrames: - filtered_augmented_negatives: rows with augmented role "reliable negative" - known_positives: rows with role in ["unlabeled spy", "true positive"] - undecisives: rows with augmented role "undecisive"

Parameters:

Name Type Description Default
spy_inf_data DataFrame

Spy-infused DataFrame containing meta columns.

required
augmented_role_column_name str

Override for the augmented role column name.

None
mod_data_point_role_column_name str

Override for the data point role column name.

None

Returns:

Type Description
Tuple[DataFrame, DataFrame, DataFrame]

Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (filtered_negatives, known_positives, undecisives).

Raises:

Type Description
KeyError

If expected columns are missing.

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
def filter_augmented_negatives_and_known_positives(
    self,
    spy_inf_data: pd.DataFrame,
    augmented_role_column_name: Optional[str] = None,
    mod_data_point_role_column_name: Optional[str] = None
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Filter the augmented negatives from spy-infused data by excluding known positives.
    Also, return the set of "undecisive" datapoints.

    Known positives are defined as rows where the data point role (from mod_data_point_role_column_name)
    is "unlabeled spy" or "true positive". For these rows, the augmented role is forcibly set to "known positive".
    The method returns three DataFrames:
      - filtered_augmented_negatives: rows with augmented role "reliable negative"
      - known_positives: rows with role in ["unlabeled spy", "true positive"]
      - undecisives: rows with augmented role "undecisive"

    Args:
        spy_inf_data (pd.DataFrame): Spy-infused DataFrame containing meta columns.
        augmented_role_column_name (str, optional): Override for the augmented role column name.
        mod_data_point_role_column_name (str, optional): Override for the data point role column name.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (filtered_negatives, known_positives, undecisives).

    Raises:
        KeyError: If expected columns are missing.
    """
    role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name
    aug_role_col = augmented_role_column_name or self.augmented_role_column_name

    for col in [role_col, aug_role_col]:
        if col not in spy_inf_data.columns:
            raise KeyError(f"Expected column '{col}' not found in data.")

    known_positive_roles = ["unlabeled spy", "true positive"]
    updated_df = spy_inf_data.copy()

    updated_df[aug_role_col] = np.where(
        updated_df[role_col].isin(known_positive_roles),
        "known positive",
        updated_df[aug_role_col]
    )
    known_positives = updated_df[updated_df[role_col].isin(known_positive_roles)].copy()
    filtered_negatives = updated_df[updated_df[aug_role_col] == "reliable negative"].copy()
    undecisives = updated_df[updated_df[aug_role_col] == "undecisive"].copy() # Only needed for Evaluation

    if self.logger:
        self.logger._log_dataframe_as_artifact(updated_df, "spy_inf_data.csv")
        self.logger._log_dataframe_as_artifact(known_positives, "true_positives_and_unlabeled_spy.csv")
        self.logger._log_dataframe_as_artifact(filtered_negatives, "augmented_negatives.csv")
        self.logger._log_dataframe_as_artifact(undecisives, "undecisive_datapoints.csv")

    return filtered_negatives, known_positives, undecisives

find_augmen_threshold(spy_inf_data, mod_data_point_role_column_name=None, probability_class_1_column_name=None)

Find the optimal threshold for classifying augmented negatives.

The threshold is determined by sorting the predicted probabilities for examples with a data point role of "unlabeled spy" and selecting the value at an index defined by the spy tolerance. If the computed threshold exceeds 0.5, it is set to 0.5.

Parameters:

Name Type Description Default
spy_inf_data DataFrame

Data with predicted probabilities.

required
mod_data_point_role_column_name str

Override for the data point role column name.

None
probability_class_1_column_name str

Override for the probability column name.

None

Returns:

Name Type Description
float float

The determined threshold.

Raises:

Type Description
KeyError

If the role column is not found in the data.

ValueError

If no spy data is found.

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
def find_augmen_threshold(
    self,
    spy_inf_data: pd.DataFrame,
    mod_data_point_role_column_name: Optional[str] = None,
    probability_class_1_column_name: Optional[str] = None
) -> float:
    """
    Find the optimal threshold for classifying augmented negatives.

    The threshold is determined by sorting the predicted probabilities for examples with
    a data point role of "unlabeled spy" and selecting the value at an index defined by the spy tolerance.
    If the computed threshold exceeds 0.5, it is set to 0.5.

    Args:
        spy_inf_data (pd.DataFrame): Data with predicted probabilities.
        mod_data_point_role_column_name (str, optional): Override for the data point role column name.
        probability_class_1_column_name (str, optional): Override for the probability column name.

    Returns:
        float: The determined threshold.

    Raises:
        KeyError: If the role column is not found in the data.
        ValueError: If no spy data is found.
    """
    role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name
    prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name

    if role_col not in spy_inf_data.columns:
        raise KeyError(f"Expected column '{role_col}' not found in data.")

    spy_data = spy_inf_data[spy_inf_data[role_col] == "unlabeled spy"]
    probability_values = spy_data[prob_class1_col].values

    if len(probability_values) == 0:
        raise ValueError("No spy data found to calculate threshold.")

    num_spies_to_catch = int(len(spy_data) * (self.spy_tolerance))
    sorted_probabilities = sorted(probability_values)
    threshold = sorted_probabilities[num_spies_to_catch]

    if threshold > 0.5:
        threshold = 0.5
        if self.logger:
            self.logger.log_message("Threshold is higher than 0.5; setting to 0.5")
    if self.logger:
        self.logger.log_threshold(threshold)
    return threshold

from_config(config, model, logger=None) classmethod

Alternative constructor that extracts the required parameters from a config object.

The configuration dictionary is expected to have keys "spy_splitting" and "meta_columns" with appropriate entries.

Parameters:

Name Type Description Default
config Dict[str, Any]

Configuration dictionary.

required
model CatBoostClassifier

Trained Spy model.

required
logger Logger

Logger instance.

None

Returns:

Name Type Description
AugmenNegativeIdentifier AugmenNegativeIdentifier

A new instance configured from the provided config.

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
@classmethod
def from_config(cls, config: Dict[str, Any], model: CatBoostClassifier,
                logger: Optional[Logger] = None) -> "AugmenNegativeIdentifier":
    """
    Alternative constructor that extracts the required parameters from a config object.

    The configuration dictionary is expected to have keys "spy_splitting" and "meta_columns" with appropriate entries.

    Args:
        config (Dict[str, Any]): Configuration dictionary.
        model (CatBoostClassifier): Trained Spy model.
        logger (Logger, optional): Logger instance.

    Returns:
        AugmenNegativeIdentifier: A new instance configured from the provided config.
    """

    return cls(
        model=model,
        spy_tolerance=config["spy_splitting"]["spy_tolerance"],
        logger=logger,
        feature_column_name = config["featurisation"]["combined_features_column_name"],
        mod_data_point_role_column_name = config["meta_columns"]["meta_mod_data_point_role"],
        probability_class_1_column_name = config["meta_columns"]["meta_mod_probability_1"],
        mod_prediction_class_column_name = config["meta_columns"]["meta_mod_prediction_class"],
        augmented_bin_column_name = config["meta_columns"]["meta_augmented_bin"],
        augmented_role_column_name = config["meta_columns"]["meta_augmented_role"]
    )

get_augmen_negatives_and_known_positives(spy_inf_data, threshold, augmented_bin_column_name=None, augmented_role_column_name=None, probability_class_1_column_name=None, mod_data_point_role_column_name=None)

Extract augmented reliable negatives, known positives, and undecisive datapoints based on a threshold.

The method creates a new binary column (augmented_bin_column_name) for augmented labels based on whether the predicted probability (from probability_class_1_column_name) exceeds the threshold. It then assigns an augmented role ("reliable negative" if binary label is 0; otherwise "undecisive") and calls the filtering function to separate known positives from reliable negatives and to collect undecisive datapoints.

Parameters:

Name Type Description Default
spy_inf_data DataFrame

Spy-infused training data with probability predictions.

required
threshold float

Threshold for binary classification.

required
augmented_bin_column_name str

Override for the binary augmented column name.

None
augmented_role_column_name str

Override for the augmented role column name.

None
probability_class_1_column_name str

Override for the probability column name.

None
mod_data_point_role_column_name str

Override for the data point role column name.

None

Returns:

Type Description
Tuple[DataFrame, DataFrame, DataFrame]

Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (augmen_reliable_negatives, known_positives, undecisives).

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
def get_augmen_negatives_and_known_positives(
    self,
    spy_inf_data: pd.DataFrame,
    threshold: float,
    augmented_bin_column_name: Optional[str] = None,
    augmented_role_column_name: Optional[str] = None,
    probability_class_1_column_name: Optional[str] = None,
    mod_data_point_role_column_name: Optional[str] = None
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
Extract augmented reliable negatives, known positives, and undecisive datapoints based on a threshold.

The method creates a new binary column (augmented_bin_column_name) for augmented labels based on whether
the predicted probability (from probability_class_1_column_name) exceeds the threshold. It then assigns an augmented
role ("reliable negative" if binary label is 0; otherwise "undecisive") and calls the filtering function to separate
known positives from reliable negatives and to collect undecisive datapoints.

Args:
    spy_inf_data (pd.DataFrame): Spy-infused training data with probability predictions.
    threshold (float): Threshold for binary classification.
    augmented_bin_column_name (str, optional): Override for the binary augmented column name.
    augmented_role_column_name (str, optional): Override for the augmented role column name.
    probability_class_1_column_name (str, optional): Override for the probability column name.
    mod_data_point_role_column_name (str, optional): Override for the data point role column name.

Returns:
    Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (augmen_reliable_negatives, known_positives, undecisives).
"""
    aug_bin_col = augmented_bin_column_name or self.augmented_bin_column_name
    aug_role_col = augmented_role_column_name or self.augmented_role_column_name
    prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name
    mod_data_role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name

    df = spy_inf_data.copy()
    df[aug_bin_col] = df[prob_class1_col].apply(lambda x: 1 if x > threshold else 0)
    df[aug_role_col] = df[aug_bin_col].apply(lambda x: "reliable negative" if x == 0 else "undecisive")

    return self.filter_augmented_negatives_and_known_positives(
        spy_inf_data=df,
        augmented_role_column_name=aug_role_col,
        mod_data_point_role_column_name=mod_data_role_col
    )

predict_augmen_probabilities(spy_inf_data, feature_column_name=None, mod_prediction_class_column_name=None, probability_class_1_column_name=None)

Predict probabilities and labels for spy-infused training data.

The method adds two columns to a copy of the input DataFrame: one for predicted classes and one for the predicted probabilities for class 1.

Parameters:

Name Type Description Default
spy_inf_data DataFrame

Spy-infused training data.

required
feature_column_name str

Name of the column containing input features.

None
mod_prediction_class_column_name Optional[str]

Override for predicted class column name.

None
probability_class_1_column_name Optional[str]

Override for probability column name.

None

Returns:

Type Description
DataFrame

pd.DataFrame: A new DataFrame with predicted class and probability columns appended.

Raises:

Type Description
KeyError

If the feature column is not found in the data.

Exception

Propagates any exceptions raised during prediction.

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
def predict_augmen_probabilities(
    self,
    spy_inf_data: pd.DataFrame,
    feature_column_name: Optional[str] = None,
    mod_prediction_class_column_name: Optional[str] = None,
    probability_class_1_column_name: Optional[str] = None
) -> pd.DataFrame:
    """
    Predict probabilities and labels for spy-infused training data.

    The method adds two columns to a copy of the input DataFrame:
    one for predicted classes and one for the predicted probabilities for class 1.

    Args:
        spy_inf_data (pd.DataFrame): Spy-infused training data.
        feature_column_name (str): Name of the column containing input features.
        mod_prediction_class_column_name (Optional[str]): Override for predicted class column name.
        probability_class_1_column_name (Optional[str]): Override for probability column name.

    Returns:
        pd.DataFrame: A new DataFrame with predicted class and probability columns appended.

    Raises:
        KeyError: If the feature column is not found in the data.
        Exception: Propagates any exceptions raised during prediction.
    """
    mod_pred_col = mod_prediction_class_column_name or self.mod_prediction_class_column_name
    prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name
    feature_column_name = feature_column_name or self.feature_column_name
    if feature_column_name not in spy_inf_data.columns:
        raise KeyError(f"Feature column '{feature_column_name}' not found in data.")

    result_df = spy_inf_data.copy()
    features = result_df[feature_column_name].tolist()

    try:
        pred_class = self.model.predict(features, prediction_type='Class')
        # CatBoost .predict(prediction_type='Probability') returns shape (N, 2), we want class 1
        pred_prob = self.model.predict(features, prediction_type='Probability')[:, 1]
    except Exception as e:
        if self.logger:
            self.logger.log_message(f"Error during prediction: {e}")
        raise e

    result_df[mod_pred_col] = pred_class
    result_df[prob_class1_col] = pred_prob
    return result_df