Augmentation Models

This module contains the core logic for the PU learning process: the Spy Model classifier and the decision engine for identifying reliable negatives.

Spy Model (`payn.AugmentationModels.SpyModel.SpyModel`)

Wraps a CatBoostClassifier. Selected for its native handling of categorical features and robust performance on tabular chemical data without extensive preprocessing. Other model architectures are applicable here as well, but a class probability score must be calculable or estimable.

The spy model is trained on the spy_infused_training_data (from payn.SpySplitting) to distinguish between "known positives" (s = 1) and "unlabeled/spy Mixture" (s = 0).

Categorical Handling: The model automatically detects categorical features (e.g., specific bit positions or metadata tags) appended to the end of the feature vector, optimizing the split strategy for mixed data types.
Parallelisation: Automatically detects SLURM cluster environments (SLURM_CPUS_PER_TASK) to adjust thread counts (thread_count), ensuring optimal resource usage while defaulting to single-threaded execution locally for maximum safety.
Determinism: Random seeds are propagated strictly from the global config to the CatBoost engine (random_state).
Logging: SpyModel is tightly coupled with the payn.Logging system. It automatically logs hyperparameters, trained model artifacts, and evaluation metrics (on test sets) to MLflow run immediately after training.

SpyModel encapsulates the CatBoostClassifier used in the Spy-based learning step.

Attributes:

Name	Type	Description
`config_key`	`str`	The key in the config dict relevant to SpyModel.
`logger`	`Optional[Logger]`	Logger instance for logging model training and evaluation.
`fold_index`	`int`	Index of the current fold (for cross-validation purposes).
`random_state`	`int`	Random seed.
`eval_metric`	`str`	Evaluation metric to use.
`verbose`	`int`	Verbosity level.
`model`	`Optional[CatBoostClassifier]`	The trained CatBoost model.
`feature_column_name`	`Optional[str]`	Column name containing feature vectors.
`training_target_column_name`	`Optional[str]`	Target column name for training data.
`validation_target_column_name`	`Optional[str]`	Target column name for validation data.
`metrics_list`	`Optional[List[str]]`	List of additional metrics to evaluate.
`categorical_column_indices`	`List[int]`	Indices of features identified as categorical.

Source code in payn\AugmentationModels\SpyModel\spymodel.py

class SpyModel:
    """
       SpyModel encapsulates the CatBoostClassifier used in the Spy-based learning step.

    Attributes:
        config_key (str): The key in the config dict relevant to SpyModel.
        logger (Optional[Logger]): Logger instance for logging model training and evaluation.
        fold_index (int): Index of the current fold (for cross-validation purposes).
        random_state (int): Random seed.
        eval_metric (str): Evaluation metric to use.
        verbose (int): Verbosity level.
        model (Optional[CatBoostClassifier]): The trained CatBoost model.
        feature_column_name (Optional[str]): Column name containing feature vectors.
        training_target_column_name (Optional[str]): Target column name for training data.
        validation_target_column_name (Optional[str]): Target column name for validation data.
        metrics_list (Optional[List[str]]): List of additional metrics to evaluate.
        categorical_column_indices (List[int]): Indices of features identified as categorical.
    """

    config_key = "spy_model"

    def __init__(self, eval_metric: str, random_state: int, verbose: int, fold_index: int = 1, logger: Optional[Logger] = None,
                 feature_column_name: str = None, training_target_column_name: str = None, validation_target_column_name: str = None,
                 metrics_list: Optional[List[str]] = None, categorical_column_indices: Optional[List[int]] = None):
        """
        Initialize the SpyModel class.

        You can either pass a config dict via the alternative constructor `from_config` or pass parameters explicitly.

        Args:
            eval_metric (str): Metric for evaluation of model performance.
            random_state (int): Random seed.
            verbose (int): Verbosity level for CatBoost output.
            fold_index (int, optional): Index of the current fold. Defaults to 1.
            logger (Logger, optional): Logger instance for logging.
            feature_column_name (str, optional): Column name containing feature vectors.
            training_target_column_name (str, optional): Target column name for training data.
            validation_target_column_name (str, optional): Target column name for validation data.
            metrics_list (List[str], optional): List of metrics to evaluate.
            categorical_column_indices (List[int], optional): Indices of features that are categorical.
        """

        self.eval_metric = eval_metric
        self.random_state = random_state
        self.fold_index = fold_index
        self.verbose = verbose
        self.logger = logger
        self.model: Optional[CatBoostClassifier] = None

        # Optional parameters for training; if not provided, they can be set later.
        self.feature_column_name = feature_column_name
        self.training_target_column_name = training_target_column_name
        self.validation_target_column_name = validation_target_column_name
        self.metrics_list = metrics_list
        self.categorical_column_indices = categorical_column_indices or []



    @classmethod
    def from_config(cls, config: Dict[str, Any], logger: Optional[Logger] = None, fold_index: int = 1) -> "SpyModel":
        """
        Alternative constructor that creates a SpyModel instance from a config object.

        Args:
            config (dict): Configuration dictionary.
            logger (Logger, optional): Logger instance.
            fold_index (int): Current fold index.

        Returns:
            SpyModel: An instance of SpyModel with parameters extracted from the config.
        """
        return cls(
            logger=logger,
            fold_index=fold_index,
            random_state = config["general"]["random_seed"],
            eval_metric = config["spy_model"]["eval_metric"],
            verbose = config["general"]["verbose"],
            feature_column_name=config["featurisation"]["combined_features_column_name"],
            training_target_column_name=config["spy_model"]["training_target_column_name"],
            validation_target_column_name=config["spy_model"]["validation_target_column_name"],
            metrics_list = config["spy_model"]["all_metrics"]
        )

    def _prepare_pool(self, data: pd.DataFrame, label_column: Optional[str], feature_column: str) -> Pool:
        """
        Prepare a CatBoost Pool from the dataset.

        Args:
            data (pd.DataFrame): Dataset containing features and target labels.
            label_column (str, optional): Column name for target labels. If None, pool is created without labels.
            feature_column (str): Column name for features.

        Returns:
            Pool: A CatBoost Pool object.

        Raises:
            ValueError: If pool preparation fails.
        """
        try:
            # Expand the list of features into a DataFrame
            pool_data = pd.DataFrame(data[feature_column].to_list())

            # Dynamically infer and store categorical indices if not already set.
            # Warning: Assumes categorical features are appended at the END of the vector.
            if not hasattr(self, "categorical_column_indices") or not self.categorical_column_indices:
                row_sample = data[feature_column].iloc[0]
                cat_count = sum(isinstance(v, str) for v in reversed(row_sample))
                self.categorical_column_indices = list(range(len(row_sample) - cat_count, len(row_sample)))

            return Pool(
                data=pool_data,
                label=data[label_column] if label_column is not None else None,
                cat_features=self.categorical_column_indices
            )

        except Exception as e:
            raise ValueError(f"Error preparing CatBoost Pool: {e}")


    def train(
        self,
        train_data: pd.DataFrame,
        val_data: pd.DataFrame,
        test_data: Optional[pd.DataFrame] = None,
        feature_column: Optional[str] = None,
        training_label_column: Optional[str] = None,
        validation_label_column: Optional[str] = None,
        **kwargs: Any
    ) -> CatBoostClassifier:
        """
        Train the Spy model on the given datasets.

        Args:
            train_data (pd.DataFrame): Training dataset with features and target labels.
            val_data (pd.DataFrame): Validation dataset for monitoring training progress.
            test_data (Optional[pd.DataFrame]): Optional test dataset for evaluation (default: None).
            feature_column (Optional[str]): Column name for features.
            training_label_column (Optional[str]): Column name for target labels in training data.
            validation_label_column (Optional[str]): Column name for target labels in validation data.
            **kwargs: Additional hyperparameters (overriding defaults).

        Returns:
            CatBoostClassifier: Trained CatBoost model.
        """
        feature_column = feature_column or self.feature_column_name
        training_label_column = training_label_column or self.training_target_column_name
        validation_label_column = validation_label_column or self.validation_target_column_name

        if isinstance(test_data, pd.DataFrame):
            test_label_column = validation_label_column
            test_pool = self._prepare_pool(test_data, test_label_column, feature_column)


        train_pool = self._prepare_pool(train_data, training_label_column, feature_column)
        val_pool = self._prepare_pool(val_data, validation_label_column, feature_column)

        # Support for slurm multi-threading
        n_threads = int(os.getenv("SLURM_CPUS_PER_TASK", -1)) # 1

        # Combine default parameters with kwargs (kwargs take precedence)
        combined_params = {
            "random_state": self.random_state,
            "eval_metric": self.eval_metric,
            "verbose": self.verbose,
            "thread_count": n_threads,
            **kwargs,  # Override defaults with values from kwargs
        }

        # Define CatBoost model
        self.model = CatBoostClassifier(**combined_params)

        # Log user-provided and default hyperparameters
        if self.logger:
            self.logger.log_model_hyperparameters(self.model, **combined_params)

        # Create the MLflow callback
        # mlflow_callback = MLflowCatBoostCallback(eval_metric=self.eval_metric, logger=self.logger)

        # Train the model
        self.model.fit(train_pool, eval_set=val_pool, use_best_model=True) # , callbacks=[mlflow_callback]

        # Log the trained model and evaluation results
        if self.logger:
            self.logger.log_model(self.model, f"spy_model_fold_{self.fold_index}")
            # Log additional model attributes
            self.logger.log_model_attributes(self.model)
            if isinstance(test_data, pd.DataFrame):
                test_results = self.evaluate(test_pool)
                self.logger.log_evaluation_metrics(test_results)

        return self.model

    def evaluate(self, test_pool: Pool) -> Dict[str, Any]:
        """
        Evaluate the trained model on a test dataset.

        Args:
            test_pool (Pool): Catboost Pool containing test features and labels.

        Returns:
            Dict[str, Any]: Dictionary of evaluation metrics.

        Raises:
            ValueError: If the model has not been trained yet.
        """
        if not self.model:
            raise ValueError("Model has not been trained yet. Call `train` before evaluating.")

        eval_results = self.model.eval_metrics(
            data=test_pool,
            metrics=self.metrics_list
        )
        return eval_results

    def predict(self, data: pd.DataFrame, feature_column: Optional[str] = None) -> pd.Series:
        """
        Make predictions using the trained Spy model.

        Args:
            data (pd.DataFrame): Dataset containing features for prediction.
            feature_column (str, optional): Column name for features.

        Returns:
            pd.Series: Predicted labels or probabilities.
        """
        feature_column = feature_column or self.feature_column_name
        if not self.model:
            raise ValueError("Model has not been trained yet. Call `train` before making predictions.")

        data_pool = self._prepare_pool(data, label_column=None, feature_column=feature_column)
        return pd.Series(self.model.predict(data_pool))

`init(eval_metric, random_state, verbose, fold_index=1, logger=None, feature_column_name=None, training_target_column_name=None, validation_target_column_name=None, metrics_list=None, categorical_column_indices=None)`

Initialize the SpyModel class.

You can either pass a config dict via the alternative constructor from_config or pass parameters explicitly.

Parameters:

Name	Type	Description	Default
`eval_metric`	`str`	Metric for evaluation of model performance.	required
`random_state`	`int`	Random seed.	required
`verbose`	`int`	Verbosity level for CatBoost output.	required
`fold_index`	`int`	Index of the current fold. Defaults to 1.	`1`
`logger`	`Logger`	Logger instance for logging.	`None`
`feature_column_name`	`str`	Column name containing feature vectors.	`None`
`training_target_column_name`	`str`	Target column name for training data.	`None`
`validation_target_column_name`	`str`	Target column name for validation data.	`None`
`metrics_list`	`List[str]`	List of metrics to evaluate.	`None`
`categorical_column_indices`	`List[int]`	Indices of features that are categorical.	`None`

Source code in payn\AugmentationModels\SpyModel\spymodel.py

def __init__(self, eval_metric: str, random_state: int, verbose: int, fold_index: int = 1, logger: Optional[Logger] = None,
             feature_column_name: str = None, training_target_column_name: str = None, validation_target_column_name: str = None,
             metrics_list: Optional[List[str]] = None, categorical_column_indices: Optional[List[int]] = None):
    """
    Initialize the SpyModel class.

    You can either pass a config dict via the alternative constructor `from_config` or pass parameters explicitly.

    Args:
        eval_metric (str): Metric for evaluation of model performance.
        random_state (int): Random seed.
        verbose (int): Verbosity level for CatBoost output.
        fold_index (int, optional): Index of the current fold. Defaults to 1.
        logger (Logger, optional): Logger instance for logging.
        feature_column_name (str, optional): Column name containing feature vectors.
        training_target_column_name (str, optional): Target column name for training data.
        validation_target_column_name (str, optional): Target column name for validation data.
        metrics_list (List[str], optional): List of metrics to evaluate.
        categorical_column_indices (List[int], optional): Indices of features that are categorical.
    """

    self.eval_metric = eval_metric
    self.random_state = random_state
    self.fold_index = fold_index
    self.verbose = verbose
    self.logger = logger
    self.model: Optional[CatBoostClassifier] = None

    # Optional parameters for training; if not provided, they can be set later.
    self.feature_column_name = feature_column_name
    self.training_target_column_name = training_target_column_name
    self.validation_target_column_name = validation_target_column_name
    self.metrics_list = metrics_list
    self.categorical_column_indices = categorical_column_indices or []

`evaluate(test_pool)`

Evaluate the trained model on a test dataset.

Parameters:

Name	Type	Description	Default
`test_pool`	`Pool`	Catboost Pool containing test features and labels.	required

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: Dictionary of evaluation metrics.

Raises:

Type	Description
`ValueError`	If the model has not been trained yet.

Source code in payn\AugmentationModels\SpyModel\spymodel.py

def evaluate(self, test_pool: Pool) -> Dict[str, Any]:
    """
    Evaluate the trained model on a test dataset.

    Args:
        test_pool (Pool): Catboost Pool containing test features and labels.

    Returns:
        Dict[str, Any]: Dictionary of evaluation metrics.

    Raises:
        ValueError: If the model has not been trained yet.
    """
    if not self.model:
        raise ValueError("Model has not been trained yet. Call `train` before evaluating.")

    eval_results = self.model.eval_metrics(
        data=test_pool,
        metrics=self.metrics_list
    )
    return eval_results

`from_config(config, logger=None, fold_index=1)` `classmethod`

Alternative constructor that creates a SpyModel instance from a config object.

Parameters:

Name	Type	Description	Default
`config`	`dict`	Configuration dictionary.	required
`logger`	`Logger`	Logger instance.	`None`
`fold_index`	`int`	Current fold index.	`1`

Returns:

Name	Type	Description
`SpyModel`	`SpyModel`	An instance of SpyModel with parameters extracted from the config.

Source code in payn\AugmentationModels\SpyModel\spymodel.py

@classmethod
def from_config(cls, config: Dict[str, Any], logger: Optional[Logger] = None, fold_index: int = 1) -> "SpyModel":
    """
    Alternative constructor that creates a SpyModel instance from a config object.

    Args:
        config (dict): Configuration dictionary.
        logger (Logger, optional): Logger instance.
        fold_index (int): Current fold index.

    Returns:
        SpyModel: An instance of SpyModel with parameters extracted from the config.
    """
    return cls(
        logger=logger,
        fold_index=fold_index,
        random_state = config["general"]["random_seed"],
        eval_metric = config["spy_model"]["eval_metric"],
        verbose = config["general"]["verbose"],
        feature_column_name=config["featurisation"]["combined_features_column_name"],
        training_target_column_name=config["spy_model"]["training_target_column_name"],
        validation_target_column_name=config["spy_model"]["validation_target_column_name"],
        metrics_list = config["spy_model"]["all_metrics"]
    )

`predict(data, feature_column=None)`

Make predictions using the trained Spy model.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Dataset containing features for prediction.	required
`feature_column`	`str`	Column name for features.	`None`

Returns:

Type	Description
`Series`	pd.Series: Predicted labels or probabilities.

Source code in payn\AugmentationModels\SpyModel\spymodel.py

def predict(self, data: pd.DataFrame, feature_column: Optional[str] = None) -> pd.Series:
    """
    Make predictions using the trained Spy model.

    Args:
        data (pd.DataFrame): Dataset containing features for prediction.
        feature_column (str, optional): Column name for features.

    Returns:
        pd.Series: Predicted labels or probabilities.
    """
    feature_column = feature_column or self.feature_column_name
    if not self.model:
        raise ValueError("Model has not been trained yet. Call `train` before making predictions.")

    data_pool = self._prepare_pool(data, label_column=None, feature_column=feature_column)
    return pd.Series(self.model.predict(data_pool))

`train(train_data, val_data, test_data=None, feature_column=None, training_label_column=None, validation_label_column=None, **kwargs)`

Train the Spy model on the given datasets.

Parameters:

Name	Type	Description	Default
`train_data`	`DataFrame`	Training dataset with features and target labels.	required
`val_data`	`DataFrame`	Validation dataset for monitoring training progress.	required
`test_data`	`Optional[DataFrame]`	Optional test dataset for evaluation (default: None).	`None`
`feature_column`	`Optional[str]`	Column name for features.	`None`
`training_label_column`	`Optional[str]`	Column name for target labels in training data.	`None`
`validation_label_column`	`Optional[str]`	Column name for target labels in validation data.	`None`
`**kwargs`	`Any`	Additional hyperparameters (overriding defaults).	`{}`

Returns:

Name	Type	Description
`CatBoostClassifier`	`CatBoostClassifier`	Trained CatBoost model.

Source code in payn\AugmentationModels\SpyModel\spymodel.py

def train(
    self,
    train_data: pd.DataFrame,
    val_data: pd.DataFrame,
    test_data: Optional[pd.DataFrame] = None,
    feature_column: Optional[str] = None,
    training_label_column: Optional[str] = None,
    validation_label_column: Optional[str] = None,
    **kwargs: Any
) -> CatBoostClassifier:
    """
    Train the Spy model on the given datasets.

    Args:
        train_data (pd.DataFrame): Training dataset with features and target labels.
        val_data (pd.DataFrame): Validation dataset for monitoring training progress.
        test_data (Optional[pd.DataFrame]): Optional test dataset for evaluation (default: None).
        feature_column (Optional[str]): Column name for features.
        training_label_column (Optional[str]): Column name for target labels in training data.
        validation_label_column (Optional[str]): Column name for target labels in validation data.
        **kwargs: Additional hyperparameters (overriding defaults).

    Returns:
        CatBoostClassifier: Trained CatBoost model.
    """
    feature_column = feature_column or self.feature_column_name
    training_label_column = training_label_column or self.training_target_column_name
    validation_label_column = validation_label_column or self.validation_target_column_name

    if isinstance(test_data, pd.DataFrame):
        test_label_column = validation_label_column
        test_pool = self._prepare_pool(test_data, test_label_column, feature_column)


    train_pool = self._prepare_pool(train_data, training_label_column, feature_column)
    val_pool = self._prepare_pool(val_data, validation_label_column, feature_column)

    # Support for slurm multi-threading
    n_threads = int(os.getenv("SLURM_CPUS_PER_TASK", -1)) # 1

    # Combine default parameters with kwargs (kwargs take precedence)
    combined_params = {
        "random_state": self.random_state,
        "eval_metric": self.eval_metric,
        "verbose": self.verbose,
        "thread_count": n_threads,
        **kwargs,  # Override defaults with values from kwargs
    }

    # Define CatBoost model
    self.model = CatBoostClassifier(**combined_params)

    # Log user-provided and default hyperparameters
    if self.logger:
        self.logger.log_model_hyperparameters(self.model, **combined_params)

    # Create the MLflow callback
    # mlflow_callback = MLflowCatBoostCallback(eval_metric=self.eval_metric, logger=self.logger)

    # Train the model
    self.model.fit(train_pool, eval_set=val_pool, use_best_model=True) # , callbacks=[mlflow_callback]

    # Log the trained model and evaluation results
    if self.logger:
        self.logger.log_model(self.model, f"spy_model_fold_{self.fold_index}")
        # Log additional model attributes
        self.logger.log_model_attributes(self.model)
        if isinstance(test_data, pd.DataFrame):
            test_results = self.evaluate(test_pool)
            self.logger.log_evaluation_metrics(test_results)

    return self.model

Reliable Negative Identification (`payn.AugmentationModels.SpyModel.augmen_negative_identifier`)

This module is the decision-making engine of the PU learning workflow. It leverages the trained Spy Model to filter the unlabeled dataset, identifying a subset of reliable negatives that are statistically distinct from the positive class.

Dynamic Thresholding: Instead of using a fixed probability threshold (e.g., 0.5), the module calculates a dynamic cutoff based on the probability distribution of the spies within the unlabeled datapoints (known positives injected into the unlabeled set). A user-defined spy_tolerance (default 5%) sets the threshold such that 95% of the spies are correctly recognized as positive by the model. This ensures that the identified negatives are unlikely to be latent positives. Unlabeled data points scoring below this threshold are classified as reliable negatives.
Classification: The module segments the unlabeled data into three distinct categories:
1. Known Positives: Original true positives and recovered spies.
2. Reliable Negatives: Unlabeled data points with predicted probabilities below the calculated threshold. These form the clean negative set for downstream applications such as Regression model training.
3. Undecisives: Unlabeled data points with probabilities above the threshold but not labeled as positive. These are discarded to prevent "noisy negatives".

`AugmenNegativeIdentifier`

Identifies augmented (augmen_) reliable negatives using the Spy technique and an optimized threshold.

Attributes:

Name	Type	Description
`model`	`CatBoostClassifier`	The trained Spy model.
`spy_tolerance`	`float`	The acceptable proportion of spies within the reliable negatives.
`logger`	`Logger`	Logger instance for logging messages and artifacts.
`feature_column_name`	`str`	Default column name for input features.
`mod_data_point_role_column_name`	`str`	Default column name indicating each data point's role.
`probability_class_1_column_name`	`str`	Default column name for the predicted probability of class 1.
`mod_prediction_class_column_name`	`str`	Default column name for the predicted class.
`augmented_bin_column_name`	`str`	Default column name for the binary augmented label.
`augmented_role_column_name`	`str`	Default column name for the augmented role label.

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py

class AugmenNegativeIdentifier:
    """
    Identifies augmented (augmen_) reliable negatives using the Spy technique and an optimized threshold.

    Attributes:
        model (CatBoostClassifier): The trained Spy model.
        spy_tolerance (float): The acceptable proportion of spies within the reliable negatives.
        logger (Logger, optional): Logger instance for logging messages and artifacts.
        feature_column_name (str, optional): Default column name for input features.
        mod_data_point_role_column_name (str, optional): Default column name indicating each data point's role.
        probability_class_1_column_name (str, optional): Default column name for the predicted probability of class 1.
        mod_prediction_class_column_name (str, optional): Default column name for the predicted class.
        augmented_bin_column_name (str, optional): Default column name for the binary augmented label.
        augmented_role_column_name (str, optional): Default column name for the augmented role label.
    """

    def __init__(self, model: CatBoostClassifier, spy_tolerance: float = 0.05, logger: Optional[Logger] = None,
                 feature_column_name: Optional[str] = None,
                 mod_data_point_role_column_name: Optional[str] = None,
                 probability_class_1_column_name: Optional[str] = None,
                 mod_prediction_class_column_name: Optional[str] = None,
                 augmented_bin_column_name: Optional[str] = None,
                 augmented_role_column_name: Optional[str] = None) -> None:
        """
        Initialize the AugmenNegativeIdentifier.

        Args:
            model (CatBoostClassifier): Trained Spy model.
            spy_tolerance (float, optional): Tolerance for spy inclusion in negatives.
            logger (Logger, optional): Logger instance for tracking and logging.
            feature_column_name (str, optional): Column name for input features.
            mod_data_point_role_column_name (str, optional): Column name for data point role.
            probability_class_1_column_name (str, optional): Column name for probability predictions for class 1.
            mod_prediction_class_column_name (str, optional): Column name for predicted class.
            augmented_bin_column_name (str, optional): Column name for binary augmented labels.
            augmented_role_column_name (str, optional): Column name for augmented role labels.
        """
        self.model = model
        self.spy_tolerance = spy_tolerance
        self.logger = logger

        # Optional parameters for classification; if not provided, they can be set later.
        self.feature_column_name = feature_column_name
        self.mod_data_point_role_column_name = mod_data_point_role_column_name
        self.probability_class_1_column_name = probability_class_1_column_name
        self.mod_prediction_class_column_name = mod_prediction_class_column_name
        self.augmented_bin_column_name = augmented_bin_column_name
        self.augmented_role_column_name = augmented_role_column_name

    @classmethod
    def from_config(cls, config: Dict[str, Any], model: CatBoostClassifier,
                    logger: Optional[Logger] = None) -> "AugmenNegativeIdentifier":
        """
        Alternative constructor that extracts the required parameters from a config object.

        The configuration dictionary is expected to have keys "spy_splitting" and "meta_columns" with appropriate entries.

        Args:
            config (Dict[str, Any]): Configuration dictionary.
            model (CatBoostClassifier): Trained Spy model.
            logger (Logger, optional): Logger instance.

        Returns:
            AugmenNegativeIdentifier: A new instance configured from the provided config.
        """

        return cls(
            model=model,
            spy_tolerance=config["spy_splitting"]["spy_tolerance"],
            logger=logger,
            feature_column_name = config["featurisation"]["combined_features_column_name"],
            mod_data_point_role_column_name = config["meta_columns"]["meta_mod_data_point_role"],
            probability_class_1_column_name = config["meta_columns"]["meta_mod_probability_1"],
            mod_prediction_class_column_name = config["meta_columns"]["meta_mod_prediction_class"],
            augmented_bin_column_name = config["meta_columns"]["meta_augmented_bin"],
            augmented_role_column_name = config["meta_columns"]["meta_augmented_role"]
        )

    def predict_augmen_probabilities(
        self,
        spy_inf_data: pd.DataFrame,
        feature_column_name: Optional[str] = None,
        mod_prediction_class_column_name: Optional[str] = None,
        probability_class_1_column_name: Optional[str] = None
    ) -> pd.DataFrame:
        """
        Predict probabilities and labels for spy-infused training data.

        The method adds two columns to a copy of the input DataFrame:
        one for predicted classes and one for the predicted probabilities for class 1.

        Args:
            spy_inf_data (pd.DataFrame): Spy-infused training data.
            feature_column_name (str): Name of the column containing input features.
            mod_prediction_class_column_name (Optional[str]): Override for predicted class column name.
            probability_class_1_column_name (Optional[str]): Override for probability column name.

        Returns:
            pd.DataFrame: A new DataFrame with predicted class and probability columns appended.

        Raises:
            KeyError: If the feature column is not found in the data.
            Exception: Propagates any exceptions raised during prediction.
        """
        mod_pred_col = mod_prediction_class_column_name or self.mod_prediction_class_column_name
        prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name
        feature_column_name = feature_column_name or self.feature_column_name
        if feature_column_name not in spy_inf_data.columns:
            raise KeyError(f"Feature column '{feature_column_name}' not found in data.")

        result_df = spy_inf_data.copy()
        features = result_df[feature_column_name].tolist()

        try:
            pred_class = self.model.predict(features, prediction_type='Class')
            # CatBoost .predict(prediction_type='Probability') returns shape (N, 2), we want class 1
            pred_prob = self.model.predict(features, prediction_type='Probability')[:, 1]
        except Exception as e:
            if self.logger:
                self.logger.log_message(f"Error during prediction: {e}")
            raise e

        result_df[mod_pred_col] = pred_class
        result_df[prob_class1_col] = pred_prob
        return result_df

    def find_augmen_threshold(
        self,
        spy_inf_data: pd.DataFrame,
        mod_data_point_role_column_name: Optional[str] = None,
        probability_class_1_column_name: Optional[str] = None
    ) -> float:
        """
        Find the optimal threshold for classifying augmented negatives.

        The threshold is determined by sorting the predicted probabilities for examples with
        a data point role of "unlabeled spy" and selecting the value at an index defined by the spy tolerance.
        If the computed threshold exceeds 0.5, it is set to 0.5.

        Args:
            spy_inf_data (pd.DataFrame): Data with predicted probabilities.
            mod_data_point_role_column_name (str, optional): Override for the data point role column name.
            probability_class_1_column_name (str, optional): Override for the probability column name.

        Returns:
            float: The determined threshold.

        Raises:
            KeyError: If the role column is not found in the data.
            ValueError: If no spy data is found.
        """
        role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name
        prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name

        if role_col not in spy_inf_data.columns:
            raise KeyError(f"Expected column '{role_col}' not found in data.")

        spy_data = spy_inf_data[spy_inf_data[role_col] == "unlabeled spy"]
        probability_values = spy_data[prob_class1_col].values

        if len(probability_values) == 0:
            raise ValueError("No spy data found to calculate threshold.")

        num_spies_to_catch = int(len(spy_data) * (self.spy_tolerance))
        sorted_probabilities = sorted(probability_values)
        threshold = sorted_probabilities[num_spies_to_catch]

        if threshold > 0.5:
            threshold = 0.5
            if self.logger:
                self.logger.log_message("Threshold is higher than 0.5; setting to 0.5")
        if self.logger:
            self.logger.log_threshold(threshold)
        return threshold

    def filter_augmented_negatives_and_known_positives(
        self,
        spy_inf_data: pd.DataFrame,
        augmented_role_column_name: Optional[str] = None,
        mod_data_point_role_column_name: Optional[str] = None
    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
        """
        Filter the augmented negatives from spy-infused data by excluding known positives.
        Also, return the set of "undecisive" datapoints.

        Known positives are defined as rows where the data point role (from mod_data_point_role_column_name)
        is "unlabeled spy" or "true positive". For these rows, the augmented role is forcibly set to "known positive".
        The method returns three DataFrames:
          - filtered_augmented_negatives: rows with augmented role "reliable negative"
          - known_positives: rows with role in ["unlabeled spy", "true positive"]
          - undecisives: rows with augmented role "undecisive"

        Args:
            spy_inf_data (pd.DataFrame): Spy-infused DataFrame containing meta columns.
            augmented_role_column_name (str, optional): Override for the augmented role column name.
            mod_data_point_role_column_name (str, optional): Override for the data point role column name.

        Returns:
            Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (filtered_negatives, known_positives, undecisives).

        Raises:
            KeyError: If expected columns are missing.
        """
        role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name
        aug_role_col = augmented_role_column_name or self.augmented_role_column_name

        for col in [role_col, aug_role_col]:
            if col not in spy_inf_data.columns:
                raise KeyError(f"Expected column '{col}' not found in data.")

        known_positive_roles = ["unlabeled spy", "true positive"]
        updated_df = spy_inf_data.copy()

        updated_df[aug_role_col] = np.where(
            updated_df[role_col].isin(known_positive_roles),
            "known positive",
            updated_df[aug_role_col]
        )
        known_positives = updated_df[updated_df[role_col].isin(known_positive_roles)].copy()
        filtered_negatives = updated_df[updated_df[aug_role_col] == "reliable negative"].copy()
        undecisives = updated_df[updated_df[aug_role_col] == "undecisive"].copy() # Only needed for Evaluation

        if self.logger:
            self.logger._log_dataframe_as_artifact(updated_df, "spy_inf_data.csv")
            self.logger._log_dataframe_as_artifact(known_positives, "true_positives_and_unlabeled_spy.csv")
            self.logger._log_dataframe_as_artifact(filtered_negatives, "augmented_negatives.csv")
            self.logger._log_dataframe_as_artifact(undecisives, "undecisive_datapoints.csv")

        return filtered_negatives, known_positives, undecisives

    def get_augmen_negatives_and_known_positives(
        self,
        spy_inf_data: pd.DataFrame,
        threshold: float,
        augmented_bin_column_name: Optional[str] = None,
        augmented_role_column_name: Optional[str] = None,
        probability_class_1_column_name: Optional[str] = None,
        mod_data_point_role_column_name: Optional[str] = None
    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
        """
    Extract augmented reliable negatives, known positives, and undecisive datapoints based on a threshold.

    The method creates a new binary column (augmented_bin_column_name) for augmented labels based on whether
    the predicted probability (from probability_class_1_column_name) exceeds the threshold. It then assigns an augmented
    role ("reliable negative" if binary label is 0; otherwise "undecisive") and calls the filtering function to separate
    known positives from reliable negatives and to collect undecisive datapoints.

    Args:
        spy_inf_data (pd.DataFrame): Spy-infused training data with probability predictions.
        threshold (float): Threshold for binary classification.
        augmented_bin_column_name (str, optional): Override for the binary augmented column name.
        augmented_role_column_name (str, optional): Override for the augmented role column name.
        probability_class_1_column_name (str, optional): Override for the probability column name.
        mod_data_point_role_column_name (str, optional): Override for the data point role column name.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (augmen_reliable_negatives, known_positives, undecisives).
    """
        aug_bin_col = augmented_bin_column_name or self.augmented_bin_column_name
        aug_role_col = augmented_role_column_name or self.augmented_role_column_name
        prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name
        mod_data_role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name

        df = spy_inf_data.copy()
        df[aug_bin_col] = df[prob_class1_col].apply(lambda x: 1 if x > threshold else 0)
        df[aug_role_col] = df[aug_bin_col].apply(lambda x: "reliable negative" if x == 0 else "undecisive")

        return self.filter_augmented_negatives_and_known_positives(
            spy_inf_data=df,
            augmented_role_column_name=aug_role_col,
            mod_data_point_role_column_name=mod_data_role_col
        )

`init(model, spy_tolerance=0.05, logger=None, feature_column_name=None, mod_data_point_role_column_name=None, probability_class_1_column_name=None, mod_prediction_class_column_name=None, augmented_bin_column_name=None, augmented_role_column_name=None)`

Initialize the AugmenNegativeIdentifier.

Parameters:

Name	Type	Description	Default
`model`	`CatBoostClassifier`	Trained Spy model.	required
`spy_tolerance`	`float`	Tolerance for spy inclusion in negatives.	`0.05`
`logger`	`Logger`	Logger instance for tracking and logging.	`None`
`feature_column_name`	`str`	Column name for input features.	`None`
`mod_data_point_role_column_name`	`str`	Column name for data point role.	`None`
`probability_class_1_column_name`	`str`	Column name for probability predictions for class 1.	`None`
`mod_prediction_class_column_name`	`str`	Column name for predicted class.	`None`
`augmented_bin_column_name`	`str`	Column name for binary augmented labels.	`None`
`augmented_role_column_name`	`str`	Column name for augmented role labels.	`None`

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py

def __init__(self, model: CatBoostClassifier, spy_tolerance: float = 0.05, logger: Optional[Logger] = None,
             feature_column_name: Optional[str] = None,
             mod_data_point_role_column_name: Optional[str] = None,
             probability_class_1_column_name: Optional[str] = None,
             mod_prediction_class_column_name: Optional[str] = None,
             augmented_bin_column_name: Optional[str] = None,
             augmented_role_column_name: Optional[str] = None) -> None:
    """
    Initialize the AugmenNegativeIdentifier.

    Args:
        model (CatBoostClassifier): Trained Spy model.
        spy_tolerance (float, optional): Tolerance for spy inclusion in negatives.
        logger (Logger, optional): Logger instance for tracking and logging.
        feature_column_name (str, optional): Column name for input features.
        mod_data_point_role_column_name (str, optional): Column name for data point role.
        probability_class_1_column_name (str, optional): Column name for probability predictions for class 1.
        mod_prediction_class_column_name (str, optional): Column name for predicted class.
        augmented_bin_column_name (str, optional): Column name for binary augmented labels.
        augmented_role_column_name (str, optional): Column name for augmented role labels.
    """
    self.model = model
    self.spy_tolerance = spy_tolerance
    self.logger = logger

    # Optional parameters for classification; if not provided, they can be set later.
    self.feature_column_name = feature_column_name
    self.mod_data_point_role_column_name = mod_data_point_role_column_name
    self.probability_class_1_column_name = probability_class_1_column_name
    self.mod_prediction_class_column_name = mod_prediction_class_column_name
    self.augmented_bin_column_name = augmented_bin_column_name
    self.augmented_role_column_name = augmented_role_column_name

`filter_augmented_negatives_and_known_positives(spy_inf_data, augmented_role_column_name=None, mod_data_point_role_column_name=None)`

Filter the augmented negatives from spy-infused data by excluding known positives. Also, return the set of "undecisive" datapoints.

Known positives are defined as rows where the data point role (from mod_data_point_role_column_name) is "unlabeled spy" or "true positive". For these rows, the augmented role is forcibly set to "known positive". The method returns three DataFrames: - filtered_augmented_negatives: rows with augmented role "reliable negative" - known_positives: rows with role in ["unlabeled spy", "true positive"] - undecisives: rows with augmented role "undecisive"

Parameters:

Name	Type	Description	Default
`spy_inf_data`	`DataFrame`	Spy-infused DataFrame containing meta columns.	required
`augmented_role_column_name`	`str`	Override for the augmented role column name.	`None`
`mod_data_point_role_column_name`	`str`	Override for the data point role column name.	`None`

Returns:

Type	Description
`Tuple[DataFrame, DataFrame, DataFrame]`	Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (filtered_negatives, known_positives, undecisives).

Raises:

Type	Description
`KeyError`	If expected columns are missing.

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py

def filter_augmented_negatives_and_known_positives(
    self,
    spy_inf_data: pd.DataFrame,
    augmented_role_column_name: Optional[str] = None,
    mod_data_point_role_column_name: Optional[str] = None
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Filter the augmented negatives from spy-infused data by excluding known positives.
    Also, return the set of "undecisive" datapoints.

    Known positives are defined as rows where the data point role (from mod_data_point_role_column_name)
    is "unlabeled spy" or "true positive". For these rows, the augmented role is forcibly set to "known positive".
    The method returns three DataFrames:
      - filtered_augmented_negatives: rows with augmented role "reliable negative"
      - known_positives: rows with role in ["unlabeled spy", "true positive"]
      - undecisives: rows with augmented role "undecisive"

    Args:
        spy_inf_data (pd.DataFrame): Spy-infused DataFrame containing meta columns.
        augmented_role_column_name (str, optional): Override for the augmented role column name.
        mod_data_point_role_column_name (str, optional): Override for the data point role column name.

    Returns:
        Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (filtered_negatives, known_positives, undecisives).

    Raises:
        KeyError: If expected columns are missing.
    """
    role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name
    aug_role_col = augmented_role_column_name or self.augmented_role_column_name

    for col in [role_col, aug_role_col]:
        if col not in spy_inf_data.columns:
            raise KeyError(f"Expected column '{col}' not found in data.")

    known_positive_roles = ["unlabeled spy", "true positive"]
    updated_df = spy_inf_data.copy()

    updated_df[aug_role_col] = np.where(
        updated_df[role_col].isin(known_positive_roles),
        "known positive",
        updated_df[aug_role_col]
    )
    known_positives = updated_df[updated_df[role_col].isin(known_positive_roles)].copy()
    filtered_negatives = updated_df[updated_df[aug_role_col] == "reliable negative"].copy()
    undecisives = updated_df[updated_df[aug_role_col] == "undecisive"].copy() # Only needed for Evaluation

    if self.logger:
        self.logger._log_dataframe_as_artifact(updated_df, "spy_inf_data.csv")
        self.logger._log_dataframe_as_artifact(known_positives, "true_positives_and_unlabeled_spy.csv")
        self.logger._log_dataframe_as_artifact(filtered_negatives, "augmented_negatives.csv")
        self.logger._log_dataframe_as_artifact(undecisives, "undecisive_datapoints.csv")

    return filtered_negatives, known_positives, undecisives

`find_augmen_threshold(spy_inf_data, mod_data_point_role_column_name=None, probability_class_1_column_name=None)`

Find the optimal threshold for classifying augmented negatives.

The threshold is determined by sorting the predicted probabilities for examples with a data point role of "unlabeled spy" and selecting the value at an index defined by the spy tolerance. If the computed threshold exceeds 0.5, it is set to 0.5.

Parameters:

Name	Type	Description	Default
`spy_inf_data`	`DataFrame`	Data with predicted probabilities.	required
`mod_data_point_role_column_name`	`str`	Override for the data point role column name.	`None`
`probability_class_1_column_name`	`str`	Override for the probability column name.	`None`

Returns:

Name	Type	Description
`float`	`float`	The determined threshold.

Raises:

Type	Description
`KeyError`	If the role column is not found in the data.
`ValueError`	If no spy data is found.

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py

def find_augmen_threshold(
    self,
    spy_inf_data: pd.DataFrame,
    mod_data_point_role_column_name: Optional[str] = None,
    probability_class_1_column_name: Optional[str] = None
) -> float:
    """
    Find the optimal threshold for classifying augmented negatives.

    The threshold is determined by sorting the predicted probabilities for examples with
    a data point role of "unlabeled spy" and selecting the value at an index defined by the spy tolerance.
    If the computed threshold exceeds 0.5, it is set to 0.5.

    Args:
        spy_inf_data (pd.DataFrame): Data with predicted probabilities.
        mod_data_point_role_column_name (str, optional): Override for the data point role column name.
        probability_class_1_column_name (str, optional): Override for the probability column name.

    Returns:
        float: The determined threshold.

    Raises:
        KeyError: If the role column is not found in the data.
        ValueError: If no spy data is found.
    """
    role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name
    prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name

    if role_col not in spy_inf_data.columns:
        raise KeyError(f"Expected column '{role_col}' not found in data.")

    spy_data = spy_inf_data[spy_inf_data[role_col] == "unlabeled spy"]
    probability_values = spy_data[prob_class1_col].values

    if len(probability_values) == 0:
        raise ValueError("No spy data found to calculate threshold.")

    num_spies_to_catch = int(len(spy_data) * (self.spy_tolerance))
    sorted_probabilities = sorted(probability_values)
    threshold = sorted_probabilities[num_spies_to_catch]

    if threshold > 0.5:
        threshold = 0.5
        if self.logger:
            self.logger.log_message("Threshold is higher than 0.5; setting to 0.5")
    if self.logger:
        self.logger.log_threshold(threshold)
    return threshold

`from_config(config, model, logger=None)` `classmethod`

Alternative constructor that extracts the required parameters from a config object.

The configuration dictionary is expected to have keys "spy_splitting" and "meta_columns" with appropriate entries.

Parameters:

Name	Type	Description	Default
`config`	`Dict[str, Any]`	Configuration dictionary.	required
`model`	`CatBoostClassifier`	Trained Spy model.	required
`logger`	`Logger`	Logger instance.	`None`

Returns:

Name	Type	Description
`AugmenNegativeIdentifier`	`AugmenNegativeIdentifier`	A new instance configured from the provided config.

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py

@classmethod
def from_config(cls, config: Dict[str, Any], model: CatBoostClassifier,
                logger: Optional[Logger] = None) -> "AugmenNegativeIdentifier":
    """
    Alternative constructor that extracts the required parameters from a config object.

    The configuration dictionary is expected to have keys "spy_splitting" and "meta_columns" with appropriate entries.

    Args:
        config (Dict[str, Any]): Configuration dictionary.
        model (CatBoostClassifier): Trained Spy model.
        logger (Logger, optional): Logger instance.

    Returns:
        AugmenNegativeIdentifier: A new instance configured from the provided config.
    """

    return cls(
        model=model,
        spy_tolerance=config["spy_splitting"]["spy_tolerance"],
        logger=logger,
        feature_column_name = config["featurisation"]["combined_features_column_name"],
        mod_data_point_role_column_name = config["meta_columns"]["meta_mod_data_point_role"],
        probability_class_1_column_name = config["meta_columns"]["meta_mod_probability_1"],
        mod_prediction_class_column_name = config["meta_columns"]["meta_mod_prediction_class"],
        augmented_bin_column_name = config["meta_columns"]["meta_augmented_bin"],
        augmented_role_column_name = config["meta_columns"]["meta_augmented_role"]
    )

`get_augmen_negatives_and_known_positives(spy_inf_data, threshold, augmented_bin_column_name=None, augmented_role_column_name=None, probability_class_1_column_name=None, mod_data_point_role_column_name=None)`

Extract augmented reliable negatives, known positives, and undecisive datapoints based on a threshold.

The method creates a new binary column (augmented_bin_column_name) for augmented labels based on whether the predicted probability (from probability_class_1_column_name) exceeds the threshold. It then assigns an augmented role ("reliable negative" if binary label is 0; otherwise "undecisive") and calls the filtering function to separate known positives from reliable negatives and to collect undecisive datapoints.

Parameters:

Name	Type	Description	Default
`spy_inf_data`	`DataFrame`	Spy-infused training data with probability predictions.	required
`threshold`	`float`	Threshold for binary classification.	required
`augmented_bin_column_name`	`str`	Override for the binary augmented column name.	`None`
`augmented_role_column_name`	`str`	Override for the augmented role column name.	`None`
`probability_class_1_column_name`	`str`	Override for the probability column name.	`None`
`mod_data_point_role_column_name`	`str`	Override for the data point role column name.	`None`

Returns:

Type	Description
`Tuple[DataFrame, DataFrame, DataFrame]`	Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (augmen_reliable_negatives, known_positives, undecisives).

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py

def get_augmen_negatives_and_known_positives(
    self,
    spy_inf_data: pd.DataFrame,
    threshold: float,
    augmented_bin_column_name: Optional[str] = None,
    augmented_role_column_name: Optional[str] = None,
    probability_class_1_column_name: Optional[str] = None,
    mod_data_point_role_column_name: Optional[str] = None
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
Extract augmented reliable negatives, known positives, and undecisive datapoints based on a threshold.

The method creates a new binary column (augmented_bin_column_name) for augmented labels based on whether
the predicted probability (from probability_class_1_column_name) exceeds the threshold. It then assigns an augmented
role ("reliable negative" if binary label is 0; otherwise "undecisive") and calls the filtering function to separate
known positives from reliable negatives and to collect undecisive datapoints.

Args:
    spy_inf_data (pd.DataFrame): Spy-infused training data with probability predictions.
    threshold (float): Threshold for binary classification.
    augmented_bin_column_name (str, optional): Override for the binary augmented column name.
    augmented_role_column_name (str, optional): Override for the augmented role column name.
    probability_class_1_column_name (str, optional): Override for the probability column name.
    mod_data_point_role_column_name (str, optional): Override for the data point role column name.

Returns:
    Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: (augmen_reliable_negatives, known_positives, undecisives).
"""
    aug_bin_col = augmented_bin_column_name or self.augmented_bin_column_name
    aug_role_col = augmented_role_column_name or self.augmented_role_column_name
    prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name
    mod_data_role_col = mod_data_point_role_column_name or self.mod_data_point_role_column_name

    df = spy_inf_data.copy()
    df[aug_bin_col] = df[prob_class1_col].apply(lambda x: 1 if x > threshold else 0)
    df[aug_role_col] = df[aug_bin_col].apply(lambda x: "reliable negative" if x == 0 else "undecisive")

    return self.filter_augmented_negatives_and_known_positives(
        spy_inf_data=df,
        augmented_role_column_name=aug_role_col,
        mod_data_point_role_column_name=mod_data_role_col
    )

`predict_augmen_probabilities(spy_inf_data, feature_column_name=None, mod_prediction_class_column_name=None, probability_class_1_column_name=None)`

Predict probabilities and labels for spy-infused training data.

The method adds two columns to a copy of the input DataFrame: one for predicted classes and one for the predicted probabilities for class 1.

Parameters:

Name	Type	Description	Default
`spy_inf_data`	`DataFrame`	Spy-infused training data.	required
`feature_column_name`	`str`	Name of the column containing input features.	`None`
`mod_prediction_class_column_name`	`Optional[str]`	Override for predicted class column name.	`None`
`probability_class_1_column_name`	`Optional[str]`	Override for probability column name.	`None`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: A new DataFrame with predicted class and probability columns appended.

Raises:

Type	Description
`KeyError`	If the feature column is not found in the data.
`Exception`	Propagates any exceptions raised during prediction.

Source code in payn\AugmentationModels\SpyModel\augmen_negative_identifier.py

def predict_augmen_probabilities(
    self,
    spy_inf_data: pd.DataFrame,
    feature_column_name: Optional[str] = None,
    mod_prediction_class_column_name: Optional[str] = None,
    probability_class_1_column_name: Optional[str] = None
) -> pd.DataFrame:
    """
    Predict probabilities and labels for spy-infused training data.

    The method adds two columns to a copy of the input DataFrame:
    one for predicted classes and one for the predicted probabilities for class 1.

    Args:
        spy_inf_data (pd.DataFrame): Spy-infused training data.
        feature_column_name (str): Name of the column containing input features.
        mod_prediction_class_column_name (Optional[str]): Override for predicted class column name.
        probability_class_1_column_name (Optional[str]): Override for probability column name.

    Returns:
        pd.DataFrame: A new DataFrame with predicted class and probability columns appended.

    Raises:
        KeyError: If the feature column is not found in the data.
        Exception: Propagates any exceptions raised during prediction.
    """
    mod_pred_col = mod_prediction_class_column_name or self.mod_prediction_class_column_name
    prob_class1_col = probability_class_1_column_name or self.probability_class_1_column_name
    feature_column_name = feature_column_name or self.feature_column_name
    if feature_column_name not in spy_inf_data.columns:
        raise KeyError(f"Feature column '{feature_column_name}' not found in data.")

    result_df = spy_inf_data.copy()
    features = result_df[feature_column_name].tolist()

    try:
        pred_class = self.model.predict(features, prediction_type='Class')
        # CatBoost .predict(prediction_type='Probability') returns shape (N, 2), we want class 1
        pred_prob = self.model.predict(features, prediction_type='Probability')[:, 1]
    except Exception as e:
        if self.logger:
            self.logger.log_message(f"Error during prediction: {e}")
        raise e

    result_df[mod_pred_col] = pred_class
    result_df[prob_class1_col] = pred_prob
    return result_df

Augmentation Models

Spy Model (payn.AugmentationModels.SpyModel.SpyModel)

__init__(eval_metric, random_state, verbose, fold_index=1, logger=None, feature_column_name=None, training_target_column_name=None, validation_target_column_name=None, metrics_list=None, categorical_column_indices=None)

evaluate(test_pool)

from_config(config, logger=None, fold_index=1) classmethod

predict(data, feature_column=None)

train(train_data, val_data, test_data=None, feature_column=None, training_label_column=None, validation_label_column=None, **kwargs)

Reliable Negative Identification (payn.AugmentationModels.SpyModel.augmen_negative_identifier)

AugmenNegativeIdentifier

__init__(model, spy_tolerance=0.05, logger=None, feature_column_name=None, mod_data_point_role_column_name=None, probability_class_1_column_name=None, mod_prediction_class_column_name=None, augmented_bin_column_name=None, augmented_role_column_name=None)

filter_augmented_negatives_and_known_positives(spy_inf_data, augmented_role_column_name=None, mod_data_point_role_column_name=None)

find_augmen_threshold(spy_inf_data, mod_data_point_role_column_name=None, probability_class_1_column_name=None)

from_config(config, model, logger=None) classmethod

get_augmen_negatives_and_known_positives(spy_inf_data, threshold, augmented_bin_column_name=None, augmented_role_column_name=None, probability_class_1_column_name=None, mod_data_point_role_column_name=None)

predict_augmen_probabilities(spy_inf_data, feature_column_name=None, mod_prediction_class_column_name=None, probability_class_1_column_name=None)

Spy Model (`payn.AugmentationModels.SpyModel.SpyModel`)

`init(eval_metric, random_state, verbose, fold_index=1, logger=None, feature_column_name=None, training_target_column_name=None, validation_target_column_name=None, metrics_list=None, categorical_column_indices=None)`

`evaluate(test_pool)`

`from_config(config, logger=None, fold_index=1)` `classmethod`

`predict(data, feature_column=None)`

`train(train_data, val_data, test_data=None, feature_column=None, training_label_column=None, validation_label_column=None, **kwargs)`

Reliable Negative Identification (`payn.AugmentationModels.SpyModel.augmen_negative_identifier`)

`AugmenNegativeIdentifier`

`init(model, spy_tolerance=0.05, logger=None, feature_column_name=None, mod_data_point_role_column_name=None, probability_class_1_column_name=None, mod_prediction_class_column_name=None, augmented_bin_column_name=None, augmented_role_column_name=None)`

`filter_augmented_negatives_and_known_positives(spy_inf_data, augmented_role_column_name=None, mod_data_point_role_column_name=None)`

`find_augmen_threshold(spy_inf_data, mod_data_point_role_column_name=None, probability_class_1_column_name=None)`

`from_config(config, model, logger=None)` `classmethod`

`get_augmen_negatives_and_known_positives(spy_inf_data, threshold, augmented_bin_column_name=None, augmented_role_column_name=None, probability_class_1_column_name=None, mod_data_point_role_column_name=None)`

`predict_augmen_probabilities(spy_inf_data, feature_column_name=None, mod_prediction_class_column_name=None, probability_class_1_column_name=None)`