Spy Injection

Spy injection and PU generation (`payn.Splitting.SpySplitting`)

Transforms the standard fully labeled dataset (Positive/Negative) into a Positive-Unlabeled (PU) format suitable for the Spy technique. It simulates a scenario where only a subset of positives are known, and the rest are hidden within a pool of unlabeled data.

Two different PU partitioning strategies are implemented:

Controlled Ratio (split_data_with_controlled_PU_ratio): Enforces a strict ratio between known positives and the unlabeled pool by partitioning the dataset and discarding excess negatives. This allows for controlled experimentation on the impact of class imbalance. This partitioning is the default within this work.
Original Ratio (split_data_with_original_PU_ratio): Preserves all negative data and mixes in a calculated subset of positives to achieve a target "unlabeled positive concentration".
Spy Infiltration: A user-defined fraction (spy_rate, typically 15-20%) of the known positive training set is randomly sampled (deterministically via random_state) and moved into the Unlabeled set.

These "Spies" have their labels changed to 0 (Negative, s = 0) within the model training context but retain their metadata role as unlabeled spy (y = 1). They serve as anchors: since the model should have classified them as positive, their predicted probability distribution helps identify other hidden positives and therefore the calculation of a new threshold.

Class for splitting data for Spy model training (PU Learning).

Naming Convention

true_ : Unmodified data, known positive/negative data from ground truth.
spy_inf_ : Spy_positive data infused into (training) data.
augmen_ : Augmented negatives identified by the spy model.

This class splits the input dataset into a training set (true positives) and an unlabeled set (combining a fraction of positives with negatives). Then, a subset of positives is designated as spies and infiltrated into the unlabeled set.

Source code in payn\Splitting\spysplitting.py

class SpySplitting:
    """
    Class for splitting data for Spy model training (PU Learning).

    Naming Convention:
        - true_ : Unmodified data, known positive/negative data from ground truth.
        - spy_inf_ : Spy_positive data infused into (training) data.
        - augmen_ : Augmented negatives identified by the spy model.

    This class splits the input dataset into a training set (true positives) and an unlabeled set
    (combining a fraction of positives with negatives). Then, a subset of positives is designated as spies
    and infiltrated into the unlabeled set.
    """

    def __init__(self, data: pd.DataFrame, true_label_column: str, modified_label_column_name: str, modified_role_column_name: str = None,
                 application_mode: str = None, positive_label: int = 1,unlabeled_positives_ratio: float = 0.2,ratio_positives_to_unlabeled: float = 0.5,
                 spy_rate: float = 0.2, random_state: int = 42, logger: Logger = None):
        """
        Initialize the SpySplitting class.

        Args:
            data (pd.DataFrame): The input dataset.
            true_label_column (str): Name of the column containing the true labels of the data.
            modified_label_column_name (str): Name of the column for the modified labels of the data.
            modified_role_column_name (str): Name of the column for the modified roles of the data points.
            application_mode (str, optional): Application mode (e.g., 'training').
            positive_label (int): Label for positive data points.
            unlabeled_positives_ratio (float): Proportion of positive and negative data to assign as unlabeled.
            ratio_positives_to_unlabeled (float): Proportion of positive data to assign as unlabeled.
            spy_rate (float): Proportion of positive samples to infiltrate as spies.
            random_state (int): Seed for reproducibility.
            logger (Logger, optional): Instance of Logger class for logging purposes.

        """
        self.data = data.copy()
        self.true_label_column = true_label_column
        self.modified_label_column_name = modified_label_column_name
        self.positive_label = positive_label
        self.unlabeled_positives_ratio = unlabeled_positives_ratio
        self.ratio_positives_to_unlabeled = ratio_positives_to_unlabeled
        self.spy_rate = spy_rate
        self.random_state = random_state
        self.logger = logger
        #Optional Parameters for data splitting and infiltration
        self.modified_role_column_name = modified_role_column_name
        self.application_mode = application_mode

        # Initialize the modified label column as a copy of the true label column
        self.data[self.modified_label_column_name] = self.data[self.true_label_column]

    @classmethod
    def from_config(cls, positive_label:int, config: dict, data: pd.DataFrame, logger: Logger = None) -> "SpySplitting":
        """
        Alternative constructor that creates a SpySplitting instance from a config object.

        Args:
            positive_label (int): The integer label representing the positive class.
            config (dict): Configuration dictionary containing splitting parameters.
            data (pd.DataFrame): The input dataset.
            logger (Logger, optional): Logger instance for logging purposes.

        Returns:
            SpySplitting: An instance of the SpySplitting class.
        """
        return cls(
            data=data,
            true_label_column=config["meta_columns"]["meta_true_label_bin"],
            modified_label_column_name=config["meta_columns"]["meta_mod_label_bin"],
            positive_label=positive_label,
            ratio_positives_to_unlabeled=config["spy_splitting"]["ratio_positives_to_unlabeled"],
            spy_rate=config["spy_splitting"]["spy_rate"],
            random_state=config["general"]["random_seed"],
            logger=logger,
            modified_role_column_name=config["meta_columns"]["meta_mod_data_point_role"],
        )

    def split_data_with_controlled_PU_ratio(self, meta_column_name: str = None, schema: Any = None) -> Dict[
        str, pd.DataFrame]:
        """
        Splits data by partitioning the *entire* (P+N) dataset into a 'Labeled'
        chunk and an 'Unlabeled' chunk.

        Note:
            A portion of known negatives from the 'Labeled' chunk is DISCARDED to maintain
            the specific `ratio_positives_to_unlabeled`.

        Args:
            meta_column_name (str, optional): Column name to label datapoint roles.
            schema (DataSchema, optional): Optional DataSchema instance for validating output.

        Returns:
            dict: Dictionary containing:
                - "train" (pd.DataFrame): Known positives for training.
                - "unlabeled" (pd.DataFrame): Combined unlabeled set (subset of P + all N).

        Raises:
            ValueError: If the calculated split ratio is invalid (not between 0 and 1).
        """
        meta_column_name = meta_column_name or self.modified_role_column_name

        positives_in_all_ratio = self.data[self.data[self.true_label_column] == 1].shape[0] / self.data.shape[0]

        recalculated_split_ratio = self.ratio_positives_to_unlabeled / (
                    positives_in_all_ratio + self.ratio_positives_to_unlabeled)

        if not (0 < recalculated_split_ratio < 1):
            raise ValueError(
                f"Calculated split_ratio is {recalculated_split_ratio}, which is not between 0 and 1. Check your config.")

        # Partition the entire dataset
        labeled_train_data = self.data.sample(frac=recalculated_split_ratio, random_state=self.random_state)

        # The remaining data becomes the Unlabeled set
        unlabeled_data = self.data.drop(labeled_train_data.index).copy()

        # Process the labels of Unlabeled set
        unlabeled_data.loc[
            unlabeled_data[self.true_label_column] == 0,
            meta_column_name] = 'unlabeled negative'
        unlabeled_data.loc[
            unlabeled_data[self.true_label_column] == 1,
            meta_column_name] = 'unlabeled positive'

        # Process the Labeled set
        # Identify negatives that ended up in the labeled partition (to be discarded)
        labeled_negative_train_data = labeled_train_data[labeled_train_data[self.true_label_column] == 0]
        # Keep only the positives for training set
        pos_train_data = labeled_train_data.drop(labeled_negative_train_data.index).copy()  # This is the final P set
        pos_train_data[meta_column_name] = 'true positive'
        pos_train_data[self.modified_label_column_name] = 1

        if self.logger:
            self.logger.log_message(
                f"Split by partitioning: Discarded {len(labeled_negative_train_data)} known negative samples.")

        # (Validation and Logging)
        if schema:
            validate_dataframe(df=unlabeled_data, schema=schema, mode=self.application_mode)
            validate_split_integrity(input_dfs=[self.data],
                                     output_dfs=[pos_train_data, unlabeled_data, labeled_negative_train_data])
        if self.logger:
            self.logger.log_spysplit_data(train_data=pos_train_data, unlabeled_data=unlabeled_data)

        return {
            "train": pos_train_data,
            "unlabeled": unlabeled_data
        }

    def split_data_with_original_PU_ratio(self, meta_column_name: str = None, schema: Any = None) -> Dict[
        str, pd.DataFrame]:
        """
        Splits data by building an unlabeled set from ALL negatives and a subset of positives.
        The goal is to achieve a specific 'unlabeled_positives_ratio' (concentration) in the U set.
        No data is discarded.

        Args:
            meta_column_name (str, optional): Column name to label datapoint roles.
            schema (DataSchema, optional): Optional DataSchema instance for validating output.

        Returns:
            dict: Dictionary containing:
                - "train" (pd.DataFrame): Known positives for training.
                - "unlabeled" (pd.DataFrame): Combined unlabeled set.

        Raises:
            ValueError: If there are insufficient true positives to satisfy the requested ratio.
        """
        meta_column_name = meta_column_name or self.modified_role_column_name

        # Separate known positives and negatives
        true_positive_data = self.data[self.data[self.true_label_column] == self.positive_label].copy()
        true_negative_data = self.data[self.data[self.true_label_column] != self.positive_label].copy()

        # Recalculate the unlabeled_positives_ratio to add the correct number of positives to negatives,
        # while respecting the ratio provided. A 20% ratio means 20 positives for every 80 negatives, i.e., 0.25 of the unlabeled data should be positive.
        recalculated_unlabeled_positives_ratio = (self.unlabeled_positives_ratio) / (1 - self.unlabeled_positives_ratio)

        # Unlabeled data is generated from positive and negative data
        number_unlabeled_positives = int(recalculated_unlabeled_positives_ratio * len(true_negative_data))
        if number_unlabeled_positives >= true_positive_data.shape[0]:
            raise ValueError(
                f"You are trying to sample {number_unlabeled_positives} unlabeled positives, but there are only {true_positive_data.shape[0]} true positives.")

        # Sample the positives for the Unlabeled set
        unlabeled_true_positives = true_positive_data.sample(n=number_unlabeled_positives,
                                                             random_state=self.random_state)

        # Remaining positives form the positive training set
        true_pos_train = true_positive_data.drop(unlabeled_true_positives.index)

        # Label datapoint roles
        true_pos_train = true_pos_train.copy()
        unlabeled_true_positives = unlabeled_true_positives.copy()
        true_negative_data = true_negative_data.copy()

        true_pos_train[meta_column_name] = "true positive"
        unlabeled_true_positives[meta_column_name] = "unlabeled positive"
        true_negative_data[meta_column_name] = "unlabeled negative"

        # Combine unlabeled parts and shuffle
        unlabeled_data = pd.concat([unlabeled_true_positives, true_negative_data]).sample(frac=1,
                                                                                          random_state=self.random_state)

        # Embedded validation: ensure the unlabeled data conforms to the expected schema.
        if schema:
            validate_dataframe(df=unlabeled_data, schema=schema, mode="training")
            validate_split_integrity(input_dfs=[true_positive_data, true_negative_data],
                                     output_dfs=[true_pos_train, unlabeled_data])
            # Likely this method will only be used for training, not inference
            if self.logger:
                self.logger.log_message("Unlabeled split validated against schema in SpySplitting.")
        # Log datasets as artifacts to MLflow using Logger
        if self.logger:
            self.logger.log_spysplit_data(train_data=true_pos_train, unlabeled_data=unlabeled_data)

        return {
            "train": true_pos_train,
            "unlabeled": unlabeled_data
        }

    def spy_infiltration(self, true_pos_train_data: pd.DataFrame, unlabeled_data: pd.DataFrame,
                         meta_column_name: str = None, application_mode: str = None,
                         schema: Any = None) -> pd.DataFrame:
        """
        Infiltrate spies into the unlabeled data, returning a new spy-infused training set.

        Selects a subset of the True Positive training data, re-labels them as "Spy",
        sets their label to 0 (Negative), and mixes them into the Unlabeled pool.

        Args:
            true_pos_train_data (pd.DataFrame): Known positive training data.
            unlabeled_data (pd.DataFrame): Unlabeled data to be infiltrated with spies.
            meta_column_name (str, optional): Column name to assign spy role labels.
            application_mode (str, optional): Application mode for schema validation.
            schema (DataSchema, optional): Optional DataSchema instance for validating output.

        Returns:
            pd.DataFrame: The spy-infused training dataset (Positives + Unlabeled w/ Spies).
        """
        meta_column_name = meta_column_name or self.modified_role_column_name
        application_mode = application_mode or self.application_mode

        # Sample a subset of spies from positive training data
        number_spies = int(self.spy_rate * len(true_pos_train_data))

        spies = true_pos_train_data.sample(n=number_spies, random_state=self.random_state)
        # Remove spies from the clean Positive set (creating the final "P" set)
        true_pos_train_data = true_pos_train_data.drop(spies.index)

        spies = spies.copy()
        # Mark spies as negatives (Label = 0) to simulate unlabeled status
        spies[self.modified_label_column_name] = 0
        spies[meta_column_name] = "unlabeled spy"

        # Ensure unlabeled data remains marked as negative.
        unlabeled_data = unlabeled_data.copy()
        unlabeled_data[self.modified_label_column_name] = 0

        # Combine spies with unlabeled data to create the spy-infiltrated dataset
        spy_inf_train_data = pd.concat([spies, unlabeled_data, true_pos_train_data]).sample(frac=1,
                                                                                            random_state=self.random_state)

        # Validate spy-infused data if a schema is provided.
        if schema:
            validate_dataframe(df=spy_inf_train_data, schema=schema, mode=application_mode)
            validate_split_integrity(input_dfs=[true_pos_train_data, spies, unlabeled_data],
                                     output_dfs=[spy_inf_train_data])

            if self.logger:
                self.logger.log_message("Spy infiltration output validated against schema in SpySplitting.")

        if self.logger:
            self.logger.log_spy_infiltrated_data(spy_inf_train_data, spies)

        return spy_inf_train_data

`init(data, true_label_column, modified_label_column_name, modified_role_column_name=None, application_mode=None, positive_label=1, unlabeled_positives_ratio=0.2, ratio_positives_to_unlabeled=0.5, spy_rate=0.2, random_state=42, logger=None)`

Initialize the SpySplitting class.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	The input dataset.	required
`true_label_column`	`str`	Name of the column containing the true labels of the data.	required
`modified_label_column_name`	`str`	Name of the column for the modified labels of the data.	required
`modified_role_column_name`	`str`	Name of the column for the modified roles of the data points.	`None`
`application_mode`	`str`	Application mode (e.g., 'training').	`None`
`positive_label`	`int`	Label for positive data points.	`1`
`unlabeled_positives_ratio`	`float`	Proportion of positive and negative data to assign as unlabeled.	`0.2`
`ratio_positives_to_unlabeled`	`float`	Proportion of positive data to assign as unlabeled.	`0.5`
`spy_rate`	`float`	Proportion of positive samples to infiltrate as spies.	`0.2`
`random_state`	`int`	Seed for reproducibility.	`42`
`logger`	`Logger`	Instance of Logger class for logging purposes.	`None`

Source code in payn\Splitting\spysplitting.py

def __init__(self, data: pd.DataFrame, true_label_column: str, modified_label_column_name: str, modified_role_column_name: str = None,
             application_mode: str = None, positive_label: int = 1,unlabeled_positives_ratio: float = 0.2,ratio_positives_to_unlabeled: float = 0.5,
             spy_rate: float = 0.2, random_state: int = 42, logger: Logger = None):
    """
    Initialize the SpySplitting class.

    Args:
        data (pd.DataFrame): The input dataset.
        true_label_column (str): Name of the column containing the true labels of the data.
        modified_label_column_name (str): Name of the column for the modified labels of the data.
        modified_role_column_name (str): Name of the column for the modified roles of the data points.
        application_mode (str, optional): Application mode (e.g., 'training').
        positive_label (int): Label for positive data points.
        unlabeled_positives_ratio (float): Proportion of positive and negative data to assign as unlabeled.
        ratio_positives_to_unlabeled (float): Proportion of positive data to assign as unlabeled.
        spy_rate (float): Proportion of positive samples to infiltrate as spies.
        random_state (int): Seed for reproducibility.
        logger (Logger, optional): Instance of Logger class for logging purposes.

    """
    self.data = data.copy()
    self.true_label_column = true_label_column
    self.modified_label_column_name = modified_label_column_name
    self.positive_label = positive_label
    self.unlabeled_positives_ratio = unlabeled_positives_ratio
    self.ratio_positives_to_unlabeled = ratio_positives_to_unlabeled
    self.spy_rate = spy_rate
    self.random_state = random_state
    self.logger = logger
    #Optional Parameters for data splitting and infiltration
    self.modified_role_column_name = modified_role_column_name
    self.application_mode = application_mode

    # Initialize the modified label column as a copy of the true label column
    self.data[self.modified_label_column_name] = self.data[self.true_label_column]

`from_config(positive_label, config, data, logger=None)` `classmethod`

Alternative constructor that creates a SpySplitting instance from a config object.

Parameters:

Name	Type	Description	Default
`positive_label`	`int`	The integer label representing the positive class.	required
`config`	`dict`	Configuration dictionary containing splitting parameters.	required
`data`	`DataFrame`	The input dataset.	required
`logger`	`Logger`	Logger instance for logging purposes.	`None`

Returns:

Name	Type	Description
`SpySplitting`	`SpySplitting`	An instance of the SpySplitting class.

Source code in payn\Splitting\spysplitting.py

@classmethod
def from_config(cls, positive_label:int, config: dict, data: pd.DataFrame, logger: Logger = None) -> "SpySplitting":
    """
    Alternative constructor that creates a SpySplitting instance from a config object.

    Args:
        positive_label (int): The integer label representing the positive class.
        config (dict): Configuration dictionary containing splitting parameters.
        data (pd.DataFrame): The input dataset.
        logger (Logger, optional): Logger instance for logging purposes.

    Returns:
        SpySplitting: An instance of the SpySplitting class.
    """
    return cls(
        data=data,
        true_label_column=config["meta_columns"]["meta_true_label_bin"],
        modified_label_column_name=config["meta_columns"]["meta_mod_label_bin"],
        positive_label=positive_label,
        ratio_positives_to_unlabeled=config["spy_splitting"]["ratio_positives_to_unlabeled"],
        spy_rate=config["spy_splitting"]["spy_rate"],
        random_state=config["general"]["random_seed"],
        logger=logger,
        modified_role_column_name=config["meta_columns"]["meta_mod_data_point_role"],
    )

`split_data_with_controlled_PU_ratio(meta_column_name=None, schema=None)`

Splits data by partitioning the entire (P+N) dataset into a 'Labeled' chunk and an 'Unlabeled' chunk.

Note

A portion of known negatives from the 'Labeled' chunk is DISCARDED to maintain the specific ratio_positives_to_unlabeled.

Parameters:

Name	Type	Description	Default
`meta_column_name`	`str`	Column name to label datapoint roles.	`None`
`schema`	`DataSchema`	Optional DataSchema instance for validating output.	`None`

Returns:

Name	Type	Description
`dict`	`Dict[str, DataFrame]`	Dictionary containing: - "train" (pd.DataFrame): Known positives for training. - "unlabeled" (pd.DataFrame): Combined unlabeled set (subset of P + all N).

Raises:

Type	Description
`ValueError`	If the calculated split ratio is invalid (not between 0 and 1).

Source code in payn\Splitting\spysplitting.py

def split_data_with_controlled_PU_ratio(self, meta_column_name: str = None, schema: Any = None) -> Dict[
    str, pd.DataFrame]:
    """
    Splits data by partitioning the *entire* (P+N) dataset into a 'Labeled'
    chunk and an 'Unlabeled' chunk.

    Note:
        A portion of known negatives from the 'Labeled' chunk is DISCARDED to maintain
        the specific `ratio_positives_to_unlabeled`.

    Args:
        meta_column_name (str, optional): Column name to label datapoint roles.
        schema (DataSchema, optional): Optional DataSchema instance for validating output.

    Returns:
        dict: Dictionary containing:
            - "train" (pd.DataFrame): Known positives for training.
            - "unlabeled" (pd.DataFrame): Combined unlabeled set (subset of P + all N).

    Raises:
        ValueError: If the calculated split ratio is invalid (not between 0 and 1).
    """
    meta_column_name = meta_column_name or self.modified_role_column_name

    positives_in_all_ratio = self.data[self.data[self.true_label_column] == 1].shape[0] / self.data.shape[0]

    recalculated_split_ratio = self.ratio_positives_to_unlabeled / (
                positives_in_all_ratio + self.ratio_positives_to_unlabeled)

    if not (0 < recalculated_split_ratio < 1):
        raise ValueError(
            f"Calculated split_ratio is {recalculated_split_ratio}, which is not between 0 and 1. Check your config.")

    # Partition the entire dataset
    labeled_train_data = self.data.sample(frac=recalculated_split_ratio, random_state=self.random_state)

    # The remaining data becomes the Unlabeled set
    unlabeled_data = self.data.drop(labeled_train_data.index).copy()

    # Process the labels of Unlabeled set
    unlabeled_data.loc[
        unlabeled_data[self.true_label_column] == 0,
        meta_column_name] = 'unlabeled negative'
    unlabeled_data.loc[
        unlabeled_data[self.true_label_column] == 1,
        meta_column_name] = 'unlabeled positive'

    # Process the Labeled set
    # Identify negatives that ended up in the labeled partition (to be discarded)
    labeled_negative_train_data = labeled_train_data[labeled_train_data[self.true_label_column] == 0]
    # Keep only the positives for training set
    pos_train_data = labeled_train_data.drop(labeled_negative_train_data.index).copy()  # This is the final P set
    pos_train_data[meta_column_name] = 'true positive'
    pos_train_data[self.modified_label_column_name] = 1

    if self.logger:
        self.logger.log_message(
            f"Split by partitioning: Discarded {len(labeled_negative_train_data)} known negative samples.")

    # (Validation and Logging)
    if schema:
        validate_dataframe(df=unlabeled_data, schema=schema, mode=self.application_mode)
        validate_split_integrity(input_dfs=[self.data],
                                 output_dfs=[pos_train_data, unlabeled_data, labeled_negative_train_data])
    if self.logger:
        self.logger.log_spysplit_data(train_data=pos_train_data, unlabeled_data=unlabeled_data)

    return {
        "train": pos_train_data,
        "unlabeled": unlabeled_data
    }

`split_data_with_original_PU_ratio(meta_column_name=None, schema=None)`

Splits data by building an unlabeled set from ALL negatives and a subset of positives. The goal is to achieve a specific 'unlabeled_positives_ratio' (concentration) in the U set. No data is discarded.

Parameters:

Name	Type	Description	Default
`meta_column_name`	`str`	Column name to label datapoint roles.	`None`
`schema`	`DataSchema`	Optional DataSchema instance for validating output.	`None`

Returns:

Name	Type	Description
`dict`	`Dict[str, DataFrame]`	Dictionary containing: - "train" (pd.DataFrame): Known positives for training. - "unlabeled" (pd.DataFrame): Combined unlabeled set.

Raises:

Type	Description
`ValueError`	If there are insufficient true positives to satisfy the requested ratio.

Source code in payn\Splitting\spysplitting.py

def split_data_with_original_PU_ratio(self, meta_column_name: str = None, schema: Any = None) -> Dict[
    str, pd.DataFrame]:
    """
    Splits data by building an unlabeled set from ALL negatives and a subset of positives.
    The goal is to achieve a specific 'unlabeled_positives_ratio' (concentration) in the U set.
    No data is discarded.

    Args:
        meta_column_name (str, optional): Column name to label datapoint roles.
        schema (DataSchema, optional): Optional DataSchema instance for validating output.

    Returns:
        dict: Dictionary containing:
            - "train" (pd.DataFrame): Known positives for training.
            - "unlabeled" (pd.DataFrame): Combined unlabeled set.

    Raises:
        ValueError: If there are insufficient true positives to satisfy the requested ratio.
    """
    meta_column_name = meta_column_name or self.modified_role_column_name

    # Separate known positives and negatives
    true_positive_data = self.data[self.data[self.true_label_column] == self.positive_label].copy()
    true_negative_data = self.data[self.data[self.true_label_column] != self.positive_label].copy()

    # Recalculate the unlabeled_positives_ratio to add the correct number of positives to negatives,
    # while respecting the ratio provided. A 20% ratio means 20 positives for every 80 negatives, i.e., 0.25 of the unlabeled data should be positive.
    recalculated_unlabeled_positives_ratio = (self.unlabeled_positives_ratio) / (1 - self.unlabeled_positives_ratio)

    # Unlabeled data is generated from positive and negative data
    number_unlabeled_positives = int(recalculated_unlabeled_positives_ratio * len(true_negative_data))
    if number_unlabeled_positives >= true_positive_data.shape[0]:
        raise ValueError(
            f"You are trying to sample {number_unlabeled_positives} unlabeled positives, but there are only {true_positive_data.shape[0]} true positives.")

    # Sample the positives for the Unlabeled set
    unlabeled_true_positives = true_positive_data.sample(n=number_unlabeled_positives,
                                                         random_state=self.random_state)

    # Remaining positives form the positive training set
    true_pos_train = true_positive_data.drop(unlabeled_true_positives.index)

    # Label datapoint roles
    true_pos_train = true_pos_train.copy()
    unlabeled_true_positives = unlabeled_true_positives.copy()
    true_negative_data = true_negative_data.copy()

    true_pos_train[meta_column_name] = "true positive"
    unlabeled_true_positives[meta_column_name] = "unlabeled positive"
    true_negative_data[meta_column_name] = "unlabeled negative"

    # Combine unlabeled parts and shuffle
    unlabeled_data = pd.concat([unlabeled_true_positives, true_negative_data]).sample(frac=1,
                                                                                      random_state=self.random_state)

    # Embedded validation: ensure the unlabeled data conforms to the expected schema.
    if schema:
        validate_dataframe(df=unlabeled_data, schema=schema, mode="training")
        validate_split_integrity(input_dfs=[true_positive_data, true_negative_data],
                                 output_dfs=[true_pos_train, unlabeled_data])
        # Likely this method will only be used for training, not inference
        if self.logger:
            self.logger.log_message("Unlabeled split validated against schema in SpySplitting.")
    # Log datasets as artifacts to MLflow using Logger
    if self.logger:
        self.logger.log_spysplit_data(train_data=true_pos_train, unlabeled_data=unlabeled_data)

    return {
        "train": true_pos_train,
        "unlabeled": unlabeled_data
    }

`spy_infiltration(true_pos_train_data, unlabeled_data, meta_column_name=None, application_mode=None, schema=None)`

Infiltrate spies into the unlabeled data, returning a new spy-infused training set.

Selects a subset of the True Positive training data, re-labels them as "Spy", sets their label to 0 (Negative), and mixes them into the Unlabeled pool.

Parameters:

Name	Type	Description	Default
`true_pos_train_data`	`DataFrame`	Known positive training data.	required
`unlabeled_data`	`DataFrame`	Unlabeled data to be infiltrated with spies.	required
`meta_column_name`	`str`	Column name to assign spy role labels.	`None`
`application_mode`	`str`	Application mode for schema validation.	`None`
`schema`	`DataSchema`	Optional DataSchema instance for validating output.	`None`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: The spy-infused training dataset (Positives + Unlabeled w/ Spies).

Source code in payn\Splitting\spysplitting.py

def spy_infiltration(self, true_pos_train_data: pd.DataFrame, unlabeled_data: pd.DataFrame,
                     meta_column_name: str = None, application_mode: str = None,
                     schema: Any = None) -> pd.DataFrame:
    """
    Infiltrate spies into the unlabeled data, returning a new spy-infused training set.

    Selects a subset of the True Positive training data, re-labels them as "Spy",
    sets their label to 0 (Negative), and mixes them into the Unlabeled pool.

    Args:
        true_pos_train_data (pd.DataFrame): Known positive training data.
        unlabeled_data (pd.DataFrame): Unlabeled data to be infiltrated with spies.
        meta_column_name (str, optional): Column name to assign spy role labels.
        application_mode (str, optional): Application mode for schema validation.
        schema (DataSchema, optional): Optional DataSchema instance for validating output.

    Returns:
        pd.DataFrame: The spy-infused training dataset (Positives + Unlabeled w/ Spies).
    """
    meta_column_name = meta_column_name or self.modified_role_column_name
    application_mode = application_mode or self.application_mode

    # Sample a subset of spies from positive training data
    number_spies = int(self.spy_rate * len(true_pos_train_data))

    spies = true_pos_train_data.sample(n=number_spies, random_state=self.random_state)
    # Remove spies from the clean Positive set (creating the final "P" set)
    true_pos_train_data = true_pos_train_data.drop(spies.index)

    spies = spies.copy()
    # Mark spies as negatives (Label = 0) to simulate unlabeled status
    spies[self.modified_label_column_name] = 0
    spies[meta_column_name] = "unlabeled spy"

    # Ensure unlabeled data remains marked as negative.
    unlabeled_data = unlabeled_data.copy()
    unlabeled_data[self.modified_label_column_name] = 0

    # Combine spies with unlabeled data to create the spy-infiltrated dataset
    spy_inf_train_data = pd.concat([spies, unlabeled_data, true_pos_train_data]).sample(frac=1,
                                                                                        random_state=self.random_state)

    # Validate spy-infused data if a schema is provided.
    if schema:
        validate_dataframe(df=spy_inf_train_data, schema=schema, mode=application_mode)
        validate_split_integrity(input_dfs=[true_pos_train_data, spies, unlabeled_data],
                                 output_dfs=[spy_inf_train_data])

        if self.logger:
            self.logger.log_message("Spy infiltration output validated against schema in SpySplitting.")

    if self.logger:
        self.logger.log_spy_infiltrated_data(spy_inf_train_data, spies)

    return spy_inf_train_data

Spy Injection

Spy injection and PU generation (payn.Splitting.SpySplitting)

__init__(data, true_label_column, modified_label_column_name, modified_role_column_name=None, application_mode=None, positive_label=1, unlabeled_positives_ratio=0.2, ratio_positives_to_unlabeled=0.5, spy_rate=0.2, random_state=42, logger=None)

from_config(positive_label, config, data, logger=None) classmethod

split_data_with_controlled_PU_ratio(meta_column_name=None, schema=None)

split_data_with_original_PU_ratio(meta_column_name=None, schema=None)

spy_infiltration(true_pos_train_data, unlabeled_data, meta_column_name=None, application_mode=None, schema=None)

Spy injection and PU generation (`payn.Splitting.SpySplitting`)

`init(data, true_label_column, modified_label_column_name, modified_role_column_name=None, application_mode=None, positive_label=1, unlabeled_positives_ratio=0.2, ratio_positives_to_unlabeled=0.5, spy_rate=0.2, random_state=42, logger=None)`

`from_config(positive_label, config, data, logger=None)` `classmethod`

`split_data_with_controlled_PU_ratio(meta_column_name=None, schema=None)`

`split_data_with_original_PU_ratio(meta_column_name=None, schema=None)`

`spy_infiltration(true_pos_train_data, unlabeled_data, meta_column_name=None, application_mode=None, schema=None)`