Skip to content

Spy Injection

Spy injection and PU generation (payn.Splitting.SpySplitting)

Transforms the standard fully labeled dataset (Positive/Negative) into a Positive-Unlabeled (PU) format suitable for the Spy technique. It simulates a scenario where only a subset of positives are known, and the rest are hidden within a pool of unlabeled data.

Two different PU partitioning strategies are implemented:

  • Controlled Ratio (split_data_with_controlled_PU_ratio): Enforces a strict ratio between known positives and the unlabeled pool by partitioning the dataset and discarding excess negatives. This allows for controlled experimentation on the impact of class imbalance. This partitioning is the default within this work.
  • Original Ratio (split_data_with_original_PU_ratio): Preserves all negative data and mixes in a calculated subset of positives to achieve a target "unlabeled positive concentration".
  • Spy Infiltration: A user-defined fraction (spy_rate, typically 15-20%) of the known positive training set is randomly sampled (deterministically via random_state) and moved into the Unlabeled set.

These "Spies" have their labels changed to 0 (Negative, s = 0) within the model training context but retain their metadata role as unlabeled spy (y = 1). They serve as anchors: since the model should have classified them as positive, their predicted probability distribution helps identify other hidden positives and therefore the calculation of a new threshold.

Class for splitting data for Spy model training (PU Learning).

Naming Convention
  • true_ : Unmodified data, known positive/negative data from ground truth.
  • spy_inf_ : Spy_positive data infused into (training) data.
  • augmen_ : Augmented negatives identified by the spy model.

This class splits the input dataset into a training set (true positives) and an unlabeled set (combining a fraction of positives with negatives). Then, a subset of positives is designated as spies and infiltrated into the unlabeled set.

Source code in payn\Splitting\spysplitting.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
class SpySplitting:
    """
    Class for splitting data for Spy model training (PU Learning).

    Naming Convention:
        - true_ : Unmodified data, known positive/negative data from ground truth.
        - spy_inf_ : Spy_positive data infused into (training) data.
        - augmen_ : Augmented negatives identified by the spy model.

    This class splits the input dataset into a training set (true positives) and an unlabeled set
    (combining a fraction of positives with negatives). Then, a subset of positives is designated as spies
    and infiltrated into the unlabeled set.
    """

    def __init__(self, data: pd.DataFrame, true_label_column: str, modified_label_column_name: str, modified_role_column_name: str = None,
                 application_mode: str = None, positive_label: int = 1,unlabeled_positives_ratio: float = 0.2,ratio_positives_to_unlabeled: float = 0.5,
                 spy_rate: float = 0.2, random_state: int = 42, logger: Logger = None):
        """
        Initialize the SpySplitting class.

        Args:
            data (pd.DataFrame): The input dataset.
            true_label_column (str): Name of the column containing the true labels of the data.
            modified_label_column_name (str): Name of the column for the modified labels of the data.
            modified_role_column_name (str): Name of the column for the modified roles of the data points.
            application_mode (str, optional): Application mode (e.g., 'training').
            positive_label (int): Label for positive data points.
            unlabeled_positives_ratio (float): Proportion of positive and negative data to assign as unlabeled.
            ratio_positives_to_unlabeled (float): Proportion of positive data to assign as unlabeled.
            spy_rate (float): Proportion of positive samples to infiltrate as spies.
            random_state (int): Seed for reproducibility.
            logger (Logger, optional): Instance of Logger class for logging purposes.

        """
        self.data = data.copy()
        self.true_label_column = true_label_column
        self.modified_label_column_name = modified_label_column_name
        self.positive_label = positive_label
        self.unlabeled_positives_ratio = unlabeled_positives_ratio
        self.ratio_positives_to_unlabeled = ratio_positives_to_unlabeled
        self.spy_rate = spy_rate
        self.random_state = random_state
        self.logger = logger
        #Optional Parameters for data splitting and infiltration
        self.modified_role_column_name = modified_role_column_name
        self.application_mode = application_mode

        # Initialize the modified label column as a copy of the true label column
        self.data[self.modified_label_column_name] = self.data[self.true_label_column]

    @classmethod
    def from_config(cls, positive_label:int, config: dict, data: pd.DataFrame, logger: Logger = None) -> "SpySplitting":
        """
        Alternative constructor that creates a SpySplitting instance from a config object.

        Args:
            positive_label (int): The integer label representing the positive class.
            config (dict): Configuration dictionary containing splitting parameters.
            data (pd.DataFrame): The input dataset.
            logger (Logger, optional): Logger instance for logging purposes.

        Returns:
            SpySplitting: An instance of the SpySplitting class.
        """
        return cls(
            data=data,
            true_label_column=config["meta_columns"]["meta_true_label_bin"],
            modified_label_column_name=config["meta_columns"]["meta_mod_label_bin"],
            positive_label=positive_label,
            ratio_positives_to_unlabeled=config["spy_splitting"]["ratio_positives_to_unlabeled"],
            spy_rate=config["spy_splitting"]["spy_rate"],
            random_state=config["general"]["random_seed"],
            logger=logger,
            modified_role_column_name=config["meta_columns"]["meta_mod_data_point_role"],
        )

    def split_data_with_controlled_PU_ratio(self, meta_column_name: str = None, schema: Any = None) -> Dict[
        str, pd.DataFrame]:
        """
        Splits data by partitioning the *entire* (P+N) dataset into a 'Labeled'
        chunk and an 'Unlabeled' chunk.

        Note:
            A portion of known negatives from the 'Labeled' chunk is DISCARDED to maintain
            the specific `ratio_positives_to_unlabeled`.

        Args:
            meta_column_name (str, optional): Column name to label datapoint roles.
            schema (DataSchema, optional): Optional DataSchema instance for validating output.

        Returns:
            dict: Dictionary containing:
                - "train" (pd.DataFrame): Known positives for training.
                - "unlabeled" (pd.DataFrame): Combined unlabeled set (subset of P + all N).

        Raises:
            ValueError: If the calculated split ratio is invalid (not between 0 and 1).
        """
        meta_column_name = meta_column_name or self.modified_role_column_name

        positives_in_all_ratio = self.data[self.data[self.true_label_column] == 1].shape[0] / self.data.shape[0]

        recalculated_split_ratio = self.ratio_positives_to_unlabeled / (
                    positives_in_all_ratio + self.ratio_positives_to_unlabeled)

        if not (0 < recalculated_split_ratio < 1):
            raise ValueError(
                f"Calculated split_ratio is {recalculated_split_ratio}, which is not between 0 and 1. Check your config.")

        # Partition the entire dataset
        labeled_train_data = self.data.sample(frac=recalculated_split_ratio, random_state=self.random_state)

        # The remaining data becomes the Unlabeled set
        unlabeled_data = self.data.drop(labeled_train_data.index).copy()

        # Process the labels of Unlabeled set
        unlabeled_data.loc[
            unlabeled_data[self.true_label_column] == 0,
            meta_column_name] = 'unlabeled negative'
        unlabeled_data.loc[
            unlabeled_data[self.true_label_column] == 1,
            meta_column_name] = 'unlabeled positive'

        # Process the Labeled set
        # Identify negatives that ended up in the labeled partition (to be discarded)
        labeled_negative_train_data = labeled_train_data[labeled_train_data[self.true_label_column] == 0]
        # Keep only the positives for training set
        pos_train_data = labeled_train_data.drop(labeled_negative_train_data.index).copy()  # This is the final P set
        pos_train_data[meta_column_name] = 'true positive'
        pos_train_data[self.modified_label_column_name] = 1

        if self.logger:
            self.logger.log_message(
                f"Split by partitioning: Discarded {len(labeled_negative_train_data)} known negative samples.")

        # (Validation and Logging)
        if schema:
            validate_dataframe(df=unlabeled_data, schema=schema, mode=self.application_mode)
            validate_split_integrity(input_dfs=[self.data],
                                     output_dfs=[pos_train_data, unlabeled_data, labeled_negative_train_data])
        if self.logger:
            self.logger.log_spysplit_data(train_data=pos_train_data, unlabeled_data=unlabeled_data)

        return {
            "train": pos_train_data,
            "unlabeled": unlabeled_data
        }

    def split_data_with_original_PU_ratio(self, meta_column_name: str = None, schema: Any = None) -> Dict[
        str, pd.DataFrame]:
        """
        Splits data by building an unlabeled set from ALL negatives and a subset of positives.
        The goal is to achieve a specific 'unlabeled_positives_ratio' (concentration) in the U set.
        No data is discarded.

        Args:
            meta_column_name (str, optional): Column name to label datapoint roles.
            schema (DataSchema, optional): Optional DataSchema instance for validating output.

        Returns:
            dict: Dictionary containing:
                - "train" (pd.DataFrame): Known positives for training.
                - "unlabeled" (pd.DataFrame): Combined unlabeled set.

        Raises:
            ValueError: If there are insufficient true positives to satisfy the requested ratio.
        """
        meta_column_name = meta_column_name or self.modified_role_column_name

        # Separate known positives and negatives
        true_positive_data = self.data[self.data[self.true_label_column] == self.positive_label].copy()
        true_negative_data = self.data[self.data[self.true_label_column] != self.positive_label].copy()

        # Recalculate the unlabeled_positives_ratio to add the correct number of positives to negatives,
        # while respecting the ratio provided. A 20% ratio means 20 positives for every 80 negatives, i.e., 0.25 of the unlabeled data should be positive.
        recalculated_unlabeled_positives_ratio = (self.unlabeled_positives_ratio) / (1 - self.unlabeled_positives_ratio)

        # Unlabeled data is generated from positive and negative data
        number_unlabeled_positives = int(recalculated_unlabeled_positives_ratio * len(true_negative_data))
        if number_unlabeled_positives >= true_positive_data.shape[0]:
            raise ValueError(
                f"You are trying to sample {number_unlabeled_positives} unlabeled positives, but there are only {true_positive_data.shape[0]} true positives.")

        # Sample the positives for the Unlabeled set
        unlabeled_true_positives = true_positive_data.sample(n=number_unlabeled_positives,
                                                             random_state=self.random_state)

        # Remaining positives form the positive training set
        true_pos_train = true_positive_data.drop(unlabeled_true_positives.index)

        # Label datapoint roles
        true_pos_train = true_pos_train.copy()
        unlabeled_true_positives = unlabeled_true_positives.copy()
        true_negative_data = true_negative_data.copy()

        true_pos_train[meta_column_name] = "true positive"
        unlabeled_true_positives[meta_column_name] = "unlabeled positive"
        true_negative_data[meta_column_name] = "unlabeled negative"

        # Combine unlabeled parts and shuffle
        unlabeled_data = pd.concat([unlabeled_true_positives, true_negative_data]).sample(frac=1,
                                                                                          random_state=self.random_state)

        # Embedded validation: ensure the unlabeled data conforms to the expected schema.
        if schema:
            validate_dataframe(df=unlabeled_data, schema=schema, mode="training")
            validate_split_integrity(input_dfs=[true_positive_data, true_negative_data],
                                     output_dfs=[true_pos_train, unlabeled_data])
            # Likely this method will only be used for training, not inference
            if self.logger:
                self.logger.log_message("Unlabeled split validated against schema in SpySplitting.")
        # Log datasets as artifacts to MLflow using Logger
        if self.logger:
            self.logger.log_spysplit_data(train_data=true_pos_train, unlabeled_data=unlabeled_data)

        return {
            "train": true_pos_train,
            "unlabeled": unlabeled_data
        }

    def spy_infiltration(self, true_pos_train_data: pd.DataFrame, unlabeled_data: pd.DataFrame,
                         meta_column_name: str = None, application_mode: str = None,
                         schema: Any = None) -> pd.DataFrame:
        """
        Infiltrate spies into the unlabeled data, returning a new spy-infused training set.

        Selects a subset of the True Positive training data, re-labels them as "Spy",
        sets their label to 0 (Negative), and mixes them into the Unlabeled pool.

        Args:
            true_pos_train_data (pd.DataFrame): Known positive training data.
            unlabeled_data (pd.DataFrame): Unlabeled data to be infiltrated with spies.
            meta_column_name (str, optional): Column name to assign spy role labels.
            application_mode (str, optional): Application mode for schema validation.
            schema (DataSchema, optional): Optional DataSchema instance for validating output.

        Returns:
            pd.DataFrame: The spy-infused training dataset (Positives + Unlabeled w/ Spies).
        """
        meta_column_name = meta_column_name or self.modified_role_column_name
        application_mode = application_mode or self.application_mode

        # Sample a subset of spies from positive training data
        number_spies = int(self.spy_rate * len(true_pos_train_data))

        spies = true_pos_train_data.sample(n=number_spies, random_state=self.random_state)
        # Remove spies from the clean Positive set (creating the final "P" set)
        true_pos_train_data = true_pos_train_data.drop(spies.index)

        spies = spies.copy()
        # Mark spies as negatives (Label = 0) to simulate unlabeled status
        spies[self.modified_label_column_name] = 0
        spies[meta_column_name] = "unlabeled spy"

        # Ensure unlabeled data remains marked as negative.
        unlabeled_data = unlabeled_data.copy()
        unlabeled_data[self.modified_label_column_name] = 0

        # Combine spies with unlabeled data to create the spy-infiltrated dataset
        spy_inf_train_data = pd.concat([spies, unlabeled_data, true_pos_train_data]).sample(frac=1,
                                                                                            random_state=self.random_state)

        # Validate spy-infused data if a schema is provided.
        if schema:
            validate_dataframe(df=spy_inf_train_data, schema=schema, mode=application_mode)
            validate_split_integrity(input_dfs=[true_pos_train_data, spies, unlabeled_data],
                                     output_dfs=[spy_inf_train_data])

            if self.logger:
                self.logger.log_message("Spy infiltration output validated against schema in SpySplitting.")

        if self.logger:
            self.logger.log_spy_infiltrated_data(spy_inf_train_data, spies)

        return spy_inf_train_data

__init__(data, true_label_column, modified_label_column_name, modified_role_column_name=None, application_mode=None, positive_label=1, unlabeled_positives_ratio=0.2, ratio_positives_to_unlabeled=0.5, spy_rate=0.2, random_state=42, logger=None)

Initialize the SpySplitting class.

Parameters:

Name Type Description Default
data DataFrame

The input dataset.

required
true_label_column str

Name of the column containing the true labels of the data.

required
modified_label_column_name str

Name of the column for the modified labels of the data.

required
modified_role_column_name str

Name of the column for the modified roles of the data points.

None
application_mode str

Application mode (e.g., 'training').

None
positive_label int

Label for positive data points.

1
unlabeled_positives_ratio float

Proportion of positive and negative data to assign as unlabeled.

0.2
ratio_positives_to_unlabeled float

Proportion of positive data to assign as unlabeled.

0.5
spy_rate float

Proportion of positive samples to infiltrate as spies.

0.2
random_state int

Seed for reproducibility.

42
logger Logger

Instance of Logger class for logging purposes.

None
Source code in payn\Splitting\spysplitting.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
def __init__(self, data: pd.DataFrame, true_label_column: str, modified_label_column_name: str, modified_role_column_name: str = None,
             application_mode: str = None, positive_label: int = 1,unlabeled_positives_ratio: float = 0.2,ratio_positives_to_unlabeled: float = 0.5,
             spy_rate: float = 0.2, random_state: int = 42, logger: Logger = None):
    """
    Initialize the SpySplitting class.

    Args:
        data (pd.DataFrame): The input dataset.
        true_label_column (str): Name of the column containing the true labels of the data.
        modified_label_column_name (str): Name of the column for the modified labels of the data.
        modified_role_column_name (str): Name of the column for the modified roles of the data points.
        application_mode (str, optional): Application mode (e.g., 'training').
        positive_label (int): Label for positive data points.
        unlabeled_positives_ratio (float): Proportion of positive and negative data to assign as unlabeled.
        ratio_positives_to_unlabeled (float): Proportion of positive data to assign as unlabeled.
        spy_rate (float): Proportion of positive samples to infiltrate as spies.
        random_state (int): Seed for reproducibility.
        logger (Logger, optional): Instance of Logger class for logging purposes.

    """
    self.data = data.copy()
    self.true_label_column = true_label_column
    self.modified_label_column_name = modified_label_column_name
    self.positive_label = positive_label
    self.unlabeled_positives_ratio = unlabeled_positives_ratio
    self.ratio_positives_to_unlabeled = ratio_positives_to_unlabeled
    self.spy_rate = spy_rate
    self.random_state = random_state
    self.logger = logger
    #Optional Parameters for data splitting and infiltration
    self.modified_role_column_name = modified_role_column_name
    self.application_mode = application_mode

    # Initialize the modified label column as a copy of the true label column
    self.data[self.modified_label_column_name] = self.data[self.true_label_column]

from_config(positive_label, config, data, logger=None) classmethod

Alternative constructor that creates a SpySplitting instance from a config object.

Parameters:

Name Type Description Default
positive_label int

The integer label representing the positive class.

required
config dict

Configuration dictionary containing splitting parameters.

required
data DataFrame

The input dataset.

required
logger Logger

Logger instance for logging purposes.

None

Returns:

Name Type Description
SpySplitting SpySplitting

An instance of the SpySplitting class.

Source code in payn\Splitting\spysplitting.py
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
@classmethod
def from_config(cls, positive_label:int, config: dict, data: pd.DataFrame, logger: Logger = None) -> "SpySplitting":
    """
    Alternative constructor that creates a SpySplitting instance from a config object.

    Args:
        positive_label (int): The integer label representing the positive class.
        config (dict): Configuration dictionary containing splitting parameters.
        data (pd.DataFrame): The input dataset.
        logger (Logger, optional): Logger instance for logging purposes.

    Returns:
        SpySplitting: An instance of the SpySplitting class.
    """
    return cls(
        data=data,
        true_label_column=config["meta_columns"]["meta_true_label_bin"],
        modified_label_column_name=config["meta_columns"]["meta_mod_label_bin"],
        positive_label=positive_label,
        ratio_positives_to_unlabeled=config["spy_splitting"]["ratio_positives_to_unlabeled"],
        spy_rate=config["spy_splitting"]["spy_rate"],
        random_state=config["general"]["random_seed"],
        logger=logger,
        modified_role_column_name=config["meta_columns"]["meta_mod_data_point_role"],
    )

split_data_with_controlled_PU_ratio(meta_column_name=None, schema=None)

Splits data by partitioning the entire (P+N) dataset into a 'Labeled' chunk and an 'Unlabeled' chunk.

Note

A portion of known negatives from the 'Labeled' chunk is DISCARDED to maintain the specific ratio_positives_to_unlabeled.

Parameters:

Name Type Description Default
meta_column_name str

Column name to label datapoint roles.

None
schema DataSchema

Optional DataSchema instance for validating output.

None

Returns:

Name Type Description
dict Dict[str, DataFrame]

Dictionary containing: - "train" (pd.DataFrame): Known positives for training. - "unlabeled" (pd.DataFrame): Combined unlabeled set (subset of P + all N).

Raises:

Type Description
ValueError

If the calculated split ratio is invalid (not between 0 and 1).

Source code in payn\Splitting\spysplitting.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
def split_data_with_controlled_PU_ratio(self, meta_column_name: str = None, schema: Any = None) -> Dict[
    str, pd.DataFrame]:
    """
    Splits data by partitioning the *entire* (P+N) dataset into a 'Labeled'
    chunk and an 'Unlabeled' chunk.

    Note:
        A portion of known negatives from the 'Labeled' chunk is DISCARDED to maintain
        the specific `ratio_positives_to_unlabeled`.

    Args:
        meta_column_name (str, optional): Column name to label datapoint roles.
        schema (DataSchema, optional): Optional DataSchema instance for validating output.

    Returns:
        dict: Dictionary containing:
            - "train" (pd.DataFrame): Known positives for training.
            - "unlabeled" (pd.DataFrame): Combined unlabeled set (subset of P + all N).

    Raises:
        ValueError: If the calculated split ratio is invalid (not between 0 and 1).
    """
    meta_column_name = meta_column_name or self.modified_role_column_name

    positives_in_all_ratio = self.data[self.data[self.true_label_column] == 1].shape[0] / self.data.shape[0]

    recalculated_split_ratio = self.ratio_positives_to_unlabeled / (
                positives_in_all_ratio + self.ratio_positives_to_unlabeled)

    if not (0 < recalculated_split_ratio < 1):
        raise ValueError(
            f"Calculated split_ratio is {recalculated_split_ratio}, which is not between 0 and 1. Check your config.")

    # Partition the entire dataset
    labeled_train_data = self.data.sample(frac=recalculated_split_ratio, random_state=self.random_state)

    # The remaining data becomes the Unlabeled set
    unlabeled_data = self.data.drop(labeled_train_data.index).copy()

    # Process the labels of Unlabeled set
    unlabeled_data.loc[
        unlabeled_data[self.true_label_column] == 0,
        meta_column_name] = 'unlabeled negative'
    unlabeled_data.loc[
        unlabeled_data[self.true_label_column] == 1,
        meta_column_name] = 'unlabeled positive'

    # Process the Labeled set
    # Identify negatives that ended up in the labeled partition (to be discarded)
    labeled_negative_train_data = labeled_train_data[labeled_train_data[self.true_label_column] == 0]
    # Keep only the positives for training set
    pos_train_data = labeled_train_data.drop(labeled_negative_train_data.index).copy()  # This is the final P set
    pos_train_data[meta_column_name] = 'true positive'
    pos_train_data[self.modified_label_column_name] = 1

    if self.logger:
        self.logger.log_message(
            f"Split by partitioning: Discarded {len(labeled_negative_train_data)} known negative samples.")

    # (Validation and Logging)
    if schema:
        validate_dataframe(df=unlabeled_data, schema=schema, mode=self.application_mode)
        validate_split_integrity(input_dfs=[self.data],
                                 output_dfs=[pos_train_data, unlabeled_data, labeled_negative_train_data])
    if self.logger:
        self.logger.log_spysplit_data(train_data=pos_train_data, unlabeled_data=unlabeled_data)

    return {
        "train": pos_train_data,
        "unlabeled": unlabeled_data
    }

split_data_with_original_PU_ratio(meta_column_name=None, schema=None)

Splits data by building an unlabeled set from ALL negatives and a subset of positives. The goal is to achieve a specific 'unlabeled_positives_ratio' (concentration) in the U set. No data is discarded.

Parameters:

Name Type Description Default
meta_column_name str

Column name to label datapoint roles.

None
schema DataSchema

Optional DataSchema instance for validating output.

None

Returns:

Name Type Description
dict Dict[str, DataFrame]

Dictionary containing: - "train" (pd.DataFrame): Known positives for training. - "unlabeled" (pd.DataFrame): Combined unlabeled set.

Raises:

Type Description
ValueError

If there are insufficient true positives to satisfy the requested ratio.

Source code in payn\Splitting\spysplitting.py
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
def split_data_with_original_PU_ratio(self, meta_column_name: str = None, schema: Any = None) -> Dict[
    str, pd.DataFrame]:
    """
    Splits data by building an unlabeled set from ALL negatives and a subset of positives.
    The goal is to achieve a specific 'unlabeled_positives_ratio' (concentration) in the U set.
    No data is discarded.

    Args:
        meta_column_name (str, optional): Column name to label datapoint roles.
        schema (DataSchema, optional): Optional DataSchema instance for validating output.

    Returns:
        dict: Dictionary containing:
            - "train" (pd.DataFrame): Known positives for training.
            - "unlabeled" (pd.DataFrame): Combined unlabeled set.

    Raises:
        ValueError: If there are insufficient true positives to satisfy the requested ratio.
    """
    meta_column_name = meta_column_name or self.modified_role_column_name

    # Separate known positives and negatives
    true_positive_data = self.data[self.data[self.true_label_column] == self.positive_label].copy()
    true_negative_data = self.data[self.data[self.true_label_column] != self.positive_label].copy()

    # Recalculate the unlabeled_positives_ratio to add the correct number of positives to negatives,
    # while respecting the ratio provided. A 20% ratio means 20 positives for every 80 negatives, i.e., 0.25 of the unlabeled data should be positive.
    recalculated_unlabeled_positives_ratio = (self.unlabeled_positives_ratio) / (1 - self.unlabeled_positives_ratio)

    # Unlabeled data is generated from positive and negative data
    number_unlabeled_positives = int(recalculated_unlabeled_positives_ratio * len(true_negative_data))
    if number_unlabeled_positives >= true_positive_data.shape[0]:
        raise ValueError(
            f"You are trying to sample {number_unlabeled_positives} unlabeled positives, but there are only {true_positive_data.shape[0]} true positives.")

    # Sample the positives for the Unlabeled set
    unlabeled_true_positives = true_positive_data.sample(n=number_unlabeled_positives,
                                                         random_state=self.random_state)

    # Remaining positives form the positive training set
    true_pos_train = true_positive_data.drop(unlabeled_true_positives.index)

    # Label datapoint roles
    true_pos_train = true_pos_train.copy()
    unlabeled_true_positives = unlabeled_true_positives.copy()
    true_negative_data = true_negative_data.copy()

    true_pos_train[meta_column_name] = "true positive"
    unlabeled_true_positives[meta_column_name] = "unlabeled positive"
    true_negative_data[meta_column_name] = "unlabeled negative"

    # Combine unlabeled parts and shuffle
    unlabeled_data = pd.concat([unlabeled_true_positives, true_negative_data]).sample(frac=1,
                                                                                      random_state=self.random_state)

    # Embedded validation: ensure the unlabeled data conforms to the expected schema.
    if schema:
        validate_dataframe(df=unlabeled_data, schema=schema, mode="training")
        validate_split_integrity(input_dfs=[true_positive_data, true_negative_data],
                                 output_dfs=[true_pos_train, unlabeled_data])
        # Likely this method will only be used for training, not inference
        if self.logger:
            self.logger.log_message("Unlabeled split validated against schema in SpySplitting.")
    # Log datasets as artifacts to MLflow using Logger
    if self.logger:
        self.logger.log_spysplit_data(train_data=true_pos_train, unlabeled_data=unlabeled_data)

    return {
        "train": true_pos_train,
        "unlabeled": unlabeled_data
    }

spy_infiltration(true_pos_train_data, unlabeled_data, meta_column_name=None, application_mode=None, schema=None)

Infiltrate spies into the unlabeled data, returning a new spy-infused training set.

Selects a subset of the True Positive training data, re-labels them as "Spy", sets their label to 0 (Negative), and mixes them into the Unlabeled pool.

Parameters:

Name Type Description Default
true_pos_train_data DataFrame

Known positive training data.

required
unlabeled_data DataFrame

Unlabeled data to be infiltrated with spies.

required
meta_column_name str

Column name to assign spy role labels.

None
application_mode str

Application mode for schema validation.

None
schema DataSchema

Optional DataSchema instance for validating output.

None

Returns:

Type Description
DataFrame

pd.DataFrame: The spy-infused training dataset (Positives + Unlabeled w/ Spies).

Source code in payn\Splitting\spysplitting.py
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
def spy_infiltration(self, true_pos_train_data: pd.DataFrame, unlabeled_data: pd.DataFrame,
                     meta_column_name: str = None, application_mode: str = None,
                     schema: Any = None) -> pd.DataFrame:
    """
    Infiltrate spies into the unlabeled data, returning a new spy-infused training set.

    Selects a subset of the True Positive training data, re-labels them as "Spy",
    sets their label to 0 (Negative), and mixes them into the Unlabeled pool.

    Args:
        true_pos_train_data (pd.DataFrame): Known positive training data.
        unlabeled_data (pd.DataFrame): Unlabeled data to be infiltrated with spies.
        meta_column_name (str, optional): Column name to assign spy role labels.
        application_mode (str, optional): Application mode for schema validation.
        schema (DataSchema, optional): Optional DataSchema instance for validating output.

    Returns:
        pd.DataFrame: The spy-infused training dataset (Positives + Unlabeled w/ Spies).
    """
    meta_column_name = meta_column_name or self.modified_role_column_name
    application_mode = application_mode or self.application_mode

    # Sample a subset of spies from positive training data
    number_spies = int(self.spy_rate * len(true_pos_train_data))

    spies = true_pos_train_data.sample(n=number_spies, random_state=self.random_state)
    # Remove spies from the clean Positive set (creating the final "P" set)
    true_pos_train_data = true_pos_train_data.drop(spies.index)

    spies = spies.copy()
    # Mark spies as negatives (Label = 0) to simulate unlabeled status
    spies[self.modified_label_column_name] = 0
    spies[meta_column_name] = "unlabeled spy"

    # Ensure unlabeled data remains marked as negative.
    unlabeled_data = unlabeled_data.copy()
    unlabeled_data[self.modified_label_column_name] = 0

    # Combine spies with unlabeled data to create the spy-infiltrated dataset
    spy_inf_train_data = pd.concat([spies, unlabeled_data, true_pos_train_data]).sample(frac=1,
                                                                                        random_state=self.random_state)

    # Validate spy-infused data if a schema is provided.
    if schema:
        validate_dataframe(df=spy_inf_train_data, schema=schema, mode=application_mode)
        validate_split_integrity(input_dfs=[true_pos_train_data, spies, unlabeled_data],
                                 output_dfs=[spy_inf_train_data])

        if self.logger:
            self.logger.log_message("Spy infiltration output validated against schema in SpySplitting.")

    if self.logger:
        self.logger.log_spy_infiltrated_data(spy_inf_train_data, spies)

    return spy_inf_train_data