Spy Injection
Spy injection and PU generation (payn.Splitting.SpySplitting)
Transforms the standard fully labeled dataset (Positive/Negative) into a Positive-Unlabeled (PU) format suitable for the Spy technique. It simulates a scenario where only a subset of positives are known, and the rest are hidden within a pool of unlabeled data.
Two different PU partitioning strategies are implemented:
- Controlled Ratio (
split_data_with_controlled_PU_ratio): Enforces a strict ratio between known positives and the unlabeled pool by partitioning the dataset and discarding excess negatives. This allows for controlled experimentation on the impact of class imbalance. This partitioning is the default within this work. - Original Ratio (
split_data_with_original_PU_ratio): Preserves all negative data and mixes in a calculated subset of positives to achieve a target "unlabeled positive concentration". - Spy Infiltration: A user-defined fraction (
spy_rate, typically 15-20%) of the known positive training set is randomly sampled (deterministically viarandom_state) and moved into the Unlabeled set.
These "Spies" have their labels changed to 0 (Negative, s = 0) within the model training context but retain their metadata role as unlabeled spy (y = 1). They serve as anchors: since the model should have classified them as positive, their predicted probability distribution helps identify other hidden positives and therefore the calculation of a new threshold.
Class for splitting data for Spy model training (PU Learning).
Naming Convention
- true_ : Unmodified data, known positive/negative data from ground truth.
- spy_inf_ : Spy_positive data infused into (training) data.
- augmen_ : Augmented negatives identified by the spy model.
This class splits the input dataset into a training set (true positives) and an unlabeled set (combining a fraction of positives with negatives). Then, a subset of positives is designated as spies and infiltrated into the unlabeled set.
Source code in payn\Splitting\spysplitting.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | |
__init__(data, true_label_column, modified_label_column_name, modified_role_column_name=None, application_mode=None, positive_label=1, unlabeled_positives_ratio=0.2, ratio_positives_to_unlabeled=0.5, spy_rate=0.2, random_state=42, logger=None)
Initialize the SpySplitting class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
The input dataset. |
required |
true_label_column
|
str
|
Name of the column containing the true labels of the data. |
required |
modified_label_column_name
|
str
|
Name of the column for the modified labels of the data. |
required |
modified_role_column_name
|
str
|
Name of the column for the modified roles of the data points. |
None
|
application_mode
|
str
|
Application mode (e.g., 'training'). |
None
|
positive_label
|
int
|
Label for positive data points. |
1
|
unlabeled_positives_ratio
|
float
|
Proportion of positive and negative data to assign as unlabeled. |
0.2
|
ratio_positives_to_unlabeled
|
float
|
Proportion of positive data to assign as unlabeled. |
0.5
|
spy_rate
|
float
|
Proportion of positive samples to infiltrate as spies. |
0.2
|
random_state
|
int
|
Seed for reproducibility. |
42
|
logger
|
Logger
|
Instance of Logger class for logging purposes. |
None
|
Source code in payn\Splitting\spysplitting.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | |
from_config(positive_label, config, data, logger=None)
classmethod
Alternative constructor that creates a SpySplitting instance from a config object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
positive_label
|
int
|
The integer label representing the positive class. |
required |
config
|
dict
|
Configuration dictionary containing splitting parameters. |
required |
data
|
DataFrame
|
The input dataset. |
required |
logger
|
Logger
|
Logger instance for logging purposes. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
SpySplitting |
SpySplitting
|
An instance of the SpySplitting class. |
Source code in payn\Splitting\spysplitting.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | |
split_data_with_controlled_PU_ratio(meta_column_name=None, schema=None)
Splits data by partitioning the entire (P+N) dataset into a 'Labeled' chunk and an 'Unlabeled' chunk.
Note
A portion of known negatives from the 'Labeled' chunk is DISCARDED to maintain
the specific ratio_positives_to_unlabeled.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
meta_column_name
|
str
|
Column name to label datapoint roles. |
None
|
schema
|
DataSchema
|
Optional DataSchema instance for validating output. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
Dict[str, DataFrame]
|
Dictionary containing: - "train" (pd.DataFrame): Known positives for training. - "unlabeled" (pd.DataFrame): Combined unlabeled set (subset of P + all N). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the calculated split ratio is invalid (not between 0 and 1). |
Source code in payn\Splitting\spysplitting.py
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | |
split_data_with_original_PU_ratio(meta_column_name=None, schema=None)
Splits data by building an unlabeled set from ALL negatives and a subset of positives. The goal is to achieve a specific 'unlabeled_positives_ratio' (concentration) in the U set. No data is discarded.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
meta_column_name
|
str
|
Column name to label datapoint roles. |
None
|
schema
|
DataSchema
|
Optional DataSchema instance for validating output. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
Dict[str, DataFrame]
|
Dictionary containing: - "train" (pd.DataFrame): Known positives for training. - "unlabeled" (pd.DataFrame): Combined unlabeled set. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If there are insufficient true positives to satisfy the requested ratio. |
Source code in payn\Splitting\spysplitting.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | |
spy_infiltration(true_pos_train_data, unlabeled_data, meta_column_name=None, application_mode=None, schema=None)
Infiltrate spies into the unlabeled data, returning a new spy-infused training set.
Selects a subset of the True Positive training data, re-labels them as "Spy", sets their label to 0 (Negative), and mixes them into the Unlabeled pool.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
true_pos_train_data
|
DataFrame
|
Known positive training data. |
required |
unlabeled_data
|
DataFrame
|
Unlabeled data to be infiltrated with spies. |
required |
meta_column_name
|
str
|
Column name to assign spy role labels. |
None
|
application_mode
|
str
|
Application mode for schema validation. |
None
|
schema
|
DataSchema
|
Optional DataSchema instance for validating output. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The spy-infused training dataset (Positives + Unlabeled w/ Spies). |
Source code in payn\Splitting\spysplitting.py
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | |