Skip to content

Optimization

Hyperparameter Optimization (payn.Optimization.Optimization)

The payn.Optimization module provides a unified interface for hyperparameter tuning, supporting both CatBoostClassifier and CatBoostRegressor architectures. It ensures consistent model evaluation by wrapping the training logic and enforcing reproducibility constraints across different optimization strategies.

  • Search Space: The search space is dynamically constructed based on the user configuration (config.yaml). The module supports a wide range of CatBoost parameters, including learning_rate (log-uniform distribution), depth (integer), iterations (integer), and multiple additional structural parameters (grow_policy, subsample, colsample_bylevel, min_data_in_leaf, one_hot_max_size, max_bin, and l2_leaf_reg). By default, in this work depth and learning_rate were optimized with an early stopping within 1000 iterations.
  • Metric-Aware Directionality: The module automatically maps the chosen evaluation metric (e.g., RMSE, Logloss, F1, MCC) to the appropriate optimization direction (minimize or maximize) using an internal eval_direction_dict. This ensures that the objective function correctly rewards or penalizes trial outcomes without manual intervention.

Base class for hyperparameter optimization strategies.

Attributes:

Name Type Description
target_model Any

Instance of the model wrapper (e.g., SpyModel).

random_state int

Random seed for reproducibility.

logger Optional[Logger]

Logger instance for tracking experiments.

feature_column_name Optional[str]

Name of the column containing feature vectors.

training_target_column_name Optional[str]

Name of the training target column.

evaluation_target_column_name Optional[str]

Name of the evaluation target column.

optimization_type Optional[str]

Type of optimization ('Bayesian' or 'Grid').

optimisation_iterations Optional[int]

Number of optimization iterations.

search_space Union[List[str], Dict[str, Any], None]

Search space configuration.

catboost_max_depth Optional[int]

Maximum depth constraint for CatBoost.

catboost_max_iterations Optional[int]

Maximum iterations constraint for CatBoost.

catboost_min_learning_rate Optional[float]

Minimum learning rate constraint.

catboost_max_bin Optional[int]

Maximum bin size constraint.

Source code in payn\Optimization\optimization.py
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
class Optimization:
    """
    Base class for hyperparameter optimization strategies.

    Attributes:
        target_model (Any): Instance of the model wrapper (e.g., SpyModel).
        random_state (int): Random seed for reproducibility.
        logger (Optional[Logger]): Logger instance for tracking experiments.
        feature_column_name (Optional[str]): Name of the column containing feature vectors.
        training_target_column_name (Optional[str]): Name of the training target column.
        evaluation_target_column_name (Optional[str]): Name of the evaluation target column.
        optimization_type (Optional[str]): Type of optimization ('Bayesian' or 'Grid').
        optimisation_iterations (Optional[int]): Number of optimization iterations.
        search_space (Union[List[str], Dict[str, Any], None]): Search space configuration.
        catboost_max_depth (Optional[int]): Maximum depth constraint for CatBoost.
        catboost_max_iterations (Optional[int]): Maximum iterations constraint for CatBoost.
        catboost_min_learning_rate (Optional[float]): Minimum learning rate constraint.
        catboost_max_bin (Optional[int]): Maximum bin size constraint.
    """
    def __init__(self, target_model: Any, random_state: int, logger: Optional[Logger] = None,
                 feature_column_name: Optional[str] = None, training_target_column_name: Optional[str] = None, evaluation_target_column_name: Optional[str] = None,
                 optimization_type: Optional[str] = None, optimisation_iterations: Optional[int] = None, search_space: Union[List[str], Dict[str, Any], None] = None,
                 catboost_max_depth: Optional[int] = None, catboost_max_iterations: Optional[int] = None, catboost_min_learning_rate: Optional[float] = None,
                 catboost_max_bin: Optional[int] = None, **kwargs: Any) -> None:
        """
        Initialize the Optimization class.

        Args:
            target_model (Any): Instance of the model wrapper (e.g., SpyModel).
            random_state (int): Random seed for reproducibility.
            logger (Logger, optional): Instance of Logger class for logging purposes.
            feature_column_name (str, optional): Column name of target variable.
            training_target_column_name (str, optional): Column name of target variable for training data.
            evaluation_target_column_name (str, optional): Column name of target variable for evaluation data.
            optimization_type (str, optional): Type of optimization to perform (Bayesian or Grid).
            optimisation_iterations (int, optional): Number of iterations of optimization.
            search_space (List, optional): List of hyperparameters to optimize.
            catboost_max_depth (int, optional): Maximum depth of catboost trees.
            catboost_max_iterations (int, optional): Maximum number of iterations for catboost.
            catboost_min_learning_rate (float, optional): Minimum learning rate for catboost.
            catboost_max_bin (int, optional): Maximum bin size for catboost.
            **kwargs (Any): Additional keyword arguments.
        """
        self.target_model = target_model
        self.random_state = random_state
        self.logger = logger
        # Optional parameters for classification; if not provided, they can be set later.
        self.feature_column_name = feature_column_name
        self.training_target_column_name = training_target_column_name
        self.evaluation_target_column_name = evaluation_target_column_name
        self.optimization_type = optimization_type
        self.optimisation_iterations = optimisation_iterations
        self.search_space = search_space
        self.catboost_max_depth = catboost_max_depth
        self.catboost_max_iterations = catboost_max_iterations
        self.catboost_min_learning_rate = catboost_min_learning_rate
        self.catboost_max_bin = catboost_max_bin
        self.eval_direction_dict = {
            "MAE": "minimize",
            "MSE": "minimize",
            "RMSE": "minimize",
            "R2": "maximize",
            "F1": "maximize",
            "TotalF1": "maximize",
            "Accuracy": "maximize",
            "Precision": "maximize",
            "BalancedAccuracy": "maximize",
            "Logloss": "minimize",
            "AUC": "maximize",
            "Recall": "maximize",
            "PRAUC": "maximize",
            "MCC": "maximize"
        }

    @classmethod
    def from_config(cls, config: Dict[str, Any], target_model: Any, logger: Optional[Logger] = None, **kwargs: Any) -> "Optimization":
        """
        Alternative constructor that extracts the required parameters from a config object.

        Args:
            config (Dict[str, Any]): Configuration dictionary.
            target_model (Any): The model wrapper instance.
            logger (Logger, optional): Logger instance.
            **kwargs (Any): Additional keyword arguments.

        Returns:
            Optimization: An initialized Optimization instance.
        """
        return cls(
            target_model = target_model,
            random_state = config["general"]["random_seed"],
            logger = logger,
            feature_column_name = config["featurisation"]["combined_features_column_name"],
            training_target_column_name = config[target_model.config_key]["training_target_column_name"],
            evaluation_target_column_name = config[target_model.config_key]["validation_target_column_name"],
            optimization_type = config["optimisation"]["type"],
            optimisation_iterations = config["optimisation"]["iterations"],
            search_space = config["optimisation"]["search_space"],
            catboost_max_depth = config["catboost"]["max_depth"],
            catboost_max_iterations = config["catboost"]["max_iterations"],
            catboost_min_learning_rate = config["catboost"]["min_learning_rate"],
            catboost_max_bin = config["catboost"]["max_bin"],
            **kwargs
        )

    def _prepare_pool(self, data: pd.DataFrame,
                      feature_column_name: Optional[str] = None,
                      training_target_column_name: Optional[str] = None,
                      evaluation_target_column_name: Optional[str] = None) -> Pool:
        """
        Prepare a CatBoost Pool from the data using the specified feature and target column names.

        Args:
            data (pd.DataFrame): Dataset containing features and target labels.
            feature_column_name (str, optional): Column name for features.
            training_target_column_name (str, optional): Column name for training target.
            evaluation_target_column_name (str, optional): Column name for evaluation target.

        Returns:
            Pool: A CatBoost Pool object.

        Raises:
            ValueError: If neither training nor evaluation target column name is provided.
        """
        feature_column_name = feature_column_name or self.feature_column_name

        if training_target_column_name:
            return Pool(
                data = pd.DataFrame(data[feature_column_name].to_list()),
                label = data[training_target_column_name]
            )
        elif evaluation_target_column_name:
            return Pool(
                data =  pd.DataFrame(data[feature_column_name].to_list()),
                label = data[evaluation_target_column_name]
            )
        else:
            raise ValueError(f"No valid column name for training or evaluation provided: {training_target_column_name}, {evaluation_target_column_name}")


    def optimize(self, train_data: pd.DataFrame, val_data: pd.DataFrame, test_data: Optional[pd.DataFrame] = None,
                 optimization_type: Optional[str] = None, **kwargs: Any,)-> Union[CatBoostClassifier, CatBoostRegressor]:
        """
        Optimize the model using the specified optimization technique.

        Args:
            train_data (pd.DataFrame): Training dataset.
            val_data (pd.DataFrame): Validation dataset.
            test_data (pd.DataFrame, optional): Test dataset.
            optimization_type (str, optional): Override for the optimization type.

        Returns:
            Union[CatBoostClassifier, CatBoostRegressor]: The optimized model.

        Raises:
            ValueError: If the optimization type is unsupported.
        """
        optimization_type = optimization_type or self.optimization_type
        if optimization_type == "Bayesian":
            # Pass all required parameters to the BayesianOptimization subclass.
            return BayesianOptimization(
                target_model=self.target_model,
                random_state=self.random_state,
                logger=self.logger,
                feature_column_name=self.feature_column_name,
                training_target_column_name=self.training_target_column_name,
                evaluation_target_column_name=self.evaluation_target_column_name,
                optimization_type=self.optimization_type,
                optimisation_iterations=self.optimisation_iterations,
                search_space=self.search_space,
                catboost_max_depth=self.catboost_max_depth,
                catboost_max_iterations=self.catboost_max_iterations,
                catboost_min_learning_rate=self.catboost_min_learning_rate,
                catboost_max_bin=self.catboost_max_bin
            ).optimize(train_data, val_data, test_data)
        elif optimization_type == "Grid":
            return GridOptimization(
                target_model=self.target_model,
                random_state=self.random_state,
                logger=self.logger,
                feature_column_name=self.feature_column_name,
                training_target_column_name=self.training_target_column_name,
                evaluation_target_column_name=self.evaluation_target_column_name,
                optimization_type=self.optimization_type,
                optimisation_iterations=self.optimisation_iterations,
                search_space=self.search_space,
                catboost_max_depth=self.catboost_max_depth,
                catboost_max_iterations=self.catboost_max_iterations,
                catboost_min_learning_rate=self.catboost_min_learning_rate,
                catboost_max_bin=self.catboost_max_bin
            ).optimize(train_data, val_data, test_data)
        else:
            raise ValueError(f"Unsupported optimization type: {optimization_type}")

__init__(target_model, random_state, logger=None, feature_column_name=None, training_target_column_name=None, evaluation_target_column_name=None, optimization_type=None, optimisation_iterations=None, search_space=None, catboost_max_depth=None, catboost_max_iterations=None, catboost_min_learning_rate=None, catboost_max_bin=None, **kwargs)

Initialize the Optimization class.

Parameters:

Name Type Description Default
target_model Any

Instance of the model wrapper (e.g., SpyModel).

required
random_state int

Random seed for reproducibility.

required
logger Logger

Instance of Logger class for logging purposes.

None
feature_column_name str

Column name of target variable.

None
training_target_column_name str

Column name of target variable for training data.

None
evaluation_target_column_name str

Column name of target variable for evaluation data.

None
optimization_type str

Type of optimization to perform (Bayesian or Grid).

None
optimisation_iterations int

Number of iterations of optimization.

None
search_space List

List of hyperparameters to optimize.

None
catboost_max_depth int

Maximum depth of catboost trees.

None
catboost_max_iterations int

Maximum number of iterations for catboost.

None
catboost_min_learning_rate float

Minimum learning rate for catboost.

None
catboost_max_bin int

Maximum bin size for catboost.

None
**kwargs Any

Additional keyword arguments.

{}
Source code in payn\Optimization\optimization.py
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
def __init__(self, target_model: Any, random_state: int, logger: Optional[Logger] = None,
             feature_column_name: Optional[str] = None, training_target_column_name: Optional[str] = None, evaluation_target_column_name: Optional[str] = None,
             optimization_type: Optional[str] = None, optimisation_iterations: Optional[int] = None, search_space: Union[List[str], Dict[str, Any], None] = None,
             catboost_max_depth: Optional[int] = None, catboost_max_iterations: Optional[int] = None, catboost_min_learning_rate: Optional[float] = None,
             catboost_max_bin: Optional[int] = None, **kwargs: Any) -> None:
    """
    Initialize the Optimization class.

    Args:
        target_model (Any): Instance of the model wrapper (e.g., SpyModel).
        random_state (int): Random seed for reproducibility.
        logger (Logger, optional): Instance of Logger class for logging purposes.
        feature_column_name (str, optional): Column name of target variable.
        training_target_column_name (str, optional): Column name of target variable for training data.
        evaluation_target_column_name (str, optional): Column name of target variable for evaluation data.
        optimization_type (str, optional): Type of optimization to perform (Bayesian or Grid).
        optimisation_iterations (int, optional): Number of iterations of optimization.
        search_space (List, optional): List of hyperparameters to optimize.
        catboost_max_depth (int, optional): Maximum depth of catboost trees.
        catboost_max_iterations (int, optional): Maximum number of iterations for catboost.
        catboost_min_learning_rate (float, optional): Minimum learning rate for catboost.
        catboost_max_bin (int, optional): Maximum bin size for catboost.
        **kwargs (Any): Additional keyword arguments.
    """
    self.target_model = target_model
    self.random_state = random_state
    self.logger = logger
    # Optional parameters for classification; if not provided, they can be set later.
    self.feature_column_name = feature_column_name
    self.training_target_column_name = training_target_column_name
    self.evaluation_target_column_name = evaluation_target_column_name
    self.optimization_type = optimization_type
    self.optimisation_iterations = optimisation_iterations
    self.search_space = search_space
    self.catboost_max_depth = catboost_max_depth
    self.catboost_max_iterations = catboost_max_iterations
    self.catboost_min_learning_rate = catboost_min_learning_rate
    self.catboost_max_bin = catboost_max_bin
    self.eval_direction_dict = {
        "MAE": "minimize",
        "MSE": "minimize",
        "RMSE": "minimize",
        "R2": "maximize",
        "F1": "maximize",
        "TotalF1": "maximize",
        "Accuracy": "maximize",
        "Precision": "maximize",
        "BalancedAccuracy": "maximize",
        "Logloss": "minimize",
        "AUC": "maximize",
        "Recall": "maximize",
        "PRAUC": "maximize",
        "MCC": "maximize"
    }

from_config(config, target_model, logger=None, **kwargs) classmethod

Alternative constructor that extracts the required parameters from a config object.

Parameters:

Name Type Description Default
config Dict[str, Any]

Configuration dictionary.

required
target_model Any

The model wrapper instance.

required
logger Logger

Logger instance.

None
**kwargs Any

Additional keyword arguments.

{}

Returns:

Name Type Description
Optimization Optimization

An initialized Optimization instance.

Source code in payn\Optimization\optimization.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
@classmethod
def from_config(cls, config: Dict[str, Any], target_model: Any, logger: Optional[Logger] = None, **kwargs: Any) -> "Optimization":
    """
    Alternative constructor that extracts the required parameters from a config object.

    Args:
        config (Dict[str, Any]): Configuration dictionary.
        target_model (Any): The model wrapper instance.
        logger (Logger, optional): Logger instance.
        **kwargs (Any): Additional keyword arguments.

    Returns:
        Optimization: An initialized Optimization instance.
    """
    return cls(
        target_model = target_model,
        random_state = config["general"]["random_seed"],
        logger = logger,
        feature_column_name = config["featurisation"]["combined_features_column_name"],
        training_target_column_name = config[target_model.config_key]["training_target_column_name"],
        evaluation_target_column_name = config[target_model.config_key]["validation_target_column_name"],
        optimization_type = config["optimisation"]["type"],
        optimisation_iterations = config["optimisation"]["iterations"],
        search_space = config["optimisation"]["search_space"],
        catboost_max_depth = config["catboost"]["max_depth"],
        catboost_max_iterations = config["catboost"]["max_iterations"],
        catboost_min_learning_rate = config["catboost"]["min_learning_rate"],
        catboost_max_bin = config["catboost"]["max_bin"],
        **kwargs
    )

optimize(train_data, val_data, test_data=None, optimization_type=None, **kwargs)

Optimize the model using the specified optimization technique.

Parameters:

Name Type Description Default
train_data DataFrame

Training dataset.

required
val_data DataFrame

Validation dataset.

required
test_data DataFrame

Test dataset.

None
optimization_type str

Override for the optimization type.

None

Returns:

Type Description
Union[CatBoostClassifier, CatBoostRegressor]

Union[CatBoostClassifier, CatBoostRegressor]: The optimized model.

Raises:

Type Description
ValueError

If the optimization type is unsupported.

Source code in payn\Optimization\optimization.py
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def optimize(self, train_data: pd.DataFrame, val_data: pd.DataFrame, test_data: Optional[pd.DataFrame] = None,
             optimization_type: Optional[str] = None, **kwargs: Any,)-> Union[CatBoostClassifier, CatBoostRegressor]:
    """
    Optimize the model using the specified optimization technique.

    Args:
        train_data (pd.DataFrame): Training dataset.
        val_data (pd.DataFrame): Validation dataset.
        test_data (pd.DataFrame, optional): Test dataset.
        optimization_type (str, optional): Override for the optimization type.

    Returns:
        Union[CatBoostClassifier, CatBoostRegressor]: The optimized model.

    Raises:
        ValueError: If the optimization type is unsupported.
    """
    optimization_type = optimization_type or self.optimization_type
    if optimization_type == "Bayesian":
        # Pass all required parameters to the BayesianOptimization subclass.
        return BayesianOptimization(
            target_model=self.target_model,
            random_state=self.random_state,
            logger=self.logger,
            feature_column_name=self.feature_column_name,
            training_target_column_name=self.training_target_column_name,
            evaluation_target_column_name=self.evaluation_target_column_name,
            optimization_type=self.optimization_type,
            optimisation_iterations=self.optimisation_iterations,
            search_space=self.search_space,
            catboost_max_depth=self.catboost_max_depth,
            catboost_max_iterations=self.catboost_max_iterations,
            catboost_min_learning_rate=self.catboost_min_learning_rate,
            catboost_max_bin=self.catboost_max_bin
        ).optimize(train_data, val_data, test_data)
    elif optimization_type == "Grid":
        return GridOptimization(
            target_model=self.target_model,
            random_state=self.random_state,
            logger=self.logger,
            feature_column_name=self.feature_column_name,
            training_target_column_name=self.training_target_column_name,
            evaluation_target_column_name=self.evaluation_target_column_name,
            optimization_type=self.optimization_type,
            optimisation_iterations=self.optimisation_iterations,
            search_space=self.search_space,
            catboost_max_depth=self.catboost_max_depth,
            catboost_max_iterations=self.catboost_max_iterations,
            catboost_min_learning_rate=self.catboost_min_learning_rate,
            catboost_max_bin=self.catboost_max_bin
        ).optimize(train_data, val_data, test_data)
    else:
        raise ValueError(f"Unsupported optimization type: {optimization_type}")

Bayesian Optimization (payn.Optimization.BayesianOptimization)

This strategy integrates the Optuna framework to perform efficient exploration of high-dimensional hyperparameter spaces.

  • Tree-structured Parzen Estimator (TPE): The optimization utilizes a TPE sampler, which models the probability of hyperparameter values given past results. This allows for significantly more efficient traversal of the search space compared to random or grid search methods. In this study, Bayesian Optimization was configured with 50 iterations by default, demonstrating convergence for both SpyModel and Regression tasks.
  • Reproducibility: The TPESampler is explicitly instantiated with a fixed global random_state. This guarantees that the sequence of hyperparameter suggestions is deterministic across experimental runs, eliminating variability caused by the stochastic nature of the sampling process.

Bases: Optimization

Subclass for performing Bayesian Optimization using Optuna.

Source code in payn\Optimization\optimization.py
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
class BayesianOptimization(Optimization):
    """
    Subclass for performing Bayesian Optimization using Optuna.
    """

    def _suggest_params(self, trial: optuna.Trial, search_space: Optional[List[str]] = None) -> Dict[str, Any]:
        """
        Suggest hyperparameters based on the configuration's search space.

        Args:
            trial (optuna.Trial): The trial object used for parameter suggestions.
            search_space (List[str], optional): Override for the search space keys.

        Returns:
            Dict[str, Any]: Dictionary of suggested hyperparameters.
        """
        search_space = search_space or self.search_space
        params = {}

        # Define the search space dynamically based on config
        if "learning_rate" in search_space:
            params["learning_rate"] = trial.suggest_float(name="learning_rate", low=self.catboost_min_learning_rate, high=1, log=True)
        if "depth" in search_space:
            params["depth"] = trial.suggest_int(name="depth", low=1, high=self.catboost_max_depth)
        if "subsample" in search_space:
            params["subsample"] = trial.suggest_float(name="subsample", low=0.05, high=1.0)
        if "colsample_bylevel" in search_space:
            params["colsample_bylevel"] = trial.suggest_float(name="colsample_bylevel", low=0.05, high=1.0)
        if "min_data_in_leaf" in search_space:
            params["min_data_in_leaf"] = trial.suggest_int(name="min_data_in_leaf", low=1, high=100)
        if "l2_leaf_reg" in search_space:
            params["l2_leaf_reg"] = trial.suggest_float(name="l2_leaf_reg",low=1e-5, high=10.0, log=True)
        if "grow_policy" in search_space:
            params["grow_policy"] = trial.suggest_categorical(name="grow_policy",choices=["SymmetricTree", "Depthwise", "Lossguide"])
        if "one_hot_max_size" in search_space:
            params["one_hot_max_size"] = trial.suggest_int(name="one_hot_max_size", low=2, high=10)
        if "max_bin" in search_space:
            params["max_bin"] = trial.suggest_int("max_bin", low=2, high=self.catboost_max_bin)
        if "iterations" in search_space:
            params["iterations"] = trial.suggest_int("iterations", low=1, high=self.catboost_max_iterations)
        return params

    def objective(self, trial: optuna.Trial, train_data: pd.DataFrame, val_data: pd.DataFrame, eval_direction: str,
                  feature_column_name: Optional[str] = None, training_target_column_name: Optional[str] = None, evaluation_target_column_name: Optional[str] = None) -> float:
        """
        Objective function to optimize CatBoost hyperparameters.

        Args:
            trial (optuna.Trial): Trial object for suggestions.
            train_data (pd.DataFrame): Training dataset.
            val_data (pd.DataFrame): Validation dataset.
            eval_direction (str): Either 'minimize' or 'maximize', depending on the metric.
            feature_column_name (str, optional): Override for feature column name.
            training_target_column_name (str, optional): Override for training target column name.
            evaluation_target_column_name (str, optional): Override for evaluation target column name.

        Returns:
            float: Validation score for current hyperparameters.
        """
        with mlflow.start_run(run_name=f"Trial_{trial.number}", nested=True):
            feature_column_name = feature_column_name or self.feature_column_name
            training_target_column_name = training_target_column_name or self.training_target_column_name
            evaluation_target_column_name = evaluation_target_column_name or self.evaluation_target_column_name

            # Determine evaluation direction and corresponding penalty value.
            penalty_value = float('-inf') if eval_direction == "maximize" else float('inf')

            # Prepare the CatBoost Pool for training and validation
            train_pool = self._prepare_pool(data=train_data, feature_column_name=feature_column_name, training_target_column_name=training_target_column_name)
            val_pool = self._prepare_pool(data=val_data, feature_column_name=feature_column_name, evaluation_target_column_name=evaluation_target_column_name)

            # Support for slurm multi-threading
            n_threads = int(os.getenv("SLURM_CPUS_PER_TASK", -1)) # 1

            # Base CatBoost parameters
            params = {
                "random_state": self.target_model.random_state,
                "eval_metric": self.target_model.eval_metric,
                "verbose": False,
                "thread_count": n_threads
            }

            params.update(self._suggest_params(trial))

            # Define model with the suggested parameters
            model_class = CatBoostRegressor if isinstance(self.target_model, RegModel) else CatBoostClassifier
            model = model_class(**params)

            # Log hyperparameters before training
            if self.logger:
                # self.logger.log_model_hyperparameters(CatBoostClassifier(**params), **params)
                self.logger.log_model_hyperparameters(model, **params)

            # MLflow callback for feedback about training progress
            # mlflow_callback = MLflowCatBoostCallback(eval_metric=self.target_model.eval_metric, logger=self.logger)

            try:
                model.fit(X=train_pool, eval_set=val_pool, use_best_model=False) # , callbacks=[mlflow_callback]
            except Exception as e:
                if self.logger:
                    self.logger.log_message(f"Trial {trial.number} failed during training: {e}")
                return penalty_value  # Return penalty on failure

            if self.logger:
                # Deactivated for Optimisation to reduce storrage capacity
                # self.logger.log_model(model, f"optimized_{self.target_model.config_key}_fold_{self.target_model.fold_index}")
                self.logger.log_model_attributes(model)

            try:
                score = model.evals_result_["validation"][self.target_model.eval_metric][-1]
            except Exception as e:
                if self.logger:
                    self.logger.log_message(f"Failed to extract evaluation metric for trial {trial.number}: {e}")
                score = penalty_value

        return score

    def optimize(self, train_data: pd.DataFrame, val_data: pd.DataFrame, test_data: Optional[pd.DataFrame] = None, iterations: Optional[int] = None) -> Union[CatBoostClassifier, CatBoostRegressor]:
        """
        Perform Bayesian optimization to find the best hyperparameters.

        Args:
            train_data (pd.DataFrame): Training dataset.
            val_data (pd.DataFrame): Validation dataset.
            test_data (pd.DataFrame, optional): Test dataset.
            iterations (int, optional): Override for number of trials.

        Returns:
            CatBoostClassifier or CatBoostRegressor: Trained model with best hyperparameters.
        """
        iterations = iterations or self.optimisation_iterations

        eval_direction = self.eval_direction_dict[self.target_model.eval_metric]
        sampler = TPESampler(seed=self.random_state)
        study = optuna.create_study(direction=eval_direction, sampler=sampler, study_name=f"Bayesian_Optimization_{self.target_model.config_key}_Fold_{self.target_model.fold_index}")

        study.optimize(
            lambda trial: self.objective(trial=trial, train_data=train_data, val_data=val_data, eval_direction=eval_direction),
            n_trials=iterations,
            show_progress_bar=False
        )

        if self.logger:
            self.logger.log_optuna_study(study)
            visualizer = Visualisation(data=train_data, logger=self.logger)
            self.logger.log_study_visualizations(study, visualizer)

        # Update the SpyModel with the best hyperparameters (including defaults if not optimized)
        best_params = study.best_params.copy()

        # Support for slurm multi-threading
        n_threads = int(os.getenv("SLURM_CPUS_PER_TASK", -1)) # 1

        fixed_params = {
            "random_state": self.target_model.random_state,
            "eval_metric": self.target_model.eval_metric,
            "verbose": self.target_model.verbose,
            "thread_count": n_threads
        }
        best_params.update({key: value for key, value in fixed_params.items() if key not in best_params})

        # Retrain the Spy model with the best parameters
        self.target_model.train(
            train_data=train_data,
            val_data=val_data,
            test_data=test_data,
            **best_params,
        )

        return self.target_model.model

objective(trial, train_data, val_data, eval_direction, feature_column_name=None, training_target_column_name=None, evaluation_target_column_name=None)

Objective function to optimize CatBoost hyperparameters.

Parameters:

Name Type Description Default
trial Trial

Trial object for suggestions.

required
train_data DataFrame

Training dataset.

required
val_data DataFrame

Validation dataset.

required
eval_direction str

Either 'minimize' or 'maximize', depending on the metric.

required
feature_column_name str

Override for feature column name.

None
training_target_column_name str

Override for training target column name.

None
evaluation_target_column_name str

Override for evaluation target column name.

None

Returns:

Name Type Description
float float

Validation score for current hyperparameters.

Source code in payn\Optimization\optimization.py
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
def objective(self, trial: optuna.Trial, train_data: pd.DataFrame, val_data: pd.DataFrame, eval_direction: str,
              feature_column_name: Optional[str] = None, training_target_column_name: Optional[str] = None, evaluation_target_column_name: Optional[str] = None) -> float:
    """
    Objective function to optimize CatBoost hyperparameters.

    Args:
        trial (optuna.Trial): Trial object for suggestions.
        train_data (pd.DataFrame): Training dataset.
        val_data (pd.DataFrame): Validation dataset.
        eval_direction (str): Either 'minimize' or 'maximize', depending on the metric.
        feature_column_name (str, optional): Override for feature column name.
        training_target_column_name (str, optional): Override for training target column name.
        evaluation_target_column_name (str, optional): Override for evaluation target column name.

    Returns:
        float: Validation score for current hyperparameters.
    """
    with mlflow.start_run(run_name=f"Trial_{trial.number}", nested=True):
        feature_column_name = feature_column_name or self.feature_column_name
        training_target_column_name = training_target_column_name or self.training_target_column_name
        evaluation_target_column_name = evaluation_target_column_name or self.evaluation_target_column_name

        # Determine evaluation direction and corresponding penalty value.
        penalty_value = float('-inf') if eval_direction == "maximize" else float('inf')

        # Prepare the CatBoost Pool for training and validation
        train_pool = self._prepare_pool(data=train_data, feature_column_name=feature_column_name, training_target_column_name=training_target_column_name)
        val_pool = self._prepare_pool(data=val_data, feature_column_name=feature_column_name, evaluation_target_column_name=evaluation_target_column_name)

        # Support for slurm multi-threading
        n_threads = int(os.getenv("SLURM_CPUS_PER_TASK", -1)) # 1

        # Base CatBoost parameters
        params = {
            "random_state": self.target_model.random_state,
            "eval_metric": self.target_model.eval_metric,
            "verbose": False,
            "thread_count": n_threads
        }

        params.update(self._suggest_params(trial))

        # Define model with the suggested parameters
        model_class = CatBoostRegressor if isinstance(self.target_model, RegModel) else CatBoostClassifier
        model = model_class(**params)

        # Log hyperparameters before training
        if self.logger:
            # self.logger.log_model_hyperparameters(CatBoostClassifier(**params), **params)
            self.logger.log_model_hyperparameters(model, **params)

        # MLflow callback for feedback about training progress
        # mlflow_callback = MLflowCatBoostCallback(eval_metric=self.target_model.eval_metric, logger=self.logger)

        try:
            model.fit(X=train_pool, eval_set=val_pool, use_best_model=False) # , callbacks=[mlflow_callback]
        except Exception as e:
            if self.logger:
                self.logger.log_message(f"Trial {trial.number} failed during training: {e}")
            return penalty_value  # Return penalty on failure

        if self.logger:
            # Deactivated for Optimisation to reduce storrage capacity
            # self.logger.log_model(model, f"optimized_{self.target_model.config_key}_fold_{self.target_model.fold_index}")
            self.logger.log_model_attributes(model)

        try:
            score = model.evals_result_["validation"][self.target_model.eval_metric][-1]
        except Exception as e:
            if self.logger:
                self.logger.log_message(f"Failed to extract evaluation metric for trial {trial.number}: {e}")
            score = penalty_value

    return score

optimize(train_data, val_data, test_data=None, iterations=None)

Perform Bayesian optimization to find the best hyperparameters.

Parameters:

Name Type Description Default
train_data DataFrame

Training dataset.

required
val_data DataFrame

Validation dataset.

required
test_data DataFrame

Test dataset.

None
iterations int

Override for number of trials.

None

Returns:

Type Description
Union[CatBoostClassifier, CatBoostRegressor]

CatBoostClassifier or CatBoostRegressor: Trained model with best hyperparameters.

Source code in payn\Optimization\optimization.py
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
def optimize(self, train_data: pd.DataFrame, val_data: pd.DataFrame, test_data: Optional[pd.DataFrame] = None, iterations: Optional[int] = None) -> Union[CatBoostClassifier, CatBoostRegressor]:
    """
    Perform Bayesian optimization to find the best hyperparameters.

    Args:
        train_data (pd.DataFrame): Training dataset.
        val_data (pd.DataFrame): Validation dataset.
        test_data (pd.DataFrame, optional): Test dataset.
        iterations (int, optional): Override for number of trials.

    Returns:
        CatBoostClassifier or CatBoostRegressor: Trained model with best hyperparameters.
    """
    iterations = iterations or self.optimisation_iterations

    eval_direction = self.eval_direction_dict[self.target_model.eval_metric]
    sampler = TPESampler(seed=self.random_state)
    study = optuna.create_study(direction=eval_direction, sampler=sampler, study_name=f"Bayesian_Optimization_{self.target_model.config_key}_Fold_{self.target_model.fold_index}")

    study.optimize(
        lambda trial: self.objective(trial=trial, train_data=train_data, val_data=val_data, eval_direction=eval_direction),
        n_trials=iterations,
        show_progress_bar=False
    )

    if self.logger:
        self.logger.log_optuna_study(study)
        visualizer = Visualisation(data=train_data, logger=self.logger)
        self.logger.log_study_visualizations(study, visualizer)

    # Update the SpyModel with the best hyperparameters (including defaults if not optimized)
    best_params = study.best_params.copy()

    # Support for slurm multi-threading
    n_threads = int(os.getenv("SLURM_CPUS_PER_TASK", -1)) # 1

    fixed_params = {
        "random_state": self.target_model.random_state,
        "eval_metric": self.target_model.eval_metric,
        "verbose": self.target_model.verbose,
        "thread_count": n_threads
    }
    best_params.update({key: value for key, value in fixed_params.items() if key not in best_params})

    # Retrain the Spy model with the best parameters
    self.target_model.train(
        train_data=train_data,
        val_data=val_data,
        test_data=test_data,
        **best_params,
    )

    return self.target_model.model

Grid Optimization (payn.Optimization.GridOptimization)

This strategy implements a combinatorial search over a manually defined grid of hyperparameter values.

  • Combinatorial Search Space: The module generates every possible combination of the provided parameter lists using Cartesian products (itertools.product). This ensures that the global optimum within the defined discrete space is found, provided the grid resolution is sufficient.
  • Deterministic Execution: Unlike stochastic search methods, Grid Optimization is inherently deterministic.
  • Use Case: While computationally more expensive than Bayesian methods for high-dimensional spaces, this strategy serves as a robust baseline for validating the stability of specific hyperparameters (e.g., depth or learning_rate) in isolation.

Bases: Optimization

Subclass for performing Grid Search optimization.

Source code in payn\Optimization\optimization.py
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
class GridOptimization(Optimization):
    """
    Subclass for performing Grid Search optimization.
    """
    def optimize(self, train_data: pd.DataFrame, val_data: pd.DataFrame, test_data: Optional[pd.DataFrame],
                 search_space: Optional[Dict[str, List[Any]]] = None, eval_direction: Optional[str] = None,
                 iterations: Optional[int] = None, **kwargs: Any) -> Union[CatBoostClassifier, CatBoostRegressor]:
        """
        Perform Grid Search optimization to find the best hyperparameters.

        Args:
            train_data (pd.DataFrame): Training dataset.
            val_data (pd.DataFrame): Validation dataset.
            test_data (pd.DataFrame): Test dataset.
            search_space (Dict[str, List[Any]], optional): Override for the search space.
            eval_direction (str, optional): Override for evaluation direction.
            iterations (int, optional): Override for number of grid combinations to try (if applicable).

        Returns:
            CatBoostClassifier or CatBoostRegressor: The trained model with the best hyperparameters.
        """
        search_space = search_space or self.search_space
        iterations = iterations or self.optimisation_iterations

        eval_direction = self.eval_direction_dict[self.target_model.eval_metric]


        best_score = float('-inf') if eval_direction == "maximize" else float('inf')
        best_params = {}
        all_combinations = list(product(*[search_space[param] for param in search_space]))
        # Optionally limit iterations if needed.
        if iterations < len(all_combinations):
            all_combinations = all_combinations[:iterations]

        for comb in all_combinations:
            trial_params = dict(zip(search_space.keys(), comb))
            params = {
                "random_state": self.target_model.random_state,
                "eval_metric": self.target_model.eval_metric,
                "verbose": False
            }
            params.update(trial_params)
            model_class = CatBoostRegressor if isinstance(self.target_model, RegModel) else CatBoostClassifier
            model = model_class(**params)

            if self.logger:
                self.logger.log_model_hyperparameters(model, **params)

            # Prepare training and validation pools.
            train_pool = self._prepare_pool(data=train_data, feature_column_name=self.feature_column_name, training_target_column_name=self.training_target_column_name)
            val_pool = self._prepare_pool(data=val_data, feature_column_name=self.feature_column_name, evaluation_target_column_name=self.evaluation_target_column_name)

            try:
                model.fit(X=train_pool, eval_set=val_pool, use_best_model=False)
            except Exception as e:
                if self.logger:
                    self.logger.log_message(f"Grid search parameters {trial_params} failed: {e}")
                continue

            try:
                score = model.evals_result_["validation"][self.target_model.eval_metric][-1]
            except Exception as e:
                if self.logger:
                    self.logger.log_message(f"Failed to extract metric for grid search parameters {trial_params}: {e}")
                continue

            if self.logger:
                self.logger.log_message(f"Grid search {trial_params} achieved score {score}")

            # Update best parameters based on evaluation direction.
            if (eval_direction == "maximize" and score > best_score) or (eval_direction == "minimize" and score < best_score):
                best_score = score
                best_params = trial_params

        if self.logger:
            self.logger.log_message(f"Best grid search parameters: {best_params} with score {best_score}")

        fixed_params = {
            "random_state": self.target_model.random_state,
            "eval_metric": self.target_model.eval_metric,
            "verbose": self.target_model.verbose,
        }
        best_params.update({key: value for key, value in fixed_params.items() if key not in best_params})

        self.target_model.train(
            train_data=train_data,
            val_data=val_data,
            test_data=test_data,
            **best_params,
        )

        return self.target_model.model

optimize(train_data, val_data, test_data, search_space=None, eval_direction=None, iterations=None, **kwargs)

Perform Grid Search optimization to find the best hyperparameters.

Parameters:

Name Type Description Default
train_data DataFrame

Training dataset.

required
val_data DataFrame

Validation dataset.

required
test_data DataFrame

Test dataset.

required
search_space Dict[str, List[Any]]

Override for the search space.

None
eval_direction str

Override for evaluation direction.

None
iterations int

Override for number of grid combinations to try (if applicable).

None

Returns:

Type Description
Union[CatBoostClassifier, CatBoostRegressor]

CatBoostClassifier or CatBoostRegressor: The trained model with the best hyperparameters.

Source code in payn\Optimization\optimization.py
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
def optimize(self, train_data: pd.DataFrame, val_data: pd.DataFrame, test_data: Optional[pd.DataFrame],
             search_space: Optional[Dict[str, List[Any]]] = None, eval_direction: Optional[str] = None,
             iterations: Optional[int] = None, **kwargs: Any) -> Union[CatBoostClassifier, CatBoostRegressor]:
    """
    Perform Grid Search optimization to find the best hyperparameters.

    Args:
        train_data (pd.DataFrame): Training dataset.
        val_data (pd.DataFrame): Validation dataset.
        test_data (pd.DataFrame): Test dataset.
        search_space (Dict[str, List[Any]], optional): Override for the search space.
        eval_direction (str, optional): Override for evaluation direction.
        iterations (int, optional): Override for number of grid combinations to try (if applicable).

    Returns:
        CatBoostClassifier or CatBoostRegressor: The trained model with the best hyperparameters.
    """
    search_space = search_space or self.search_space
    iterations = iterations or self.optimisation_iterations

    eval_direction = self.eval_direction_dict[self.target_model.eval_metric]


    best_score = float('-inf') if eval_direction == "maximize" else float('inf')
    best_params = {}
    all_combinations = list(product(*[search_space[param] for param in search_space]))
    # Optionally limit iterations if needed.
    if iterations < len(all_combinations):
        all_combinations = all_combinations[:iterations]

    for comb in all_combinations:
        trial_params = dict(zip(search_space.keys(), comb))
        params = {
            "random_state": self.target_model.random_state,
            "eval_metric": self.target_model.eval_metric,
            "verbose": False
        }
        params.update(trial_params)
        model_class = CatBoostRegressor if isinstance(self.target_model, RegModel) else CatBoostClassifier
        model = model_class(**params)

        if self.logger:
            self.logger.log_model_hyperparameters(model, **params)

        # Prepare training and validation pools.
        train_pool = self._prepare_pool(data=train_data, feature_column_name=self.feature_column_name, training_target_column_name=self.training_target_column_name)
        val_pool = self._prepare_pool(data=val_data, feature_column_name=self.feature_column_name, evaluation_target_column_name=self.evaluation_target_column_name)

        try:
            model.fit(X=train_pool, eval_set=val_pool, use_best_model=False)
        except Exception as e:
            if self.logger:
                self.logger.log_message(f"Grid search parameters {trial_params} failed: {e}")
            continue

        try:
            score = model.evals_result_["validation"][self.target_model.eval_metric][-1]
        except Exception as e:
            if self.logger:
                self.logger.log_message(f"Failed to extract metric for grid search parameters {trial_params}: {e}")
            continue

        if self.logger:
            self.logger.log_message(f"Grid search {trial_params} achieved score {score}")

        # Update best parameters based on evaluation direction.
        if (eval_direction == "maximize" and score > best_score) or (eval_direction == "minimize" and score < best_score):
            best_score = score
            best_params = trial_params

    if self.logger:
        self.logger.log_message(f"Best grid search parameters: {best_params} with score {best_score}")

    fixed_params = {
        "random_state": self.target_model.random_state,
        "eval_metric": self.target_model.eval_metric,
        "verbose": self.target_model.verbose,
    }
    best_params.update({key: value for key, value in fixed_params.items() if key not in best_params})

    self.target_model.train(
        train_data=train_data,
        val_data=val_data,
        test_data=test_data,
        **best_params,
    )

    return self.target_model.model