Skip to content

Regression Model

Yield prediction Regression Model (payn.RegModel.RegModel)

Wraps a CatBoostRegressor. Selected for its native handling of categorical features and robust performance on tabular chemical data without extensive preprocessing. Other model architectures are applicable here as well.

  • Parallelisation: Automatically detects SLURM cluster environments (SLURM_CPUS_PER_TASK) to adjust thread counts (thread_count), ensuring optimal resource usage while defaulting to single-threaded execution locally for maximum safety.
  • Determinism: Random seeds are propagated strictly from the global config to the CatBoost engine (random_state).
  • Logging: RegModel is tightly coupled with the payn.Logging system. It automatically logs hyperparameters, trained model artifacts, and evaluation metrics (on test sets) to MLflow run immediately after training.

RegModel encapsulates the CatBoostRegressor model for yield regression.

Attributes:

Name Type Description
config_key str

The key in the config dict relevant to the RegModel.

logger Optional[Logger]

Logger instance for logging model training and evaluation.

fold_index int

Index of the current fold (for cross-validation purposes).

random_state int

Random seed.

eval_metric str

Evaluation metric to use.

verbose int

Verbosity level.

model Optional[CatBoostRegressor]

The trained CatBoost model.

feature_column_name Optional[str]

Feature column name.

training_target_column_name Optional[str]

Target column name for training data.

validation_target_column_name Optional[str]

Target column name for validation data.

metrics_list Optional[List[str]]

List of metrics to track during testing.

Source code in payn\RegModel\regmodel.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
class RegModel:
    """
    RegModel encapsulates the CatBoostRegressor model for yield regression.

    Attributes:
        config_key (str): The key in the config dict relevant to the RegModel.
        logger (Optional[Logger]): Logger instance for logging model training and evaluation.
        fold_index (int): Index of the current fold (for cross-validation purposes).
        random_state (int): Random seed.
        eval_metric (str): Evaluation metric to use.
        verbose (int): Verbosity level.
        model (Optional[CatBoostRegressor]): The trained CatBoost model.
        feature_column_name (Optional[str]): Feature column name.
        training_target_column_name (Optional[str]): Target column name for training data.
        validation_target_column_name (Optional[str]): Target column name for validation data.
        metrics_list (Optional[List[str]]): List of metrics to track during testing.
    """

    config_key = "reg_model"
    def __init__(self, eval_metric: str, random_state: int, verbose: int, fold_index: int = 1,
                 logger: Optional[Logger] = None, feature_column_name: Optional[str] = None, training_target_column_name: Optional[str] = None,
                 validation_target_column_name: Optional[str] = None, metrics_list: Optional[List[str]] = None) -> None:
        """
        Initialize the Regression model for yield prediciton class.

        You can either pass a config dict via the alternative constructor `from_config` or pass parameters explicitly.

        Args:
            eval_metric (str): Metric for evaluation of model performance.
            random_state (int): Random seed.
            verbose (int): Verbosity level.
            fold_index (int, optional): Index of the current fold (for cross-validation purposes).
            logger (Logger, optional): Logger instance for logging.
            feature_column_name (str, optional): Feature column name.
            training_target_column_name (str, optional): Target column name for training data.
            validation_target_column_name (str, optional): Target column name for validation data.
            metrics_list (List[str], optional): List of metrics to track during testing (optional).
        """

        self.eval_metric = eval_metric
        self.random_state = random_state
        self.fold_index = fold_index
        self.verbose = verbose
        self.logger = logger
        self.model: Optional[CatBoostRegressor] = None

        # Optional parameters for training; if not provided, they can be set later.
        self.feature_column_name = feature_column_name
        self.training_target_column_name = training_target_column_name
        self.validation_target_column_name = validation_target_column_name
        self.metrics_list = metrics_list


    @classmethod
    def from_config(cls, config: Dict[str, Any], logger: Optional[Logger] = None, fold_index: int = 1) -> "RegModel":
        """
        Alternative constructor that creates a RegModel instance from a config object.

        Args:
            config (Dict[str, Any]): Configuration dictionary.
            logger (Logger, optional): Logger instance.
            fold_index (int): Current fold index.

        Returns:
            RegModel: An instance of RegModel with parameters extracted from the config.
        """
        return cls(
            logger=logger,
            fold_index=fold_index,
            random_state = config["general"]["random_seed"],
            eval_metric = config["reg_model"]["eval_metric"],
            verbose = config["general"]["verbose"],
            feature_column_name=config["featurisation"]["combined_features_column_name"],
            training_target_column_name=config["reg_model"]["training_target_column_name"],
            validation_target_column_name=config["reg_model"]["validation_target_column_name"],
            metrics_list=config["reg_model"]["all_metrics"]
        )

    def _prepare_pool(self, data: pd.DataFrame, label_column: Optional[str], feature_column: str) -> Pool:
        """
        Prepare a CatBoost Pool from the dataset.

        Args:
            data (pd.DataFrame): Dataset containing features and target labels.
            label_column (str, optional): Column name for target labels.
            feature_column (str): Column name for features.

        Returns:
            Pool: CatBoost Pool object.

        Raises:
            ValueError: If pool preparation fails.
        """
        try:
            pool_data = pd.DataFrame(data[feature_column].to_list())
            if label_column is not None:
                pool = Pool(data=pool_data, label=data[label_column])
            else:
                pool = Pool(data=pool_data)
            return pool
        except Exception as e:
            raise ValueError(f"Error preparing CatBoost Pool: {e}")

    def train(
        self,
        train_data: pd.DataFrame,
        val_data: pd.DataFrame,
        test_data: Optional[pd.DataFrame] = None,
        feature_column: Optional[str] = None,
        training_label_column: Optional[str] = None,
        validation_label_column: Optional[str] = None,
        **kwargs: Any
    ) -> CatBoostRegressor:
        """
        Train the yield regression model on the given datasets.

        Args:
            train_data (pd.DataFrame): Training dataset with features and target labels.
            val_data (pd.DataFrame): Validation dataset for monitoring training progress.
            test_data (pd.DataFrame, optional): Optional test dataset for evaluation (default: None).
            feature_column (str, optional): Column name for features.
            training_label_column (str, optional): Column name for target labels in training data.
            validation_label_column (str, optional): Column name for target labels in validation data.
            **kwargs (Any): Additional keyword arguments for Hyperparameters in training.

        Returns:
            CatBoostRegressor: Trained CatBoost model.
        """
        feature_column = feature_column or self.feature_column_name
        training_label_column = training_label_column or self.training_target_column_name
        validation_label_column = validation_label_column or self.validation_target_column_name

        if isinstance(test_data, pd.DataFrame):
            test_label_column = validation_label_column
            test_pool = self._prepare_pool(data=test_data, feature_column=feature_column, label_column=test_label_column)

        train_pool = self._prepare_pool(data=train_data, feature_column=feature_column, label_column=training_label_column )
        val_pool = self._prepare_pool(data=val_data, feature_column=feature_column, label_column=validation_label_column )

        # Support for slurm multi-threading
        n_threads = int(os.getenv("SLURM_CPUS_PER_TASK", -1)) # 1

        # Combine default parameters with kwargs (kwargs take precedence)
        combined_params = {
            "random_state": self.random_state,
            "eval_metric": self.eval_metric,
            "verbose": self.verbose,
            "thread_count": n_threads,
            **kwargs,  # Override defaults with values from kwargs
        }

        # Define CatBoost model
        self.model = CatBoostRegressor(**combined_params)

        # Log user-provided and default hyperparameters
        if self.logger:
            self.logger.log_model_hyperparameters(self.model, **combined_params)

        # Create the MLflow callback
        # mlflow_callback = MLflowCatBoostCallback(eval_metric=self.eval_metric, logger=self.logger)

        # Train the model
        self.model.fit(X=train_pool, eval_set=val_pool, use_best_model=True) # , callbacks=[mlflow_callback]

        # Log the trained model and evaluation results
        if self.logger:
            self.logger.log_model(self.model, f"reg_model_fold_{self.fold_index}")
            # Log additional model attributes
            self.logger.log_model_attributes(self.model)
            if isinstance(test_data, pd.DataFrame):
                test_results = self.evaluate(test_pool)
                print(f"Evaluation test data: MAE {test_results["MAE"][-1]}")
                print(f"Evaluation test data: R2 {test_results["R2"][-1]}")
                print(f"Evaluation test data: RMSE {test_results["RMSE"][-1]}")
                self.logger.log_evaluation_metrics(test_results)

        return self.model

    def evaluate(self, test_pool: Pool) -> Dict[str, Any]:
        """
        Evaluate the trained model on a test dataset.

        Args:
            test_pool (Pool): Test dataset pre-formatted as a CatBoost Pool.

        Returns:
            Dict[str, Any]: Dictionary of evaluation metrics.
        """
        if not self.model:
            raise ValueError("Model has not been trained yet. Call `train` before evaluating.")

        eval_results = self.model.eval_metrics(
            data=test_pool,
            metrics=self.metrics_list
        )
        return eval_results

    def predict(self, data: pd.DataFrame, feature_column: Optional[str] = None) -> pd.Series:
        """
        Make predictions using the trained Spy model.

        Args:
            data (pd.DataFrame): Dataset containing features for prediction.
            feature_column (str, optional): Name of the feature column.

        Returns:
            pd.Series: Predicted labels or probabilities.

        Raises:
            ValueError: If model has not been trained yet.
        """
        feature_column = feature_column or self.feature_column_name
        if not self.model:
            raise ValueError("Model has not been trained yet. Call `train` before making predictions.")

        data_pool = self._prepare_pool(data, label_column=None, feature_column=feature_column)
        return pd.Series(self.model.predict(data_pool))

__init__(eval_metric, random_state, verbose, fold_index=1, logger=None, feature_column_name=None, training_target_column_name=None, validation_target_column_name=None, metrics_list=None)

Initialize the Regression model for yield prediciton class.

You can either pass a config dict via the alternative constructor from_config or pass parameters explicitly.

Parameters:

Name Type Description Default
eval_metric str

Metric for evaluation of model performance.

required
random_state int

Random seed.

required
verbose int

Verbosity level.

required
fold_index int

Index of the current fold (for cross-validation purposes).

1
logger Logger

Logger instance for logging.

None
feature_column_name str

Feature column name.

None
training_target_column_name str

Target column name for training data.

None
validation_target_column_name str

Target column name for validation data.

None
metrics_list List[str]

List of metrics to track during testing (optional).

None
Source code in payn\RegModel\regmodel.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def __init__(self, eval_metric: str, random_state: int, verbose: int, fold_index: int = 1,
             logger: Optional[Logger] = None, feature_column_name: Optional[str] = None, training_target_column_name: Optional[str] = None,
             validation_target_column_name: Optional[str] = None, metrics_list: Optional[List[str]] = None) -> None:
    """
    Initialize the Regression model for yield prediciton class.

    You can either pass a config dict via the alternative constructor `from_config` or pass parameters explicitly.

    Args:
        eval_metric (str): Metric for evaluation of model performance.
        random_state (int): Random seed.
        verbose (int): Verbosity level.
        fold_index (int, optional): Index of the current fold (for cross-validation purposes).
        logger (Logger, optional): Logger instance for logging.
        feature_column_name (str, optional): Feature column name.
        training_target_column_name (str, optional): Target column name for training data.
        validation_target_column_name (str, optional): Target column name for validation data.
        metrics_list (List[str], optional): List of metrics to track during testing (optional).
    """

    self.eval_metric = eval_metric
    self.random_state = random_state
    self.fold_index = fold_index
    self.verbose = verbose
    self.logger = logger
    self.model: Optional[CatBoostRegressor] = None

    # Optional parameters for training; if not provided, they can be set later.
    self.feature_column_name = feature_column_name
    self.training_target_column_name = training_target_column_name
    self.validation_target_column_name = validation_target_column_name
    self.metrics_list = metrics_list

evaluate(test_pool)

Evaluate the trained model on a test dataset.

Parameters:

Name Type Description Default
test_pool Pool

Test dataset pre-formatted as a CatBoost Pool.

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: Dictionary of evaluation metrics.

Source code in payn\RegModel\regmodel.py
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
def evaluate(self, test_pool: Pool) -> Dict[str, Any]:
    """
    Evaluate the trained model on a test dataset.

    Args:
        test_pool (Pool): Test dataset pre-formatted as a CatBoost Pool.

    Returns:
        Dict[str, Any]: Dictionary of evaluation metrics.
    """
    if not self.model:
        raise ValueError("Model has not been trained yet. Call `train` before evaluating.")

    eval_results = self.model.eval_metrics(
        data=test_pool,
        metrics=self.metrics_list
    )
    return eval_results

from_config(config, logger=None, fold_index=1) classmethod

Alternative constructor that creates a RegModel instance from a config object.

Parameters:

Name Type Description Default
config Dict[str, Any]

Configuration dictionary.

required
logger Logger

Logger instance.

None
fold_index int

Current fold index.

1

Returns:

Name Type Description
RegModel RegModel

An instance of RegModel with parameters extracted from the config.

Source code in payn\RegModel\regmodel.py
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
@classmethod
def from_config(cls, config: Dict[str, Any], logger: Optional[Logger] = None, fold_index: int = 1) -> "RegModel":
    """
    Alternative constructor that creates a RegModel instance from a config object.

    Args:
        config (Dict[str, Any]): Configuration dictionary.
        logger (Logger, optional): Logger instance.
        fold_index (int): Current fold index.

    Returns:
        RegModel: An instance of RegModel with parameters extracted from the config.
    """
    return cls(
        logger=logger,
        fold_index=fold_index,
        random_state = config["general"]["random_seed"],
        eval_metric = config["reg_model"]["eval_metric"],
        verbose = config["general"]["verbose"],
        feature_column_name=config["featurisation"]["combined_features_column_name"],
        training_target_column_name=config["reg_model"]["training_target_column_name"],
        validation_target_column_name=config["reg_model"]["validation_target_column_name"],
        metrics_list=config["reg_model"]["all_metrics"]
    )

predict(data, feature_column=None)

Make predictions using the trained Spy model.

Parameters:

Name Type Description Default
data DataFrame

Dataset containing features for prediction.

required
feature_column str

Name of the feature column.

None

Returns:

Type Description
Series

pd.Series: Predicted labels or probabilities.

Raises:

Type Description
ValueError

If model has not been trained yet.

Source code in payn\RegModel\regmodel.py
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
def predict(self, data: pd.DataFrame, feature_column: Optional[str] = None) -> pd.Series:
    """
    Make predictions using the trained Spy model.

    Args:
        data (pd.DataFrame): Dataset containing features for prediction.
        feature_column (str, optional): Name of the feature column.

    Returns:
        pd.Series: Predicted labels or probabilities.

    Raises:
        ValueError: If model has not been trained yet.
    """
    feature_column = feature_column or self.feature_column_name
    if not self.model:
        raise ValueError("Model has not been trained yet. Call `train` before making predictions.")

    data_pool = self._prepare_pool(data, label_column=None, feature_column=feature_column)
    return pd.Series(self.model.predict(data_pool))

train(train_data, val_data, test_data=None, feature_column=None, training_label_column=None, validation_label_column=None, **kwargs)

Train the yield regression model on the given datasets.

Parameters:

Name Type Description Default
train_data DataFrame

Training dataset with features and target labels.

required
val_data DataFrame

Validation dataset for monitoring training progress.

required
test_data DataFrame

Optional test dataset for evaluation (default: None).

None
feature_column str

Column name for features.

None
training_label_column str

Column name for target labels in training data.

None
validation_label_column str

Column name for target labels in validation data.

None
**kwargs Any

Additional keyword arguments for Hyperparameters in training.

{}

Returns:

Name Type Description
CatBoostRegressor CatBoostRegressor

Trained CatBoost model.

Source code in payn\RegModel\regmodel.py
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
def train(
    self,
    train_data: pd.DataFrame,
    val_data: pd.DataFrame,
    test_data: Optional[pd.DataFrame] = None,
    feature_column: Optional[str] = None,
    training_label_column: Optional[str] = None,
    validation_label_column: Optional[str] = None,
    **kwargs: Any
) -> CatBoostRegressor:
    """
    Train the yield regression model on the given datasets.

    Args:
        train_data (pd.DataFrame): Training dataset with features and target labels.
        val_data (pd.DataFrame): Validation dataset for monitoring training progress.
        test_data (pd.DataFrame, optional): Optional test dataset for evaluation (default: None).
        feature_column (str, optional): Column name for features.
        training_label_column (str, optional): Column name for target labels in training data.
        validation_label_column (str, optional): Column name for target labels in validation data.
        **kwargs (Any): Additional keyword arguments for Hyperparameters in training.

    Returns:
        CatBoostRegressor: Trained CatBoost model.
    """
    feature_column = feature_column or self.feature_column_name
    training_label_column = training_label_column or self.training_target_column_name
    validation_label_column = validation_label_column or self.validation_target_column_name

    if isinstance(test_data, pd.DataFrame):
        test_label_column = validation_label_column
        test_pool = self._prepare_pool(data=test_data, feature_column=feature_column, label_column=test_label_column)

    train_pool = self._prepare_pool(data=train_data, feature_column=feature_column, label_column=training_label_column )
    val_pool = self._prepare_pool(data=val_data, feature_column=feature_column, label_column=validation_label_column )

    # Support for slurm multi-threading
    n_threads = int(os.getenv("SLURM_CPUS_PER_TASK", -1)) # 1

    # Combine default parameters with kwargs (kwargs take precedence)
    combined_params = {
        "random_state": self.random_state,
        "eval_metric": self.eval_metric,
        "verbose": self.verbose,
        "thread_count": n_threads,
        **kwargs,  # Override defaults with values from kwargs
    }

    # Define CatBoost model
    self.model = CatBoostRegressor(**combined_params)

    # Log user-provided and default hyperparameters
    if self.logger:
        self.logger.log_model_hyperparameters(self.model, **combined_params)

    # Create the MLflow callback
    # mlflow_callback = MLflowCatBoostCallback(eval_metric=self.eval_metric, logger=self.logger)

    # Train the model
    self.model.fit(X=train_pool, eval_set=val_pool, use_best_model=True) # , callbacks=[mlflow_callback]

    # Log the trained model and evaluation results
    if self.logger:
        self.logger.log_model(self.model, f"reg_model_fold_{self.fold_index}")
        # Log additional model attributes
        self.logger.log_model_attributes(self.model)
        if isinstance(test_data, pd.DataFrame):
            test_results = self.evaluate(test_pool)
            print(f"Evaluation test data: MAE {test_results["MAE"][-1]}")
            print(f"Evaluation test data: R2 {test_results["R2"][-1]}")
            print(f"Evaluation test data: RMSE {test_results["RMSE"][-1]}")
            self.logger.log_evaluation_metrics(test_results)

    return self.model