Configuration Reference

The config.yaml file controls the entire lifecycle of the PAYN framework. All parameters are organized by module below.

1. General Settings

Key: general

Parameter	Type	Example	Description
`user`	`str`	`"Maria"`	User or team name used for tagging experiments.
`experiment_id`	`str`	`"001"`	Unique identifier to group related runs.
`random_seed`	`int`	`42`	Global random seed for reproducibility.
`verbose`	`int`	`200`	Logging verbosity level.
`usage_mode`	`str`	`"training"`	Pipeline mode. Options: `"training"` (fully labled data) or `"inference"` (experimental!).

Key: dataset

Parameter	Type	Example	Description
`file_path`	`str`	`"./data.xlsx"`	Path to the input dataset (Excel or CSV).
`sheet_name`	`str`	`"Sheet 1"`	Name of the sheet to load (if Excel).
`input_columns`	`list`	`["Nucleophile", "Electrophile"]`	Specific feature columns to use. If empty `[]`, auto-detects all non-meta columns.
`absence_flag`	`list`	`[]`	Values in raw data to treat as missing/null (e.g., `["NaN"]`).
`target_column`	`str`	`"Output"`	Name of the yield column (target variable).
`yield_limit`	`float`	`100`	Maximum yield value (e.g., `100` for %, `1.0` for fraction).
`yield_classification_threshold`	`float`	`0.2`	Threshold to binarize yield (e.g., `0.2` = 20% cutoff for Positive class).

Key: featurisation

Parameter	Type	Example	Description
`method`	`str`	`"ecfp"`	Vector generation method. Options include `"ecfp"` or `"mff"`.
`ECFP_bit_length`	`int`	`2048`	Length of the fingerprint bit vector.
`ECFP_radius`	`int`	`2`	Radius for Morgan fingerprints (e.g., `2` = ECFP4).
`condense_bits`	`bool`	`True`	If `True`, removes zero-variance columns to reduce sparsity.
`existing_feature_columns`	`list`	`[]`	Columns containing pre-calculated features (e.g., DFT properties).
`combined_features_column_name`	`str`	`"FP_combined"`	Name of the final concatenated feature vector column.

Key: meta_columns

Parameter	Type	Default	Description
`meta_true_label_bin`	`str`	`"true_bin"`	Ground truth binary label derived from yield.
`meta_data_point_role`	`str`	`"true_role"`	Role in the Outer Split (train/val/test).
`meta_mod_label_bin`	`str`	`"spy_inf_bin"`	Modified label after Spy infiltration ($s$).
`meta_mod_data_point_role`	`str`	`"spy_inf_role"`	Role in the Inner/Spy Split (spy/unlabeled).
`meta_mod_probability_1`	`str`	`"spy_inf_prob_1"`	Probability of belonging to positive class after Spy infiltration.
`meta_mod_prediction_class`	`str`	`"spy_inf_pred_label"`	Predicted class label after Spy infiltration.
`meta_augmented_bin`	`str`	`"augm_bin"`	Binary label after Augmentation.
`meta_augmented_role`	`str`	`"augm_role"`	Role in Regression (`known_positive` or `aug_neg`).
`meta_augmented_target`	`str`	`"augm_yield"`	Target variable for regression (Reliable Negatives assigned 0).

Key: splitting

Parameter	Type	Example	Description
`cross_validation_folds`	`int`	`5`	Number of cross-validation folds.
`test_size`	`float`	`0.1`	Fraction of data reserved for testing (ignored for Scaffold splits).
`validation_size`	`float`	`0.1`	Fraction of training data used for validation.

Key: spy_splitting

Parameter	Type	Example	Description
`spy_rate`	`float`	`0.2`	Percentage of positive training data masked as "Spies".
`spy_tolerance`	`float`	`0.05`	Probability threshold tolerance for detecting spies.
`ratio_positives_to_unlabeled`	`float`	`0.5`	Target ratio of Positives to Unlabeled data.

Key: spy_model

Parameter	Type	Example	Description
`eval_metric`	`str`	`"MCC"`	Metric to optimize during training.
`all_metrics`	`list`	`['Accuracy', ...]`	List of metrics to log during evaluation.
`training_target_column_name`	`str`	`"spy_inf_bin"`	The column used as the target for training.
`validation_target_column_name`	`str`	`"true_bin"`	The column used as the target for validation.
`metric_manipulation`	`str`	`"Recall"`	Metric to prioritize for augmented negatives (or `None`).
`target_value`	`float`	`0.5`	Target score for the manipulated metric.

Key: reg_model

Parameter	Type	Example	Description
`eval_metric`	`str`	`"MAE"`	Metric to optimize during regression training.
`all_metrics`	`list`	`['MAE', 'R2']`	List of metrics to log.
`training_target_column_name`	`str`	`"augm_yield"`	The column used as the target for training.
`validation_target_column_name`	`str`	`"Output"`	The column used as the target for validation.

Key: optimisation

Parameter	Type	Example	Description
`type`	`str`	`"Bayesian"`	Optimization strategy.
`search_space`	`list`	`["depth", ...]`	List of hyperparameters to tune.
`iterations`	`int`	`50`	Number of optimization trials to run.

Key: catboost

Parameter	Type	Example	Description
`max_depth`	`int`	`12`	Maximum tree depth.
`max_iterations`	`int`	`10000`	Maximum number of boosting rounds.
`min_learning_rate`	`float`	`0.00001`	Minimum learning rate limit.
`max_bin`	`int`	`254`	Maximum number of splits for numerical features.