Configuration Reference
The config.yaml file controls the entire lifecycle of the PAYN framework. All parameters are organized by module below.
1. General Settings
Key: general
| Parameter | Type | Example | Description |
|---|---|---|---|
user |
str |
"Maria" |
User or team name used for tagging experiments. |
experiment_id |
str |
"001" |
Unique identifier to group related runs. |
random_seed |
int |
42 |
Global random seed for reproducibility. |
verbose |
int |
200 |
Logging verbosity level. |
usage_mode |
str |
"training" |
Pipeline mode. Options: "training" (fully labled data) or "inference" (experimental!). |
2. Dataset Settings
Key: dataset
| Parameter | Type | Example | Description |
|---|---|---|---|
file_path |
str |
"./data.xlsx" |
Path to the input dataset (Excel or CSV). |
sheet_name |
str |
"Sheet 1" |
Name of the sheet to load (if Excel). |
input_columns |
list |
["Nucleophile", "Electrophile"] |
Specific feature columns to use. If empty [], auto-detects all non-meta columns. |
absence_flag |
list |
[] |
Values in raw data to treat as missing/null (e.g., ["NaN"]). |
target_column |
str |
"Output" |
Name of the yield column (target variable). |
yield_limit |
float |
100 |
Maximum yield value (e.g., 100 for %, 1.0 for fraction). |
yield_classification_threshold |
float |
0.2 |
Threshold to binarize yield (e.g., 0.2 = 20% cutoff for Positive class). |
3. Featurisation
Key: featurisation
| Parameter | Type | Example | Description |
|---|---|---|---|
method |
str |
"ecfp" |
Vector generation method. Options include "ecfp" or "mff". |
ECFP_bit_length |
int |
2048 |
Length of the fingerprint bit vector. |
ECFP_radius |
int |
2 |
Radius for Morgan fingerprints (e.g., 2 = ECFP4). |
condense_bits |
bool |
True |
If True, removes zero-variance columns to reduce sparsity. |
existing_feature_columns |
list |
[] |
Columns containing pre-calculated features (e.g., DFT properties). |
combined_features_column_name |
str |
"FP_combined" |
Name of the final concatenated feature vector column. |
4. Meta Columns (Internal)
Key: meta_columns
| Parameter | Type | Default | Description |
|---|---|---|---|
meta_true_label_bin |
str |
"true_bin" |
Ground truth binary label derived from yield. |
meta_data_point_role |
str |
"true_role" |
Role in the Outer Split (train/val/test). |
meta_mod_label_bin |
str |
"spy_inf_bin" |
Modified label after Spy infiltration ($s$). |
meta_mod_data_point_role |
str |
"spy_inf_role" |
Role in the Inner/Spy Split (spy/unlabeled). |
meta_mod_probability_1 |
str |
"spy_inf_prob_1" |
Probability of belonging to positive class after Spy infiltration. |
meta_mod_prediction_class |
str |
"spy_inf_pred_label" |
Predicted class label after Spy infiltration. |
meta_augmented_bin |
str |
"augm_bin" |
Binary label after Augmentation. |
meta_augmented_role |
str |
"augm_role" |
Role in Regression (known_positive or aug_neg). |
meta_augmented_target |
str |
"augm_yield" |
Target variable for regression (Reliable Negatives assigned 0). |
5. Splitting Strategy (Outer)
Key: splitting
| Parameter | Type | Example | Description |
|---|---|---|---|
cross_validation_folds |
int |
5 |
Number of cross-validation folds. |
test_size |
float |
0.1 |
Fraction of data reserved for testing (ignored for Scaffold splits). |
validation_size |
float |
0.1 |
Fraction of training data used for validation. |
6. Spy Splitting (Inner)
Key: spy_splitting
| Parameter | Type | Example | Description |
|---|---|---|---|
spy_rate |
float |
0.2 |
Percentage of positive training data masked as "Spies". |
spy_tolerance |
float |
0.05 |
Probability threshold tolerance for detecting spies. |
ratio_positives_to_unlabeled |
float |
0.5 |
Target ratio of Positives to Unlabeled data. |
7. Spy Model Configuration
Key: spy_model
| Parameter | Type | Example | Description |
|---|---|---|---|
eval_metric |
str |
"MCC" |
Metric to optimize during training. |
all_metrics |
list |
['Accuracy', ...] |
List of metrics to log during evaluation. |
training_target_column_name |
str |
"spy_inf_bin" |
The column used as the target for training. |
validation_target_column_name |
str |
"true_bin" |
The column used as the target for validation. |
metric_manipulation |
str |
"Recall" |
Metric to prioritize for augmented negatives (or None). |
target_value |
float |
0.5 |
Target score for the manipulated metric. |
8. Regression Model Configuration
Key: reg_model
| Parameter | Type | Example | Description |
|---|---|---|---|
eval_metric |
str |
"MAE" |
Metric to optimize during regression training. |
all_metrics |
list |
['MAE', 'R2'] |
List of metrics to log. |
training_target_column_name |
str |
"augm_yield" |
The column used as the target for training. |
validation_target_column_name |
str |
"Output" |
The column used as the target for validation. |
9. Optimization Settings
Key: optimisation
| Parameter | Type | Example | Description |
|---|---|---|---|
type |
str |
"Bayesian" |
Optimization strategy. |
search_space |
list |
["depth", ...] |
List of hyperparameters to tune. |
iterations |
int |
50 |
Number of optimization trials to run. |
10. CatBoost Parameters
Key: catboost
| Parameter | Type | Example | Description |
|---|---|---|---|
max_depth |
int |
12 |
Maximum tree depth. |
max_iterations |
int |
10000 |
Maximum number of boosting rounds. |
min_learning_rate |
float |
0.00001 |
Minimum learning rate limit. |
max_bin |
int |
254 |
Maximum number of splits for numerical features. |