Skip to content

Configuration Reference

The config.yaml file controls the entire lifecycle of the PAYN framework. All parameters are organized by module below.

1. General Settings

Key: general

Parameter Type Example Description
user str "Maria" User or team name used for tagging experiments.
experiment_id str "001" Unique identifier to group related runs.
random_seed int 42 Global random seed for reproducibility.
verbose int 200 Logging verbosity level.
usage_mode str "training" Pipeline mode. Options: "training" (fully labled data) or "inference" (experimental!).

2. Dataset Settings

Key: dataset

Parameter Type Example Description
file_path str "./data.xlsx" Path to the input dataset (Excel or CSV).
sheet_name str "Sheet 1" Name of the sheet to load (if Excel).
input_columns list ["Nucleophile", "Electrophile"] Specific feature columns to use. If empty [], auto-detects all non-meta columns.
absence_flag list [] Values in raw data to treat as missing/null (e.g., ["NaN"]).
target_column str "Output" Name of the yield column (target variable).
yield_limit float 100 Maximum yield value (e.g., 100 for %, 1.0 for fraction).
yield_classification_threshold float 0.2 Threshold to binarize yield (e.g., 0.2 = 20% cutoff for Positive class).

3. Featurisation

Key: featurisation

Parameter Type Example Description
method str "ecfp" Vector generation method. Options include "ecfp" or "mff".
ECFP_bit_length int 2048 Length of the fingerprint bit vector.
ECFP_radius int 2 Radius for Morgan fingerprints (e.g., 2 = ECFP4).
condense_bits bool True If True, removes zero-variance columns to reduce sparsity.
existing_feature_columns list [] Columns containing pre-calculated features (e.g., DFT properties).
combined_features_column_name str "FP_combined" Name of the final concatenated feature vector column.

4. Meta Columns (Internal)

Key: meta_columns

Parameter Type Default Description
meta_true_label_bin str "true_bin" Ground truth binary label derived from yield.
meta_data_point_role str "true_role" Role in the Outer Split (train/val/test).
meta_mod_label_bin str "spy_inf_bin" Modified label after Spy infiltration ($s$).
meta_mod_data_point_role str "spy_inf_role" Role in the Inner/Spy Split (spy/unlabeled).
meta_mod_probability_1 str "spy_inf_prob_1" Probability of belonging to positive class after Spy infiltration.
meta_mod_prediction_class str "spy_inf_pred_label" Predicted class label after Spy infiltration.
meta_augmented_bin str "augm_bin" Binary label after Augmentation.
meta_augmented_role str "augm_role" Role in Regression (known_positive or aug_neg).
meta_augmented_target str "augm_yield" Target variable for regression (Reliable Negatives assigned 0).

5. Splitting Strategy (Outer)

Key: splitting

Parameter Type Example Description
cross_validation_folds int 5 Number of cross-validation folds.
test_size float 0.1 Fraction of data reserved for testing (ignored for Scaffold splits).
validation_size float 0.1 Fraction of training data used for validation.

6. Spy Splitting (Inner)

Key: spy_splitting

Parameter Type Example Description
spy_rate float 0.2 Percentage of positive training data masked as "Spies".
spy_tolerance float 0.05 Probability threshold tolerance for detecting spies.
ratio_positives_to_unlabeled float 0.5 Target ratio of Positives to Unlabeled data.

7. Spy Model Configuration

Key: spy_model

Parameter Type Example Description
eval_metric str "MCC" Metric to optimize during training.
all_metrics list ['Accuracy', ...] List of metrics to log during evaluation.
training_target_column_name str "spy_inf_bin" The column used as the target for training.
validation_target_column_name str "true_bin" The column used as the target for validation.
metric_manipulation str "Recall" Metric to prioritize for augmented negatives (or None).
target_value float 0.5 Target score for the manipulated metric.

8. Regression Model Configuration

Key: reg_model

Parameter Type Example Description
eval_metric str "MAE" Metric to optimize during regression training.
all_metrics list ['MAE', 'R2'] List of metrics to log.
training_target_column_name str "augm_yield" The column used as the target for training.
validation_target_column_name str "Output" The column used as the target for validation.

9. Optimization Settings

Key: optimisation

Parameter Type Example Description
type str "Bayesian" Optimization strategy.
search_space list ["depth", ...] List of hyperparameters to tune.
iterations int 50 Number of optimization trials to run.

10. CatBoost Parameters

Key: catboost

Parameter Type Example Description
max_depth int 12 Maximum tree depth.
max_iterations int 10000 Maximum number of boosting rounds.
min_learning_rate float 0.00001 Minimum learning rate limit.
max_bin int 254 Maximum number of splits for numerical features.