Positivity is All You Need (PAYN)
The PAYN framework is an open-source Python library for the execution and evaluation of Positive-Unlabeled (PU) learning in organic chemistry. PAYN is implemented in Python3, building upon the core scientific stack including pandas, RDKit, CatBoost, Optuna, numpy, sklearn among others. The architecture enforces a separation of concerns between configuration, data featurisation, model training, and evaluation to ensure reproducibility and extensibility. Below we detail the core modules responsible for experiment orchestration.
Background
Literature datasets are heavily skewed towards successful reactions, severely limiting the generalization capability of Machine Learning models. PAYN solves this by:
- Spy Technique: Using Spies (known positives) to inject into the unlabeled pool.
- Dynamic Thresholding: Statistically identifying "Reliable Negatives" that are distinct from the latent positive distribution within the unlabeled data.
- Balancing Data: Constructing balanced training sets that significantly improve downstream yield prediction accuracy.
Module Architecture
The repository is structured into modular components to enforce a strict separation of concerns, ensuring reproducibility and extensibility:
| Module | Description |
|---|---|
payn.ConfigLoader |
Manages hierarchical configuration (YAML/JSON) and generates dynamic CLI arguments for SLURM integration. |
payn.Logging |
Centralizes experimental tracking via MLflow, enforcing artifact serialization and parameter provenance. |
payn.DataSchema |
Enforces runtime schema validation and mathematically verifies index disjointness to prevent data leakage. |
payn.Featurisation |
Orchestrates SMILES-to-Fingerprint transformation, supporting ECFP, Multi-Feature Fingerprinting (MFF) and custom precalculated features. |
payn.Splitting |
Reproducibly partitions nd cross validates data using Random, Scaffold, or Butina clustering strategies. |
payn.SpySplitting |
Transforms fully labeled data into PU data and injects known positives ("Spies") into Unlabeled set. |
payn.AugmentationModels |
Contains the SpyModel for PU classification and the engine for identifying Reliable Negatives via dynamic thresholding. |
payn.Optimization |
Performs hyperparameter tuning (Bayesian TPE or Grid Search) with deterministic state handling for reproducibility. |
payn.Recombination |
Constructs balanced datasets for downstream tasks by merging verified positives with identified reliable negatives. |
payn.RegModel |
Wraps CatBoostRegressor for the final yield prediction task, handling categorical features and parallel execution. |
payn.Evaluator |
Computes specialized PU metrics (Negative Precision/Recall) to assess the purity of the identified negative set. |
payn.Visualisation |
Generates diagnostic plots for data distributions, hyperparameter importance, and optimization history. |
Use the sidebar to navigate through different sections of the documentation.
Citation
If you use PAYN in your research, please cite:
@article{Boser2025,
title = {Positivity is All You Need (PAYN): A PU Learning Framework for Yield Prediction in Organic Chemistry},
author = {Boser, Florian and Spies, Jan Christoph and Glorius, Frank},
journal = {ChemRxiv},
year = {2025},
doi = {10.26434/chemrxiv-2025-hq4rx},
url = {[https://doi.org/10.26434/chemrxiv-2025-hq4rx](https://doi.org/10.26434/chemrxiv-2025-hq4rx)}
}
Getting Started
- Installation: Set up the environment using Poetry.
- Configuration: Understand the hierarchical config system.
- Naming: Learn how datasets are named during our pipline.