Thursday, May 29, 2025

Beyond the Score: A Critical Look at Rescoring Methods in Molecular Docking

So in the last post I focused on docking scoring functions and their limitations. But of docking scoring functions are not food enough, there must be some sort of different scoring method that one could apply to rank the docking pose more accurately. This concept Is known as rescoring.

Rescoring methods aim to refine docking predictions by using more accurate, computationally intensive approaches. This post outlines the main classes of rescoring techniques, evaluates their strengths and weaknesses, and provides guidance on when and how to apply them effectively.

Rescoring Methods: An Overview

1. Empirical scoring functions (e.g., ChemScore, X-Score): Fast and interpretable, but limited by oversimplified energy terms and training set bias.

2. Force-field based methods: these include MM-GBSA (Molecular Mechanics Generalized Born Surface Area) and MM-PBSA (Molecular Mechanics Poisson-Boltzmann Surface Area). Both incorporate molecular mechanics energies with continuum solvation models. They require minimization or short MD simulations to estimate binding energies more accurately.

3. Consensus scoring: Combines multiple scoring functions (empirical, knowledge-based, etc.) to reduce individual model bias. Often improves early enrichment but is not always reliable.

4. Interaction-based filters: Methods that evaluate pose quality based on H-bonding, ionic interactions, desolvation penalties, and hydrophobic contacts (e.g., PoseBuster, IChem).

5. Machine learning models (e.g., RF-Score, CNN-based models like Gnina): Can capture non-linear relationships but require careful validation and are highly dependent on training data.

6. Quantum mechanical rescoring (e.g., DFT, GFN-FF): Very accurate but extremely slow, generally impractical for more than a few poses.

When is Rescoring Useful? 

Rescoring is particularly beneficial in these scenarios:

  • The initial docking scoring function fails to distinguish between plausible and implausible poses.
  • The receptor is rigid and the ligand is moderately flexible.
  • Retrospective enrichment is poor (ligands not ranking above decoys).
  • You have a manageable number of poses (~10-100 per target).

Rescoring is less useful:

  • For ultra-large libraries (>1M compounds), due to prohibitive computational cost.
  • When the docking protocol already uses high-accuracy scoring (e.g., Glide XP, CovDock).
  • If the binding mode prediction is uncertain; rescoring can't fix a bad pose.

Notably, Sindt et al. (2025) underscore the difficulty of rescoring large-scale docking outputs in ultra-large libraries. They show that despite theoretical improvements, rescoring methods such as MM-GBSA may not effectively reorder top-ranked molecules when the docking protocol has already introduced significant bias. Their findings suggest that the utility of rescoring is context-dependent and that blindly applying more complex scoring may not yield better results without rigorous pose validation.

Pros and Cons of Rescoring 

Pros:

  • Improves enrichment in retrospective and prospective studies.
  • Can recover false negatives missed by docking.
  • Offers more physically realistic estimates of binding.

Cons:

  • Computationally expensive.
  • May introduce false positives due to overfitting or inaccuracies in force fields.
  • Limited by accuracy of the initial pose.

As shown by Sindt et al., rescoring cannot overcome limitations of initial pose sampling in large-scale virtual screens, and may result in poor enrichment if used naively.


MD-based Rescoring: MM-GBSA and MM-PBSA 

These methods can be applied after energy minimization or from snapshots of MD trajectories:

  • Minimization-only is often sufficient, especially for rigid binding pockets or small ligands.
  • Short MD simulations (1-10 ns) can help explore local flexibility and better estimate solvation effects.
  • Long MD (>10 ns) offers marginal benefit for these endpoint methods but is essential for more complex free energy approaches like FEP.

MD-based rescoring also enables assessment of pose stability and interaction persistence over time, which is especially useful in ambiguous binding scenarios.

Rescoring in Virtual Screening vs. Single Ligand Optimization

In virtual screening, rescoring is typically applied to the top 1–5% of hits from docking, due to cost.

In single-ligand studies, more exhaustive rescoring (even QM-based) is feasible and recommended, especially for lead optimization.

Different strategies make sense depending on the goal: filtering out decoys vs. refining a pose.

As emphasized by Sindt et al., rescoring should be coupled with pose quality control and chemical plausibility filters, particularly in ultra-large screens where docking noise dominates.

Conclusion 

Rescoring is a valuable tool in the docking pipeline but is not a panacea. It must be applied judiciously, with clear understanding of its limitations and appropriate benchmarks. Studies like Sindt et al. (2025) caution against over-reliance on post hoc scoring without validation, especially in large-scale campaigns.

Scoring the Wrong Thing: Lessons from Pose Prediction Challenges

For the second post of the blog, I thought it might be a good idea to extend on the first one about docking, by reflecting on the most challenging part it.

So there you are! Let's talk about scoring!

Despite decades of research, scoring functions in molecular docking still struggle with a basic task: identifying the correct pose among many alternatives. While pose sampling has become relatively efficient, pose ranking remains error-prone — and in many cases, the “correct” pose is buried far below top rank.

The sampling works — the scoring often doesn’t

In a benchmark of 100 protein-ligand complexes from the DUD-E dataset, Chaput and Mouawad showed that programs like Surflex and Glide frequently generate near-native poses (within 2 Å RMSD), but fail to top-rank them. Glide, for instance, placed the correct pose at rank 1 for only 68 out of 100 cases, even though many more correct poses were present in the full list.

This pattern recurs across many studies: scoring functions often assign high scores to geometrically incorrect poses, including strained conformers or poses with unsatisfied hydrogen bonds. The result is that a correct pose can be missed not because it wasn’t found — but because the scoring function didn’t recognize its plausibility.

Scoring functions aren't trained for geometry

Most classical scoring functions are designed for affinity estimation, not geometric discrimination. They aggregate terms related to van der Waals contacts, hydrogen bonds, desolvation, and electrostatics into a linear or empirical model. This works (sometimes) for ranking ligands in virtual screening, but fails when the goal is to select the most realistic binding pose for a single ligand.

Deep learning-based scoring functions have shown promise in improving binding affinity prediction, but even these often underperform in pose selection tasks when trained solely on affinity labels. Many lack the resolution to penalize subtle steric clashes or torsional strain — both critical in distinguishing native-like poses from decoys.

What can we do about it?

Some practical mitigation strategies include:

  • Rescoring with more physics-aware functions: Tools like MM-GBSA or quantum-derived methods can improve pose discrimination, though at higher computational cost.
  • Consensus scoring: Chaput and Mouawad found that combining docking programs via a “wisdom of the crowd” approach improved top-4 pose accuracy to 87%, outperforming any individual tool.
  • Post-docking filtering: Using torsion strain filters, hydrogen bond satisfaction checks, or pose interaction fingerprints (IFPs) can eliminate obviously implausible poses before ranking.
  • Binding pose metadynamics: MD-based approaches like binding pose metadynamics can be used to estimate pose stability by evaluating kinetic escape times and dynamic consistency with the binding site. These methods help discriminate poses that may be energetically plausible but dynamically unstable.
  • Domain-specific ML models: Deep learning methods trained on pose quality (rather than affinity) — or with 3D structural embeddings — can outperform traditional scoring on this task when trained on appropriate datasets like CASF or PDBbind.

Final thoughts

Pose prediction is not just a subroutine — it's foundational. Misranking a good pose early in the workflow can cascade into misleading SAR, wrong pharmacophore hypotheses, and wasted synthetic efforts. As long as we keep scoring the wrong thing, even good docking engines will fail us.

Don’t blame docking — fix the scoring!

Wednesday, May 28, 2025

So Nobody Believes in Docking Anymore? Here's Why You Still Should!

It did not take long to decide what the first post of my new blog should be about. If you think about it, there is one method that is probably the most used (and often misused) in CADD. I am talking about molecular docking. 

(Video of molecular docking from Wikipedia)


Despite being widely used, there is a great deal of mistrust in what docking can do.

Ask around in 2025 and you’ll hear a familiar refrain: “Docking doesn’t work”, "It’s too rigid.", "The scoring functions are wrong", "Machine learning is better". And in many ways, the critics aren’t wrong — docking has well-known limitations, and its worst-case failures can be misleading or even dangerous in a screening context. But dismissing it entirely is a mistake.

Docking is not dead. It's just misunderstood.

Docking is fast and hypothesis-generating — not predictive chemistry

At its core, docking is a heuristic method for generating testable hypotheses. It doesn’t pretend to capture full protein flexibility or solvent effects (unless paired with more advanced protocols), but it gives you a plausible binding mode fast. This is invaluable when you're dealing with dozens of targets or thousands of ligands.

What docking excels at is ranking by rough compatibility — shape, chemistry, and energetics — and doing so with incredible speed. If you're expecting nanomolar predictions from a rigid-body pose scorer, you're using the tool wrong.

It works well when paired with human or domain expertise

In many cases, docking fails because it’s used in isolation or on poorly prepared systems. But when combined with:

  • a reliable reference ligand,
  • proper protonation and tautomeric states,
  • a curated binding site,
  • an ensemble of protein conformations,
  • and filters for synthetic accessibility or toxicity,

…it becomes a powerful triaging tool. Many successful hit-to-lead campaigns start with a docking pipeline, not because docking is perfect, but because it gets you to “good enough” for the next round.

It's the front-end, not the final word

Most serious drug discovery projects today use docking as a first-pass filter before:
  • molecular dynamics refinement,
  • MM-GBSA/PBSA rescoring,
  • free energy perturbation (FEP),
  • or machine learning rescoring models.

These are all expensive in comparison. Docking lets you throw out the worst 95% before investing in the remaining 5%. If you think of docking as the funnel, not the filter, its role becomes obvious.

It’s still improving — just slowly

We’ve seen progress in ensemble docking, water-aware protocols, flexible receptor docking, and more physics-informed scoring functions. It's true that the field moves slowly compared to ML, but there's a decade of methodological depth here that new tools still rely on.

Bottom line

Docking is not a crystal ball. But used thoughtfully, with good structural data and chemical intuition, it remains one of the most cost-effective, scalable, and explainable tools we have in the early stages of drug discovery.

The real issue isn't that docking doesn’t work — it's that we ask too much of it.

Welcome to Ligand in the Loop

Drug discovery is iterative by nature — an endless loop of hypotheses, experiments, and corrections. This blog sits in the middle of that loop, focusing on the computational side: docking poses, pharmacophore models, MD trajectories, similarity metrics, and machine learning pipelines that try (and often fail) to predict biology from data.

Ligand in the Loop is a personal, anonymous space to share thoughts, methods, tools, and reflections that arise during real research projects in molecular modeling and cheminformatics. I’ll post short essays, technical notes, critiques of published papers, code snippets, and commentary on methods I use or avoid — often with more candor than peer review allows.

The intended audience includes:

  • Computational chemists and molecular modelers
  • Medicinal chemists curious about in silico tools
  • Data scientists working at the chemistry-biology interface
  • PhD students and postdocs navigating the black box of virtual screening

There won’t be fixed formats or schedules — just ideas worth documenting. If any of these resonate with your own work, you’re in the loop too.

Beyond the Score: A Critical Look at Rescoring Methods in Molecular Docking

So in the last post I focused on docking scoring functions and their limitations. But of docking scoring functions are not food enough, ther...