Thursday, May 29, 2025

Scoring the Wrong Thing: Lessons from Pose Prediction Challenges

For the second post of the blog, I thought it might be a good idea to extend on the first one about docking, by reflecting on the most challenging part it.

So there you are! Let's talk about scoring!

Despite decades of research, scoring functions in molecular docking still struggle with a basic task: identifying the correct pose among many alternatives. While pose sampling has become relatively efficient, pose ranking remains error-prone — and in many cases, the “correct” pose is buried far below top rank.

The sampling works — the scoring often doesn’t

In a benchmark of 100 protein-ligand complexes from the DUD-E dataset, Chaput and Mouawad showed that programs like Surflex and Glide frequently generate near-native poses (within 2 Å RMSD), but fail to top-rank them. Glide, for instance, placed the correct pose at rank 1 for only 68 out of 100 cases, even though many more correct poses were present in the full list.

This pattern recurs across many studies: scoring functions often assign high scores to geometrically incorrect poses, including strained conformers or poses with unsatisfied hydrogen bonds. The result is that a correct pose can be missed not because it wasn’t found — but because the scoring function didn’t recognize its plausibility.

Scoring functions aren't trained for geometry

Most classical scoring functions are designed for affinity estimation, not geometric discrimination. They aggregate terms related to van der Waals contacts, hydrogen bonds, desolvation, and electrostatics into a linear or empirical model. This works (sometimes) for ranking ligands in virtual screening, but fails when the goal is to select the most realistic binding pose for a single ligand.

Deep learning-based scoring functions have shown promise in improving binding affinity prediction, but even these often underperform in pose selection tasks when trained solely on affinity labels. Many lack the resolution to penalize subtle steric clashes or torsional strain — both critical in distinguishing native-like poses from decoys.

What can we do about it?

Some practical mitigation strategies include:

  • Rescoring with more physics-aware functions: Tools like MM-GBSA or quantum-derived methods can improve pose discrimination, though at higher computational cost.
  • Consensus scoring: Chaput and Mouawad found that combining docking programs via a “wisdom of the crowd” approach improved top-4 pose accuracy to 87%, outperforming any individual tool.
  • Post-docking filtering: Using torsion strain filters, hydrogen bond satisfaction checks, or pose interaction fingerprints (IFPs) can eliminate obviously implausible poses before ranking.
  • Binding pose metadynamics: MD-based approaches like binding pose metadynamics can be used to estimate pose stability by evaluating kinetic escape times and dynamic consistency with the binding site. These methods help discriminate poses that may be energetically plausible but dynamically unstable.
  • Domain-specific ML models: Deep learning methods trained on pose quality (rather than affinity) — or with 3D structural embeddings — can outperform traditional scoring on this task when trained on appropriate datasets like CASF or PDBbind.

Final thoughts

Pose prediction is not just a subroutine — it's foundational. Misranking a good pose early in the workflow can cascade into misleading SAR, wrong pharmacophore hypotheses, and wasted synthetic efforts. As long as we keep scoring the wrong thing, even good docking engines will fail us.

Don’t blame docking — fix the scoring!

No comments:

Post a Comment

Beyond the Score: A Critical Look at Rescoring Methods in Molecular Docking

So in the last post I focused on docking scoring functions and their limitations. But of docking scoring functions are not food enough, ther...