Custom metrics and selectors
============================

.. jupyter-execute::
    :hide-code:

    from pathlib import Path
    import os
    import pandas as pd

    path_to_systems = Path(os.getcwd()) / "tests" / "data" / "predictions"
    pd.options.display.float_format = lambda x: "{:.2f}".format(x)

.. currentmodule:: peppr

``peppr`` is written with extensibility in mind:
You can easily define your own :class:`Metric` and :class:`Selector` classes
to extend the functionality of the package.

Creating a custom metric
------------------------
To create a custom metric, one only needs to subclass the :class:`Metric` base class.
For the scope of this tutorial we will create an all-atom RMSD as metric.

.. jupyter-execute::

    import peppr
    import numpy as np
    import biotite.structure as struc

    class RMSD(peppr.Metric):

        # This is mandatory
        @property
        def name(self):
            return "My awesome RMSD"

        # This optionally enables binning for `Evaluator.summarize_metrics()`
        @property
        def thresholds(self):
            return {"good": 0.0, "ok": 10.0, "bad": 20.0}

        # The core of each metric
        def evaluate(self, reference, pose):
            # Filter to common atoms
            reference, pose = peppr.filter_matched(reference, pose)
            # `reference` and `pose` are probably not superimposed onto each other
            pose, _ = struc.superimpose(reference, pose)
            return struc.rmsd(reference, pose).item()

        # Defines whether a smaller or larger value indicates a better prediction
        # Used by the `Selector` classes
        def smaller_is_better(self):
            return True

A few notes on the :meth:`evaluate()` method:
The purpose of a :class:`Metric` is to be used in context of an :class:`Evaluator`.
As the :class:`Evaluator` already automatically preprocesses the input systems passed
to :meth:`Evaluator.feed()`, we can rely on some properties of each system that make
our life easier:

- The input system will only contain heavy atoms (i.e. no hydrogen).
- The input reference and pose have a `matched` annotation that can be used as mask
  to obtain corresponding atoms between the reference and pose.
  This means that atom ``reference[reference.matched][i]`` corresponds to the atom
  ``pose[pose.matched][i]``.
  You can use the convenient :func:`peppr.filter_matched()` function to get a
  reference and pose :class:`AtomArray` with matching shape, if required for the metric.
  Finding the optimal atom mapping in case of ambiguous scenarios, as they appear in
  homomers or symmetric molecules, is already handled under the hood beforehand.
- The :class:`Metric` should respect the ``hetero`` annotation of the input
  :class:`biotite.structure.AtomArray`.
  An atom where ``hetero`` is ``False`` is defined as polymer and where ``hetero`` is
  ``True`` as small molecule.
  For example a peptide chain with ``hetero`` being ``True`` should still be interpreted
  as a small molecule.
- Atoms with the same ``chain_id`` should be considered as part of the same molecule,
  atoms with different ``chain_id`` should be considered as different molecules.

Creating a custom selector
--------------------------
Custom selectors can also be created by inheriting from the :class:`Selector` base
class.
Let's create a selector that picks the *worst case* prediction, i.e. the pose
with the worst value for a given metric.

.. jupyter-execute::

    class WorstCaseSelector(peppr.Selector):

        @property
        def name(self):
            return "worst case"

        def select(self, values, smaller_is_better):
            if smaller_is_better:
                return np.nanmax(values)
            else:
                return np.nanmin(values)

Note that the :meth:`Selector.select()` method does not know anything about the actual
pose or metric it is selecting from.
It only obtains values to select from (may also contain *NaN* values) and their
ordering.

Bringing it all together
------------------------
Now that we have created our custom metric and the selector, we can use them in the
:class:`Evaluator` to evaluate our predictions.

.. jupyter-execute::

    import biotite.structure.io.pdbx as pdbx

    evaluator = peppr.Evaluator([RMSD()])

    for system_dir in sorted(path_to_systems.iterdir()):
        if not system_dir.is_dir():
            continue
        system_id = system_dir.name
        pdbx_file = pdbx.CIFFile.read(system_dir / "reference.cif")
        reference = pdbx.get_structure(pdbx_file, model=1, include_bonds=True)
        poses = []
        for pose_path in sorted(system_dir.glob("poses/*.cif")):
            pdbx_file = pdbx.CIFFile.read(pose_path)
            poses.append(pdbx.get_structure(pdbx_file, model=1, include_bonds=True))
        evaluator.feed(system_id, reference, poses)

    evaluator.tabulate_metrics(selectors=[WorstCaseSelector()])

And finally, we will also see the bin thresholds for the RMSD metric in action.

.. jupyter-execute::

    for name, value in evaluator.summarize_metrics(
        selectors=[WorstCaseSelector()]
    ).items():
        print(f"{name}: {value:.2f}")