Example: Benchmarking Isolation Forest with Probabilistic Outliers Prediction
==============================================================================

Prerequisites
---------------

* Python 3.12 (recommended)
* Files on disk:

  * ``database/train_dataset_severson.db`` (benchmark labels per cycle)
  * ``database/train_features_severson.db`` (engineered features per cycle)
  * ``PIPELINE_OUTPUT_DIR/hyperparams_iforest.csv`` (An example of tuned
    hyperparameters per cell for Isolation Forest)

* (Optional) LaTeX installation if you want Matplotlib to render text with
  LaTeX:

  * A TeX distribution (e.g., TeX Live/MacTeX/MiKTeX), dvipng, and fonts
    like cm-super.
  * Don’t have LaTeX installed? Either install it, or set
    ``rcParams["text.usetex"] = False``.

Before running the example in the ``quick_start_tutorials`` section, please
evaluate whether the global directory path specified in ``src/osbad/config.py``
needs to be updated:

.. code-block:: python

    # Modify this global directory path if needed
    PIPELINE_OUTPUT_DIR = Path.cwd().joinpath("artifacts_output_dir")

The following example of running a model with probabilistic outlier prediction
is also provided as a notebook in
``quick_start_tutorials/02_run_model_tutorial.ipynb``.

Step-1: Load libraries
---------------------------

Import the libraries into your local development environment, including the
``osbad`` library for benchmarking anomaly detection.

* ``Path`` is used for robust, cross-platform file paths.
* ``duckdb`` is the embedded analytical database engine storing the dataset.
* ``fireducks.pandas as pd`` gives you a pandas-compatible API;
  you can usually treat it like import pandas as pd.
* ``rcParams["text.usetex"] = True`` tells Matplotlib to render text via
  LaTeX. If you don’t have LaTeX installed, flip this to False.
* ``bconf``: project config utilities (e.g., where to write artifacts).
* ``BenchDB``: a thin layer around DuckDB that provides convenience loaders.
* ``ModelRunner``, ``hp``, ``modval``: modeling, hyperparameters, and
  model validation helpers for benchmarking study in this project.

.. code-block:: python

  # Standard library
  import os
  from pathlib import Path

  # Third-party libraries
  import duckdb
  import fireducks.pandas as pd
  import matplotlib.pyplot as plt
  from matplotlib import rcParams

  rcParams["text.usetex"] = True

  # Custom osbad library for benchmarking anomaly detection
  import osbad.config as bconf
  import osbad.hyperparam as hp
  import osbad.modval as modval
  from osbad.database import BenchDB
  from osbad.model import ModelRunner

Step-2: Load Benchmarking Dataset
------------------------------------

* Pick a specific cell based on the ``cell_index``, which identifies the
  experimental data corresponding to one unique cell.
* Create an artifacts folder for that cell, where you can save figures,
  tables, or model outputs related to this cell.
* Initialize ``BenchDB`` for the selected cell and path to the DuckDB file:
  ``train_dataset_severson.db``.
* Loads all data related to ``selected_cell_label`` from the training
  partition.

.. code-block:: python

    # Get the cell-ID from unique_cell_index_train
    selected_cell_label = "2017-05-12_5_4C-70per_3C_CH17"

    # Create a subfolder to store fig output
    # corresponding to each cell-index
    selected_cell_artifacts_dir = bconf.artifacts_output_dir(
        selected_cell_label)

    # Path to the DuckDB file:
    # "train_dataset_severson.db"
    db_filepath = (
        Path.cwd()
        .parent
        .joinpath("database","train_dataset_severson.db"))

    # Import the BenchDB class
    # Load only the dataset based on the selected cell
    benchdb = BenchDB(
        db_filepath,
        selected_cell_label)

    # load the benchmarking dataset
    df_selected_cell = benchdb.load_benchmark_dataset(
        dataset_type="train")

Step-3: Load the Features DB
------------------------------------

* Load the features (e.g., ``log_max_diff_dQ``, ``log_max_diff_dV``) based
  on ``selected_cell_label`` in ``BenchDB``.
* To make the chart more informative, bubble sizes are scaled by ratios
  calculated from the distributions of feature values (``max_diff_dQ`` and
  ``max_diff_dV``). Using absolute values ensures all sizes are positive.
* Plot the bubble chart using the logarithmic features ``log_max_diff_dQ``
  and ``log_max_diff_dV`` as the x and y axes. Bubble sizes are
  determined by the calculated ratios. Cycles flagged as outliers are
  highlighted via their indices.

.. code-block:: python

    # Define the filepath to ``train_features_severson.db``
    # DuckDB instance.
    db_features_filepath = (
        Path.cwd()
        .parent
        .joinpath("database","train_features_severson.db"))

    # Load only the training features dataset
    df_features_per_cell = benchdb.load_features_db(
        db_features_filepath,
        dataset_type="train")

    unique_cycle_count = (
        df_features_per_cell["cycle_index"].unique())

    # Calculate the bubble size ratio for plotting
    df_bubble_size_dQ = bstats.calculate_bubble_size_ratio(
        df_variable=df_features_per_cell["max_diff_dQ"])

    df_bubble_size_dV = bstats.calculate_bubble_size_ratio(
        df_variable=df_features_per_cell["max_diff_dV"])

    bubble_size = (
        np.abs(df_bubble_size_dV)
        * np.abs(df_bubble_size_dQ))

    # Plot the bubble chart and label the outliers
    axplot = bviz.plot_bubble_chart(
        xseries=df_features_per_cell["log_max_diff_dQ"],
        yseries=df_features_per_cell["log_max_diff_dV"],
        bubble_size=bubble_size,
        unique_cycle_count=unique_cycle_count,
        cycle_outlier_idx_label=true_outlier_cycle_index)

    axplot.set_title(
        f"Cell {selected_cell_label}", fontsize=13)

    axplot.set_xlabel(
        r"$\log(\Delta Q_\textrm{scaled,max,cyc)}\;\textrm{[Ah]}$",
        fontsize=12)
    axplot.set_ylabel(
        r"$\log(\Delta V_\textrm{scaled,max,cyc})\;\textrm{[V]}$",
        fontsize=12)

    output_fig_filename = (
        "log_bubble_plot_"
        + selected_cell_label
        + ".png")

    fig_output_path = (
        selected_cell_artifacts_dir.joinpath(output_fig_filename))

    plt.savefig(
        fig_output_path,
        dpi=200,
        bbox_inches="tight")

    plt.show()

.. image:: docs_figure/log_bubble_plot_2017-05-12_5_4C-70per_3C_CH17.png
   :height: 480px
   :width: 600 px
   :alt: Bubble plot from ``2017-05-12_5_4C-70per_3C_CH17``
   :align: center

Step-4: Train model with tuned hyperparameters
-----------------------------------------------------

* Resolves PIPELINE_OUTPUT_DIR and reads ``hyperparams_iforest.csv`` and
  filters the CSV to the row for ``selected_cell_label``.

* Extracts the hyperparameter dictionary:

  * ``contamination``,
  * ``n_estimators``,
  * ``max_samples``,
  * ``threshold``.

* Loads the Isolation Forest config from ``hp.MODEL_CONFIG["iforest"]``.
* Builds a ModelRunner with the cell label, feature DataFrame, and
  selected features.
* Calls ``runner.create_model_x_input()`` to get the X matrix
  (shape: n_cycles × n_features).
* Instantiates the model with ``cfg.model_param(param_dict)``,
  fits it, and computes probabilities with predict_proba.
* Converts probabilities into outlier indices and associated outlier scores.

.. code-block:: python

    # Access the global filepath variable PIPELINE_OUTPUT_DIR defined
    # in config.py
    PIPELINE_OUTPUT_DIR = bconf.PIPELINE_OUTPUT_DIR

    # Read hyperparameters from the stored artifacts
    hyperparam_filepath =  PIPELINE_OUTPUT_DIR.joinpath(
        "hyperparams_iforest.csv")
    df_hyperparam_from_csv = pd.read_csv(hyperparam_filepath)

    # Filter for the selected_cell_label
    df_param_per_cell = df_hyperparam_from_csv[
        df_hyperparam_from_csv["cell_index"] == selected_cell_label]

    # Hyperparameters for Isolation Forest
    param_dict = {
        "contamination": df_param_per_cell["contamination"].values[0],
        "n_estimators": df_param_per_cell["n_estimators"].values[0],
        "max_samples": df_param_per_cell["max_samples"].values[0],
        "threshold": df_param_per_cell["threshold"].values[0]}


    # Extract the model configuration for Isolation Forest
    # Note: the dict key for Isolation Forest is iforest
    cfg = hp.MODEL_CONFIG["iforest"]

    # The two features implemented in this example
    selected_feature_cols = (
        "log_max_diff_dQ",
        "log_max_diff_dV")

    # Create a ModelRunner instance based on selected_cell_label,
    # df_features_per_cell and
    # selected_feature_cols
    runner = ModelRunner(
        cell_label=selected_cell_label,
        df_input_features=df_features_per_cell,
        selected_feature_cols=selected_feature_cols
    )

    # Create the training input
    Xdata = runner.create_model_x_input()

    # Create the model based on the configured hyperparameters
    model = cfg.model_param(param_dict)
    print(model)

    # Fit the model and get probabilistic outliers prediction
    model.fit(Xdata)
    proba = model.predict_proba(Xdata)

    # Get the predicted outlier indices (pred_outlier_indices)
    # pred_outlier_indices correspond to predicted anomalous cycles in the
    # first example
    (pred_outlier_indices,
     pred_outlier_score) = runner.pred_outlier_indices_from_proba(
        proba=proba,
        threshold=param_dict["threshold"],
        outlier_col=cfg.proba_col
    )

Step-5: Predict Probabilistic Anomaly Score Map
-----------------------------------------------------

* ``pred_outlier_indices`` is a list of cycle indices predicted as
  anomalous by the Isolation Forest model. Using ``.isin()``, we filter the
  dataframe to keep only the cycles identified as anomalies.
* A new column, ``outlier_prob``, is added to store the outliers probability
  computed by the model, making it easy to track how confidently the
  algorithm flags each cycle.
* ``runner.predict_anomaly_score_map`` generates a 2D contour map of anomaly
  scores (outlier probability).

.. code-block:: python

    # Filter the selected features based on predicted outlier indices
    df_outliers_pred = df_features_per_cell[
        df_features_per_cell["cycle_index"].isin(pred_outlier_indices)].copy()

    df_outliers_pred["outlier_prob"] = pred_outlier_score

    # Plot the anomaly score map
    axplot = runner.predict_anomaly_score_map(
      selected_model=model,
      model_name="Isolation Forest",
      xoutliers=df_outliers_pred["log_max_diff_dQ"],
      youtliers=df_outliers_pred["log_max_diff_dV"],
      pred_outliers_index=pred_outlier_indices,
      threshold=param_dict["threshold"])

.. image:: docs_figure/isolation_forest_2017-05-12_5_4C-70per_3C_CH17.png
   :height: 450px
   :width: 600 px
   :alt: Anomaly score map with iForest from ``2017-05-12_5_4C-70per_3C_CH17``
   :align: center

The figure shows the anomaly score map produced by the Isolation Forest model:

* Background Heatmap:

  * Red regions: high anomaly probability (more likely to contain outliers).
  * Blue/white regions: low anomaly probability (normal cycles).
* Dashed Black Contour:

  * Represents the decision boundary defined by the Isolation Forest
    threshold. Points outside are considered anomalies.
* Black Dots:

  * Represent the majority of normal cycles (inlier data).
* Yellow Stars with Labels:

  * Mark the detected anomalous cycles (0, 40, 147, 148).
  * Their positions in the 2D feature space highlight where they deviate from
    typical battery behavior.
* Colorbar (right):

  * Quantifies anomaly probability (0 = normal, 1 = highly anomalous).
* Annotation Box:

  * Summarizes the predicted anomalous cycles.

Cycles 0 and 40 show unusually high voltage deviations, while 147 and 148
show strong deviations in charge capacity. These anomalies might correspond
to specific battery degradation events, sensor errors, or experimental
disturbances.


Step-6: Model performance evaluation
-----------------------------------------------------

* Map predicted outlier indices to the benchmark dataset:

  * ``df_selected_cell`` holds cycle-level records and the ground-truth label
    (e.g., ``outlier`` = 1 for anomalous cycles, else 0).
  * ``pred_outlier_indices`` is the list of cycle indices flagged by the
    model.
* ``modval.evaluate_pred_outliers(...)`` returns a tidy DataFrame with:

  * ``cycle_index``: Cell discharge cycle index
  * ``true_outlier``: ground truth (0/1).
  * ``pred_outlier``: model prediction (0/1) for the same cycles.

* ``modval.generate_confusion_matrix(...)`` aggregates counts of:

  * ``True Negative (TN)``: predicted 0, truth 0.
  * ``False Positive (FP)``: predicted 1, truth 0.
  * ``False Negative (FN)``: predicted 0, truth 1.
  * ``True Positive (TP)``: predicted 1, truth 1.

.. code-block:: python

    # Map the predicted outlier indices
    df_eval_outlier = modval.evaluate_pred_outliers(
      df_benchmark=df_selected_cell,
      outlier_cycle_index=pred_outlier_indices)

    # Confusion matrix
    axplot = modval.generate_confusion_matrix(
        y_true=df_eval_outlier["true_outlier"],
        y_pred=df_eval_outlier["pred_outlier"])

    axplot.set_title(
        "Isolation Forest",
        fontsize=16)

    output_fig_filename = (
        "conf_matrix_iforest_"
        + selected_cell_label
        + ".png")

    fig_output_path = (
        selected_cell_artifacts_dir
        .joinpath(output_fig_filename))

    plt.savefig(
        fig_output_path,
        dpi=600,
        bbox_inches="tight")

    plt.show()

    # Evaluate model performance
    df_current_eval_metrics = modval.eval_model_performance(
        model_name="iforest",
        selected_cell_label=selected_cell_label,
        df_eval_outliers=df_eval_outlier)

.. image:: docs_figure/conf_matrix_iforest_2017-05-12_5_4C-70per_3C_CH17.png
   :height: 420px
   :width: 550 px
   :alt: Confusion matrix with iForest from ``2017-05-12_5_4C-70per_3C_CH17``
   :align: center