Example(6): K-Nearest Neighbors with Hyperparameter Tuning using Proxy Evaluation Metrics
==========================================================================================

This example illustrates how to leverage the capabilities of **osbad** for 
hyperparameter tuning in unsupervised anomaly detection models when **no 
prior labeled training** data is available. In such scenarios, traditional 
hyperparameter optimization methods based on transfer learning cannot be 
directly applied. Consequently, the objective function must be redefined 
because conventional outlier detection metrics, such as precision and recall, 
are not meaningful during the tuning process. To address this, a surrogate 
multivariate regression model is utilized to estimate model performance
using alternative indicators like regression loss and inlier count score. 
These serve as practical substitutes in the absence of ground-truth labels.
The underlying principle is that if the anomaly detection model effectively
isolates outliers, the remaining inlier data should exhibit a more coherent
structure, resulting in improved regression loss.

The following example of running a hyperparameter tuning and anomaly detection
pipeline is also provided as a notebook in 
``machine_learning/hp_tuning_with_regression_proxy/severson_data_source/01_train_dataset/ml_02_knn_hyperparam_proxy_severson.ipynb``.

Step-1: Load libraries
---------------------------

Import the libraries into your local development environment, including the
``osbad`` library for benchmarking anomaly detection.

* ``Path`` is used for robust, cross-platform file paths.
* ``duckdb`` is the embedded analytical database engine storing the dataset.
* ``optuna`` for automated hyperparameter optimization that uses efficient
  algorithms like Bayesian optimization to find the best parameter settings.
* ``bconf``: project config utilities (e.g., where to write artifacts).
* ``BenchDB``: a thin layer around DuckDB that provides convenience loaders.
* ``ModelRunner``, ``hp``, ``modval``: modeling, hyperparameters, and
  model validation helpers for benchmarking study in this project.

.. code-block:: python

  # Standard library
    from pathlib import Path
    import pprint

    # Third-party libraries
    import duckdb
    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    import optuna

    # Custom osbad library for anomaly detection
    import osbad.config as bconf
    import osbad.hyperparam as hp
    import osbad.modval as modval
    import osbad.viz as bviz
    from osbad.database import BenchDB
    from osbad.model import ModelRunner

Step-2: Load Benchmarking Dataset
------------------------------------

* Pick a specific cell based on the ``cell_index``, which identifies the
  experimental data corresponding to one unique cell.
* Create an artifacts folder for that cell, where you can save figures,
  tables, or model outputs related to this cell.
* Initialize ``BenchDB`` for the selected cell and path to the DuckDB file:
  ``train_dataset_severson.db``.
* Loads all data related to ``selected_cell_label`` from the training
  partition.

.. code-block:: python

  # Get the cell-ID from cell_inventory
  selected_cell_label = "2017-05-12_5_4C-70per_3C_CH17"

  # Create a subfolder to store fig output
  # corresponding to each cell-index
  selected_cell_artifacts_dir = bconf.artifacts_output_dir(
      selected_cell_label)

    # Path to the DuckDB file:
    # "train_dataset_severson.db"
    db_filepath = (
        Path.cwd()
        .parent
        .joinpath("database","train_dataset_severson.db"))

    # Import the BenchDB class
    # Load only the dataset based on the selected cell
    benchdb = BenchDB(
        db_filepath,
        selected_cell_label)

    # load the benchmarking dataset
    df_selected_cell = benchdb.load_benchmark_dataset(
        dataset_type="train")


Step-3: Load the Features DB
------------------------------------

* Load the features (e.g., ``log_max_diff_dQ``, ``log_max_diff_dV``) based
  on ``selected_cell_label`` in ``BenchDB``.

.. code-block:: python

    # Define the filepath to ``train_features_severson.db``
    # DuckDB instance.
    db_features_filepath = (
        Path.cwd()
        .parent
        .joinpath("database","train_features_severson.db"))

    # Load only the training features dataset
    df_features_per_cell = benchdb.load_features_db(
        db_features_filepath,
        dataset_type="train")

Step-4: Hyperparameter Tuning with Optuna using Proxy Metrics
-------------------------------------------------------------

* Define the search space for K-Nearest Neighbors hyperparameters:

  * ``contamination``: Expected proportion of outliers (0.0 - 0.5)
  * ``n_neighbors``: Number of nearest neighbors to consider when computing
    the anomaly score (2 - 50)
  * ``method``: Specifies how the anomaly score is calculated. 
    Common options include:

    - ``largest``: Distance to the farthest neighbor.
    - ``mean``: Average distance to all neighbors.
    - ``median``: Median distance to neighbors. 
  * ``metric``: The distance metric used to compute neighbor distances.
  * ``threshold``: Decision threshold for outlier probability (0.0 - 1.0)

* Use Optuna's TPE sampler to optimize for both proxy metrics (regression loss 
  score and inlier count score)
* Run 100 trials to find the best hyperparameter configuration.

.. code-block:: python

  # Define the hyperparameter search space for KNN
  hp_space_knn=lambda trial: {
      "contamination": trial.suggest_float(
          "contamination", 0, 0.5),
      "n_neighbors": trial.suggest_int(
          "n_neighbors", 2, 50, step=2),
      "method": trial.suggest_categorical(
          "method", ["largest", "mean", "median"]),
      "metric": trial.suggest_categorical(
          "metric", ["minkowski", "euclidean", "manhattan"]),
      "threshold": trial.suggest_float(
          "threshold", 0, 1)}

  # Instantiate an optuna study for knn model
  sampler = optuna.samplers.TPESampler(seed=42)

  selected_feature_cols = (
      "cycle_index",
      "log_max_diff_dQ",
      "log_max_diff_dV")

  knn_study = optuna.create_study(
      study_name="knn_hyperparam",
      sampler=sampler,
      directions=["minimize","maximize"])

  knn_study.optimize(
      lambda trial: hp.objective(
          trial,
          model_id="knn",
          df_feature_dataset=df_features_per_cell,
          selected_feature_cols=selected_feature_cols,
          hp_space=hp_space,
          selected_cell_label=selected_cell_label),
      n_trials=100)

.. note:: If you notice, there is no ``df_benchmark_dataset`` argument used in 
  objective function. The optimization trials do not depend on the recall and 
  precision, but instead on the proxy metrics which are designed to be 
  calculated independent of the true labels. 

Step-5: Aggregate Best Hyperparameters
--------------------------------------

* Extract the optimal trade-off trails or best compromise solutions from the 
  pareto optimal trials.
* ``trade_off_trials_detection`` method from the ``hp`` module uses frequency 
  based approach to detect the best compromise trails (marked by green). 
* Hyperparameters from these trails are aggregated using median values.
* Export the optimized hyperparameters to CSV for reproducibility.

.. code-block:: python

  schema_knn = {
      "contamination": "median",
      "n_neighbors": "median_int",
      "method": "mode",
      "metric": "mode",
      "threshold": "median",
  }

  trade_off_trials_list = hp.trade_off_trials_detection(
      study=knn_study)

  df_knn_hyperparam = hp.aggregate_best_trials(
      trade_off_trials_list,
      cell_label=selected_cell_label,
      model_id="knn",
      schema=schema_knn)

  hp.plot_proxy_pareto_front(
      knn_study,
      trade_off_trials_list,
      selected_cell_label,
      fig_title="K Nearest Neighbors (KNN) Pareto Front")

  plt.show()

  # Export current hyperparameters to CSV
  hyperparam_filepath =  Path.cwd().joinpath(
      "ml_02_knn_hyperparam_proxy_severson.csv")

  hp.export_current_hyperparam(
      df_knn_hyperparam,
      selected_cell_label,
      export_csv_filepath=hyperparam_filepath,
      if_exists="replace")

.. image:: docs_figure/ml_06_severson_knn_proxy_regr/k_nearest_neighbors_(knn)_pareto_front_2017-05-12_5_4C-70per_3C_CH17.png
   :height: 500px
   :width: 600 px
   :alt: Pareto front from ``2017-05-12_5_4C-70per_3C_CH17``
   :align: center
        
* This figure illustrates the Pareto fronts obtained from Bayesian optimization
  performed to minimize the normalized regression loss score and maximize the  
  normalized inlier count score for K-Nearest Neighbors using the severson 
  dataset.
* The X-axis is the normalized regression loss score (regression loss between 
  actual features and predicted features by a multivariate linear regression 
  model for predicted inlier cycles/features by the unsupervised anomaly
  detection model for selected configuration).
* The Y-axis is the normalized inlier count score (ratio of predicted inlier 
  cycle and total number of cycles).
* While the blue scattered points represent all the trials evaluated during
  the optimization process, the red dots denote the pareto optimal trials and 
  green dot denotes the best compromise solution.

Step-6: Train Model with Best Hyperparameters
---------------------------------------------

* Load the optimized hyperparameters from the CSV file.
* Create a ``ModelRunner`` instance with the selected features.
* Train the KNN model using the best hyperparameters.
* Predict outlier probabilities and identify anomalous cycles.

.. code-block:: python

  # Load best trial parameters from CSV output
  df_hyperparam_from_csv = pd.read_csv(hyperparam_filepath)

  df_param_per_cell = df_hyperparam_from_csv[
      df_hyperparam_from_csv["cell_index"] == selected_cell_label]

  param_dict = df_param_per_cell.iloc[0].to_dict()
  pprint.pp(param_dict)

  # Run the model with best trial parameters
  cfg = hp.MODEL_CONFIG["knn"]

  runner = ModelRunner(
      cell_label=selected_cell_label,
      df_input_features=df_merge_features,
      selected_feature_cols=selected_feature_cols
  )

  Xdata = runner.create_model_x_input()

  model = cfg.model_param(param_dict)
  print(model)
  model.fit(Xdata)
  proba = model.predict_proba(Xdata)

  (pred_outlier_indices,
  pred_outlier_score) = runner.pred_outlier_indices_from_proba(
      proba=proba,
      threshold=param_dict["threshold"],
      outlier_col=cfg.proba_col
  )

  # Get df_outliers_pred
  df_outliers_pred = (df_merge_features[
      df_merge_features["cycle_index"]
      .isin(pred_outlier_indices)].copy())

  df_outliers_pred["outlier_prob"] = pred_outlier_score

  df_outliers_pred = (df_features_per_cell[
    df_features_per_cell["cycle_index"]
    .isin(pred_outlier_indices)].copy())

  df_outliers_pred["outlier_prob"] = pred_outlier_score


Step-8: Predict Anomaly Score Map
-----------------------------------

* Generate a 2D contour map showing the anomaly probability across the
  feature space.
* Highlight the predicted anomalous cycles.
* The map helps visualize which regions of the feature space are considered
  anomalous by the model.

.. code-block:: python

  axplot = runner.predict_anomaly_score_map(
    selected_model=model,
    model_name="K Nearest Neighbors (KNN)",
    xoutliers=df_outliers_pred["log_max_diff_dQ"],
    youtliers=df_outliers_pred["log_max_diff_dV"],
    pred_outliers_index=pred_outlier_indices,
    threshold=param_dict["threshold"]
  )

  axplot.set_xlabel(
      r"$\log(\Delta Q_{\mathrm{scaled,max,cyc}})$ [Ah]",
      fontsize = 12)

  axplot.set_ylabel(
      r"$\log(\Delta V_{\mathrm{scaled,max,cyc}})$ [V]",
      fontsize = 12)

  output_fig_filename = (
      "knn_"
      + selected_cell_label
      + ".png")

  fig_output_path = (
      selected_cell_artifacts_dir
      .joinpath(output_fig_filename))

  plt.savefig(
      fig_output_path,
      dpi=600,
      bbox_inches="tight")

  plt.show()

.. image:: /docs_figure/ml_06_severson_knn_proxy_regr/knn_2017-05-12_5_4C-70per_3C_CH17.png
   :height: 420px
   :width: 600 px
   :alt: Anomaly score map from ``2017-05-12_5_4C-70per_3C_CH17``
   :align: center

The visualization illustrates the decision boundary and anomaly probability 
distribution in the two-dimensional feature space defined by:

* *log(ΔQ_scaled,max,cyc)*: Represents the scaled change in maximum discharge 
  capacity across cycles.
* *log(ΔV_scaled,max,cyc)*: Represents the scaled change in maximum voltage
  across cycles.

**Color Gradient Interpretation**

* Dark Blue Regions (outlier probability ≈ 0.0): Indicate normal operating
  conditions where cycles exhibit typical capacity and voltage change patterns.
* Light Blue to White Regions (outlier probability ≈ 0.2–0.5): Transition zones
  where the KNN model begins to detect deviations from expected behavior.
* Orange to Red Regions (outlier probability ≈ 0.6–0.8): Areas with moderate
  anomaly likelihood, suggesting unusual combinations of capacity and voltage 
  changes.
* Dark Red Regions (outlier probability ≈ 1.0): High-confidence anomaly zones 
  where cycles are strongly classified as outliers.

**Decision Boundary**

* The dashed black contour represents the decision threshold separating normal 
  cycles from anomalous ones based on the KNN distance metric. 
  
**Predicted Normal vs Anomalous Cycles**

* Yellow stars mark the detected anomalous cycles at indices 0, 40, 147, and 
  148, as annotated in the legend box. The majority of normal cycles cluster 
  in the central dark-blue region, indicating stable degradation behavior.

**Key Insight**

* This visualization demonstrates how the KNN model leverages local density and
  distance-based metrics to distinguish anomalous capacity-voltage change 
  patterns from normal distribution. The anomalies detected are positioned far 
  from the dense cluster of normal cycles, highlighting their deviation in 
  both engineered features.

Step-9: Model Performance Evaluation
--------------------------------------

* The optimal hyperparameters are evaluated against the true labels using
  standard anomaly detection metrics for a post hoc evaluation and comparison.
* Generate a confusion matrix to visualize True Positives, False Positives, 
  True Negatives, and False Negatives.
* Calculate performance metrics: precision, recall, F1-score, and accuracy.

.. code-block:: python

  df_eval_outlier = modval.evaluate_pred_outliers(
    df_benchmark=df_selected_cell,
    outlier_cycle_index=pred_outlier_indices)

  # confusion matrix
  axplot = modval.generate_confusion_matrix(
    y_true=df_eval_outlier["true_outlier"],
    y_pred=df_eval_outlier["pred_outlier"])

  axplot.set_title(
      "K Nearest Neighbors (KNN)",
      fontsize=16)

  output_fig_filename = (
      "conf_matrix_knn_"
      + selected_cell_label
      + ".png")

  fig_output_path = (
      selected_cell_artifacts_dir
      .joinpath(output_fig_filename))

  plt.savefig(
      fig_output_path,
      dpi=600,
      bbox_inches="tight")

  plt.show()

  # evaluate model performance
  df_current_eval_metrics = modval.eval_model_performance(
    model_name="knn",
    selected_cell_label=selected_cell_label,
    df_eval_outliers=df_eval_outlier)

  # Export model performance metrics to CSV output
  hyperparam_eval_filepath =  Path.cwd().joinpath(
    "eval_metrics_hp_single_cell_severson.csv")

  hp.export_current_model_metrics(
      model_name="knn",
      selected_cell_label=selected_cell_label,
      df_current_eval_metrics=df_current_eval_metrics,
      export_csv_filepath=hyperparam_eval_filepath,
      if_exists="replace")

.. image:: docs_figure/ml_06_severson_knn_proxy_regr/conf_matrix_knn_2017-05-12_5_4C-70per_3C_CH17.png
   :height: 450px
   :width: 550 px
   :alt: Confusion matrix from ``2017-05-12_5_4C-70per_3C_CH17``
   :align: center