Example(5): Distance Based Anomaly Detection using Euclidean Distance ====================================================================== This example illustrates how to leverage the capabilities of **osbad** for distance-based anomaly detection using ``osbad.dbad`` with multivariate datasets. Although distance-based metrics are commonly integrated into various ML-based anomaly detection frameworks, this example shows a simpler adaptation using centroid-based distance calculation. In scenarios where low latency, model interpretability, or limited computational resources are critical, such as in process-control hardware or embedded monitoring systems, a straightforward centroid-based method offers a more practical and efficient alternative to computationally intensive ML algorithms. The following example of running a hyperparameter tuning and anomaly detection pipeline is also provided as a notebook in ``distance/distance_01_euclidan.ipynb``. Step-1: Load libraries --------------------------- Import the libraries into your local development environment, including the ``osbad`` library for benchmarking anomaly detection. * ``Path`` is used for robust, cross-platform file paths. * ``duckdb`` is the embedded analytical database engine storing the dataset. * ``bconf``: project config utilities (e.g., where to write artifacts). * ``BenchDB``: a thin layer around DuckDB that provides convenience loaders. * ``ModelRunner``, ``modval``: modeling and model validation helpers for benchmarking study in this project. * ``dbad``: utilities for computing distances, identifying outliers and visualizing the results. .. code-block:: python # Standard libraries from pathlib import Path # Third-party libraries import duckdb import pandas as pd import numpy as np import matplotlib.pyplot as plt # Custom osbad library for anomaly detection import osbad.config as bconf import osbad.modval as modval from osbad.database import BenchDB from osbad.model import ModelRunner # importing distance based anomaly detection utilities from osbad import dbad Step-2: Load Benchmarking Dataset ------------------------------------ * Pick a specific cell based on the ``cell_index``, which identifies the experimental data corresponding to one unique cell. * Create an artifacts folder for that cell, where you can save figures, tables, or model outputs related to this cell. * Initialize ``BenchDB`` for the selected cell and path to the DuckDB file: ``train_dataset_severson.db``. * Loads all data related to ``selected_cell_label`` from the training partition. .. code-block:: python # Get the cell-ID from cell_inventory selected_cell_label = "2017-05-12_5_4C-70per_3C_CH17" # Create a subfolder to store fig output # corresponding to each cell-index selected_cell_artifacts_dir = bconf.artifacts_output_dir( selected_cell_label) # Path to the DuckDB file: # "train_dataset_severson.db" db_filepath = ( Path.cwd() .parent .joinpath("database","train_dataset_severson.db")) # Import the BenchDB class # Load only the dataset based on the selected cell benchdb = BenchDB( db_filepath, selected_cell_label) # load the benchmarking dataset df_selected_cell = benchdb.load_benchmark_dataset( dataset_type="train") Step-3: Load the Features DB ------------------------------------ * Load the features (e.g., ``log_max_diff_dQ``, ``log_max_diff_dV``) based on ``selected_cell_label`` in ``BenchDB``. .. code-block:: python # Define the filepath to ``train_features_severson.db`` # DuckDB instance. db_features_filepath = ( Path.cwd() .parent .joinpath("database","train_features_severson.db")) # Load only the training features dataset df_features_per_cell = benchdb.load_features_db( db_features_filepath, dataset_type="train") Step-4: Select features and calculate distribution centroid ----------------------------------------------------------- * Builds a ModelRunner with the cell label, feature DataFrame, and selected features. * Calls ``runner.create_model_x_input()`` to get the X matrix (shape: n_cycles × n_features). * Calculate centroid on the feature distribution based on the median value. shape of ``centroid`` should be (number of selected features, ). .. code-block:: python # The two features implemented in this example selected_feature_cols = ( "log_max_diff_dQ", "log_max_diff_dV") # Create a ModelRunner instance based on selected_cell_label, # df_features_per_cell and # selected_feature_cols runner = ModelRunner( cell_label=selected_cell_label, df_input_features=df_features_per_cell, selected_feature_cols=selected_feature_cols) # get features and calculate centroid features = runner.create_model_x_input() centroid = np.median(features, axis=0) Step-5: Calculate distance from centroid and detect outliers ------------------------------------------------------------ * Select one of the available distance metrics to use for centroid based anomaly detection. This includes ``euclidean``, ``manhattan``, ``minkowski`` and ``mahalanobis`` distances. * ``dbad`` module provied ``calculate_distance`` method which measures the distance between centroid and each data point in the distribution and return an array of shape (n_cycles,). It also returns the maximum distance (farthest point from centroid) for distance normalization. * ``predict_outliers`` method provides functionality to detects outliers using the threshold based on Median Absolute Deviation (MAD) method. .. code-block:: python # select which distance-metric to use metric_name= "euclidean" # calculate euclidean distance using dbad module euclidean_dist, max_euclidean_dist = dbad.calculate_distance( metric_name=metric_name, features=features, centroid=centroid, norm=True) # predict outlier cycles based on distance. Also returns threshold distance # used, calculated based on MAD (pred_outlier_indices, pred_outlier_distance, pred_outlier_features, euclidean_threshold) = dbad.predict_outliers( distance=euclidean_dist, features= features, mad_threshold=3) print("\nPredicted Anomalous Cycles:", pred_outlier_indices) print("Euclidean distance for Outlier Cycles:", pred_outlier_distance) print("Euclidean Threshold:", euclidean_threshold) Step-6: Plot distance score mapping and contour ----------------------------------------------- * ``pred_outlier_indices`` is a list of cycle indices predicted as anomalous by the centroid based anomaly detection method. * A new column, ``outlier_distance``, is added to store the outliers distance score computed by the model, making it easy to track flagged cycles. * ``runner.create_2d_mesh_grid()`` generates a 2D mesh grid, which is a structured set of points covering the feature space. The mesh grid is used to calculate distances and visualize the score map. The returned values include: - **xx and yy**: Coordinate matrices for plotting. - **meshgrid**: Combined grid points for distance computation. * ``calculate_distance`` is again utilized to get the euclidean distance between each point in the mesh grid and the centroid of the feature space. ``max_distance`` is used for normalization and ``norm`` can be set to ``True`` to ensures distances are scaled between 0 and 1. * ``plot_distance_score_map`` creates a visual representation of the distance score map. .. code-block:: python # Select rows from df_features_per_cell where 'cycle_index' matches the # predicted outlier indices and create a copy to avoid modifying the original # DataFrame. df_outliers_pred = df_features_per_cell[ df_features_per_cell["cycle_index"].isin(pred_outlier_indices) ].copy() # Store the predicted outlier distances in df. df_outliers_pred["outlier_distance"] = pred_outlier_distance # Create a 2D mesh grid for plotting the distance score map. # This returns xx, yy coordinates and the combined meshgrid points. xx, yy, meshgrid = runner.create_2d_mesh_grid() # Calculate the Euclidean distance for each point in the mesh grid relative # to the centroid. The distance is normalized using max_euclidean_dist. grid_euclidean_dist = dbad.calculate_distance( metric_name=metric_name, features=meshgrid, centroid=centroid, max_distance=max_euclidean_dist, norm=True ) # Plot the distance score map using the calculated distances. axplot = dbad.plot_distance_score_map( meshgrid_distance=grid_euclidean_dist, xx=xx, yy=yy, features=features, xoutliers=df_outliers_pred["log_max_diff_dQ"], youtliers=df_outliers_pred["log_max_diff_dV"], centroid=centroid, threshold=euclidean_threshold, pred_outlier_indices=pred_outlier_indices, norm=True ) # Set the title of the plot. axplot.set_title('Euclidean Distance', fontsize=12) # Generate a filename for the output figure filename = f"{metric_name}_dist_map" output_fig_filename = filename + "_" + selected_cell_label + ".png" # Define the full path for saving the figure. fig_output_path = selected_cell_artifacts.joinpath(output_fig_filename) # Save the figure with high resolution and tight bounding box. plt.savefig( fig_output_path, dpi=600, bbox_inches="tight" ) # Display the plot. plt.show() .. image:: docs_figure/ml_05_severson_dbad_euclidean/euclidean_dist_map_2017-05-12_5_4C-70per_3C_CH17.png :height: 450px :width: 600 px :alt: Anomaly score map euclidean dist for ``2017-05-12_5_4C-70per_3C_CH17`` :align: center **Background Heatmap:** * Dark Red Regions: Indicate a high normalized distance from the centroid. These areas are more likely to contain anomalies. * Blue Regions: Represent low normalized distance (close to the centroid), corresponding to normal cycles. The color gradient (blue → white → red) shows increasing anomaly likelihood based on Euclidean distance. **Dashed Black Contour:** * Represents the decision boundary defined by the Euclidean distance threshold. * Contour Shape: - For Euclidean distance, the contour is circular because the metric treats all directions equally (isotropic). - If a different metric is used (e.g., Mahalanobis distance), the contour would become elliptical, reflecting feature correlations and scaling differences. - For Manhattan distance, the contour would appear more like a diamond shape. * This contour visually separates normal regions (inside) from anomalous regions (outside). **Red Cross:** * Marks the centroid of the feature space. Black Dots: * Represent the majority of normal cycles (inlier data) clustered near the centroid. Yellow Stars with Labels: * Highlight the detected anomalous cycles: 0, 40, 147, 148. Their positions in the 2D feature space (log-transformed ΔQ vs. ΔV) show how far they deviate from typical battery behavior. **Colorbar (Right):** * Quantifies the normalized Euclidean distance from the centroid. 0 = normal (close to centroid) and 1 = highly anomalous (far from centroid). **Annotation Box:** * Summarizes the predicted anomalous cycles: [0, 40, 147, 148]. * Interpretation: - Cycles 0 and 40 exhibit unusually high voltage deviations. - Cycles 147 and 148 show strong deviations in charge capacity. * These anomalies may correspond to battery degradation events, sensor errors, or experimental disturbances. Step-7: Model performance evaluation ------------------------------------ * Map predicted outlier indices to the benchmark dataset: * ``df_selected_cell`` holds cycle-level records and the ground-truth label (e.g., ``outlier`` = 1 for anomalous cycles, else 0). * ``pred_outlier_indices`` is the list of cycle indices flagged by the model. * ``modval.evaluate_pred_outliers(...)`` returns a tidy DataFrame with: * ``cycle_index``: Cell discharge cycle index * ``true_outlier``: ground truth (0/1). * ``pred_outlier``: model prediction (0/1) for the same cycles. * ``modval.generate_confusion_matrix(...)`` aggregates counts of: * ``True Negative (TN)``: predicted 0, truth 0. * ``False Positive (FP)``: predicted 1, truth 0. * ``False Negative (FN)``: predicted 0, truth 1. * ``True Positive (TP)``: predicted 1, truth 1. .. code-block:: python # Map the predicted outlier indices df_eval_outlier = modval.evaluate_pred_outliers( df_benchmark=df_selected_cell, outlier_cycle_index=pred_outlier_indices) # generate confusion matrix axplot = modval.generate_confusion_matrix( y_true=df_eval_outlier["true_outlier"], y_pred=df_eval_outlier["pred_outlier"]) axplot.set_title( "Euclidean Distance", fontsize=16) output_fig_filename = ( "conf_matrix_euclidean_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts .joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=600, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_05_severson_dbad_euclidean/conf_matrix_euclidean_2017-05-12_5_4C-70per_3C_CH17.png :height: 480px :width: 600 px :alt: confusion matrix with euclidean distance ``2017-05-12_5_4C-70per_3C_CH17`` :align: center