Example(5): Distance Based Anomaly Detection using Euclidean Distance¶

This example illustrates how to leverage the capabilities of osbad for distance-based anomaly detection using osbad.dbad with multivariate datasets. Although distance-based metrics are commonly integrated into various ML-based anomaly detection frameworks, this example shows a simpler adaptation using centroid-based distance calculation. In scenarios where low latency, model interpretability, or limited computational resources are critical, such as in process-control hardware or embedded monitoring systems, a straightforward centroid-based method offers a more practical and efficient alternative to computationally intensive ML algorithms.

The following example of running a hyperparameter tuning and anomaly detection pipeline is also provided as a notebook in distance/distance_01_euclidan.ipynb.

Step-1: Load libraries¶

Import the libraries into your local development environment, including the osbad library for benchmarking anomaly detection.

Path is used for robust, cross-platform file paths.
duckdb is the embedded analytical database engine storing the dataset.
bconf: project config utilities (e.g., where to write artifacts).
BenchDB: a thin layer around DuckDB that provides convenience loaders.
ModelRunner, modval: modeling and model validation helpers for benchmarking study in this project.
dbad: utilities for computing distances, identifying outliers and visualizing the results.

# Standard libraries
from pathlib import Path

# Third-party libraries
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Custom osbad library for anomaly detection
import osbad.config as bconf
import osbad.modval as modval
from osbad.database import BenchDB
from osbad.model import ModelRunner

# importing distance based anomaly detection utilities
from osbad import dbad

Step-2: Load Benchmarking Dataset¶

Pick a specific cell based on the cell_index, which identifies the experimental data corresponding to one unique cell.
Create an artifacts folder for that cell, where you can save figures, tables, or model outputs related to this cell.
Initialize BenchDB for the selected cell and path to the DuckDB file: train_dataset_severson.db.
Loads all data related to selected_cell_label from the training partition.

# Get the cell-ID from cell_inventory
selected_cell_label = "2017-05-12_5_4C-70per_3C_CH17"

# Create a subfolder to store fig output
# corresponding to each cell-index
selected_cell_artifacts_dir = bconf.artifacts_output_dir(
    selected_cell_label)

  # Path to the DuckDB file:
  # "train_dataset_severson.db"
  db_filepath = (
      Path.cwd()
      .parent
      .joinpath("database","train_dataset_severson.db"))

  # Import the BenchDB class
  # Load only the dataset based on the selected cell
  benchdb = BenchDB(
      db_filepath,
      selected_cell_label)

  # load the benchmarking dataset
  df_selected_cell = benchdb.load_benchmark_dataset(
      dataset_type="train")

Step-3: Load the Features DB¶

Load the features (e.g., log_max_diff_dQ, log_max_diff_dV) based on selected_cell_label in BenchDB.

# Define the filepath to ``train_features_severson.db``
# DuckDB instance.
db_features_filepath = (
    Path.cwd()
    .parent
    .joinpath("database","train_features_severson.db"))

# Load only the training features dataset
df_features_per_cell = benchdb.load_features_db(
    db_features_filepath,
    dataset_type="train")

Step-4: Select features and calculate distribution centroid¶

Builds a ModelRunner with the cell label, feature DataFrame, and selected features.
Calls runner.create_model_x_input() to get the X matrix (shape: n_cycles × n_features).
Calculate centroid on the feature distribution based on the median value. shape of centroid should be (number of selected features, ).

# The two features implemented in this example
selected_feature_cols = (
  "log_max_diff_dQ",
  "log_max_diff_dV")

# Create a ModelRunner instance based on selected_cell_label,
# df_features_per_cell and
# selected_feature_cols
runner = ModelRunner(
      cell_label=selected_cell_label,
      df_input_features=df_features_per_cell,
      selected_feature_cols=selected_feature_cols)

# get features and calculate centroid
features = runner.create_model_x_input()
centroid = np.median(features, axis=0)

Step-5: Calculate distance from centroid and detect outliers¶

Select one of the available distance metrics to use for centroid based anomaly detection. This includes euclidean, manhattan, minkowski and mahalanobis distances.
dbad module provied calculate_distance method which measures the distance between centroid and each data point in the distribution and return an array of shape (n_cycles,). It also returns the maximum distance (farthest point from centroid) for distance normalization.
predict_outliers method provides functionality to detects outliers using the threshold based on Median Absolute Deviation (MAD) method.

# select which distance-metric to use
metric_name= "euclidean"

# calculate euclidean distance using dbad module
euclidean_dist, max_euclidean_dist = dbad.calculate_distance(
    metric_name=metric_name,
    features=features,
    centroid=centroid,
    norm=True)

# predict outlier cycles based on distance. Also returns threshold distance
# used, calculated based on MAD
(pred_outlier_indices,
pred_outlier_distance,
pred_outlier_features,
euclidean_threshold) = dbad.predict_outliers(
                            distance=euclidean_dist,
                            features= features,
                            mad_threshold=3)

print("\nPredicted Anomalous Cycles:", pred_outlier_indices)
print("Euclidean distance for Outlier Cycles:", pred_outlier_distance)
print("Euclidean Threshold:", euclidean_threshold)

Step-6: Plot distance score mapping and contour¶

pred_outlier_indices is a list of cycle indices predicted as anomalous by the centroid based anomaly detection method.
A new column, outlier_distance, is added to store the outliers distance score computed by the model, making it easy to track flagged cycles.
runner.create_2d_mesh_grid() generates a 2D mesh grid, which is a structured set of points covering the feature space. The mesh grid is used to calculate distances and visualize the score map. The returned values include:
- xx and yy: Coordinate matrices for plotting.
- meshgrid: Combined grid points for distance computation.
calculate_distance is again utilized to get the euclidean distance between each point in the mesh grid and the centroid of the feature space. max_distance is used for normalization and norm can be set to True to ensures distances are scaled between 0 and 1.
plot_distance_score_map creates a visual representation of the distance score map.

# Select rows from df_features_per_cell where 'cycle_index' matches the
# predicted outlier indices and create a copy to avoid modifying the original
# DataFrame.
df_outliers_pred = df_features_per_cell[
    df_features_per_cell["cycle_index"].isin(pred_outlier_indices)
].copy()

# Store the predicted outlier distances in df.
df_outliers_pred["outlier_distance"] = pred_outlier_distance

# Create a 2D mesh grid for plotting the distance score map.
# This returns xx, yy coordinates and the combined meshgrid points.
xx, yy, meshgrid = runner.create_2d_mesh_grid()

# Calculate the Euclidean distance for each point in the mesh grid relative
# to the centroid. The distance is normalized using max_euclidean_dist.
grid_euclidean_dist = dbad.calculate_distance(
    metric_name=metric_name,
    features=meshgrid,
    centroid=centroid,
    max_distance=max_euclidean_dist,
    norm=True
)

# Plot the distance score map using the calculated distances.
axplot = dbad.plot_distance_score_map(
    meshgrid_distance=grid_euclidean_dist,
    xx=xx,
    yy=yy,
    features=features,
    xoutliers=df_outliers_pred["log_max_diff_dQ"],
    youtliers=df_outliers_pred["log_max_diff_dV"],
    centroid=centroid,
    threshold=euclidean_threshold,
    pred_outlier_indices=pred_outlier_indices,
    norm=True
)

# Set the title of the plot.
axplot.set_title('Euclidean Distance', fontsize=12)

# Generate a filename for the output figure
filename = f"{metric_name}_dist_map"
output_fig_filename = filename + "_" + selected_cell_label + ".png"

# Define the full path for saving the figure.
fig_output_path = selected_cell_artifacts.joinpath(output_fig_filename)

# Save the figure with high resolution and tight bounding box.
plt.savefig(
    fig_output_path,
    dpi=600,
    bbox_inches="tight"
)

# Display the plot.
plt.show()

Anomaly score map euclidean dist for ``2017-05-12_5_4C-70per_3C_CH17``

Background Heatmap:

Dark Red Regions: Indicate a high normalized distance from the centroid. These areas are more likely to contain anomalies.
Blue Regions: Represent low normalized distance (close to the centroid), corresponding to normal cycles. The color gradient (blue → white → red) shows increasing anomaly likelihood based on Euclidean distance.

Dashed Black Contour:

Represents the decision boundary defined by the Euclidean distance threshold.
Contour Shape:
- For Euclidean distance, the contour is circular because the metric treats all directions equally (isotropic).
- If a different metric is used (e.g., Mahalanobis distance), the contour would become elliptical, reflecting feature correlations and scaling differences.
- For Manhattan distance, the contour would appear more like a diamond shape.
This contour visually separates normal regions (inside) from anomalous regions (outside).

Red Cross:

Marks the centroid of the feature space.

Black Dots:

Represent the majority of normal cycles (inlier data) clustered near the centroid.

Yellow Stars with Labels:

Highlight the detected anomalous cycles: 0, 40, 147, 148. Their positions in the 2D feature space (log-transformed ΔQ vs. ΔV) show how far they deviate from typical battery behavior.

Colorbar (Right):

Quantifies the normalized Euclidean distance from the centroid. 0 = normal (close to centroid) and 1 = highly anomalous (far from centroid).

Annotation Box:

Summarizes the predicted anomalous cycles: [0, 40, 147, 148].
Interpretation:
- Cycles 0 and 40 exhibit unusually high voltage deviations.
- Cycles 147 and 148 show strong deviations in charge capacity.
These anomalies may correspond to battery degradation events, sensor errors, or experimental disturbances.

Step-7: Model performance evaluation¶

Map predicted outlier indices to the benchmark dataset:
- df_selected_cell holds cycle-level records and the ground-truth label (e.g., outlier = 1 for anomalous cycles, else 0).
- pred_outlier_indices is the list of cycle indices flagged by the model.
modval.evaluate_pred_outliers(...) returns a tidy DataFrame with:
- cycle_index: Cell discharge cycle index
- true_outlier: ground truth (0/1).
- pred_outlier: model prediction (0/1) for the same cycles.
modval.generate_confusion_matrix(...) aggregates counts of:
- True Negative (TN): predicted 0, truth 0.
- False Positive (FP): predicted 1, truth 0.
- False Negative (FN): predicted 0, truth 1.
- True Positive (TP): predicted 1, truth 1.

# Map the predicted outlier indices
df_eval_outlier = modval.evaluate_pred_outliers(
  df_benchmark=df_selected_cell,
  outlier_cycle_index=pred_outlier_indices)

# generate confusion matrix
axplot = modval.generate_confusion_matrix(
  y_true=df_eval_outlier["true_outlier"],
  y_pred=df_eval_outlier["pred_outlier"])

axplot.set_title(
    "Euclidean Distance",
    fontsize=16)

output_fig_filename = (
    "conf_matrix_euclidean_"
    + selected_cell_label
    + ".png")

fig_output_path = (
    selected_cell_artifacts
    .joinpath(output_fig_filename))

plt.savefig(
    fig_output_path,
    dpi=600,
    bbox_inches="tight")

plt.show()

confusion matrix with euclidean distance ``2017-05-12_5_4C-70per_3C_CH17``