osbad.dbad

This module provides tools for Distance-Based Anomaly Detection (DBAD) in multivariate datasets. It includes utilities for computing distances using various metrics, identifying outliers using robust statistical methods, and visualizing the results through histograms and contour maps.

Key Features:
  • Distance Computation: Supports Euclidean, Manhattan, Minkowski, and Mahalanobis metrics for calculating distances between data points and a centroid. Also supports calculation normalized distances.

  • Outlier Detection: Implements Median Absolute Deviation (MAD)-based outlier detection for non-gaussian distributions of distances.

  • Visualization:
    • Histogram of distances with threshold indication.

    • 2D contour map showing distance scores, decision boundaries, and annotated outliers.

Intended Use:

This module is designed for exploratory data analysis, benchmarking anomaly detection in battery cycle data, and other applications where data is unimodal and identifying deviations from a central tendency is critical.

Example Workflow:
  1. Compute distances using calculate_distance.

  2. Detect outliers with predict_outliers.

  3. Visualize results using plot_hist_distance and plot_distance_score_map.

Note

  • Ensure that input data is properly scaled and shaped before using the functions.

  • Visualization for 2D contour map showing distance scores, assumes 2D feature space for plotting purposes.

Module Contents

osbad.dbad.distance_metrics: Dict[str, Callable[Ellipsis, float]]
osbad.dbad.calculate_distance(metric_name: Literal['euclidean', 'manhattan', 'minkowski', 'mahalanobis'], features: numpy.ndarray, centroid: numpy.ndarray, p: int = None, inv_cov_matrix: numpy.ndarray = None, max_distance: float = None, norm: bool = True) numpy.ndarray

Calculates the distance between each point in a feature set and a given centroid using the specified distance metric.

Parameters:
  • metric_name (Literal) – The name of the distance metric to use.

  • include (Options) –

    • euclidean: Euclidean distance

    • manhattan: Manhattan (L1) distance

    • minkowski: Minkowski distance (requires p)

    • mahalanobis: Mahalanobis distance (requires inv_cov_matrix)

  • features (np.ndarray) – A 2D array of shape (n_samples, n_features) representing the dataset.

  • centroid (np.ndarray) – A 1D array of shape (n_features,) representing the centroid point of the data distribution.

  • p (int, optional) – The order parameter for Minkowski distance. Required if metric_name is minkowski.

  • inv_cov_matrix (np.ndarray, optional) – The inverse of the covariance matrix. Required if metric_name is mahalanobis.

  • max_distance (float, optional) – If provided, distances are normalized by dividing by the value of max_distance. Else, the maximum value of calculated distance is used.

  • norm (bool) – If True, distance is normalized.

Returns:

A 1D array of distances between each feature vector and the centroid.

Return type:

np.ndarray

Raises:
  • ValueError – If required parameters (p for Minkowski or

  • inv_cov_matrix` for Mahalanobis) are not provided

osbad.dbad.predict_outliers(distance: numpy.ndarray, features: numpy.ndarray, mad_threshold: float = 3) tuple

Detects outliers in a dataset using the Median Absolute Deviation (MAD) method.

This function identifies data points whose distance from a reference (e.g., centroid) exceeds a threshold based on the MAD, which is a robust measure of statistical dispersion. It is particularly effective for detecting anomalies in skewed or non-Gaussian distributions.

Parameters:
  • distance (np.ndarray) – A 1D array of distance values for each data point, computed using calculate_distance function based on selected distance metric.

  • features (np.ndarray) – A 2D array of shape (n_samples, n_features) representing the original feature vectors corresponding to each distance value.

  • mad_threshold (float) – The threshold multiplier for MAD-based outlier detection. Default is 3. A higher value results in fewer points being classified as outliers.

Returns:

A tuple containing:
  • mad_outlier_indices (np.ndarray): Indices of the detected outliers.

  • outlier_distance (np.ndarray): Distance values of the outliers.

  • outlier_features (np.ndarray): Feature vectors corresponding to the outliers.

  • mad_max_limit (float): The upper limit used for outlier detection based on MAD.

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray, float]

Note

This function relies on _compute_mad_outliers from stats module, which calculates the MAD factor if not provided and applies the threshold to identify anomalous points.

osbad.dbad.plot_hist_distance(distance: numpy.ndarray, threshold: float) matplotlib.axes._axes.Axes

Plots a histogram of distances from a centroid and highlights the outlier threshold. The histogram displays the distribution of distances and a vertical dashed red line marks the outlier threshold.

Parameters:
  • distance (np.ndarray) – A 1D array of distance values for each data point.

  • outlier_indices (np.ndarray) – Indices of the data points identified as outliers.

  • threshold (float) – The distance threshold used to classify outliers (e.g., based on MAD).

Returns:

The matplotlib Axes object containing the histogram plot.

Return type:

mpl.axes._axes.Axes

osbad.dbad.plot_distance_score_map(meshgrid_distance: numpy.ndarray, xx: numpy.ndarray, yy: numpy.ndarray, features: numpy.ndarray, xoutliers: pandas.Series, youtliers: pandas.Series, centroid: numpy.ndarray, threshold: numpy.ndarray, pred_outlier_indices: numpy.ndarray, norm: bool = True) matplotlib.axes._axes.Axes

Plots a 2D contour map of distance scores over a meshgrid with a decision boundary and highlights inliers, outliers, and the centroid.

This function visualizes the distance landscape using a filled contour plot across a 2D mesh grid. It creates a contour plot showing anomaly distances, a dashed decision boundary at the specified threshold, and highlights the predicted anomalous cycles.

Parameters:
  • meshgrid_distance (np.ndarray) – Flattened array of distance scores computed over a meshgrid.

  • xx (np.ndarray) – Meshgrid array for the x-axis.

  • yy (np.ndarray) – Meshgrid array for the y-axis.

  • features (np.ndarray) – 2D array of shape (n_samples, 2) representing the feature vectors.

  • xoutliers (pd.Series) – Series containing x-coordinates of detected outliers.

  • youtliers (pd.Series) – Series containing y-coordinates of detected outliers.

  • centroid (np.ndarray) – 1D array of shape (2,) representing the centroid coordinates.

  • threshold (np.ndarray) – Distance threshold used to identify outliers.

  • pred_outlier_indices (np.ndarray) – Indices of predicted outliers to be annotated on the plot.

  • norm (bool) – If normalized distances are used for contour plot set

  • False. (to True. Else)

Returns:

The matplotlib Axes object containing the contour and scatter plot.

Return type:

mpl.axes._axes.Axes

Note

This function assumes the dataset contains exactly two features, which are used for 2D plotting.

xx, yy, meshgrid = runner.create_2d_mesh_grid()

grid_euclidean_dist = dbad.calculate_distance(
                            metric_name="euclidean",
                            features=meshgrid,
                            centroid=centroid
                            )

axplot = dbad.plot_distance_score_map(
    meshgrid_distance = grid_euclidean_dist,
    xx = xx,
    yy = yy,
    features=features,
    xoutliers= df_outliers["feature1"],
    youtliers= df_outliers["feature2"],
    centroid=centroid,
    threshold= euclidean_threshold,
    pred_outlier_indices= pred_outlier_indices,
    norm=True
    )