osbad.dbad ========== .. py:module:: osbad.dbad .. autoapi-nested-parse:: This module provides tools for Distance-Based Anomaly Detection (DBAD) in multivariate datasets. It includes utilities for computing distances using various metrics, identifying outliers using robust statistical methods, and visualizing the results through histograms and contour maps. Key Features: - **Distance Computation**: Supports Euclidean, Manhattan, Minkowski, and Mahalanobis metrics for calculating distances between data points and a centroid. - **Outlier Detection**: Implements Median Absolute Deviation (MAD)-based outlier detection for non-gaussian distributions of distances. - **Visualization**: - Histogram of distances with threshold indication. - 2D contour map showing distance scores, decision boundaries, and annotated outliers. Intended Use: This module is designed for exploratory data analysis, benchmarking anomaly detection in battery cycle data, and other applications where data is unimodal and identifying deviations from a central tendency is critical. Example Workflow: 1. Compute distances using `calculate_distance`. 2. Detect outliers with `predict_outliers`. 3. Visualize results using `plot_hist_distance` and `plot_distance_score_map`. Note: - Ensure that input data is properly scaled and shaped before using the functions. - Visualization for 2D contour map showing distance scores, assumes 2D feature space for plotting purposes. Module Contents --------------- .. py:data:: distance_metrics :type: Dict[str, Callable[Ellipsis, float]] .. py:function:: calculate_distance(metric_name: Literal['euclidean', 'manhattan', 'minkowski', 'mahalanobis'], features: numpy.ndarray, centroid: numpy.ndarray, p: int = None, inv_cov_matrix: numpy.ndarray = None) -> numpy.ndarray Calculates the distance between each point in a feature set and a given centroid using the specified distance metric. :param metric_name: The name of the distance metric to use. - "euclidean": Euclidean distance - "manhattan": Manhattan (L1) distance - "minkowski": Minkowski distance (requires `p`) - "mahalanobis": Mahalanobis distance (requires `inv_cov_matrix`) :type metric_name: Literal :param features: A 2D array of shape (n_samples, n_features) representing the dataset. :type features: np.ndarray :param centroid: A 1D array of shape (n_features,) representing the centroid point of the data distribution. :type centroid: np.ndarray :param p: The order parameter for Minkowski distance. Required if `metric_name` is "minkowski". :type p: int, optional :param inv_cov_matrix: The inverse of the covariance matrix. Required if `metric_name` is "mahalanobis". :type inv_cov_matrix: np.ndarray, optional :returns: A 1D array of distances between each feature vector and the centroid. :rtype: np.ndarray :raises ValueError: If required parameters (`p` for Minkowski or `inv_cov_matrix` for Mahalanobis) are not provided when needed. .. py:function:: predict_outliers(distance: numpy.ndarray, features: numpy.ndarray, mad_threshold: float = 3) -> tuple Detects outliers in a dataset using the Median Absolute Deviation (MAD) method. This function identifies data points whose distance from a reference (e.g., centroid) exceeds a threshold based on the MAD, which is a robust measure of statistical dispersion. It is particularly effective for detecting anomalies in skewed or non-Gaussian distributions. :param distance: A 1D array of distance values for each data point, computed using `calculate_distance` function based on selected distance metric. :type distance: np.ndarray :param features: A 2D array of shape (n_samples, n_features) representing the original feature vectors corresponding to each distance value. :type features: np.ndarray :param mad_threshold: The threshold multiplier for MAD-based outlier detection. Default is 3. A higher value results in fewer points being classified as outliers. :type mad_threshold: float, optional :returns: tuple A tuple containing: - mad_outlier_indices (np.ndarray): Indices of the detected outliers. - outlier_distance (np.ndarray): Distance values of the outliers. - outlier_features (np.ndarray): Feature vectors corresponding to the outliers. - mad_max_limit (float): The upper limit used for outlier detection based on MAD. .. note:: This function relies on `_compute_mad_outliers`, which calculates the MAD factor if not provided and applies the threshold to identify anomalous points. .. py:function:: plot_hist_distance(distance: numpy.ndarray, threshold: float) -> matplotlib.axes._axes.Axes Plots a histogram of distances from a centroid and highlights the outlier threshold. The histogram displays the distribution of distances and a vertical dashed red line marks the outlier threshold. :param distance: A 1D array of distance values for each data point. :type distance: np.ndarray :param outlier_indices: Indices of the data points identified as outliers. :type outlier_indices: np.ndarray :param threshold: The distance threshold used to classify outliers (e.g., based on MAD). :type threshold: float Returns; mpl.axes._axes.Axes The matplotlib Axes object containing the histogram plot. .. py:function:: plot_distance_score_map(meshgrid_distance: numpy.ndarray, xx: numpy.ndarray, yy: numpy.ndarray, features: numpy.ndarray, xoutliers: pandas.Series, youtliers: pandas.Series, centroid: numpy.ndarray, threshold: numpy.ndarray, pred_outlier_indices: numpy.ndarray) -> matplotlib.axes._axes.Axes Plots a 2D contour map of distance scores over a meshgrid with a decision boundary and highlights inliers, outliers, and the centroid. This function visualizes the distance landscape using a filled contour plot across a 2D mesh grid. It creates a contour plot showing anomaly distances, a dashed decision boundary at the specified threshold, and highlights the predicted anomalous cycles. Inliers are shown as black dots, the centroid as a red 'x', and outliers as gold stars with red edges. :param meshgrid_distance: Flattened array of distance scores computed over a meshgrid. :type meshgrid_distance: np.ndarray :param xx: Meshgrid array for the x-axis. :type xx: np.ndarray :param yy: Meshgrid array for the y-axis. :type yy: np.ndarray :param features: 2D array of shape (n_samples, 2) representing the feature vectors. :type features: np.ndarray :param xoutliers: Series containing x-coordinates of detected outliers. :type xoutliers: pd.Series :param youtliers: Series containing y-coordinates of detected outliers. :type youtliers: pd.Series :param centroid: 1D array of shape (2,) representing the centroid coordinates. :type centroid: np.ndarray :param threshold: Distance threshold used to identify outliers. :type threshold: np.ndarray :param pred_outlier_indices: Indices of predicted outliers to be annotated on the plot. :type pred_outlier_indices: np.ndarray :returns: mpl.axes._axes.Axes The matplotlib Axes object containing the contour and scatter plot. .. Note:: This function assumes the dataset contains exactly two features, which are used for 2D plotting. .. code-block:: xx, yy, meshgrid = runner.create_2d_mesh_grid() grid_euclidean_dist = dbad.calculate_distance( metric_name="euclidean", features=meshgrid, centroid=centroid ) axplot= dbad.plot_distance_score_map( meshgrid_distance = grid_euclidean_dist, xx = xx, yy = yy, features=features, xoutliers= df_outliers["feature1"], youtliers= df_outliers["feature2"], centroid=centroid, threshold= euclidean_threshold, pred_outlier_indices= pred_outlier_indices, )