Example (4): Autoencoder with Hyperparameter Tuning (Tohoku Dataset) ======================================================================= Prerequisites --------------- * Python 3.12 (recommended) * Files on disk: * ``database/tohoku_benchmark_dataset.db`` (benchmark labels per cycle) * (Optional) LaTeX installation if you want Matplotlib to render text with LaTeX: * A TeX distribution (e.g., TeX Live/MacTeX/MiKTeX), dvipng, and fonts like cm-super. * Don't have LaTeX installed? Either install it, or set ``rcParams["text.usetex"] = False``. Before running the example in the ``machine_learning/hp_tuning_with_transfer_learning`` section, please evaluate whether the global directory path specified in ``src/osbad/config.py`` needs to be updated: .. code-block:: python # Modify this global directory path if needed PIPELINE_OUTPUT_DIR = Path.cwd().joinpath("artifacts_output_dir") The following example of running an Autoencoder model with hyperparameter tuning is also provided as a notebook in ``machine_learning/hp_tuning_with_transfer_learning/tohoku_data_source/01_train_dataset/ml_06_autoencoder_hyperparam_tohoku.ipynb``. Step-1: Load libraries --------------------------- Import the libraries into your local development environment, including the ``osbad`` library for benchmarking anomaly detection. * ``Path`` is used for robust, cross-platform file paths. * ``pprint`` pretty-prints data structures for readable diagnostics. * ``duckdb`` is the embedded analytical database engine storing the dataset. * ``optuna`` is a hyperparameter optimization framework used to search for the best model configuration. * ``EmpiricalCovariance`` from scikit-learn is used to compute the Mahalanobis distance for feature engineering. * ``bconf``: project config utilities (e.g., where to write artifacts). * ``hp``: hyperparameter tuning utilities including the objective function, aggregation of best trials, and Pareto front visualization. * ``BenchDB``: a thin layer around DuckDB that provides convenience loaders. * ``ModelRunner``, ``modval``, ``bviz``: modeling, model validation, and visualization helpers for the benchmarking study. .. code-block:: python # Standard library from pathlib import Path import pprint # Third-party libraries import duckdb import pandas as pd import matplotlib.pyplot as plt import numpy as np import optuna from sklearn.covariance import EmpiricalCovariance # Custom osbad library for anomaly detection import osbad.config as bconf import osbad.hyperparam as hp import osbad.modval as modval import osbad.viz as bviz from osbad.database import BenchDB from osbad.model import ModelRunner Step-2: Load Benchmarking Dataset ------------------------------------ * Define the path to the DuckDB database file (``tohoku_benchmark_dataset.db``) using the ``DB_DIR`` from ``bconf``. * Create a DuckDB connection (read-only) and load the full Tohoku dataset from the ``df_tohoku_dataset`` table. * Drop the additional index column and retrieve the unique cell indices available in the dataset. .. code-block:: python # Path to database directory DB_DIR = bconf.DB_DIR db_filepath = DB_DIR.joinpath("tohoku_benchmark_dataset.db") # Create a DuckDB connection con = duckdb.connect( db_filepath, read_only=True) # Load all training dataset from duckdb df_duckdb = con.execute( "SELECT * FROM df_tohoku_dataset").fetchdf() # Drop the additional index column df_duckdb = df_duckdb.drop( columns="__index_level_0__", errors="ignore") unique_cell_index_train = df_duckdb["cell_index"].unique() print(unique_cell_index_train) Step-3: Filter Dataset for a Selected Cell --------------------------------------------- * There are 10 cells in the Tohoku dataset, and in this work, ``cell-1``, ``cell-2``, ``cell-5`` and ``cell-6`` are used for training. * In this example, the model training is illustrated for one cell: ``cell_num_1``. * Pick a specific cell based on ``selected_cell_label``, which identifies the experimental data corresponding to one unique cell. * Create an artifacts folder for that cell, where you can save figures, tables, or model outputs related to this cell. * Filter the loaded dataset for the selected cell only. * Initialize ``BenchDB`` for the selected cell. .. code-block:: python # Get the cell-ID from cell_inventory selected_cell_label = "cell_num_1" cell_num = selected_cell_label[-1] # Create a subfolder to store fig output # corresponding to each cell-index selected_cell_artifacts_dir = bconf.artifacts_output_dir( selected_cell_label) # Filter dataset for specific selected cell only df_selected_cell = df_duckdb[ df_duckdb["cell_index"] == selected_cell_label] # Import the BenchDB class benchdb = BenchDB( db_filepath, selected_cell_label) Step-4: Drop True Labels ----------------------------- * Drop the true outlier labels (denoted as ``outlier``) from the dataframe, keeping only the relevant columns for machine learning. .. code-block:: python # Drop the outlier labels df_selected_cell_without_labels = df_selected_cell.drop( "outlier", axis=1).reset_index(drop=True) df_selected_cell_without_labels Step-5: Plot Cycle Capacity Fade without Labels --------------------------------------------------- * Calculate the maximum discharge capacity per cycle. * Visualize the capacity fade curve for the selected cell without displaying the true outlier labels. This represents what the model sees before training. .. code-block:: python # Calculate maximum capacity per cycle max_cap_per_cycle = ( df_selected_cell_without_labels .groupby(["cycle_index"])["discharge_capacity"].max()) max_cap_per_cycle.name = "max_discharge_capacity" unique_cycle_index = ( df_selected_cell_without_labels["cycle_index"].unique()) .. code-block:: python axplot = bviz.plot_cycle_data( xseries=unique_cycle_index, yseries=max_cap_per_cycle, cycle_index_series=unique_cycle_index) axplot.set_xlabel( r"Cycle index", fontsize=14) axplot.set_ylabel( r"Maximum discharge capacity [mAh/g]", fontsize=14) axplot.set_title( f"Cell-{cell_num}", fontsize=16) output_fig_filename = ( "cycling_data_without_labels_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts_dir .joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=600, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_04_tohoku_autoencoder_hyperparam_tuned/cycling_data_without_labels_cell_num_1.png :height: 396px :width: 600px :alt: Cycling data without labels for ``cell_num_1`` :align: center Step-6: Feature Transformation ---------------------------------- In the Tohoku dataset, we want to track the sudden and unintended capacity drop over the cycle life. Therefore, the features used are: * **Cycle index**: The cycle number of each cell. * **Maximum discharge capacity**: Peak discharge capacity per cycle. * **Normalized Mahalanobis distance**: A multivariate distance metric that accounts for the correlation between cycle index and maximum discharge capacity. Create input features for Mahalanobis distance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The Mahalanobis distance is calculated from both the cycle index and the maximum discharge capacity. .. code-block:: python df_cycle_index = pd.Series( unique_cycle_index, name="cycle_index") # Input features for Mahalanobis distance df_features_per_cell = pd.concat( [df_cycle_index, max_cap_per_cycle], axis=1) df_features_per_cell Compute the normalized Mahalanobis distance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Fit an ``EmpiricalCovariance`` estimator to compute the covariance matrix of the feature space. * Calculate the Mahalanobis distance for each cycle and normalize it by the maximum distance to obtain a value between 0 and 1. .. code-block:: python Xfeat = df_features_per_cell.values # Calculate Mahalanobis distance based on # cycle_index and max_discharge_capacity cov = EmpiricalCovariance().fit(Xfeat) mahal_dist = cov.mahalanobis(Xfeat) df_maha_dist = pd.Series( mahal_dist, name="mahal_dist") # Merge calculated mahalanobis distance df_merge_features = pd.concat( [df_features_per_cell, df_maha_dist], axis=1) # Calculate maximum mahal_dist to # normalize the distance calculation max_mahal_dist = ( df_merge_features["mahal_dist"].max()) df_merge_features["norm_mahal_dist"] = ( df_merge_features["mahal_dist"]/max_mahal_dist) selected_feature_cols = ( "max_discharge_capacity", "norm_mahal_dist") To inspect the merged features: .. code-block:: python df_merge_features Step-7: Hyperparameter Tuning with Optuna -------------------------------------------- Optuna is used to search for the best hyperparameters of the Autoencoder model. The multi-objective optimization maximizes both **recall** and **precision** simultaneously. Define the hyperparameter search space ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The hyperparameter search space is defined as a lambda function that maps each Optuna ``trial`` to a dictionary of sampled hyperparameter values: * ``batch_size``: Number of samples per training batch (int, 8 to 32). * ``epoch_num``: Number of training epochs (int, 10 to 50). * ``learning_rate``: Learning rate for the optimizer (float, 0.0 to 0.1). * ``dropout_rate``: Dropout rate for regularization (float, 0.1 to 0.5). * ``threshold``: Decision threshold for the outlier probability score (float, 0.0 to 1.0). .. code-block:: python # Define the hyperparameter search space for autoencoder hp_space=lambda trial: { "batch_size": trial.suggest_int( "batch_size", 8, 32), "epoch_num": trial.suggest_int( "epoch_num", 10, 50), "learning_rate": trial.suggest_float( "learning_rate", 0.0, 0.1), "dropout_rate": trial.suggest_float( "dropout_rate", 0.1, 0.5), "threshold": trial.suggest_float( "threshold", 0.0, 1.0)} Create and run the Optuna study ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * A ``TPESampler`` with a fixed seed ensures reproducibility. * The study is configured for multi-objective optimization with two directions set to ``maximize`` (recall and precision). * The ``hp.objective`` function trains the Autoencoder model for each trial and evaluates it against the benchmarking dataset. .. code-block:: python # Instantiate an optuna study for autoencoder model sampler = optuna.samplers.TPESampler(seed=42) autoencoder_study = optuna.create_study( study_name="autoencoder_hyperparam", sampler=sampler, directions=["maximize","maximize"]) autoencoder_study.optimize( lambda trial: hp.objective( trial, model_id="autoencoder", df_feature_dataset=df_merge_features, selected_feature_cols=selected_feature_cols, df_benchmark_dataset=df_selected_cell, hp_space=hp_space, selected_cell_label=selected_cell_label), n_trials=20) Step-8: Aggregate Best Trials --------------------------------- After the optimization completes, aggregate the best trial hyperparameters using the median (or median rounded to integer for discrete parameters). The aggregation schema defines how each hyperparameter is consolidated across the Pareto-optimal trials: .. code-block:: python schema_autoencoder = { "batch_size": "median_int", "epoch_num": "median_int", "learning_rate": "median", "dropout_rate": "median", "threshold": "median", } df_autoencoder_hyperparam = hp.aggregate_best_trials( autoencoder_study.best_trials, cell_label=selected_cell_label, model_id="autoencoder", schema=schema_autoencoder) df_autoencoder_hyperparam Step-9: Evaluate Percentage of Perfect Recall and Precision -------------------------------------------------------------- * Evaluate the percentage of trials in the study that achieved a perfect recall score (= 1.0) and a perfect precision score (= 1.0). * This provides insight into how frequently the optimization found configurations that correctly identified all anomalies without any false positives. .. code-block:: python recall_score_pct, precision_score_pct = hp.evaluate_hp_perfect_score_pct( model_study=autoencoder_study) Step-10: Plot Pareto Front ------------------------------ * The Pareto front visualizes the trade-off between recall and precision across all trials. * Trials on the Pareto front represent the best achievable combinations of recall and precision: improving one metric would require sacrificing the other. .. code-block:: python hp.plot_pareto_front( autoencoder_study, selected_cell_label, fig_title="Autoencoder Pareto Front") plt.show() .. image:: docs_figure/ml_04_tohoku_autoencoder_hyperparam_tuned/autoencoder_pareto_front_cell_num_1.png :height: 476px :width: 600px :alt: Pareto front of recall vs precision for Autoencoder hyperparameter tuning on ``cell_num_1`` :align: center Step-11: Export Hyperparameters to CSV ----------------------------------------- * Export the aggregated best hyperparameters to a CSV file for record-keeping and reproducibility. * The ``if_exists="replace"`` option overwrites any existing entry for the selected cell. .. code-block:: python # Export current hyperparameters to CSV hyperparam_filepath = Path.cwd().joinpath( "hp_06_autoencoder_hyperparam_tohoku.csv") hp.export_current_hyperparam( df_autoencoder_hyperparam, selected_cell_label, export_csv_filepath=hyperparam_filepath, if_exists="replace") Step-12: Train Model with Best Trial Parameters --------------------------------------------------- Load best trial parameters from CSV output ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Read back the exported hyperparameters from CSV and filter for the selected cell. .. code-block:: python # Test reading from exported metrics df_hyperparam_from_csv = pd.read_csv(hyperparam_filepath) df_param_per_cell = df_hyperparam_from_csv[ df_hyperparam_from_csv["cell_index"] == selected_cell_label] df_param_per_cell Create a dict for best trial parameters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python param_dict = df_param_per_cell.iloc[0].to_dict() pprint.pp(param_dict) Run the model with best trial parameters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Extract the model configuration for the Autoencoder from ``hp.MODEL_CONFIG``. * Instantiate a ``ModelRunner`` with the selected features and cell label. * Build the training input matrix ``Xdata`` (shape: n_cycles × n_features). * Create the Autoencoder model using the tuned hyperparameters via ``cfg.model_param(param_dict)``. * Fit the model, compute probabilistic outlier scores, and extract the predicted outlier cycle indices using the tuned threshold. .. code-block:: python cfg = hp.MODEL_CONFIG["autoencoder"] runner = ModelRunner( cell_label=selected_cell_label, df_input_features=df_merge_features, selected_feature_cols=selected_feature_cols ) Xdata = runner.create_model_x_input() model = cfg.model_param(param_dict) print(model) model.fit(Xdata) proba = model.predict_proba(Xdata) pred_outlier_indices, pred_outlier_score = runner.pred_outlier_indices_from_proba( proba=proba, threshold=param_dict["threshold"], outlier_col=cfg.proba_col ) pred_outlier_indices, pred_outlier_score Get predicted outlier dataframe ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Filter the feature dataframe to retain only cycles predicted as anomalous. * Append the ``outlier_prob`` column with the model's outlier probability for each predicted anomalous cycle. .. code-block:: python df_outliers_pred = (df_merge_features[ df_merge_features["cycle_index"] .isin(pred_outlier_indices)].copy()) df_outliers_pred["outlier_prob"] = pred_outlier_score df_outliers_pred Step-13: Predict Probabilistic Anomaly Score Map --------------------------------------------------- * ``runner.predict_anomaly_score_map`` generates a 2D contour map of anomaly scores (outlier probability). * The anomaly score map uses the tuned threshold from the hyperparameter optimization. * Two different ``grid_offset`` values are shown to demonstrate how the grid resolution affects the visualization. Anomaly score map with grid offset = 1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python grid_offset_size = 1 axplot = runner.predict_anomaly_score_map( selected_model=model, model_name="Autoencoder", xoutliers=df_outliers_pred["max_discharge_capacity"], youtliers=df_outliers_pred["norm_mahal_dist"], pred_outliers_index=pred_outlier_indices, threshold=param_dict["threshold"], square_grid=False, grid_offset=grid_offset_size ) axplot.set_xlabel( r"Maximum discharge capacity per cycle", fontsize=12) axplot.set_ylabel( r"Normalized Mahalanobis distance", fontsize=12) output_fig_filename = ( f"autoencoder_grid_offset_size_{grid_offset_size}_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts_dir .joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=600, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_04_tohoku_autoencoder_hyperparam_tuned/autoencoder_grid_offset_size_1_cell_num_1.png :height: 418px :width: 600px :alt: Anomaly score map for ``cell_num_1`` with grid offset size 1 :align: center Anomaly score map with grid offset = 50 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python grid_offset_size = 50 axplot = runner.predict_anomaly_score_map( selected_model=model, model_name="Autoencoder", xoutliers=df_outliers_pred["max_discharge_capacity"], youtliers=df_outliers_pred["norm_mahal_dist"], pred_outliers_index=pred_outlier_indices, threshold=param_dict["threshold"], square_grid=False, grid_offset=grid_offset_size ) axplot.set_xlabel( r"Maximum discharge capacity per cycle", fontsize=12) axplot.set_ylabel( r"Normalized Mahalanobis distance", fontsize=12) output_fig_filename = ( f"autoencoder_grid_offset_size_{grid_offset_size}_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts_dir .joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=600, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_04_tohoku_autoencoder_hyperparam_tuned/autoencoder_grid_offset_size_50_cell_num_1.png :height: 418px :width: 600px :alt: Anomaly score map for ``cell_num_1`` with grid offset size 50 :align: center The figure shows the anomaly score map produced by the hyperparameter-tuned Autoencoder model: * **Background Heatmap**: * Red regions: high anomaly probability (more likely to contain outliers). * Blue/white regions: low anomaly probability (normal cycles). * **Dashed Black Contour**: * Represents the decision boundary defined by the optimized threshold. Points outside are considered anomalies. * **Black Dots**: * Represent the majority of normal cycles (inlier data). * **Yellow Stars with Labels**: * Mark the detected anomalous cycles. Their positions in the 2D feature space highlight where they deviate from typical battery behavior. * **Colorbar (right)**: * Quantifies anomaly probability (0 = normal, 1 = highly anomalous). Step-14: Model Performance Evaluation ----------------------------------------- * Map predicted outlier indices to the benchmark dataset to compare against ground-truth labels. * ``modval.evaluate_pred_outliers(...)`` returns a tidy DataFrame with: * ``cycle_index``: Cell discharge cycle index. * ``true_outlier``: ground truth (0/1). * ``pred_outlier``: model prediction (0/1) for the same cycles. .. code-block:: python df_eval_outlier = modval.evaluate_pred_outliers( df_benchmark=df_selected_cell, outlier_cycle_index=pred_outlier_indices) Confusion matrix ^^^^^^^^^^^^^^^^^^ .. code-block:: python axplot = modval.generate_confusion_matrix( y_true=df_eval_outlier["true_outlier"], y_pred=df_eval_outlier["pred_outlier"]) axplot.set_title( "Autoencoder", fontsize=16) output_fig_filename = ( "conf_matrix_autoencoder_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts_dir .joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=600, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_04_tohoku_autoencoder_hyperparam_tuned/conf_matrix_autoencoder_cell_num_1.png :height: 458px :width: 600px :alt: Confusion matrix for ``cell_num_1`` :align: center Evaluation metrics ^^^^^^^^^^^^^^^^^^^^ In this study, five different metrics are used to evaluate model performance: * **Accuracy**: :math:`\frac{\textrm{TP} + \textrm{TN}}{\textrm{Total prediction}}` * **Precision**: :math:`\frac{\textrm{TP}}{\textrm{TP + FP}}` * **Recall**: :math:`\frac{\textrm{TP}}{\textrm{TP + FN}}` * **F1-score**: :math:`\frac{2(\textrm{Precision}\times \textrm{Recall})}{\textrm{Precision} + \textrm{Recall}}` * **MCC**: :math:`\frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN+FN)}}` .. code-block:: python df_current_eval_metrics = modval.eval_model_performance( model_name="autoencoder", selected_cell_label=selected_cell_label, df_eval_outliers=df_eval_outlier) df_current_eval_metrics Step-15: Export Model Performance Metrics -------------------------------------------- * Export the evaluation metrics to a CSV file for record-keeping and comparison across models and cells. .. code-block:: python # Export current metrics to CSV hyperparam_eval_filepath = Path.cwd().joinpath( "eval_metrics_hp_single_cell_tohoku.csv") hp.export_current_model_metrics( model_name="autoencoder", selected_cell_label=selected_cell_label, df_current_eval_metrics=df_current_eval_metrics, export_csv_filepath=hyperparam_eval_filepath, if_exists="replace") Step-16: Visualize Predicted Anomalies ----------------------------------------- Plot predicted anomalous cycles ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Re-plot the cycling data with the predicted anomalous cycles highlighted and annotated, allowing visual comparison of model predictions against the ground truth. .. code-block:: python axplot = benchdb.plot_cycle_data( df_selected_cell_without_labels, pred_outlier_indices) axplot.set_title( f"Cell-{cell_num}: Predicted Anomalies with Autoencoder", fontsize=16) output_fig_filename = ( "autoencoder_pred_cycles_with_outliers_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts_dir .joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=600, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_04_tohoku_autoencoder_hyperparam_tuned/autoencoder_pred_cycles_with_outliers_cell_num_1.png :height: 398px :width: 600px :alt: Cycling data with predicted anomalous cycles highlighted for ``cell_num_1`` :align: center Plot predicted capacity fade with outlier annotations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Filter the maximum capacity per cycle to retain only the predicted outlier cycles. * Plot the capacity fade curve with the predicted outlier cycles annotated in a text box. .. code-block:: python pred_cap_outlier = max_cap_per_cycle[ max_cap_per_cycle .index.isin(pred_outlier_indices)] axplot = bviz.plot_cycle_data( xseries=unique_cycle_index, yseries=max_cap_per_cycle, cycle_index_series=unique_cycle_index, xoutlier=pred_cap_outlier.index, youtlier=pred_cap_outlier) axplot.set_xlabel( r"Cycle index", fontsize=14) axplot.set_ylabel( r"Maximum discharge capacity [mAh/g]", fontsize=14) axplot.set_title( f"Cell-{cell_num}: Predicted Anomalies with Autoencoder", fontsize=16) # Create textbox to annotate anomalous cycle textstr = '\n'.join(( r"Cycle index with anomalies:", f"{list(pred_cap_outlier.index)}")) # properties for bbox props = dict( boxstyle='round', facecolor='wheat', alpha=0.5) axplot.text( 0.95, 0.95, textstr, transform=axplot.transAxes, fontsize=12, ha="right", va='top', bbox=props) output_fig_filename = ( "autoencoder_pred_cap_fade_with_outliers_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts_dir .joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=600, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_04_tohoku_autoencoder_hyperparam_tuned/autoencoder_pred_cap_fade_with_outliers_cell_num_1.png :height: 396px :width: 600px :alt: Predicted capacity fade curve with outlier annotations for ``cell_num_1`` :align: center ---- .. note:: This notebook serves as an example to explain the workflow for running the Autoencoder model with hyperparameter tuning on a single cell. To mitigate overfitting, the model should be trained and validated across multiple cells in the training dataset instead of a single cell. The hyperparameters are then averaged across all cells to find a more generalizable configuration that performs well across the entire dataset, rather than just one cell.