Example (2): Isolation Forest with Hyperparameter Tuning =========================================================== Prerequisites --------------- * Python 3.12 (recommended) * Files on disk: * ``database/train_dataset_severson.db`` (benchmark labels per cycle) * ``database/train_features_severson.db`` (precomputed features per cycle) * (Optional) LaTeX installation if you want Matplotlib to render text with LaTeX: * A TeX distribution (e.g., TeX Live/MacTeX/MiKTeX), dvipng, and fonts like cm-super. * Don't have LaTeX installed? Either install it, or set ``rcParams["text.usetex"] = False``. Before running the example in the ``machine_learning/hp_tuning_with_transfer_learning`` section, please evaluate whether the global directory path specified in ``src/osbad/config.py`` needs to be updated: .. code-block:: python # Modify this global directory path if needed PIPELINE_OUTPUT_DIR = Path.cwd().joinpath("artifacts_output_dir") The following example of running an Isolation Forest model with hyperparameter tuning is also provided as a notebook in ``machine_learning/hp_tuning_with_transfer_learning/severson_data_source/01_train_dataset/ml_01_iforest_hyperparam_severson.ipynb``. Step-1: Load libraries --------------------------- Import the libraries into your local development environment, including the ``osbad`` library for benchmarking anomaly detection. * ``Path`` is used for robust, cross-platform file paths. * ``pprint`` pretty-prints data structures for readable diagnostics. * ``duckdb`` is the embedded analytical database engine storing the dataset. * ``optuna`` is a hyperparameter optimization framework used to search for the best model configuration. * ``bconf``: project config utilities (e.g., where to write artifacts). * ``hp``: hyperparameter tuning utilities including the objective function, aggregation of best trials, and Pareto front visualization. * ``BenchDB``: a thin layer around DuckDB that provides convenience loaders. * ``ModelRunner``, ``modval``, ``bviz``: modeling, model validation, and visualization helpers for the benchmarking study. .. code-block:: python # Standard library from pathlib import Path import pprint # Third-party libraries import duckdb import pandas as pd import matplotlib.pyplot as plt import numpy as np import optuna # Custom osbad library for anomaly detection import osbad.config as bconf import osbad.hyperparam as hp import osbad.modval as modval import osbad.viz as bviz from osbad.database import BenchDB from osbad.model import ModelRunner Step-2: Load Benchmarking Dataset ------------------------------------ * Define the path to the DuckDB database file using the ``DB_DIR`` from ``bconf``. * Create a DuckDB connection (read-only) and load the full training dataset from the ``df_train_dataset_sv`` table. * Retrieve the unique cell indices available in the training dataset. .. code-block:: python # Define a global variable to save fig output PIPELINE_OUTPUT_DIR = bconf.PIPELINE_OUTPUT_DIR # Path to database directory DB_DIR = bconf.DB_DIR # Path to the DuckDB instance: # "osbad/database/train_dataset_severson.db" db_filepath = ( DB_DIR.joinpath("train_dataset_severson.db")) # Create a DuckDB connection con = duckdb.connect( db_filepath, read_only=True) # Load all training dataset from duckdb df_duckdb = con.execute( "SELECT * FROM df_train_dataset_sv").fetchdf() # Get the cell index of training dataset unique_cell_index_train = df_duckdb["cell_index"].unique() print(unique_cell_index_train) .. code-block:: python training_cell_count = len(unique_cell_index_train) print(f"Training cell count: {training_cell_count}") Step-3: Filter Dataset for a Selected Cell --------------------------------------------- * Pick a specific cell based on ``selected_cell_label``, which identifies the experimental data corresponding to one unique cell. * Create an artifacts folder for that cell, where you can save figures, tables, or model outputs related to this cell. .. code-block:: python # Get the cell-ID from cell_inventory selected_cell_label = "2017-05-12_5_4C-70per_3C_CH17" # Create a subfolder to store fig output # corresponding to each cell-index selected_cell_artifacts_dir = bconf.artifacts_output_dir( selected_cell_label) Step-4: Load Benchmarking Dataset for Selected Cell ------------------------------------------------------ * Initialize ``BenchDB`` for the selected cell and load the benchmarking dataset from the training partition. .. code-block:: python # Import the BenchDB class # Load only the dataset based on the selected cell benchdb = BenchDB( db_filepath, selected_cell_label) # load the benchmarking dataset df_selected_cell = benchdb.load_benchmark_dataset( dataset_type="train") Step-5: Drop True Labels ----------------------------- * Drop the true outlier labels (denoted as ``outlier``) from the dataframe and select only the relevant features for machine learning: * ``cell_index``: The cell-ID for data and model versioning purposes. * ``cycle_index``: The cycle number of each cell. * ``discharge_capacity``: Discharge capacity of the cell. * ``voltage``: Discharge voltage of the cell. .. code-block:: python if df_selected_cell is not None: filter_col = [ "cell_index", "cycle_index", "discharge_capacity", "voltage"] # Drop true labels from the benchmarking dataset # and filter for selected columns only df_selected_cell_without_labels = benchdb.drop_labels( df_selected_cell, filter_col) # print a subset of the dataframe # for diagnostics running in terminals print(df_selected_cell_without_labels.head(10).to_markdown()) print("*"*100) Step-6: Plot Cycle Data without Labels ----------------------------------------- * Visualize the cycling data for the selected cell without displaying the true outlier labels. This represents what the model sees before training. .. code-block:: python # If the true outlier cycle index is not known, # cycling data will be plotted without labels benchdb.plot_cycle_data( df_selected_cell_without_labels) output_fig_filename = ( "cycle_data_without_labels_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts_dir .joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=600, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_02_severson_iforest_hyperparam_tuned/cycle_data_without_labels_2017-05-12_5_4C-70per_3C_CH17.png :height: 398px :width: 600px :alt: Cycle data without labels from ``2017-05-12_5_4C-70per_3C_CH17`` :align: center Step-7: Load the Pre-computed Training Features -------------------------------------------------- Instead of computing the statistical feature transformation and physics-informed feature extraction from scratch (as in the baseline example), the hyperparameter tuning workflow loads pre-computed features directly from a dedicated features database. This database stores the already-transformed features per cycle, including: * ``max_diff_dQ``: Maximum scaled capacity difference per cycle. * ``log_max_diff_dQ``: Log-transformed maximum scaled capacity difference. * ``max_diff_dV``: Maximum scaled voltage difference per cycle. * ``log_max_diff_dV``: Log-transformed maximum scaled voltage difference. .. code-block:: python # Define the filepath to ``train_features_severson.db`` # osbad/database/train_features_severson.db db_features_filepath = ( DB_DIR.joinpath( "train_features_severson.db")) # Load only the training features dataset df_features_per_cell = benchdb.load_features_db( db_features_filepath, dataset_type="train") unique_cycle_count = ( df_features_per_cell["cycle_index"].unique()) To inspect the loaded features: .. code-block:: python df_features_per_cell Step-8: Hyperparameter Tuning with Optuna -------------------------------------------- Optuna is used to search for the best hyperparameters of the Isolation Forest model. The multi-objective optimization maximizes both **recall** and **precision** simultaneously. Define the hyperparameter search space ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The hyperparameter search space is defined as a lambda function that maps each Optuna ``trial`` to a dictionary of sampled hyperparameter values: * ``contamination``: Expected proportion of outliers in the dataset (float, 0 to 0.5). * ``n_estimators``: Number of isolation trees in the ensemble (int, 100 to 500). * ``max_samples``: Maximum number of samples to draw for each tree (int, 100 to total cycle count). * ``threshold``: Decision threshold for the outlier probability score (float, 0 to 1). .. code-block:: python # Update the HP config for max_samples # depending on the cycle numbers total_cycle_count = len( df_selected_cell_without_labels["cycle_index"].unique()) # Define the hyperparameter search space for # isolation forest hp_space=lambda trial: { "contamination": trial.suggest_float( "contamination", 0, 0.5), "n_estimators": trial.suggest_int( "n_estimators", 100, 500), "max_samples": trial.suggest_int( "max_samples", 100, total_cycle_count), "threshold": trial.suggest_float( "threshold", 0, 1)} Create and run the Optuna study ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * A ``TPESampler`` with a fixed seed ensures reproducibility. * The study is configured for multi-objective optimization with two directions set to ``maximize`` (recall and precision). * The ``hp.objective`` function trains the Isolation Forest model for each trial and evaluates it against the benchmarking dataset. .. code-block:: python # Instantiate an optuna study for iForest model sampler = optuna.samplers.TPESampler(seed=42) selected_feature_cols = ( "log_max_diff_dQ", "log_max_diff_dV") if_study = optuna.create_study( study_name="iforest_hyperparam", sampler=sampler, directions=["maximize","maximize"]) if_study.optimize( lambda trial: hp.objective( trial, model_id="iforest", df_feature_dataset=df_features_per_cell, selected_feature_cols=selected_feature_cols, df_benchmark_dataset=df_selected_cell, hp_space=hp_space, selected_cell_label=selected_cell_label), n_trials=20) Step-9: Aggregate Best Trials --------------------------------- After the optimization completes, aggregate the best trial hyperparameters using the median (or median rounded to integer for discrete parameters). The aggregation schema defines how each hyperparameter is consolidated across the Pareto-optimal trials: .. code-block:: python schema_iforest = { "threshold": "median", "contamination": "median", "n_estimators": "median_int", "max_samples": "median_int", } df_iforest_hyperparam = hp.aggregate_best_trials( if_study.best_trials, cell_label=selected_cell_label, model_id="iforest", schema=schema_iforest) df_iforest_hyperparam Step-10: Evaluate Percentage of Perfect Recall and Precision -------------------------------------------------------------- * Evaluate the percentage of trials in the study that achieved a perfect recall score (= 1.0) and a perfect precision score (= 1.0). * This provides insight into how frequently the optimization found configurations that correctly identified all anomalies without any false positives. .. code-block:: python recall_score_pct, precision_score_pct = hp.evaluate_hp_perfect_score_pct( model_study=if_study) Step-11: Plot Pareto Front ------------------------------ * The Pareto front visualizes the trade-off between recall and precision across all trials. * Trials on the Pareto front represent the best achievable combinations of recall and precision: improving one metric would require sacrificing the other. .. code-block:: python hp.plot_pareto_front( if_study, selected_cell_label, fig_title="Isolation Forest Pareto Front") plt.show() .. image:: docs_figure/ml_02_severson_iforest_hyperparam_tuned/isolation_forest_pareto_front_2017-05-12_5_4C-70per_3C_CH17.png :height: 485 px :width: 600px :alt: Pareto front for Isolation Forest hyperparameter tuning :align: center Step-12: Export Hyperparameters to CSV ----------------------------------------- * Export the aggregated best hyperparameters to a CSV file for record-keeping and reproducibility. * The ``if_exists="replace"`` option overwrites any existing entry for the selected cell. .. code-block:: python # Export current hyperparameters to CSV hyperparam_filepath = Path.cwd().joinpath( "ml_01_iforest_hyperparam_severson.csv") hp.export_current_hyperparam( df_iforest_hyperparam, selected_cell_label, export_csv_filepath=hyperparam_filepath, if_exists="replace") Step-13: Train Model with Best Trial Parameters --------------------------------------------------- Load best trial parameters from CSV output ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Read back the exported hyperparameters from CSV and filter for the selected cell. .. code-block:: python # Test reading from exported metrics df_hyperparam_from_csv = pd.read_csv(hyperparam_filepath) df_param_per_cell = df_hyperparam_from_csv[ df_hyperparam_from_csv["cell_index"] == selected_cell_label] df_param_per_cell Create a dict for best trial parameters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python param_dict = df_param_per_cell.iloc[0].to_dict() pprint.pp(param_dict) Run the model with best trial parameters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Extract the model configuration for Isolation Forest from ``hp.MODEL_CONFIG``. * Instantiate a ``ModelRunner`` with the selected features and cell label. * Build the training input matrix ``Xdata`` (shape: n_cycles × n_features). * Create the Isolation Forest model using the tuned hyperparameters via ``cfg.model_param(param_dict)``. * Fit the model, compute probabilistic outlier scores, and extract the predicted outlier cycle indices using the tuned threshold. .. code-block:: python cfg = hp.MODEL_CONFIG["iforest"] runner = ModelRunner( cell_label=selected_cell_label, df_input_features=df_features_per_cell, selected_feature_cols=selected_feature_cols ) Xdata = runner.create_model_x_input() model = cfg.model_param(param_dict) print(model) model.fit(Xdata) proba = model.predict_proba(Xdata) pred_outlier_indices, pred_outlier_score = runner.pred_outlier_indices_from_proba( proba=proba, threshold=param_dict["threshold"], outlier_col=cfg.proba_col ) pred_outlier_indices, pred_outlier_score Get predicted outlier dataframe ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Filter the feature dataframe to retain only cycles predicted as anomalous. * Append the ``outlier_prob`` column with the model's outlier probability for each predicted anomalous cycle. .. code-block:: python df_outliers_pred = (df_features_per_cell[ df_features_per_cell["cycle_index"] .isin(pred_outlier_indices)].copy()) df_outliers_pred["outlier_prob"] = pred_outlier_score df_outliers_pred Step-14: Predict Probabilistic Anomaly Score Map --------------------------------------------------- * ``runner.predict_anomaly_score_map`` generates a 2D contour map of anomaly scores (outlier probability). * The anomaly score map uses the tuned threshold from the hyperparameter optimization instead of a fixed value. .. code-block:: python axplot = runner.predict_anomaly_score_map( selected_model=model, model_name="Isolation Forest", xoutliers=df_outliers_pred["log_max_diff_dQ"], youtliers=df_outliers_pred["log_max_diff_dV"], pred_outliers_index=pred_outlier_indices, threshold=param_dict["threshold"] ) axplot.set_xlabel( r"$\log(\Delta Q_{\mathrm{scaled,max,cyc}})$ [Ah]", fontsize = 12) axplot.set_ylabel( r"$\log(\Delta V_{\mathrm{scaled,max,cyc}})$ [V]", fontsize = 12) output_fig_filename = ( "iforest_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts_dir .joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=600, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_02_severson_iforest_hyperparam_tuned/iforest_2017-05-12_5_4C-70per_3C_CH17.png :height: 424 px :width: 600 px :alt: iForest anomaly score map from ``2017-05-12_5_4C-70per_3C_CH17`` :align: center The figure shows the anomaly score map produced by the hyperparameter-tuned Isolation Forest model: * **Background Heatmap**: * Red regions: high anomaly probability (more likely to contain outliers). * Blue/white regions: low anomaly probability (normal cycles). * **Dashed Black Contour**: * Represents the decision boundary defined by the optimized threshold. Points outside are considered anomalies. * **Black Dots**: * Represent the majority of normal cycles (inlier data). * **Yellow Stars with Labels**: * Mark the detected anomalous cycles. Their positions in the 2D feature space highlight where they deviate from typical battery behavior. * **Colorbar (right)**: * Quantifies anomaly probability (0 = normal, 1 = highly anomalous). Step-15: Model Performance Evaluation ----------------------------------------- * Map predicted outlier indices to the benchmark dataset to compare against ground-truth labels. * ``modval.evaluate_pred_outliers(...)`` returns a tidy DataFrame with: * ``cycle_index``: Cell discharge cycle index. * ``true_outlier``: ground truth (0/1). * ``pred_outlier``: model prediction (0/1) for the same cycles. .. code-block:: python df_eval_outlier = modval.evaluate_pred_outliers( df_benchmark=df_selected_cell, outlier_cycle_index=pred_outlier_indices) Confusion matrix ^^^^^^^^^^^^^^^^^^ .. code-block:: python axplot = modval.generate_confusion_matrix( y_true=df_eval_outlier["true_outlier"], y_pred=df_eval_outlier["pred_outlier"]) axplot.set_title( "Isolation Forest", fontsize=16) output_fig_filename = ( "conf_matrix_iforest_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts_dir .joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=600, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_02_severson_iforest_hyperparam_tuned/conf_matrix_iforest_2017-05-12_5_4C-70per_3C_CH17.png :height: 480 px :width: 600 px :alt: iForest confusion matrix from ``2017-05-12_5_4C-70per_3C_CH17`` :align: center Evaluation metrics ^^^^^^^^^^^^^^^^^^^^ In this study, five different metrics are used to evaluate model performance: * **Accuracy**: :math:`\frac{\textrm{TP} + \textrm{TN}}{\textrm{Total prediction}}` * **Precision**: :math:`\frac{\textrm{TP}}{\textrm{TP + FP}}` * **Recall**: :math:`\frac{\textrm{TP}}{\textrm{TP + FN}}` * **F1-score**: :math:`\frac{2(\textrm{Precision}\times \textrm{Recall})}{\textrm{Precision} + \textrm{Recall}}` * **MCC**: :math:`\frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN+FN)}}` .. code-block:: python df_current_eval_metrics = modval.eval_model_performance( model_name="iforest", selected_cell_label=selected_cell_label, df_eval_outliers=df_eval_outlier) df_current_eval_metrics Step-16: Export Model Performance Metrics -------------------------------------------- * Export the evaluation metrics to a CSV file for record-keeping and comparison across models and cells. .. code-block:: python # Export current metrics to CSV hyperparam_eval_filepath = Path.cwd().joinpath( "eval_metrics_hp_single_cell_severson.csv") hp.export_current_model_metrics( model_name="iforest", selected_cell_label=selected_cell_label, df_current_eval_metrics=df_current_eval_metrics, export_csv_filepath=hyperparam_eval_filepath, if_exists="replace") Step-17: Verify with True Labels ------------------------------------- Plot cycle data with labels ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Extract the true outlier cycle indices from the benchmarking dataset. * Re-plot the cycling data with the true anomalous cycles highlighted and annotated, allowing visual comparison of model predictions against the ground truth. .. code-block:: python # Extract true outliers cycle index from benchmarking dataset true_outlier_cycle_index = benchdb.get_true_outlier_cycle_index( df_selected_cell) print(f"True outlier cycle index:") print(true_outlier_cycle_index) # Plot cell data with true anomalies # If the true outlier cycle index is not known, # cycling data will be plotted without labels benchdb.plot_cycle_data( df_selected_cell_without_labels, true_outlier_cycle_index) output_fig_filename = ( "cycle_data_with_labels_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts_dir.joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=600, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_02_severson_iforest_hyperparam_tuned/cycle_data_with_labels_2017-05-12_5_4C-70per_3C_CH17.png :height: 398 px :width: 600 px :alt: Cycle data with true labels from ``2017-05-12_5_4C-70per_3C_CH17`` :align: center Calculate bubble size ratio ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Calculate the bubble size ratios from the feature distributions for plotting. The bubble size encodes the magnitude of both the capacity and voltage differences, making it easier to identify anomalous cycles visually. .. code-block:: python # Calculate the bubble size ratio for plotting df_bubble_size_dQ = bviz.calculate_bubble_size_ratio( df_variable=df_features_per_cell["max_diff_dQ"]) df_bubble_size_dV = bviz.calculate_bubble_size_ratio( df_variable=df_features_per_cell["max_diff_dV"]) bubble_size = ( np.abs(df_bubble_size_dV) * np.abs(df_bubble_size_dQ)) Plot the bubble chart with true outlier labels ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python # Plot the bubble chart and label the outliers axplot = bviz.plot_bubble_chart( xseries=df_features_per_cell["log_max_diff_dQ"], yseries=df_features_per_cell["log_max_diff_dV"], bubble_size=bubble_size, unique_cycle_count=unique_cycle_count, cycle_outlier_idx_label=true_outlier_cycle_index) axplot.set_title( f"Cell {selected_cell_label}", fontsize=13) axplot.set_xlabel( r"$\log(\Delta Q_{\mathrm{scaled,max,cyc}})$ [Ah]", fontsize = 12) axplot.set_ylabel( r"$\log(\Delta V_{\mathrm{scaled,max,cyc}})$ [V]", fontsize = 12) output_fig_filename = ( "log_bubble_plot_" + selected_cell_label + ".png") fig_output_path = ( selected_cell_artifacts_dir.joinpath(output_fig_filename)) plt.savefig( fig_output_path, dpi=200, bbox_inches="tight") plt.show() .. image:: docs_figure/ml_02_severson_iforest_hyperparam_tuned/bubble_plot_2017-05-12_5_4C-70per_3C_CH17.png :height: 484 px :width: 600 px :alt: Log bubble plot from ``2017-05-12_5_4C-70per_3C_CH17`` :align: center ---- .. note:: This notebook serves as an example to explain the workflow for running the Isolation Forest model with hyperparameter tuning on a single cell. To mitigate overfitting, the model should be trained and validated across all 23 different cells in the training dataset instead of a single cell (see ml_01_iforest_hyperparam_pipeline_severson.py). The hyperparameters are then averaged across all cells to find a more generalizable configuration that performs well across the entire dataset, rather than just one cell (see ml_01_iforest_export_model.py).