Quickstart

This project is implemented and managed using UV environment, which is an extremely fast Python package and project manager, written in Rust.

Step-1: Install uv

Follow the official Astral installation documentation to install uv.

Step-2: Setup Git LFS for your system

This project uses Git LFS to manage large files such as datasets. Without Git LFS, you will only be able to download the text pointers instead of the actual large files, which will lead to errors when you run the benchmarking scripts.

Example error message when Git LFS is not installed:

IOException: IO Error: The file "train_dataset_severson.db" exists, but it is not a
valid DuckDB database file!

If you are using Git LFS for the first time, you can follow the Git LFS installation guide to install Git LFS.

# On Ubuntu/Debian
sudo apt install git-lfs

# Then initialize Git LFS in your system
git lfs install

Step-3: Clone osbad from the GitHub Repository

  • Clone the osbad repository to access the example notebooks and scripts

  • Sync dependencies and activate the virtual environment

  • Pull the large dataset files with Git LFS

# Clone the osbad repository to access the example notebooks and scripts
git clone git@github.com:meichinpang/osbad.git

# Change into the cloned osbad repository
cd osbad

# Sync Dependencies
uv sync

# Activate the virtual environment
# On macOS/Linux:
source .venv/bin/activate

# On Windows (Command Prompt):
.venv\Scripts\activate.bat

# On Windows (PowerShell):
.venv\Scripts\Activate.ps1

# Pull the large dataset files with Git LFS
git lfs pull

To test osbad installation, run the script test_osbad_installation.py in the root directory of the project. This script imports the osbad package and prints the current version to confirm that the installation is successful.

python test_osbad_installation.py

If the installation is successful, you should see an output similar to the following:

Hello from osbad!
osbad current version: X.Y.Z
OSBAD package installation is successful!

# Then you can start Jupyter Notebook
jupyter notebook

In the Jupyter browser UI, navigate to the following notebook to run the osbad workflow for the Isolation Forest model on the Severson dataset:

machine_learning/hp_tuning_with_transfer_learning/severson_data_source/01_train_dataset/ml_01_iforest_hyperparam_severson.ipynb

Typical Workflow

The notebook ml_01_iforest_hyperparam_severson.ipynb demonstrates the end-to-end osbad workflow for the Isolation Forest model:

  1. Import libraries: Load osbad, optuna, duckdb, pandas, and matplotlib.

  2. Load the benchmarking dataset: Connect to the Severson training dataset (train_dataset_severson.db) via DuckDB and select a cell for analysis.

  3. Drop true labels: Remove ground-truth anomaly labels from the dataset to simulate an unsupervised setting.

  4. Plot raw cycle data: Visualize discharge capacity vs. voltage curves without labels.

  5. Load features database: Import pre-computed features from train_features_severson.db.

  6. Hyperparameter tuning: Run Bayesian optimization with Optuna (TPE sampler, 20 trials) to find the best Isolation Forest hyperparameters (contamination, n_estimators, max_samples, threshold).

  7. Aggregate best trials: Extract the median hyperparameters from the Pareto-optimal trials and export them to CSV.

  8. Train the model: Fit an Isolation Forest with the best trial parameters and predict anomaly scores.

  9. Visualize anomaly score map: Plot the decision boundary and predicted outliers in feature space.

  10. Evaluate model performance: Generate a confusion matrix and compute performance metrics (accuracy, precision, recall, F1-score, and Matthews correlation coefficient).

  11. Export evaluation metrics: Save the model performance results to a CSV file.

  12. Verify with true labels: Compare predicted outliers against the ground-truth labels using cycle plots and bubble charts.

Issues and Troubleshooting

  • If you encounter errors related to missing files or invalid database files, ensure that you have Git LFS installed and have pulled the large dataset files correctly.

  • If you see errors about missing Python packages, make sure you have activated the virtual environment with source .venv/bin/activate (or the appropriate command for your operating system) and that you have run uv sync to install all dependencies.

  • For any other issues, please open an issue on the OSBAD Issue Tracker with details about the error message and your system configuration.

Next Steps

  • See Datasets Guide for an overview of the datasets included in this benchmarking project.

  • Explore Models Guide for details on the models included in this benchmarking study.