Quickstart ============= This project is implemented and managed using UV environment, which is an extremely fast Python package and project manager, written in Rust. **Step-1: Install uv** Follow the `official Astral installation documentation `_ to install uv. **Step-2: Setup Git LFS for your system** This project uses Git LFS to manage large files such as datasets. Without Git LFS, you will only be able to download the text pointers instead of the actual large files, which will lead to errors when you run the benchmarking scripts. Example error message when Git LFS is not installed: .. code-block:: text IOException: IO Error: The file "train_dataset_severson.db" exists, but it is not a valid DuckDB database file! If you are using Git LFS for the first time, you can follow the `Git LFS installation guide `_ to install Git LFS. .. code-block:: bash # On Ubuntu/Debian sudo apt install git-lfs # Then initialize Git LFS in your system git lfs install **Step-3: Clone osbad from the GitHub Repository** * Clone the osbad repository to access the example notebooks and scripts * Sync dependencies and activate the virtual environment * Pull the large dataset files with Git LFS .. code-block:: bash # Clone the osbad repository to access the example notebooks and scripts git clone git@github.com:meichinpang/osbad.git # Change into the cloned osbad repository cd osbad # Sync Dependencies uv sync # Activate the virtual environment # On macOS/Linux: source .venv/bin/activate # On Windows (Command Prompt): .venv\Scripts\activate.bat # On Windows (PowerShell): .venv\Scripts\Activate.ps1 # Pull the large dataset files with Git LFS git lfs pull To test ``osbad`` installation, run the script ``test_osbad_installation.py`` in the root directory of the project. This script imports the osbad package and prints the current version to confirm that the installation is successful. .. code-block:: bash python test_osbad_installation.py If the installation is successful, you should see an output similar to the following: .. code-block:: bash Hello from osbad! osbad current version: X.Y.Z OSBAD package installation is successful! # Then you can start Jupyter Notebook jupyter notebook In the Jupyter browser UI, navigate to the following notebook to run the osbad workflow for the Isolation Forest model on the Severson dataset: ``machine_learning/hp_tuning_with_transfer_learning/severson_data_source/01_train_dataset/ml_01_iforest_hyperparam_severson.ipynb`` Typical Workflow ------------------- The notebook ``ml_01_iforest_hyperparam_severson.ipynb`` demonstrates the end-to-end osbad workflow for the Isolation Forest model: 1. **Import libraries**: Load ``osbad``, ``optuna``, ``duckdb``, ``pandas``, and ``matplotlib``. 2. **Load the benchmarking dataset**: Connect to the Severson training dataset (``train_dataset_severson.db``) via DuckDB and select a cell for analysis. 3. **Drop true labels**: Remove ground-truth anomaly labels from the dataset to simulate an unsupervised setting. 4. **Plot raw cycle data**: Visualize discharge capacity vs. voltage curves without labels. 5. **Load features database**: Import pre-computed features from ``train_features_severson.db``. 6. **Hyperparameter tuning**: Run Bayesian optimization with Optuna (TPE sampler, 20 trials) to find the best Isolation Forest hyperparameters (contamination, n_estimators, max_samples, threshold). 7. **Aggregate best trials**: Extract the median hyperparameters from the Pareto-optimal trials and export them to CSV. 8. **Train the model**: Fit an Isolation Forest with the best trial parameters and predict anomaly scores. 9. **Visualize anomaly score map**: Plot the decision boundary and predicted outliers in feature space. 10. **Evaluate model performance**: Generate a confusion matrix and compute performance metrics (accuracy, precision, recall, F1-score, and Matthews correlation coefficient). 11. **Export evaluation metrics**: Save the model performance results to a CSV file. 12. **Verify with true labels**: Compare predicted outliers against the ground-truth labels using cycle plots and bubble charts. Issues and Troubleshooting ---------------------------- * If you encounter errors related to missing files or invalid database files, ensure that you have Git LFS installed and have pulled the large dataset files correctly. * If you see errors about missing Python packages, make sure you have activated the virtual environment with ``source .venv/bin/activate`` (or the appropriate command for your operating system) and that you have run ``uv sync`` to install all dependencies. * For any other issues, please open an issue on the `OSBAD Issue Tracker `_ with details about the error message and your system configuration. Next Steps -------------- * See :doc:`doc_02_dataset_overview` for an overview of the datasets included in this benchmarking project. * Explore :doc:`doc_03_models_overview` for details on the models included in this benchmarking study. .. * Run comparisons in :doc:`doc_04_benchmarking` to evaluate model performance.