osbad.hyperparam¶

Hyperparameter tuning utilities for PyOD models with Optuna.

This module provides building blocks to define search spaces, create model instances, run Optuna studies, summarize/aggregate best trials, visualize Pareto fronts (recall vs. precision), and export results to CSV files. It covers six anomaly-detection models: Isolation Forest, KNN, GMM, LOF, PCA, and AutoEncoder.

Key features:

ModelConfigDataClass: Frozen dataclass bundling the Optuna search-space function (hp_space), model configuration with tuned hyperparameters (model_param), model configuration without hyperparameters tuning (baseline_model_param) and the probability column index for outliers (proba_col).
MODEL_CONFIG: Registry mapping model IDs (iforest, knn, gmm, lof, pca, autoencoder) to their ModelConfigDataClass.
objective: Generic Optuna objective function that can handle both supervised hyperparameter tuning and unsupervised hyperparameter tuning. Samples params, builds the model, predicts outliers, and returns performance metrics.
curvature: calculates value of curvature of the loss score vs. inlier score curve at each pareto-optimal point.
trade_off_trials_detection: Finds best compromised solutions or trials based on the maximum curvature or elbow point of two objective targets.
aggregate_param_method: Aggregate a list of values via median, median_int, or mode (with deterministic tie-breaking by mode).
aggregate_best_trials: Collect parameters from study.best_trials and produce a single-row DataFrame of aggregated hyperparameters.
evaluate_hp_perfect_score_pct: Compute the percentage of trials with perfect recall and precision (value == 1) and log per-trial scores.
plot_pareto_front: Plot the Pareto front (recall vs. precision), annotate perfect-score percentages, and save to the artifacts folder.
export_current_hyperparam: Append best hyperparameters for a cell to a CSV (skips if already present) and return the updated DataFrame.
export_current_model_metrics: Append evaluation metrics for a (model, cell) pair to a CSV (skips if already present) and return the updated DataFrame.

Configuration:

RANDOM_STATE: Shared random seed used by model factories.
bconf.PIPELINE_OUTPUT_DIR: Base directory where per-cell figures are saved.

import osbad.hyperparam as hp

Module Contents¶

osbad.hyperparam.RANDOM_STATE = 42¶

osbad.hyperparam.HpSpaceFuncType¶

A type alias for a function that takes an optuna.trial.Trial as input and returns a dictionary mapping hyperparameter names str to their suggested values Any.

Example

def knn_hp_space(trial: optuna.trial.Trial) -> Dict[str, Any]:

    hyperparam_dict =  {
        "contamination": trial.suggest_float(
            "contamination", 0.0, 0.5),
        "n_neighbors": trial.suggest_int(
            "n_neighbors", 2, 50, step=2),
        "method": trial.suggest_categorical(
            "method", ["largest", "mean", "median"]),
        "metric": trial.suggest_categorical(
            "metric", ["minkowski", "euclidean", "manhattan"]),
        "threshold": trial.suggest_float(
            "threshold", 0.0, 1.0),
    }

    return hyperparam_dict

osbad.hyperparam.ModelParamFuncType¶

A type alias for a function type that takes a dictionary of hyperparameters as input Dict[str, Any] and returns a model instance (e.g., KNN, IForest, GMM) Any.

Example

input_hyperparam_dict = {
    "contamination": 0.1,
    "n_neighbors": 10,
    "method": "mean",
    "metric": "euclidean",
    "threshold": 0.5,
}

def knn_model_param(param: Dict[str, Any]) -> Any:

    model_instance = KNN(
        contamination=param["contamination"],
        n_neighbors=param["n_neighbors"],
        method=param["method"],
        metric=param["metric"],
        n_jobs=-1,
    )

    return model_instance

# output will be:
# KNN(contamination=0.1, n_neighbors=10, method="mean",
# metric="euclidean", n_jobs=-1)

osbad.hyperparam.PyODModelType¶

Type alias for supported PyOD anomaly detection models.

This alias unifies the set of PyOD estimators commonly used in the benchmarking pipeline, enabling consistent type annotations and improved readability.

Supported models:

IForest: Isolation Forest model.
KNN: K-Nearest Neighbors–based outlier detector.
GMM: Gaussian Mixture Model for density-based detection.
LOF: Local Outlier Factor for neighborhood-based detection.
PCA: Principal Component Analysis for subspace-based detection.
AutoEncoder: Neural network–based autoencoder for reconstruction-based detection.

class osbad.hyperparam.ModelConfigDataClass¶

Immutable container class for the model configuration.

Stores the search space function for Optuna trials, the model factory function, and the probability column index used for PyOD estimators.

hp_space: HpSpaceFuncType¶: Function that defines the hyperparameter search space for an Optuna trial.

model_param: ModelParamFuncType¶: Function that builds a model instance from a set of hyperparameters.

baseline_model_param: Callable[[], PyODModelType] | None = None¶: Function that builds a model instance without hyperparameter tuning.

proba_col: int = 1¶: Index of the probability column in PyOD estimators. Column 0 is the inlier probability, and column 1 is the outlier probability. Defaults to 1.

osbad.hyperparam.DataSource¶

osbad.hyperparam.DATA_SOURCE: DataSource¶

osbad.hyperparam.grab(model: str)¶

osbad.hyperparam.IFOREST_HP_CONFIG¶

osbad.hyperparam.KNN_HP_CONFIG¶

osbad.hyperparam.GMM_HP_CONFIG¶

osbad.hyperparam.LOF_HP_CONFIG¶

osbad.hyperparam.PCA_HP_CONFIG¶

osbad.hyperparam.AUTOENCODER_HP_CONFIG¶

osbad.hyperparam.MODEL_CONFIG: Dict[str, ModelConfigDataClass]¶

Dictionary mapping model identifiers to their configurations.

Each entry contains a ModelConfigDataClass object that defines the search space for Optuna hyperparameter optimization (hp_space) and a factory function (model_param) to create the corresponding model with the chosen hyperparameters.

The following model identifiers are supported:

“iforest”: Isolation Forest
“knn”: k-Nearest Neighbors
“gmm”: Gaussian Mixture Model
“lof”: Local Outlier Factor
“pca”: Principal Component Analysis
“autoencoder”: AutoEncoder

Parameters:

key (str) – Model identifier.
value (ModelConfigDataClass) – Configuration for searching hyperparameters and to instantiate the corresponding model creation based on the hyperparameter search space.

osbad.hyperparam.objective(trial: optuna.trial.Trial, model_id: Literal['iforest', 'knn', 'gmm', 'lof', 'pca', 'autoencoder'], df_feature_dataset: fireducks.pandas.DataFrame, selected_feature_cols: list, selected_cell_label: str, df_benchmark_dataset: fireducks.pandas.DataFrame | None = None) → Tuple[float, float]¶

Optimize model hyperparameters using Optuna trial.

This function evaluates a given anomaly detection model by sampling hyperparameters from the trial, training the model, predicting outliers, and computing evaluation metrics. If a benchmark dataset is provided, recall and precision are computed. Otherwise, proxy evaluation metrics are used based on cycle index and model input features.

Parameters:

trial (optuna.trial.Trial) – Optuna trial object used to suggest hyperparameters.
model_id (Literal) – Identifier of the model to optimize. Must be one of “iforest”, “knn”, “gmm”, “lof”, “pca”, or “autoencoder”.
df_feature_dataset (pd.DataFrame) – Feature dataset containing model input features.
selected_feature_cols (list) – List of selected feature column names.
selected_cell_label (str) – Label identifying the cell for which the model is being trained.
df_benchmark_dataset (Optional[pd.DataFrame]) – Benchmark dataset used to evaluate predicted outliers. If None, proxy evaluation is performed.

Returns:

If benchmark dataset is provided, returns recall and precision scores. Otherwise, returns loss score and inliers score from proxy evaluation.

Return type:

Tuple[float, float]

Example

import optuna
import osbad.hyperparam as hp

# Use the TPESampler from optuna
sampler = optuna.samplers.TPESampler(seed=42)

# Create a study to maximize recall and precision score
study = optuna.create_study(
    sampler=sampler,
    directions=["maximize","maximize"])

# Optimize the hyperparameters for iforest using 20 trials
study.optimize(
    lambda tr: hp.objective(
        tr,
        "iforest",
        df_features_per_cell,
        df_selected_cell),
    n_trials=20)

osbad.hyperparam.trade_off_trials_detection(study: optuna.study.Study, window_size: int = 5) → List[optuna.trial.FrozenTrial]¶

Identifies the most representative Pareto-optimal trials based on curvature analysis of the loss_score vs inlier_score trade-off curve.The curvature-based selection helps identify the “elbow point” in the trade-off curve, which often represents the best balance between minimizing regression loss and maximizing inlier retention.

This function performs the following steps:

Sorts Pareto-optimal trials in descending order of loss_score.
Extracts loss_score and inlier_score values from the sorted trials.

3. Computes the curvature of the smoothed loss vs inlier score plot to identify the point of maximum curvature (inflection point). 4. Selects the trial at the inflection point as the optimal trade-off

between model performance and data retention.

Returns all trials that share the same loss_score and inlier_score as the identified optimal trial.

Parameters:: study (optuna.study.study.Study) – An Optuna study object containing multiple trials, including Pareto-optimal ones.
Returns:: List[optuna.trial.FrozenTrial] A list of trials that match the optimal trade-off point, determined by the maximum curvature in the loss vs inlier score plot.

osbad.hyperparam.Agg¶

Type alias for the aggregation methods.

Represents allowed strategies for aggregating a list of values:

“median”: Returns the median as a float.

“mean”: Returns the mean as a float.

“median_int”: Returns the median as an integer.

“mean_int”: Returns the mean as an integer.

“mode”: Returns the most frequent value.

osbad.hyperparam.aggregate_param_method(values: List[Any], how: Agg)¶

Aggregate a list of values using the given method.

Supports median, median as integer, and mode. Raises ValueError if an unsupported method is provided.

Parameters:

values (List[Any]) – List of values to aggregate.
how (Agg) – Aggregation method, one of median, mean, median_int, mean_int or mode.

Returns:

Aggregated result based on the specified method.

Return type:

Any

Raises:

ValueError – If how is not a supported aggregation method.

Example

>>> aggregate_param_method([500, 300, 250, 400, 200], "median")
300.0
>>> aggregate_param_method([500, 300, 250, 400, 200], "median_int")
300
>>> aggregate_param_method(
    ['manhattan', 'manhattan', 'euclidean',
    'manhattan', 'minkowski'], "mode")
'manhattan'

Note

If there is a tie in the most frequent parameter, for example, method = ['largest','largest','median','median','mean'], the first most frequent parameter largest will be chosen.

osbad.hyperparam.aggregate_best_trials(best_trials: List[optuna.trial.FrozenTrial], cell_label: str, model_id: str, schema: Dict[str, Agg]) → fireducks.pandas.DataFrame¶

Aggregate parameters from the best Optuna trials.

Collects hyperparameters from the best trials of a study and aggregates them using rules defined in the schema. Each parameter is reduced to a single representative value using median, median_int, or mode.

Parameters:

best_trials (List[optuna.trial.FrozenTrial]) – A list of best
additional (trials obtained using Pareto optimization or the)
tuning (curvature analysis step in case of proxy hyperparameter)
method.
cell_label (str) – Identifier for the experimental cell.
model_id (str) – Identifier of the ML-model. Allowed values are “iforest”, “knn”, “gmm”, “lof”, “pca”, “autoencoder”.
schema (Dict[str, Agg]) – Mapping of parameter names to aggregation strategies. Allowed values are “median”, “median_int”, and “mode”.

Returns:

A single-row DataFrame containing the model ID, cell label, and aggregated hyperparameters.

Return type:

pd.DataFrame

Example

schema_knn = {
    "contamination": "median",
    "n_neighbors": "median_int",
    "method": "mode",
    "metric": "mode",
    "threshold": "median"}

df_knn = hp.aggregate_best_trials(
    study.best_trials,
    cell_label=selected_cell_label,
    model_id="knn",
    schema=schema_knn)

osbad.hyperparam.curvature(target_x: List[float] | numpy.ndarray, target_y: List[float] | numpy.ndarray, window_size: int = 5) → numpy.ndarray¶

Calculates the curvature values of a smoothed loss_score vs inlier_score plot.

This function estimates the curvature of a 2D curve defined by target_x and target_y, which represents the trade-off between regression loss and inlier count in outlier detection evaluation. The curvature is computed after applying a uniform smoothing filter to reduce noise.

Parameters:

target_x (Union[List[float], np.ndarray]) – List or array
x-values (of)
target_y (Union[List[float], np.ndarray]) – List or array
y-values (of)

Returns:

Array of curvature values at each point on the smoothed curve. These values indicate how sharply the curve bends at each location.

Return type:

float

A smoothing window is applied to reduce noise before computing

gradients. - The curvature is calculated using the standard 2D curvature formula:

κ = (dx * ddy - dy * ddx) / (dx² + dy²)^(3/2)

Division by zero is safely handled using NumPy’s error state

management.

osbad.hyperparam.plot_proxy_pareto_front(model_study: optuna.study.study.Study, selected_cell_label: str, fig_title: str) → None¶

Plot and save the Pareto front of proxy evaluation metrics i.e. loss scores vs. inliers count scores.

This function generates a Pareto front plot from an Optuna study that optimizes for both regression loss and predicted inlier count. The plot includes an annotation showing the best compromised solution or trade-off solution obtained at the knee or inflection point of the curve out of pareto-optimal trials.

Parameters:

model_study (optuna.study.study.Study) – Optuna study object containing trials with recall and precision scores as objectives.
selected_cell_label (str) – Identifier for the evaluated cell, used to generate the output file path.
fig_title (str) – Title of the plot and basis for the output file name.
output_log_status (bool, optional) – If True, enables logging of intermediate evaluation steps. Defaults to False.

Returns:

The function saves the Pareto front plot as a PNG file in the artifacts directory associated with the selected cell.

Return type:

None

Example

hp.plot_proxy_pareto_front(
    if_study,
    selected_cell_label,
    fig_title="Isolation Forest Pareto Front")

osbad.hyperparam.evaluate_hp_perfect_score_pct(model_study: optuna.study.study.Study, output_log_bool: bool = True)¶

Evaluate percentage of trials with the perfect recall and precision score.

This function analyzes an Optuna study and calculates the percentage of trials that achieved a perfect recall score of 1 and a perfect precision score of 1. Trial-level recall and precision values are logged for inspection. The function provides an overview of how often hyperparameter trials reach ideal performance.

Parameters:

model_study (optuna.study.study.Study) – Optuna study object containing hyperparameter optimization trials. Each trial is expected to have recall and precision as its objective values.
output_log_bool (bool, optional) – If True, enables logging output. Defaults to True.

Returns:

A tuple containing:

recall_score_pct (float): Percentage of trials with recall=1.
precision_score_pct (float): Percentage of trials with precision=1.

Return type:

Tuple[float, float]

Example

sampler = optuna.samplers.TPESampler(seed=42)

selected_feature_cols = (
    "log_max_diff_dQ",
    "log_max_diff_dV")

if_study = optuna.create_study(
    study_name="iforest_hyperparam",
    sampler=sampler,
    directions=["maximize","maximize"])

if_study.optimize(
    lambda trial: hp.objective(
        trial,
        model_id="iforest",
        df_feature_dataset=df_features_per_cell,
        selected_feature_cols=selected_feature_cols,
        df_benchmark_dataset=df_selected_cell,
        selected_cell_label=selected_cell_label),
    n_trials=20)

(recall_score_pct,
precision_score_pct) = hp.evaluate_hp_perfect_score_pct(
    model_study=if_study)

osbad.hyperparam.plot_pareto_front(model_study: optuna.study.study.Study, selected_cell_label: str, fig_title: str, output_log_status: bool = False) → None¶

Plot and save the Pareto front of recall vs. precision scores.

This function generates a Pareto front plot from an Optuna study that optimizes for both recall and precision. The plot includes an annotation showing the percentage of trials with perfect recall and precision scores. The figure is customized with labels, legends, and formatting before being saved as a PNG file.

Parameters:

model_study (optuna.study.study.Study) – Optuna study object containing trials with recall and precision scores as objectives.
selected_cell_label (str) – Identifier for the evaluated cell, used to generate the output file path.
fig_title (str) – Title of the plot and basis for the output file name.
output_log_status (bool, optional) – If True, enables logging of intermediate evaluation steps. Defaults to False.

Returns:

The function saves the Pareto front plot as a PNG file in the artifacts directory associated with the selected cell.

Return type:

None

Example

hp.plot_pareto_front(
    if_study,
    selected_cell_label,
    fig_title="Isolation Forest Pareto Front")

osbad.hyperparam.export_current_hyperparam(df_best_param_current_cell: fireducks.pandas.DataFrame, selected_cell_label: str, export_csv_filepath: pathlib.PosixPath | str, if_exists: Literal['replace', 'keep'] = 'replace', output_log_bool: bool = True)¶

Export best hyperparameters for a cell to a CSV file.

This function manages the storage of best hyperparameters for a specified cell. If the cell’s hyperparameters already exist in the CSV file, the behavior depends on if_exists: either the rows are replaced or kept. If the cell is not found, a new row is created. Logging tracks the export status and duplication handling.

Parameters:

df_best_param_current_cell (pd.DataFrame) – DataFrame containing the best hyperparameters for the current cell.
selected_cell_label (str) – Identifier for the evaluated cell.
export_csv_filepath (Union[pathlib.PosixPath, str]) – Path to the CSV file where hyperparameters are stored.
if_exists (Literal["replace", "keep"], optional) – Action to take if the cell already exists in the CSV. Defaults to "replace".
output_log_bool (bool, optional) – If True, enables logging output. Defaults to True.

Returns:

Updated DataFrame containing hyperparameters from both existing records and the current cell.

Return type:

pd.DataFrame

Raises:

ValueError – If if_exists is not "replace" or "keep".

Example

# Export current hyperparameters to CSV
hyperparam_filepath =  PIPELINE_OUTPUT_DIR.joinpath(
    "hyperparams_autoencoder_tohoku.csv")

hp.export_current_hyperparam(
    df_autoencoder_hyperparam,
    selected_cell_label,
    export_csv_filepath=hyperparam_filepath,
    if_exists="replace")

osbad.hyperparam.export_current_model_metrics(model_name: str, selected_cell_label: str, df_current_eval_metrics: fireducks.pandas.DataFrame, export_csv_filepath: pathlib.PosixPath | str, if_exists: Literal['replace', 'keep'] = 'replace', output_log_bool: bool = True)¶

Export current model evaluation metrics to a CSV file.

This function checks if the evaluation metrics for a given model and cell are already stored in the CSV file. Based on the if_exists flag, it either replaces existing rows, keeps them unchanged, or creates a new entry. Logging tracks the status of the export process.

Parameters:

model_name (str) – Name of the machine learning model.
selected_cell_label (str) – Identifier for the evaluated cell.
df_current_eval_metrics (pd.DataFrame) – DataFrame containing current evaluation metrics to be exported.
export_csv_filepath (Union[pathlib.PosixPath, str]) – Path to the CSV file where evaluation metrics are stored.
if_exists (Literal["replace", "keep"], optional) – Action to take if the model-cell combination already exists in the CSV. Defaults to “replace”.
output_log_bool (bool, optional) – If True, enables logging output. Defaults to True.

Returns:

Updated DataFrame containing evaluation metrics from both existing and current evaluations.

Return type:

pd.DataFrame

Raises:

ValueError – If if_exists is not “replace” or “keep”.

Example

# Export current metrics to CSV
hyperparam_eval_filepath =  Path.cwd().joinpath(
    "eval_metrics_hp_single_cell_tohoku.csv")

hp.export_current_model_metrics(
    model_name="iforest",
    selected_cell_label=selected_cell_label,
    df_current_eval_metrics=df_current_eval_metrics,
    export_csv_filepath=hyperparam_eval_filepath,
    if_exists="replace")