Model Logging Made Easy: MLflow vs. Pickle

Using MLflow to log models offers distinct advantages over Pickle. Here’s a detailed look at the benefits:

Managing Library Versions

Problem

Different machine learning models may depend on various versions of the same library, leading to compatibility conflicts. Manually tracking and configuring the correct environment for each model can be tedious and error-prone.

Solution

MLflow automatically logs all dependencies, allowing users to easily recreate the exact environment necessary to run the model. This feature simplifies deployment and enhances reproducibility.

Documenting Inputs and Outputs

Problem

The expected inputs and outputs of a model are often not well-documented, making it challenging for others to utilize the model correctly.

Solution

MLflow defines a clear schema for inputs and outputs, ensuring that users know precisely what data to provide and what to expect in return. This clarity fosters better collaboration and reduces confusion.

Example Implementation

To demonstrate the advantages of MLflow, let’s implement a simple logistic regression model and log it.

Open In Colab

Logging the Model

import mlflow
from mlflow.models import infer_signature
import numpy as np
from sklearn.linear_model import LogisticRegression
​
with mlflow.start_run():
    X = np.array([-2, -1, 0, 1, 2, 1]).reshape(-1, 1)
    y = np.array([0, 0, 1, 1, 1, 0])
    lr = LogisticRegression()
    lr.fit(X, y)
    signature = infer_signature(X, lr.predict(X))
​
    model_info = mlflow.sklearn.log_model(
        sk_model=lr, artifact_path="model", signature=signature
    )
​
    print(f"Saving data to {model_info.model_uri}")

This code will output the location where the model is saved:

Saving data to runs:/f8b0fc900aa14cf0ade8d0165c5a9f11/model

Using the Logged Model

To use the logged model later, you can load it with the model_uri:

import mlflow
import numpy as np
​
model_uri = "runs:/1e20d72afccf450faa3b8a9806a97e83/model"
sklearn_pyfunc = mlflow.pyfunc.load_model(model_uri=model_uri)
​
data = np.array([-4, 1, 0, 10, -2, 1]).reshape(-1, 1)
predictions = sklearn_pyfunc.predict(data)

Inspecting Model Artifacts

Check the saved artifacts in mlruns/0/1e20d72afccf450faa3b8a9806a97e83/artifacts/model:

    MLmodel           model.pkl         requirements.txt
    conda.yaml        python_env.yaml

Understanding Model Configuration

The MLmodel file contains critical information about the model, including its dependencies and input/output specifications:

artifact_path: model
flavors:
  python_function:
    env:
      conda: conda.yaml
      virtualenv: python_env.yaml
    loader_module: mlflow.sklearn
    model_path: model.pkl
    predict_fn: predict
    python_version: 3.11.6
  sklearn:
    code: null
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 1.4.1.post1
mlflow_version: 2.15.0
model_size_bytes: 722
model_uuid: e7487bc3c4ab417c965144efcecaca2f
run_id: 1e20d72afccf450faa3b8a9806a97e83
signature:
  inputs: '[{"type": "tensor", "tensor-spec": {"dtype": "int64", "shape": [-1, 1]}}]'
  outputs: '[{"type": "tensor", "tensor-spec": {"dtype": "int64", "shape": [-1]}}]'
  params: null
utc_time_created: '2024-08-02 20:58:16.516963'

Environment Specifications

The conda.yaml and python_env.yaml files detail the dependencies required for the model, ensuring a consistent runtime environment. Here’s a look at conda.yaml:

# conda.yaml
channels:
- conda-forge
dependencies:
- python=3.11.6
- pip<=24.2
- pip:
  - mlflow==2.15.0
  - cloudpickle==2.2.1
  - numpy==1.23.5
  - psutil==5.9.6
  - scikit-learn==1.4.1.post1
  - scipy==1.11.3
name: mlflow-env

And for python_env.yaml and requirements.txt:

# python_env.yaml
python: 3.11.6
build_dependencies:
- pip==24.2
- setuptools
- wheel==0.40.0
dependencies:
- -r requirements.txt
# requirements.txt
mlflow==2.15.0
cloudpickle==2.2.1
numpy==1.23.5
psutil==5.9.6
scikit-learn==1.4.1.post1
scipy==1.11.3

Learn more about MLFlow Models.

Scroll to Top