Poetry: A Better Way to Manage Python Dependencies

Motivation

In the beginning stages of your data science project, using dependency management tools like pip or conda may be sufficient.

However, as the data science project expands, the number of dependencies also increases. This can make it difficult to reproduce the project’s environment and maintain it effectively when relying solely on pip or conda for dependency management.

To address these challenges, Poetry, an open-source library, provides a powerful tool for creating and maintaining Python projects with consistent environments. In this article, we will delve into the advantages of Poetry and highlight its key distinctions from pip and conda.

Ease of Installation

A straightforward installation process allows developers to quickly adopt and integrate a package into their codebase, saving time and effort.

Conda

Conda’s installation format is inconsistent for different packages. For example, to install polars, you need to run:

conda install -c conda-forge polar

While to install pandas, the command would be:

conda install -c anaconda pandas

Pip

Pip has a consistent installation format for every package:

pip install package-name

Poetry

Poetry follows the same installation format for every package:

poetry add package-name

Available packages

Having a broad selection of packages makes it easier for developers to find the specific package and version that best suits their needs.

Conda

Some packages, like “snscrape,” cannot be installed with conda. Additionally, certain versions, such as Pandas 2.0, might not be available for installation through Conda.

While you can use pip inside a conda virtual environment to address package limitations, conda cannot track dependencies installed with pip, making dependency management challenging.

$ conda list
# packages in environment at /Users/khuyentran/miniconda3/envs/test-conda:
#
# Name                    Version                   Build  Channel

Pip

Pip can install any packages from the Python Package Index (PyPI) and other repositories.

Poetry

Poetry also allows the installation of packages from the Python Package Index (PyPI) and other repositories.

Number of Dependencies

Reducing the number of dependencies in an environment simplifies the development process.

Conda

Conda provides full environment isolation, managing both Python packages and system-level dependencies. This can result in larger package sizes compared to other package managers, potentially consuming more storage space during installation and distribution.

$ conda install pandas

$ conda list          

# packages in environment at /Users/khuyentran/miniconda3/envs/test-conda:
#
# Name              Version         Build           Channel             
blas                1.0             openblas                          
bottleneck          1.3.5           py311ha0d4635_0                    
bzip2               1.0.8           h620ffc9_4                        
ca-certificates     2023.05.30      hca03da5_0                        
libcxx              14.0.6          h848a8c0_0                        
libffi              3.4.4           hca03da5_0                        
libgfortran         5.0.0           11_3_0_hca03da5_28                 
libgfortran5        11.3.0          h009349e_28                       
libopenblas         0.3.21          h269037a_0                        
llvm-openmp         14.0.6          hc6e5704_0                        
ncurses             6.4             h313beb8_0                        
numexpr             2.8.4           py311h6dc990b_1                    
numpy               1.24.3          py311hb57d4eb_0                    
numpy-base          1.24.3          py311h1d85a46_0                    
openssl             3.0.8           h1a28f6b_0                        
pandas              1.5.3           py311h6956b77_0                    
pip                 23.0.1          py311hca03da5_0                    
python              3.11.3          hb885b13_1                        
python-dateutil     2.8.2           pyhd3eb1b0_0                      
pytz                2022.7          py311hca03da5_0                    
readline            8.2             h1a28f6b_0                        
setuptools          67.8.0          py311hca03da5_0                    
six                 1.16.0          pyhd3eb1b0_1                      
sqlite              3.41.2          h80987f9_0                        
tk                  8.6.12          hb8d0fd4_0                        
tzdata              2023c           h04d1e81_0                        
wheel               0.38.4          py311hca03da5_0                    
xz                  5.4.2           h80987f9_0                        
zlib                1.2.13          h5a0b063_0                        

Pip

Pip installs only the dependencies required by a package.

$ pip install pandas

$ pip list
Package         Version
--------------- -------
numpy           1.24.3
pandas          2.0.2
pip             22.3.1
python-dateutil 2.8.2
pytz            2023.3
setuptools      65.5.0
six             1.16.0
tzdata          2023.3

Poetry

Poetry also installs only the dependencies required by a package.

$ poetry add pandas

$ poetry show

numpy           1.24.3 Fundamental package for array computing in Python
pandas          2.0.2  Powerful data structures for data analysis, time...
python-dateutil 2.8.2  Extensions to the standard Python datetime module
pytz            2023.3 World timezone definitions, modern and historical
six             1.16.0 Python 2 and 3 compatibility utilities
tzdata          2023.3 Provider of IANA time zone data

Uninstall Packages

Uninstalling packages and their dependencies frees up disk space, prevents unnecessary clutter, and optimizes storage resource usage.

Pip

Pip removes only the specified package, not its dependencies, potentially leading to the accumulation of unused dependencies over time. This can result in increased storage space usage and potential conflicts.

$ pip install pandas

$ pip uninstall pandas

$ pip list

Package         Version
--------------- -------
numpy           1.24.3
pip             22.0.4
python-dateutil 2.8.2
pytz            2023.3
setuptools      56.0.0
six             1.16.0
tzdata          2023.3

Conda

Conda removes the package and its dependencies. 

$ conda install -c conda pandas

$ conda uninstall -c conda pandas

Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/khuyentran/miniconda3/envs/test-conda

  removed specs:
    - pandas


The following packages will be REMOVED:

  blas-1.0-openblas
  bottleneck-1.3.5-py311ha0d4635_0
  libcxx-14.0.6-h848a8c0_0
  libgfortran-5.0.0-11_3_0_hca03da5_28
  libgfortran5-11.3.0-h009349e_28
  libopenblas-0.3.21-h269037a_0
  llvm-openmp-14.0.6-hc6e5704_0
  numexpr-2.8.4-py311h6dc990b_1
  numpy-1.24.3-py311hb57d4eb_0
  numpy-base-1.24.3-py311h1d85a46_0
  pandas-1.5.3-py311h6956b77_0
  python-dateutil-2.8.2-pyhd3eb1b0_0
  pytz-2022.7-py311hca03da5_0
  six-1.16.0-pyhd3eb1b0_1


Proceed ([y]/n)? 

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Poetry

Poetry also removes the package and its dependencies.

$ poetry add pandas

$ poetry remove pandas

  • Removing numpy (1.24.3)
  • Removing pandas (2.0.2)
  • Removing python-dateutil (2.8.2)
  • Removing pytz (2023.3)
  • Removing six (1.16.0)
  • Removing tzdata (2023.3)

Dependency Files

Dependency files ensure the reproducibility of a software project’s environment by specifying the exact versions or version ranges of required packages.

This helps recreate the same environment across different systems or at different points in time, ensuring collaboration among developers with the same set of dependencies.

Conda

To save dependencies in a Conda environment, you need to manually write them to a file. Version ranges specified in an environment.yml file can result in different versions being installed, potentially introducing compatibility issues when reproducing the environment.

Let’s assume that we have installed pandas version 1.5.3 as an example. Here is an example environment.yml file that specifies the dependencies:

# environment.yml
name: test-conda
channels:
  - defaults
dependencies:
  - python=3.8
  - pandas>=1.5

If a new user tries to reproduce the environment when the latest version of pandas is 2.0, pandas 2.0 will be installed instead.

# Create and activate a virtual environment
$ conda env create -n env
$ conda activate env

# List packages in the current environment
$ conda list
...
pandas                   2.0

If the codebase relies on syntax or behavior specific to pandas version 1.5.3 and the syntax has changed in version 2.0, running the code with pandas 2.0 could introduce bugs.

Pip

The same problem can occur with pip.

# requirements.txt
pandas>=1.5
# Create and activate a virtual environment
$ python3 -m venv venv
$ source venv/bin/activate

# Install dependencies
$ pip install -r requirements.txt

# List packages
$ pip list
Package    Version
---------- -------
pandas       2.0
...

You can pin down the version by freezing them in a requirements.txt file:

$ pip freeze > requirements.txt
# requirements.txt

numpy==1.24.3
pandas==1.5.3
python-dateutil==2.8.2
pytz==2023.3
six==1.16.0

However, this makes the code environment less flexible and potentially harder to maintain in the long run. Any changes to the dependencies would require manual modifications to the requirements.txt file, which can be time-consuming and error-prone.

Poetry

Poetry automatically updates the pyproject.toml file when installing a package.

In the following example, the “pandas” package is added with the version constraint “^1.5”. This flexible versioning approach ensures that your project can adapt to newer releases without manual adjustments.

$ poetry add 'pandas=^1.5'
# pyproject.toml

[tool.poetry.dependencies]
python = "^3.8"
pandas = "^1.5"

The poetry.lock file stores the precise version numbers for each package and its dependencies.

# poetry.lock
...
[[package]]
name = "pandas"
version = "1.5.3"
description = "Powerful data structures for data analysis, time series, and statistics"
category = "main"
optional = false
python-versions = ">=3.8"

[package.dependencies]
numpy = [
    {version = ">=1.20.3", markers = "python_version < \"3.10\""},
    {version = ">=1.21.0", markers = "python_version >= \"3.10\""},
    {version = ">=1.23.2", markers = "python_version >= \"3.11\""},
]
python-dateutil = ">=2.8.2"
pytz = ">=2020.1"
tzdata = ">=2022.1"
...

This guarantees consistency in the installed packages, even if a package has a version range specified in the pyproject.toml file. Here, we can see that pandas 1.5.3 is installed instead of pandas 2.0

$ poetry install

$ poetry show pandas

name         : pandas                                                                  
version      : 1.5.3                                                                   
description  : Powerful data structures for data analysis, time series, and statistics 

dependencies
 - numpy >=1.20.3
 - numpy >=1.21.0
 - numpy >=1.23.2
 - python-dateutil >=2.8.1
 - pytz >=2020.1

Separate dependencies for dev and prod

By separating the dependencies, you can clearly distinguish between the packages required for development purposes, such as testing frameworks and code quality tools, from the packages needed for the production environment, which typically include the core dependencies.

This ensures that the production environment contains only the necessary packages for running the application, reducing the risk of conflicts or compatibility issues.

Conda

Conda doesn’t inherently support separate dependencies for different environments, but a workaround involves creating two environment files: one for the development environment and one for production. The development file contains both production and development dependencies.

# environment.yml
name: test-conda
channels:
  - defaults
dependencies:
  # Production packages
  - numpy
  - pandas
# environment-dev.yml
name: test-conda-dev
channels:
  - defaults
dependencies:
  # Production packages
  - numpy
  - pandas
  # Development packages
  - pytest
  - pre-commit

Pip

Pip also doesn’t directly support separate dependencies, but a similar approach can be used with separate requirement files.

# requirements.txt
numpy 
pandas
# requirements-dev.txt
-r requirements.txt
pytest
pre-commit
# Install prod
$ pip install -r requirements.txt

# Install both dev and prod
$ pip install -r requirements-dev.txt

Poetry

Poetry simplifies managing dependencies by supporting groups within one file. This allows you to keep track of all dependencies in a single place.

$ poetry add numpy pandas
$ poetry add --group dev pytest pre-commit
# pyproject.toml
[tool.poetry.dependencies]
python = "^3.8"
pandas = "^2.0"
numpy = "^1.24.3"

[tool.poetry.group.dev.dependencies]
pytest = "^7.3.2"
pre-commit = "^3.3.2"

To install only production dependencies:

$ poetry install --only main

To install both development and production dependencies:

$ poetry install

Update an Environment

Updating dependencies is essential to benefit from bug fixes, performance improvements, and new features introduced in newer package versions.

Conda

Conda allows you to update only a specified package.

$ conda install -c conda pandas
$ conda install -c anaconda scikit-learn
# New versions available
$ conda update pandas
$ conda update scikit-learn

Afterwards, you need to manually update the environment.yaml file to keep it in sync with the updated dependencies.

$ conda env export > environment.yml

Pip

Pip also only allows you to update a specified package and requires you to manually update the requirements.txt file.

$ pip install -U pandas
$ pip freeze > requirements.txt

Poetry

Using Poetry, you can use the update command to upgrade all packages specified in the pyproject.toml file. This action automatically updates the poetry.lock file, ensuring consistency between the package specifications and the lock file.

$ poetry add pandas scikit-learn

# New verisons available
poetry update

Updating dependencies
Resolving dependencies... (0.3s)

Writing lock file

Package operations: 0 installs, 2 updates, 0 removals

  • Updating pandas (2.0.0 -> 2.0.2)
  • Updating scikit-learn (1.2.0 -> 1.2.2)

Dependency Resolution

Dependency conflicts occur when packages or libraries required by a project have conflicting versions or incompatible dependencies. Properly resolving conflicts is crucial to avoid errors, runtime issues, or project failures.

Pip

pip installs packages sequentially, which means it installs each package one by one, following the specified order. This sequential approach can sometimes lead to conflicts when packages have incompatible dependencies or version requirements.

For example, suppose you install pandas==2.0.2  first, which requires numpy>=1.20.3. Later, you install numpy==1.20.2 using pip. Even though this will create dependency conflicts, pip will proceed to update the version of numpy.

$ pip install pandas==2.0.2

$ pip install numpy==1.20.2
Collecting numpy=1.20.2
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas 2.0.2 requires numpy>=1.20.3; python_version < "3.10", but you have numpy 1.20.2 which is incompatible.
Successfully installed numpy-1.20.2

Conda

Conda uses a SAT solver to explore all combinations of package versions and dependencies to find a compatible set.

For instance, if an existing package has a specific constraint for its dependency (e.g., statsmodels==0.13.2 requires numpy>=1.21.2,<2.0a0), and the package you want to install doesn’t meet that requirement (e.g., numpy<1.21.2), conda won’t immediately raise an error. Instead, it will diligently search for compatible versions of all the required packages and their dependencies, only reporting an error if no suitable solution is found.

$ conda install 'statsmodels==0.13.2'

$ conda search 'statsmodels==0.13.2' --info
dependencies: 
  - numpy >=1.21.2,<2.0a0
  - packaging >=21.3
  - pandas >=1.0
  - patsy >=0.5.2
  - python >=3.9,<3.10.0a0
  - scipy >=1.3

$ conda install 'numpy<1.21.2'

...
Package ca-certificates conflicts for:
python=3.8 -> openssl[version='>=1.1.1t,<1.1.2a'] -> ca-certificates
openssl -> ca-certificates
ca-certificates
cryptography -> openssl[version='>1.1.0,<3.1.0'] -> ca-certificates

Package idna conflicts for:
requests -> urllib3[version='>=1.21.1,<1.27'] -> idna[version='>=2.0.0']
requests -> idna[version='>=2.5,<3|>=2.5,<4']
idna
pooch -> requests -> idna[version='>=2.5,<3|>=2.5,<4']
urllib3 -> idna[version='>=2.0.0']

Package numexpr conflicts for:
statsmodels==0.13.2 -> pandas[version='>=1.0'] -> numexpr[version='>=2.7.0|>=2.7.1|>=2.7.3']
numexpr
pandas==1.5.3 -> numexpr[version='>=2.7.3']

Package patsy conflicts for:
statsmodels==0.13.2 -> patsy[version='>=0.5.2']
patsy

Package chardet conflicts for:
requests -> chardet[version='>=3.0.2,<4|>=3.0.2,<5']
pooch -> requests -> chardet[version='>=3.0.2,<4|>=3.0.2,<5']

Package python-dateutil conflicts for:
statsmodels==0.13.2 -> pandas[version='>=1.0'] -> python-dateutil[version='>=2.7.3|>=2.8.1']
python-dateutil
pandas==1.5.3 -> python-dateutil[version='>=2.8.1']

Package setuptools conflicts for:
numexpr -> setuptools
pip -> setuptools
wheel -> setuptools
setuptools
python=3.8 -> pip -> setuptools
pandas==1.5.3 -> numexpr[version='>=2.7.3'] -> setuptools

Package brotlipy conflicts for:
urllib3 -> brotlipy[version='>=0.6.0']
brotlipy
requests -> urllib3[version='>=1.21.1,<1.27'] -> brotlipy[version='>=0.6.0']

Package pytz conflicts for:
pytz
pandas==1.5.3 -> pytz[version='>=2020.1']
statsmodels==0.13.2 -> pandas[version='>=1.0'] -> pytz[version='>=2017.3|>=2020.1']

While this approach enhances the chances of finding a resolution, it can be computationally intensive, particularly when dealing with extensive environments.

Poetry

By focusing on the direct dependencies of the project, Poetry’s deterministic resolver narrows down the search space, making the resolution process more efficient. It evaluates the specified constraints, such as version ranges or specific versions, and immediately identifies any conflicts.

$ poetry add 'seaborn==0.12.2'
$ poetry add 'matplotlib<3.1' 

Because poetry shell depends on seaborn (0.12.2) which depends on matplotlib (>=3.1,<3.6.1 || >3.6.1), matplotlib is required.
So, because poetry shell depends on matplotlib (<3.1), version solving failed.

This immediate feedback helps prevent potential issues from escalating and allows developers to address the problem early in the development process. For example, in the following code, we can relax the requirements for seaborn to enable the installation of a specific version of matplotlib:

poetry add 'seaborn<=0.12.2'  'matplotlib<3.1' 

Package operations: 1 install, 2 updates, 4 removals

  • Removing contourpy (1.0.7)
  • Removing fonttools (4.40.0)
  • Removing packaging (23.1)
  • Removing pillow (9.5.0)
  • Updating matplotlib (3.7.1 -> 3.0.3)
  • Installing scipy (1.9.3)
  • Updating seaborn (0.12.2 -> 0.11.2)

Conclusion

In summary, Poetry provides several advantages over pip and conda:

  1. Consistent Package Installation: Poetry offers a consistent format to install any package, ensuring a standardized approach across your project.
  2. Broad Package Selection: Poetry provides access to a wide range of packages available on PyPI, allowing you to leverage a diverse ecosystem for your project.
  3. Efficient Dependency Management: Poetry installs only the necessary dependencies for a specified package, reducing the number of extraneous packages in your environment.
  4. Streamlined Package Removal: Poetry simplifies the removal of packages and their associated dependencies, making it easy to maintain a clean and efficient project environment.
  5. Dependency Resolution: Poetry’s deterministic resolver efficiently resolves dependencies, identifying and addressing any inconsistencies or conflicts promptly.

While Poetry may require some additional time and effort for your teammates to learn and adapt to, using a tool like Poetry can save you time and effort in the long run.


I love writing about data science concepts and playing with different data science tools. You can stay up-to-date with my latest posts by:

Related Posts

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran