Motivation
In the beginning stages of your data science project, using dependency management tools like pip or conda may be sufficient.
However, as the data science project expands, the number of dependencies also increases. This can make it difficult to reproduce the project’s environment and maintain it effectively when relying solely on pip or conda for dependency management.
To address these challenges, Poetry, an open-source library, provides a powerful tool for creating and maintaining Python projects with consistent environments. In this article, we will delve into the advantages of Poetry and highlight its key distinctions from pip and conda.
Ease of Installation
A straightforward installation process allows developers to quickly adopt and integrate a package into their codebase, saving time and effort.
Conda
Conda’s installation format is inconsistent for different packages. For example, to install polars, you need to run:
conda install -c conda-forge polar
While to install pandas, the command would be:
conda install -c anaconda pandas
Pip
Pip has a consistent installation format for every package:
pip install package-name
Poetry
Poetry follows the same installation format for every package:
poetry add package-name
Available packages
Having a broad selection of packages makes it easier for developers to find the specific package and version that best suits their needs.
Conda
Some packages, like “snscrape,” cannot be installed with conda. Additionally, certain versions, such as Pandas 2.0, might not be available for installation through Conda.
While you can use pip inside a conda virtual environment to address package limitations, conda cannot track dependencies installed with pip, making dependency management challenging.
$ conda list
# packages in environment at /Users/khuyentran/miniconda3/envs/test-conda:
#
# Name Version Build Channel
Pip
Pip can install any packages from the Python Package Index (PyPI) and other repositories.
Poetry
Poetry also allows the installation of packages from the Python Package Index (PyPI) and other repositories.
Number of Dependencies
Reducing the number of dependencies in an environment simplifies the development process.
Conda
Conda provides full environment isolation, managing both Python packages and system-level dependencies. This can result in larger package sizes compared to other package managers, potentially consuming more storage space during installation and distribution.
$ conda install pandas
$ conda list
# packages in environment at /Users/khuyentran/miniconda3/envs/test-conda:
#
# Name Version Build Channel
blas 1.0 openblas
bottleneck 1.3.5 py311ha0d4635_0
bzip2 1.0.8 h620ffc9_4
ca-certificates 2023.05.30 hca03da5_0
libcxx 14.0.6 h848a8c0_0
libffi 3.4.4 hca03da5_0
libgfortran 5.0.0 11_3_0_hca03da5_28
libgfortran5 11.3.0 h009349e_28
libopenblas 0.3.21 h269037a_0
llvm-openmp 14.0.6 hc6e5704_0
ncurses 6.4 h313beb8_0
numexpr 2.8.4 py311h6dc990b_1
numpy 1.24.3 py311hb57d4eb_0
numpy-base 1.24.3 py311h1d85a46_0
openssl 3.0.8 h1a28f6b_0
pandas 1.5.3 py311h6956b77_0
pip 23.0.1 py311hca03da5_0
python 3.11.3 hb885b13_1
python-dateutil 2.8.2 pyhd3eb1b0_0
pytz 2022.7 py311hca03da5_0
readline 8.2 h1a28f6b_0
setuptools 67.8.0 py311hca03da5_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.41.2 h80987f9_0
tk 8.6.12 hb8d0fd4_0
tzdata 2023c h04d1e81_0
wheel 0.38.4 py311hca03da5_0
xz 5.4.2 h80987f9_0
zlib 1.2.13 h5a0b063_0
Pip
Pip installs only the dependencies required by a package.
$ pip install pandas
$ pip list
Package Version
--------------- -------
numpy 1.24.3
pandas 2.0.2
pip 22.3.1
python-dateutil 2.8.2
pytz 2023.3
setuptools 65.5.0
six 1.16.0
tzdata 2023.3
Poetry
Poetry also installs only the dependencies required by a package.
$ poetry add pandas
$ poetry show
numpy 1.24.3 Fundamental package for array computing in Python
pandas 2.0.2 Powerful data structures for data analysis, time...
python-dateutil 2.8.2 Extensions to the standard Python datetime module
pytz 2023.3 World timezone definitions, modern and historical
six 1.16.0 Python 2 and 3 compatibility utilities
tzdata 2023.3 Provider of IANA time zone data
Uninstall Packages
Uninstalling packages and their dependencies frees up disk space, prevents unnecessary clutter, and optimizes storage resource usage.
Pip
Pip removes only the specified package, not its dependencies, potentially leading to the accumulation of unused dependencies over time. This can result in increased storage space usage and potential conflicts.
$ pip install pandas
$ pip uninstall pandas
$ pip list
Package Version
--------------- -------
numpy 1.24.3
pip 22.0.4
python-dateutil 2.8.2
pytz 2023.3
setuptools 56.0.0
six 1.16.0
tzdata 2023.3
Conda
Conda removes the package and its dependencies.
$ conda install -c conda pandas
$ conda uninstall -c conda pandas
Collecting package metadata (repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /Users/khuyentran/miniconda3/envs/test-conda
removed specs:
- pandas
The following packages will be REMOVED:
blas-1.0-openblas
bottleneck-1.3.5-py311ha0d4635_0
libcxx-14.0.6-h848a8c0_0
libgfortran-5.0.0-11_3_0_hca03da5_28
libgfortran5-11.3.0-h009349e_28
libopenblas-0.3.21-h269037a_0
llvm-openmp-14.0.6-hc6e5704_0
numexpr-2.8.4-py311h6dc990b_1
numpy-1.24.3-py311hb57d4eb_0
numpy-base-1.24.3-py311h1d85a46_0
pandas-1.5.3-py311h6956b77_0
python-dateutil-2.8.2-pyhd3eb1b0_0
pytz-2022.7-py311hca03da5_0
six-1.16.0-pyhd3eb1b0_1
Proceed ([y]/n)?
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Poetry
Poetry also removes the package and its dependencies.
$ poetry add pandas
$ poetry remove pandas
• Removing numpy (1.24.3)
• Removing pandas (2.0.2)
• Removing python-dateutil (2.8.2)
• Removing pytz (2023.3)
• Removing six (1.16.0)
• Removing tzdata (2023.3)
Dependency Files
Dependency files ensure the reproducibility of a software project’s environment by specifying the exact versions or version ranges of required packages.
This helps recreate the same environment across different systems or at different points in time, ensuring collaboration among developers with the same set of dependencies.
Conda
To save dependencies in a Conda environment, you need to manually write them to a file. Version ranges specified in an environment.yml file can result in different versions being installed, potentially introducing compatibility issues when reproducing the environment.
Let’s assume that we have installed pandas version 1.5.3 as an example. Here is an example environment.yml file that specifies the dependencies:
# environment.yml
name: test-conda
channels:
- defaults
dependencies:
- python=3.8
- pandas>=1.5
If a new user tries to reproduce the environment when the latest version of pandas is 2.0, pandas 2.0 will be installed instead.
# Create and activate a virtual environment
$ conda env create -n env
$ conda activate env
# List packages in the current environment
$ conda list
...
pandas 2.0
If the codebase relies on syntax or behavior specific to pandas version 1.5.3 and the syntax has changed in version 2.0, running the code with pandas 2.0 could introduce bugs.
Pip
The same problem can occur with pip.
# requirements.txt
pandas>=1.5
# Create and activate a virtual environment
$ python3 -m venv venv
$ source venv/bin/activate
# Install dependencies
$ pip install -r requirements.txt
# List packages
$ pip list
Package Version
---------- -------
pandas 2.0
...
You can pin down the version by freezing them in a requirements.txt file:
$ pip freeze > requirements.txt
# requirements.txt
numpy==1.24.3
pandas==1.5.3
python-dateutil==2.8.2
pytz==2023.3
six==1.16.0
However, this makes the code environment less flexible and potentially harder to maintain in the long run. Any changes to the dependencies would require manual modifications to the requirements.txt file, which can be time-consuming and error-prone.
Poetry
Poetry automatically updates the pyproject.toml file when installing a package.
In the following example, the “pandas” package is added with the version constraint “^1.5”. This flexible versioning approach ensures that your project can adapt to newer releases without manual adjustments.
$ poetry add 'pandas=^1.5'
# pyproject.toml
[tool.poetry.dependencies]
python = "^3.8"
pandas = "^1.5"
The poetry.lock file stores the precise version numbers for each package and its dependencies.
# poetry.lock
...
[[package]]
name = "pandas"
version = "1.5.3"
description = "Powerful data structures for data analysis, time series, and statistics"
category = "main"
optional = false
python-versions = ">=3.8"
[package.dependencies]
numpy = [
{version = ">=1.20.3", markers = "python_version < \"3.10\""},
{version = ">=1.21.0", markers = "python_version >= \"3.10\""},
{version = ">=1.23.2", markers = "python_version >= \"3.11\""},
]
python-dateutil = ">=2.8.2"
pytz = ">=2020.1"
tzdata = ">=2022.1"
...
This guarantees consistency in the installed packages, even if a package has a version range specified in the pyproject.toml file. Here, we can see that pandas 1.5.3 is installed instead of pandas 2.0
$ poetry install
$ poetry show pandas
name : pandas
version : 1.5.3
description : Powerful data structures for data analysis, time series, and statistics
dependencies
- numpy >=1.20.3
- numpy >=1.21.0
- numpy >=1.23.2
- python-dateutil >=2.8.1
- pytz >=2020.1
Separate dependencies for dev and prod
By separating the dependencies, you can clearly distinguish between the packages required for development purposes, such as testing frameworks and code quality tools, from the packages needed for the production environment, which typically include the core dependencies.
This ensures that the production environment contains only the necessary packages for running the application, reducing the risk of conflicts or compatibility issues.
Conda
Conda doesn’t inherently support separate dependencies for different environments, but a workaround involves creating two environment files: one for the development environment and one for production. The development file contains both production and development dependencies.
# environment.yml
name: test-conda
channels:
- defaults
dependencies:
# Production packages
- numpy
- pandas
# environment-dev.yml
name: test-conda-dev
channels:
- defaults
dependencies:
# Production packages
- numpy
- pandas
# Development packages
- pytest
- pre-commit
Pip
Pip also doesn’t directly support separate dependencies, but a similar approach can be used with separate requirement files.
# requirements.txt
numpy
pandas
# requirements-dev.txt
-r requirements.txt
pytest
pre-commit
# Install prod
$ pip install -r requirements.txt
# Install both dev and prod
$ pip install -r requirements-dev.txt
Poetry
Poetry simplifies managing dependencies by supporting groups within one file. This allows you to keep track of all dependencies in a single place.
$ poetry add numpy pandas
$ poetry add --group dev pytest pre-commit
# pyproject.toml
[tool.poetry.dependencies]
python = "^3.8"
pandas = "^2.0"
numpy = "^1.24.3"
[tool.poetry.group.dev.dependencies]
pytest = "^7.3.2"
pre-commit = "^3.3.2"
To install only production dependencies:
$ poetry install --only main
To install both development and production dependencies:
$ poetry install
Update an Environment
Updating dependencies is essential to benefit from bug fixes, performance improvements, and new features introduced in newer package versions.
Conda
Conda allows you to update only a specified package.
$ conda install -c conda pandas
$ conda install -c anaconda scikit-learn
# New versions available
$ conda update pandas
$ conda update scikit-learn
Afterwards, you need to manually update the environment.yaml file to keep it in sync with the updated dependencies.
$ conda env export > environment.yml
Pip
Pip also only allows you to update a specified package and requires you to manually update the requirements.txt file.
$ pip install -U pandas
$ pip freeze > requirements.txt
Poetry
Using Poetry, you can use the update
command to upgrade all packages specified in the pyproject.toml file. This action automatically updates the poetry.lock file, ensuring consistency between the package specifications and the lock file.
$ poetry add pandas scikit-learn
# New verisons available
poetry update
Updating dependencies
Resolving dependencies... (0.3s)
Writing lock file
Package operations: 0 installs, 2 updates, 0 removals
• Updating pandas (2.0.0 -> 2.0.2)
• Updating scikit-learn (1.2.0 -> 1.2.2)
Dependency Resolution
Dependency conflicts occur when packages or libraries required by a project have conflicting versions or incompatible dependencies. Properly resolving conflicts is crucial to avoid errors, runtime issues, or project failures.
Pip
pip installs packages sequentially, which means it installs each package one by one, following the specified order. This sequential approach can sometimes lead to conflicts when packages have incompatible dependencies or version requirements.
For example, suppose you install pandas==2.0.2
first, which requires numpy>=1.20.3
. Later, you install numpy==1.20.2
using pip. Even though this will create dependency conflicts, pip will proceed to update the version of numpy.
$ pip install pandas==2.0.2
$ pip install numpy==1.20.2
Collecting numpy=1.20.2
Attempting uninstall: numpy
Found existing installation: numpy 1.24.3
Uninstalling numpy-1.24.3:
Successfully uninstalled numpy-1.24.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas 2.0.2 requires numpy>=1.20.3; python_version < "3.10", but you have numpy 1.20.2 which is incompatible.
Successfully installed numpy-1.20.2
Conda
Conda uses a SAT solver to explore all combinations of package versions and dependencies to find a compatible set.
For instance, if an existing package has a specific constraint for its dependency (e.g., statsmodels==0.13.2 requires numpy>=1.21.2,<2.0a0), and the package you want to install doesn’t meet that requirement (e.g., numpy<1.21.2), conda won’t immediately raise an error. Instead, it will diligently search for compatible versions of all the required packages and their dependencies, only reporting an error if no suitable solution is found.
$ conda install 'statsmodels==0.13.2'
$ conda search 'statsmodels==0.13.2' --info
dependencies:
- numpy >=1.21.2,<2.0a0
- packaging >=21.3
- pandas >=1.0
- patsy >=0.5.2
- python >=3.9,<3.10.0a0
- scipy >=1.3
$ conda install 'numpy<1.21.2'
...
Package ca-certificates conflicts for:
python=3.8 -> openssl[version='>=1.1.1t,<1.1.2a'] -> ca-certificates
openssl -> ca-certificates
ca-certificates
cryptography -> openssl[version='>1.1.0,<3.1.0'] -> ca-certificates
Package idna conflicts for:
requests -> urllib3[version='>=1.21.1,<1.27'] -> idna[version='>=2.0.0']
requests -> idna[version='>=2.5,<3|>=2.5,<4']
idna
pooch -> requests -> idna[version='>=2.5,<3|>=2.5,<4']
urllib3 -> idna[version='>=2.0.0']
Package numexpr conflicts for:
statsmodels==0.13.2 -> pandas[version='>=1.0'] -> numexpr[version='>=2.7.0|>=2.7.1|>=2.7.3']
numexpr
pandas==1.5.3 -> numexpr[version='>=2.7.3']
Package patsy conflicts for:
statsmodels==0.13.2 -> patsy[version='>=0.5.2']
patsy
Package chardet conflicts for:
requests -> chardet[version='>=3.0.2,<4|>=3.0.2,<5']
pooch -> requests -> chardet[version='>=3.0.2,<4|>=3.0.2,<5']
Package python-dateutil conflicts for:
statsmodels==0.13.2 -> pandas[version='>=1.0'] -> python-dateutil[version='>=2.7.3|>=2.8.1']
python-dateutil
pandas==1.5.3 -> python-dateutil[version='>=2.8.1']
Package setuptools conflicts for:
numexpr -> setuptools
pip -> setuptools
wheel -> setuptools
setuptools
python=3.8 -> pip -> setuptools
pandas==1.5.3 -> numexpr[version='>=2.7.3'] -> setuptools
Package brotlipy conflicts for:
urllib3 -> brotlipy[version='>=0.6.0']
brotlipy
requests -> urllib3[version='>=1.21.1,<1.27'] -> brotlipy[version='>=0.6.0']
Package pytz conflicts for:
pytz
pandas==1.5.3 -> pytz[version='>=2020.1']
statsmodels==0.13.2 -> pandas[version='>=1.0'] -> pytz[version='>=2017.3|>=2020.1']
While this approach enhances the chances of finding a resolution, it can be computationally intensive, particularly when dealing with extensive environments.
Poetry
By focusing on the direct dependencies of the project, Poetry’s deterministic resolver narrows down the search space, making the resolution process more efficient. It evaluates the specified constraints, such as version ranges or specific versions, and immediately identifies any conflicts.
$ poetry add 'seaborn==0.12.2'
$ poetry add 'matplotlib<3.1'
Because poetry shell depends on seaborn (0.12.2) which depends on matplotlib (>=3.1,<3.6.1 || >3.6.1), matplotlib is required.
So, because poetry shell depends on matplotlib (<3.1), version solving failed.
This immediate feedback helps prevent potential issues from escalating and allows developers to address the problem early in the development process. For example, in the following code, we can relax the requirements for seaborn to enable the installation of a specific version of matplotlib:
poetry add 'seaborn<=0.12.2' 'matplotlib<3.1'
Package operations: 1 install, 2 updates, 4 removals
• Removing contourpy (1.0.7)
• Removing fonttools (4.40.0)
• Removing packaging (23.1)
• Removing pillow (9.5.0)
• Updating matplotlib (3.7.1 -> 3.0.3)
• Installing scipy (1.9.3)
• Updating seaborn (0.12.2 -> 0.11.2)
Conclusion
In summary, Poetry provides several advantages over pip and conda:
- Consistent Package Installation: Poetry offers a consistent format to install any package, ensuring a standardized approach across your project.
- Broad Package Selection: Poetry provides access to a wide range of packages available on PyPI, allowing you to leverage a diverse ecosystem for your project.
- Efficient Dependency Management: Poetry installs only the necessary dependencies for a specified package, reducing the number of extraneous packages in your environment.
- Streamlined Package Removal: Poetry simplifies the removal of packages and their associated dependencies, making it easy to maintain a clean and efficient project environment.
- Dependency Resolution: Poetry’s deterministic resolver efficiently resolves dependencies, identifying and addressing any inconsistencies or conflicts promptly.
While Poetry may require some additional time and effort for your teammates to learn and adapt to, using a tool like Poetry can save you time and effort in the long run.
I love writing about data science concepts and playing with different data science tools. You can stay up-to-date with my latest posts by:
- Subscribing to my newsletter on Data Science Simplified.
- Connect with me on LinkedIn and Twitter.