Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Course
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Machine Learning

MICEforest: An Iterative Predictive Modeling Approach to Missing Data Imputation

The miceforest library is a Python tool for imputing missing data in a dataset using an iterative series of predictive models. In each iteration, every variable with missing values is imputed using the other variables. The iterations proceed until convergence appears to have been met.

In the example above, the correlation between A and B is brought much closer to the original data after imputing A using B and C, and then imputing B using A and C.

Link to miceforest.

MICEforest: An Iterative Predictive Modeling Approach to Missing Data Imputation Read More »

Phoenix: Visualize High-Dimensional Data to Identify Performance Issues

During system performance degradation, pinpointing underlying causes can be challenging, especially with datasets containing numerous features.

Phoenix leverages UMAP to visualize high-dimensional data during periods of performance degradation, thereby enabling the identification of clusters of problematic data.

Link to Phoenix.

Phoenix: Visualize High-Dimensional Data to Identify Performance Issues Read More »

safetensors: A Simple and Safe Way to Store and Distribute Tensors

PyTorch defaults to using Pickle for tensor storage, which poses security risks as malicious pickle files can execute arbitrary code upon unpickling. In contrast, safetensors specialize in securely storing tensors, guaranteeing data integrity during storage and retrieval.

safetensors also uses zero-copy operations, eliminating the need to copy data into new memory locations, thereby enabling fast and efficient data handling.

Link to safetensors.

safetensors: A Simple and Safe Way to Store and Distribute Tensors Read More »

Simplify Machine Learning Deployment with MLFlow

After training your machine learning model, deploying it for real-world predictions can be complex.

MLFlow simplifies this by offering a user-friendly interface to deploy models across various platforms without requiring boilerplate code.

With MLflow, the model, code, and configurations are packaged with the deployment container, ensuring consistency between training and deployment environments.

Learn more about MLFlow deployment.

Simplify Machine Learning Deployment with MLFlow Read More »

Data Freshness Experiment: A Blueprint for Model Update Frequency

There’s a common belief that fresher data yields better results, but how frequently should you update your models?

To figure this out, train models on different past timeframes and test them on current data.

For instance, train model A on January-May data, model B on April-August, and model C on July-November, then evaluate all on December data.

If model A performs much worse than model C, you should consider updating your model more frequently to maintain high performance.

Reference: Designing Machine Learning Systems by Chip Huyen.

Data Freshness Experiment: A Blueprint for Model Update Frequency Read More »

txtai: All-in-one open-source embeddings database for semantic search

Traditional search systems rely on keywords to retrieve data, whereas semantic search uses natural language understanding to identify results with similar meanings.

txtai is an all-in-one embedding database for semantic search that enables vector search with SQL, topic modeling, retrieval augmented generation, and more.

Link to txtai.

txtai: All-in-one open-source embeddings database for semantic search Read More »

Streamline Model Version Management with MLflow Aliases

MLflow aliases enable you to assign meaningful names to different versions of machine learning models.

With aliases, you can effortlessly switch between different model versions without modifying the deployment code. This is particularly useful when promoting a new model version to production.

Learn more about MLflow aliases.

Streamline Model Version Management with MLflow Aliases Read More »

Scroll to Top

Work with Khuyen Tran

Work with Khuyen Tran