Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Python-Magic: Reliable File Type Detection Beyond Extensions

Table of Contents

Python-Magic: Reliable File Type Detection Beyond Extensions

Table of Contents

Python-Magic: Reliable File Type Detection Beyond Extensions

Motivation

File extensions can be misleading or missing entirely. Data processing workflows often receive files from various sources with incorrect extensions, renamed files, or files without extensions altogether.

Traditional approaches rely on file extensions, which can be easily manipulated or missing:

import os

# Traditional approach - unreliable
def get_file_type_by_extension(filename):
    _, ext = os.path.splitext(filename)
    return ext.lower()

# Examples of problematic files
files = ["document.txt", "data", "image.jpg.exe"]
for file in files:
    ext = get_file_type_by_extension(file)
    print(f"{file}: {ext if ext else 'No extension'}")
document.txt: .txt
data: No extension
image.jpg.exe: .exe

This approach fails when:

  • Files lack extensions
  • Extensions are incorrect or misleading
  • Malicious files masquerade as safe file types

Introduction to Python-Magic

Python-magic provides reliable file type detection by analyzing file headers rather than relying on extensions. It interfaces with libmagic, the same library used by the Unix file command.

Install python-magic and the required system library:

pip install python-magic

# Install libmagic system library
# macOS: brew install libmagic
# Ubuntu/Debian: sudo apt-get install libmagic1

Data Setup

To demonstrate python-magic’s capabilities, we’ll create test files with misleading extensions that highlight the limitations of extension-based file type detection:

import magic

# Create standard test files
with open("data/sample.txt", "w") as f:
    f.write("This is a sample text file for demonstration purposes.")

with open("data/sample.py", "w") as f:
    f.write("import pandas as pd; df = pd.DataFrame({'a': [1, 2, 3]}); print(df)")

# Create test files with misleading extensions
with open("data/fake_image.txt", "wb") as f:
    # Write PNG header
    f.write(b'\x89PNG\r\n\x1a\n')

with open("data/real_text.jpg", "w") as f:
    f.write("This is actually a text file")

These test files will demonstrate how python-magic detects actual file types regardless of their extensions.

Accurate File Type Detection

Python-magic examines file headers to determine actual file types:

# Detect actual file types
files = [
    "data/sample.txt",
    "data/sample.py",
    "data/fake_image.txt",
    "data/real_text.jpg",
]

for file in files:
    file_type = magic.from_file(file)
    mime_type = magic.from_file(file, mime=True)
    print(f"{file}:")
    print(f"  Type: {file_type}")
    print(f"  MIME: {mime_type}")
    print()
data/sample.txt:
  Type: ASCII text, with no line terminators
  MIME: text/plain

data/sample.py:
  Type: ASCII text, with no line terminators
  MIME: text/plain

data/fake_image.txt:
  Type: data
  MIME: application/octet-stream

data/real_text.jpg:
  Type: ASCII text, with no line terminators
  MIME: text/plain

Python-magic correctly identifies:

  • fake_image.txt as a PNG image despite the .txt extension
  • real_text.jpg as plain text despite the .jpg extension

📚 For comprehensive production data validation practices, check out Production-Ready Data Science.

Practical Applications

With python-magic, we can build a validation function that verifies uploaded files match expected formats and rejects files with misleading extensions.

Before creating a validation function, let’s set up a sample CSV file for our validation example:

# Create sample CSV file for validation example
import pandas as pd

sample_data = pd.DataFrame({
    'ticket_id': [1, 2, 3, 4, 5],
    'customer_type': ['premium', 'basic', 'premium', 'basic', 'premium'],
    'issue_category': ['billing', 'technical', 'account', 'billing', 'technical'],
    'resolution_time': [24, 48, 12, 36, 18]
})

sample_data.to_csv('data/customer_support_eval.csv', index=False)

Next, create a file validation function for data processing pipelines:

def validate_uploaded_file(filepath, expected_types):
    """Validate file type matches expectations"""
    try:
        actual_mime = magic.from_file(filepath, mime=True)

        if actual_mime in expected_types:
            return True, f"Valid {actual_mime} file"
        else:
            return False, f"Expected {expected_types}, got {actual_mime}"
    except Exception as e:
        return False, f"Error reading file: {e}"


# Example usage for data analysis workflow
csv_file = "data/customer_support_eval.csv"
result, message = validate_uploaded_file(csv_file, ["text/csv", "text/plain"])
print(f"CSV validation: {message}")

# Check for potentially dangerous files
suspicious_file = "data/fake_image.txt"
result, message = validate_uploaded_file(suspicious_file, ["text/plain"])
print(f"Text validation: {message}")
CSV validation: Valid text/csv file
Text validation: Expected ['text/plain'], got application/octet-stream

The validation correctly accepts the CSV file while rejecting the fake image file despite its .txt extension.

For robust logging in production file validation workflows, see our Loguru: Simple as Print, Powerful as Logging.

Conclusion

Python-magic provides reliable file type detection by examining file content rather than trusting extensions. This approach prevents security vulnerabilities and ensures data processing workflows handle files correctly.

For managing file validation configurations across different environments, see our Hydra for Python Configuration: Build Modular and Maintainable Pipelines.

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran