Raw text data can lack the necessary context and factual details required for robust machine learning models. 

Upgini can automatically enrich any text fields with relevant facts from external data sources and generate ready-to-use numeric features from these enriched representations.

Link to Upgini.

Pandas is a single-threaded library, utilizing only a single CPU core. To achieve parallelism with Pandas, you would need to use additional libraries like Dask.

import pandas as pd
import multiprocessing as mp
import dask.dataframe as dd


df = pd.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})

# Perform the groupby and sum operation in parallel 
ddf = dd.from_pandas(df, npartitions=mp.cpu_count())
result = ddf.groupby("A").sum().compute()

Polars, on the other hand, automatically leverages the available CPU cores without any additional configuration.

import polars as pl

df = pl.DataFrame({"A": range(1_000_000), "B": range(1_000_000)})

# Perform the groupby and sum operation in parallel 
result = df.group_by("A").sum()

Link to Polars.

Interact with this code in Google Colab.

Data scientists spend much of their time cleaning data or building features. While the former is unavoidable, the latter can be automated.

tsfresh uses a robust feature selection algorithm to automatically extract relevant time series features, freeing up data scientists’ time.

To demonstrate this, start with loading an example dataset:

from tsfresh.examples.robot_execution_failures import (
    download_robot_execution_failures,
    load_robot_execution_failures,
)

download_robot_execution_failures()
timeseries, y = load_robot_execution_failures()
timeseries.head()

Output:

  id  time  F_x  F_y  F_z  T_x  T_y  T_z
0   1     0   -1   -1   63   -3   -1    0
1   1     1    0    0   62   -3   -1    0
2   1     2   -1   -1   61   -3    0    0
3   1     3   -1   -1   63   -2   -1    0
4   1     4   -1   -1   63   -3   -1    0

Extract features and select only relevant features for each time series.

from tsfresh import extract_relevant_features

# extract relevant features
features_filtered = extract_relevant_features(
    timeseries, y, column_id="id", column_sort="time"
)

You can now use the features in features_filtered to train your classification model.

Link to tsfresh.

Interact with this code in Google Colab.

Kestra has just introduced a new feature called Realtime Triggers that enables triggering workflows in response to events with millisecond latency.

For instance, with Realtime Triggers, a financial services company can call an ML inference endpoint to detect unusual transactions in real-time.

This rapid responsiveness is a new paradigm in orchestration that I haven’t seen yet in any other orchestration tool so far.

Link to Kestra.

PySpark queries with different syntax (DataFrame API or parameterized SQL) can have the same performance, as the physical plan is identical. Here is an example:

from pyspark.sql.functions import col

fruits = spark.createDataFrame(
    [("apple", 4), ("orange", 3), ("banana", 2)], ["item", "price"]
)
fruits.show()

Use the DataFrame API to filter rows where the price is greater than 3.

fruits.where(col("price") > 3).explain()
== Physical Plan ==
*(1) Filter (isnotnull(price#1L) AND (price#1L > 3))
+- *(1) Scan ExistingRDD[item#0,price#1L]

Use the spark.sql() method to execute an equivalent SQL query.

spark.sql("select * from {df} where price > 3", df=fruits).explain()
== Physical Plan ==
*(1) Filter (isnotnull(price#1L) AND (price#1L > 3))
+- *(1) Scan ExistingRDD[item#0,price#1L]

The physical plan for both queries is the same, indicating identical performance.

The choice between DataFrame API and spark.sql() depends on the following:

  • Familiarity: Use spark.sql() if your team prefers SQL syntax. Use the DataFrame API if chained method calls are more intuitive for your team.
  • Complexity of Transformations: The DataFrame API is more flexible for complex manipulations, while SQL is more concise for simpler queries.

Run this code in Google Colab.

Our Subscriber

4.1k+

Don’t miss these daily tips!

Select Frequency
Scroll to Top