Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Filter by Categories
About Article
Analyze Data
Archive
Best Practices
Better Outputs
Blog
Code Optimization
Code Quality
Command Line
Daily tips
Dashboard
Data Analysis & Manipulation
Data Engineer
Data Visualization
DataFrame
Delta Lake
DevOps
DuckDB
Environment Management
Feature Engineer
Git
Jupyter Notebook
LLM
LLM Tools
Machine Learning
Machine Learning & AI
Machine Learning Tools
Manage Data
MLOps
Natural Language Processing
Newsletter Archive
NumPy
Pandas
Polars
PySpark
Python Helpers
Python Tips
Python Utilities
Scrape Data
SQL
Testing
Time Series
Tools
Visualization
Visualization & Reporting
Workflow & Automation
Workflow Automation

Hacker News Semantic Search: Production RAG with CloudQuery and Postgres

Hacker News Semantic Search: Production RAG with CloudQuery and Postgres

Table of Contents

Introduction

Building RAG pipelines typically involves moving data from APIs to vector stores, which can be complicated and diverts AI engineers from their core work of building intelligent applications. Teams spend weeks debugging connection timeouts and API failures instead of improving model performance.

CloudQuery eliminates this complexity so AI teams can spend time building intelligent applications instead of debugging data pipelines. The same RAG pipeline that takes weeks to build and maintain with custom code becomes a 10-line YAML configuration that runs reliably in production.

This guide demonstrates how to replace custom Python scripts with CloudQuery’s declarative YAML configuration to build production-ready RAG pipelines with pgvector integration.

Key Takeaways

Here’s what you’ll learn:

  • Configure CloudQuery to sync Hacker News stories and generate OpenAI embeddings in a single YAML workflow
  • Set up pgvector extension in PostgreSQL for storing 1536-dimensional vectors with automatic indexing
  • Query semantic similarities using LangChain Postgres similarity_search across embedded content chunks
  • Leverage CloudQuery’s write modes, batching, and time-based incremental syncing for production deployments

Introduction to CloudQuery

CloudQuery operates as a high-performance data movement framework with intelligent batching, state persistence, and automatic error recovery. Its plugin ecosystem includes 100+ source connectors and supports advanced features like pgvector integration for AI applications.

Key features:

  • 10x Faster Development: Configure complex data pipelines in minutes with pre-built connectors for AWS, GCP, Azure, and 100+ SaaS platforms instead of months of custom code
  • Production Data Management: Sync state persistence with resume capability, incremental processing for delta detection, and automatic schema evolution eliminate manual maintenance
  • Built-in Data Governance: Native transformers for column obfuscation, PII removal, and data cleaning ensure compliance without custom code
  • Optimized Resource Usage: Intelligent batching (100MB default), connection pooling, and adaptive scheduling maximize database performance while respecting API limits
  • Built for Modern AI: Native pgvector integration with automated text splitting, embedding generation, and semantic indexing for production RAG applications

Install CloudQuery and Prepare PostgreSQL

To install CloudQuery, run the following command:

# macOS
brew install cloudquery/tap/cloudquery

# or download a release (Linux example)
curl -L https://github.com/cloudquery/cloudquery/releases/download/cli-v6.29.2/cloudquery_linux_amd64 -o cloudquery
chmod +x cloudquery

For other installation methods (Windows, Docker, or package managers), visit the CloudQuery installation guide.

Create a database that will be used for this tutorial:

createdb cloudquery

Export the credentials that will be used for the later steps:

export POSTGRESQL_CONNECTION_STRING="postgresql://user:pass@localhost:5432/database"
export OPENAI_API_KEY=sk-...

Here are how to get the credentials:

  • PostgreSQL: Replace user:pass@localhost:5432/database with your actual database connection details
  • OpenAI API Key: Get your API key from the OpenAI platform

Sync Hacker News to PostgreSQL

To sync Hacker News to PostgreSQL, start by authenticating with CloudQuery Hub:

cloudquery login

Then run the following command to create a YAML file that syncs Hacker News to PostgreSQL:

cloudquery init --source hackernews --destination postgresql

This command will create a YAML file called hackernews_to_postgresql.yaml:

kind: source
spec:
  name: "hackernews"
  path: "cloudquery/hackernews"
  registry: "cloudquery"
  version: "v3.8.2"
  tables: ["*"]
  backend_options:
    table_name: "cq_state_hackernews"
    connection: "@@plugins.postgresql.connection"
  destinations:
    - "postgresql"
  spec:
    item_concurrency: 100
    start_time: 3 hours ago
---
kind: destination
spec:
  name: "postgresql"
  path: "cloudquery/postgresql"
  registry: "cloudquery"
  version: "v8.12.1"
  write_mode: "overwrite-delete-stale"
  spec:
    connection_string: "${POSTGRESQL_CONNECTION_STRING}"

We can now sync the Hacker News data to PostgreSQL by running the following command:

cloudquery sync hackernews_to_postgresql.yaml

Output:

Loading spec(s) from hackernews_to_postgresql.yaml
Starting sync for: hackernews (cloudquery/hackernews@v3.8.2) -> [postgresql (cloudquery/postgresql@v8.12.1)]
Sync completed successfully. Resources: 9168, Errors: 0, Warnings: 0, Time: 26s

Let’s check if the data was ingested successfully by connecting to the CloudQuery database:

psql -U postgres -d cloudquery

Then inspect the available tables:

\dt
 Schema |            Name             | Type  |   Owner
--------+-----------------------------+-------+------------
 public | cq_state_hackernews         | table | khuyentran
 public | hackernews_items            | table | khuyentran

CloudQuery automatically creates two tables:

  • cq_state_hackernews: tracks sync state for incremental updates
  • hackernews_items: contains the actual Hacker News data

View the schema of the hackernews_items table:

\d hackernews_items

Output:

                        Table "public.hackernews_items"
     Column      |            Type             | Collation | Nullable | Default
-----------------+-----------------------------+-----------+----------+---------
 _cq_sync_time   | timestamp without time zone |           |          |
 _cq_source_name | text                        |           |          |
 _cq_id          | uuid                        |           | not null |
 _cq_parent_id   | uuid                        |           |          |
 id              | bigint                      |           | not null |
 deleted         | boolean                     |           |          |
 type            | text                        |           |          |
 by              | text                        |           |          |
 time            | timestamp without time zone |           |          |
 text            | text                        |           |          |
 dead            | boolean                     |           |          |
 parent          | bigint                      |           |          |
 kids            | bigint[]                    |           |          |
 url             | text                        |           |          |
 score           | bigint                      |           |          |
 title           | text                        |           |          |
 parts           | bigint[]                    |           |          |
 descendants     | bigint                      |           |          |

Check the first 5 rows with type story:

SELECT id, type, score, title FROM hackernews_items WHERE type='story' LIMIT 5;
    id    | type  | score |                                 title
----------+-------+-------+-----------------------------------------------------------------------
 45316982 | story |     3 | Ask HN: Why don't Americans hire human assistants for everyday tasks?
 45317015 | story |     2 | Streaming Live: San Francisco Low Riders Festival
 45316989 | story |     1 | The Collapse of Coliving Operators, and Why the Solution Is Upstream
 45317092 | story |     0 |
 45317108 | story |     1 | Patrick McGovern was the maven of ancient tipples

Add pgvector Embeddings

pgvector is a PostgreSQL extension that adds vector similarity search capabilities, perfect for RAG applications. For a complete guide on implementing RAG with pgvector, see our semantic search tutorial.

CloudQuery provides built-in pgvector support, automatically generating embeddings alongside your data sync. First, enable the pgvector extension in PostgreSQL:

psql -d cloudquery -c "CREATE EXTENSION IF NOT EXISTS vector;"

Now add the pgvector configuration to the hackernews_to_postgresql.yaml:

kind: source
spec:
  name: "hackernews"
  path: "cloudquery/hackernews"
  registry: "cloudquery"
  version: "v3.8.2"
  tables: ["*"]
  backend_options:
    table_name: "cq_state_hackernews"
    connection: "@@plugins.postgresql.connection"
  destinations:
    - "postgresql"
  spec:
    item_concurrency: 100
    start_time: 3 hours ago
---
kind: destination
spec:
  name: "postgresql"
  path: "cloudquery/postgresql"
  registry: "cloudquery"
  version: "v8.12.1"
  write_mode: "overwrite-delete-stale"
  spec:
    connection_string: "${POSTGRESQL_CONNECTION_STRING}"

    pgvector_config:
      tables:
        - source_table_name: "hackernews_items"
          target_table_name: "hackernews_items_embeddings"
          embed_columns: ["title"]
          metadata_columns: ["id", "type", "url", "by"]
          filter_condition: "type = 'story' AND title IS NOT NULL AND title != ''"
      text_splitter:
        recursive_text:
          chunk_size: 200
          chunk_overlap: 0
      openai_embedding:
        api_key: "${OPENAI_API_KEY}"
        model_name: "text-embedding-3-small"
        dimensions: 1536

Explanation of pgvector configuration:

  • source_table_name: The original CloudQuery table to read data from
  • target_table_name: New table where embeddings and metadata will be stored
  • embed_columns: Which columns to convert into vector embeddings (only non-empty text is processed)
  • metadata_columns: Additional columns to preserve alongside embeddings for filtering and context
  • filter_condition: SQL WHERE clause to only embed specific rows (stories with non-empty titles)
  • chunk_size: Maximum characters per text chunk (short titles become single chunks)
  • chunk_overlap: Overlapping characters between chunks to preserve context across boundaries
  • model_name: OpenAI embedding model (text-embedding-3-small offers 5x cost savings vs ada-002)

Since we’ve already synced the data, let’s clean up the existing tables before running the sync again:

psql -U postgres -d cloudquery -c 'DROP TABLE IF EXISTS cq_state_hackernews'

Run the enhanced sync:

cloudquery sync hackernews_to_postgresql.yaml

CloudQuery now produces an embeddings table alongside the source data:

psql -U postgres -d cloudquery

List the available tables:

\dt

Output:

 Schema |            Name             | Type  |   Owner
--------+-----------------------------+-------+------------
 public | cq_state_hackernews         | table | khuyentran
 public | hackernews_items            | table | khuyentran
 public | hackernews_items_embeddings | table | khuyentran

Inspect the embeddings table structure:

-- Check table structure and vector dimensions
\d hackernews_items_embeddings;

Output:

cloudquery=# \d hackernews_items_embeddings;
                  Table "public.hackernews_items_embeddings"
    Column     |            Type             | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
 _cq_id        | uuid                        |           |          |
 id            | bigint                      |           | not null |
 type          | text                        |           |          |
 url           | text                        |           |          |
 _cq_sync_time | timestamp without time zone |           |          |
 chunk         | text                        |           |          |
 embedding     | vector(1536)                |           |          |

Sample a few records to see the data:

-- Sample a few records to see the data
SELECT id, type, url, by, chunk
FROM hackernews_items_embeddings
LIMIT 5;

Output:

 type  |       by        |                         chunk
-------+-----------------+--------------------------------------------------------
 story | Kaibeezy        | Autonomous Airport Ground Support Equipment
 story | drankl          | Dining across the divide: 'We disagreed on...
 story | wjSgoWPm5bWAhXB | World's First AI-designed viruses a step towards...
 story | Brajeshwar      | Banned in the U.S. and Europe, Huawei aims for...
 story | danielfalbo     | Zig Z-ant: run ML models on microcontrollers

Check for NULL embeddings:

SELECT COUNT(*) as rows_without_embeddings
FROM hackernews_items_embeddings
WHERE embedding IS NULL;

Output:

 rows_without_embeddings
----------------------
                    0

Great! There is no NULL embeddings.

Check the chunk sizes and content distribution:

-- Check chunk sizes and content distribution
SELECT
    type,
    COUNT(*) as count,
    AVG(LENGTH(chunk)) as avg_chunk_length,
    MIN(LENGTH(chunk)) as min_chunk_length,
    MAX(LENGTH(chunk)) as max_chunk_length
FROM hackernews_items_embeddings
GROUP BY type;

Output:

 type  | count |  avg_chunk_length   | min_chunk_length | max_chunk_length
-------+-------+---------------------+------------------+------------------
 story |   695 | 52.1366906474820144 |                4 |               86

The 52-character average chunk length confirms that most Hacker News titles fit comfortably within the configured 200-character chunk_size limit, validating the text splitter settings.

Semantic Search with LangChain Postgres

Now that we have embeddings stored in PostgreSQL, let’s use LangChain Postgres to handle vector operations.

Start with installing the necessary packages:

pip install langchain-postgres langchain-openai psycopg[binary] greenlet

Next, set up the embedding service that will convert text queries into vector representations for similarity matching. In this example, we’ll use the OpenAI embedding model.

import os
from langchain_postgres import PGEngine, PGVectorStore
from langchain_openai import OpenAIEmbeddings

# Initialize embeddings (requires OPENAI_API_KEY environment variable)
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

Connect to the hackernews_items_embeddings table generated by CloudQuery containing the pre-computed embeddings.

# Initialize connection engine
CONNECTION_STRING = "postgresql+asyncpg://khuyentran@localhost:5432/cloudquery"
engine = PGEngine.from_connection_string(url=CONNECTION_STRING)

# Connect to existing CloudQuery vector store table
vectorstore = PGVectorStore.create_sync(
    engine=engine,
    table_name="hackernews_items_embeddings",
    embedding_service=embeddings,
    content_column="chunk",
    embedding_column="embedding",
    id_column="id",  # Map to CloudQuery's id column
    metadata_columns=["type", "url", "by"],  # Include story metadata
)

Finally, we can query the vector store to find semantically similar content.

# Semantic search for Apple-related stories
docs = vectorstore.similarity_search("Apple technology news", k=4)
for doc in docs:
    print(f"Title: {doc.page_content}")

Output:

Title: Apple takes control of all core chips in iPhoneAir with new arch prioritizing AI
Title: Apple used AI to uncover new blood pressure notification
Title: Apple Losing Talent to OpenAI
Title: Standard iPhone 17 Outperforms Expectations as Apple Ramps Up Manufacturing

The vector search successfully identifies relevant Apple content beyond exact keyword matches, capturing stories about AI development, talent acquisition, and product development that share semantic meaning with the query.

Other CloudQuery Features

Let’s explore some of the other features of CloudQuery to enhance your pipeline.

Multi-Destination Support

CloudQuery can route the same source data to multiple destinations in a single sync operation:

destinations: ["postgresql", "bigquery", "s3"]

This capability means you can simultaneously:

  • Store operational data in PostgreSQL for real-time queries
  • Load analytical data into BigQuery for data warehousing
  • Archive raw data to S3 for compliance and backup

Write Modes

CloudQuery provides different write modes to control how data updates are handled:

Smart Incremental Updates:

Smart incremental updates is the default write mode and is recommended for most use cases. It updates existing records and removes any data that’s no longer present in the source. This is perfect for maintaining accurate, up-to-date datasets where items can be deleted or modified.

write_mode: "overwrite-delete-stale"

Append-Only Mode:

Append-only mode only adds new data without modifying existing records. Ideal for time-series data, logs, or when you want to preserve historical versions of records.

write_mode: "append"

Selective Overwrite:

Selective overwrite mode replaces existing records with matching primary keys but doesn’t remove stale data. Useful when you know the source data is complete but want to keep orphaned records.

write_mode: "overwrite"

Performance Batching

You can optimize memory usage and database performance by configuring the batch size and timeout:

spec:
  batch_size: 1000              # Records per batch
  batch_size_bytes: 4194304     # 4MB memory limit
  batch_timeout: "20s"          # Max wait between writes

Details of the parameters:

  • batch_size: Number of records grouped together before writing.
  • batch_size_bytes: Maximum memory size per batch in bytes.
  • batch_timeout: Time limit before writing partial batches.

Retry Handling

The max_retries parameter ensures reliable data delivery by automatically retrying failed write operations a specified number of times before marking them as permanently failed.

spec:
  max_retries: 5                # Number of retry attempts

Time-Based Incremental Syncing

CloudQuery’s start_time configuration prevents unnecessary data reprocessing by only syncing records created after a specified timestamp, dramatically reducing sync time and resource usage:

spec:
  start_time: "7 days ago"       # Only sync recent data
  # Alternative formats:
  # start_time: "2024-01-15T10:00:00Z"  # Specific timestamp
  # start_time: "1 hour ago"            # Relative time

See the source plugin configuration and destination plugin configuration documentation for all available options.

What You’ve Built

In this tutorial, you’ve created a production-ready RAG pipeline that:

  • Syncs live data from Hacker News with automatic retry and state persistence
  • Generates vector embeddings using OpenAI’s latest models
  • Enables semantic search across thousands of posts and comments
  • Scales to handle any CloudQuery-supported data source (100+ connectors)

All with zero custom ETL code and enterprise-grade reliability.

Next steps:

  • Explore the Hacker News source documentation for advanced filtering options (top, best, ask HN, etc.)
  • Add additional text sources (GitHub issues, Typeform surveys, Airtable notes) to the same pipeline
  • Schedule CloudQuery with cron, Airflow, or Kubernetes for continuous refresh
  • Integrate with LangChain PGVector or LlamaIndex PGVector in production RAG systems

Related Tutorials

Leave a Comment

Your email address will not be published. Required fields are marked *

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran