How to Build Analytics Applications with EDB Postgres AI Analytics Accelerator: A Step-by-Step Guide
Imagine having a high-performance analytics engine that could transform your Postgres database into a columnar powerhouse, capable of processing billions of rows in seconds without ETL pipelines. Now, picture querying both your transactional data and modern lakehouse formats with a single system that separates compute from storage, allowing you to scale resources independently based on workload demands.
In this tutorial, we'll take you on a journey through EDB Postgres AI Analytics Accelerator, showing you how to transform this vision into reality. We'll demonstrate how to run lightning-fast queries across both Postgres databases and modern lakehouse environments while maintaining complete data sovereignty and delivering exceptional analytical capabilities to your organization.
Analytics Accelerator Overview
EDB Postgres AI Analytics Accelerator transforms Postgres into a columnar analytics database, allowing you to run analytics without degrading transactional performance. It powers rapid queries with an analytics engine that separates compute from storage and eliminates ETL pipelines by providing compatibility across both lakehouse and legacy data stores—enabling advanced AI-powered business intelligence.
In summary, the Analytics Accelerator consists of a vectorized query engine attached to an ephemeral Postgres instance, with key components including:
- A decoupled columnar storage layer supporting Delta Lake and Apache Iceberg formats
- PGFS extension for object storage connectivity
- The PGAA extension for vectorized query processing
- Integration with Iceberg REST Catalog for metadata management
Separation of storage from compute in Analytics Accelerator means the two can be scaled independently, which is ideal for analytical workloads that often have unpredictable and "spiky" query patterns. Because the “source of truth” is in the storage layer, multiple Postgres engines can query the same data simultaneously without impacting each other. Data is stored in object storage using highly compressible formats, and compute resources can be scripted to be provisioned only when needed for queries, avoiding idle machine costs.
For more information on its architecture and configurations, you can refer to Analytics Accelerator documentation.
Quick Start
Let’s kick things off by provisioning a new Lakehouse cluster in EDB Postgres AI Hybrid Manager (HM). This is a single-node Postgres database that comes preloaded with the Analytics Accelerator extension. If you prefer self-hosting, you can also download the extension (pgaa) as packaged software from EDB software repositories, and install it in any Postgres instance, with support for PostgreSQL, EnterpriseDB Advanced Server, and Postgres Extended Server.
Hybrid Manager Console is the control plane of the EDB Postgres AI platform offering a unified experience for deploying databases, running GenAI workloads, and scaling Postgres-native analytics across cloud, on-prem, or hybrid environments. HM Console runs anywhere you run Kubernetes, and comes with an intuitive Web Interface and a REST API for managing and monitoring Postgres clusters.
HM Console organizes database resources as projects. Log into the console first and create a project. On your project page, select Create New > Lakehouse Analytics as follows.
The Create Analytical Cluster page opens, where you can choose between a templated build or a custom build for the Lakehouse cluster:
Select Custom Build, and then select Start from Scratch. The Create Analytical Cluster page opens, where you can choose the cluster size and configuration:
You then return to the Clusters page, where you can see the status of your cluster as it's created. Provisioning the cluster can take 10-15 minutes but is usually much quicker.
The Lakehouse cluster is composed of individual Postgres nodes and they are ephemeral. None of the data is stored on the hard drive except for internal system tables. Otherwise, the hard drive is used only as a temporary cache for Parquet data and a “spill-out” space for queries that don’t fit in memory. Think of it as an elastic query engine with true "scale to zero" capabilities.
Connect to the Cluster
You can connect to a Lakehouse cluster node with any Postgres client in the same way that you connect to any other cluster from EDB Postgres AI Platform.
Navigate to the cluster detail page (Connect) and copy its connection string.
In general, you can connect to the database with any Postgres client, including psql, pgcli, pgAdmin, and many more.
Here’s how you can use psql
to connect to the Lakehouse node from your local machine:
psql -U edb_admin -h p-q0iej9l6i1-rw-external-bb4307cac7b13d49.elb.us-east-1.amazonaws.com
Make sure to change the hostname to match yours.
Inspect Built-In Datasets
Every Lakehouse cluster comes pre-configured with a Catalog of benchmarking datasets stored as Delta Lake tables in a public S3 bucket. The datasets include TPC-H and TPC-DS (with scale factors of 1, 10, 100 and1000), ClickBench, and the “1 billion row challenge.”
If you're using psql, you can run \dn
to see the available schemas as follows:
Let's try running some basic queries to get an idea of the cluster's performance.
edb_admin> select count(*) from clickbench.hits;
+----------+
| count |
|----------|
| 99997497 |
+----------+
SELECT 1
Time: 0.945s
edb_admin> select count(*) from brc_1b.measurements;
+------------+
| count |
|------------|
| 1000000000 |
+------------+
SELECT 1
Time: 0.651s
Notice how both queries took less than a second to return results. That’s because, for Analytics Accelerator, a count query is a metadata-only operation to read the footers of the Parquet files comprising the tables. Unlike with standard Postgres, there is no need to analyze the table or scan any data or index. Whether there’s 10M rows or 1B rows, the query time will be the same.
But Postgres Analytics Accelerator can do more than just count the rows! It’s fully compatible with all Postgres syntax, and in most cases the analytical engine can handle the full query for a massive speedup. When it can’t handle the full query, it falls back to default Postgres. Let’s try running a more complex query on the TPC-H data:
edb_admin> SELECT
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
FROM
supplier,
(
SELECT
l_suppkey AS supplier_no,
sum(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM
lineitem
WHERE
l_shipdate >= CAST('1996-01-01' AS date)
AND l_shipdate < CAST('1996-04-01' AS date)
GROUP BY
supplier_no) revenue0
WHERE
s_suppkey = supplier_no
AND total_revenue = (
SELECT
max(total_revenue)
FROM (
SELECT
l_suppkey AS supplier_no,
sum(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM
lineitem
WHERE
l_shipdate >= CAST('1996-01-01' AS date)
AND l_shipdate < CAST('1996-04-01' AS date)
GROUP BY
supplier_no) revenue1)
ORDER BY
s_suppkey;
s_suppkey | s_name | s_address | s_phone | total_revenue
-----------+--------------------+-------------------+-----------------+---------------
8449 | Supplier#000008449 | 5BXWsJERA2mP5OyO4 | 20-469-856-8873 | 1772627.2087
(1 row)
Time: 634.689 ms
In just 634ms the query returned the results after joining data from multiple tables, aggregating and sorting it.
Querying Lakehouse Tables
Lakehouse tables represent a modern approach to data storage that combines the best features of data warehouses and data lakes. These tables are stored as Parquet files, a columnar format which optimizes query performance by allowing analytics engines to read only the specific columns needed for a query rather than scanning entire rows, and providing metadata the engines can use to read only the precise byte ranges required for their query. This columnar structure is especially beneficial for analytical workloads that typically analyze, aggregate and summarize specific columns across large datasets.
Analytics Accelerator supports querying popular open table formats like Apache Iceberg and Delta Lake. These formats provide important capabilities including ACID transactions for multiple writers, schema evolution, and interoperability with an ecosystem of query engines and tools. Both formats store metadata separately from the data files, enabling efficient data management at scale.
The PGAA extension installed on Lakehouse nodes enables Postgres to efficiently read these lakehouse tables directly from object storage systems like S3, MinIO, or Azure Blob Storage. It accomplishes this through vectorized query processing, where operations are performed on batches of data rather than row by row, dramatically improving performance for analytical workloads. This allows users to query petabyte-scale datasets with the familiar SQL interface of Postgres while achieving performance comparable to dedicated data warehouse solutions.
With the newly provisioned Lakehouse node in HM, let’s query some Lakehouse tables.
Querying Iceberg Tables
Let's assume you have several Apache Iceberg tables in S3-compatible object storage with appropriate storage permissions. First we’ll show how to query them directly, and then we’ll show how to query them via an Iceberg REST Catalog (external or within the Hybrid Manager).
The first step is to create a storage location with PGFS (Postgres File System) extension, which is already installed in the Lakehouse node. PGFS provides an interface between PostgreSQL and object storage systems (or any POSIX filesystem), enabling Lakehouse nodes to access and query data files directly from object stores. This extension allows you to mount external object storage locations as virtual file systems within PostgreSQL, creating a seamless integration between your database and your data lake. This connection is essential for querying lakehouse tables (like Iceberg and Delta) that are stored in these external systems.
SELECT pgfs.create_storage_location(
name => 'my_s3_iceberg_data',
url => 's3://your-bucket/path/to/iceberg',
options => '{}',
credentials => '{"access_key_id": "...", "secret_access_key": "..."}'
);
Next, create a PGAA external table for Iceberg:
CREATE TABLE public.my_sales_iceberg_data (
sales_id INT,
sales_date DATE,
sales_amount NUMERIC
)
USING PGAA
WITH (
pgaa.storage_location = 'my_s3_iceberg_data',
pgaa.path = 'sales_records/iceberg_table_root',
pgaa.format = 'iceberg'
);
This table exposes your Iceberg table to Postgres, allowing you to query it like:
SELECT * FROM public.my_sales_iceberg_data
WHERE sales_region = 'North America'
LIMIT 100;
In the example above, we're directly accessing an Iceberg table by pointing to its location in object storage. This direct connection approach provides significant advantages, including simplicity (no separate catalog service to maintain), reduced dependencies (requires only storage location and credentials), and quick access that's ideal for ad-hoc queries and exploration with minimal operational overhead.
Another method of accessing Iceberg tables is via an Iceberg REST Catalog: a metadata management system that tracks and organizes Iceberg tables, enabling efficient data discovery, access, and versioning across different compute engines. It stores information about table locations, schemas, snapshots, and partitioning, allowing seamless integration with engines like Spark, Trino, and Flink for querying large-scale, mutable datasets in data lakes.
Querying Iceberg tables via an Iceberg REST Catalog provides significant additional benefits:
- Table discovery: Catalogs maintain an inventory of all available tables
- Schema evolution tracking: Catalogs track schema changes over time
- Metadata management: More efficient handling of table metadata
- Centralized permissions: Better control over access management
- Transaction coordination: Ensures ACID compliance across multiple writers
To connect to an Iceberg catalog from your Lakehouse node, you need some basic configuration from any Catalog implementing the Iceberg REST Catalog standard. Some examples are Snowflake Open Catalog, Databricks Unity Catalog, AWS S3Tables, and Lakekeeper — which is also available as a managed service within the Hybrid Manager.
Once the catalog is set up, you can register it with the Lakehouse node using PGAA. For example, here’s how to register the HM-managed Lakekeeper catalog:
SELECT pgaa.add_catalog(
'hm_lakekeeper_main',
'iceberg-rest',
'{
"url": "<http://hm.example.com/catalog/v1>",
"token": "your_hm_api_key",
"warehouse": "lakehouse_warehouse_1"
}'
);
You can visit the Catalogs section in the HM Console to access the catalog configurations specific to your Lakehouse cluster.
After adding the catalog, run pgaa.attach_catalog()
to make catalog tables visible and queryable in Postgres. This will also start a background process to keep your Postgres Catalog in sync with the Iceberg REST Catalog — so when you add, delete or alter tables in the upstream, the changes are reflected in Postgres as well. If you don’t want to keep in sync, you can run import_catalog
with the same syntax for a one-time import of the metadata.
SELECT pgaa.attach_catalog('your_catalog_alias');
If you prefer to manually add tables from the Catalog, you can do that too, using standard CREATE TABLE syntax:
CREATE TABLE public.catalog_managed_sensor_data (
device_id TEXT,
event_time TIMESTAMP WITH TIME ZONE,
temperature FLOAT,
humidity FLOAT
)
USING PGAA
WITH (
pgaa.format = 'iceberg',
pgaa.managed_by = 'your_catalog_alias',
pgaa.catalog_namespace = 'iot_data',
pgaa.catalog_table = 'hourly_sensor_readings'
);
For production environments, using a catalog is generally recommended as it provides a more robust foundation for data governance (the systematic management of data assets through policies, standards, and controls), schema evolution, and multi-user access. The direct access method we demonstrated earlier is perfect for quick exploration or simpler use cases where the additional catalog infrastructure isn't necessary.
Querying Delta Tables
Querying Delta tables is almost identical to querying Iceberg tables except for the fact that we are not going to use a catalog for it. In the future, we might add support for querying Delta tables via a Catalog, especially if Iceberg REST Catalog gains support for governing Delta tables.
Assume you have a HM-provisioned Lakehouse node and several Delta tables in S3-compatible object storage.
First, create a storage location:
SELECT pgfs.create_storage_location(
name => 'my_public_delta_lake_store',
url => 's3://my-public-delta-data/',
options => '{"aws_skip_signature": "true"}',
credentials => '{}'
);
Then, create an external table referencing a Delta Table in a path in that storage location:
CREATE TABLE public.sales_delta_table ()
USING PGAA
WITH (
pgaa.storage_location = 'my_private_delta_lake_assets',
pgaa.path = 'path/to/delta_table_root/',
pgaa.format = 'delta'
);
Finally, query this table as if you are querying a regular Postgres table.
SELECT order_id, customer_name, sale_amount
FROM public.sales_delta_table
WHERE sale_date >= '2023-01-01'
AND product_category = 'Electronics'
LIMIT 100;
Offloading Older Data with Tiered Tables
So far, we explored the Analytics Accelerator’s capabilities for reading Lakehouse tables. With Tiered Tables, Analytics Accelerator can write to Lakehouse tables.
Tiered Tables are a native capability of EDB Postgres AI high availability configurations, supported by the Analytics Accelerator architecture. They enable managing large, time-based datasets efficiently by moving "cold" data to cost-effective object storage, while keeping "hot" data in primary transactional storage. This can be fully automated, or you can manually offload tables, partitions, or rows matching a specific query.
Tiered tables automatically offload older partitions of time-partitioned tables from EDB Postgres AI to object storage (Apache Iceberg format) using:
- PGD AutoPartition for automated partitioning and lifecycle control
- PGAA and PGFS for querying and accessing offloaded data
- Optional Iceberg catalogs for governance and interoperability
Wrapping It Up
In this tutorial, we've journeyed through the powerful capabilities of EDB Postgres AI Analytics Accelerator, demonstrating how it transforms Postgres into a high-performance analytical engine. We've explored how to query both Iceberg and Delta Lake tables directly from object storage, leveraging the PGAA extension for vectorized processing of petabyte-scale datasets.
We've also introduced Tiered Tables, which intelligently manage data lifecycle by automatically moving cold data to cost-effective storage while maintaining seamless query access across all tiers. By eliminating ETL pipelines and providing a unified interface for both transactional and analytical workloads, the Analytics Accelerator represents a significant advancement in database technology—bringing together the best of Postgres reliability with modern analytical performance and lakehouse flexibility.
To try out the steps mentioned above, start with installing PG AI Hybrid Manager (HM) on your preferred platform, and access the HM Console. To learn more about how Analytics Accelerator can transform Postgres into a powerful engine for business insights, read our comprehensive launch blog and explore the documentation.