Reference - EDB Postgres Lakehouse
Suggest editsPostgres Lakehouse is an early product. Eventually, it will support deployment modes across multiple clouds and on-premises. However, currently it's fairly limited in terms of where you can deploy it and what data you can query with it.
To get the best experience with Postgres Lakehouse, you should follow the "quick start" guide to query benchmarking data. Then you can try loading your own data with Lakehouse Sync. If you're intrigued, reach out to us and we can talk more about your use case and potential opportunities.
This page details some of the important bits to know.
Supported cloud providers and regions
AWS only: Currently, support for all Lakehouse features (Lakehouse nodes, Managed Storage Locations, and Lakehouse Sync) is limited to AWS.
EDB-hosted only: "Bring Your Own Account" (BYOA) regions are NOT currently supported for Lakehouse resources. Support is limited to ONLY EDB Postgres® AI - Hosted environments on AWS (a.k.a. "EDB-Hosted AWS regions").
This means you can select from one of the following regions:
- North America
- US East 1
- US East 2
- US West 2
- Europe
- EU Central 1
- EU West 1
- EU West 2
- Asia
- AP South 1
- Australia
- AP SouthEast 2
To be precise:
- Lakehouse nodes can only be provisioned in EDB-hosted AWS regions
- Managed Storage Locations can only be created in EDB-hosted AWS regions
- Lakehouse Sync can only sync from source databases in EDB-hosted AWS regions
These limitations will be removed as we continue to improve the product. Eventually, we will support BYOA, as well as Azure and GCP, for all Lakehouse use cases. We will also add better support for "external" buckets ("bring your own bucket").
Supported AWS instances
When deploying a Lakehouse node, you must choose an instance type from
the m6id
family of instances. Importantly, these instances come with NVMe
drives attached to them.
Instances are ephemeral. These NVMe drives are used only for spill-out space *while processing queries, and for caching Delta Tables on disk. All data on the NVMe drives will be lost when the cluster is shutdown.
System tables are persisted. Persistent data in system tables (users, roles, *etc.) is stored in an attached block storage device, and will survive a pause/resume cycle.
Supported instances
API Name | Memory | vCPUs | Cores | Storage |
---|---|---|---|---|
m6id.large | 8.0 GiB | 2 vCPUs | 1 | 118 GB NVMe SSD |
m6id.xlarge | 16.0 GiB | 4 vCPUs | 2 | 237 GB NVMe SSD |
m6id.2xlarge | 32.0 GiB | 8 vCPUs | 4 | 474 GB NVMe SSD |
m6id.4xlarge | 64.0 GiB | 16 vCPUs | 8 | 950 GB NVMe SSD |
m6id.8xlarge | 128.0 GiB | 32 vCPUs | 16 | 1900 GB NVMe SSD |
m6id.12xlarge | 192.0 GiB | 48 vCPUs | 24 | 2850 GB (2 * 1425 GB NVMe SSD) |
m6id.16xlarge | 256.0 GiB | 64 vCPUs | 32 | 3800 GB (2 * 1900 GB NVMe SSD) |
m6id.24xlarge | 384.0 GiB | 96 vCPUs | 48 | 5700 GB (4 * 1425 GB NVMe SSD) |
m6id.32xlarge | 512.0 GiB | 128 vCPUs | 64 | 7600 GB (4 * 1900 GB NVMe SSD) |
Available benchmarking datasets
When you provision a Lakehouse node, it comes pre-configured to point to a public S3 bucket in its same region, containing sample benchmarking datasets.
You can query tables in these datasets by referencing them with their schema name.
Schema Name | Dataset |
---|---|
tpcds_sf_1 | TPC-DS, Scale Factor 1 |
tpcds_sf_10 | TPC-DS, Scale Factor 10 |
tpcds_sf_100 | TPC-DS, Scale Factor 100 |
tpcds_sf_1000 | TPC-DS, Scale Factor 1000 |
tpch_sf_1 | TPC-H, Scale Factor 1 |
tpch_sf_10 | TPC-H, Scale Factor 10 |
tpch_sf_100 | TPC-H, Scale Factor 100 |
tpch_sf_1000 | TPC-H, Scale Factor 1000 |
clickbench | ClickBench, 100 million rows |
brc_1b | Billion row challenge |
Notes about ClickBench data:
Data columns (EventData
) are integers, not dates.
You must quote ClickBench column names, because they contain uppercase letters, but unquoted identifiers in Postgres are case-insensitive. For example:
✅ select "Title" from clickbench.hits;
🚫 select Title from clickbench.hits;
User management
When you provision a Lakehouse node, you must provide a password. We do not
save this password. You will need it to login as the edb_admin
user. This is
not a superuser account, but it does have the ability to create users and roles
and grants. Thus, you can either share the credentials for edb_admin
itself,
or you can create other users and distribute those.
Gotcha: Do not set search_path
Do not set search_path
. Always reference fully qualified table names.
Using search_path
makes Postgres Lakehouse fall back to PostgreSQL,
dramatically impacting query performance. To avoid this, qualify all table names
in your query with a schema.
For example:
🚫 Do NOT do this!
✅ Do this instead!
Supported queries
In general, READ ONLY queries are supported. You cannot write directly to object storage. This includes all Postgres built-in functions, statements and types. It also includes any of those provided by EPAS or PGE, depending on which distribution you choose to deploy.
In general, you cannot insert, update, delete or otherwise modify data. You
cannot CREATE TABLE
. You must load data into the bucket out-of-band, either
with your own ETL scripts or with Lakehouse Sync. See "Advanced: Bring Your Own
Data" for more details. (In the future, we will be making this more usable with
a custom DDL).
One exception is Postgres system tables, such as those used for storing users,
roles, and grants. These tables are stored on the local block device, which is
included in backups and restores. So you can CREATE USER
or CREATE ROLE
or
GRANT USAGE
, and these users/roles/grants will survive restarts and restores.
DirectScan vs. fallback modes and EXPLAIN
Postgres Lakehouse is fastest when it can "push down" your entire query to DataFusion, the vectorized query used for handling queries when possible. (In the future, this will be more fine-grained as we add support for partial pushdowns.)
Postgres Lakehouse can execute your query in two modes. First, it attempts to run the entire query using Seafowl (a dedicated columnar database based on DataFusion). If Seafowl can't run the entire query, for example, because it uses PostgreSQL-specific operations like JSON, then Postgres Lakehouse will fall back to using the PostgreSQL executor, with Seafowl streaming full table contents to it.
If your query is extremely slow, it's possible that's what's happening.
You can check which mode is being used by running an EXPLAIN
on the query and
making sure that the top-most query node is SeafowlDirectScan
. For example:
In this case, the query is executed by PostgreSQL and Seafowl is only involved
when scanning the table (see SeafowlScan
at the bottom). The fix in this case is
to explicitly name the inner COUNT(*)
column, since Seafowl gives it an implicit
name count(*)
whereas PostgreSQL calls it count
:
Here, we can see the SeafowlDirectScan
at the top, which means that Seafowl is
running the entire query.
If you're having trouble rewording your query to make it run fully on Seafowl, open a support ticket.
Load data with Lakehouse sync
If you have a transactional database running in EDB Postgres AI Cloud Service, then you can sync tables from this database into a Managed Storage Location.
A more detailed guide for this is forthcoming. If you want to try it yourself, look in the UI for "Migrations" or "Sync to Lakehouse."
Advanced: Bring your own data
It's possible to point your Lakehouse node at an arbitrary S3 bucket with Delta Tables inside of it. However, this comes with some major caveats (which will eventually be resolved):
Caveats
- The bucket must be publicly accessible.
- If you want to use a private bucket, this is technically possible, but requires some manual action on our side and your side (to assign the correct IAM policies). Let us know if you want to try it. We will be adding proper support for private, external buckets in the near future.
- The tables must be stored as Delta Tables within the location
- A “Delta Table” is a folder of Parquet files along with some JSON metadata.
- Each table must be prefixed with a
$schema/$table/
where$schema
and$table
are valid Postgres identifiers (i.e. < 64 characters)- For example, this is a valid Delta Table that will be recognized by Beacon Analytics:
my_schema/my_table/{part1.parquet, part2.parquet, _delta_log}
- These
$schema
and$table
identifiers will be queryable in the Lakehouse node, e.g.:SELECT count(*) FROM my_schema.my_table;
- These
- This Delta Table will NOT be recognized by Beacon Analytics (missing a schema):
my_table/{part1.parquet, part2.parquet, _delta_log}
- For example, this is a valid Delta Table that will be recognized by Beacon Analytics:
Loading your own data
- You can use the deltalake Python library to create Delta Tables and write to the bucket
- You can also use the
lakehouse-loader
utility we created for this, to export data from an arbitrary Postgres instance to Lakehouse Tables in a storage bucket.
For example, with the lakehouse-loader
utility:
Pointing to your bucket
By default, each Lakehouse node is configured to point to a bucket with
benchmarking datasets inside. To point it to a different bucket, you can
call the seafowl.set_bucket_location
function:
Querying your own data
In the example above, after you've called set_bucket_location
, you will be able
to query data in my_schema.my_table
:
Note that using an S3 bucket that isn't in the same region as your node will 1) be slow because of cross-region latencies, and 2) will incur AWS costs (between $0.01 and $0.02 / GB) for data transfer! Currently these egress costs are not passed through to you but we do track them and reserve the right to terminate an instance.
Switching back to sample data
To switch the bucket back to the default sample bucket in the same region as your node:
Could this page be better? Report a problem or suggest an addition!