Concepts - EDB Postgres Lakehouse
Suggest editsEDB Postgres Lakehouse is the solution for running Rapid Analytics against operational data on the EDB Postgres® AI platform.
Major concepts
- Lakehouse nodes query Lakehouse tables in managed storage locations.
- Lakehouse Sync can create Lakehouse tables from Transactional tables in a source database.
Here's how it fits together:
Lakehouse node
A Postgres Lakehouse node is Postgres, with a Vectorized Query Engine that's optimized to query Lakehouse tables, but still fall back to Postgres for full compatibility.
Lakehouse nodes are stateless and ephemeral. Scale them up or down based on workload requirements.
Lakehouse tables
Lakehouse Tables are stored using highly compressible, columnar storage formats optimized for analytics and interoperable with the rest of the Analytics ecosystem. Currently, Postgres Lakehouse nodes can read tables stored using the Delta Protocol ("delta tables"), and Lakehouse Sync can write them.
Managed storage location
A managed storage location is where you can organize Lakehouse tables in object storage, so that Postgres Lakehouse can query them.
A managed storage location is a location in object storage where we control the file layout and write Lakehouse Tables on your behalf. Technically, it's an implementation detail that we store these in buckets. This is really a subset of an upcoming "Storage Location" feature that will also support "External Storage Locations," where you bring your own bucket.
Lakehouse Sync
Lakehouse Sync is a data migration service offered as part of the EDB Postgres AI platform. It can "sync" tables from a transactional database, to Lakehouse Tables in a destination Storage Location. Currently, it supports source databases hosted in the EDB Postgres AI Cloud Service (formerly known as BigAnimal). In the future, it will support more source databases (such as Oracle) and deployment locations (on-premises databases). It's built using Debezium.
Terminology
Lakehouse
The "Lakehouse architecture" is a data engineering practice, which is a portmanteau of "Data Lake" and "Data Warehouse," offering the best of both. The central tenet of the architecture is that data is stored in Object Storage, generally in columnar formats like Parquet, where different query engines can process it for their own specialized purposes, using the optimal compute resources for a given query.
Vectorized query engine
A vectorized query engine is a query engine that's optimized for running queries on columnar data. Most analytics engines use vectorized query execution. Postgres Lakehouse uses Apache DataFusion.
Delta tables
We use the term "Lakehouse tables" to avoid overcommitting to a particular format (since we might eventually support Iceberg or Hudi, for example). But technically, we're using Delta Tables. A Delta Table is a well-defined container of Parquet files and JSON metadata, according to the "Delta Lake" spec and open protocol. Delta Lake is a Linux Foundation project.
How it works
Postgres Lakehouse is built using a number of technologies:
- PostgreSQL
- Seafowl, an analytical database acquired from Splitgraph
- Apache DataFusion, the query engine used by Seafowl
- Delta Lake (and specifically delta-rs), for implementing the storage and retrieval layer of Delta Tables
Postgres Lakehouse is built by much of the Splitgraph team, along with Postgres experts from EDB. We have lots of exciting plans for further integration with existing EDB technologies...
Level 100
The most important thing to understand about Postgres Lakehouse, is that it separates storage from compute. This allows you to scale them independently, which is ideal for analytical workloads where queries can be unpredictable and spiky. You would not want to keep a machine mostly idle just to hold data on its attached hard drives. Instead, you can keep data in object storage (and also in highly compressible formats), and only provision the compute needed to query it when necessary.
On the compute side, a Vectorized Query Engine is optimized to query Lakehouse Tables, but still fall back to Postgres for full compatibility.
On the storage side, Lakehouse Tables are stored using highly compressible, columnar storage formats optimzied for Analytics.
Level 200
Here's a slightly more comprehensive diagram of how these services fit together:
Level 300
Here's the more detailed, zoomed-in view of "what's in the box":
Could this page be better? Report a problem or suggest an addition!