-
Notifications
You must be signed in to change notification settings - Fork 59
Add pg_lake extension support #616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
minkimipt
wants to merge
8
commits into
master
Choose a base branch
from
danil/add_pg_lake
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add support for the pg_lake extension, which integrates Apache Iceberg
and data lake files into PostgreSQL, enabling it to function as a
lakehouse system with support for querying Parquet, CSV, JSON, and
Iceberg tables in object storage.
Changes made:
* CHANGELOG.md: Add pg_lake to future release notes
* Dockerfile:
- Add build dependencies required for pg_lake: ninja-build,
libgeos-dev, libproj-dev, libgdal-dev, and openjdk-21-jdk
- Add build step to install pg_lake extension via install_extensions
script (runs as postgres user before switching to root)
- Add PG_LAKE_VERSIONS to image configuration output
* Makefile:
- Add PG_LAKE_VERSIONS variable (default: all)
- Pass PG_LAKE_VERSIONS as build argument to Docker
- Set PG_LAKE_VERSIONS to empty in fast target for quicker builds
- Set PG_LAKE_VERSIONS to latest in latest target
* README.md:
- Add pg_lake to the list of extensions that can be version-managed
via versions.yaml
* build_scripts/install_extensions:
- Add pg_lake to accepted command-line arguments
- Add PG_LAKE_VERSIONS to version output
- Add pg_lake installation case block that iterates through requested
versions
* build_scripts/shared_install.sh:
- Add install_pg_lake() function that clones pg_lake repository from
GitHub and builds from source
- Function handles version checkout (tags, branches, or main/master)
- Builds and installs all pg_lake components: pg_extension_base,
pg_map, pg_extension_updater, pg_lake_engine, avro,
pg_lake_iceberg, pg_lake_table, pg_lake_spatial, pg_lake_copy,
and pg_lake
* build_scripts/shared_versions.sh:
- Add supported_pg_lake() function to validate PostgreSQL version
compatibility
- Add PG_LAKE_VERSIONS processing via requested_pkg_versions
* build_scripts/versions.yaml:
- Add pg_lake configuration with main branch support
- Set PostgreSQL compatibility to 16-18 based on pg_lake's
documented support matrix
* cicd/shared.sh:
- Add check_pg_lake() function to verify pg_lake installation by
checking for pg_lake.so library file
- Add check_pg_lake call to check_base_components
- Function records extension version and validates installation for
PostgreSQL 16-18
* cicd/version_info.sql:
- Add pg_lake to extensions checked in pg_available_extensions
- Add query to collect pg_lake.available_versions from
pg_available_extension_versions
Implementation notes:
Unlike pgvectorscale which downloads pre-built .deb packages, pg_lake
must be built from source as the upstream project does not publish
binary releases. This requires additional build dependencies (GEOS,
PROJ, GDAL development headers, and Java JDK) which are already
available as PostgreSQL build dependencies or required for PostGIS.
The pg_lake extension consists of multiple components that are built
and installed separately through individual make targets. The
installation follows the same pattern as other source-built extensions
in this repository.
Default version is set to "main" branch to track the latest development
as no stable releases are available yet. This can be overridden via
PG_LAKE_VERSIONS environment variable during build.
PostgreSQL version support is limited to 16-18 based on pg_lake's
build requirements and testing matrix documented in their repository.
Add explicit PostgreSQL version check in supported_pg_lake() to skip PG15 even for main/master branch builds. pg_lake uses PostgreSQL 16+ APIs (RTEPermissionInfo, p_rteperminfos) that don't exist in PG15, causing compilation errors. The early return for main/master branches was allowing build attempts on all PostgreSQL versions, bypassing the version constraints defined in versions.yaml. The added check ensures pg_lake is only built for PostgreSQL 16 and later.
pg_lake's Avro component (required for Iceberg table support) needs libjansson >=2.3 to build. Add libjansson-dev to BUILD_PACKAGES to satisfy this build dependency.
Make /usr/lib/{x86_64,aarch64}-linux-gnu directories writable by
postgres group before switching to postgres user. This allows pg_lake's
Avro installation step (make install-avro) to succeed when installing
libavro.* files to system library directories.
The build runs as postgres user for security and to allow mutability
of installed extensions, but system library directories are owned by
root. Adding group write permissions allows the build to complete
successfully.
Apply patch to fix incorrect include path in pg_lake_table source code. The file pg_lake_table/src/planner/query_pushdown.c incorrectly uses 'server/rewrite/rewriteManip.h' when the correct path for PostgreSQL extension includes is 'rewrite/rewriteManip.h' (without server/ prefix). This appears to be a bug in pg_lake's main branch. The sed patch is applied after checkout to fix the include before compilation.
Update check_pg_lake to properly verify installation even for main/master branch builds. Previously, the function would skip checking for the .so file entirely for main/master versions, causing the final error check to fail when found=false. Now the check: 1. Verifies pg_lake.so exists for main/master versions 2. Records the version and sets found=true if installed 3. Logs (doesn't error) if not built for unsupported PG versions 4. Only errors for tagged versions that should exist but don't This allows the build to pass install checks when using PG_LAKE_VERSIONS=main.
pg_extension_base (required by pg_lake) must be loaded at PostgreSQL startup via shared_preload_libraries. Add it to the postgresql.conf.sample configuration alongside timescaledb. According to pg_lake documentation, pg_extension_base acts as a loader that loads other pg_lake modules as needed, so only this extension needs to be preloaded. Without this, attempting to CREATE EXTENSION pg_extension_base fails with: ERROR: pg_extension_base can only be loaded via shared_preload_libraries This ensures pg_lake extensions can be installed without manual configuration changes.
pg_lake requires libjansson runtime library at runtime for Avro/Iceberg
support. While libjansson-dev is needed for building, the runtime library
libjansson4 must be installed separately to remain in the final image
(not removed during cleanup).
Without this, PostgreSQL fails to start with pg_lake preloaded:
FATAL: could not load library "/usr/lib/postgresql/18/lib/pg_lake_iceberg.so":
libjansson.so.4: cannot open shared object file: No such file or directory
This installs libjansson4 as a runtime dependency similar to libsodium23.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add support for the pg_lake extension, which integrates Apache Iceberg and data lake files into PostgreSQL, enabling it to function as a lakehouse system with support for querying Parquet, CSV, JSON, and Iceberg tables in object storage.
Changes made:
CHANGELOG.md: Add pg_lake to future release notes
Dockerfile:
Makefile:
README.md:
build_scripts/install_extensions:
build_scripts/shared_install.sh:
build_scripts/shared_versions.sh:
build_scripts/versions.yaml:
cicd/shared.sh:
cicd/version_info.sql:
Implementation notes:
Unlike pgvectorscale which downloads pre-built .deb packages, pg_lake must be built from source as the upstream project does not publish binary releases. This requires additional build dependencies (GEOS, PROJ, GDAL development headers, and Java JDK) which are already available as PostgreSQL build dependencies or required for PostGIS.
The pg_lake extension consists of multiple components that are built and installed separately through individual make targets. The installation follows the same pattern as other source-built extensions in this repository.
Default version is set to "main" branch to track the latest development as no stable releases are available yet. This can be overridden via PG_LAKE_VERSIONS environment variable during build.
PostgreSQL version support is limited to 16-18 based on pg_lake's build requirements and testing matrix documented in their repository.