Skip to content

Conversation

@minkimipt
Copy link
Contributor

Add support for the pg_lake extension, which integrates Apache Iceberg and data lake files into PostgreSQL, enabling it to function as a lakehouse system with support for querying Parquet, CSV, JSON, and Iceberg tables in object storage.

Changes made:

  • CHANGELOG.md: Add pg_lake to future release notes

  • Dockerfile:

    • Add build dependencies required for pg_lake: ninja-build, libgeos-dev, libproj-dev, libgdal-dev, and openjdk-21-jdk
    • Add build step to install pg_lake extension via install_extensions script (runs as postgres user before switching to root)
    • Add PG_LAKE_VERSIONS to image configuration output
  • Makefile:

    • Add PG_LAKE_VERSIONS variable (default: all)
    • Pass PG_LAKE_VERSIONS as build argument to Docker
    • Set PG_LAKE_VERSIONS to empty in fast target for quicker builds
    • Set PG_LAKE_VERSIONS to latest in latest target
  • README.md:

    • Add pg_lake to the list of extensions that can be version-managed via versions.yaml
  • build_scripts/install_extensions:

    • Add pg_lake to accepted command-line arguments
    • Add PG_LAKE_VERSIONS to version output
    • Add pg_lake installation case block that iterates through requested versions
  • build_scripts/shared_install.sh:

    • Add install_pg_lake() function that clones pg_lake repository from GitHub and builds from source
    • Function handles version checkout (tags, branches, or main/master)
    • Builds and installs all pg_lake components: pg_extension_base, pg_map, pg_extension_updater, pg_lake_engine, avro, pg_lake_iceberg, pg_lake_table, pg_lake_spatial, pg_lake_copy, and pg_lake
  • build_scripts/shared_versions.sh:

    • Add supported_pg_lake() function to validate PostgreSQL version compatibility
    • Add PG_LAKE_VERSIONS processing via requested_pkg_versions
  • build_scripts/versions.yaml:

    • Add pg_lake configuration with main branch support
    • Set PostgreSQL compatibility to 16-18 based on pg_lake's documented support matrix
  • cicd/shared.sh:

    • Add check_pg_lake() function to verify pg_lake installation by checking for pg_lake.so library file
    • Add check_pg_lake call to check_base_components
    • Function records extension version and validates installation for PostgreSQL 16-18
  • cicd/version_info.sql:

    • Add pg_lake to extensions checked in pg_available_extensions
    • Add query to collect pg_lake.available_versions from pg_available_extension_versions

Implementation notes:

Unlike pgvectorscale which downloads pre-built .deb packages, pg_lake must be built from source as the upstream project does not publish binary releases. This requires additional build dependencies (GEOS, PROJ, GDAL development headers, and Java JDK) which are already available as PostgreSQL build dependencies or required for PostGIS.

The pg_lake extension consists of multiple components that are built and installed separately through individual make targets. The installation follows the same pattern as other source-built extensions in this repository.

Default version is set to "main" branch to track the latest development as no stable releases are available yet. This can be overridden via PG_LAKE_VERSIONS environment variable during build.

PostgreSQL version support is limited to 16-18 based on pg_lake's build requirements and testing matrix documented in their repository.

Add support for the pg_lake extension, which integrates Apache Iceberg
and data lake files into PostgreSQL, enabling it to function as a
lakehouse system with support for querying Parquet, CSV, JSON, and
Iceberg tables in object storage.

Changes made:

* CHANGELOG.md: Add pg_lake to future release notes

* Dockerfile:
  - Add build dependencies required for pg_lake: ninja-build,
    libgeos-dev, libproj-dev, libgdal-dev, and openjdk-21-jdk
  - Add build step to install pg_lake extension via install_extensions
    script (runs as postgres user before switching to root)
  - Add PG_LAKE_VERSIONS to image configuration output

* Makefile:
  - Add PG_LAKE_VERSIONS variable (default: all)
  - Pass PG_LAKE_VERSIONS as build argument to Docker
  - Set PG_LAKE_VERSIONS to empty in fast target for quicker builds
  - Set PG_LAKE_VERSIONS to latest in latest target

* README.md:
  - Add pg_lake to the list of extensions that can be version-managed
    via versions.yaml

* build_scripts/install_extensions:
  - Add pg_lake to accepted command-line arguments
  - Add PG_LAKE_VERSIONS to version output
  - Add pg_lake installation case block that iterates through requested
    versions

* build_scripts/shared_install.sh:
  - Add install_pg_lake() function that clones pg_lake repository from
    GitHub and builds from source
  - Function handles version checkout (tags, branches, or main/master)
  - Builds and installs all pg_lake components: pg_extension_base,
    pg_map, pg_extension_updater, pg_lake_engine, avro,
    pg_lake_iceberg, pg_lake_table, pg_lake_spatial, pg_lake_copy,
    and pg_lake

* build_scripts/shared_versions.sh:
  - Add supported_pg_lake() function to validate PostgreSQL version
    compatibility
  - Add PG_LAKE_VERSIONS processing via requested_pkg_versions

* build_scripts/versions.yaml:
  - Add pg_lake configuration with main branch support
  - Set PostgreSQL compatibility to 16-18 based on pg_lake's
    documented support matrix

* cicd/shared.sh:
  - Add check_pg_lake() function to verify pg_lake installation by
    checking for pg_lake.so library file
  - Add check_pg_lake call to check_base_components
  - Function records extension version and validates installation for
    PostgreSQL 16-18

* cicd/version_info.sql:
  - Add pg_lake to extensions checked in pg_available_extensions
  - Add query to collect pg_lake.available_versions from
    pg_available_extension_versions

Implementation notes:

Unlike pgvectorscale which downloads pre-built .deb packages, pg_lake
must be built from source as the upstream project does not publish
binary releases. This requires additional build dependencies (GEOS,
PROJ, GDAL development headers, and Java JDK) which are already
available as PostgreSQL build dependencies or required for PostGIS.

The pg_lake extension consists of multiple components that are built
and installed separately through individual make targets. The
installation follows the same pattern as other source-built extensions
in this repository.

Default version is set to "main" branch to track the latest development
as no stable releases are available yet. This can be overridden via
PG_LAKE_VERSIONS environment variable during build.

PostgreSQL version support is limited to 16-18 based on pg_lake's
build requirements and testing matrix documented in their repository.
Add explicit PostgreSQL version check in supported_pg_lake() to skip
PG15 even for main/master branch builds. pg_lake uses PostgreSQL 16+
APIs (RTEPermissionInfo, p_rteperminfos) that don't exist in PG15,
causing compilation errors.

The early return for main/master branches was allowing build attempts
on all PostgreSQL versions, bypassing the version constraints defined
in versions.yaml. The added check ensures pg_lake is only built for
PostgreSQL 16 and later.
pg_lake's Avro component (required for Iceberg table support) needs
libjansson >=2.3 to build. Add libjansson-dev to BUILD_PACKAGES to
satisfy this build dependency.
Make /usr/lib/{x86_64,aarch64}-linux-gnu directories writable by
postgres group before switching to postgres user. This allows pg_lake's
Avro installation step (make install-avro) to succeed when installing
libavro.* files to system library directories.

The build runs as postgres user for security and to allow mutability
of installed extensions, but system library directories are owned by
root. Adding group write permissions allows the build to complete
successfully.
Apply patch to fix incorrect include path in pg_lake_table source code.
The file pg_lake_table/src/planner/query_pushdown.c incorrectly uses
'server/rewrite/rewriteManip.h' when the correct path for PostgreSQL
extension includes is 'rewrite/rewriteManip.h' (without server/ prefix).

This appears to be a bug in pg_lake's main branch. The sed patch is
applied after checkout to fix the include before compilation.
Update check_pg_lake to properly verify installation even for main/master
branch builds. Previously, the function would skip checking for the .so
file entirely for main/master versions, causing the final error check to
fail when found=false.

Now the check:
1. Verifies pg_lake.so exists for main/master versions
2. Records the version and sets found=true if installed
3. Logs (doesn't error) if not built for unsupported PG versions
4. Only errors for tagged versions that should exist but don't

This allows the build to pass install checks when using PG_LAKE_VERSIONS=main.
pg_extension_base (required by pg_lake) must be loaded at PostgreSQL
startup via shared_preload_libraries. Add it to the postgresql.conf.sample
configuration alongside timescaledb.

According to pg_lake documentation, pg_extension_base acts as a loader
that loads other pg_lake modules as needed, so only this extension needs
to be preloaded.

Without this, attempting to CREATE EXTENSION pg_extension_base fails with:
  ERROR: pg_extension_base can only be loaded via shared_preload_libraries

This ensures pg_lake extensions can be installed without manual
configuration changes.
pg_lake requires libjansson runtime library at runtime for Avro/Iceberg
support. While libjansson-dev is needed for building, the runtime library
libjansson4 must be installed separately to remain in the final image
(not removed during cleanup).

Without this, PostgreSQL fails to start with pg_lake preloaded:
  FATAL: could not load library "/usr/lib/postgresql/18/lib/pg_lake_iceberg.so":
         libjansson.so.4: cannot open shared object file: No such file or directory

This installs libjansson4 as a runtime dependency similar to libsodium23.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants