Skip to content

support for DuckDB #90

@johntruckenbrodt

Description

@johntruckenbrodt

Hi,
I am currently trying to figure out a way to use pygeofilter for converting CQL2 to a DuckDB query using the spatial extension. This extension is quite prototypical and I stumble across a couple of caveats. Here's my current approach:

import duckdb

duckdb.install_extension('spatial')
duckdb.load_extension('spatial')

Define the CQL2 filter

cql2_filter = {
    "op": "and",
    "args": [
        {
            "op": ">=",
            "args": [
                {
                    "property": "end_datetime"
                },
                '2020-03-28T20:05:46+02'
            ]
        },
        {
            "op": "<=",
            "args": [
                {
                    "property": "start_datetime"
                },
                '2020-03-28T22:06:15+02'
            ]
        },
        {
            "op": "=",
            "args": [
                {
                    "property": "sar:instrument_mode"
                },
                'IW'
            ]
        }
    ]
}

Optionally add spatial filtering

# ext = None
ext = {'xmin': -4, 'xmax': -2, 'ymin': 6, 'ymax': 8}

if ext is not None:
    arg = {
        'op': 's_intersects',
        'args': [
            {
                'property': 'geometry'
            },
            {
                'type': 'Polygon',
                'coordinates': [[[ext['xmin'], ext['ymin']],
                                 [ext['xmin'], ext['ymax']],
                                 [ext['xmax'], ext['ymax']],
                                 [ext['xmax'], ext['ymin']],
                                 [ext['xmin'], ext['ymin']]]]
            }
        ]
    }
    cql2_filter['args'].append(arg)

Convert CQL2 filter to SQL where clause

from pygeofilter.parsers.cql2_json import parse as json_parse

filter = json_parse(cql2_filter)
from pygeofilter.backends.sql.evaluate import to_sql_where

sql_where = to_sql_where(filter, {
    's1:datatake': 's1:datatake',
    'datetime': 'datetime',
    'sar:instrument_mode': 'sar:instrument_mode',
    'end_datetime': 'end_datetime',
    'start_datetime': 'start_datetime',
    'geometry': 'geometry'
})

Create DuckDB query for a geoparquet file:
Here it gets ugly because (1) the WKB column needs to be converted to GEOMETRY type and (2) the DuckDB-spatial implementation of ST_GeomFromWKB cannot read the WKB-HEX representation returned by to_sql_where.

import re

sql_query = "SELECT * EXCLUDE geometry, ST_GeomFromWKB(geometry) AS geometry FROM '20200301_20200401.parquet' WHERE %s" % sql_where

if ext is not None:
    # convert WKB blob to GEOMETRY
    sql_query = sql_query.replace('ST_Intersects("geometry"', 'ST_Intersects(ST_GeomFromWKB(geometry)')
    
    # duckdb_spatial apparently cannot yet read wkb_hex representation -> convert it back to text
    spatial = ("ST_GeomFromText('POLYGON(({xmin} {ymin}, {xmin} {ymax}, "
               "{xmax} {ymax}, {xmax} {ymin}, {xmin} {ymin}))')")
    sql_query = re.sub(r'ST_GeomFromWKB\(x\'.*\'\)', spatial.format(**ext), sql_query)
sql_query

Execute the query:

df = duckdb.query(sql_query)

I wonder, how would you do this? Do you think there is anything that could/should be modified on the pygeofilter end or is it entirely up to the DuckDB-spatial package? I'd appreciate any help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions