Skip to content

Draft of new multi dataframe synthesis#415

Draft
qubixes wants to merge 15 commits intodevelopfrom
multiframe
Draft

Draft of new multi dataframe synthesis#415
qubixes wants to merge 15 commits intodevelopfrom
multiframe

Conversation

@qubixes
Copy link
Member

@qubixes qubixes commented Feb 3, 2026

Adds primary and foreign key relationships. See #413

Could be separated out into its own package, though the current implementation (without documentation) is about 100 lines of code, so that might be a bit too light for a new package. It also adds a new demo multi dataset which can be used to explain how it works. A new notebook is also there to show the new feature off.

There are several things to be worked out still. A non-exhaustive list:

  • GMF file / serialization. I'm still not 100% convinced that multiple files is the solution. An SQLite database is also a single file, so...
  • The interface of the classes could be improved for sure. It works well enough as it is, but it lacks a lot of elegance.
  • Efficiency might be a bottleneck. Probably not a big deal, but could become an issue in some circumstances. The current approach is to synthesize the metaframes seperately and connect them later. Some columns are synthesized multiple times this way.
  • No documentation, preferred interface usage
  • No tests
  • Ruff compliance
  • Add warning if primary key is not unique
  • Multiframe.fit_dataframes() without adding metaframes (which will be inferred with default keywords).
  • table1[key1] <- table2[key2], arrow determined
  • Update regex for escaping brackets and more allowed characters.

Fixes #413

Adds primary and foreign key relationships.
@qubixes qubixes requested a review from vankesteren February 3, 2026 20:46
@vankesteren
Copy link
Member

first impression: excellent!! Will do more thorough testing and review later.

@vankesteren
Copy link
Member

So we discussed that there will be two ways of creating between-table relationships:

# preferred way for most use-cases
relations = [
    "customers[id] <- purchases[customer_id]", 
    "products[id] <- purchases[product_id]"
]

# If you want to specify the relation type
from metasyn.multitable import ColumnRelation
relations = [
    ColumnRelation(
        primary_table="customers", primary_key="id", 
        foreign_table="purchases", foreign_key="customer_id", 
        relation_type=RelationType.Equal
    ),
    ColumnRelation(
        primary_table="products", primary_key="id", 
        foreign_table="purchases", foreign_key="product_id", 
        relation_type=RelationType.Equal
    ),
]

# maybe we should call these `mlf` by default to distinguish from MetaFrame (mf)?
mlf = MultiFrame(metaframes, relations, data)

@vankesteren
Copy link
Member

Thoughts about integration of multiframe with metasyn

It's great that the multiframe stuff is "separate" from the basic metaframe stuff. We should attempt to keep it that way, so that we can think of multiframe as a kind of wrapper around a dict of metaframes.

I'm also a fan of keeping everything in its own module (metasyn.multiframe) to be explicit about this separation. We should be able to update metasyn without touching multiframe (and vice versa). I don't think it should be another package entirely because it is conceptually a feature of metasyn itself.

We can discuss what this approach entails for serialization.

@vankesteren
Copy link
Member

vankesteren commented Feb 9, 2026

Thoughts about (de)serialization

We have three options for serialization of MultiFrames:

Option 1: separate files

In this option, the serialization of the MultiFrame is a folder with a GMF file for each individual MetaFrame and a serialized version of the list of ColumnRelation objects. This approach has several advantages: it's immediately clear that this pertains to several datasets, the relations are easy to inspect separately (because it's a single file), and the separate GMF files work by themselves as well in a backwards compatible way: it's possible to synthesize only a single file (or a subset of files?) using plain metasyn (MetaFrame).
The downside of this approach is that it is multiple files, and the relations file depends on being next to the gmf files. This is potentially error-prone and may be confusing for new users?

Option 2: nested serialization

For this option, the multiframe is serialized to a single json file which has GMFs as elements of a list, and the table relations as a separate element. This would require a "new" serialization format with its own schema (and its own file extension, documentation?) This way, the GMF sub-elements can be separately validated by our existing code but it remains a single file which is nice for portability. We could still build in the possibility of only generating a single table for this approach, but this would require a bit of work in metasyn itself.

Option 3: GMF-compliant multi-tables

In this option, we would find a way to create multiple tables within a single gmf-compliant gmf file. This would mean creating all columns in a flat way, all just with metavars. The column relations would be in the additional metadata field which we did already reserve in the GMF specification. Advantages: no other file format extension, things remain gmf-compliant, single file. Disadvantages: you can't really deserialize this gmf file to a MetaFrame in a meaningful way, or you might get a MetaFrame with many weirdly-named columns. Also, how would we deal with the table-level metadata such as the number of rows, which can differ per table?

All in all, after writing up these options my preference would still go out to option 1 because of the clarity, and its alignment with the "wrapper" mentality of the MultiFrame stuff that I mentioned in my previous comment. But I'm really happy to be convinced otherwise!

@vankesteren
Copy link
Member

vankesteren commented Feb 9, 2026

Idea: use regex labeled groups using (?P<groupname>regex) to keep the parsing code for our new ColumnRelation syntax a bit more legible, maintainable, updateable.

# do this somewhere at top-level
import re
RELATIONS_PATTERN = re.compile(r"(?P<ptab>[\s\S]+)\[(?P<pcol>[\s\S]+)\]\s?<-\s?(?P<ftab>[\s\S]+)\[(?P<fcol>[\s\S]+)\]")

# do this within parse code in ColumnRelation constructor
relations_syntax = "customers[id] <- purchases[customer_id]"
match = re.match(RELATIONS_PATTERN, relations_syntax)
if match is not None:
    ColumnRelation(
        primary_table=match.group("ptab").strip(), 
        primary_key=match.group("pcol").strip(), 
        foreign_table=match.group("ftab").strip(), 
        foreign_key=match.group("fcol").strip(),
    )

we could even split out the [\s\S]+ part because we reuse it 4 times in a single regex but not sure if that makes it more legible.

Update

This also works with the (non-python specific) general named group syntax:

RELATIONS_PATTERN = re.compile(r"(?<ptab>[\s\S]+)\[(?<pcol>[\s\S]+)\]\s?<-\s?(?<ftab>[\s\S]+)\[(?<fcol>[\s\S]+)\]")

Try it out here to see what breaks! https://regexr.com/8joh7

@vankesteren
Copy link
Member

First comments after trying it out. Haven't managed to look at the code a lot yet.

  • demo_dataframe("shop_multi") now seems weird because this returns a dict of dataframes, not a dataframe. Maybe we should rename this to demo_data in general all over metasyn?
  • MultiFrame.save_json() should become (or be aliased as) MultiFrame.save() just like MetaFrame.save()
  • The two elements in that json should be called "relations" and "tables" instead of "relations" and "metaframes" (easier to understand for external audit)
  • Build support for MetaFrame.load() for these nested GMF files (e.g., selecting the first separate table?).
  • Improve the following error message:
    ValueError: Cannot parse relation 'testy > testx'. It should be of the form: tab1[col1] <- tab2[col2].
    The user should see at a glance what is the primary key and the foreign key. something like tab_a[col_primary] <- tab_b[col_foreign] maybe?
  • Check for that error message before fitting dataframes in MultiFrame.fit_dataframes() (otherwiser we do a lot of work which goes to waste)
  • MultiFrame.fit_dataframes() prints some stuff at the end which it should not (if you do <? relations
  • MultiFrame.load() should be a classmethod

@vankesteren
Copy link
Member

Actually, I haven't read the docs yet but I'm not entirely sure why we need "<-" AND "<?" in the relations specification thingy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-table relationships

2 participants