Draft of new multi dataframe synthesis by qubixes · Pull Request #415 · sodascience/metasyn

qubixes · 2026-02-03T20:46:48Z

Adds primary and foreign key relationships. See #413

Could be separated out into its own package, though the current implementation (without documentation) is about 100 lines of code, so that might be a bit too light for a new package. It also adds a new demo multi dataset which can be used to explain how it works. A new notebook is also there to show the new feature off.

There are several things to be worked out still. A non-exhaustive list:

Fixes #413

Adds primary and foreign key relationships.

vankesteren · 2026-02-05T15:59:08Z

first impression: excellent!! Will do more thorough testing and review later.

vankesteren · 2026-02-09T10:36:57Z

So we discussed that there will be two ways of creating between-table relationships:

# preferred way for most use-cases
relations = [
    "customers[id] <- purchases[customer_id]", 
    "products[id] <- purchases[product_id]"
]

# If you want to specify the relation type
from metasyn.multitable import ColumnRelation
relations = [
    ColumnRelation(
        primary_table="customers", primary_key="id", 
        foreign_table="purchases", foreign_key="customer_id", 
        relation_type=RelationType.Equal
    ),
    ColumnRelation(
        primary_table="products", primary_key="id", 
        foreign_table="purchases", foreign_key="product_id", 
        relation_type=RelationType.Equal
    ),
]

# maybe we should call these `mlf` by default to distinguish from MetaFrame (mf)?
mlf = MultiFrame(metaframes, relations, data)

vankesteren · 2026-02-09T10:41:27Z

Thoughts about integration of multiframe with metasyn

It's great that the multiframe stuff is "separate" from the basic metaframe stuff. We should attempt to keep it that way, so that we can think of multiframe as a kind of wrapper around a dict of metaframes.

I'm also a fan of keeping everything in its own module (metasyn.multiframe) to be explicit about this separation. We should be able to update metasyn without touching multiframe (and vice versa). I don't think it should be another package entirely because it is conceptually a feature of metasyn itself.

We can discuss what this approach entails for serialization.

vankesteren · 2026-02-09T11:11:07Z

Thoughts about (de)serialization

We have three options for serialization of MultiFrames:

Option 1: separate files

In this option, the serialization of the MultiFrame is a folder with a GMF file for each individual MetaFrame and a serialized version of the list of ColumnRelation objects. This approach has several advantages: it's immediately clear that this pertains to several datasets, the relations are easy to inspect separately (because it's a single file), and the separate GMF files work by themselves as well in a backwards compatible way: it's possible to synthesize only a single file (or a subset of files?) using plain metasyn (MetaFrame).
The downside of this approach is that it is multiple files, and the relations file depends on being next to the gmf files. This is potentially error-prone and may be confusing for new users?

Option 2: nested serialization

For this option, the multiframe is serialized to a single json file which has GMFs as elements of a list, and the table relations as a separate element. This would require a "new" serialization format with its own schema (and its own file extension, documentation?) This way, the GMF sub-elements can be separately validated by our existing code but it remains a single file which is nice for portability. We could still build in the possibility of only generating a single table for this approach, but this would require a bit of work in metasyn itself.

Option 3: GMF-compliant multi-tables

In this option, we would find a way to create multiple tables within a single gmf-compliant gmf file. This would mean creating all columns in a flat way, all just with metavars. The column relations would be in the additional metadata field which we did already reserve in the GMF specification. Advantages: no other file format extension, things remain gmf-compliant, single file. Disadvantages: you can't really deserialize this gmf file to a MetaFrame in a meaningful way, or you might get a MetaFrame with many weirdly-named columns. Also, how would we deal with the table-level metadata such as the number of rows, which can differ per table?

All in all, after writing up these options my preference would still go out to option 1 because of the clarity, and its alignment with the "wrapper" mentality of the MultiFrame stuff that I mentioned in my previous comment. But I'm really happy to be convinced otherwise!

vankesteren · 2026-02-09T11:28:42Z

Idea: use regex labeled groups using (?P<groupname>regex) to keep the parsing code for our new ColumnRelation syntax a bit more legible, maintainable, updateable.

# do this somewhere at top-level
import re
RELATIONS_PATTERN = re.compile(r"(?P<ptab>[\s\S]+)\[(?P<pcol>[\s\S]+)\]\s?<-\s?(?P<ftab>[\s\S]+)\[(?P<fcol>[\s\S]+)\]")

# do this within parse code in ColumnRelation constructor
relations_syntax = "customers[id] <- purchases[customer_id]"
match = re.match(RELATIONS_PATTERN, relations_syntax)
if match is not None:
    ColumnRelation(
        primary_table=match.group("ptab").strip(), 
        primary_key=match.group("pcol").strip(), 
        foreign_table=match.group("ftab").strip(), 
        foreign_key=match.group("fcol").strip(),
    )

we could even split out the [\s\S]+ part because we reuse it 4 times in a single regex but not sure if that makes it more legible.

Update

This also works with the (non-python specific) general named group syntax:

RELATIONS_PATTERN = re.compile(r"(?<ptab>[\s\S]+)\[(?<pcol>[\s\S]+)\]\s?<-\s?(?<ftab>[\s\S]+)\[(?<fcol>[\s\S]+)\]")

Try it out here to see what breaks! https://regexr.com/8joh7

vankesteren · 2026-03-11T15:20:51Z

First comments after trying it out. Haven't managed to look at the code a lot yet.

demo_dataframe("shop_multi") now seems weird because this returns a dict of dataframes, not a dataframe. Maybe we should rename this to demo_data in general all over metasyn?
MultiFrame.save_json() should become (or be aliased as) MultiFrame.save() just like MetaFrame.save()
The two elements in that json should be called "relations" and "tables" instead of "relations" and "metaframes" (easier to understand for external audit)
Build support for MetaFrame.load() for these nested GMF files (e.g., selecting the first separate table?).
Improve the following error message:
```
ValueError: Cannot parse relation 'testy > testx'. It should be of the form: tab1[col1] <- tab2[col2].
```
The user should see at a glance what is the primary key and the foreign key. something like tab_a[col_primary] <- tab_b[col_foreign] maybe?
Check for that error message before fitting dataframes in MultiFrame.fit_dataframes() (otherwiser we do a lot of work which goes to waste)
MultiFrame.fit_dataframes() prints some stuff at the end which it should not (if you do <? relations
MultiFrame.load() should be a classmethod

vankesteren · 2026-03-11T15:21:59Z

Actually, I haven't read the docs yet but I'm not entirely sure why we need "<-" AND "<?" in the relations specification thingy

Also rename get_dataframe -> get_data

Draft of new multi dataframe synthesis

769ba8a

Adds primary and foreign key relationships.

qubixes requested a review from vankesteren February 3, 2026 20:46

qubixes added 12 commits February 25, 2026 15:55

Add serialization to multiframe

98735df

Remove output

0ed1459

Update regex for parsing relations

2fbcbb5

Clear outputs

41135ec

Add more comments and docstring

d08997d

Fix ruff

7892cf0

Fix mypy

63b1ad6

Add warning if primary key is not unique.

f44f599

Update documentation and tutorial

a6c22e6

Add multiframe tests

c2b0f26

Add documentation

4441249

Remove validation

b6edf4a

qubixes added 2 commits March 13, 2026 09:26

Add save/load to multiframe

9ce508f

Also rename get_dataframe -> get_data

Implement more suggestions

855643a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft of new multi dataframe synthesis#415

Draft of new multi dataframe synthesis#415
qubixes wants to merge 15 commits intodevelopfrom
multiframe

qubixes commented Feb 3, 2026 •

edited

Loading

Uh oh!

vankesteren commented Feb 5, 2026

Uh oh!

vankesteren commented Feb 9, 2026

Uh oh!

vankesteren commented Feb 9, 2026

Uh oh!

vankesteren commented Feb 9, 2026 •

edited

Loading

Uh oh!

vankesteren commented Feb 9, 2026 •

edited

Loading

Uh oh!

vankesteren commented Mar 11, 2026

Uh oh!

vankesteren commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qubixes commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vankesteren commented Feb 5, 2026

Uh oh!

vankesteren commented Feb 9, 2026

Uh oh!

vankesteren commented Feb 9, 2026

Thoughts about integration of multiframe with metasyn

Uh oh!

vankesteren commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Thoughts about (de)serialization

Option 1: separate files

Option 2: nested serialization

Option 3: GMF-compliant multi-tables

Uh oh!

vankesteren commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vankesteren commented Mar 11, 2026

Uh oh!

vankesteren commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qubixes commented Feb 3, 2026 •

edited

Loading

vankesteren commented Feb 9, 2026 •

edited

Loading

vankesteren commented Feb 9, 2026 •

edited

Loading