Conversation
Adds primary and foreign key relationships.
|
first impression: excellent!! Will do more thorough testing and review later. |
|
So we discussed that there will be two ways of creating between-table relationships: # preferred way for most use-cases
relations = [
"customers[id] <- purchases[customer_id]",
"products[id] <- purchases[product_id]"
]
# If you want to specify the relation type
from metasyn.multitable import ColumnRelation
relations = [
ColumnRelation(
primary_table="customers", primary_key="id",
foreign_table="purchases", foreign_key="customer_id",
relation_type=RelationType.Equal
),
ColumnRelation(
primary_table="products", primary_key="id",
foreign_table="purchases", foreign_key="product_id",
relation_type=RelationType.Equal
),
]
# maybe we should call these `mlf` by default to distinguish from MetaFrame (mf)?
mlf = MultiFrame(metaframes, relations, data) |
Thoughts about integration of multiframe with metasynIt's great that the I'm also a fan of keeping everything in its own module ( We can discuss what this approach entails for serialization. |
Thoughts about (de)serializationWe have three options for serialization of MultiFrames: Option 1: separate filesIn this option, the serialization of the MultiFrame is a folder with a GMF file for each individual MetaFrame and a serialized version of the list of Option 2: nested serializationFor this option, the multiframe is serialized to a single json file which has GMFs as elements of a list, and the table relations as a separate element. This would require a "new" serialization format with its own schema (and its own file extension, documentation?) This way, the GMF sub-elements can be separately validated by our existing code but it remains a single file which is nice for portability. We could still build in the possibility of only generating a single table for this approach, but this would require a bit of work in metasyn itself. Option 3: GMF-compliant multi-tablesIn this option, we would find a way to create multiple tables within a single gmf-compliant gmf file. This would mean creating all columns in a flat way, all just with metavars. The column relations would be in the additional metadata field which we did already reserve in the GMF specification. Advantages: no other file format extension, things remain gmf-compliant, single file. Disadvantages: you can't really deserialize this gmf file to a MetaFrame in a meaningful way, or you might get a MetaFrame with many weirdly-named columns. Also, how would we deal with the table-level metadata such as the number of rows, which can differ per table? All in all, after writing up these options my preference would still go out to option 1 because of the clarity, and its alignment with the "wrapper" mentality of the MultiFrame stuff that I mentioned in my previous comment. But I'm really happy to be convinced otherwise! |
|
Idea: use regex labeled groups using # do this somewhere at top-level
import re
RELATIONS_PATTERN = re.compile(r"(?P<ptab>[\s\S]+)\[(?P<pcol>[\s\S]+)\]\s?<-\s?(?P<ftab>[\s\S]+)\[(?P<fcol>[\s\S]+)\]")
# do this within parse code in ColumnRelation constructor
relations_syntax = "customers[id] <- purchases[customer_id]"
match = re.match(RELATIONS_PATTERN, relations_syntax)
if match is not None:
ColumnRelation(
primary_table=match.group("ptab").strip(),
primary_key=match.group("pcol").strip(),
foreign_table=match.group("ftab").strip(),
foreign_key=match.group("fcol").strip(),
)we could even split out the Update This also works with the (non-python specific) general named group syntax: RELATIONS_PATTERN = re.compile(r"(?<ptab>[\s\S]+)\[(?<pcol>[\s\S]+)\]\s?<-\s?(?<ftab>[\s\S]+)\[(?<fcol>[\s\S]+)\]")Try it out here to see what breaks! https://regexr.com/8joh7 |
|
First comments after trying it out. Haven't managed to look at the code a lot yet.
|
|
Actually, I haven't read the docs yet but I'm not entirely sure why we need "<-" AND "<?" in the relations specification thingy |
Also rename get_dataframe -> get_data
Adds primary and foreign key relationships. See #413
Could be separated out into its own package, though the current implementation (without documentation) is about 100 lines of code, so that might be a bit too light for a new package. It also adds a new demo multi dataset which can be used to explain how it works. A new notebook is also there to show the new feature off.
There are several things to be worked out still. A non-exhaustive list:
Multiframe.fit_dataframes()without adding metaframes (which will be inferred with default keywords).table1[key1] <- table2[key2], arrow determinedFixes #413