Skip to content

[Proposal] Redesign of the Storage to combine Relational, Graph and Vector Databases #1081

@JoshInnis

Description

@JoshInnis

Where we are today

Currently, we use a table basic heap tables to store Vertices and Edges as separate entities.

The vertex table template:

Sudocode

 CREATE TABLE graph_name.vertex_label_name (
  id graphid
  properties gtype
)
 CREATE TABLE graph_name.edge_label_name (
  id graphid,
startid graphid,
endid graphid,
  properties gtype
)

Where we should go

  1. Adjancency List
  2. Vector column
  3. A way to generate a vector via some user system

The Storage Layer

We need to split the nodes adjacency lists from their properties (i.e. we need to have a set of pages holding adjacency lists and another page to hold properties. It the vector should probably go with the adjacency list.

This model as it is, allows for really good traditional graph search algorithms such is a wide variety of ways to implement. Dijkstra, A* BFS, DFS etc, but it also can serve as a format for new Approximation algorithms.

The normal graph tables will have an adjacency list that follows normal graph database conventions. (ie the user creates edges. The database itself does not create edges in these sets of pages.)

However, the user can create indices on the graph that do approximate nearest neighbors. The page system that we have designed for absolute nearest neighbors.

The Concept of Edges

In the cypher model, edges have properties. So that implies that the edge must be stored in a set of pages dedicated entirely to them.

In the nearest neighbor model, they do not by default and to the authors best knowledge there is a gap in research about K-ANN with properties to filter (and the label issue too, but we have some novel solutions to that I need to write down).

What this means for us in the implementation details to note is that we can either store half as many edges in the normal pages vs the index pages for the same amount of space or we can store a reference to the edge's properties AND a reference to the connected vertex.

  1. If we store a reference to the just the edge not the other vertex, that may be a performance slowdown, for traversals. because we have to pull the edge page up no matter what in this case.
  2. On the other hand, if they reference property filters, we have to go that page anyway.

What does the DDL and DQL Features Look Like to Support This

I am not sure yet what this will ultimately look like. Fundamentally the question is, whether I force the user to make it clear whether to use a certain index or not. Postgres's planner may select the ANN index, or it may choose the normal table. It's debatable. Other databases implemented extensions to the SQL language to tell the server whether the use wants to use the approximate index or if it wants exact k-NN.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions