Adopt SF1000+ data sets

The SNB Interactive benchmark is currently limited to:
- Data sets up to SF1000
- Append-only workloads without deletions

These could be amended by backporting the improvements made for the BI workload.

## Larger data sets

Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:

- The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen.
- The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (https://github.com/ldbc/ldbc_snb_datagen/issues/219).
- The inserts generated by the new data generator (e.g. `inserts/dynamic/Person/part-*.csv`) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)

## Introducing deletions

Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the `deletes/dynamic/Person/part-*.csv` files work well, maybe an `updateStream`-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput).

## Timeline

These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adopt SF1000+ data sets #173

Larger data sets

Introducing deletions

Timeline

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adopt SF1000+ data sets #173

Description

Larger data sets

Introducing deletions

Timeline

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions