Skip to content

Adopt SF1000+ data sets #173

@szarnyasg

Description

@szarnyasg

The SNB Interactive benchmark is currently limited to:

  • Data sets up to SF1000
  • Append-only workloads without deletions

These could be amended by backporting the improvements made for the BI workload.

Larger data sets

Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:

  • The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen.
  • The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219).
  • The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-*.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)

Introducing deletions

Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-*.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput).

Timeline

These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions