-
Notifications
You must be signed in to change notification settings - Fork 5
Description
TLDR I found it harder to implement this in the way the spec defines, and found a split between read and write to make things a lot easier
As a pretext to this, I did previously try to implement this based on using async iterables in place of "streams" where the dataset also implemented [Symbol.asyncIterator]
, however the implementation was messy and didn't cleanly fall into place, it required weird inheritance where a factory was required to create new datasets within DatasetCore
.
I hoped that using async iterables would help with the lazy style of reading I wanted to achieve, but it turns out you can achieve this same thing with sync Iterables, which can be seen in a code snippet below
I have had great success implementing dataset using two different concepts
First, we have all of our "read" dataset
export interface FilterIterateeFn<T> {
(value: T): boolean
}
export interface RunIteratee<T> {
(value: T): void
}
export interface MapIteratee<T, R> {
(value: T): R
}
export interface ReadonlyDataset extends Iterable<Quad> {
size: number
empty: boolean
filter(iteratee: FilterIterateeFn<Quad>): ReadonlyDataset
except(iteratee: FilterIterateeFn<Quad>): ReadonlyDataset
match(find: Quad | QuadFind): ReadonlyDataset
without(find: Quad | QuadFind): ReadonlyDataset
has(find: Quad | QuadFind): boolean
contains(dataset: Iterable<Quad | QuadLike>): boolean
difference(dataset: Iterable<Quad | QuadLike>): ReadonlyDataset
equals(dataset: Iterable<Quad | QuadLike>): boolean
every(iteratee: FilterIterateeFn<Quad>): boolean
forEach(iteratee: RunIteratee<Quad>): void
intersection(dataset: Iterable<Quad | QuadLike>): ReadonlyDataset
map(iteratee: MapIteratee<Quad, QuadLike>): ReadonlyDataset
some(iteratee: FilterIterateeFn<Quad>): boolean
toArray(): Quad[]
union(dataset: Iterable<Quad | QuadLike>): ReadonlyDataset
}
And our write dataset...
export interface Dataset extends ReadonlyDataset {
add(value: Quad | QuadLike): Dataset
addAll(dataset: Iterable<Quad | QuadLike>): Dataset
import(dataset: AsyncIterable<Quad | QuadLike>): Promise<unknown>
delete(quad: Quad | QuadLike | QuadFind): Dataset
}
If we want an immutable write dataset...
export interface ImmutableDataset extends Dataset {
add(value: Quad | QuadLike): ImmutableDataset
addAll(dataset: Iterable<Quad | QuadLike>): ImmutableDataset
import(dataset: AsyncIterable<Quad | QuadLike>): Promise<ImmutableDataset>
delete(quad: Quad | QuadLike | QuadFind): ImmutableDataset
}
A couple of specific changes...
DatasetCore
is "moved up" to become the write dataset on top,
Dataset
is "moved down" to become the read dataset.
write functions return writable datasets
read functions return readable datasets
In terms of implementation, this felt a lot more natural and took a lot less time than trying to follow the spec one to one
I've used types specifically for TypeScript here, but I think it shows nicely the how implementation works
Behind the scenes I was able to utilise a Set as the backing collection for quads
The read dataset in this implementation accepts a source Iterable, which the read dataset implements itself, meaning we don't need any intermediate steps in between creating new datasets, a Set also implements this, making everything very seemless.
Chaining using the read dataset is very clean which can be seen in the implementation of the read dataset itself:
Using iterables also enables this kind of usage where the returned read dataset is a "live" view of the write dataset:
import { Dataset } from "../esnext/index.js"
import { DefaultDataFactory } from "@opennetwork/rdf-data-model"
const dataset = new Dataset()
const aNameMatch = {
subject: DefaultDataFactory.blankNode("a"),
predicate: DefaultDataFactory.namedNode("http://xmlns.com/foaf/0.1/name"),
graph: DefaultDataFactory.defaultGraph()
}
const aMatcher = dataset.match(aNameMatch)
dataset.add({
...aNameMatch,
object: DefaultDataFactory.literal(`"A"@en`)
})
dataset.add({
subject: DefaultDataFactory.blankNode("s"),
predicate: DefaultDataFactory.namedNode("http://xmlns.com/foaf/0.1/name"),
object: DefaultDataFactory.literal(`"s"@en`),
graph: DefaultDataFactory.defaultGraph()
})
console.log({ a: aMatcher.size, total: dataset.size })
dataset.add({
...aNameMatch,
object: DefaultDataFactory.literal(`"B"@en`)
})
console.log({ a: aMatcher.size, total: dataset.size })
dataset.add({
...aNameMatch,
object: DefaultDataFactory.literal(`"C"@en`)
})
console.log({ a: aMatcher.size, total: dataset.size })
console.log({ aObjects: aMatcher.toArray().map(({ object }) => object) })
This snippet outputs:
{ a: 1, total: 2 }
{ a: 2, total: 3 }
{ a: 3, total: 4 }
{
aObjects: [
LiteralImplementation {
termType: 'Literal',
value: 'A',
language: 'en',
datatype: [NamedNodeImplementation]
},
LiteralImplementation {
termType: 'Literal',
value: 'B',
language: 'en',
datatype: [NamedNodeImplementation]
},
LiteralImplementation {
termType: 'Literal',
value: 'C',
language: 'en',
datatype: [NamedNodeImplementation]
}
]
}
I have implemented these datasets as sync as I am only wanting to know whats in memory right now, not whats available in this remote dataset
I did this because if you wanted to import information from a remote dataset, you should utilise import
if you have an async iterable (Node.js ReadableStream, MongoDB cursors, etc), or addAll
if you have another in memory dataset or an iterable (Arrays, Sets, etc)