Skip to content

Add a new dataset server

This walkthrough illustrates the steps required from the implementor of a dataset server in order to create a DatasetTemplate specification.

In KOBE, a dataset template is defined using a set of Docker images. Additional parameters include the port and the path that the container will listen for queries. Dataset templates are used in the Benchmark specifications in order to define the dataset server of the federated endpoints.

Prerequisites

In this walkthrough we assume that you already have already prepared a Docker image that provides the SPARQL endpoint of the dataset server (e.g., https://hub.docker.com/r/openlink/virtuoso-opensource-7).

Step 1. Prepare your Docker images

The first step is to provide a set of one or more Docker images that downloads the dataset, loads the data, and starts the dataset server. Even though all this functionality can be provided with a single image, we suggest to split the various tasks into three separate images. More specifically:

  • A docker image that downloads a RDF dump from a known URL (found in the variable $DATASET_URL) and extracts its contents in the directory /kobe/dataset/$DATASET_NAME/dump.
  • A docker image that loads the downloaded dump (already present in the directory /kobe/dataset/$DATASET_NAME/dump) into the dataset server. Optionally, it can back-up the contents of the database in some directory inside /kobe/dataset/$DATASET_NAME such that the loading process to be executed only once.
  • A docker image that starts the dataset server and exposes its SPARQL endpoint.

The environment variables are initialized by the KOBE operator according to the specification of the benchmark to be executed. Moreover, the shared volumes are managed through the KOBE operator too (ref. here for details about the shared storage of KOBE).

In the benchmark walkthrough, we suggest that the dataset dumps should follow a specific format. Therefore, feel free to use semagrow/url-donwnloader (source code here) as your first image. However, if you optionally want your template to support more dataset dump formats, you can implement your own url downloader.

As an example, we present the images for two dataset servers (namely Virtuoso and Strabon).

  • Regarding the Virtuoso RDF store, we use the images semagrow/url-donwnloader, semagrow/virtuoso-init (source code here), and semagrow/virtuoso-main (source code here). We use the shared storage of KOBE, to keep a backup of the /database directory of Virtuoso, which is used to keep all the files used by the database. The last two images are built upon openlink/virtuoso-opensource-7.

  • Regarding the Strabon geospatial RDF store, we use the images semagrow/url-downloader, semagrow/strabon-init (source code here), and semagrow/strabon-main (source code here). We use the shared storage of KOBE, to keep a backup of the PostGIS database (directory /var/lib/postgresql/9.4/main) which is where the data are kept inside Strabon. The last two images are built using the docker file of KR-suite (see http://github.com/GiorgosMandi/KR-Suite-docker)`.

Step 2. Prepare your YAML file

Once you have prepared the docker images, creating the dataset template specification for your dataset server is a straightforward task. It should look like this (we use as an example the template for Virtuoso):

apiVersion: kobe.semagrow.org/v1alpha1
kind: DatasetTemplate
metadata:
  # Each dataset template can be uniquely identified by its name.
  name: virtuosotemplate
spec:
  initContainers:
    # here you put the first two images (that is the images for initializing
    # your server in the order you want to be executed).
    - name: initcontainer0
      image: semagrow/url-downloader
    - name: initcontainer1
      image: semagrow/virtuoso-init
  containers:
    # here you put the last image (that is the image for serving the data)
    - name: maincontainer
      image: semagrow/virtuoso-main
      ports:
        - containerPort: 8890  # port to listen for queries
  port: 8890     # port to listen for queries
  path: /sparql  # path to listen for queries

The default URL for the SPARQL endpoint for Virtuoso is http://localhost:8890/sparql, hence the port and the path to listen for queries are 8890 and /sparql respectively.

Examples

We have already prepared several dataset template specifications to experiment with:

Note

We plan to define more dataset template specifications in the future. We place all dataset template specifications in the examples/ directory under a subdirectory with the prefix dataset-*.