# Support a new database system
This page describes how Ontop can be extended to support a novel relational database system (RDBMS).
The following RDBMSs are currently supported:
- DB2
- MySQL / MariaDB
- Oracle
- PostgreSQL
- SQL server
- H2
- Denodo
- Dremio
- Apache Spark SQL & Databricks
- Teiid
- Snowflake
- Trino
- Presto
- Athena
- Redshift
- DuckDB
It is relatively easy though to extend Ontop's source code in order to support an additional RDBMS, thanks to dependency injection.
New: Guided Implementation
Ontop also provides a python tool (opens new window) that automatically generates a large part of the required code end simplifies the process of developing support for new SQL dialects.
Using it minimizes the overhead for the implementation of new dialect support and allows you to test your code easily right away.
This page provides a detailed explanation of the key concepts that need to be implemented, and was updated with Ontop 5.0.2. For newer versions of Ontop, it may no longer be up-to-date.
# Required implementations
Two implementations must be provided, each for a different interface that already has a default (abstract) implementation.
These implementations dictate what the datatypes of the RDBMS are and how certain function symbols can be translated into the SQL dialect supported by the RDBMS.
The default implementation of each of these interfaces is often sufficient to handle many query operators and functions. As a result, only a few methods generally need to be overwritten, to account for specificities of the new SQL dialect.
The 2 required implementations are the DB function symbol factory and the DB datatype factory.
# DB function symbol factory
The DB function symbol factory provides function symbols
which can be directly serialized into the target SQL dialect (e.g. LENGTH
or CURRENT_TIMESTAMP
). This factory has methods for common operations with precise semantics (e.g. concatenating 3 string arguments in a null-rejecting manner).
The interface to implement is DBFunctionSymbolFactory
.
And the default implementation is AbstractSQLDBFunctionSymbolFactory
.
For instance, the DB function symbol factory for PostgreSQL within Ontop is the class PostgreSQLDBFunctionSymbolFactory
.
WARNING
This interface should not be confused with the FunctionSymbolFactory
, which is in charge of constructing
SPARQL function symbols and other function symbols that cannot be directly serialized into SQL.
A basic template for this implementation is provided here:
public class NewDialectDBFunctionSymbolFactory extends AbstractSQLDBFunctionSymbolFactory {
@Inject
protected DuckDBDBFunctionSymbolFactory(TypeFactory typeFactory) {
super(createNewDialectRegularFunctionTable(typeFactory), typeFactory);
}
protected static ImmutableTable<String, Integer, DBFunctionSymbol> createNewDialectRegularFunctionTable(TypeFactory typeFactory) {
DBTypeFactory dbTypeFactory = typeFactory.getDBTypeFactory();
DBTermType abstractRootDBType = dbTypeFactory.getAbstractRootDBType();
Table<String, Integer, DBFunctionSymbol> table = HashBasedTable.create(
createDefaultRegularFunctionTable(typeFactory));
/*
------------------------------------------
- Changed Function Symbols -
------------------------------------------
*/
return ImmutableTable.copyOf(table);
}
/*
------------------------------------------
- Implementations -
------------------------------------------
*/
}
In the template above, two code blocks are commented out. We will now look at each of them individually.
# Changed Function Symbols
The regular function table is a dictionary, that maps from function names to the function symbol instance it corresponds to. The abstract base classes we inherit from already define a default function symbol table that includes all commonly used SQL functions. In some cases, however, some of these default function symbols need to be changed. For these cases, new entries can be created in this block, to override existing, bad entries.
A popular example is the CURRENT_TIMESTAMP
function symbol. By default, it is created as a simple function with arity 0 of the form CURRENT_TIMESTAMP()
. In some dialects, however, the current time is called without parentheses, simply in the form CURRENT_TIMESTAMP
. In this case, we need to override the function symbol created by the default implementation, to not include parentheses:
DBFunctionSymbol nowFunctionSymbol = new WithoutParenthesesSimpleTypedDBFunctionSymbolImpl(
CURRENT_TIMESTAMP_STR, dbTypeFactory.getDBDateTimestampType(), abstractRootDBType);
table.put(CURRENT_TIMESTAMP_STR, 0, nowFunctionSymbol);
The CURRENT_TIMESTAMP
function is now a WithoutParenthesesSimpleTypedDBFunctionSymbol
, so when generating SQL queries, ontop will not include the parentheses. By putting this new function symbol into the table at the key CURRENT_TIMESTAMP_STR
, we override the default behaviour that was put into the table by the default method.
# Implementations
A set of serialization methods have been left abstract by the base classes. These always have to be implemented for each new dialect.
All such methods share the same structure:
protected String serialize[FUNCTION_NAME](ImmutableList<? extends ImmutableTerm> terms, Function<ImmutableTerm, String> termConverter, TermFactory termFactory) {
...
}
Here, terms
is the list of arguments that were passed to the function, termConverter
ist a function that can be used to transform any given term into its SQL equivalent, and termFactory can be used to generate new terms that may be required for some functions.
We will now go through each of them and provide example implementations.
Contains:
This function returns a boolean value, indicating whether a given string (argument 2) is included in a different string (argument 1). While some dialects have their own CONTAINS
functions, for others it may have to be implemented through a string search, like in this example:
@Override
protected String serializeContains(ImmutableList<? extends ImmutableTerm> terms, Function<ImmutableTerm, String> termConverter, TermFactory termFactory) {
return String.format("(POSITION(%s IN %s) > 0)",
termConverter.apply(terms.get(1)),
termConverter.apply(terms.get(0)));
}
StrBefore:
This function returns the section of a given string (argument 1) before the first appearance of a given search string (argument 2). If the searched string is not contained in the base string, an empty string is returned instead. In many languages, this can be implemented using string search and substring methods.
@Override
protected String serializeStrBefore(ImmutableList<? extends ImmutableTerm> terms, Function<ImmutableTerm, String> termConverter, TermFactory termFactory) {
String str = termConverter.apply(terms.get(0));
String before = termConverter.apply(terms.get(1));
return String.format("SUBSTRING(%s,1,POSITION(%s IN %s)-1)", str, before, str);
}
StrAfter:
The opposite of STRBEFORE
, this function returns the section of a given string (argument 1) after the first appearance of a search string (argument 2). This function can be implemented similarly:
@Override
protected String serializeStrBefore(ImmutableList<? extends ImmutableTerm> terms, Function<ImmutableTerm, String> termConverter, TermFactory termFactory) {
String str = termConverter.apply(terms.get(0));
String before = termConverter.apply(terms.get(1));
return String.format("SUBSTRING(%s, IF(POSITION(%s IN %s) != 0, POSITION(%s IN %s) + LENGTH(%s), 0))", str, after, str, after, str, after);
}
Tz: This function extracts the timezone part of a given timestamp. An example implementation may look like this:
@Override
protected String serializeTz(ImmutableList<? extends ImmutableTerm> terms, Function<ImmutableTerm, String> termConverter, TermFactory termFactory) {
String str = termConverter.apply(terms.get(0));
return String.format("(LPAD(EXTRACT(TIMEZONE_HOUR FROM %s)::text,2,'0') || ':' || LPAD(EXTRACT(TIMEZONE_MINUTE FROM %s)::text,2,'0'))", str, str);
}
DateTimeNorm:
This function converts a given timestamp to the standardized ISO 8601
format. An example implementation may look like this:
@Override
protected String serializeDateTimeNorm(ImmutableList<? extends ImmutableTerm> terms, Function<ImmutableTerm, String> termConverter, TermFactory termFactory) {
return String.format("TO_ISO8601(%s)", termConverter.apply(terms.get(0)));
}
Hashing Functions:
Ontop supports a set of hasing functions:
MD5
SHA1
SHA256
SHA384
SHA512
For each of them, the implementation of a serialization method is required. An example implementation of the MD5
function may look like this:
@Override
protected String serializeMD5(ImmutableList<? extends ImmutableTerm> terms, Function<ImmutableTerm, String> termConverter, TermFactory termFactory) {
return String.format("MD5(%s)", termConverter.apply(terms.get(0)));
}
Other than the serialization methods, four additional methods must be implemented by the subclass:
@Override
protected String getUUIDNameInDialect() { return "[UUID_FUNCTION_NAME]"; }
returns the string that represents the name of the UUID in the SQL dialect.
The methods createNullRejectingDBConcat(int arity)
, createDBConcatOperator(int arity)
, and createRegularDBConcat(int arity)
need to be implemented to define the concat function symbols. The DBConcat
methods define function symbols that represent CONCAT
function calls of a given arity. Sample implementations for them may look like this:
@Override
protected DBConcatFunctionSymbol createRegularDBConcat(int arity) {
return new NullToleratingDBConcatFunctionSymbol("CONCAT", arity, dbStringType, abstractRootDBType, false);
}
@Override
protected DBConcatFunctionSymbol createNullRejectingDBConcat(int arity) {
return new NullRejectingDBConcatFunctionSymbol("CONCAT_NO_NULL", arity, dbStringType, abstractRootDBType, false);
}
The createDBConcatOperator(int arity)
method represents a concat operation executed through a concat operator (often +
or ||
). It may be implemented like this:
@Override
protected DBConcatFunctionSymbol createDBConcatOperator(int arity) {
return new NullRejectingDBConcatFunctionSymbol("||", arity, dbStringType, abstractRootDBType, Serializers.getOperatorSerializer("||"));
}
# DB Datatype factory
The DB datatype factory declares the hierarchy of DB datatypes used by the DBMS, and specifies their correspondence with datatypes used in the RDF graphs (such as xsd datatypes).
The interface to implement (for SQL dialects) is SQLDBTypeFactory
.
And the default implementation is DefaultSQLDBTypeFactory
.
For instance, the datatype factory for PostgreSQL within Ontop is the class PostgreSQLDBTypeFactory
.
A basic template for this implementation is provided here:
public class NewDialectDBTypeFactory extends DefaultSQLDBTypeFactory {
/*
------------------------------------------
- Type Definitions -
------------------------------------------
*/
@AssistedInject
protected NewDialectDBTypeFactory(@Assisted TermType rootTermType, @Assisted TypeFactory typeFactory) {
super(createNewDialectTypeMap(rootTermType, typeFactory), createNewDialectCodeMap());
}
protected static Map<String, DBTermType> createNewDialectTypeMap(TermType rootTermType, TypeFactory typeFactory) {
/*
------------------------------------------
- Type Map -
------------------------------------------
*/
}
protected static ImmutableMap<DefaultTypeCode, String> createNewDialectCodeMap() {
/*
------------------------------------------
- Code Map -
------------------------------------------
*/
}
/*
------------------------------------------
- Support Flags -
------------------------------------------
*/
}
In the template above, four code blocks are commented out. We will now look at each of them individually.
# Type Definitions
In this section, we define a set of String
constants that represent the names of datatypes in our SQL dialect. These constants are then used in other code blocks. An excerpt of the type definitions part in an implementation may look like this:
protected static final String BIT_STR = "BIT";
protected static final String INT2_STR = "INT2";
protected static final String INT4_STR = "INT4";
protected static final String INT8_STR = "INT8";
protected static final String FLOAT4_STR = "FLOAT4";
protected static final String FLOAT8_STR = "FLOAT8";
# TYPE MAP
In this method, we define the contents of the type map for our dialect. We create DBTermType
instances for each supported type and store them in a Map
object, that matches their identifiers to the DBTermType
. The base class we extend from already generates an extensive type map for many popular types, such as VARCHAR
, INTEGER
, REAL
, DOUBLE
, and more in the method createDefaultSQLTypeMap(...)
. New types can be added to the map in the subclass, and, alternatively, we can override existing entries in the map here. An example implementation of this method may look like this:
protected static Map<String, DBTermType> createNewDialectTypeMap(TermType rootTermType, TypeFactory typeFactory) {
TermTypeAncestry rootAncestry = rootTermType.getAncestry();
Map<String, DBTermType> map = createDefaultSQLTypeMap(rootTermType, typeFactory);
return map;
}
# CODE MAP
The code map is a dictionary, that maps different type codes to their "default type" for the SQL dialect. This includes the following type codes:
STRING
HEXBINARY
LARGE_INTEGER
DECIMAL
DOUBLE
BOOLEAN
DATE
TIME
DATETIMESTAMP
GEOMETRY
GEOGRAPHY
ARRAY
JSON
The default implementation already maps many of these to the name of the corresponding data type. To support additional ones, or to override existing mappings, this method can be implemented. An example implementation may look like this:
protected static ImmutableMap<DefaultTypeCode, String> createNewDialectCodeMap() {
Map<DefaultTypeCode, String> map = createDefaultSQLCodeMap();
map.put(DefaultTypeCode.JSON, JSONB_STR);
return ImmutableMap.copyOf(map);
}
# Support Flags
An instance of DBTypeFactory
supports a set of methods that, when called, return a flag indicating if a certain SQL feature is supported by the dialect. These methods are
supportsDBGeometryType()
supportsDBGeographyType()
supportsDBDistanceSphere()
supportsJson()
supportsArrayType()
Any of these functions can be overridden by the DBTypeFactory
sub-class to return a boolean constant indicating if the type is supported.
# Other Implementations
The DBTypeFactory
base-class supports further methods that can be included in sub-classes for fine-tuning. One popular set of such methods are the LexicalValue
methods. These return a String
constant that represents the lexical value the corresponding literal may assume. For instance, by default, the method getDBFalseLexicalValue()
returns the string "FALSE"
, as this is a common representation of the boolean false
value in many SQL dialects. To implement the DBTypeFactory
for a dialect that instead uses the digit 0
for false
literals, we could override the method like so:
@Override
public String getDBFalseLexicalValue() {
return "0";
}
# Optional implementations
Additional implementations can be optionally provided in replacement of the default implementation.
# Serializer
The serializer dictates the general shape (SELECT-FROM-WHERE) of the SQL query.
The interface to implement is SelectFromWhereSerializer
.
And the default implementation is DefaultSelectFromWhereSerializer
.
For instance, the serializer for PostgreSQL within Ontop is the class PostgresSelectFromWhereSerializer
.
# Normalizer
The normalizer addresses limitations of certain DBMSs (such as a non-canonical evaluation of the ORDER BY clause).
The interface to implement is DialectExtraNormalizer
.
And several implementations are already available, some of which are used by several DBMS.
For instance, the normalizer associated to PostgreSQL within Ontop is OnlyInPresenceOfDistinctProjectOrderByTermsNormalizer
.
# Metadata provider
The metadata provider specifies how schema and integrity constraints (for instance primary keys) are retrieved from the DBMS.
The interface to implement is MetadataProvider
.
And the default implementation is DefaultDBMetadataProvider
.
For instance, the metadata provider for PostgreSQL within Ontop is PostgreSQLDBMetadataProvider
.
# Declaring an implementation
All the implementations mentioned above can be declared in the property file sql-default.properties
(which can be found in the directory it/unibz/inf/ontop/injection/
of the ontop-rdb
module).
A key-value pair must be added for each of these implementations, where the key indicates the type of the implementation (serializer, function symbol factory, etc.), and the value is the implementation.
The naming scheme for the keys is the following.
Let <driverName>
be the name of the JDBC driver for the RDBMS (for instance, the JDBC driver for PostgreSQL is org.postgresql.Driver
.
Then the keys are:
<driverName>-serializer
for a serializer<driverName>-symbolFactory
for a DB function symbol factory<driverName>-typeFactory
for a DB datatype factory<driverName>-normalizer
for a normalizer<driverName>-metadataProvider
for a metadata provider
For instance, the key-value pairs declared for PostgreSQL are:
org.postgresql.Driver-serializer = it.unibz.inf.ontop.generation.serializer.impl.PostgresSelectFromWhereSerializer
org.postgresql.Driver-symbolFactory = it.unibz.inf.ontop.model.term.functionsymbol.db.impl.PostgreSQLDBFunctionSymbolFactory
org.postgresql.Driver-typeFactory = it.unibz.inf.ontop.model.type.impl.PostgreSQLDBTypeFactory
org.postgresql.Driver-normalizer = it.unibz.inf.ontop.generation.normalization.impl.OnlyInPresenceOfDistinctProjectOrderByTermsNormalizer
org.postgresql.Driver-metadataProvider = it.unibz.inf.ontop.dbschema.impl.PostgreSQLDBMetadataProvider
A basic set of key-value pairs may be:
name.of.jdbc.Driver-serializer = it.unibz.inf.ontop.generation.serializer.impl.DefaultSelectFromWhereSerializer
name.of.jdbc.Driver-symbolFactory = it.unibz.inf.ontop.model.term.functionsymbol.db.impl.NewDialectDBFunctionSymbolFactory
name.of.jdbc.Driver-typeFactory = it.unibz.inf.ontop.model.type.impl.NewDialectDBTypeFactory
name.of.jdbc.Driver-normalizer = it.unibz.inf.ontop.generation.normalization.impl.OnlyInPresenceOfDistinctProjectOrderByTermsNormalizer
name.of.jdbc.Driver-metadataProvider = it.unibz.inf.ontop.dbschema.impl.DefaultSchemaCatalogDBMetadataProvider
# Preparing and Running the Testcases
Ontop currently uses a set of tests called lightweight-tests
to test all main ontop language support features on minimal databases. A simple way of testing the correctness of newly implemented dialect support is to run the same tests a database that uses the corresponding dialect.
This consists of the following steps:
- Set up a database with the required testing conditions
- Prepare files that assist the connection to the database
- Implement sub-classes of the abstract lightweight test classes and run them
In the following sections, we will discuss each of these steps.
# Setting up the Database
The exact procedure of the database preparation depends on the database system. In ontop, we generally use docker images that reproduce the exact same database state for each test run so we can test the different dialects automatically.
Each directory inside test/lightweight-tests/lightweight-db-test-images/
corresponds to one docker image, and many of them contain table definitions in .sql
file. For instance, the mysql
directory contains the exact definitions of the three databases that are used for testing, books
, dbconstraints
, and university
.
To prepare the required testing conditions, you can take these .sql
files and run their queries on your database. Should any of the features not be supported, you will be required to either use alternative equivalent features to reach the same final database state, or disable the corresponding test cases.
# Preparing Connection Files
The directory test/lightweight-tests/src/test/resources/
contains files that are used for setting up an ontop connection with different databases, including .obda
files for the VKG creation and .property
files for JDBC connections. While the .obda
files can usually be reused for different systems, the .property
files must be included for each database system. The directories books
and prof
inside the test resources contain one sub directory for each dialect, where each sub directory in turn includes one .properties
file for the JDBC connection. The dbconstraints
and university
directories, on the other hand, directly contain .property
files for each dialect.
To run all lightweight tests for a new dialect, the corresponding files and directories have to be creates for its JDBC connection. In particular, the connection files should be created in such a way, that the corresponding database/schema of the test is selected as the default database/schema, as the .obda
files access the tables through relative paths.
# Implementing the Lightweight Test Classes
The directory test/lightweight-tests/src/test/java/
contains the test cases that have to run successfully for each new dialect. They are defined in the four abstract classes AbstractBindTestWithFunctions
, AbstractConstraintTest
, AbstractDistinctInAggregateTest
, and AbstractLeftJoinProfTest
.
For each new dialect, a new package has to be created at this location, containing the implementations of sub-classes for each of these abstract base classes.
# BindTestWithFunctions
This test class tests all main SQL functions that are supported by ontop. It uses the books
database, as well as directly generated data through the SPARQL BIND
function. An example implementation of this class may look like this:
@NewDialectLightweightTest
public class BindWithFunctionsNewDialectTest extends AbstractBindTestWithFunctions {
private static final String PROPERTIES_FILE = "/books/new-dialect/books-new-dialect.properties";
@BeforeAll
public static void before() throws IOException, SQLException {
initOBDA(OBDA_FILE, OWL_FILE, PROPERTIES_FILE);
}
@AfterAll
public static void after() throws SQLException {
release();
}
/*
------------------------------------------
- Result Definitions -
------------------------------------------
*/
}
NOTE
Using the @NewDialectLightweightTest
annotation requires the definition of the annotation in the same package, similarly to how it was defined for other dialects.
The code block "Result Definitions" is commented out in the above example. This code block may contain redefinitions of expected results that are tested in the base class. To do this, override the get[TEST-NAME]ExpectedValues()
method here. An example implementation may look like this:
@Override
protected ImmutableList<String> getConstantIntegerDivideExpectedResults() {
return ImmutableList.of("\"0.500\"^^xsd:decimal");
}
Should any of the functionalities not be supported by the SQL dialect, then the corresponding test case may simply be disabled with an exaplanation as to why it is not supported. An example for that may look like this:
@Disabled("New Dialect does not support SHA384")
@Test
@Override
public void testHashSHA384() {
super.testHashSHA384();
}
Some test cases are disabled by default.
# Constraint
This set of test cases tests for the correct interpretation of integrity constraints in the data. It uses the dbconstraints
database. An example implementation may look like this:
@NewDialectLightweightTest
public class ConstraintNewDialectTest extends AbstractConstraintTest {
private static final String PROPERTIES_FILE = "/dbconstraints/dbconstraints-new-dialect.properties";
public ConstraintNewDialectTest(String method) {
super(method, PROPERTIES_FILE);
}
}
Once again, specific tests may be disabled if they are not supported. For database systems that do not support integrity constraints, this class can be ignored.
# DistinctInAggregate
This set of test cases runs queries containing aggregate functions that include the DISTINCT
keyword. It can be implemented analogously to the BindWithFunctions
test class.
# LeftJoinProf
This set of test cases runs on the university
database. It accesses items that have to be taken from the database through JOIN
s. Furthermore, it tests the optimization procedures employed by ontop to minimize the required number of joins. Once again, it can be implemented analogously to the BindWithFunctions
test class.
If the database management system does not support integrity constraints, many of these test cases will fail because the optimizations will not be performed.
# Further Implementations and Troubleshooting
Once the basic features and lightweight test cases have been implemented for a dialect, the tests can be run to determine which features still do not work correctly.
If a test case fails, its corresponding feature is likely not yet supported by the current implementation. In these cases, changes need to be applied to the implemented DBTypeFactory
or DBFunctionSymbolFactory
to solve the issue.
In some situations, however, issues may be caused by problems not related to database types or functions. In these cases, one of the following default implementations may also be extended for the new dialect.
NOTE
New implementations of these classes have to be linked to the driver through the sql-default.properties
file.
# DefaultSchemaCatalogDBMetadataProvider
, DefaultSchemaDBMetadataProvider
, or DefaultDBMetadataProvider
If the DBMetadataProvider
assigned to the dialect is not able to perform its expected actions through its default implementation, it may have to be extended with a new sub-class that is particularly tailored for this dialect.
In that case, its methods may be overridden to change the default behavious. Commong targets include:
insertIntegrityConstraints(...)
if the dialect does not support integrity constraints.getRelationId(...)
to return the ID of a given relation.
Additionally, the constructor may be implemented to determine the default schema or default catalog names if they cannot be accessed directly throug the JDBC, or a different QuotedIDFactory
instance can be passed to the super-class to force ontop to generate quoted identifiers in a specific way.
# SelectFromWhereSerializer
This class is responsible for the serialization of the SQL query frame. Different SQL dialects have different rules as to how the general shape of an SQL query have to be defined. For instance, some dialects allow users to combine LIMIT
and OFFSET
clauses as LIMIT <offset>,<limit>
, while others require the form OFFSET <offset> LIMIT <limit>
. The following shows a default implementation of a SelectFromWhereSerializer
to achieve the latter behaviour:
@Singleton
public class NewDialectSelectFromWhereSerializer extends DefaultSelectFromWhereSerializer implements SelectFromWhereSerializer {
@Inject
private NewDialectSelectFromWhereSerializer(TermFactory termFactory) {
super(new DefaultSQLTermSerializer(termFactory));
}
@Override
public QuerySerialization serialize(SelectFromWhereWithModifiers selectFromWhere, DBParameters dbParameters) {
return selectFromWhere.acceptVisitor(
new DefaultRelationVisitingSerializer(dbParameters.getQuotedIDFactory()) {
@Override
protected String serializeLimitOffset(long limit, long offset, boolean noSortCondition) {
return String.format("OFFSET %d LIMIT %d", offset, limit);
}
});
}
}