Improve search performance by using fulltext index

Problem description

When originally implementing the text search functionality for regesta and index entries, we decided to not use the FULLTEXT INDEX feature of Neo4j, but instead build custom normalized properties and create simple TEXT indices for them that are then used in normal WHERE clauses. The main motivation behind this decision was to save memory on the server to allow running the regesta database in low-resource environments and to be in full control of the normalization process.

However, when testing the searches with a dataset that contains not only the department RI XIII, but also most of the other departments, the performance of the currently implemented approach is not sufficient. When performing a text search with a single keyword on the complete set of regesta (124989 nodes at the time of testing), queries take an average of 8 to 10 seconds. When testing the search query with an increased heap size, no significant performance improvements could be observed.

This average search time is too long, especially since this search functionality is prominently placed at the top of the landing page. When the first interaction of users with the database takes more than 10 seconds, they will assume that the page is generally slow and most likely discontinue using it.

Solution outline

Neo4js native approach to implementing fulltext searches is the usage of the FULLTEXT INDEX feature. Fulltext indices can be created for multiple nodes and properties and be used for performing complex searches. The feature is internally built with Apache Lucene and therefore tokenizes the properties that the index is created for. This not only allows for fast matching, but also the calculation of a similarity score that can be used to order the results.

When testing our regesta searches with a FULLTEXT INDEX, the complete operation typically takes less than 1 second. In addition, it seems possible to slightly improve the performance by increasing the heap size of Neo4j. Therefore, this approach seems to be the go-to for improving the search performance.

Index creation

In addition to creating the text indices for the normalized properties in the ri-neo4j-importer, we should also create the fulltext index as part of the import. When creating the index, we need to specify the labels and properties that the index is created for and the analyzer that should be used for tokenizing the properties and building the index.

The following choices should be our first approach:

Properties: For regesta, instead of indexing the "normal" properties like summary and footnotes, we should index the normalized properties (normalizedSummary, normalizedFootnotes, ...). While some analyzers claim to deal with diacritics correctly, it is easier to re-use our already existing and tested normalization. This approach is also more future-proof as we can simply implement additional requirements via the normalizer plugin.
- For index entries, we should only index the normalizedLabels property to follow the same pattern.
Analyzer: Since the normalization process is handled via the normalizer plugin, we do not need to rely on the normalization capabilities of an analyzer. In addition, most complex analyzers remove language-specific stopwords from the index. For example, the german analyzer removes words like "denn", "daß", "muss", "allem" or "allen". While this behaviour is useful for various search engines and improves the performance, our use case prefers to allow users to match full sentences, including these stopwords.
- Therefore, we should not use an analyzer that removes stopwords. This only leaves the option of the simple analyzer which tokenizes words by spaces, but does not remove stopwords.

The following sample command could be used to create the index for the regesta:

CREATE FULLTEXT INDEX regestaSearchIndex IF NOT EXISTS FOR (r:Regesta) ON EACH [r.normalizedSummary, r.normalizedIdentifier, r.normalizedFootnotes, r.normalizedArchivalHistory, r.normalizedPlaceOfIssue, r.normalizedOrigPlaceOfIssue, r.normalizedCommentary]
OPTIONS {
  indexConfig: {
    `fulltext.analyzer`: 'simple',
    `fulltext.eventually_consistent`: false
  }
}

Search queries

As fulltext indices cannot be implicitely used in WHERE clauses, we need to rewrite our search queries to instead use the db.index.fulltext.queryNodes() procedure for retrieving nodes. It takes the name of an index and a Lucene query as its input and returns the list of nodes that were found, together with their similarity score.

A simple search for a search term looks like this:

CALL db.index.fulltext.queryNodes("regestaSearchIndex", "aachen") YIELD node AS r, score
RETURN r.identifier LIMIT 50

To implement the special requirements of our text search feature, multiple additional steps need to be added when building the query.

Exact matching

By default, the fulltext index query will also return nodes that do not have full matches, but are only similar to the search term. For our use case, we only want exact matches. Therefore, the search term needs to be wrapped in quotes like shown in the following example:

CALL db.index.fulltext.queryNodes("regestaSearchIndex", '"aachen"') YIELD node AS r, score
RETURN r.identifier LIMIT 50

Filtering by selective properties

Since users can select to only filter by certain properties, we cannot query the full index on every request. Lucene allows to limit the search to a specific property by using the <property>:<searchTerm> syntax. As this syntax only allows specifying a single property, combining multiple matchers with OR is required. When all properties are used, the following query could be used:

CALL db.index.fulltext.queryNodes("regestaSearchIndex", '"normalizedIdentifier:"aachen" OR normalizedSummary:"aachen" OR "normalizedFootnotes:"searchTerm" OR "normalizedArchivalHistory:"aachen" OR "normalizedPlaceOfIssue:"aachen" OR "normalizedOrigPlaceOfIssue:"aachen" OR "normalizedCommentary:"aachen"')
) YIELD node AS r, score
RETURN r.identifier LIMIT 50

Building this OR combination should be done by the buildSearchTermCypherPredicate function as in the current implementation (although it might be renamed to buildSearchTermLucenePredicate).

Combining multiple search terms

Lucene supports using AND, OR and parentheses for scopes just like Cypher. When a user specifies a combination of search terms, we should build the query accordingly by building the property-matching clause described in the previous section for each search term, wrapping them in their own scope and concatenating them with AND/OR. To achieve this, the currently existing buildCombinedRegestaSearchTermPredicate might be sufficient.

Normalizing the searchterm

Since we build the index with the normalized properties, it is necessary to normalize the search term in the same way. Since the Neo4j documentation does not mention any possibility to specify parameters/arguments in the Lucene query, we need to use apoc.text.replace to insert the normalized search term into the Lucene query. This can be done like in the following example:

CALL db.index.fulltext.queryNodes("regestaSearchIndex", 
apoc.text.replace('("normalizedIdentifier:"searchTerm" OR normalizedSummary:"searchTerm" OR "normalizedFootnotes:"searchTerm" OR "normalizedArchivalHistory:"searchTerm" OR "normalizedPlaceOfIssue:"searchTerm" OR "normalizedOrigPlaceOfIssue:"searchTerm" OR "normalizedCommentary:"searchTerm")', "searchTerm", janek64.normalizer.normalize("münchen"))
) YIELD node AS r, score
RETURN r.identifier LIMIT 50

Filtering by dates, departments and volumes

Finally, users are able to specify a narrowed date range (minimum and maximum date) as well as the department number and volume numbers that the regesta should be filtered by. While it would be more efficient to filter the regesta and then apply the fulltext search on the subset, the procedure does not support that.

Instead, we can filter the results yielded by the procedure as shown in this example (which also shows a full query that implements all requirements):

CALL db.index.fulltext.queryNodes("regestaSearchIndex", 
apoc.text.replace('("normalizedIdentifier:"searchTerm" OR normalizedSummary:"searchTerm" OR "normalizedFootnotes:"searchTerm" OR "normalizedArchivalHistory:"searchTerm" OR "normalizedPlaceOfIssue:"searchTerm" OR "normalizedOrigPlaceOfIssue:"searchTerm" OR "normalizedCommentary:"searchTerm")', "searchTerm", janek64.normalizer.normalize("münchen"))
) YIELD node AS r, score
WHERE r.departmentNr=$departmentNr AND r.volumeNr=$volumeNr AND (r.isoStartDate >= date($minDate) OR r.isoEndDate >= date($minDate)) AND (r.isoStartDate <= date($maxDate) OR r.isoEndDate <= date($maxDate))
RETURN r.identifier LIMIT 50

For the department numbers and volume numbers, it would also be possible to include them in the index itself and optionally add two predicates to the Lucene query that perform an exact matching for the respective value (the predicates need to be concatenated with the rest of the query via AND). As this can slightly improve the search performance, we should consider using this improved approach.

Edited May 12, 2024 by Janek Berg

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information