Elastic Search
Elastic Search indexes the information in the database for quick retrieval, facilitating endpoints that can search through assets with loosely matching queries and give relevancy-ranked suggestions.
Indexing the Database using Logstash
Elastic Search keeps independent indices for various assets in the database, and achieves this by:
- Creating an initial index
- Populating it based on the information already in the database
- Updating the index with new information periodically, removing old entries if necessary
Because the logic for these steps is very similar for the different assets, we generate various scripts to create and maintain the elastic search indices. We use Logstash to process the data from our database and export it to Elastic Search.
This happens through two container services:
es_logstash_setup
: Generates the common scripts for use by logstash, and creates the Elastic Search indices if necessary. This is a short-running service that only runs on startup, exiting when its done.logstash
: Continually monitors the database and updates the Elastic Search indices.
Logstash Setup
The es_logstash_setup
service executes two important roles: generating logstash files and creating Elastic Search indices.
The src/logstash_setup/generate_logstash_config_files.py
file generates logstash files based on the
templates provided in the src/logstash_setup/templates
directory. The generated files are placed
into subdirectories of the logstash/config
directory, along with predefined files.
For syncing the Elastic Search index, logstash requires SQL files that extract the necessary data from the database.
These are generated based on the src/logstash_setup/templates/sql_{init|sync|rm}.py
files:
- The
sql_init.py
file defines the query template that finds the data that should be included in the index if it is populated from scratch. - The
sql_sync.py
file defines the query template that finds the data that has been updated since the last creation or synchronization, so that the ES index can be updated efficiently. - The
sql_rm.py
file defines the query template that finds the data that should be removed from the index.
It also generates the configuration files needed for Logstash to run the sync scripts:
config.py
: used to generatelogstash.yml
, the general configuration.init_table.py
: contains the configuration that is needed to run the queries fromsql_init.py
, and defines them for each asset that needs to be indexed.sync_table.py
: contains the configuration that is needed to run the queries fromsql_sync.py
andsql_rm.py
scripts, and defines them for each asset that needs to be synced.
All generated files contain the preamble defined in file_header.py
.
Additionally, the logstash/config/config
directory contains additional files used for the configuration of logstash, such as the JVM options.
Creating a New Index
To create a new index for an asset supported in the metadata catalogue REST API, you simply need to create the respective "search router", more on that below.
Elastic Search in the Metadata Catalogue
The metadata catalogue provides REST API endpoints to allow querying elastic search in a uniform manner. While the Elastic Search can be exposed directly in production, this unified endpoint allows us to provide more structure and better automated documentation. It also avoids requiring the user to learn the Elastic Search query format.
Creating a New Search
To extend Elastic Search to a new asset type, create a search router, similar to those in src/routers/search_routers/
.
Simply inherit from the base SearchRouter
class defined in src/routers/search_router.py
and define a few properties:
@property
def es_index(self) -> str:
return "case_study"
es_index
property defines the name of the index. It is how it is known by Elasic Search, and should match the name of the table in the database.
@property
def resource_name_plural(self) -> str:
return "case_studies"
The resource_name_plural
is used to define the path of the REST API endpoint, e.g.: api.aiod.eu/search/case_studies
.
@property
def resource_class(self):
return CaseStudy
The resource_class
property contains a direct reference to the object it indexes, which is used when returning expanded responses from the ES query ("get all").
@property
def extra_indexed_fields(self) -> set[str]:
return {"headline", "alternative_headline"}
The extra_indexed_fields
property contains the fields of the entity that should be included in the index other than the global_indexed_fields
found in the SearchRouter
class.
@property
def linked_fields(self) -> set[str]:
return {
"alternate_name",
"application_area",
"industrial_sector",
"research_area",
"scientific_domain",
}
linked_fields
property contains the fields of the entity which refer to external tables and should be included in the index.
By creating a new SearchRouter
(and adding it to the router list), the script which generates the logstash files will automatically include it.
Configuration
Besides the aforementioned configuration files, the elastic search configuration is located at es/elasticsearch.yml
, but shouldn't need much configuration.
Some aspects of both Logstash and Elastic Search are to be configured through environment variables through the override.env
file (defaults in .env
).
Most notable one of these are the password for Elastic Search and the JVM resource options.