Installing OpenTapioca

This software is a Python web service that requires Solr.

Installing Solr

It relies on a recent new feature of Solr (since 7.4.0), which was previously available as an external plugin, SolrTextTagger. If you cannot use a recent Solr version, it is possible to use older versions with the plugin installed: this will require changing the class names in the Solr configs (in the configset directory).

Install Solr 7.4.0 or above.

OpenTapioca requires that Solr runs in Cloud mode, so you can start it as follows:

bin/solr start -c -m 4g

The memory available to Solr (here 4 GB) will determine how many indexing operations you can run in parallel (searching is cheap).

In its Cloud mode, Solr reads the configuration for its indices from so-called “configsets” which gouvern the configuration of multiple collections. OpenTapioca comes with the appropriate configsets for its collections and the default one is called “tapioca”. You need to upload it to Solr before indexing any data, as follows:

bin/solr zk -upconfig -z localhost:9983 -n tapioca -d configsets/tapioca

Custom analyzers

Some profiles require custom Solr analyzers and tokenizers. For instance the Twitter profile can be used to index Twitter usernames and hashtags as labels, which is useful to annotate mentions in Twitter feeds. This requires a special tokenizer which handles these tokens appropriately. This tokenizer is provided as a Solr plugin in the plugins directory. It can be installed by adding this jar in the server/solr/lib directory of your Solr instance (the lib subfolder needs to be created first).

Installing Python dependencies

OpenTapioca is known to work with Python 3.6, and offers a command-line interface to manipulate Wikidata dumps and train classifiers from datasets.

In a Virtualenv, do pip install -r requirements.txt to install the Python dependencies, and python setup.py install to install the CLI in your PATH.

When developing OpenTapioca, you can use pip install -e . to install the CLI from the local files, so that your changes on the source code are directly reflected in the CLI, without the need to run python setup.py install every time you change something.