Arrow of time
Arrow of time

The Needle Search Server - alpha

Share Tweet Share

I've written before about my Needle light-weight full-text search server. To recap: it's a full-text search server written in C …

I've written before about my Needle light-weight full-text search server. To recap: it's a full-text search server written in C++ with a FastCGI interface, using Google's LevelDB for storage, and with a pure REST API. It's available at BitBucket if you want to test it yourself!

As these things go, it took me a lot more effort to find the time to work on Needle, but I'm managing it here and there. Most importantly, the use case I need it for (hosting a searchable and subscribale index of Croatia's state gazette on a Raspberry Pi) still exists and needs Needle to progress.

In the last few months I've finalized most of Needle's internals so it's now actually usable! The most imporant thing it lacks right now is some kind of smart search query syntax (currently, it searches for all documents containing any of the given words, i.e. it performs OR). With the infrastructure I have in place it should be reasonably easy to implement logical operators and phrase searching (i.e. words near each other), with ranking.

The full list of completed features is:

  • A pure FastCGI server in C++ with enough tweaks to run under Nginx, Apache and Lighttpd, with a REST API
  • The ability to create (and use) multiple, arbitrary databases (each database is a collection of documents), and up to 4 billion different documents stored in the database
  • Word indexing uses simple stemming, enabling the search of similar word forms. Currently, stemmers for English and Croatian are provided
  • Text document import (indexing) is completely implemented
  • Support for importing JSON dictionaries, describing documents with multiple parts, each of which can have different ranking (e.g. title, abstract, body) is experimental

To run Needle, you need to:

  1. Obtain the source from BitBucket and compile it
  2. Create a config file in /etc as described below
  3. Configure a web server to interface to the Needle's FastCGI server
  4. Run the created executable

If everything goes OK, you should be able to immediately run some REST API operations on the server.

Building Needle

To build Needle, your system needs the following development libraries (the names below are valid for Ubuntu):

  • libfcgi-dev
  • libjsoncpp-dev
  • libsnappy-dev
  • libboost1.54-all-dev

If all the dependencies are installed, needled can be compiled simply by running make.

The configuration file

The configuration file is named /etc/needle.conf.json on Linux, and it should contain content like the following:

{
      // Basic server configuration
      "server": {
              // Verbosity level; 0=quiet, 3=most verbose
              "verbosity": 1,
              // FastCGI socket path; use ":12345" for TCP port instead of Unix socket
              "socket_path": "/tmp/needle.sock"
      },
      // Database configuration
      "db": {
              // Parent directory for all databases
              "dir": "/tmp"
      },
      // Word manipulation
      "words": {
              // Configure stemmer ("none" | "croatian" | "english")
              "stemmer": "croatian"
      }
}

The server.socket_path entry governs where and how the FastCGI socket will be created. It can be a path to a Unix socket file, or a syntax like ":9999" for a TCP socket. The db.dir entry specifies the top-level directory under which Needle will create its databases. Note that the user running the Needle server needs the permissions to write into this directory. Finally the words.stemmer entry specifies which stemmer to use when indexing documents. Currently, this is a per-server setting and influences all databases and all documents.

Configuring Nginx

Nginx is the simplest to configure for the Needle FastCGI server, though additional configuration examples are available in the README file and in the doc directory. You should simply add lines like the following in your nginx configuration file (e.g. in one of the sites-enabled files):

location /needle {
    include         fastcgi_params;
    root            /needle;
    fastcgi_pass    unix:/tmp/needle.sock;
}

Starting Needle

You should run the needled executable with the -d argument to make it stay in foreground. A small number of critical messages will be written to the standard output, while the majority of messages will be passed to syslog.

Testing Needle

A small number of helper Python (2.7) scripts are provided in the utils directory.

  • To create a Needle database named test, run ./ncreatedb test
  • To import a text document into the test database, run ./nimporttxt test doc_id filename.txt. The doc_id is the unique document identifier. It can be almost anything you like, it's just an opaque string which must match the regex ([a-zA-Z][a-zA-Z0-9._$%]+).
  • To search for a word, run ./nsearch test word

Of course, Needle is not meant to be used with such helper scripts, but directly from your applications by accessing its REST API. Each of the scripts prints out the URL it is using to perform its task if it's run with the -v argument.

With this option, you can easily see that, e.g. creating a database means issusing a GET request to an URL in the form of http://example.com/needle/+create/dbname, that importing a text document means a POST request to http://example.com/needle/dbname/doc_id, that searching means a GET request to http://example.com/needle/dbname?q=foo, etc.

Have fun, and I would very much welcome your feedback!


comments powered by Disqus