I have recently built an "Open Government" service which takes all the documents from the official Croatian government gazette which, among other things, publishes laws, changes to laws, decisions of the Constitutional court, etc. and indexes them, offering two new services: better full-text searchability and data "push" approach, allowing users to "subscribe" to arbitrary search queries and get notified when there are new documents published which are matched by those queries.
Though these documents are pro-forma published on-line at the gazette's web page, the form in which they are published is not readily searchable and certainly doesn't have this proactive, "push" approach.
Among the technically interesting bits for me was that I did it all on a Raspberry Pi, using completely standard, "large" development tools and techniques like Django and PostgreSQL. It turns out that the 512 MB Raspberry Pi can manage this, but barely. The major bottleneck is the limited memory size, which prevents caching by the database, forcing even slightly unfrequent queries to read data from the SD card. Though the PostgreSQL's GIST index of the data fits comfortably in RAM (being < 200 MiB), the database still needs to hit the main table data for functions such as ranking - and the main table is another 300 MiB in size.
But still - the thing works. The Raspberry Pi is slow (700 MHz ARM from 2010) but using clever caching and optimizing wherever I could I've managed to get around 10 requests/s from this setup, in the best case.
This got me thinking - there has to be a better way and a more efficient search server. My app doesn't store the documents in the database but on the file system and there should be a way to just hold the index in the database and not the actual document contents. The current solution uses PostgreSQL which as a "proper" SQL database stores both the full text data and an index on it, effectively duplicating both the need for data storage and causing queries to hit both the index and the table storage.
The most popular search server these days seems to be Lucene but it's a Java servlet so that disqualifies it for this purpose. There's a C++ "reimplementation" called CLucene but after some digging around it doesn't seem to be what I need - both it and Lucene are libraries meant for embedding into larger projects. Sphinx requires a SQL database backend and seems to be more of an enterprise multi-data-source search than a simple full-text search which I need. A long time ago I've used Swish-E but that seems to be abandoned since 2009, and isn't actually a server but a command-line query tool (whose results can be wrapped in a server, though).
Apparently, I'll have to do things myself.
The Needle Search Server
This is something I'll be working on occasionally, in my free time. At the start I thought about making it an opportunity to learn a new language. I've been following the development of Rust for the last couple of weeks and it seems interesting but in a huge state of flux right now as syntax and semantics are apparently changing on a weekly basis. I feel that Go and D don't offer enough substantial new semantics to be worth learning (for me), so I'll do this in Ye Olde C++.
Contrasted to my last big project of the sort - Bullet Cache Needle will be mostly a "compositing" project, not about inventing new algorithms. Here's what I intend to do:
- Make it a FastCGI server as a default interface, so it can be used with arbitrary web servers which can then take care of security, authentication, etc. The server will provide a nice, pragmatic REST API using JSON for transporting structured data.
- The server will require that the documents are submitted as JSON dicts but will only care about two keys: an "id" and "content". The "id" needs to be unique, and the "content" needs to be clean text data (stripped of HTML tags, etc.) Any other fields will be preserved and returned with the document as metadata on search operations, and will be treated as opaque.
- The server, at least in the first version, will not store the verbatim contents of the "content" field. Internally, the only thing it will maintain will be the index. Clients can store their own copies of the document data and reference it with the unique "id". Anyway, in my experience, the data usually comes in a non-clean format (e.g. HTML, PDF) and I don't want to deal with that issue.
- At first, I plan to implement simple search query syntax similar to that of PostgreSQL's tsearch2.
- I intend to use Google's LevelDB for the backend data storage, JsonCpp for JSON operations, libfcgi for the FastCGI protocol stuff, and Boost for generic algorithms and structures.
- I will offer word stemming in English and Croatian (because those are the only ones I need right now).
- The server will be multithreaded.
- Very likely, the server will support multi-master clustering for load-balancing and reliability
You can track the progress of the project on my BitBucket repo. Any suggestions are welcome!