We're constantly receiving questions: What search engine do you use internally?, How do you collect files?, Why do I need docker to run Ambar?, etc. This post gives you the answers to these questions and clarifies how Ambar works under the hood.
When we were designing Ambar's architecture we were keeping in mind next goals:
- Ambar must have an ultra-fast search and should provide the same search experience as Google Search does
- All the text from documents (including their contents) must be extracted and indexed
- The architecture must be scalable, Ambar must be usable in wide range of environments from laptop to cluster with no changes in code
- Easy distribution, user must be able to install and update Ambar just with a few commands
You can't reach any of this goals without awesome technologies, below I describe how we achieved each of these goals.
Almost every full-text search engine nowadays is built around Apache Lucene. The most programmer-friendly and popular Lucene wrapper is ElasticSearch. It supports hundreds of plugins and can-be highly customized. ElasticSearch provides you with awesome REST API and scalable architecture. So we were pretty confident about using ElasticSearch as a search engine. Later on we discovered some nasty things about ElasticSearch, we've posted on this already, also, new posts are incoming. It's quite difficult to correctly setup ElasticSearch mappings to make it perform well with large documents. (Making ElasticSearch Perform Well with Large Text Fields, Highlighting Large Documents in ElasticSearch). Nevertheless, we sticked with ElasticSearch and we never regret it. With properly set up ElasticSearch engine, Ambar responds to your queries within milliseconds.
There is a zoo of different formats and encoding, starting from txt files in the DOS encoding to pdf's with scanned images inside. Our goal was to gracefully handle each of these types. In open-source you can find myriads of libraries that can open and extract text from one strongly defined format and encoding, but there is only one library that can handle pretty much everything - Apache Tika. Its main purpose is to union all the open-source content extraction libraries and create an easy-to-use all-in-one system. As Tika, tries to embrace everything, it misses a substantial ability to fine-tune its modules, that's why we extract and OCR PDFs in our own way with help of Apache PDFBox. For OCR we use Tesseract with a proper images preprocessing.
When we discovered Ingest Attachment plugin for ElasticSearch, we tried it first, and quickly realized that it's not the best choice for a production system. You can find the details in Ingest Attachment Plugin for ElasticSearch: Should You Use It? post.
Finally, we created our own text extraction pipeline with Python. There is a lot of details, so we'll post on it later.
Our customers are very different, some use Ambar Community Edition to make their small office paperless, others index terabytes of documents to analyze them and make right business decisions with Ambar Enterprise Edition. To make Ambar fit any client's needs, we created it with a scalable architecture inside.
Core Ambar's components are:
- Crawlers (modules that collect files from different sources)
- Pipelines (modules that extract text from documents)
- ElasticSearch (full-text search engine)
- WebApi (the core module that puts everything all together, it handles every request from every module, also it provides REST API for end-users)
- Rabbit MQ (queue)
- Redis (main cache for security data, metadata and other stuff)
- MongoDB (settings and miscellaneous data base + GridFS to store the files' sources binary data)
All the components listed above are running in Docker containers, and can be easily scaled horizontally, with just a few changes in
config.json file. For super high-loaded setups we can provide a Docker Swarm configuration.
Easily deploy Ambar with a single docker-compose file.