Since the year 2010 we've been creating bespoke security-related software for large enterprises. Almost every enterprise has a security department responsible for background checking of potential customers and partners. And this kind of checking includes regular searching for some info about the subject being checked. The search is performed over hundreds of millions of documents of different kinds and from different sources. Such as reports, articles, financial data, text notes or scanned contracts in pdf, doc, xls, txt or any other formats. Anyway, the issue of fast and efficient searching over huge amount of different docs is quite substantial for any business, especially for large enterprises.
We've gone a long way from using dtSearch to developing our own solution. This post describes our steps and shades some light on why we eventually stopped using dtSearch and created a new solution.
We've been using custom self-created solutions to implement the process of security background checking, but as a full-text search engine we've been using dtSearch for quite a long time (from 2010 to 2016). And that's why:
- Back in 2010 we were choosing between Copernic, Cross and Archivarius (Russian products), dtSearch and some other exotic solutions
- The fastest one turned out to be dtSearch, especially with large amounts of data
- dtSearch had the most sophisticated and flexible query syntax that allowed us to fine tune our queries and make the search more efficient
- dtSearch had (and still has) a C# API that we used to integrate our systems with its engine. Not the best option, though it was ok back in 2010
What Happened Next
Years past, our systems were becoming more and more advanced, and one day dtSearch became the slowest and the most issue-generating module:
- As data volumes were constantly growing, dtSearch performance was decreasing, and by the end of 2016 some queries could take up to 5 minutes to execute - such performance was absolutely unacceptable
- dtSearch does not do OCR and we had a significant raise in number of scanned documents, so we were missing important data
- dtSearch does not handle DOS encoded files well, for reference, it can't index CP866 encoded files properly
- dtSearch does not always properly tokenize phrases, dates and phone numbers, so you can miss some important info when, for example, searching for some phone number or a complex name
- Since we were moving from ASP.NET MVC/C#/MSSQL stack towards more sophisticated React/Node.js/Python/ElasticSearch/MongoDB, integrating with dtSearch via it's C# API was becoming quite tricky and unstable
- We had to keep a dedicated Windows server for dtSearch indexer
- dtSearch can't be clustered, and it's necessary when you're going to handle huge amounts of data. So we had to keep a single extra large VM exclusively for dtSearch
These were the most painful things about dtSearch, there are more of them actually, but other issues are rather minor in comparison to the main ones.
So, that's why, at some moment, we realized that we had to do something with our full-text searching module. First of all, we tested existing solutions and the results were not promising. The old players had not changed much since 2010, and we were not impressed with the new ones like LucidWorks Fusion. Then we thought about creating just a new full-text searching module with Tika + ElasticSearch or Solr, this would have solved our issue. But we were still concerned about missing a solid, fast and usable solution on the market.
That's why, after yet another deep thinking session, we had decided to create a new open-source project that would make many lives easier, and then, Ambar was born.
Ambar – Full-Text Search System
Since we were constantly struggling with dtSearch and its issues, we were keeping all this experience in our minds during Ambar development. The main goal was to create an instant search experience, any query must take less than 100 ms to execute no matter how complex it is or how many documents are indexed. Considering we were aiming at data amounts of hundreds of millions of files.
Ambar was released in January 2017. This month we deployed Ambar at the first enterprise customer.
The main points about Ambar:
- Instant full-text and fuzzy search through documents content with respect to languages morphology, any query takes less than 100 ms to execute
- Ultra fast and easy to use user and administrator interfaces
- Ambar supports almost all existing file types and deduplicates data
- Ambar has advanced OCR with tricky images preprocessing
- Ambar has the best pdf parsing on the market today, it smartly detects whether a page has an image or text inside and does OCR if necessary
- Advanced full-text language analyzer that does tokenizing right
- Simple REST API, easily integrated with any system
- Cloud and self-hosted versions are available
- Self-hosted Ambar supports clustering and can be scaled to any size
We'll release an email crawler soon. Also we'll extend the data analysis part of Ambar and add named entities recognition.
Check out Ambar on GitHub: https://github.com/RD17/ambar