Challenges of building a high performance search engine to process big data
Build a high performance indexing system suitable for electronic discovery and compliance needs.
The Athena Archiver Indexing system is the most accurate, distributed search engine on the market. It can process over 100 different data types and search through millions of records in fractions of a second. There are zero false positives and accurate record counting unlike many of the existing search systems like open source Lucene or database engines. In addition, there is a complex search query language and classification system which is lightning fast.
Athena Archiver is an email and document archiving company which aims to deliver next-generation legal discovery and compliance capabilities for businesses. Athena Archiver initially started with a database oriented solution, but it was clear that the load and search speed of a traditional database was simply too slow. Athena Archiver customers were processing petabytes of mail each month.
We knew that existing solutions had major shortcomings. But the biggest shortcoming of open source systems was that it was too generalized. None of them had the scale and accuracy requirements that Athena Archiver had. So, We set out to built a custom indexing system in C++ with an innovative indexing strategy which did not lose any information in the indexes. We could search millions of records in fractions of a second on a single PC. The core engine was built in partnership and is currently dual licensed to Athena Archiver and BitFortress. In addition to the core engine, we also built automated data conversion systems which handled unicode, various email formats, attachment indexing, and image categorization through machine learning techniques. The system can easily handle massive stores of data and has handled some of the largest financial instituions in the world.
The system is still evolving. We are beginning to apply artificial intelligence and machine learning techniques to do conceptual searches and flag high risk emails and documents in real time. Once the main indexing system was built, we then proceeded to build a distributed task scheduler so that we could scale the system to huge clusters of servers. We focused on high throughput of our messaging infrastructure.
Building a high performance data warehouse to track over 5 million artists
Build a real-time data warehouse of every artists and musician across hundreds of data points
Bitfortress built a web parsing/scaping engine which converts web pages into structured content that is indexable and searchable. The Bookwise data warehouse tracks all social media, booking tours, ticket prices, and dozens of other data metrics on a daily basis for millions of artists. The system is one of the most comprehensive and accurate music databases in the world.
Bookwise is a data aggregation platform for the music industry which takes data from all over the internet for an artist and provides a real time view and comparison of artists on hundreds of data points.
BitFortress developed the core data scraping and data warehouse system for the data analysis engine. Written in C++ and Node.js over a period of 3 months. The core system tracks over 5 million artists on a daily basis and over 14 million albums. By aggregating social media views, youtube, and website traffic, we can detect trending artists and separate artists by geography and costs to book, bringing data analysis to the music booking industry for the first time.