Our cloud-covered world could be in for a storm. Beijing-based tech company Terark has developed algorithms that allow databases to run up to 200x faster by compressing their data further. On top of that, their algorithm also allows reading the data without having to decompress it. This means one server running TerarkDB can do the job of five servers running industry standard database engines. The cost savings for companies will be huge, plus it slots straight into existing database ecosystems โ€œlike changing a battery,โ€ allowing them to easily offer free trials.

Terark itself has secured a $1 million contract with Alibaba Cloud and is already profitable despite only being established in November 2015. Theyโ€™re not looking for further investment, but they are now heading to Europe and the US to try to explain the revolutionary concept to potential clients (keep reading for our attempts), though theyโ€™re also making the code free to small users running just one server.

The company is already in talks with undisclosed large clients around the world.

Comparison of TerarkDB (Image credit: Terark)
Comparison of TerarkDB, WiredTiger (the MongoDB engine) and RocksDB (Image credit: Terark)

โ€œItโ€™s not an incremental improvement,โ€ said VP Remy Trichard, โ€œYou can get up to 200 times faster on random reads. [Big companies like IBM, Google] spend a lot of time and money trying to optimize their servers, adding more memory to get only incremental improvements. In terms of cost savings, itโ€™s fives times. One server can do the work of five, in some scenarios, ten.โ€

Angel Alibaba

So far their biggest client is Alibaba who have done a $1 million deal with Terark to integrate their technology into Alibaba Cloud (Aliyun ้˜ฟ้‡Œไบ‘), the worldโ€™s third largest cloud company according to Sean Fu, CEO of the company. This angel client will give its cloud service users the choice to switch to TerarkDB in a few weeksโ€™ time.

Pricing structures are not yet clear, though the team divulged that Alibaba Cloud will save money by customers switching to the system.

How does it work?

The team uses various analogies (scroll down) to explain how the solution works and the inventor of the algorithms, CTO Lei Peng, even drew diagrams to explain the difference. โ€œOur whole logic system is different,โ€ said Lei as he got his whiteboard pen.

Databases store their data in blocks with a corresponding index. When data is needed, a search of the index is made and the relevant block is retrieved. Currently, those blocks are compressed and need to be decompressed. The blocks are managed by a file system cache and have to be dropped into a block cache to be decompressed and read, which puts a huge demand on servers.

Lei Peng explains Terark
CTO Lei Peng takes to the whiteboard to explain how Terark works (Image credit: TechNode)

TerarkDB compresses the data further, but its indexing system is where the real difference is. โ€œTraditional system can only index 1% but we can index 100% using the Nested Succinct Trie [pronounced โ€œtryโ€], said Lei. That the index holds way more information about what is in the data, blocks donโ€™t have to be retrieved and decompressedโ€”they can be read in situ. The compressed index is more comprehensive which means the data doesnโ€™t have to be compressed as blocks, but as a โ€œglobal compression,โ€ allowing for far greater query speeds.

โ€œWe can search directly into the data without decompressing it so we donโ€™t need a big block cache. Traditional databases need to find the relevant block, decompress it, check if itโ€™s the right data, if not then put it back and pick another,โ€ said Trichard.

Lei came up with the algorithm when devising a way for Chinese characters to be suggested more quickly when typing pinyin into a keyboard. โ€œIt was quite a gradual step by step process in itself, but the breakthrough was applying something very specific to something very generalโ€”databases,โ€ said Lei.

Terark's Lei Peng at the HQ in Changping District, Beijing (Image credit: TechNode)
Terarkโ€™s Lei Peng at the HQ in Changping District, Beijing (Image credit: TechNode)

Analogy #1 The Zip File

One explanation of how it works is to think of it as the blocks being like a Zip file of vacation photos. You canโ€™t see individual photos within the file and either have to decompress to view then recompress, or leave them decompressed and taking up more space. But for Terark you can access them within the file, still zipped.

Analogy #2 The Library

Trichard prefers the library scenario. Think of blocks as sections of books in a library, such as architecture, history. Each book has a table of contents at the front, then the library has an index of all the books. So if you want a book on architecture, the librarian/index can direct you to the architecture section/block, but to you have to look at each bookโ€™s contents page to decide if thatโ€™s the book you need. Terark lets you put all the tables of contents into the overall library index

โ€œItโ€™s like putting all your library on Google โ€“ you just type the keyword for what you want,โ€ said Trichard.

Plug โ€˜nโ€™ playโ€”much faster

Will all that speed make your smartphone melt? โ€œThe users of everyday apps and websites may notice a faster experience, but it would really be for the company itself. They would be able to reduce their number of servers and reduce the speed of querying data from the servers,โ€ says CEO Sean Fu.

โ€œWeโ€™ve developed a new engine, not a car,โ€ said Fu. The solution can be slotted straight into existing databases meaning companies can keep running ecosystems such as MongoDB and MySQL, the most commonly used worldwide.

โ€œEverything stays the same, the interface stays the same โ€“ the only difference is they get better speed, better storage, better efficiency,โ€ said Fu.

Future

There are only ten of them and they donโ€™t see the point of scaling the team or opening offices elsewhere. We met the team at their small office within a Tencent-run startup space (you have to use WeChat to get in a meeting room) on the edge of Beijing. โ€œTechnology does not have the boundaries of countries โ€“ if itโ€™s good, people can use it anywhere. We can do almost everything online, though may need sales engineers in some places,โ€ said Fu.

The Nested Succinct Trie is only the beginning. Terark has six patents for its various innovations, but the team is quite resigned to the fact that the key algorithm for the indexing compression is nearing maturity. โ€œThere will be an evolution, but then there will have to different indexes,โ€ said Lei. The team is looking into creating indices suited to handling different types of data sets as they are approached by more interested parties. They may end up developing a range of products targeted at different client types such as genetics companies. โ€œDifferent indexes will be more efficient for different data,โ€ said Lei.

Frank Hersey is a Beijing-based tech reporter who's been coming to China since 2001. He tries to go beyond the headlines to explain the context and impact of developments in China's tech sector. Get in...

Leave a comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.