Our cloud-covered world could be in for a storm. Beijing-based tech company Terark has developed algorithms that allow databases to run up to 200x faster by compressing their data further. On top of that, their algorithm also allows reading the data without having to decompress it. This means one server running TerarkDB can do the job of five servers running industry standard database engines. The cost savings for companies will be huge, plus it slots straight into existing database ecosystems โlike changing a battery,โ allowing them to easily offer free trials.
Terark itself has secured a $1 million contract with Alibaba Cloud and is already profitable despite only being established in November 2015. Theyโre not looking for further investment, but they are now heading to Europe and the US to try to explain the revolutionary concept to potential clients (keep reading for our attempts), though theyโre also making the code free to small users running just one server.
The company is already in talks with undisclosed large clients around the world.
โItโs not an incremental improvement,โ said VP Remy Trichard, โYou can get up to 200 times faster on random reads. [Big companies like IBM, Google] spend a lot of time and money trying to optimize their servers, adding more memory to get only incremental improvements. In terms of cost savings, itโs fives times. One server can do the work of five, in some scenarios, ten.โ
Angel Alibaba
So far their biggest client is Alibaba who have done a $1 million deal with Terark to integrate their technology into Alibaba Cloud (Aliyun ้ฟ้ไบ), the worldโs third largest cloud company according to Sean Fu, CEO of the company. This angel client will give its cloud service users the choice to switch to TerarkDB in a few weeksโ time.
Pricing structures are not yet clear, though the team divulged that Alibaba Cloud will save money by customers switching to the system.
How does it work?
The team uses various analogies (scroll down) to explain how the solution works and the inventor of the algorithms, CTO Lei Peng, even drew diagrams to explain the difference. โOur whole logic system is different,โ said Lei as he got his whiteboard pen.
Databases store their data in blocks with a corresponding index. When data is needed, a search of the index is made and the relevant block is retrieved. Currently, those blocks are compressed and need to be decompressed. The blocks are managed by a file system cache and have to be dropped into a block cache to be decompressed and read, which puts a huge demand on servers.
TerarkDB compresses the data further, but its indexing system is where the real difference is. โTraditional system can only index 1% but we can index 100% using the Nested Succinct Trie [pronounced โtryโ], said Lei. That the index holds way more information about what is in the data, blocks donโt have to be retrieved and decompressedโthey can be read in situ. The compressed index is more comprehensive which means the data doesnโt have to be compressed as blocks, but as a โglobal compression,โ allowing for far greater query speeds.
โWe can search directly into the data without decompressing it so we donโt need a big block cache. Traditional databases need to find the relevant block, decompress it, check if itโs the right data, if not then put it back and pick another,โ said Trichard.
Lei came up with the algorithm when devising a way for Chinese characters to be suggested more quickly when typing pinyin into a keyboard. โIt was quite a gradual step by step process in itself, but the breakthrough was applying something very specific to something very generalโdatabases,โ said Lei.
Analogy #1 The Zip File
One explanation of how it works is to think of it as the blocks being like a Zip file of vacation photos. You canโt see individual photos within the file and either have to decompress to view then recompress, or leave them decompressed and taking up more space. But for Terark you can access them within the file, still zipped.
Analogy #2 The Library
Trichard prefers the library scenario. Think of blocks as sections of books in a library, such as architecture, history. Each book has a table of contents at the front, then the library has an index of all the books. So if you want a book on architecture, the librarian/index can direct you to the architecture section/block, but to you have to look at each bookโs contents page to decide if thatโs the book you need. Terark lets you put all the tables of contents into the overall library index
โItโs like putting all your library on Google โ you just type the keyword for what you want,โ said Trichard.
Plug โnโ playโmuch faster
Will all that speed make your smartphone melt? โThe users of everyday apps and websites may notice a faster experience, but it would really be for the company itself. They would be able to reduce their number of servers and reduce the speed of querying data from the servers,โ says CEO Sean Fu.
โWeโve developed a new engine, not a car,โ said Fu. The solution can be slotted straight into existing databases meaning companies can keep running ecosystems such as MongoDB and MySQL, the most commonly used worldwide.
โEverything stays the same, the interface stays the same โ the only difference is they get better speed, better storage, better efficiency,โ said Fu.
Future
There are only ten of them and they donโt see the point of scaling the team or opening offices elsewhere. We met the team at their small office within a Tencent-run startup space (you have to use WeChat to get in a meeting room) on the edge of Beijing. โTechnology does not have the boundaries of countries โ if itโs good, people can use it anywhere. We can do almost everything online, though may need sales engineers in some places,โ said Fu.
The Nested Succinct Trie is only the beginning. Terark has six patents for its various innovations, but the team is quite resigned to the fact that the key algorithm for the indexing compression is nearing maturity. โThere will be an evolution, but then there will have to different indexes,โ said Lei. The team is looking into creating indices suited to handling different types of data sets as they are approached by more interested parties. They may end up developing a range of products targeted at different client types such as genetics companies. โDifferent indexes will be more efficient for different data,โ said Lei.



