
On June, 4th, my colleague Björn and me went on our way to Berlin to visit the Berlin Buzzwords conference. Next to all the Buzzwords, the conference featured talks around the self proclaimed topics search, store, and scale by many well-known web companies like Twitter, Groupon, wooga and many others.
For me, personally, the talks grouped themselves in of three main categories:
1) Stories about Large-scale deployments of the currently hot tools, mainly Hadoop, ElasticSearch, and MongoDB.
One notable example is the talk “You know, for search. Querying 24 Billion Records in 900ms.” by Jodok Batlogg.
2) Architecture Designs to collect, store and provide data from different sources and to different customers integrated in the respective custom infrastructure of web servers and other agents.
I especially liked the talk “Event-Stream Processing with Kafka” by Tim Lossen from wooga. The architecture used Apache Kafka as the major fire hose to collect a lot of log messages from their online gaming clients all over the world and integrated this with their system by using redis as glue between their plethora of data consumers. I’m already a big fan of Kafka and we also use redis a lot at ICANS, so I liked to see another successful solution using these technologies. I was also amazed at the simplicity of their setup: a single, non-distributed Kafka instance with a periodic rsync backup of the data to collect 250 million messages per day.
A similar topic and design was presented by Ralf Neeb from GameDuell in his talk “Real-/Neartime analysis with Hadoop & VoltDB”. Instead of Kafka, they use VoltDB to store their incoming data stream, which gives them fast and direct access to the recent data. Out of VoltDB, the data is then pulled into Hadoop.
My exposure to VoltDB is rather limited, but these talks inspired me to write a bit more about transport and topics that arise around it, like monitoring for data loss and tracking how long it takes for data to move through the transport layer. So watch out for the upcoming post “Event-Stream Processing: Flume, Kafka, VoltDB”.
3) New stuff. Talks about things, I’ve not used so far: Tools, Methods, and technical Details.
@tools: With Basho being very present at the conference, I got very interested in Riak. I’ll definately have a look into it soon.
@methods: Ted Dunning introduced me to concept of Bayesian Bandits in his talk “Real Time Dataming and Aggregation at Scale”. It’s quite interesting, as it basically just counts how often a specific path led to success when traveled, for instance a click on a presented option. What it adds to the count is a probability distribution by which the significance of difference can be estimated and a set of best-performing options can be selected. I think this method could have many applications within our applications in the future. Ted also implemented a Bayesian Bandit for Storm: storm-counts. As I’m also very interested in Storm in general, I plan to dive a bit deeper into this topic and write about my findings in “Fun and Profit with Bayesian Bandits”.
@details: Michael Busch from Twitter presented quite a deep view in how they patched Lucene to make it fit to their kind of data load. Check his talk “Realtime Search at Twitter”.
So far, for my impressions.
Berlin Buzzwords, for me, were two days of interesting talks with many pointers for new stuff to look at. I also attended the post-Buzzwords Workshop “Parallel Processing beyond MapReduce”, which gave me two more days of interesting discussions and insight into Apache Giraph, Stratosphere, and YARN, the new next generation of Hadoop MapReduce.
I intend to follow those pointers and write about them in the following upcoming blog posts:
* Event-Stream Processing: Flume, Kafka, VoltDB
* Fun and Profit with Bayesian Bandits
* Adventures with Riak
* Inside Hadoop Deployments: Unexpected Sources for Errors and Performance-Leaks
* Better Hadoop: Ideas for improving Performance and your Quality of Sleep
* Musings on the Alledged Unreasonable Effectiveness of Data
* Parallel Processing beyond MapReduce