пʼятницю, 3 травня 2013 р.

Storm, distributed and fault-tolerant realtime computation, part 1

Storm is a real big data solution for soft-real time. In the hadoop world batch processing is a main limitation. Let me explain. Imagine, you track some event, however with Hadoop event happened at 11:00AM, it collected, than processed each half hour (in 11:30AM, bacause of Hadoop is batch processing) and this event influences statistics (or whatever) only at 12:00AM. How to do it more "real-time"

Storm comes into play here.

Storm built on idea of streaming data and this is big difference in compare with Hadoop. Not advantage, but difference. Hadoop is a great batch processing system. Data is introduced into the Hadoop file system (by Flume or any other way) and distributed across nodes for processing. Then processing is started, it contains a lot of shuffle operations between nodes and so on, a lot of IO operations. Afterwards,  the result must be picked up from HDFS. 
Storm works with unterminated streams of data. Storm jobs, unlike Hadoop jobs, never stop, instead continuing to process data as it arrives. So, you can think about Strom as about powerful integration framework (like Apache Camel or Spring integration) if you are familiar with it.

So, when even is produced, it is introduced into Storm and go through set of jobs...
Let's look at Storm terminology:
  • Tuple is a data structure that represent standard data types (such as ints, floats, and byte arrays) or user-defined types with some additional serialization code (there isn't analog in Java language, but tuples are very usual in Python and special in Erlang);
  • Stream is a unterminated data flow, sequence of tuples
  • Spout special component which provide data from external source into Storm
  • Bolt is a Storm jobs that can be linked in chain and makes any tuples transformation
  • Topology structure of spouts and bolt connected by streams (see picture below)

By default, streams are made as ØMQ queues, but it's underhood and you should worry about that (expect cluster installation time).
One features that I explored during development, but I didn't mentioned during reading tutorial: Bolt has to be serializable, so all class members must be serializable too.

Немає коментарів:

Дописати коментар