вівторок, 7 травня 2013 р.

Storm, distributed and fault-tolerant realtime computation, part 2

Read first part here

To illustrate main Storm points, let's go though the main points...
So, the idea of application is to scan web-server log (well, be honest this log is created by us but it's very similar to real-world log), push each event into Storm topology, calculate some statistic and update this statistic in database. Well, stupid user-case for Storm, but good enough to show all main concepts.

Let's start from log line, it is something like follow:


Well, first of all, we have to get this line into Storm and we know that we have to use Spout for this purposes. In example, we using Spout to read file, but this is a very bad practice in real-life! You can have your log file on separate machine, or you can configure different number of spouts and log files or something other. Anyway, it will be better to introduce data from external source into spout by some general flexible solution like MQ or Flume. 
The easiest way to write Spout is extending backtype.storm.topology.base.BaseRichSpout.
There are next important method that must be overrode:

public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer)
declare fields in output tuple, i.e. outputFieldsDeclarer.declare(new Fields("country", "url"));

public void nextTuple()
Called by Storm to get new Tuple (call emit method to produce new tuple into topology). This method shouldn't be blocking, if there is nothing to emit, just sleep thread for miliseconds and return nothing. Also, Storm provide two method to control tuple lifecycle - ack, to mark tuple as successful processed, and fail - to mark tuple as unsuccessful processed; use ack and fail methods in Spouts and Bolts both.

public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector)
The method called for Spout configuration on start-up


Bolt is a main component in Storm architecture and can be compared with jobs in Hadoop.
The easiest way to create Bolt is to extend BaseRichBolt. Similar yo Spout you will need to implement
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer)
method and describe output tuple if they are present (otherwise just left empty implementation).

If you wish, you can implement following method:
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector)
Usualy, it uses to get outputCollector instance available from next method (the most important):
public void execute(Tuple tuple)

This method is called all time when you Bolt get new tuple. In this method you can emit any tuples to make it available on the next step. Or you can push it to datastore or whatever.

Notice, all fields in Bolt must be serializable!


Topology link Spouts and Bolt is the one structure and describe streams between them. After that topology can be send to cluster - remote or local and will be executed by Storm.

To see te whole working example, check out my github

Немає коментарів:

Дописати коментар