середа, 26 грудня 2012 р.

surprise in Hadoop log

When I started working with Hadoop I was confused by next message in logs:
  1. DEBUG conf.Configuration: java.io.IOException: config(config)
  2. at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:225)
  3. at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:183)

It was Hadoop 1.0.3 and I didn't understand "what am I doing wrong?". It was just the newest Hadoop and I didn't find more information in Google, So, I was need to check source code... surprise! Look at it (line 4!):
  1. public Configuration(boolean loadDefaults) {
  2. this.loadDefaults = loadDefaults;
  3. if (LOG.isDebugEnabled()) {
  4. LOG.debug(StringUtils.stringifyException(new IOException("config()")));
  5. }
  6. synchronized(Configuration.class) {
  7. REGISTRY.put(this, null);
  8. }
  9. this.storeResource = false;
  10. }

I can't believe they always log exception... strange way to get stack trace? maybe...

середа, 12 грудня 2012 р.

Raspberry Pi cluster for hadoop?

I just thought "Raspberry Pi cluster for hadoop?" Is it possible? does it makes any sense?

Let's think... Hadoop uses hard-drive very-very intensive. Memory.. it's good too have enough memory, but it doesn't critical; I believe 512 MB will be enough. CPU... depends on your code, but usual it's not critical point for map-reduce in general

So, with Raspberry Pi you get (just for $35!):

  • RAM 512 MB
  • CPU ARM11 700 MHz
  • SD with Linux 4-16 GB (you will need to buy it separately)

Some time ago there was the nice article about Paspbery Pi supercomputer: 64 Raspberry Pi computers were connected into the one cluster (via Ethernet); each has 16 GB SD card and it means 1 TB storage for whle cluster (!), and costs about $4000
One concern: access speed to SD card. It isn't good enough and you will need to buy external SSD hard-drive. I assume each Raspberry Pi has to have own SSD (32-64 GB must be enough). So, this solution will be a more expensive that $4000, but cheapest than whole PC or cloud instances.

Let's try to calculate: 64 Raspberry Pi * $35 = 2240, SSD 64 GB * 64 = 4TB costs $4500, whole solution will cost $6500-$7000 for 64 physical node:)

So, does is make sense to build hadoop-oriented cluster? I believe so, what do you think?
At least, it will be a great experiment!

PS. Maybe someone wants to donate money for this experiment? kickstarter sounds resonable here