четвер, 3 жовтня 2013 р.

Six Hadoop distribution to consider about


Hortonworks is the most recent player that basically spun off Yahoo! instead of maintaining own Hadoop infrastructure in house. Everything they do is always open source and very close to Apache Hadoop project in their evolution. They have already turn into YARN and provide easy-to-start virtual machine. A lot of comprehensive tutorials are available on their web-sire as well as regular webinars.

Cloudera is perhaps the oldest and best known provider who turn Hadoop into a commercially viable product and is still the market leader. Cloudera product based on Apache Hadoop with a lot of own patches and enhancements that are release as open source. Also, some absolutely new and unique products were created by Cloudera as Cloudera Search (Solr on Hadoop). Moreover, there are some enterprise proprietary components available for additional money. They have already turn into YARN and also provide virtual machine for evaluation.


MapR technology is very unique in many ways. They replaced HDFS with own file system MapR-FS written in C instead of Java. MapR-FS is compatible with HDFS, however is completely distributed, doesn’t have single point of failure and even support mutable data. MapR claims their FS gives better performance compared to HDFS. Also, can you can mount MapR-FS as a regular volume on a system to avoid complicated data ingestion process. What is MapR famous also is partnership with Amazon and Google. In Amazon EMR you get MapR as distro and MapR sets TB-sort record on Google VM infrastructure. Documentation is good structured, but it's difficult to find something in Google. Virtual machine is provided as well

Intel Hadoop distributive perhaps the younger and Intel claims:

  • the fastest Hadoop optimized for Intel hardware
  • SQL92 in Hive (Intel is working currently on that)
  • the first Hadoop with built-in HDFS encryption/decryption 

Unfortunately, not a lot of public information is available 

Pivotal HD was created with influence of Greenplum, well-knonw for their MPP database. Apart of close integration with Greenplum and high-performance SQL, this distributive is special because of close collaboration with Spring Framework team (VmWare) - one of the most popular Java development frameworks. So, a good support for Pivotal is expected in Spring and vice verse. For example, SpringBatch is used instead of Oozie for job scheduling. YARN is included and VM for evaluation can be loaded.  

IBM BigInsight. Created by IBM, a really little information is available about this distributive, but you can download it and evaluate. As I found, a lot of enterprise features are included (LDAP and so on) as well as own unique product in addition to Hive: Jaql, a high-level query language based on JavaScript Object Notation (JSON), which also supports SQL.



Basic evolution diagram: 


Немає коментарів:

Дописати коментар