Kostiantyn's blog: 2013

вівторок, 10 грудня 2013 р.

R connection to Hive

Short instruction how to query Hive from R via JDBC.

First of all install rJava: sudo apt-get install r-cran-rjava
After that install RJDBC package with all dependencies: install.packages("RJDBC",dep=TRUE)
In next step Hadoop libraries for Hive conneections must be added to classpath. The easiest way to do it: copy all jars for pattern /usr/lib/hive/lib/*.jar and /usr/lib/hadoop/*.jar to your classpath on target machine (when RJDBC client is located).

Also, HiveServer must be started, for Cloudera distribution use

hive --service hiveserver2

instead of

sudo service hive-server2 start

(as was mentioned http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.1/CDH4-Installation-Guide/cdh4ig_topic_18_8.html)
Now it is time to check if HiveServer is running properly, follow the next command line steps:

/usr/lib/hive/bin/beeline
beeline> connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver

Connecting to jdbc:hive2://127.0.0.1:10000/default
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-cdh4.3.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ

Finaly, we can write R code to connect Hive and fetch some information

library(RJDBC)
# this is a regular JDBC connection
# jdbc:hive://192.168.0.104:10000/default
drv <- JDBC(driverClass = "org.apache.hive.jdbc.HiveDriver",
            classPath = list.files("/opt/jars/hive",pattern="jar$",full.names=T),
            identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://192.168.0.104:10000/default", "admin", "admin")

r <- dbGetQuery(conn, "select col_1, sum(col_2) from tab2 where id>? group by col_1", "10")

And the result is going to be like:

col_1          _c1
1 false 243808846,65
2  true       486,65

вівторок, 12 листопада 2013 р.

Create Impala DataMart based on Hive backend

Hive queries are slow, hopefully on Cloudera there is possible to create fast accessible Impala DataMart.

Data into Impala table can be populated from Hive table.
The following table is accessible in `default` database:

table tab2 (
  id int,
  col_1 boolean,
  col_2 double)

The following queries can be used to create Impala and Hive tables with the same content (and the difference in the speed of access to these datasets):

Impala

Hive

create table tab5 (
  col1 boolean,
  col2 double)
STORED AS PARQUETFILE;

insert overwrite tab5
select
 col_1,
 sum(col_2)
from tab2
group by col_1;

create table tab5h (
  col1 boolean,
  col2 double)
STORED AS sequencefile;

insert overwrite table tab5h
select
 col_1,
 sum(col_2)
from tab2
group by col_1;

четвер, 31 жовтня 2013 р.

Basic visualisation in R

# plot two histograms together in R
p1 <- font=""> hist(rnorm(500,4))                     # centered at 4
p2 <- font=""> hist(rnorm(500,4))                     # centered at 4
plot( p1, col=rgb(0,0,1,1/4), xlim=c(0,10))  # first histogram
plot( p2, col=rgb(1,1,0,1/2), xlim=c(0,10), add=T)  # second

BigData recommendations from CIA

THE CIA’S “GRAND CHALLENGES” WITH BIG DATA from Structure:Data 2013 from Gigaom

четвер, 24 жовтня 2013 р.

Pig's evolution

субота, 12 жовтня 2013 р.

the best BigData videos found by me lately

Exploratory BI in a Hadoop Centric World - Webinar Recording

Six Hadoop distribution to consider about

Hortonworks is the most recent player that basically spun off Yahoo! instead of maintaining own Hadoop infrastructure in house. Everything they do is always open source and very close to Apache Hadoop project in their evolution. They have already turn into YARN and provide easy-to-start virtual machine. A lot of comprehensive tutorials are available on their web-sire as well as regular webinars.

Cloudera is perhaps the oldest and best known provider who turn Hadoop into a commercially viable product and is still the market leader. Cloudera product based on Apache Hadoop with a lot of own patches and enhancements that are release as open source. Also, some absolutely new and unique products were created by Cloudera as Cloudera Search (Solr on Hadoop). Moreover, there are some enterprise proprietary components available for additional money. They have already turn into YARN and also provide virtual machine for evaluation.

Why R is so important on Hadoop?

The short history how R meets Hadoop:

Why R is so important on Hadoop?

The short answer is: the reasons are the same as for Oracle R Enterprise

The long answer is:

BI tools provide fairly limited statistical functionality. It
is therefore standard practice in many organizations to extract portions of a database into desktop software packages:
statistical package like SAS, Matlab or R, spreadsheets like
Excel, or custom code written in languages like Java.
There are various problems with this approach. First,
copying out a large database extract is often much less e -
cient than pushing computation to the data; it is easy to get
orders of magnitude performance gains by running code in
the database. Second, most stat packages require their data
to t in RAM. For large datasets, this means sampling the
database to form an extract, which loses detail.

(from "MAD Skills: New Analysis Practices for Big Data" http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf)

Of course, you can develop with Java on Hadoop. But do your statisticians/data scientists/analysts familiar with Java? Perhaps not, the are familiar with R. RHadoop provides excellent way to develop working model locally and then deploy this model on Hadoop cluster. For sure, implemented MR in R can't be the best approach from Performance point of view. But think, how many people you will need to create high-performance solution in Java: one data scientist, one software engineer and more time on test/debug their solution. On the other hand one data scientist can create and deploy solution in less time. This "people performance" goes more important, specially when a lot of "one-time-happens" researches are required by business.

понеділок, 30 вересня 2013 р.

RHadoop

The short history how R meets Hadoop:

Hadoop + R via streaming
This one

So, R through Streaming on Hadoop was discussed in a previous article. It's obvious that using streaming is not the best approach on Hadoop because some additional issues are appearing. To facilitate development with R on Hadoop RHadoop was created.

RHadoop is set of R packages aim to facilitate writing MapReduce code in R on Hadoop. It still uses streaming, but brings the following advantages:

don’t need to manage key change in Reducer
don’t need to control functions output manually
simple MapReduce API for R
enables access to files on HDFS
R code can be run on local env/Hadoop without changes

Hadoop + R via streaming

The short history how R meets Hadoop

R is language for Stats, Math and Data Science created by statisticians for statisticians. It contains 5000+ implemented algorithms and impressive 2M+ users with domain knowledge worldwide. However, it has one big disadvantage - all data is placed into memory ... in one thread.

And there is Hadoop. New, powerful framework for distributed data processing. Hadoop is built upon idea of MapReduce algorithm, this isn't something very specific, a lot of languages have MR capabilities, but Hadoop brought it to the new level.The main idea of MR is:

Map step: Map(k1,v1) → list(k2,v2)
Magic here
Reduce step: Reduce(k2, list (v2)) → list(v3)

Hadoop was developed in Java and Java is the main programming languages for Hadoop. Although Java is main language, you can still use any other language to write MR: for example, Python, R or OCaml. It is called "Streaming API"

Useful unix commands

1. Let's assume you have very long CSV file and you wish to find all non-empty columns #25, so:

tail -n +1 data.csv | cut -d ',' -f 25 | grep -v "^[[:space:]]*$"

2. In the previous file you need to find number of all rows with value 'CA' in the column 25:

cat -n data.csv | cut -d ',' -f 25,1 | grep -e "CA.*"

In output, you will get column value as well as row number
3. In the huge file you wish to get row #250007 in separate file:

sed -n '250007p' data.csv | cat > line_250007.csv

4. To get number of tabulation in line

tr -cd \t < line_250007.csv | wc -c

5. Ensure file copy is complete before starting process or ensure that file is not used by other process right now:

if [ $(lsof $file | wc -l) -gt 1 ] ; then
    echo "file $file still loading, skipping it"
else
    echo "file $file completed upload, process it"
fi

6. Color script output for better visualisation

NORMAL=$(tput sgr0)
GREEN=$(tput setaf 2; tput bold)
RED=$(tput setaf 1)

function red() {
    echo -e "$RED$*$NORMAL"
}

function green() {
    echo -e "$GREEN$*$NORMAL"
}

red "This is error message"

7. Find all files in directory by regexp:

for FILE in $(ls $DIR | grep ^report_.*csv$)
do
...
done

8. Check previous command result

mkdir -p $WRK_DIR
if [ $? -ne 0 ]; then
  #do something
fi

9. Extract 12 hours from NOW and cast to specific format:

date -d "+12 hours" +\%Y\%m\%d\%H

вівторок, 3 вересня 2013 р.

Several Apache Pig trips and ticks

1. Counters in Pig

PigStatusReporter reporter = PigStatusReporter.getInstance();
if (reporter != null) {
   reporter.getCounter(key).increment(incr);
}

2. Pick up latest version of JAR
Incredible simple way to use always the last version of jar file without code changes

%default elephantBirdJar `hadoop fs -ls /tmp/libs/elephant-bird-core*jar | awk '{print $8;}' | sort -n | head -1`
register 'hdfs://$elephantBirdJar'

3. Call Java code without UDF from Pig script
It's pity that you have to write UDF each time when java call is required, even for the one line of code. There is a way to call built-in java functions without writing UDF and it's called Dynamic Invokers. For example, java.net.URLDecoder#decode method is called in the next example:

DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'data.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');

4. Set timeout for long-running UDF
Sometimes UDF can require much more time than it is expected, there is a way to stop long-running UDF automatically by Pig with @MonitoredUDF annotation (more information available here)

/* Timeout for UDF is 10 seconds, if no result. thna default will be returned; 
pay close attention, only several types are suported to be returned, i. e. there are not tuples or bags */

@MonitoredUDF(timeUnit = TimeUnit.MILLISECONDS, duration = 10000, intDefault = 10)
 public class MyUDF extends EvalFunc<Integer> {
   /* implementation goes here */
 }

субота, 10 серпня 2013 р.

goldman bigdata analysis 2013

from http://architects.dzone.com/articles/goldman-sachs-–-big-data

пʼятниця, 9 серпня 2013 р.

Hadoop on Rasberry Pi

As i had thought, Hadoop was installed on Rasberry Pi cluster. Read more in the creator blog

четвер, 25 липня 2013 р.

Market basket analysis with R

Affinity analysis is a data analysis and data mining technique that discovers co-occurrence relationships <...>. In retail, affinity analysis is used to perform market basket analysis, in which retailers seek to understand the purchase behavior of customers. This information can then be used for purposes of cross-selling and up-selling, in addition to influencing sales promotions, loyalty programs, store design, and discount plans. [from Wikipedia]
In other words, you want to find all items from your sails that are sold together, for example: people usually buy chips with beer. There are several algorithms and one of them is Apriori algorithm which is available in R and implemented in 'arules' package.

Build and run Flume agent to work w/ Twitter Streaming API

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic applications (source)

Date flow model is perfect described at the official documentation and contains 3 components:

source gets data from external system and delivers them into flume, channel which transport data from source to sink (think about channel as about queue; also, this queue makes possibility to async source and sink execution) and sink (destination). All of them is called agent and agents can be grouped to build complex and fail-over flow.

Big Data - The Hadoop Data Warehouse by IBM

пʼятниця, 7 червня 2013 р.

Hadoop MySQL Applier

What is Hadoop MySQL Applier? And how is it differ from Sqoop?

In fact, Applier provides realtime data migration from MySQL into HDFS. The idea is pretty simple:
applier reads MySQL binary logs and replicate insert events on HDFS
only insert operations are handled and replicated, there aren't updates, deletes or DDL

субота, 1 червня 2013 р.

JMX batch updates

Some days ago, I found that it would be a good idea to monitor and save (or even set) some JMX metrics in batch style. In other words, what if I have a server farm and wish to get some JMX attribute(s) in a moment from all servers? Or even set up new value. I'd like to have it as command line tool because of I'd like to run it from console (even performed by cron). That's why I started my new repo JMXSample on GitHub. The idea is pretty easy and all implementation is located in the one file which can be compiled by javac. There are another files that are example of JMX managed application, you can ignore them.
To run this utility, you have to create configuration file (see example), which contains JMX commands line by line (currently only two commands are supported: get MBean value and set it, also only primitive Java type, String and Date are supported, check out source code for details). So, the typical line of configuration is following:
ObjectName attributeName [new_value_if_set] host port

Let's look how we can use it. I assumed you reuse sample configuration file and you run ElectroCar sample application (I wrote about it earlier) with the next parameters:

-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=1617
-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

After that feel free to run batch JMXCommand util (after compilation, of course), and expected result is following:
java jmxsample.JMXConsole /absilote_path/JMX/example/conf.txt

Expected output will be next (I tagged moment, when maximal speed was updated):

[localhost:1617] Attribute MaxSpeed has value 150
[localhost:1617] Attribute CurrentSpeed has value 55
[localhost:1617] Attribute MaxSpeed has value 250
[localhost:1617] Attribute CurrentSpeed has value 90
[localhost:1617] Attribute CurrentSpeed has value 235

пʼятниця, 31 травня 2013 р.

Incredible simple JMX example

With JMX you can implement management interfaces for Java applications. Let's look at the simplest and most used example. Imagine, you have create some ElectroCar managed by Java application. You wish to get/monitor some parameters and, moreover, set the same of them.

Ok, you required to create MBean (managed bean that will be controlled by JMX) that will give you read access to 2 car parameters and you will be able to change max speed limit at runtime. To describe it, you need the next interface:

public interface ElectroCarMBean {
    public void setMaxSpeed(int maxSpeed);
    public int getMaxSpeed();
    public int getCurrentSpeed();
}

Implementation will be very short and simple:

public class ElectroCar implements ElectroCarMBean {

    public int maxSpeed = 150;
    private Random rnd = new Random();

    @Override
    public void setMaxSpeed(int maxSpeed) {
        this.maxSpeed = maxSpeed;
    }

    @Override
    public int getMaxSpeed() {
        return this.maxSpeed;
    }

    @Override
    public int getCurrentSpeed() {
        return rnd.nextInt(this.maxSpeed);
    }
}

And after that you need to register you MBean (yeap, manually):

// Get the platform MBeanServer
mbs = ManagementFactory.getPlatformMBeanServer();

// managed bean instance
ElectroCar car = new ElectroCar();
ObjectName carBean = null;

// Uniquely identify the MBeans and register them with the platform MBeanServer
carBean = new ObjectName("FOO:name=jmxsample.ElectroCar");
mbs.registerMBean(car, carBean);

After that, don't forget to add next parameters when you will be run your application:

-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=1617 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false

It means run application w/ enabled JMX on port 1617, but without authentication (everyone can connect). So, your application is ready now! Run jconsole and check the result:

You can see that each attribute has a set of properties and the most important is read/write properties. If write property is 'true' you are able to set up new value from jconsole. Otherwise you could only read this value. In particular example, MaxSpeed can be change in runtime and it makes influence on max speed of the car. However, CurrentSpeed is readonly, and you can only perform monitoring (click 'Refresh' to update value)

As I said, MBean must be registered manually. So, it can be issue if you using some container for beans creation (like Spring). For example, Spring provide special bean MBeanExporter to give you possibility to MBeaning your classes, read more here

пʼятниця, 17 травня 2013 р.

Fix PigUnit issue on Windows

PigUnit is the nice and extremely easy way to test your Pig script. Read more here
However, it doesn't run on Windows at all. When you write your first PigUnit script, you will get the following exception:

Exception in thread "main" java.io.IOException: Cannot run program "chmod": CreateProcess error=2, The system cannot find the file specified :

In fact, it means Cygwin is not correctly installed. To fix it, you have to download and install Cygwin, after that edit PATH variable and enter the path name to cygwin directory.

Try to run again. The next possible error will be:

ERROR mapReduceLayer.Launcher: Backend error message during job submission java.io.IOException: Failed to set permissions of path: \tmp\hadoop-MyUsername\mapred\staging\MyUsername1049214732.staging to 0700

It means, your temporary directory is not set correctly (or doesn't set at all). Be honest, I tried to set up this temporary directory with the following code:

pigServer.getPigContext().getProperties().setProperty("pig.temp.dir", "D:/TMP");

pigServer.getPigContext().getProperties().setProperty("hadoop.tmp.dir", "D:/TMP");

Unfortunately,it doesn't work.... The solution is to set up system property. There are a lot of way to do it, and one of them is to tune java run configuration when you run your test, just add:

-Djava.io.tmpdir=D:\TMP\

Ok, that's much better, but it's not the finish yet, there is error

java.io.IOException: Failed to set permissions of path: file:/tmp/hadoop-iwonabb/mapred/staging/iwonabb-1931875024/.staging to 0700

at org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFileSystem.java:526)

that's because of error in the code.

There are several solutions to fix this bug (it is present in Hadoop for a years... :(). One of them, is to use this patch or fix code and recompile. But for me it was the best way (special, it will be fix only for specefic version, also it is difficult to maintain on several clusters on dev machines and so on).
So, I've decided to change code at runtime with Javassist

So, the solution is a very simple and self-describing:

To apply it, just call from you test before run in. I'd recommend to do it just after PigTest instance creation.

вівторок, 7 травня 2013 р.

Storm, distributed and fault-tolerant realtime computation, part 2

Read first part here

To illustrate main Storm points, let's go though the main points...
So, the idea of application is to scan web-server log (well, be honest this log is created by us but it's very similar to real-world log), push each event into Storm topology, calculate some statistic and update this statistic in database. Well, stupid user-case for Storm, but good enough to show all main concepts.

Let's start from log line, it is something like follow:
datetime,countryCode,URL,responseCode,responseTime

Storm, distributed and fault-tolerant realtime computation, part 1

Storm is a real big data solution for soft-real time. In the hadoop world batch processing is a main limitation. Let me explain. Imagine, you track some event, however with Hadoop event happened at 11:00AM, it collected, than processed each half hour (in 11:30AM, bacause of Hadoop is batch processing) and this event influences statistics (or whatever) only at 12:00AM. How to do it more "real-time"

Storm comes into play here.

Storm built on idea of streaming data and this is big difference in compare with Hadoop. Not advantage, but difference. Hadoop is a great batch processing system. Data is introduced into the Hadoop file system (by Flume or any other way) and distributed across nodes for processing. Then processing is started, it contains a lot of shuffle operations between nodes and so on, a lot of IO operations. Afterwards, the result must be picked up from HDFS.

Storm works with unterminated streams of data. Storm jobs, unlike Hadoop jobs, never stop, instead continuing to process data as it arrives. So, you can think about Strom as about powerful integration framework (like Apache Camel or Spring integration) if you are familiar with it.

So, when even is produced, it is introduced into Storm and go through set of jobs...

Let's look at Storm terminology:

Tuple is a data structure that represent standard data types (such as ints, floats, and byte arrays) or user-defined types with some additional serialization code (there isn't analog in Java language, but tuples are very usual in Python and special in Erlang);
Stream is a unterminated data flow, sequence of tuples
Spout special component which provide data from external source into Storm
Bolt is a Storm jobs that can be linked in chain and makes any tuples transformation
Topology structure of spouts and bolt connected by streams (see picture below)

By default, streams are made as ØMQ queues, but it's underhood and you should worry about that (expect cluster installation time).

One features that I explored during development, but I didn't mentioned during reading tutorial: Bolt has to be serializable, so all class members must be serializable too.

субота, 27 квітня 2013 р.

Isolation level in RDBMS

In the famous ACID abbreviation, I means Isolation (Isolation level). According to wikipedia, isolation is a property that defines how/when the changes made by one operation become visible to other concurrent operations.

To understand isolations level better, lets review the main colitions that may happend in concurrent
environment:

lost update - A second transaction writes a second value of a data on top of a first value written by a first concurrent transaction, and the first value is lost to other transactions running concurrently which need, by their precedence, to read the first value.
dirty read - insert or update data by transaction that will be rollbacked further
non-repeatable read - first transaction read data, then read them again and the same data is modified
phantom reads - on the second read in transaction, new data is appearing

So, now that's easy to describe isolation levels with the following table:

Isolation level	JDBC equivalent	Dirty reads	Non-repeatable reads	Phantoms	Lost update
None	TRANSACTION_NONE	may occur	may occur	may occur	may occur
Read Uncommitted	TRANSACTION_READ_UNCOMMITTED	may occur	may occur	may occur	-
Read Committed	TRANSACTION_READ_COMMITTED	-	may occur	may occur	-
Repeatable Read	TRANSACTION_REPEATABLE_READ	-	-	may occur	-
Serializible	TRANSACTION_SERIALIZABLE	-	-	-	-

nginx postaction

Some time ago I thought how is it possible to count downloaded files from resource. I mean, you have a set of files and you need to gather some real-rime statistics on file downloading or just page visiting. I know it's easy to analyze log file. But what if it doesn't enough for you and you want to trigger some another service... with ngnix solution is pretty simple

There is post action functionality that makes ability to call any resource after another resource was called. It is like TRIGGER in regular RDBMS, so simple and self-described example:

вівторок, 16 квітня 2013 р.

Maven assembly plugin fuck up on Spring project making

Yesterday just bumped into Maven assembly plugin issue. I tried to build one-jar project with a log of spring dependencies. It was successful, however I got error on start:

 Configuration problem: Unable to locate Spring NamespaceHandler for XML schema namespace [http://www.springframework.org/schema/context]
Offending resource: class path resource [spring-context.xml]

According to error, I tried to find issue with namespace in spring configuration files... However, it was correct and without issues. But I found spring-related files in my jar which were located in META-INF directory: spring.schemas and spring.handlers, and they have completely strange content - only several namespace, the most of namespaces were lost. So, the problem is with maven assembly plugin. Fortunately, this plugin is enough flexible and gives possibility to manage each step of building.
The correct version (doesn't perform broken merge) is:

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>your.main.Class</mainClass>
                                </transformer>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>META-INF/spring.handlers</resource>
                                </transformer>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>META-INF/spring.schemas</resource>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

четвер, 11 квітня 2013 р.

NoSQL unit testing

Definitely, TDD and unit testing are amazing inventions and this is very high recommended to use it during development. That's very usual to have persistence layer in your application. Several years ago it was RDBMS as standard solution. No surprise, it is good idea to test your persistence layer, unit and integration.
So, as result there are a lot of an in-memory databases for unit testing, and there is prefect tool DBUnit for integration testing.
So, when you write integration test you can use DBUnit, which manipulates dataset. It works in the next way:

before test method: set db content (set state)
execute test method
after test method compare result in db with expected one

What about NoSQL? For example, if you are using MongoDB or HBase, how is it possible to create unit tests without dependencies on standalone server? How is it possible to make integration test with minimum effort to keep data consistency?
I know the answer: NoSQLUnit - NoSQL Unit is a JUnit extension that helps you write NoSQL unit tests. Available on GitHub https://github.com/lordofthejars/nosql-unit

This perfect amazing breathtaking framework provides possibility to create elegant in-memory unit tests and powerful integration tests in DBUnit way! Moreover, it supports a lot of nosql databases:

MongoDB (not only one instance, but also replica set and sharding!)
Neo4j
Cassandra
HBase
Redis
CouchDB
Infispran engine (never heart about that before)
ElasticSearch

And what is the most important, documentation is perfect!

So, I want to say THANK YOU VERY MUCH, Alex Soto, you made amazing job!

Do you know any other tools for unit/integration testing for Nosql? Give me a link, please!

вівторок, 9 квітня 2013 р.

SQuirreL - a database administration tool

JPA is a popular way to communicate with database in java world, JPA via Hibernata, JPA vi EclipseLink, whatever... however there is instants cumbersome way to test your JPQL queries.
SQuirreL help you! Overview on wiki

When you have a project with JPA and you want to test several jpa queries (aka jpql), you can use SquirreL. Build you project first, then you need to tune SquirreL a bit: go to Squirrel File –> Global Preferences –> Hibernate,
then click Add classpath entity, and add your JDBC driver jar and all project's jars.
Then specify your persistence unit name, and you are ready to run JPQL queries from this tool for your application.
That's much easy than experiments from java code

Moreover, there is a set of useful plugins which helps you to compare databases, create ER-diagrams or execute a specific code on session start

Also, the is possible to query HBase with https://github.com/forcedotcom/phoenix

понеділок, 18 березня 2013 р.

Two the most important abbreviation

In software design:
S.O.L.I.D.

Single responsibility principle - an object should have only a single responsibility.
Open/closed principle - “software entities … should be open for extension, but closed for modification”.
Liskov substitution principle - “objects in a program should be replaceable with instances of their subtypes without altering the correctness of that program”.
Interface segregation principle - many client-specific interfaces are better than one general-purpose interface
Dependency inversion principle - Depend upon Abstractions. Do not depend upon concretions

In unit testing:
F.I.R.S.T.
Each test must be:
Fast
Independent
Repeatable
Self-Validating (green or red, mo other options)
Timely (write it according to TDD, not in one month, not when you boss requires it)

пʼятниця, 15 березня 2013 р.

power of logginig

Just found two amazing tools: Logstash and Kibana

kindly beautiful log visualisation:
http://kibana.org/about.html

that uses logstash (elasticsearch under hood)
http://logstash.net/docs/1.1.5/inputs/file

Logstash from 琛琳饶

четвер, 7 лютого 2013 р.

Resolving underscore templates and Java JSTL conflict

Front-end development has made a huge step for the last couple of year... Could you imagine backbone or underscore in earlier 2010? I'm not... now it has a lot of new possibilities, and new changeless for JSTL users.

Today you can create real cool application with using Rest and html/javascript. So, you can (and you have to) separate backend and frontend development. As for me, the real cool thing is a templating engine in javascript. I'm talking about frameworks like underscore.js and mustache.js. Both is awesome, both is cool and must to be used. In this article I'll discuss underscore templatting.

Let's imagine you consider use underscore and read about templates. The Underscore toolkit includes an easy-to-use templating feature that easily integrates with any JSON data source. For example, you need to repeat some fragment of html many times. For Example, you get list of employees in your AJAX request. With template you have to define place for inserting list of users (let's we want to have table).

Your restfull service returns json with name, position and set of phone numbers, you fetch it w/ backbone and create related DOM.

However there is one problem. In JSTL <% %> is using to mark scriplets... So, we have conflict: JSP vs underscore. And JSP wins! Fortunately, there is solution - you can override underscore template symbols to use <@ @> instead of <% %> for underscore!

    // underscore templating
    $(document).ready(function ()
    {            
     _.templateSettings = {
      interpolate: /\<\@\=(.+?)\@\>/gim,
      evaluate: /\<\@(.+?)\@\>/gim,
      escape: /\<\@\-(.+?)\@\>/gim
  };
     
    });

So, html fragment:

<table id="employees">
</table>

We aim to insert list of employees here with the next information: employee's name, position, phone number(s). Ok, it's really easy. Our template is:

<script id="employee-template" type="text/template">

  <@=name@>
  <@=position@>
  <@ _.each(phones, function(i) {<@=i@>, }@>

Template is ready, know just let's init this template:


var template_html = _.template($('#employee-template').html());
    
var h = $(template_html({name: item.get("name"), position: item.get("position"), phones: item.get("phones")}))

$('#employees').append( h );

that's all!

середа, 6 лютого 2013 р.

Working with MongoDB using Kundera

When we are talking about JPA2 we usualy imaging relation database and one of ORM (Hibernate or EclipseLink). However, JPA is general approach not only for relation system. It covers NoSQL as well. And one of wonderful JPA implementation oriented on non-relation databases is Kundera.

Initially Kundera was developed for Cassandra. It seems to be a good idea, if you don't use relation you don't have the main problem of different ORM frameworks:) HBase was the next database supported by Kundera, and now is the time for MongoDB

Also, support for relation schema was anonced! Awesome! Be honest, I don't believe the it'll work well. Otherwise we'll get power competitor to Hibernate and K.
But, Kundera is awesome one to creating prototype, when you need base CRUD operations and just to want to try you application with Cassandra, HBase or MongoDB. The beautiful parts is possibility to write JPQL queries instead of native database structures.

Let's figure out how we can use Kundera with MongoDb... Mongo JPA

who has stolen my CPU?

A week ago I bumped into issue w/ AWS CloudWatch. Originally, my load was about 50% and I set CloudWatch to notify me when average CPU utilization > 90% for 5 minutes. And a week ago I got several alert! I was really surprised when recognized that load is more 90% for a long time.
Nothing was changed in application configuration or real load... so, I started investigation. AWS status shows all services operating normally

But the biggest surprise: according to "top" load was less than 10%!
Now I have this load on regular base, for example

(click to view big image)

вівторок, 10 грудня 2013 р.

Short instruction how to query Hive from R via JDBC.

вівторок, 12 листопада 2013 р.

четвер, 31 жовтня 2013 р.

понеділок, 28 жовтня 2013 р.

четвер, 24 жовтня 2013 р.

субота, 12 жовтня 2013 р.

четвер, 3 жовтня 2013 р.

вівторок, 1 жовтня 2013 р.

понеділок, 30 вересня 2013 р.

субота, 28 вересня 2013 р.

четвер, 5 вересня 2013 р.

вівторок, 3 вересня 2013 р.

субота, 10 серпня 2013 р.

пʼятниця, 9 серпня 2013 р.

четвер, 25 липня 2013 р.

вівторок, 23 липня 2013 р.

понеділок, 8 липня 2013 р.

пʼятниця, 7 червня 2013 р.

субота, 1 червня 2013 р.

пʼятниця, 31 травня 2013 р.

пʼятниця, 17 травня 2013 р.

вівторок, 7 травня 2013 р.

пʼятниця, 3 травня 2013 р.

субота, 27 квітня 2013 р.

вівторок, 16 квітня 2013 р.

четвер, 11 квітня 2013 р.

вівторок, 9 квітня 2013 р.

понеділок, 18 березня 2013 р.

пʼятниця, 15 березня 2013 р.

четвер, 7 лютого 2013 р.

середа, 6 лютого 2013 р.

середа, 23 січня 2013 р.