понеділок, 20 листопада 2017 р.

Terraform

Terraform by HashiCorp enables you to safely and predictably create, change, and improve infrastructure. It is an open source tool that codifies APIs into declarative configuration files that can be shared amongst team members, treated as code, edited, reviewed, and versioned.

Terraform is a simple and reliable way to manage infrastructure in AWS, Google Cloud, Azure, Digital Ocean and more IaaS (providers, in terms of Terraform). The main idea of such tools is to create reproducible infrastructure and Terraform provides DSL to describe infrastructure and then apply to different environments. Previously we used a set of Python and bash scripts to describe what create in AWS, described different conditions which checks if some resource exists in AWS and create if it's not. Actually, Terraform is doing the same underhood. This is an introduction which covers simple use case to create Alluxio cluster I used in the previous post.

Terraform supports a number of different providers, but Terraform script must be written every time for new provider.


Applying Alluxio to warm up your data

Alluxio, formerly Tachyon, enables any application to interact with any data from any storage system at memory speed. 
states https://www.alluxio.org/. In this article I'd like to describe the general idea of using Alluxio and how it helped me. Alluxio is not one known to everyone, however it has a lot of features to propose and can be a game changer for your project. Alluxio already powers data processing at Barclays, Alibaba, Baidu, ZTE, Intel, etc. The current license is Apache 2.0 and source code can be reviewed here https://github.com/Alluxio/alluxio .

Alluxio provides virtual filesystem which create a layer between your application (i.e. computational framework) and real storage such as HDFS, S3, Google Cloud Storage, Azure Blob Storage and so on. Alluxio has several interfaces: Hadoop compatible FS, native key-value interface, NFS interface. From component point of view, Alluxio has single Master (plus Secondary Master which similar to SNN in Hadoop, i.e. doesn't process requests from clients), multiple Slaves and, obviously, Client. 


My use case was inspired by layered storage in HDFS: it's when you can configure HDFS to save specific HDFS paths on Hot storage (let say in memory) or Warm (~ SSD) or Cold (~ HDD). However, cloud usage is growing every day and it's not so often to see hardware Hadoop cluster and the issue with a clouds (at the same time, a benefit): storage is isolated from computations, which makes impossible or hard to implement storage layers. And that's very good use case for Alluxio: deploy alluxio cluster to play the role of Hot storage where only high-frequency used data is located. 


While saving data on S3, we'd like to partition them by year, month and day to increase access speed while executing access to data in known time range. However it's not often happen to access data according to uniform distribution, much often there is very specific patterns like:


  • actively access last 3 months
  • actively access last month and the same month of last year
It's natural candidate to put these data into Alluxio to speed up access to them, but the rest of data will be available directly from S3.

Let's see the practical example of working with data stored on S3 using Apache Spark on EMR.


 I used Terraform to create Alluxio cluster, having 3 r4.xlarge slaves and one m4.xlarge master. Also, we will need computational power to run Spark job, let's create AWS EMR cluster:


aws emr create-cluster --name 'Alluxio_EMR_test' \

--instance-type m4.2xlarge \
--instance-count 3 \
--ec2-attributes SubnetId=subnet-131cda0a,KeyName=my-key-name,InstanceProfile=EMR_EC2_DefaultRole \
--service-role EMR_DefaultRole \
--applications Name=Hadoop Name=spark \
--region us-west-2 \
--log-uri s3://alluxio-poc/emrlogs \
--enable-debugging \
--release-label emr-5.7.0 \

--emrfs Consistent=true


After that, Alluxio is ready to be started and out data is ready to be pulled in:


[ec2-user@ip-172-16-175-35 ~]$ docker ps

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1c876a0ffe4d alluxio "/entrypoint.sh wo..." 9 minutes ago Up 9 minutes cranky_brown
[ec2-user@ip-172-16-175-35 ~]$ docker exec -it 1c876a0ffe4d /bin/sh
/ # cd /opt/alluxio/bin
/opt/alluxio/bin # ./alluxio runTests
/opt/alluxio/bin # ./alluxio fs mkdir /mnt
Successfully created directory /mnt
# the following command cache S3 folder inside of Alluxio
opt/alluxio/bin # ./alluxio fs mount -readonly alluxio://localhost:19998/mnt/s3 s3a://alluxio-poc/data
Mounted s3a://alluxio-poc/data at alluxio://localhost:19998/mnt/s3
/opt/alluxio/bin #
/opt/alluxio/bin # ./alluxio fs ls /mnt/s3

-rwx------ pc-nord-account66pc-nord-account66410916576 09-22-2017 18:03:10:815 Not In Memory /mnt/s3/part-00084-2e9dafb0-2d7a-428e-b517-b6eb4d70f781.snappy.parquet


Then, back to EMR Master and start spark shell:

spark-shell --jars ~/alluxio-1.5.0/client/spark/alluxio-1.5.0-spark-client.jar


The following command starts spark context and register alluxio file sustem:

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem")

val x = spark.read.parquet("alluxio://172.16.175.46:19998/mnt/s3")

// let's see how fast is' gonna be

x.select($"itemid", $"itemdescription", $"GlobalTransactionID", $"amount").orderBy(desc("amount")).show(20) // 4 sec

x.count() // 3 sec

// now let's compare with s3 dataset

val p = spark.read.parquet("s3a://alluxio-poc/data")
p.select($"itemid", $"itemdescription", $"GlobalTransactionID", $"amount").orderBy(desc("amount")).show(20) // 19 sec

p.count() // value 19 sec

To sum up, Alluxio provides great way to speed up data processing in update-based warehouse when you need access only to limited dataset. Potential use case: hot data that must be accessed and processed x10 times more often, but is only 10% of all dataset is an ideal candidate to be cached with Alluxio.

# Einführung in Alluxio (in English)


четвер, 31 серпня 2017 р.

Druid: fixed lambda



Druid is an excellent high-performance, column-oriented and distributed data storage. Used by IT giant to get answers in sub-seconds from TBs (or even PBs) datasets. Needless to say I felt in love since day one. 

Several examples:
Netflix ingest up to 2 TB per hour with the ability to query data as its being ingested
eBay ingest over 100.000 events'sec and supports over 100 concurrent queries without impacting ingest rate and query latency

 Main featured of Druid that helps to stand out of the crowd:

  • Sub-seconds query
  • Scalable to PBs
  • Real-time strams
  • Deploy anywhere (can work with Hadoop or without by processing data from S3)
I'm excited I had an opportunity to work with Druid a year ago. It's really cool, works super fast and delivers excellent result! The JSON-based query language wasn't super hard to learn, I managed even to calculate average using post action:) previous MR experience really helped.

One remark, I'd like to add there: 
we developer and tested druid based system in us-east-1 region, everything was good, deployment was automated, so we moved to prod which, surprisingly, was selected to be in Frankfurt AWS region. We got pretty nasty error in Druid when deployment script finished his work there:
 Caused by: io.druid.segment.loading.SegmentLoadingException: S3 fail!

Looks like the problem was in additional configuration required for non US-east region, unfortunately there isn’t documentation so I derived that from source code, looks like it works now:

On each historical node, please add the following file “/opt/druid/config/_common/jets3t.properties” with a content:
storage-service.request-signature-version=AWS4-HMAC-SHA256
s3service.s3-endpoint=s3.eu-central-1.amazonaws.com

1st line forces to use v4 auth

2nd line sets endpoint, default is us-east-1, but for Frankfurt it must be s3.eu-central-1.amazonaws.com

Anyway, Metamarket team - thank you for great product! Now going to test Caravela from AirBnb