Kostiantyn's blog: Applying Alluxio to warm up your data

понеділок, 20 листопада 2017 р.

Applying Alluxio to warm up your data

Alluxio, formerly Tachyon, enables any application to interact with any data from any storage system at memory speed.

states https://www.alluxio.org/. In this article I'd like to describe the general idea of using Alluxio and how it helped me. Alluxio is not one known to everyone, however it has a lot of features to propose and can be a game changer for your project. Alluxio already powers data processing at Barclays, Alibaba, Baidu, ZTE, Intel, etc. The current license is Apache 2.0 and source code can be reviewed here https://github.com/Alluxio/alluxio .

Alluxio provides virtual filesystem which create a layer between your application (i.e. computational framework) and real storage such as HDFS, S3, Google Cloud Storage, Azure Blob Storage and so on. Alluxio has several interfaces: Hadoop compatible FS, native key-value interface, NFS interface. From component point of view, Alluxio has single Master (plus Secondary Master which similar to SNN in Hadoop, i.e. doesn't process requests from clients), multiple Slaves and, obviously, Client.

My use case was inspired by layered storage in HDFS: it's when you can configure HDFS to save specific HDFS paths on Hot storage (let say in memory) or Warm (~ SSD) or Cold (~ HDD). However, cloud usage is growing every day and it's not so often to see hardware Hadoop cluster and the issue with a clouds (at the same time, a benefit): storage is isolated from computations, which makes impossible or hard to implement storage layers. And that's very good use case for Alluxio: deploy alluxio cluster to play the role of Hot storage where only high-frequency used data is located.

While saving data on S3, we'd like to partition them by year, month and day to increase access speed while executing access to data in known time range. However it's not often happen to access data according to uniform distribution, much often there is very specific patterns like:

actively access last 3 months
actively access last month and the same month of last year

It's natural candidate to put these data into Alluxio to speed up access to them, but the rest of data will be available directly from S3.

Let's see the practical example of working with data stored on S3 using Apache Spark on EMR.

I used Terraform to create Alluxio cluster, having 3 r4.xlarge slaves and one m4.xlarge master. Also, we will need computational power to run Spark job, let's create AWS EMR cluster:

aws emr create-cluster --name 'Alluxio_EMR_test' \
--instance-type m4.2xlarge \
--instance-count 3 \
--ec2-attributes SubnetId=subnet-131cda0a,KeyName=my-key-name,InstanceProfile=EMR_EC2_DefaultRole \
--service-role EMR_DefaultRole \
--applications Name=Hadoop Name=spark \
--region us-west-2 \
--log-uri s3://alluxio-poc/emrlogs \
--enable-debugging \
--release-label emr-5.7.0 \

--emrfs Consistent=true

After that, Alluxio is ready to be started and out data is ready to be pulled in:

[ec2-user@ip-172-16-175-35 ~]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1c876a0ffe4d alluxio "/entrypoint.sh wo..." 9 minutes ago Up 9 minutes cranky_brown
[ec2-user@ip-172-16-175-35 ~]$ docker exec -it 1c876a0ffe4d /bin/sh
/ # cd /opt/alluxio/bin
/opt/alluxio/bin # ./alluxio runTests
/opt/alluxio/bin # ./alluxio fs mkdir /mnt
Successfully created directory /mnt
# the following command cache S3 folder inside of Alluxio
opt/alluxio/bin # ./alluxio fs mount -readonly alluxio://localhost:19998/mnt/s3 s3a://alluxio-poc/data
Mounted s3a://alluxio-poc/data at alluxio://localhost:19998/mnt/s3
/opt/alluxio/bin #
/opt/alluxio/bin # ./alluxio fs ls /mnt/s3

-rwx------ pc-nord-account66pc-nord-account66410916576 09-22-2017 18:03:10:815 Not In Memory /mnt/s3/part-00084-2e9dafb0-2d7a-428e-b517-b6eb4d70f781.snappy.parquet

Then, back to EMR Master and start spark shell:
spark-shell --jars ~/alluxio-1.5.0/client/spark/alluxio-1.5.0-spark-client.jar

The following command starts spark context and register alluxio file sustem:
val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem")

val x = spark.read.parquet("alluxio://172.16.175.46:19998/mnt/s3")

// let's see how fast is' gonna be
x.select($"itemid", $"itemdescription", $"GlobalTransactionID", $"amount").orderBy(desc("amount")).show(20) // 4 sec

x.count() // 3 sec

// now let's compare with s3 dataset
val p = spark.read.parquet("s3a://alluxio-poc/data")
p.select($"itemid", $"itemdescription", $"GlobalTransactionID", $"amount").orderBy(desc("amount")).show(20) // 19 sec

p.count() // value 19 sec

To sum up, Alluxio provides great way to speed up data processing in update-based warehouse when you need access only to limited dataset. Potential use case: hot data that must be accessed and processed x10 times more often, but is only 10% of all dataset is an ideal candidate to be cached with Alluxio.

# Einführung in Alluxio (in English)