понеділок, 20 листопада 2017 р.

Applying Alluxio to warm up your data

Alluxio, formerly Tachyon, enables any application to interact with any data from any storage system at memory speed. 
states https://www.alluxio.org/. In this article I'd like to describe the general idea of using Alluxio and how it helped me. Alluxio is not one known to everyone, however it has a lot of features to propose and can be a game changer for your project. Alluxio already powers data processing at Barclays, Alibaba, Baidu, ZTE, Intel, etc. The current license is Apache 2.0 and source code can be reviewed here https://github.com/Alluxio/alluxio .

Alluxio provides virtual filesystem which create a layer between your application (i.e. computational framework) and real storage such as HDFS, S3, Google Cloud Storage, Azure Blob Storage and so on. Alluxio has several interfaces: Hadoop compatible FS, native key-value interface, NFS interface. From component point of view, Alluxio has single Master (plus Secondary Master which similar to SNN in Hadoop, i.e. doesn't process requests from clients), multiple Slaves and, obviously, Client. 


My use case was inspired by layered storage in HDFS: it's when you can configure HDFS to save specific HDFS paths on Hot storage (let say in memory) or Warm (~ SSD) or Cold (~ HDD). However, cloud usage is growing every day and it's not so often to see hardware Hadoop cluster and the issue with a clouds (at the same time, a benefit): storage is isolated from computations, which makes impossible or hard to implement storage layers. And that's very good use case for Alluxio: deploy alluxio cluster to play the role of Hot storage where only high-frequency used data is located. 


While saving data on S3, we'd like to partition them by year, month and day to increase access speed while executing access to data in known time range. However it's not often happen to access data according to uniform distribution, much often there is very specific patterns like:


  • actively access last 3 months
  • actively access last month and the same month of last year
It's natural candidate to put these data into Alluxio to speed up access to them, but the rest of data will be available directly from S3.

Let's see the practical example of working with data stored on S3 using Apache Spark on EMR.


 I used Terraform to create Alluxio cluster, having 3 r4.xlarge slaves and one m4.xlarge master. Also, we will need computational power to run Spark job, let's create AWS EMR cluster:


aws emr create-cluster --name 'Alluxio_EMR_test' \

--instance-type m4.2xlarge \
--instance-count 3 \
--ec2-attributes SubnetId=subnet-131cda0a,KeyName=my-key-name,InstanceProfile=EMR_EC2_DefaultRole \
--service-role EMR_DefaultRole \
--applications Name=Hadoop Name=spark \
--region us-west-2 \
--log-uri s3://alluxio-poc/emrlogs \
--enable-debugging \
--release-label emr-5.7.0 \

--emrfs Consistent=true


After that, Alluxio is ready to be started and out data is ready to be pulled in:


[ec2-user@ip-172-16-175-35 ~]$ docker ps

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1c876a0ffe4d alluxio "/entrypoint.sh wo..." 9 minutes ago Up 9 minutes cranky_brown
[ec2-user@ip-172-16-175-35 ~]$ docker exec -it 1c876a0ffe4d /bin/sh
/ # cd /opt/alluxio/bin
/opt/alluxio/bin # ./alluxio runTests
/opt/alluxio/bin # ./alluxio fs mkdir /mnt
Successfully created directory /mnt
# the following command cache S3 folder inside of Alluxio
opt/alluxio/bin # ./alluxio fs mount -readonly alluxio://localhost:19998/mnt/s3 s3a://alluxio-poc/data
Mounted s3a://alluxio-poc/data at alluxio://localhost:19998/mnt/s3
/opt/alluxio/bin #
/opt/alluxio/bin # ./alluxio fs ls /mnt/s3

-rwx------ pc-nord-account66pc-nord-account66410916576 09-22-2017 18:03:10:815 Not In Memory /mnt/s3/part-00084-2e9dafb0-2d7a-428e-b517-b6eb4d70f781.snappy.parquet


Then, back to EMR Master and start spark shell:

spark-shell --jars ~/alluxio-1.5.0/client/spark/alluxio-1.5.0-spark-client.jar


The following command starts spark context and register alluxio file sustem:

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem")

val x = spark.read.parquet("alluxio://172.16.175.46:19998/mnt/s3")

// let's see how fast is' gonna be

x.select($"itemid", $"itemdescription", $"GlobalTransactionID", $"amount").orderBy(desc("amount")).show(20) // 4 sec

x.count() // 3 sec

// now let's compare with s3 dataset

val p = spark.read.parquet("s3a://alluxio-poc/data")
p.select($"itemid", $"itemdescription", $"GlobalTransactionID", $"amount").orderBy(desc("amount")).show(20) // 19 sec

p.count() // value 19 sec

To sum up, Alluxio provides great way to speed up data processing in update-based warehouse when you need access only to limited dataset. Potential use case: hot data that must be accessed and processed x10 times more often, but is only 10% of all dataset is an ideal candidate to be cached with Alluxio.

# Einführung in Alluxio (in English)


17 коментарів:

  1. After reading this blog i very strong in this topics and this blog really helpful to all.. Big Data Hadoop Online Training India


    ВідповістиВидалити
  2. I will truly value the essayist's decision for picking this magnificent article fitting to my matter.Here is profound depiction about the article matter which helped me more.
    360DigiTMG Data Analytics Course

    ВідповістиВидалити
  3. Set aside my effort to peruse all the remarks, however I truly delighted in the article. It's consistently pleasant when you can not exclusively be educated, yet in addition, engaged!
    difference between analysis and analytics

    ВідповістиВидалити
  4. Your this webpage is on legitimate steroids in UK so you are mentioned to visit the site for purchasing just authentic steroids in United Kingdom. I will be glad to see you out there this decent post here.
    iot courses in delhi

    ВідповістиВидалити
  5. This is a great post I saw thanks to sharing. I really want to hope that you will continue to share great posts in the future.
    https://360digitmg.com/course/project-management-professional-pmp

    ВідповістиВидалити
  6. Автор видалив цей коментар.

    ВідповістиВидалити
  7. A debt of gratitude is in order for sharing the information, keep doing awesome... I truly delighted in investigating your site. great asset...
    Best Institute for Data Science in Hyderabad


    ВідповістиВидалити
  8. Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!
    data science training

    ВідповістиВидалити
  9. Easily, the article is actually the best topic on this registry related issue. I fit in with your conclusions and will eagerly look forward to your next updates.
    data analytics courses in aurangabad

    ВідповістиВидалити
  10. It is extremely nice to see the greatest details presented in an easy and understanding manner.
    data science course in malaysia

    ВідповістиВидалити