пʼятницю, 29 липня 2016 р.

Protecting Spark UI, part 1: nginx

Apache Spark WEB UI is a descent place to check cluster health and monitor job performance, starting point for almost every performance optimization. A guys from Databricks hardworking on improvements of UI from version to version.
But it still have one issue which I'm facing on every project and which must be resolver every time: I'm talking about publicity of this information, everyone how can reach the port (defaults, 8080 or 4040) can then access UI, and all information there (and there are a lot of stuff you want to keep private).

There are several solution to deal with it:

  1. Close all ports and configure nginx to listen specific port and forward requests (of course w/ basic authentication)
  2. protect UI using Spark's built-in method: implementing own filter
In this post, let's start from How to protect Spark UI with NGINX?

The instruction below is suitable for protecting standalone spark Web UI when job is executed in client mode (so you can predict where driver is up and run).

Let's assume that there is a node with both spark and nginx installed (obviously they can be on different nodes).

First of all, close all spark related ports (and there are a lot of them): they must be still accessible in-network. In Amazon, it easy to do with security groups: just specify appropriate CIDR mask for each inbound rule, for instance 172.16.0.0/12. Next, open 2 ports not used by Spark, but which you're going to make accessible to get into spark master ui or spark driver ui: just for example let's assume it's 2020 and 2020.

Now the small part left: configure nginx to perform basic auth and forward requests to Spark UI. In this case nginx is in provate network, so request will be handled by Spark and UI actually presented to end user. 

Before configuring nginx itself, the file to keep proper configuration must be created:
It's simple to do with htpasswd tool, can be installed by running   sudo yum install -y httpd-tools

Then generate password and store it into a file (user name will be spark and passowrd entered in CLI):
sudo htpasswd -c /etc/nginx/.htpasswd spark

Last step is to create proper nginx configuration (the eample is only to forward all request on Spark Master 8080 to 2000):
vi /etc/nginx/nginx2001.conf

{
  events {
     worker_connections 1000;
  }

  server {
listen 2020;

auth_basic "Private Beta"; auth_basic_user_file /etc/nginx/.htpasswd;

location / {
proxy_pass http://localhost:8080;
}
}

}

Actually, that's it. After that we just need to start nginx
nginx -c /etc/nginx/nginx2001.conf

And point prowser to HOST:2020 to be asked enter credentials and only after that be redirected to Spark Master UI.

2 коментарі: