пʼятницю, 29 липня 2016 р.

Protecting Spark UI, part 2: servlet filter

In the previous post it was described how to configure simple NGINX instance to add basic auth to Spark job. In this part let see what Spark's suggest itself by implementing filter.

Filter is an special class which participate in Java servlet lifecycle and is called on each request (and even response). Using filter a resource can be protected by basic authentication from unauthorized access. According to documentation the filter must be implemented and then passed (full name) as a parameter. Let's pass valid username and password through environment variables, it must be good enough, as it equals to the approach used to pass AWS credentials for instance. Obviously, this env variable must be set on the instance where driver is supposed to be run. Another option is to pass them as arguments into filter using spark..params param1=value1 param2=value2 ...

Let's imagine our class in the package my.company.filters (and using several helpers, like commons-codec, commons-lang)

public class BasicAuthFilter implements Filter {

  private String login;
  private String pass;

  // this method is called one time on Filter creation
  public void init(FilterConfig config) {
     this.login = System.getenv("SPARK_LOGIN");
     this.pass = System.getenv("SPARK_PASS");
  }

  public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) throws IOException, ServletException {

     HttpServletRequest hreq = (HttpServletRequest) req;
     HttpServletResponse hres = (HttpServletResponse) res;

     String auth = hreq.getHeader( "Authorization" );
     if ( auth != null ) {
        int index = auth.indexOf(' ');
        if ( index > 0 ) {
          String[] creds = StringUtils.split( new String( Base64(auth.substring(index)), Charset.UTF_8), ':' );
          if ( creds.length == 2 && login.equals(creds[0]) && pass.equals( creds[1] ) ) {
              //  auth passed successfully
              return;
            }
          
        }
     }

     hres.setHeader( "WWW-Authenticate", "Basic realm=\"ProtectedSpark\"" );
     hres.sendError( HttpServletResponse.SC_UNAUTHORIZED );

  }

}


Ok, next step is to build JAR (pack this filter into JAR). After that, we can run our job in secured manner: execute spark-submit and pass newly assembled jar with flag --jars and through configuration (*.conf file or --conf param) pass full class path: spark.ui.filters=my.company.filters.BasicAuthFilter


Protecting Spark UI, part 1: nginx

Apache Spark WEB UI is a descent place to check cluster health and monitor job performance, starting point for almost every performance optimization. A guys from Databricks hardworking on improvements of UI from version to version.
But it still have one issue which I'm facing on every project and which must be resolver every time: I'm talking about publicity of this information, everyone how can reach the port (defaults, 8080 or 4040) can then access UI, and all information there (and there are a lot of stuff you want to keep private).

There are several solution to deal with it:

  1. Close all ports and configure nginx to listen specific port and forward requests (of course w/ basic authentication)
  2. protect UI using Spark's built-in method: implementing own filter
In this post, let's start from How to protect Spark UI with NGINX?

The instruction below is suitable for protecting standalone spark Web UI when job is executed in client mode (so you can predict where driver is up and run).

Let's assume that there is a node with both spark and nginx installed (obviously they can be on different nodes).

First of all, close all spark related ports (and there are a lot of them): they must be still accessible in-network. In Amazon, it easy to do with security groups: just specify appropriate CIDR mask for each inbound rule, for instance 172.16.0.0/12. Next, open 2 ports not used by Spark, but which you're going to make accessible to get into spark master ui or spark driver ui: just for example let's assume it's 2020 and 2020.

Now the small part left: configure nginx to perform basic auth and forward requests to Spark UI. In this case nginx is in provate network, so request will be handled by Spark and UI actually presented to end user. 

Before configuring nginx itself, the file to keep proper configuration must be created:
It's simple to do with htpasswd tool, can be installed by running   sudo yum install -y httpd-tools

Then generate password and store it into a file (user name will be spark and passowrd entered in CLI):
sudo htpasswd -c /etc/nginx/.htpasswd spark

Last step is to create proper nginx configuration (the eample is only to forward all request on Spark Master 8080 to 2000):
vi /etc/nginx/nginx2001.conf

{
  events {
     worker_connections 1000;
  }

  server {
listen 2020;

auth_basic "Private Beta"; auth_basic_user_file /etc/nginx/.htpasswd;

location / {
proxy_pass http://localhost:8080;
}
}

}

Actually, that's it. After that we just need to start nginx
nginx -c /etc/nginx/nginx2001.conf

And point prowser to HOST:2020 to be asked enter credentials and only after that be redirected to Spark Master UI.