I've been using Spark pretty much daily for about 3 years now. Its our main data analysis engine over at Coupa. Our code base has grown considerably over the years and we utilize Spark in various ways. I feel like I've become pretty familiar with where Spark shines and what its downfalls are.
If I would chose any one particular bad thing about the analytics engine, it would be connecting to remote Spark clusters sucks.
Everything you see about running Spark applications is always about running things directly on the cluster. Either from a notebook like Zeppelin, or straight up with spark-submit. That makes for a really crappy development flow, even more so when you are working from a largish enterprise codebase.
The interesting part of all this is there is no real reason why it needs to be this way. Before I show how it can be improved, we need to look at first how python interacts with Spark/JVM. (I'm showcasing python here because we use pyspark at work)
Spark is written in Scala which runs on the JVM. When running a pyspark application, most actions happening are happening in the JVM, not in python. The way that python interacts with the JVM is via a middleware called Py4J. This middleware exposes classes and functions from the JVM via a network socket. The python Py4J library makes a call over this network socket with information like what function to call and parameters. Then the network socket sends back a result.
Since python is already talking to Spark and the JVM over a network socket, why can't that be done over the internet?
Turns out its actually pretty easy thing to set up. When initializing a Spark Context you can specify a custom java gateway and you simply just specify the one connected to your remote gateway instead of the default. I even created an open source library called Pyspark Gateway that helps facilitate connecting over the internet.
So how does this work?
The Pyspark Gateway library has a client and server component. The client and server communicate to facilitate starting the java gateway and passing authentication credentials to the client. Then the client creates a local java gateway pointing to the remote gateway with the credentials obtained.
This is really simple from the client side, simply import the Pyspark Gateway library and initialize your spark context with the java gateway created by the library.
Now you have your local python talking to your remote spark cluster! Now you can run various actions in spark, but they all get executed on the cluster. Even things like UDFs work as well!
Some shortcomings of Pyspark Gateway
The above method for executing spark applications works for about 95% of the things you need to do. Spark does some funky thing (later post?) that prohibits this from being 100% compatible with all spark apis.
Particularly, spark avoids passing too much data over Py4J (I think I saw in some comments something about performance?). Instead, spark will write the file to disk, then the JVM will read that file. Or if you have encryption enabled, it will create a temporary socket to exchange data. This is why we have to force encryption (conf.set('spark.io.encryption.enabled', 'true')) when creating the spark context. We also need to do some funky monkey-patching in order to intercept and forward this data to the client.
Another problem area are RDDs. Many RDD functions require passing in a python function. This is essential a mini UDF that needs to get serialized and distributed to the cluster. If the function you are sending is too big, spark tries to serialize to disk (regardless what you have encryption set to). Since your python is running on your local machine, the JVM on the cluster won't be able to read your local file that python is trying to send to the JVM.
Pyspark Gateway is in a pretty good spot right now. We use it daily across many people at Coupa. The main thing I want to fix before bumping to a "1.0" release would be to fix the janky monkey patching it does. Reliably patching a module's function seems not so straight forward in python, at least from my attempts at it.