Adam Bronte

Thoughts, stories, and data.

11 Jul 2021

Run Code Locally

XKCD 378

Developers tend to be very opinionated about the tools they chose. It mostly comes down to personal preference and what somebody is comfortable and most productive with. Imagine telling an Emacs user they have to use VIM to get any work done. There would be pitchforks in the streets!

In the data world, this notion of choosing your tools is completely thrown out the window.

Why is that?

The whole notion of “big data” is that you have to do things that won’t fit on a single computer. It’s not like developing a web application with a local database and your code, browser, etc. If your data can’t live next to your code, then the next natural step is to run your code next to your data.

Every single product I’ve run across solves this the same way, by hosting a remote notebook that you interact with through the process. Don’t get me wrong; this is great for getting up and running quickly and more ad-hoc analysis but isn’t without its downsides.

Forcing a specific tool

Some people like VIM, some Emacs, some like VSCode. Developers should use whatever they are most productive with. I believe everyone should be able to use whatever they like to get the job done.

Difficulty managing large projects

When working on larger, production-grade projects, you aren’t going to be putting everything into a single notebook and running it. You’re gonna have multiple directories, modules, classes, configurations, etc. I’ve found many of these hosted notebooks difficult to work with on larger projects.

Going to production isn’t so straightforward

This is highly dependent on what platform you’re using. An all-encompassing platform like Databricks makes this pretty seamless. Other tools like Zeppelin are not so straightforward. Ideally, you’d want how to run in production to be similar to how things are run in dev.

Ok, great, I want to run my code locally!

Sad to say, I don’t think the industry has gotten to this point yet. In my opinion, the developer experience in the data world is still in the very early stages. There also may be certain use cases where you would always have to run remotely such as complex ML models that need certain GPUs.

In Spark, I’ve tried to solve this in two different ways. One by extending pyspark’s py4j to connect to a remote JVM in a project I call Pyspark Gateway. Another where I expose the pyspark API as an RPC interface in a project I call PysparkRPC. Both these projects assume most of your heavy computation is going to be done inside the JVM on the cluster. Then Python into more of an orchestration layer rather than a programming language.

As I mentioned earlier, I still think we are early on in the developer experience. Your tools should enable your end goals and get out of the way as much as possible. I want to continue to explore this problem and see what else I might come up with.

A happy developer is a productive developer.