For the last five months I’ve been using Spark on a daily basis for various types of workloads and analysis. Spark has a few good things going for it, but overall I’ve really come to loath it. Those specific reasons I’ll most likely outline in a later post. All I really need and want out of a data analysis platform is plain old SQL with a little big of programming. Enough programming where I can get around the limitations of just SQL but not enough to get you into trouble.

Introducing Exalt SQL

Exalt SQL is my answer to an alternative to Spark. The idea behind it is to take an existing big data SQL engine and supplement it with a programming language, in this case Ruby. I chose to go with Presto as the SQL backend due to ease of deployment on EMR, but any backend could be plugged in such a Redshift, BigQuery, etc.

The main design goal is SQL should be the primary method to interact with data. Every query you execute you can easily query the data returned by previous queries. Then you can add more complex logic to your queries, call external resources, etc with the Ruby interface.

Here is a simple example:

# Our initial query
data = query("select * from tpch.sf1.customer limit 5")

# Easily query our previous data
data = query("select count(*) FROM #{data.table}")

# Display the final data in a table
data.show()

The query function runs the given query and saves the output to a temporary which gets dropped each run.

Even given this really simple and limited feature set, I’m able to run production level code. Here’s a production sample I’ve retrofitted that I have running in Exalt SQL.

customer_contributor_counts = []

customers = query("select customer from raptor.default.customer_items group by customer")
unique_items = query("select item from raptor.default.customer_items group by item")
unique_customer_items = query("select customer, item from raptor.default.customer_items group by customer, item")

processed_customers = []

customers.each do |c|
  if customer_contributor_counts.length == 0
    diff_count = query("select count(*) from #{unique_customer_items.table} where customer = '#{c["customer"]}'").rows[0][0]
  else
     diff_count = query("""
       SELECT
         COUNT(*) AS cnt
       FROM (
         SELECT item FROM #{unique_customer_items.table} WHERE customer = '#{c["customer"]}'
         EXCEPT
         SELECT item FROM #{unique_customer_items.table} WHERE customer IN ('#{processed_customers.join("','")}')
        )
     """).rows[0][0]
  end

  customer_contributor_counts << {customer: c["customer"], item_contribution_count: diff_count}
  processed_customers << c["customer"]
end

puts customer_contributor_counts

Whats next?

The next goal is to add more features to make the workflow smoother. Also work more on getting production code working. You can see whats next in the open issues on Github.

The project is open sourced on Github under the MIT license. https://github.com/abronte/ExaltSQL