#distributed #query #data #processing #sql

datafusion

DataFusion is a modern distributed compute platform that uses Apache Arrow as the memory model

32 releases

0.3.3 Sep 3, 2018
0.3.1 Jul 3, 2018
0.2.2 Mar 26, 2018

#37 in Database interfaces

Download history 12/week @ 2018-09-15 4/week @ 2018-09-22 100/week @ 2018-09-29 29/week @ 2018-10-06 6/week @ 2018-10-13 363/week @ 2018-10-20 44/week @ 2018-10-27 176/week @ 2018-11-03 274/week @ 2018-11-10 231/week @ 2018-11-17 37/week @ 2018-11-24 260/week @ 2018-12-01 47/week @ 2018-12-08

537 downloads per month

Apache-2.0

23KB
535 lines

DataFusion: Modern Distributed Compute Platform implemented in Rust

License Version Build Status Coverage Status Gitter chat

DataFusion is an attempt at building a modern distributed compute platform in Rust, using Apache Arrow as the memory model.

See my article How To Build a Modern Distributed Compute Platform to learn about the design and my motivation for building this. The TL;DR is that this project is a great way to learn about building distributed systems but there are plenty of better choices if you need something mature and supported.

Status

The original POC no longer works due to changes in Rust nightly since 11/3/18 and since then I have been contributing more code to the Apache Arrow project and decided to start implementing DataFusion from scratch based on that latest Arrow code and incorporating lessons learned from the first attempt. The original POC code is is now on the original_poc branch and supports single threaded SQL execution against Parquet and CSV files using Apache Arrow as the memory model.

The current task list:

  • Delete existing code and update the README with the new plan
  • Implement serializable logical query plan
  • Implement data source for CSV
  • Implement data source for Parquet
  • Implement query execution: Projection
  • Implement query execution: Selection
  • Implement query execution: Sort
  • Implement query execution: Aggregate
  • Implement query execution: Scalar Functions
  • Implement parallel query execution (multithreaded, single process)
  • Generate query plan from SQL
  • Implement worker node that can receive a query plan, execute the query, and return a result in Arrow IPC format
  • Implement distributed query execution using Kubernetes

Prerequisites

  • Rust nightly (required by parquet-rs crate)

Building DataFusion

See BUILDING.md.

Gitter

There is a Gitter channel where you can ask questions about the project or make feature suggestions too.

Contributing

Contributors are welcome! Please see CONTRIBUTING.md for details.

Dependencies

~4MB
~65K SLoC