#distributed #query #data #processing #sql

bin+lib datafusion

DataFusion is a modern distributed compute platform that uses Apache Arrow as the memory model

38 releases (5 breaking)

✓ Uses Rust 2018 edition

0.6.0 Jan 20, 2019
0.5.0 Dec 15, 2018
0.4.0 Nov 26, 2018
0.3.1 Jul 3, 2018
0.2.2 Mar 26, 2018

#51 in Database interfaces

Download history 13/week @ 2018-10-09 389/week @ 2018-10-16 10/week @ 2018-10-23 331/week @ 2018-10-30 120/week @ 2018-11-06 230/week @ 2018-11-13 86/week @ 2018-11-20 221/week @ 2018-11-27 50/week @ 2018-12-04 59/week @ 2018-12-11 136/week @ 2018-12-18 38/week @ 2018-12-25 35/week @ 2019-01-01 85/week @ 2019-01-08 21/week @ 2019-01-15

469 downloads per month


3.5K SLoC

DataFusion: Modern Distributed Compute Platform implemented in Rust

License Version Build Status Coverage Status Gitter chat

DataFusion is an attempt at building a modern distributed compute platform in Rust, using Apache Arrow as the memory model.

See my article How To Build a Modern Distributed Compute Platform to learn about the design and my motivation for building this. The TL;DR is that this project is a great way to learn about building distributed systems but there are plenty of better choices if you need something mature and supported.


The original POC no longer works due to changes in Rust nightly since 11/3/18 and since then I have been contributing more code to the Apache Arrow project and decided to start implementing DataFusion from scratch based on that latest Arrow code and incorporating lessons learned from the first attempt. The original POC code is is now on the original_poc branch and supports single threaded SQL execution against Parquet and CSV files using Apache Arrow as the memory model.

The current task list:

  • Delete existing code and update the README with the new plan
  • Implement serializable logical query plan
  • Implement data source for CSV
  • Implement data source for Parquet
  • Implement query execution: Projection
  • Implement query execution: Selection
  • Implement query execution: Sort
  • Implement query execution: Aggregate
  • Implement query execution: Scalar Functions
  • Implement parallel query execution (multithreaded, single process)
  • Generate query plan from SQL
  • Implement worker node that can receive a query plan, execute the query, and return a result in Arrow IPC format
  • Implement distributed query execution using Kubernetes


  • Rust nightly (required by parquet-rs crate)

Building DataFusion



There is a Gitter channel where you can ask questions about the project or make feature suggestions too.


Contributors are welcome! Please see CONTRIBUTING.md for details.


~433K SLoC