Spark#
The vortex-spark connector implements Apache Spark’s DataSource V2 API, allowing Spark to read
and write Vortex files as a native data source registered under the format name vortex.
Registration#
The connector implements Spark’s TableProvider and DataSourceRegister interfaces. When a
query references the vortex format, Spark creates a VortexTable that supports both batch
reads (SupportsRead) and writes (SupportsWrite). Schema inference reads the footer of a
discovered file to extract the Arrow schema and map it to Spark’s schema representation.
Multiple Files#
Spark’s scan builder enumerates Vortex files by scanning the provided path. If the path is a
directory, native code lists all .vortex files within it. Each file becomes an independent
input partition, and Spark’s task scheduler distributes partitions across executors in the
cluster.
Each partition creates its own file handle and scan state, so there is no shared mutable state between partitions. This maps naturally to Spark’s execution model where each task runs independently on a separate JVM thread.
Threading Model#
The Spark integration crosses the JNI boundary between Java and Rust. Each Spark partition reader opens a Vortex file and creates a native scan via JNI. The native side manages its own async runtime and drives I/O internally, returning results to Java as Arrow-compatible columnar batches.
Because each partition reader owns its native resources exclusively, there is no contention across Spark threads. The JNI boundary is crossed once per batch rather than once per row, keeping overhead low. A prefetching iterator on the Java side buffers upcoming batches to overlap I/O with Spark’s processing.
Filter and Projection Pushdown#
Projection pushdown is supported through Spark’s SupportsPushDownRequiredColumns interface.
The scan builder prunes the column list to only those referenced by the query, and the pruned
column set is passed to the native scan via ScanOptions.
Filter pushdown infrastructure exists in the ScanOptions type but is not yet connected to
Spark’s SupportsPushDownFilters interface. This is planned future work.
Data Export#
Native Vortex arrays are exported to Arrow via the C Data Interface, then wrapped in Spark’s
columnar batch format using custom ArrowColumnVector wrappers. This avoids a copy between
the Rust and JVM heaps – the Arrow buffers remain in native memory and are accessed from Java
through direct byte buffers.
Future Work#
The current integration builds directly on the native file and scan APIs via JNI. Future work
will migrate it to use the Scan API Source trait, which will provide a
standard interface for file discovery, partitioning, and pushdown. This will also enable
connecting Spark’s filter pushdown to Vortex’s expression-based filtering.