Spark#
Vortex provides a Spark DataSource V2 connector for reading and writing Vortex files. The
connector is published to Maven Central as dev.vortex:vortex-spark.
Installation#
Add the dependency to your build. The connector is built against Spark 4.x with Scala 2.13.
implementation("dev.vortex:vortex-spark:<version>")
<dependency>
<groupId>dev.vortex</groupId>
<artifactId>vortex-spark</artifactId>
<version>${vortex.version}</version>
</dependency>
The connector ships as a shadow JAR that relocates its Arrow, Guava, and Protobuf dependencies to avoid classpath conflicts with Spark.
Reading Vortex Files#
Use the vortex format to read a single file or a directory of Vortex files:
Dataset<Row> df = spark.read()
.format("vortex")
.option("path", "/path/to/data.vortex")
.load();
When pointed at a directory, the connector discovers all .vortex files and creates one read
partition per file.
Column pruning is pushed down — only the columns referenced by the query are read from the file.
Writing Vortex Files#
df.write()
.format("vortex")
.option("path", "/path/to/output")
.mode(SaveMode.Overwrite)
.save();
Each Spark partition produces one output file named part-{partitionId}-{taskId}.vortex.
Write Options#
Option |
Default |
Description |
|---|---|---|
|
2048 |
Number of rows per batch (1–65536) |
Save Modes#
The connector supports all standard Spark save modes: Overwrite, Append, Ignore, and
ErrorIfExists.
Supported Types#
Spark Type |
Vortex Type |
|---|---|
|
Bool |
|
Int8 / UInt8 |
|
Int16 / UInt16 |
|
Int32 / UInt32 |
|
Int64 / UInt64 |
|
Float32 |
|
Float64 |
|
Utf8 |
|
Binary |
|
Decimal |
|
Date (days) |
|
Timestamp (microseconds, UTC) |
|
Timestamp (microseconds, no timezone) |
|
List |
|
Struct |
S3 Support#
The connector supports reading and writing to S3 paths:
Dataset<Row> df = spark.read()
.format("vortex")
.option("path", "s3://bucket/path/to/data")
.load();