File Format¶
See also
The majority of the complexity of the Vortex file format is encapsulated in Vortex Layouts. Unless you are interested in the specific byte layout of the file, you are probably looking for that documentation!
Recall that Vortex Layouts provide a mechanism to efficiently query large serialized Vortex arrays. The Vortex File Format is designed to provide a container for these serialized arrays, as well as footer definition that allows efficiently querying the layout.
Other considerations for the Vortex file format include:
Backwards compatibility, and (uniquely) forwards compatibility.
Fine-grained encryption.
Efficient access for both local disk and cloud storage.
Minimal overhead reading few columns or rows from wide or long arrays.
File Specification¶
The Vortex file format has a very small definition, with much of the complexity encapsulated in Vortex Layouts.
<4 bytes> magic number 'VTXF'
... segments of binary data, optionally with inter-segment padding
... postfix data
<2 bytes> u16 version tag
<2 bytes> u16 postfix length
<4 bytes> magic number 'VTXF'
The file format begins and ends with the 4-byte magic number VTXF
.
Immediately prior to the trailing magic number are two 16-bit integers: the version tag and the length of the postfix.
Postfix¶
The postfix contains the locations of the file’s root DType
segment, as well as a FileLayout
segment containing
the root Layout
, a segment map, and other shared configuration such as compression and encryption schemes.
/// The `Postscript` is guaranteed by the file format to never exceed
/// 65528 bytes (i.e., u16::MAX - 8 bytes) in length, and is immediately
/// followed by an 8-byte `EndOfFile` struct.
/// The initial read of a Vortex file is always at least 64KB (u16::MAX bytes) and therefore
/// is guaranteed to cover at least the Postscript.
table Postscript {
dtype: Segment;
footer: Segment;
}
Data Type¶
Both viewed arrays and viewed layouts require an external DType
to instantiate them. This helps us to avoid
redundancy in the serialized format since it is very common for a child array or layout to inherit or infer its data
type from the parent type.
The root DType
segment is a flat buffer serialized DType
object. See DType Format for more
information.
Note
Unlike many columnar formats, the DType
of a Vortex file is not required to be a StructDType
. It is perfectly
valid to store a Float64
array, a Boolean
array, or any other root data type.
Backward Compatibility¶
Backward compatability guarantees that any old Vortex file can be read by newer versions of the Vortex library.
The Vortex File Format is currently considered unstable. We are aiming for an 0.x release in Q1 2025 that guarantees no breaking changes within each minor version of Vortex, and a 1.0 release in H2 2025 that guarantees no breaking changes within a major version of Vortex.
Please upvote or comment on the GitHub issue if you would like to see a stable release sooner.
Forward Compatibility¶
Note
Forward compatibility is planned to ship prior to the 1.0 release.
Forward compatibility guarantees that any new Vortex file can be read by older versions of the Vortex library.
This rare feature allows us to continue to evolve the Vortex File Format, avoiding calcification and remaining up to date with new compression codecs and layout optimizations - all without breaking existing readers or requiring them to be updated.
At write-time, a minimum supported reader version is declared. Any new encodings or layouts are then embedded into the file with WebAssembly decompression logic. Old readers are able to decompress new data (slower than native code, but still with SIMD acceleration) and read the file. New readers are able to make the best use of these encodings with native decompression logic and additional push-down compute functions.