File Format¶

Important

The Vortex File Format has been considered stable since the release of version 0.36.0. That means that you can expect all future versions of the Vortex library to be able to read files written by version 0.36.0 or later (up to and including the version doing the reading).

File Specification¶

The Vortex file format has a very small definition, with much of the complexity encapsulated in Vortex Layouts.

<4 bytes>  magic number 'VTXF'
...        segments of binary data, optionally with inter-segment padding
...        postscript data
<2 bytes>  u16 version tag
<2 bytes>  u16 postscript length
<4 bytes>  magic number 'VTXF'

The file format begins and ends with the 4-byte magic number VTXF. Immediately prior to the trailing magic number are two 16-bit integers: the version tag and the length of the postscript.

Notably, this minimal notion of a Vortex file effectively includes only the byte ranges, alignment, encryption, and compression configurations for other pieces of metadata.

Minimal Vortex File

Postscript¶

The postscript contains the locations of:

a dtype segment representing the top-level logical data type (i.e., schema)
a layout segment containing the root Layout
a statistics segment containing file-level per-field statistics (e.g., minima and maxima of each field/column, for whole-file pruning)
a footer segment containing a dictionary-encoded segment map, and other shared configuration such as compression and encryption schemes

/// The `Postscript` is guaranteed by the file format to never exceed
/// 65528 bytes (i.e., u16::MAX - 8 bytes) in length, and is immediately
/// followed by an 8-byte `EndOfFile` struct.
///
/// An initial read of a Vortex file defaults to at least 64KB (u16::MAX bytes) and therefore
/// is guaranteed to cover at least the Postscript.
///
/// The reason for a postscript at all is to ensure minimal but all necessary footer information
/// can be read in two round trips. Since the DType is optional and possibly large, it lives in
/// its own segment. If the footer were arbitrary size, with a pointer to the DType segment, then
/// in the worst case we would need one round trip to read the footer length, one to read the full
/// footer and parse the DType offset, and a third to fetch the DType segment.
///
/// The segments pointed to by the postscript have inline compression and encryption specs to avoid
/// the need to fetch encryption schemes up-front.
table Postscript {
    /// Segment containing the root `DType` flatbuffer.
    dtype: PostscriptSegment;
    /// Segment containing the root `Layout` flatbuffer (required).
    layout: PostscriptSegment;
    /// Segment containing the file-level `Statistics` flatbuffer.
    statistics: PostscriptSegment;
    /// Segment containing the 'Footer' flatbuffer (required)
    footer: PostscriptSegment;
}

/// A `PostscriptSegment` describes the location of a segment in the file without referencing any
/// specification objects. That is, encryption and compression are defined inline.
table PostscriptSegment {
    offset: uint64;
    length: uint32;
    alignment_exponent: uint8;
    _compression: CompressionSpec;
    _encryption: EncryptionSpec;
}

Data Type¶

Both viewed arrays and viewed layouts require an external DType to instantiate them. This helps us to avoid redundancy in the serialized format since it is very common for a child array or layout to inherit or infer its data type from the parent type.

The root DType segment is a flat buffer serialized DType object. See DType Format for more information.

Note

Unlike many columnar formats, the DType of a Vortex file is not required to be a StructDType. It is perfectly valid to store a Float64 array, a Boolean array, or any other root data type.

Footer¶

The footer is a flat buffer serialized Footer object. This object contains all the information required to load the root Layout object into a usable LayoutReader). For example, it contains the locations, compression schemes, encryption schemes, and required alignment of all segments in the file.

/// The `FileStatistics` object contains file-level statistics for the Vortex file.
table FileStatistics {
    /// Statistics for each field in the root schema. If the root schema is not a struct, there will
    /// be a single entry in this array.
    field_stats: [ArrayStats];
}

/// The `Registry` object stores dictionary-encoded configuration for segments,
/// compression schemes, encryption schemes, etc.
table Footer {
    // Dictionary-encoded array specs, up to u16::MAX.
    array_specs: [ArraySpec];
    // Dictionary-encoded layout specs, up to u16::MAX.
    layout_specs: [LayoutSpec];
    // Dictionary-encoded segment specs, up to u32::MAX.
    segment_specs: [SegmentSpec];
    // Dictionary-encoded compress specs, up to u3::MAX (8).
    compression_specs: [CompressionSpec];
    // Dictionary-encoded encryption specs, up to u16::MAX.
    encryption_specs: [EncryptionSpec];
}

/// An `ArraySpec` describes the type of a particular array.
///
/// These are identified by a globally unique string identifier, and looked up in the Vortex registry
/// at read-time.
table ArraySpec {
    id: string (required);
}

/// A `LayoutSpec` describes the type of a particular layout.
///
/// These are identified by a globally unique string identifier, and looked up in the Vortex registry
/// at read-time.
table LayoutSpec {
    id: string (required);
}

/// A `SegmentSpec` acts as the locator for a buffer within the file.
struct SegmentSpec {
    /// Offset relative to the start of the file.
    offset: uint64;
    /// Length in bytes of the segment.
    length: uint32;
    /// Base-2 exponent of the alignment of the segment.
    alignment_exponent: uint8;
    // These two fields are reserved for future use and act as pointers
    // into `FileLayout::compression_schemes` and `FileLayout::encryption_schemes`
    // respectively. They are not used in the current version of the file format.
    _compression: uint8;
    _encryption: uint16;
}

enum CompressionScheme: uint8 {
    None = 0,
    LZ4 = 1,
    ZLib = 2,
    ZStd = 3,
}

/// Definition of a compression scheme.
table CompressionSpec {
    scheme: CompressionScheme;
}

table EncryptionSpec {
}

The footer is separated from the Data Type such that large schemas can be omitted from the file if they can be shared or fetched from an external source.

Reified File Example¶

Since Vortex files are largely self-describing, many mainstays of other columnar file formats (e.g., whether or not to have row groups) are decided by the writer, rather than being a rigid part of the specification. To build intuition, consider an example Vortex file with two non-nullable columns, “A” of type i32, and “B” of type UTF-8. Using the defaults as of June 2025, it might look as follows.

Reified Vortex File

Backward Compatibility¶

Backward compatibility guarantees that any older Vortex file can be read by newer versions of the Vortex library, and is expected from all releases of Vortex from version 0.36.0 onwards.

Forward Compatibility¶

Warning

Forward compatibility is not yet implemented, but is planned to ship prior to the 1.0 release.

Forward compatibility extends the preceding stability guarantee such that newer Vortex files can be read by older versions of the Vortex library.

The intent of this work is to allow us to continue to evolve the Vortex File Format, avoiding calcification and remaining up-to-date with new compression codecs and layout optimizations – without breaking existing readers or requiring lockstep upgrades.

The plan is that at write-time, a minimum supported reader version is declared. Any encodings or layouts added after that minimum reader version can then be embedded into the file with WebAssembly decompression logic. Old readers are able to decompress new data (slower than native code, but still with SIMD acceleration) and read the file. New readers are able to make the best use of these encodings with native decompression logic and additional push-down compute functions (which also provides an incentive to upgrade).

File Determinism and Reproducibility¶

Encoding Order Indeterminism¶

When writing Vortex files, each array segment references its encoding via an integer index into the footer’s array_specs list. During serialization, encodings are registered in the order they are first encountered via calls to ArrayContext::encoding_idx(). With concurrent writes, this encounter order depends on thread scheduling and lock acquisition timing, making the ordering in the footer non-deterministic between runs.

This affects the encoding field in each serialized array segment. The same encoding might receive index 0 in one run and index 1 in another, changing the integer value stored in each array segment that uses that encoding. FlatBuffers optimize storage by omitting fields with default values (such as 0), so when an encoding index is 0, the field may be omitted from the serialized representation. This saves approximately 2 bytes per affected array segment, and with alignment adjustments, can result in up to 4 bytes difference per array segment between runs.

Note

Despite this non-determinism, the practical impact is minimal:

File size may vary by up to 4 bytes per affected array segment
All file contents remain semantically identical and fully readable
Segment ordering (the actual data layout) remains deterministic and consistent across writes