DTypes#

A core principle of Vortex is that its data types (or dtypes) are logical rather than physical. This means that the dtype has no bearing on how the data is actually stored in memory, and is instead used to define the domain of values an array may hold.

For example, a u32 dtype represents an unsigned integer domain with values between 0 and 2^32 - 1, even though the underlying array may store values dictionary-encoded, run-length encoded (RLE), or in any other format!

This principle enables many of Vortex’s advanced features. For example, performing compute directly on compressed data.

What is a schema?!

It is worth noting now that Vortex has no concept of a schema, instead preferring to use a struct dtype to represent columnar data. This means you can write a Vortex file containing a single integer array just as well as writing one with many columns.

Logical Types#

The following table lists the built-in dtypes in Vortex, each of which can be marked as either nullable or non-nullable.

Name	Domain
`Null`	`null`
`Bool`	`true`, `false`
`Primitive`	See Primitive
`Decimal`	Fixed-precision real numbers
`Utf8`	Variable length valid UTF-8 encoded strings
`Binary`	Arbitrary variable length bytes
`List`	See List
`FixedSizeList`	See List
`Struct`	See Struct
`Extension`	See Extension

Note

There are additional logical types that Vortex does not yet support, for example fixed-length binary, maps, and variants. These may be added in future versions.

Primitive#

Primitive dtypes are an enumeration of different fixed-width primitive values.

Name	Domain
`I8`	8-bit signed integer
`I16`	16-bit signed integer
`I32`	32-bit signed integer
`I64`	64-bit signed integer
`U8`	8-bit unsigned integer
`U16`	16-bit unsigned integer
`U32`	32-bit unsigned integer
`U64`	64-bit unsigned integer
`F16`	IEEE 754-2008 half
`F32`	IEEE 754-1985 single
`F64`	IEEE 754-1985 double

List#

A List dtype has a single element type, itself a logical dtype, and represents an array of variable-length sequences of elements of that type.

A FixedSizeList dtype is similar, but the length of each sequence is fixed.

Struct#

A Struct dtype is an ordered collection of named fields, each of which has its own logical dtype.

Extension#

An Extension dtype is a logical dtype with an id, a storage dtype, and a metadata field. The id and metadata fields together may implicitly restrict the domain of values of the storage dtype.

For example, a vortex.date type is logically stored as a U32 representing the number of days since the Unix epoch.

Vs. Arrow#

This section helps those familiar with Apache Arrow to quickly understand the differences vs. Vortex’s dtypes.

In Arrow, nullability is tied to a pyarrow.Field rather than the data type. Data types in Vortex instead always define explicit nullability.
In Arrow, there are multiple ways to describe the same logical data type, for example pyarrow.string() and pyarrow.large_string() both represent UTF-8 values. In Vortex, there is a single Utf8 dtype.
In Arrow, encoded data is described with additional data types, for example pyarrow.dictionary(). In Vortex, encodings are a distinct concept from dtypes.
In Arrow, date and time types are defined as first-class data types. In Vortex, these are represented as Extension dtypes since that can be composed of other more primitive logical dtypes.
In Arrow, tables and record batches have a schema that defines the types of the columns. Vortex makes no distinction between a data type and a schema. Columnar data can be stored with a struct dtype, and integer data can be stored equally well without a top-level struct.