Vortex Data Types¶
A core principle of Vortex is that its data types (or dtypes
) are logical rather than physical.
This means that the dtype has no bearing on how the data is actually stored in memory, and is instead used to define
the domain of values an array may hold.
For example, a u32
dtype represents an unsigned integer domain with values between 0
and 2^32 - 1
, even though
the underlying array may store values dictionary-encoded, run-length encoded (RLE), or in any other format!
This principle enables many of Vortex’s advanced features. For example, performing compute directly on compressed data.
What is a schema?!
It is worth noting now that Vortex has no concept of a schema, instead preferring to use a struct dtype to represent columnar data. This means you can write a Vortex file containing a single integer array just as well as writing one with many columns.
Owned vs Viewed
As with other possibly large recursive data structures in Vortex, dtypes can be either owned or viewed. Owned dtypes are heap-allocated, while viewed dtypes are lazily unwrapped from an underlying FlatBuffer representation. This allows Vortex to efficiently load and work with very wide data types without needing to deserialize the full type in memory.
Logical Types¶
The following table lists the built-in dtypes in Vortex, each of which can be marked as either nullable or non-nullable.
Name |
Domain |
---|---|
|
|
|
|
|
See Primitive |
|
Variable length valid utf-8 encoded strings |
|
Arbitrary variable length bytes |
|
See Struct |
|
See List |
|
See Extension |
Note
There are additional logical types that Vortex does not yet support, for example fixed-length binary, utf-8, and list types, as well as a map type. These may be added in future versions.
Primitive¶
Primitive dtypes are an enumeration of different fixed-width primitive values.
Name |
Domain |
---|---|
|
8-bit signed integer |
|
16-bit signed integer |
|
32-bit signed integer |
|
64-bit signed integer |
|
8-bit unsigned integer |
|
16-bit unsigned integer |
|
32-bit unsigned integer |
|
64-bit unsigned integer |
|
IEEE 754-2008 half |
|
IEEE 754-1985 single |
|
IEEE 754-1985 double |
Struct¶
A Struct
dtype is an ordered collection of named fields, each of which has its own logical dtype.
List¶
A List
dtype has a single element type, itself a logical dtype, and represents an array of variable-length
sequences of elements of that type.
Extension¶
An Extension
dtype is a logical dtype with an id
, a storage
dtype, and a metadata
field. The id
and metadata
fields together may implicitly restrict the domain of values of the storage
dtype.
For example, a vortex.date
type is logically stored as a U32
representing the number of days since the Unix epoch.
Vs. Arrow¶
This section helps those familiar with Apache Arrow to quickly understand the differences vs. Vortex’s dtypes.
In Arrow, nullability is tied to a
pyarrow.Field
rather than the data type. Data types in Vortex instead always define explicitnullability
.In Arrow, there are multiple ways to describe the same logical data type, for example
pyarrow.string()
andpyarrow.large_string()
both represent UTF-8 values. In Vortex, there is a singleUTF8
dtype.In Arrow, encoded data is described with additional data types, for example
pyarrow.dictionary()
. In Vortex, encodings are a distinct concept from dtypes.In Arrow, date and time types are defined as first-class data types. In Vortex, these are represented as
Extension
dtypes since that can be composed of other more primitive logical dtypes.In Arrow, tables and record batches have a schema that defines the types of the columns. Vortex makes no distinction between a data type and a schema. Columnar data can be stored with a struct dtype, and integer data can be stored equally well without a top-level struct.