Input and Output

Vortex arrays support reading and writing to local and remote file systems, including plain-old HTTP, S3, Google Cloud Storage, and Azure Blob Storage.

open

Lazily open a Vortex file located at the given path or URL.

VortexFile

RepeatedScan

A prepared scan that is optimized for repeated execution.

read_url

Read a vortex struct array from a URL.

write

Write a vortex struct array to the local filesystem.


vortex.open(path: str, *, without_segment_cache: bool = False) VortexFile

Lazily open a Vortex file located at the given path or URL.

Parameters:
  • path (str) – A local path or URL to the Vortex file.

  • without_segment_cache (bool) – If true, disable the segment cache for this file, useful when memory is constrained.

Examples

Open a Vortex file and perform a scan operation:

>>> import vortex as vx
>>> vxf = vx.open("data.vortex")
>>> array_iterator = vxf.scan()

See also: vortex.dataset.VortexDataset

final class vortex.VortexFile(file: VortexFile)
property dtype: DType

The dtype of the file.

scan(projection: Expr | list[str] | None = None, *, expr: Expr | None = None, indices: Array | None = None, batch_size: int | None = None) ArrayIterator

Scan the Vortex file returning a vortex.ArrayIterator.

Parameters:
  • projection (vortex.Expr | list[str] | None) – The projection expression to read, or else read all columns.

  • expr (vortex.Expr | None) – The predicate used to filter rows. The filter columns do not need to be in the projection.

  • indices (vortex.Array | None) – The indices of the rows to read. Must be sorted and non-null.

  • batch_size (int | None) – The number of rows to read per chunk.

Examples

Scan a file with a structured column and nulls at multiple levels and in multiple columns.

>>> import vortex as vx
>>> import vortex.expr as ve
>>> a = vx.array([
...     {'name': 'Joseph', 'age': 25},
...     {'name': None, 'age': 31},
...     {'name': 'Angela', 'age': None},
...     {'name': 'Mikhail', 'age': 57},
...     {'name': None, 'age': None},
... ])
>>> vx.io.write(a, "a.vortex")
>>> vxf = vx.open("a.vortex")
>>> vxf.scan().read_all().to_arrow_array()
<pyarrow.lib.StructArray object at ...>
-- is_valid: all not null
-- child 0 type: int64
  [
    25,
    31,
    null,
    57,
    null
  ]
-- child 1 type: string_view
  [
    "Joseph",
    null,
    "Angela",
    "Mikhail",
    null
  ]

Read just the age column:

>>> vxf.scan(['age']).read_all().to_arrow_array()
<pyarrow.lib.StructArray object at ...>
-- is_valid: all not null
-- child 0 type: int64
  [
    25,
    31,
    null,
    57,
    null
  ]

Keep rows with an age above 35. This will read O(N_KEPT) rows, when the file format allows.

>>> vxf.scan(expr=ve.column("age") > 35).read_all().to_arrow_array()
<pyarrow.lib.StructArray object at ...>
-- is_valid: all not null
-- child 0 type: int64
  [
    57
  ]
-- child 1 type: string_view
  [
    "Mikhail"
  ]
to_arrow(projection: Expr | list[str] | None = None, *, expr: Expr | None = None, batch_size: int | None = None) RecordBatchReader

Scan the Vortex file as a pyarrow.RecordBatchReader.

Parameters:
  • projection (vortex.Expr | list[str] | None) – Either an expression over the columns of the file (only referenced columns will be read from the file) or an explicit list of desired columns.

  • expr (vortex.Expr | None) – The predicate used to filter rows. The filter columns need not appear in the projection.

  • batch_size (int | None) – The number of rows to read per chunk.

to_dataset() VortexDataset

Scan the Vortex file using the pyarrow.dataset.Dataset API.

to_polars() LazyFrame

Read the Vortex file as a pl.LazyFrame, supporting column pruning and predicate pushdown.

to_repeated_scan(projection: Expr | list[str] | None = None, *, expr: Expr | None = None, indices: Array | None = None, batch_size: int | None = None) RepeatedScan

Prepare a scan of the Vortex file for repeated reads, returning a vortex.RepeatedScan.

Parameters:
  • projection (vortex.Expr | list[str] | None) – The projection expression to read, or else read all columns.

  • expr (vortex.Expr | None) – The predicate used to filter rows. The filter columns do not need to be in the projection.

  • indices (vortex.Array | None) – The indices of the rows to read. Must be sorted and non-null.

  • batch_size (int | None) – The number of rows to read per chunk.

final class vortex.RepeatedScan(scan: RepeatedScan)

A prepared scan that is optimized for repeated execution.

execute(*, row_range: tuple[int, int] | None = None) ArrayIterator

Execute the scan returning a vortex.ArrayIterator.

Parameters:

row_range (tuple[int, int] | None) – Tuple is interpreted as [start, stop).

Examples

Scan a file with a structured column and nulls at multiple levels and in multiple columns.

>>> import vortex as vx
>>> import vortex.expr as ve
>>> a = vx.array([
...     {'name': 'Joseph', 'age': 25},
...     {'name': None, 'age': 31},
...     {'name': 'Angela', 'age': None},
...     {'name': 'Mikhail', 'age': 57},
...     {'name': None, 'age': None},
... ])
>>> vx.io.write(a, "a.vortex")
>>> scan = vx.open("a.vortex").to_repeated_scan()
>>> scan.execute(row_range=(1, 3)).read_all().to_arrow_array()
<pyarrow.lib.StructArray object at ...>
-- is_valid: all not null
-- child 0 type: int64
  [
    31,
    null
  ]
-- child 1 type: string_view
  [
    null,
    "Angela"
  ]
scalar_at(index: int) Scalar

Fetch a scalar from the scan returning a vortex.Scalar.

Parameters:

index (int) – The row index to fetch. Raises an IndexError if out of bounds or if the given row index was not included in the scan.

Examples

Scan a file with a structured column and nulls at multiple levels and in multiple columns.

>>> import vortex as vx
>>> import vortex.expr as ve
>>> a = vx.array([
...     {'name': 'Joseph', 'age': 25},
...     {'name': None, 'age': 31},
...     {'name': 'Angela', 'age': None},
...     {'name': 'Mikhail', 'age': 57},
...     {'name': None, 'age': None},
... ])
>>> vx.io.write(a, "a.vortex")
>>> scan = vx.open("a.vortex").to_repeated_scan()
>>> scan.scalar_at(1)
<vortex.StructScalar object at ...>
class vortex.io.VortexWriteOptions

Write Vortex files with custom configuration.

static compact()

Prioritize small size over read-throughput and read-latency.

Let’s model some stock ticker data. As you may know, the stock market always (noisly) goes up:

>>> import os
>>> import random
>>> sprl = vx.array([random.randint(i, i + 10) for i in range(100_000)])

If we naively wrote 4-bytes for each of these integers to a file we’d have 400,000 bytes! Let’s see how small this is when we write with the default Vortex write options (which are also used by vortex.io.write()):

>>> vx.io.VortexWriteOptions.default().write_path(sprl, "chonky.vortex")
>>> import os
>>> os.path.getsize('chonky.vortex')
215196

Wow, Vortex manages to use about two bytes per integer! So advanced. So tiny.

But can we do better?

We sure can.

>>> vx.io.VortexWriteOptions.compact().write_path(sprl, "tiny.vortex")
>>> os.path.getsize('tiny.vortex')
54200

Random numbers are not (usually) composed of random bytes!

static default()

Balance size, read-throughput, and read-latency.

write_path(iter, path)

Write an array or iterator of arrays into a local file.

Parameters:

Examples

Write a single Vortex array a to the local file a.vortex using the default settings:

>>> import vortex as vx
>>> import random
>>> a = vx.array([0, 1, 2, 3, None, 4])
>>> vx.io.VortexWriteOptions.default().write_path(a, "a.vortex")

Write the same array while preferring small file sizes over read-throughput and read-latency:

>>> import vortex as vx
>>> vx.io.VortexWriteOptions.compact().write_path(a, "a.vortex")
vortex.io.read_url(url, *, projection=None, row_filter=None, indices=None, row_range=None)

Read a vortex struct array from a URL.

Parameters:
  • url (str) – The URL to read from.

  • projection (list[str | int] | None) – The columns to read identified either by their index or name.

  • row_filter (Expr | None) – Keep only the rows for which this expression evaluates to true.

  • indices (Array | None) – A list of rows to keep identified by the zero-based index within the file. NB: If row_range is specified, these indices are within the row range, not the file!

  • row_range (tuple[int, int] | None) – A left-inclusive, right-exclusive range of rows to read.

Examples

Read an array from an HTTPS URL:

>>> import vortex as vx
>>> a = vx.io.read_url("https://example.com/dataset.vortex")

Read an array from an S3 URL:

>>> a = vx.io.read_url("s3://bucket/path/to/dataset.vortex")

Read an array from an Azure Blob File System URL:

>>> a = vx.io.read_url("abfss://my_file_system@my_account.dfs.core.windows.net/path/to/dataset.vortex")

Read an array from an Azure Blob Storage URL:

>>> a = vx.io.read_url("https://my_account.blob.core.windows.net/my_container/path/to/dataset.vortex")

Read an array from a Google Storage URL:

>>> a = vx.io.read_url("gs://bucket/path/to/dataset.vortex")

Read an array from a local file URL:

>>> a = vx.io.read_url("file:/path/to/dataset.vortex")
vortex.io.write(iter, path)

Write a vortex struct array to the local filesystem.

Parameters:

Examples

Write a single Vortex array a to the local file a.vortex.

>>> import vortex as vx
>>> a = vx.array([
...     {'x': 1},
...     {'x': 2},
...     {'x': 10},
...     {'x': 11},
...     {'x': None},
... ])
>>> vx.io.write(a, "a.vortex")

Stream a PyArrow Table directly to Vortex without loading into memory:

>>> import pyarrow as pa
>>> import vortex as vx
>>> table = pa.table({'x': [1, 2, 3, 4, 5]})
>>> vx.io.write(table, "streamed.vortex")

Stream from a PyArrow RecordBatchReader:

>>> import pyarrow as pa
>>> import vortex as vx
>>> reader = pa.RecordBatchReader.from_batches(schema, batches)
>>> vx.io.write(reader, "streamed.vortex")