Input and Output¶

Vortex arrays support reading and writing to local and remote file systems, including plain-old HTTP, S3, Google Cloud Storage, and Azure Blob Storage.

`open`	Lazily open a Vortex file located at the given path or URL.
`VortexFile`
`RepeatedScan`	A prepared scan that is optimized for repeated execution.
`read_url`	Read a vortex struct array from a URL.
`write`	Write a vortex struct array to the local filesystem.

vortex.open(path: str, *, without_segment_cache: bool = False) → VortexFile¶

Lazily open a Vortex file located at the given path or URL.

Parameters:

path (str) – A local path or URL to the Vortex file.
without_segment_cache (bool) – If true, disable the segment cache for this file, useful when memory is constrained.

Examples

Open a Vortex file and perform a scan operation:

>>> import vortex as vx
>>> vxf = vx.open("data.vortex")
>>> array_iterator = vxf.scan()

final class vortex.VortexFile(file: VortexFile)¶

property dtype: DType¶: The dtype of the file.

Scan the Vortex file returning a vortex.ArrayIterator.

Parameters:

projection (vortex.Expr | list[str] | None) – The projection expression to read, or else read all columns.
expr (vortex.Expr | None) – The predicate used to filter rows. The filter columns do not need to be in the projection.
indices (vortex.Array | None) – The indices of the rows to read. Must be sorted and non-null.
batch_size (int | None) – The number of rows to read per chunk.

Examples

Scan a file with a structured column and nulls at multiple levels and in multiple columns.

>>> import vortex as vx
>>> import vortex.expr as ve
>>> a = vx.array([
...     {'name': 'Joseph', 'age': 25},
...     {'name': None, 'age': 31},
...     {'name': 'Angela', 'age': None},
...     {'name': 'Mikhail', 'age': 57},
...     {'name': None, 'age': None},
... ])
>>> vx.io.write(a, "a.vortex")
>>> vxf = vx.open("a.vortex")
>>> vxf.scan().read_all().to_arrow_array()
<pyarrow.lib.StructArray object at ...>
-- is_valid: all not null
-- child 0 type: int64
  [
    25,
    31,
    null,
    57,
    null
  ]
-- child 1 type: string_view
  [
    "Joseph",
    null,
    "Angela",
    "Mikhail",
    null
  ]

Read just the age column:

>>> vxf.scan(['age']).read_all().to_arrow_array()
<pyarrow.lib.StructArray object at ...>
-- is_valid: all not null
-- child 0 type: int64
  [
    25,
    31,
    null,
    57,
    null
  ]

Keep rows with an age above 35. This will read O(N_KEPT) rows, when the file format allows.

>>> vxf.scan(expr=ve.column("age") > 35).read_all().to_arrow_array()
<pyarrow.lib.StructArray object at ...>
-- is_valid: all not null
-- child 0 type: int64
  [
    57
  ]
-- child 1 type: string_view
  [
    "Mikhail"
  ]

to_arrow(projection: Expr | list[str] | None = None, *, expr: Expr | None = None, batch_size: int | None = None) → RecordBatchReader¶

Scan the Vortex file as a pyarrow.RecordBatchReader.

Parameters:

projection (vortex.Expr | list[str] | None) – Either an expression over the columns of the file (only referenced columns will be read from the file) or an explicit list of desired columns.
expr (vortex.Expr | None) – The predicate used to filter rows. The filter columns need not appear in the projection.
batch_size (int | None) – The number of rows to read per chunk.

to_dataset() → VortexDataset¶: Scan the Vortex file using the pyarrow.dataset.Dataset API.

to_polars() → LazyFrame¶: Read the Vortex file as a pl.LazyFrame, supporting column pruning and predicate pushdown.

Prepare a scan of the Vortex file for repeated reads, returning a vortex.RepeatedScan.

Parameters:

projection (vortex.Expr | list[str] | None) – The projection expression to read, or else read all columns.
expr (vortex.Expr | None) – The predicate used to filter rows. The filter columns do not need to be in the projection.
indices (vortex.Array | None) – The indices of the rows to read. Must be sorted and non-null.
batch_size (int | None) – The number of rows to read per chunk.

final class vortex.RepeatedScan(scan: RepeatedScan)¶

A prepared scan that is optimized for repeated execution.

execute(*, row_range: tuple[int, int] | None = None) → ArrayIterator¶

Execute the scan returning a vortex.ArrayIterator.

Parameters:: row_range (tuple[int, int] | None) – Tuple is interpreted as [start, stop).

Examples

Scan a file with a structured column and nulls at multiple levels and in multiple columns.

>>> import vortex as vx
>>> import vortex.expr as ve
>>> a = vx.array([
...     {'name': 'Joseph', 'age': 25},
...     {'name': None, 'age': 31},
...     {'name': 'Angela', 'age': None},
...     {'name': 'Mikhail', 'age': 57},
...     {'name': None, 'age': None},
... ])
>>> vx.io.write(a, "a.vortex")
>>> scan = vx.open("a.vortex").to_repeated_scan()
>>> scan.execute(row_range=(1, 3)).read_all().to_arrow_array()
<pyarrow.lib.StructArray object at ...>
-- is_valid: all not null
-- child 0 type: int64
  [
    31,
    null
  ]
-- child 1 type: string_view
  [
    null,
    "Angela"
  ]

scalar_at(index: int) → Scalar¶

Fetch a scalar from the scan returning a vortex.Scalar.

Parameters:: index (int) – The row index to fetch. Raises an IndexError if out of bounds or if the given row index was not included in the scan.

Examples

Scan a file with a structured column and nulls at multiple levels and in multiple columns.

>>> import vortex as vx
>>> import vortex.expr as ve
>>> a = vx.array([
...     {'name': 'Joseph', 'age': 25},
...     {'name': None, 'age': 31},
...     {'name': 'Angela', 'age': None},
...     {'name': 'Mikhail', 'age': 57},
...     {'name': None, 'age': None},
... ])
>>> vx.io.write(a, "a.vortex")
>>> scan = vx.open("a.vortex").to_repeated_scan()
>>> scan.scalar_at(1)
<vortex.StructScalar object at ...>

class vortex.io.VortexWriteOptions¶

Write Vortex files with custom configuration.

See also

vortex.io.write()

vortex.io.read_url(url, *, projection=None, row_filter=None, indices=None, row_range=None)¶

Read a vortex struct array from a URL.

Parameters:

url (str) – The URL to read from.
projection (list[str | int] | None) – The columns to read identified either by their index or name.
row_filter (Expr | None) – Keep only the rows for which this expression evaluates to true.
indices (Array | None) – A list of rows to keep identified by the zero-based index within the file. NB: If row_range is specified, these indices are within the row range, not the file!
row_range (tuple[int, int] | None) – A left-inclusive, right-exclusive range of rows to read.

Examples

Read an array from an HTTPS URL:

>>> import vortex as vx
>>> a = vx.io.read_url("https://example.com/dataset.vortex")

Read an array from an S3 URL:

>>> a = vx.io.read_url("s3://bucket/path/to/dataset.vortex")

Read an array from an Azure Blob File System URL:

>>> a = vx.io.read_url("abfss://my_file_system@my_account.dfs.core.windows.net/path/to/dataset.vortex")

Read an array from an Azure Blob Storage URL:

>>> a = vx.io.read_url("https://my_account.blob.core.windows.net/my_container/path/to/dataset.vortex")

Read an array from a Google Storage URL:

>>> a = vx.io.read_url("gs://bucket/path/to/dataset.vortex")

Read an array from a local file URL:

>>> a = vx.io.read_url("file:/path/to/dataset.vortex")

vortex.io.write(iter, path)¶

Write a vortex struct array to the local filesystem.

Parameters:

iter (vortex.Array | vortex.ArrayIterator | pyarrow.Table | pyarrow.RecordBatchReader) – The data to write. Can be a single array, an array iterator, or a PyArrow object that supports streaming. When using PyArrow objects, data is streamed directly without loading the entire dataset into memory.
path (str) – The file path.

Examples

Write a single Vortex array a to the local file a.vortex.

>>> import vortex as vx
>>> a = vx.array([
...     {'x': 1},
...     {'x': 2},
...     {'x': 10},
...     {'x': 11},
...     {'x': None},
... ])
>>> vx.io.write(a, "a.vortex")

Stream a PyArrow Table directly to Vortex without loading into memory:

>>> import pyarrow as pa
>>> import vortex as vx
>>> table = pa.table({'x': [1, 2, 3, 4, 5]})
>>> vx.io.write(table, "streamed.vortex")

Stream from a PyArrow RecordBatchReader:

>>> import pyarrow as pa
>>> import vortex as vx
>>> reader = pa.RecordBatchReader.from_batches(schema, batches)
>>> vx.io.write(reader, "streamed.vortex")