Input and Output¶
Vortex arrays support reading and writing to local and remote file systems, including plain-old HTTP, S3, Google Cloud Storage, and Azure Blob Storage.
Lazily open a Vortex file located at the given path or URL. |
|
A prepared scan that is optimized for repeated execution. |
|
Read a vortex struct array from a URL. |
|
Write a vortex struct array to the local filesystem. |
- vortex.open(path: str, *, without_segment_cache: bool = False) VortexFile ¶
Lazily open a Vortex file located at the given path or URL.
- Parameters:
Examples
Open a Vortex file and perform a scan operation:
>>> import vortex as vx >>> vxf = vx.open("data.vortex") >>> array_iterator = vxf.scan()
See also:
vortex.dataset.VortexDataset
- final class vortex.VortexFile(file: VortexFile)¶
-
- scan(projection: Expr | list[str] | None = None, *, expr: Expr | None = None, indices: Array | None = None, batch_size: int | None = None) ArrayIterator ¶
Scan the Vortex file returning a
vortex.ArrayIterator
.- Parameters:
projection (
vortex.Expr
| list[str] | None) – The projection expression to read, or else read all columns.expr (
vortex.Expr
| None) – The predicate used to filter rows. The filter columns do not need to be in the projection.indices (
vortex.Array
| None) – The indices of the rows to read. Must be sorted and non-null.batch_size (
int
| None) – The number of rows to read per chunk.
Examples
Scan a file with a structured column and nulls at multiple levels and in multiple columns.
>>> import vortex as vx >>> import vortex.expr as ve >>> a = vx.array([ ... {'name': 'Joseph', 'age': 25}, ... {'name': None, 'age': 31}, ... {'name': 'Angela', 'age': None}, ... {'name': 'Mikhail', 'age': 57}, ... {'name': None, 'age': None}, ... ]) >>> vx.io.write(a, "a.vortex") >>> vxf = vx.open("a.vortex") >>> vxf.scan().read_all().to_arrow_array() <pyarrow.lib.StructArray object at ...> -- is_valid: all not null -- child 0 type: int64 [ 25, 31, null, 57, null ] -- child 1 type: string_view [ "Joseph", null, "Angela", "Mikhail", null ]
Read just the age column:
>>> vxf.scan(['age']).read_all().to_arrow_array() <pyarrow.lib.StructArray object at ...> -- is_valid: all not null -- child 0 type: int64 [ 25, 31, null, 57, null ]
Keep rows with an age above 35. This will read O(N_KEPT) rows, when the file format allows.
>>> vxf.scan(expr=ve.column("age") > 35).read_all().to_arrow_array() <pyarrow.lib.StructArray object at ...> -- is_valid: all not null -- child 0 type: int64 [ 57 ] -- child 1 type: string_view [ "Mikhail" ]
- to_arrow(projection: Expr | list[str] | None = None, *, expr: Expr | None = None, batch_size: int | None = None) RecordBatchReader ¶
Scan the Vortex file as a
pyarrow.RecordBatchReader
.- Parameters:
projection (
vortex.Expr
| list[str] | None) – Either an expression over the columns of the file (only referenced columns will be read from the file) or an explicit list of desired columns.expr (
vortex.Expr
| None) – The predicate used to filter rows. The filter columns need not appear in the projection.batch_size (
int
| None) – The number of rows to read per chunk.
- to_dataset() VortexDataset ¶
Scan the Vortex file using the
pyarrow.dataset.Dataset
API.
- to_polars() LazyFrame ¶
Read the Vortex file as a pl.LazyFrame, supporting column pruning and predicate pushdown.
- to_repeated_scan(projection: Expr | list[str] | None = None, *, expr: Expr | None = None, indices: Array | None = None, batch_size: int | None = None) RepeatedScan ¶
Prepare a scan of the Vortex file for repeated reads, returning a
vortex.RepeatedScan
.- Parameters:
projection (
vortex.Expr
| list[str] | None) – The projection expression to read, or else read all columns.expr (
vortex.Expr
| None) – The predicate used to filter rows. The filter columns do not need to be in the projection.indices (
vortex.Array
| None) – The indices of the rows to read. Must be sorted and non-null.batch_size (
int
| None) – The number of rows to read per chunk.
- final class vortex.RepeatedScan(scan: RepeatedScan)¶
A prepared scan that is optimized for repeated execution.
- execute(*, row_range: tuple[int, int] | None = None) ArrayIterator ¶
Execute the scan returning a
vortex.ArrayIterator
.Examples
Scan a file with a structured column and nulls at multiple levels and in multiple columns.
>>> import vortex as vx >>> import vortex.expr as ve >>> a = vx.array([ ... {'name': 'Joseph', 'age': 25}, ... {'name': None, 'age': 31}, ... {'name': 'Angela', 'age': None}, ... {'name': 'Mikhail', 'age': 57}, ... {'name': None, 'age': None}, ... ]) >>> vx.io.write(a, "a.vortex") >>> scan = vx.open("a.vortex").to_repeated_scan() >>> scan.execute(row_range=(1, 3)).read_all().to_arrow_array() <pyarrow.lib.StructArray object at ...> -- is_valid: all not null -- child 0 type: int64 [ 31, null ] -- child 1 type: string_view [ null, "Angela" ]
- scalar_at(index: int) Scalar ¶
Fetch a scalar from the scan returning a
vortex.Scalar
.- Parameters:
index (int) – The row index to fetch. Raises an
IndexError
if out of bounds or if the given row index was not included in the scan.
Examples
Scan a file with a structured column and nulls at multiple levels and in multiple columns.
>>> import vortex as vx >>> import vortex.expr as ve >>> a = vx.array([ ... {'name': 'Joseph', 'age': 25}, ... {'name': None, 'age': 31}, ... {'name': 'Angela', 'age': None}, ... {'name': 'Mikhail', 'age': 57}, ... {'name': None, 'age': None}, ... ]) >>> vx.io.write(a, "a.vortex") >>> scan = vx.open("a.vortex").to_repeated_scan() >>> scan.scalar_at(1) <vortex.StructScalar object at ...>
- class vortex.io.VortexWriteOptions¶
Write Vortex files with custom configuration.
See also
- static compact()¶
Prioritize small size over read-throughput and read-latency.
Let’s model some stock ticker data. As you may know, the stock market always (noisly) goes up:
>>> import os >>> import random >>> sprl = vx.array([random.randint(i, i + 10) for i in range(100_000)])
If we naively wrote 4-bytes for each of these integers to a file we’d have 400,000 bytes! Let’s see how small this is when we write with the default Vortex write options (which are also used by
vortex.io.write()
):>>> vx.io.VortexWriteOptions.default().write_path(sprl, "chonky.vortex") >>> import os >>> os.path.getsize('chonky.vortex') 215196
Wow, Vortex manages to use about two bytes per integer! So advanced. So tiny.
But can we do better?
We sure can.
>>> vx.io.VortexWriteOptions.compact().write_path(sprl, "tiny.vortex") >>> os.path.getsize('tiny.vortex') 54200
Random numbers are not (usually) composed of random bytes!
- static default()¶
Balance size, read-throughput, and read-latency.
- write_path(iter, path)¶
Write an array or iterator of arrays into a local file.
- Parameters:
iter (vortex.Array | vortex.ArrayIterator | pyarrow.Table | pyarrow.RecordBatchReader) – The data to write. Can be a single array, an array iterator, or a PyArrow object that supports streaming. When using PyArrow objects, data is streamed directly without loading the entire dataset into memory.
path (str) – The file path.
Examples
Write a single Vortex array a to the local file a.vortex using the default settings:
>>> import vortex as vx >>> import random >>> a = vx.array([0, 1, 2, 3, None, 4]) >>> vx.io.VortexWriteOptions.default().write_path(a, "a.vortex")
Write the same array while preferring small file sizes over read-throughput and read-latency:
>>> import vortex as vx >>> vx.io.VortexWriteOptions.compact().write_path(a, "a.vortex")
See also
- vortex.io.read_url(url, *, projection=None, row_filter=None, indices=None, row_range=None)¶
Read a vortex struct array from a URL.
- Parameters:
url (str) – The URL to read from.
projection (list[str | int] | None) – The columns to read identified either by their index or name.
row_filter (Expr | None) – Keep only the rows for which this expression evaluates to true.
indices (Array | None) – A list of rows to keep identified by the zero-based index within the file. NB: If row_range is specified, these indices are within the row range, not the file!
row_range (tuple[int, int] | None) – A left-inclusive, right-exclusive range of rows to read.
Examples
Read an array from an HTTPS URL:
>>> import vortex as vx >>> a = vx.io.read_url("https://example.com/dataset.vortex")
Read an array from an S3 URL:
>>> a = vx.io.read_url("s3://bucket/path/to/dataset.vortex")
Read an array from an Azure Blob File System URL:
>>> a = vx.io.read_url("abfss://my_file_system@my_account.dfs.core.windows.net/path/to/dataset.vortex")
Read an array from an Azure Blob Storage URL:
>>> a = vx.io.read_url("https://my_account.blob.core.windows.net/my_container/path/to/dataset.vortex")
Read an array from a Google Storage URL:
>>> a = vx.io.read_url("gs://bucket/path/to/dataset.vortex")
Read an array from a local file URL:
>>> a = vx.io.read_url("file:/path/to/dataset.vortex")
- vortex.io.write(iter, path)¶
Write a vortex struct array to the local filesystem.
- Parameters:
iter (vortex.Array | vortex.ArrayIterator | pyarrow.Table | pyarrow.RecordBatchReader) – The data to write. Can be a single array, an array iterator, or a PyArrow object that supports streaming. When using PyArrow objects, data is streamed directly without loading the entire dataset into memory.
path (str) – The file path.
Examples
Write a single Vortex array a to the local file a.vortex.
>>> import vortex as vx >>> a = vx.array([ ... {'x': 1}, ... {'x': 2}, ... {'x': 10}, ... {'x': 11}, ... {'x': None}, ... ]) >>> vx.io.write(a, "a.vortex")
Stream a PyArrow Table directly to Vortex without loading into memory:
>>> import pyarrow as pa >>> import vortex as vx >>> table = pa.table({'x': [1, 2, 3, 4, 5]}) >>> vx.io.write(table, "streamed.vortex")
Stream from a PyArrow RecordBatchReader:
>>> import pyarrow as pa >>> import vortex as vx >>> reader = pa.RecordBatchReader.from_batches(schema, batches) >>> vx.io.write(reader, "streamed.vortex")
See also