PyArrow#

Vortex integrates with PyArrow for reading and writing Vortex files using Arrow tables and record batch readers.

Writing Vortex Files#

Use write() to convert a Parquet file to Vortex. The write function accepts anything that implements IntoArrayIterator, including pyarrow.Table and pyarrow.RecordBatchReader:

>>> import pyarrow.parquet as pq
>>> import vortex as vx
>>>
>>> table = pq.read_table("_static/example.parquet")
>>> vx.io.write(table, 'example.vortex')

Reading Vortex Files#

Use open() to lazily open a Vortex file:

>>> f = vx.open('example.vortex')
>>> len(f)
1000

As an Arrow Table#

VortexFile.to_arrow() returns a pyarrow.RecordBatchReader. Call read_all() to collect into a pyarrow.Table:

>>> table = f.to_arrow().read_all()
>>> table.num_rows
1000

Column Projection#

Read only the columns you need:

>>> table = f.to_arrow(['tip_amount', 'fare_amount']).read_all()
>>> table.column_names
['tip_amount', 'fare_amount']

Streaming Record Batches#

Iterate over record batches for streaming processing:

>>> total = 0
>>> for batch in f.to_arrow():
...     total += batch.num_rows
>>> total
1000

Arrow Interop#

The array() function constructs a Vortex array from an Arrow array without copies:

>>> import pyarrow as pa
>>> arrow = pa.array([1, 2, None, 3])
>>> arr = vx.array(arrow)
>>> arr.dtype
int(64, nullable=True)

Array.to_arrow_array() converts back:

>>> arr.to_arrow_array()
<pyarrow.lib.Int64Array object at ...>
[
1,
2,
null,
3
]

Struct arrays convert to Arrow tables with Array.to_arrow_table():

>>> struct_arr = vx.array([
... {'name': 'Joseph', 'age': 25},
... {'name': 'Narendra', 'age': 31},
... {'name': 'Angela', 'age': 33},
... {'name': 'Mikhail', 'age': 57},
... ])
>>> struct_arr.to_arrow_table()
pyarrow.Table
age: int64
name: string
----
age: [[25,31,33,57]]
name: [["Joseph","Narendra","Angela","Mikhail"]]