PyArrow#
Vortex integrates with PyArrow for reading and writing Vortex files using Arrow tables and record batch readers.
Writing Vortex Files#
Use write() to convert a Parquet file to Vortex. The write function accepts
anything that implements IntoArrayIterator, including pyarrow.Table and
pyarrow.RecordBatchReader:
>>> import pyarrow.parquet as pq
>>> import vortex as vx
>>>
>>> table = pq.read_table("_static/example.parquet")
>>> vx.io.write(table, 'example.vortex')
Reading Vortex Files#
Use open() to lazily open a Vortex file:
>>> f = vx.open('example.vortex')
>>> len(f)
1000
As an Arrow Table#
VortexFile.to_arrow() returns a pyarrow.RecordBatchReader. Call
read_all() to collect into a pyarrow.Table:
>>> table = f.to_arrow().read_all()
>>> table.num_rows
1000
Column Projection#
Read only the columns you need:
>>> table = f.to_arrow(['tip_amount', 'fare_amount']).read_all()
>>> table.column_names
['tip_amount', 'fare_amount']
Streaming Record Batches#
Iterate over record batches for streaming processing:
>>> total = 0
>>> for batch in f.to_arrow():
... total += batch.num_rows
>>> total
1000
Arrow Interop#
The array() function constructs a Vortex array from an Arrow array without copies:
>>> import pyarrow as pa
>>> arrow = pa.array([1, 2, None, 3])
>>> arr = vx.array(arrow)
>>> arr.dtype
int(64, nullable=True)
Array.to_arrow_array() converts back:
>>> arr.to_arrow_array()
<pyarrow.lib.Int64Array object at ...>
[
1,
2,
null,
3
]
Struct arrays convert to Arrow tables with Array.to_arrow_table():
>>> struct_arr = vx.array([
... {'name': 'Joseph', 'age': 25},
... {'name': 'Narendra', 'age': 31},
... {'name': 'Angela', 'age': 33},
... {'name': 'Mikhail', 'age': 57},
... ])
>>> struct_arr.to_arrow_table()
pyarrow.Table
age: int64
name: string
----
age: [[25,31,33,57]]
name: [["Joseph","Narendra","Angela","Mikhail"]]