Dataset¶

Vortex files implement the Arrow Dataset interface permitting efficient use of a Vortex file within query engines like DuckDB and Polars. In particular, Vortex will read data proportional to the number of rows passing a filter condition and the number of columns in a selection. For most Vortex encodings, this property holds true even when the filter condition specifies a single row.

`VortexDataset`	Read Vortex files with row filter and column selection pushdown.
`VortexScanner`	A PyArrow Dataset Scanner that reads from a Vortex Array.

class vortex.dataset.VortexDataset(dataset)¶

Read Vortex files with row filter and column selection pushdown.

This class implements the pyarrow.dataset.Dataset interface which enables its use with Polars, DuckDB, Pandas and others.

count_rows(filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None) → int¶: Count the number of rows in this dataset.

filter(expression: Expression) → VortexDataset¶: Not implemented.

get_fragments(filter: Expression | None = None) → Iterator[Fragment]¶: Not implemented.

Load the first num_rows of the dataset.

Parameters:

num_rows (int) – The number of rows to load.
columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

join(right_dataset, keys, right_keys=None, join_type=None, left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads: bool | None = None) → InMemoryDataset¶: Not implemented.

join_asof(right_dataset, on, by, tolerance, right_on=None, right_by=None) → InMemoryDataset¶: Not implemented.

replace_schema(schema: Schema)¶: Not implemented.

Construct a pyarrow.dataset.Scanner.

Parameters:

columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

property schema: Schema¶: The common schema of the full Dataset

sort_by(sorting, **kwargs) → InMemoryDataset¶: Not implemented.

Load a subset of rows identified by their absolute indices.

Parameters:

indices (pyarrow.Array) – A numeric array of absolute indices into self indicating which rows to keep.
columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

to_batches(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None) → Iterator[RecordBatch]¶

Construct an iterator of pyarrow.RecordBatch.

Parameters:

columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

to_record_batch_reader(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None) → RecordBatchReader¶

Construct a pyarrow.RecordBatchReader.

Parameters:

columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

Construct an Arrow pyarrow.Table.

Parameters:

columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

class vortex.dataset.VortexScanner(dataset: VortexDataset, columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool = None)¶

A PyArrow Dataset Scanner that reads from a Vortex Array.

Parameters:

dataset (VortexDataset) – The dataset to scan.
columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns: