Dataset

Vortex files implement the Arrow Dataset interface permitting efficient use of a Vortex file within query engines like DuckDB and Polars. In particular, Vortex will read data proportional to the number of rows passing a filter condition and the number of columns in a selection. For most Vortex encodings, this property holds true even when the filter condition specifies a single row.

VortexDataset

Read Vortex files with row filter and column selection pushdown.

VortexScanner

A PyArrow Dataset Scanner that reads from a Vortex Array.


class vortex.dataset.VortexDataset(dataset)

Read Vortex files with row filter and column selection pushdown.

This class implements the pyarrow.dataset.Dataset interface which enables its use with Polars, DuckDB, Pandas and others.

count_rows(filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) int

Not implemented.

filter(expression: Expression) VortexDataset

Not implemented.

get_fragments(filter: Expression | None = None) Iterator[Fragment]

Not implemented.

head(num_rows: int, columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) Table

Load the first num_rows of the dataset.

Parameters:
  • num_rows (int) – The number of rows to load.

  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

join(right_dataset, keys, right_keys=None, join_type=None, left_suffix=None, right_suffix=None, coalesce_keys=True, use_threads: bool | None = None) InMemoryDataset

Not implemented.

join_asof(right_dataset, on, by, tolerance, right_on=None, right_by=None) InMemoryDataset

Not implemented.

replace_schema(schema: Schema)

Not implemented.

scanner(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) Scanner

Construct a pyarrow.dataset.Scanner.

Parameters:
  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

property schema: Schema

The common schema of the full Dataset

sort_by(sorting, **kwargs) InMemoryDataset

Not implemented.

take(indices: Array | Any, columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) Table

Load a subset of rows identified by their absolute indices.

Parameters:
  • indices (pyarrow.Array) – A numeric array of absolute indices into self indicating which rows to keep.

  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

to_batches(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) Iterator[RecordBatch]

Construct an iterator of pyarrow.RecordBatch.

Parameters:
  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

to_record_batch_reader(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) RecordBatchReader

Construct a pyarrow.RecordBatchReader.

Parameters:
  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

to_table(columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None) Table

Construct an Arrow pyarrow.Table.

Parameters:
  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

class vortex.dataset.VortexScanner(dataset: VortexDataset, columns: list[str] | None = None, filter: Expression | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, memory_pool: MemoryPool | None = None)

A PyArrow Dataset Scanner that reads from a Vortex Array.

Parameters:
  • dataset (VortexDataset) – The dataset to scan.

  • columns (list of str) – The columns to keep, identified by name.

  • filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evalutes to True. Any rows for which this expression evaluates to Null is removed.

  • batch_size (int) – The maximum number of rows per batch.

  • batch_readahead (int) – Not implemented.

  • fragment_readahead (int) – Not implemented.

  • fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.

  • use_threads (bool) – Not implemented.

  • memory_pool (pyarrow.MemoryPool) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

count_rows(self)

Count rows matching the scanner filter.

Returns:

count

Return type:

int

head(num_rows: int) Table

Load the first num_rows of the dataset.

Parameters:

num_rows (int) – The number of rows to read.

Returns:

table

Return type:

pyarrow.Table

scan_batches() Iterator[TaggedRecordBatch]

Not implemented.

to_batches() Iterator[RecordBatch]

Construct an iterator of pyarrow.RecordBatch.

Returns:

table

Return type:

pyarrow.Table

to_reader() RecordBatchReader

Construct a pyarrow.RecordBatchReader.

Returns:

table

Return type:

pyarrow.Table

to_table() Table

Construct an Arrow pyarrow.Table.

Returns:

table

Return type:

pyarrow.Table