Dataset¶

Vortex files implement the Arrow Dataset interface permitting efficient use of a Vortex file within query engines like DuckDB and Polars. In particular, Vortex will read data proportional to the number of rows passing a filter condition and the number of columns in a selection. For most Vortex encodings, this property holds true even when the filter condition specifies a single row.

`VortexDataset`	Read Vortex files with row filter and column selection pushdown.
`VortexScanner`	A PyArrow Dataset Scanner that reads from a Vortex Array.
`VortexFragment`	Fragment of data from a `VortexDataset`.

final class vortex.dataset.VortexDataset(dataset: VortexDataset, *, filters: list[Expr] | None = None)¶

Read Vortex files with row filter and column selection pushdown.

This class implements the pyarrow.dataset.Dataset interface which enables its use with Polars, DuckDB, Pandas and others.

count_rows(filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None, _row_range: tuple[int, int] | None = None) → int¶: Count the number of rows in this dataset.

filter(expression: Expression | Expr) → VortexDataset¶

A new Dataset with a filter condition applied.

Successively calling this method conjuncts all the filter expressions together.

get_fragments(filter: Expression | Expr | None = None) → Iterator[VortexFragment]¶: A fragment for each file in the Dataset.

Load the first num_rows of the dataset.

Parameters:

num_rows (int) – The number of rows to load.
columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evaluates to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool | None) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

join(right_dataset: Dataset, keys: str | list[str], right_keys: str | list[str] | None = None, join_type: str = 'left outer', left_suffix: str | None = None, right_suffix: str | None = None, coalesce_keys: bool = True, use_threads: bool = True) → InMemoryDataset¶: Not implemented.

join_asof(right_dataset: Dataset, on: str, by: str | list[str], tolerance: int, right_on: str | list[str] | None = None, right_by: str | list[str] | None = None) → InMemoryDataset¶: Not implemented.

replace_schema(schema: Schema) → None¶: Not implemented.

Construct a pyarrow.dataset.Scanner.

Parameters:

columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evaluates to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool | None) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

property schema: Schema¶: The common schema of the full Dataset

sort_by(sorting: str | list[tuple[str, str]], **kwargs) → InMemoryDataset¶: Not implemented.

Load a subset of rows identified by their absolute indices.

Parameters:

indices (pyarrow.Array) – A numeric array of absolute indices into self indicating which rows to keep.
columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evaluates to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
cache_metadata (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool | None) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

Construct an iterator of pyarrow.RecordBatch.

Parameters:

columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evaluates to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
cache_metadata (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool | None) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

Construct a pyarrow.RecordBatchReader.

Parameters:

columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evaluates to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool | None) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

Construct an Arrow pyarrow.Table.

Parameters:

columns (list of str, dict[str, pyarrow.dataset.Expression] | None) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evaluates to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool | None) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

final class vortex.dataset.VortexFragment(dataset: VortexDataset, _row_range: tuple[int, int])¶

Fragment of data from a VortexDataset.

count_rows(filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None) → int¶: See vortex.dataset.VortexDataset.count_rows

head(num_rows: int, columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None) → Table¶: See vortex.dataset.VortexDataset.head

property partition_expression: Expression¶: An Expression which evaluates to true for all data viewed by this Fragment.

property physical_schema: Schema¶: Return the physical schema of this Fragment. This schema can be different from the dataset read schema.

scanner(schema: Schema | None = None, columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None) → Scanner¶: See vortex.dataset.VortexDataset.scanner

take(indices: pyarrow.Array[pyarrow.Int8Scalar | pyarrow.Int16Scalar | pyarrow.Int32Scalar | pyarrow.Int64Scalar | pyarrow.UInt8Scalar | pyarrow.UInt16Scalar | pyarrow.UInt32Scalar | pyarrow.UInt64Scalar], columns: list[str] | None = None, filter: pyarrow.dataset.Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: pyarrow.dataset.FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: pyarrow.MemoryPool | None = None) → pyarrow.Table¶: See vortex.dataset.VortexDataset.take

Warning

The indices are indices into the file, not indices into this fragment of the file.

to_batches(schema: Schema | None = None, columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool = True, memory_pool: MemoryPool | None = None) → Iterator[RecordBatch]¶: See vortex.dataset.VortexDataset.to_batches

to_table(schema: Schema | None = None, columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None) → Table¶: See vortex.dataset.VortexDataset.to_table

A PyArrow Dataset Scanner that reads from a Vortex Array.

Parameters:

dataset (VortexDataset) – The dataset to scan.
columns (list of str) – The columns to keep, identified by name.
filter (pyarrow.dataset.Expression) – Keep only rows for which this expression evaluates to True. Any rows for which this expression evaluates to Null is removed.
batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (pyarrow.dataset.FragmentScanOptions) – Not implemented.
use_threads (bool) – Not implemented.
memory_pool (pyarrow.MemoryPool | None) – Not implemented.

Returns:

table

Return type:

pyarrow.Table

count_rows(self)¶

Count rows matching the scanner filter.

Returns:: count
Return type:: int

property dataset_schema: Schema¶: The schema with which batches will be read from fragments.

head(num_rows: int) → Table¶

Load the first num_rows of the dataset.

Parameters:: num_rows (int) – The number of rows to read.
Returns:: table
Return type:: pyarrow.Table

property projected_schema: Schema¶

The materialized schema of the data, accounting for projections.

This is the schema of any data returned from the scanner.

scan_batches() → Iterator[TaggedRecordBatch]¶: Not implemented.

to_batches() → Iterator[RecordBatch]¶

Construct an iterator of pyarrow.RecordBatch.

Returns:: table
Return type:: pyarrow.Table

to_reader() → RecordBatchReader¶

Construct a pyarrow.RecordBatchReader.

Returns:: table
Return type:: pyarrow.Table

to_table() → Table¶

Construct an Arrow pyarrow.Table.

Returns:: table
Return type:: pyarrow.Table