Dataset¶
Vortex files implement the Arrow Dataset interface permitting efficient use of a Vortex file within query engines like DuckDB and Polars. In particular, Vortex will read data proportional to the number of rows passing a filter condition and the number of columns in a selection. For most Vortex encodings, this property holds true even when the filter condition specifies a single row.
Read Vortex files with row filter and column selection pushdown. |
|
A PyArrow Dataset Scanner that reads from a Vortex Array. |
|
Fragment of data from a |
- final class vortex.dataset.VortexDataset(dataset: VortexDataset, *, filters: list[Expr] | None = None)¶
Read Vortex files with row filter and column selection pushdown.
This class implements the
pyarrow.dataset.Dataset
interface which enables its use with Polars, DuckDB, Pandas and others.- count_rows(filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None, _row_range: tuple[int, int] | None = None) int ¶
Count the number of rows in this dataset.
- filter(expression: Expression | Expr) VortexDataset ¶
A new Dataset with a filter condition applied.
Successively calling this method conjuncts all the filter expressions together.
- get_fragments(filter: Expression | Expr | None = None) Iterator[VortexFragment] ¶
A fragment for each file in the Dataset.
- head(num_rows: int, columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None, _row_range: tuple[int, int] | None = None) Table ¶
Load the first num_rows of the dataset.
- Parameters:
num_rows (int) – The number of rows to load.
columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression
) – Keep only rows for which this expression evalutes toTrue
. Any rows for which this expression evaluates toNull
is removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions
) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool
| None) – Not implemented.
- Returns:
table
- Return type:
- join(right_dataset: Dataset, keys: str | list[str], right_keys: str | list[str] | None = None, join_type: str = 'left outer', left_suffix: str | None = None, right_suffix: str | None = None, coalesce_keys: bool = True, use_threads: bool = True) InMemoryDataset ¶
Not implemented.
- join_asof(right_dataset: Dataset, on: str, by: str | list[str], tolerance: int, right_on: str | list[str] | None = None, right_by: str | list[str] | None = None) InMemoryDataset ¶
Not implemented.
- scanner(columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None, _row_range: tuple[int, int] | None = None) Scanner ¶
Construct a
pyarrow.dataset.Scanner
.- Parameters:
columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression
) – Keep only rows for which this expression evalutes toTrue
. Any rows for which this expression evaluates toNull
is removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions
) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool
| None) – Not implemented.
- Returns:
table
- Return type:
- take(indices: pyarrow.Array[pyarrow.Int8Scalar | pyarrow.Int16Scalar | pyarrow.Int32Scalar | pyarrow.Int64Scalar | pyarrow.UInt8Scalar | pyarrow.UInt16Scalar | pyarrow.UInt32Scalar | pyarrow.UInt64Scalar], columns: list[str] | None = None, filter: pyarrow.dataset.Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: pyarrow.dataset.FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: pyarrow.MemoryPool | None = None, _row_range: tuple[int, int] | None = None) pyarrow.Table ¶
Load a subset of rows identified by their absolute indices.
- Parameters:
indices (
pyarrow.Array
) – A numeric array of absolute indices into self indicating which rows to keep.columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression
) – Keep only rows for which this expression evalutes toTrue
. Any rows for which this expression evaluates toNull
is removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions
) – Not implemented.use_threads (bool) – Not implemented.
cache_metadata (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool
| None) – Not implemented.
- Returns:
table
- Return type:
- to_batches(columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None, _row_range: tuple[int, int] | None = None) Iterator[RecordBatch] ¶
Construct an iterator of
pyarrow.RecordBatch
.- Parameters:
columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression
) – Keep only rows for which this expression evalutes toTrue
. Any rows for which this expression evaluates toNull
is removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions
) – Not implemented.use_threads (bool) – Not implemented.
cache_metadata (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool
| None) – Not implemented.
- Returns:
table
- Return type:
- to_record_batch_reader(columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None, _row_range: tuple[int, int] | None = None) RecordBatchReader ¶
Construct a
pyarrow.RecordBatchReader
.- Parameters:
columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression
) – Keep only rows for which this expression evalutes toTrue
. Any rows for which this expression evaluates toNull
is removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions
) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool
| None) – Not implemented.
- Returns:
table
- Return type:
- to_table(columns: list[str] | dict[str, Expression] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None, _row_range: tuple[int, int] | None = None) Table ¶
Construct an Arrow
pyarrow.Table
.- Parameters:
columns (list of str, dict[str,
pyarrow.dataset.Expression
] | None) – The columns to keep, identified by name.filter (
pyarrow.dataset.Expression
) – Keep only rows for which this expression evalutes toTrue
. Any rows for which this expression evaluates toNull
is removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions
) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool
| None) – Not implemented.
- Returns:
table
- Return type:
- final class vortex.dataset.VortexFragment(dataset: VortexDataset, _row_range: tuple[int, int])¶
Fragment of data from a
VortexDataset
.- count_rows(filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None) int ¶
- head(num_rows: int, columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None) Table ¶
- property partition_expression: Expression¶
An Expression which evaluates to true for all data viewed by this Fragment.
- property physical_schema: Schema¶
Return the physical schema of this Fragment. This schema can be different from the dataset read schema.
- scanner(schema: Schema | None = None, columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None) Scanner ¶
- take(indices: pyarrow.Array[pyarrow.Int8Scalar | pyarrow.Int16Scalar | pyarrow.Int32Scalar | pyarrow.Int64Scalar | pyarrow.UInt8Scalar | pyarrow.UInt16Scalar | pyarrow.UInt32Scalar | pyarrow.UInt64Scalar], columns: list[str] | None = None, filter: pyarrow.dataset.Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: pyarrow.dataset.FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: pyarrow.MemoryPool | None = None) pyarrow.Table ¶
See
vortex.dataset.VortexDataset.take
Warning
The indices are indices into the file, not indices into this fragment of the file.
- to_batches(schema: Schema | None = None, columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool = True, memory_pool: MemoryPool | None = None) Iterator[RecordBatch] ¶
- to_table(schema: Schema | None = None, columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None) Table ¶
- final class vortex.dataset.VortexScanner(dataset: VortexDataset, columns: list[str] | None = None, filter: Expression | Expr | None = None, batch_size: int | None = None, batch_readahead: int | None = None, fragment_readahead: int | None = None, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool | None = None, cache_metadata: bool | None = None, memory_pool: MemoryPool | None = None, _row_range: tuple[int, int] | None = None)¶
A PyArrow Dataset Scanner that reads from a Vortex Array.
- Parameters:
dataset (VortexDataset) – The dataset to scan.
columns (list of str) – The columns to keep, identified by name.
filter (
pyarrow.dataset.Expression
) – Keep only rows for which this expression evalutes toTrue
. Any rows for which this expression evaluates toNull
is removed.batch_size (int) – The maximum number of rows per batch.
batch_readahead (int) – Not implemented.
fragment_readahead (int) – Not implemented.
fragment_scan_options (
pyarrow.dataset.FragmentScanOptions
) – Not implemented.use_threads (bool) – Not implemented.
memory_pool (
pyarrow.MemoryPool
| None) – Not implemented.
- Returns:
table
- Return type:
- head(num_rows: int) Table ¶
Load the first num_rows of the dataset.
- Parameters:
num_rows (int) – The number of rows to read.
- Returns:
table
- Return type:
- scan_batches() Iterator[TaggedRecordBatch] ¶
Not implemented.
- to_batches() Iterator[RecordBatch] ¶
Construct an iterator of
pyarrow.RecordBatch
.- Returns:
table
- Return type:
- to_reader() RecordBatchReader ¶
Construct a
pyarrow.RecordBatchReader
.- Returns:
table
- Return type:
- to_table() Table ¶
Construct an Arrow
pyarrow.Table
.- Returns:
table
- Return type: