Class VortexDataSourceV2

java.lang.Object
dev.vortex.spark.VortexDataSourceV2
All Implemented Interfaces:
org.apache.spark.sql.connector.catalog.TableProvider, org.apache.spark.sql.sources.DataSourceRegister

public final class VortexDataSourceV2 extends Object implements org.apache.spark.sql.connector.catalog.TableProvider, org.apache.spark.sql.sources.DataSourceRegister
Spark V2 data source for reading and writing Vortex files.

This class is automatically registered so it can be discovered by the Spark runtime. For reading: SparkSession.read() and specify the format as "vortex". For writing: Dataset.write() and specify the format as "vortex".

  • Constructor Summary

    Constructors
    Constructor
    Description
    Creates a new instance of the Vortex data source.
  • Method Summary

    Modifier and Type
    Method
    Description
    org.apache.spark.sql.connector.catalog.Table
    getTable(org.apache.spark.sql.types.StructType schema, org.apache.spark.sql.connector.expressions.Transform[] _partitioning, Map<String,String> properties)
    Creates a Vortex table instance with the given schema and properties.
    org.apache.spark.sql.types.StructType
    inferSchema(org.apache.spark.sql.util.CaseInsensitiveStringMap options)
    Infers the schema of the Vortex files specified in the options.
    Returns the short name identifier for this data source.
    boolean
    Indicates whether this data source supports external metadata (schemas).

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface org.apache.spark.sql.connector.catalog.TableProvider

    inferPartitioning
  • Constructor Details

    • VortexDataSourceV2

      public VortexDataSourceV2()
      Creates a new instance of the Vortex data source.

      This no-argument constructor is required for Spark to instantiate the data source through reflection.

  • Method Details

    • inferSchema

      public org.apache.spark.sql.types.StructType inferSchema(org.apache.spark.sql.util.CaseInsensitiveStringMap options)
      Infers the schema of the Vortex files specified in the options.

      This method examines the last file in the provided paths to determine the schema. Currently, schema evolution and merging across multiple files is not supported.

      Specified by:
      inferSchema in interface org.apache.spark.sql.connector.catalog.TableProvider
      Parameters:
      options - the data source options containing file paths
      Returns:
      the inferred Spark SQL schema
      Throws:
      RuntimeException - if required path options are missing
      RuntimeException - if there's an error reading the file or converting the schema
    • getTable

      public org.apache.spark.sql.connector.catalog.Table getTable(org.apache.spark.sql.types.StructType schema, org.apache.spark.sql.connector.expressions.Transform[] _partitioning, Map<String,String> properties)
      Creates a Vortex table instance with the given schema and properties.

      This method creates a VortexWritableTable that can be used to both read from and write to Vortex files. The partitioning parameter is currently ignored.

      Specified by:
      getTable in interface org.apache.spark.sql.connector.catalog.TableProvider
      Parameters:
      schema - the table schema
      _partitioning - table partitioning transforms (currently ignored)
      properties - the table properties containing file paths and other options
      Returns:
      a VortexTable instance for reading and writing data
      Throws:
      RuntimeException - if required path properties are missing
    • supportsExternalMetadata

      public boolean supportsExternalMetadata()
      Indicates whether this data source supports external metadata (schemas).

      Returns true to indicate that this data source accepts external schemas, which is necessary for write operations where the DataFrame provides the schema.

      Specified by:
      supportsExternalMetadata in interface org.apache.spark.sql.connector.catalog.TableProvider
      Returns:
      true to accept external schemas
    • shortName

      public String shortName()
      Returns the short name identifier for this data source.

      This name is used by Spark when registering the data source and can be used in SQL queries and DataFrame read operations to specify this format.

      Specified by:
      shortName in interface org.apache.spark.sql.sources.DataSourceRegister
      Returns:
      the short name "vortex"