Infer Schema

RFC PR: datafuselabs/databend#8645
Tracking Issue: datafuselabs/databend#8646

Summary

Allowing infer schema from staged files or locations.

Motivation

To load data from a stage or location, users need to create table first. However, sometimes users don't know the file schema, or the schema is too complex / too simple to be input by users.

Allowing infer schema from existing files makes our users' lives easier. Also, this feature will unlock the databend from implementing select * from @my_stage.

Guide-level explanation

Users will be able to infer the schema of the staged file by:

INFER @my_stage/data.csv FILE_FORMAT = ( TYPE = CSV);

+-------------+---------+----------+
| COLUMN_NAME | TYPE    | NULLABLE |
|-------------+---------+----------|
| CONTINENT   | TEXT    | True     |
| COUNTRY     | VARIANT | True     |
+-------------+---------+----------+

Infer from an external location is also supported:

INFER 's3://mybucket/data.csv' FILE_FORMAT = ( TYPE = CSV );

+-------------+---------+----------+
| COLUMN_NAME | TYPE    | NULLABLE |
|-------------+---------+----------|
| CONTINENT   | TEXT    | True     |
| COUNTRY     | VARIANT | True     |
+-------------+---------+----------+

CREATE TABLE will support CREATE TABLE <table> BY ( <Query> ) to accept the output from INFER, so users can create a table with INFRE directly.

For example:

CREATE TABLE test BY (
    INFER 's3://mybucket/data.csv' FILE_FORMAT = ( TYPE = CSV )
);

Users can also make some changes to the output of INFER:

CREATE TABLE test BY (
    SELECT UPPER(COLUMN_NAME), TYPE, TRUE from (
        INFER 's3://mybucket/data.csv' FILE_FORMAT = ( TYPE = CSV )
    )
);

Reference-level explanation

We will adopt the infer_schema feature from arrow2 to make this possible. Take csv as an example:

pub fn infer_schema<R: Read + Seek, F: Fn(&[u8]) -> DataType>(
    reader: &mut Reader<R>,
    max_rows: Option<usize>,
    has_header: bool,
    infer: &F
) -> Result<(Vec<Field>, usize)>

We will then convert the Field into data types that the databend supports.

Drawbacks

None.

Rationale and alternatives

None.

Prior art

Snowflake

Snowflake has infer_schema functions:

select *
  from table(
    infer_schema(
      location=>'@mystage'
      , file_format=>'my_parquet_format'
      )
    );

+-------------+---------+----------+---------------------+--------------------------+----------+
| COLUMN_NAME | TYPE    | NULLABLE | EXPRESSION          | FILENAMES                | ORDER_ID |
|-------------+---------+----------+---------------------+--------------------------|----------+
| continent   | TEXT    | True     | $1:continent::TEXT  | geography/cities.parquet | 0        |
| country     | VARIANT | True     | $1:country::VARIANT | geography/cities.parquet | 1        |
| COUNTRY     | VARIANT | True     | $1:COUNTRY::VARIANT | geography/cities.parquet | 2        |
+-------------+---------+----------+---------------------+--------------------------+----------+

Snowflake's CREATE TABLE support USING TEMPLATE to accept the output from infer_schema:

create table mytable
  using template (
    select array_agg(object_construct(*))
    within group (order by order_id)
      from table(
        infer_schema(
          location=>'@mystage',
          file_format=>'my_parquet_format'
        )
      ));

Unresolved questions

None.

Future possibilities

Generate Create Table Statement from INFER

It's possible to generate a CREATE TABLE statement from INFER in the future, so users can modify the statement in manully instead of by SQL.

For example:

SOME MAGIC SQL

+--------------------------------------------------+
| STATEMENT                                        |
|--------------------------------------------------+
| CREATE TABLE table_name (name VARCHAR, age INT); |
+--------------------------------------------------+

SELECT FROM external location

With INFER support, we can build a temperary table schema to read data from external location.

SELECT * from @my_stage;

SELECT * from 's3://my_bucket/data.csv';

We will execute INFER internally to get the schema to files and than fill data into the table.

Summary​

Motivation​

Guide-level explanation​

Reference-level explanation​

Drawbacks​

Rationale and alternatives​

Prior art​

Snowflake​

Unresolved questions​

Future possibilities​

Generate Create Table Statement from INFER​

SELECT FROM external location​

Summary

Motivation

Guide-level explanation

Reference-level explanation

Drawbacks

Rationale and alternatives

Prior art

Snowflake

Unresolved questions

Future possibilities

Generate Create Table Statement from INFER

SELECT FROM external location