Infer Schema
- RFC PR: datafuselabs/databend#8645
- Tracking Issue: datafuselabs/databend#8646
Summary
Allowing infer schema from staged files or locations.
Motivation
To load data from a stage or location, users need to create table first. However, sometimes users don't know the file schema, or the schema is too complex / too simple to be input by users.
Allowing infer schema from existing files makes our users' lives easier. Also, this feature will unlock the databend from implementing select * from @my_stage
.
Guide-level explanation
Users will be able to infer the schema of the staged file by:
INFER @my_stage/data.csv FILE_FORMAT = ( TYPE = CSV);
+-------------+---------+----------+
| COLUMN_NAME | TYPE | NULLABLE |
|-------------+---------+----------|
| CONTINENT | TEXT | True |
| COUNTRY | VARIANT | True |
+-------------+---------+----------+
Infer from an external location is also supported:
INFER 's3://mybucket/data.csv' FILE_FORMAT = ( TYPE = CSV );
+-------------+---------+----------+
| COLUMN_NAME | TYPE | NULLABLE |
|-------------+---------+----------|
| CONTINENT | TEXT | True |
| COUNTRY | VARIANT | True |
+-------------+---------+----------+
CREATE TABLE
will support CREATE TABLE <table> BY ( <Query> )
to accept the output from INFER
, so users can create a table with INFRE
directly.
For example:
CREATE TABLE test BY (
INFER 's3://mybucket/data.csv' FILE_FORMAT = ( TYPE = CSV )
);
Users can also make some changes to the output of INFER
:
CREATE TABLE test BY (
SELECT UPPER(COLUMN_NAME), TYPE, TRUE from (
INFER 's3://mybucket/data.csv' FILE_FORMAT = ( TYPE = CSV )
)
);
Reference-level explanation
We will adopt the infer_schema
feature from arrow2
to make this possible. Take csv
as an example:
pub fn infer_schema<R: Read + Seek, F: Fn(&[u8]) -> DataType>(
reader: &mut Reader<R>,
max_rows: Option<usize>,
has_header: bool,
infer: &F
) -> Result<(Vec<Field>, usize)>
We will then convert the Field
into data types that the databend supports.
Drawbacks
None.
Rationale and alternatives
None.
Prior art
Snowflake
Snowflake has infer_schema
functions:
select *
from table(
infer_schema(
location=>'@mystage'
, file_format=>'my_parquet_format'
)
);
+-------------+---------+----------+---------------------+--------------------------+----------+
| COLUMN_NAME | TYPE | NULLABLE | EXPRESSION | FILENAMES | ORDER_ID |
|-------------+---------+----------+---------------------+--------------------------|----------+
| continent | TEXT | True | $1:continent::TEXT | geography/cities.parquet | 0 |
| country | VARIANT | True | $1:country::VARIANT | geography/cities.parquet | 1 |
| COUNTRY | VARIANT | True | $1:COUNTRY::VARIANT | geography/cities.parquet | 2 |
+-------------+---------+----------+---------------------+--------------------------+----------+
Snowflake's CREATE TABLE
support USING TEMPLATE
to accept the output from infer_schema
:
create table mytable
using template (
select array_agg(object_construct(*))
within group (order by order_id)
from table(
infer_schema(
location=>'@mystage',
file_format=>'my_parquet_format'
)
));
Unresolved questions
None.
Future possibilities
Generate Create Table Statement from INFER
It's possible to generate a CREATE TABLE
statement from INFER
in the future, so users can modify the statement in manully instead of by SQL.
For example:
SOME MAGIC SQL
+--------------------------------------------------+
| STATEMENT |
|--------------------------------------------------+
| CREATE TABLE table_name (name VARCHAR, age INT); |
+--------------------------------------------------+
SELECT FROM external location
With INFER
support, we can build a temperary table schema to read data from external location.
SELECT * from @my_stage;
SELECT * from 's3://my_bucket/data.csv';
We will execute INFER
internally to get the schema to files and than fill data into the table.