Data type enforcement in Bigtable
Bigtable's flexible schema lets you store data of any type – strings, dates, numbers, JSON documents, or even images or PDFs – in a Bigtable table.
This document describes when Bigtable enforces type, requiring you to encode or decode it in your application code. For a list of Bigtable data types, see Type in the Data API reference documentation.
Enforced types
Data type is enforced for the following data:
- Aggregate column families (counters)
- Timestamps
- Materialized views
Aggregates
For the aggregate data type, encoding depends on the aggregation type. When you create an aggregate column family, you must specify an aggregation type.
This table shows the input type and encoding for each aggregation type.
Aggregate type | Input type | Encoding |
---|---|---|
Sum | Int64 |
BigEndianBytes
|
Min | Int64 |
BigEndianBytes
|
Max | Int64 |
BigEndianBytes
|
HLL | Bytes | Zetasketch HLL++ |
When you query the data in aggregate cells using SQL, SQL automatically incorporates type information.
When you read the data in aggregate cells using the Data API's ReadRows
method, Bigtable returns bytes, so your application must
decode the values using the encoding that Bigtable used to map the
typed data to bytes.
You can't convert a column family that contains non-aggregate data into an aggregate column family. Columns in aggregate column families can't contain non-aggregate cells, and standard column families can't contain aggregate cells.
For more information about creating tables with aggregate column families, see Create a table. For code samples that show how to increment an aggregate cell with encoded values, see Increment a value.
Timestamps
Each Bigtable cell has an Int64
timestamp that must be a
microsecond value with, at most, millisecond precision. Bigtable
rejects a timestamp with microsecond precision, such as 3023483279876543. In
this example, the acceptable timestamp value is 3023483279876000. A timestamp is
the number of microseconds since the Unix
epoch, 1970-01-01 00:00:00 UTC
.
Continuous materialized views
Continuous materialized views are read-only resources that you can read by using
SQL or with a ReadRows
Data API call. Data in a materialized view is typed
based on the query that defines it. For an overview, see Continuous
materialized views.
When you use SQL to query a continuous materialized view, SQL automatically incorporates type information.
When you read from a continuous materialized
view using a Data API
ReadRows
request, you must know each column's type and decode it in your
application code.
Aggregated values in a continuous materialized view are stored using encoding described in the following table, based on the output type of the column from the view definition.
Type | Encoding |
---|---|
BOOL | 1 byte value, 1 = true, 0 = false |
BYTES | No encoding |
INT64 (or INT, SMALLINT, INTEGER, BIGINT, TINYINT, BYTEINT) | 64-bit big-endian |
FLOAT64 | 64-bit IEEE 754, excluding NaN and +/-inf |
STRING | UTF-8 |
TIME/TIMESTAMP | 64-bit integer representing the number of microseconds since the Unix epoch (consistent with GoogleSQL) |
Unenforced types
If no type information is provided, then Bigtable treats each cell as bytes with an unknown encoding.
When querying column families that are created without type enforcement, you must provide type information at read time to ensure that the data is read correctly. This is relevant with database functions whose behavior depends on the data type. GoogleSQL for Bigtable offers CAST functions to do type conversions at query time. These functions convert from bytes to the types that various functions expect.
While Bigtable doesn't enforce types, certain operations assume a data type. Knowing this helps you ensure that your data is written in a way that can be processed within the database. The following are examples:
- Increments using
ReadModifyWriteRow
assume the cell contains a 64-bit big-endian signed integer. - The
TO_VECTOR64
function in SQL expects the cell to contain a byte array that's a concatenation of the big-endian bytes of 64-bit floating point numbers. - The
TO_VECTOR32
function in SQL expects the cell to contain a byte array that's a concatenation of the big-endian bytes of 32-bit floating point numbers.