Data type enforcement in Bigtable

Bigtable's flexible schema lets you store data of any type – strings, dates, numbers, JSON documents, or even images or PDFs – in a Bigtable table.

This document describes when Bigtable enforces type, requiring you to encode or decode it in your application code. For a list of Bigtable data types, see Type in the Data API reference documentation.

Enforced types

Data type is enforced for the following data:

  • Aggregate column families (counters)
  • Timestamps
  • Materialized views

Aggregates

For the aggregate data type, encoding depends on the aggregation type. When you create an aggregate column family, you must specify an aggregation type.

This table shows the input type and encoding for each aggregation type.

Aggregate type Input type Encoding
Sum Int64 BigEndianBytes
Min Int64 BigEndianBytes
Max Int64 BigEndianBytes
HLL Bytes Zetasketch HLL++

When you query the data in aggregate cells using SQL, SQL automatically incorporates type information.

When you read the data in aggregate cells using the Data API's ReadRows method, Bigtable returns bytes, so your application must decode the values using the encoding that Bigtable used to map the typed data to bytes.

You can't convert a column family that contains non-aggregate data into an aggregate column family. Columns in aggregate column families can't contain non-aggregate cells, and standard column families can't contain aggregate cells.

For more information about creating tables with aggregate column families, see Create a table. For code samples that show how to increment an aggregate cell with encoded values, see Increment a value.

Timestamps

Each Bigtable cell has an Int64 timestamp that must be a microsecond value with, at most, millisecond precision. Bigtable rejects a timestamp with microsecond precision, such as 3023483279876543. In this example, the acceptable timestamp value is 3023483279876000. A timestamp is the number of microseconds since the Unix epoch, 1970-01-01 00:00:00 UTC.

Continuous materialized views

Continuous materialized views are read-only resources that you can read by using SQL or with a ReadRows Data API call. Data in a materialized view is typed based on the query that defines it. For an overview, see Continuous materialized views.

When you use SQL to query a continuous materialized view, SQL automatically incorporates type information.

When you read from a continuous materialized view using a Data API ReadRows request, you must know each column's type and decode it in your application code.

Aggregated values in a continuous materialized view are stored using encoding described in the following table, based on the output type of the column from the view definition.

Type Encoding
BOOL 1 byte value, 1 = true, 0 = false
BYTES No encoding
INT64 (or INT, SMALLINT, INTEGER, BIGINT, TINYINT, BYTEINT) 64-bit big-endian
FLOAT64 64-bit IEEE 754, excluding NaN and +/-inf
STRING UTF-8
TIME/TIMESTAMP 64-bit integer representing the number of microseconds since the Unix epoch (consistent with GoogleSQL)
For more information, see Encoding in the Data API reference.

Unenforced types

If no type information is provided, then Bigtable treats each cell as bytes with an unknown encoding.

When querying column families that are created without type enforcement, you must provide type information at read time to ensure that the data is read correctly. This is relevant with database functions whose behavior depends on the data type. GoogleSQL for Bigtable offers CAST functions to do type conversions at query time. These functions convert from bytes to the types that various functions expect.

While Bigtable doesn't enforce types, certain operations assume a data type. Knowing this helps you ensure that your data is written in a way that can be processed within the database. The following are examples:

  • Increments using ReadModifyWriteRow assume the cell contains a 64-bit big-endian signed integer.
  • The TO_VECTOR64 function in SQL expects the cell to contain a byte array that's a concatenation of the big-endian bytes of 64-bit floating point numbers.
  • The TO_VECTOR32 function in SQL expects the cell to contain a byte array that's a concatenation of the big-endian bytes of 32-bit floating point numbers.

What's next