Best practices for spatial analysis

This document describes best practices for optimizing geospatial query performance in BigQuery. You can use these best practices to improve performance and reduce cost and latency.

Datasets can contain large collections of polygons, multipolygon shapes, and linestrings to represent complex features—for example, roads, land parcels, and flood zones. Each shape can contain thousands of points. In most spatial operations in BigQuery (for example, intersections and distance calculations), the underlying algorithm usually visits the majority of points in each shape to produce a result. For some operations, the algorithm visits all points. For complex shapes, visiting each point can increase the cost and duration of the spatial operations. You can use the strategies and methods presented in this guide to optimize these common spatial operations for improved performance and reduced cost.

This document assumes that your BigQuery geospatial tables are clustered on a geography column.

Simplify shapes

Best practice: Use simplify and snap-to-grid functions to store a simplified version of your original dataset as a materialized view.

Many complex shapes with large numbers of points can be simplified without much loss in precision. Use the BigQuery ST_SIMPLIFY and ST_SNAPTOGRID functions separately or together to reduce the number of points in complex shapes. Combine these functions with BigQuery materialized views to store a simplified version of your original dataset as a materialized view that's automatically kept up to date against the base table.

Simplifying shapes is most useful for improving the cost and performance of a dataset in the following use cases:

  • You need to maintain a high degree of similarity to the true shape.
  • You must perform high-precision, high-accuracy operations.
  • You want to speed up visualizations without visible loss in shape detail.

The following code sample shows how to use the ST_SIMPLIFY function on a base table that has a GEOGRAPHY column named geom. The code simplifies shapes and removes points without disturbing any edge of a shape by more than the given tolerance of 1.0 meters.

CREATE MATERIALIZED VIEW project.dataset.base_mv
  CLUSTER BY geom
AS (
  SELECT
    * EXCEPT (geom),
    ST_SIMPLIFY(geom, 1.0) AS geom
  FROM base_table
)

The following code sample shows how to use the ST_SNAPTOGRID function to snap the points to a grid with a resolution of 0.00001 degrees:

CREATE MATERIALIZED VIEW project.dataset.base_mv
  CLUSTER BY geom
AS (
  SELECT
    * EXCEPT (geom),
    ST_SNAPTOGRID(geom, -5) AS geom
  FROM base_table
)

The grid_size argument in this function serves as the exponent, which means 10e-5 = 0.00001. This resolution is equivalent to around 1 meter in the worst case, which occurs at the equator.

After you create these views, query the base_mv view using the same query semantics you would use to query the base table. You can use this technique to quickly identify a collection of shapes that need to be analyzed more deeply, and then you can perform a second deeper analysis on the base table. Test your queries to see which threshold values work best for your data.

For measurement use cases, determine the level of accuracy that your use case requires. When using the ST_SIMPLIFY function, set the threshold_meters parameter to the required level of accuracy. For measuring distances at the scale of a city or larger, set a threshold of 10 meters. At smaller scales—for example, when measuring the distance between a building and the nearest body of water—consider using a smaller threshold of 1 meter or less. Using smaller threshold values results in removing fewer points from the given shape.

When serving map layers from a web service, you can precalculate materialized views for different zoom levels with the bigquery-geotools project, which is a driver for Geoserver that lets you serve spatial layers from BigQuery. This driver creates multiple materialized views with different ST_SIMPLIFY threshold parameters so that less detail is served at higher zoom levels.

Use points and rectangles

Best practice: Reduce the shape to a point or rectangle to represent its location.

You can improve query performance by reducing the shape to a single point or a rectangle. The methods in this section don't accurately represent the details and proportions of the shape, but rather optimize for representing the location of the shape.

You can use the geographic central point of a shape (its centroid) to represent the location of the whole shape. Use a rectangle containing the shape to create the shape's extent, which you can use to represent the shape's location and maintain information about its relative size.

Using points and rectangles is most useful for improving the cost and performance of a dataset when you need to measure the distance between two points, such as between two cities.

For example, consider loading a database of land parcels in the United States into a BigQuery table and then determining the nearest body of water. In this case, precomputing parcel centroids using the ST_CENTROID function in combination with the method described in the Simplify shapes section of this document can reduce the number of comparisons performed when using the ST_DISTANCE or ST_DWITHIN functions. When using the ST_CENTROID function, the parcel centroid needs to be considered in the calculation. Precomputing the parcel centroids in this way can also reduce variability in performance, because different parcel shapes are likely to contain different numbers of points.

A variant of this method is to use the ST_BOUNDINGBOX function instead of the ST_CENTROID function to compute a rectangular envelope around the input shape. While it's not quite as efficient as using a single point, it can reduce the occurrence of certain edge cases. This variant still offers good and consistent performance, since the output of the ST_BOUNDINGBOX function always contains only four points that need to be considered. The bounding box result will be of the type STRUCT, which means you'll need to calculate the distances manually or use the vector index method described later in this document.

Use hulls

Best practice: Use a hull to optimize for representing the location of a shape.

If you imagine shrink-wrapping a shape and computing the boundary of the shrink wrap, that boundary is called the hull. In a convex hull, all the angles of the resulting shape are convex. Like a shape's extent, a convex hull retains some information about the underlying shape's relative size and proportions. However, using a hull comes at the cost of needing to store and consider more points in subsequent analyses.

You can use the ST_CONVEXHULL function to optimize for representing the location of the shape. Using this function improves accuracy, but this comes at the cost of decreased performance. The ST_CONVEXHULL function is similar to the ST_EXTENT function, except the output shape contains more points and varies in the number of points based on the complexity of the input shape. While the performance benefit is likely negligible for small datasets of non-complex shapes, for very large datasets with large and complex shapes, the ST_CONVEXHULL function offers a good balance between cost, performance, and accuracy.

Use grid systems

Best practice: Use geospatial grid systems to compare areas.

If your use cases involve aggregating data within localized areas and comparing statistical aggregations of those areas with each other, you can benefit from utilizing a standardized grid system to compare different areas.

For example, a retailer might want to analyze demographic changes over time in areas where their stores are located or where they are contemplating building a new store. Or, an insurance company might want to improve their understanding of property risks by analyzing the prevailing natural hazard risks in a particular area.

Using standard grid systems such as S2 and H3 can speed up such statistical aggregations and spatial analyses. Using these grid systems can also simplify the development of analytics and improve development efficiency.

For example, comparisons using census tracts in the United States suffer from inconsistency in size, which means corrective factors need to be applied to perform like-for-like comparisons between census tracts. Additionally, census tracts and other administrative boundaries change over time and require effort to correct for these changes. Using grid systems for spatial analysis can address such challenges.

Use vector search and vector indexes

Best practice: Use vector search and vector indexes for nearest-neighbor geospatial queries.

Vector search capabilities were introduced in BigQuery to enable machine-learning use cases such as semantic search, similarity detection, and retrieval-augmented generation. The key to enabling these use cases is an indexing method called approximate nearest-neighbor search. You can use vector search to speed up and simplify nearest-neighbor geospatial queries by comparing vectors that represent points in space.

You can use vector search to search for features by radius. First, establish a radius for your search. You can discover the optimal radius in the result set of a nearest-neighbor search. After you establish the radius, use the ST_DWITHIN function to identify nearby features.

For example, consider finding the ten buildings nearest to a particular anchor building that you already have the location of. You can store the centroids of each building as a vector in a new table, index the table, and search using vector search.

For this example, you can also use Overture Maps data in BigQuery to create a separate table of building shapes corresponding to an area of interest and a vector called geom_vector. The area of interest in this example is the city of Norfolk, VA, United States, represented by FIPS code 51710, as shown in the following code sample:

CREATE TABLE geo-solution-demos.vector_search.norfolk_buildings
AS (
  SELECT
    *,
    [
      ST_X(ST_CENTROID(building.geometry)),
      ST_Y(ST_CENTROID(building.geometry))] AS geom_vector
  FROM geo-solution-demos.overture_maps.building AS building
  INNER JOIN bigquery-public-data.geo_us_boundaries.counties AS county
    ON (st_intersects(county.county_geom, building.geometry))
  WHERE county.county_fips_code = '51710'
)

The following code sample shows how to create a vector index on the table:

CREATE
  vector index building_vector_index
ON
  `geo-solution-demos.vector_search.norfolk_buildings`(geom_vector)
  OPTIONS (index_type = 'IVF')

This query identifies the 10 buildings nearest to a particular anchor building, which is represented as anchor_building_id:

declare anchor_building_id STRING;

SET anchor_building_id = '08b2af6202ce5fff02007d1778948bc4';

SELECT base.*
FROM
  VECTOR_SEARCH(
    TABLE `geo-solution-demos.vector_search.norfolk_buildings`,
    'geom_vector',
    (
      SELECT
        geom_vector
      FROM
        `geo-solution-demos.vector_search.norfolk_buildings`
      WHERE id = anchor_building_id
    ),
    top_k => 10,
    distance_type => 'EUCLIDEAN',
    options => '{"fraction_lists_to_search":0.1}')

When you view the geospatial data in BigQuery Geo Viz, the output is a cluster of building shapes nearest to the anchor building, as displayed in the following image:

Geospatial data visualized in BigQuery Geo Viz.

When you run this query in the Google Cloud console, click Job Information and verify that Vector Index Usage Mode is set to FULLY_USED. This indicates that the query is leveraging the building_vector_index vector index, which you created earlier.

Divide large shapes

Best practice: Divide large shapes with the ST_SUBDIVIDE function.

Use the ST_SUBDIVIDE function to break large shapes or long line strings into smaller shapes.

What's next