Melakukan penelusuran kemiripan vektor di Spanner dengan menemukan K-nearest neighbors
Tetap teratur dengan koleksi
Simpan dan kategorikan konten berdasarkan preferensi Anda.
Halaman ini menjelaskan cara melakukan penelusuran kesamaan vektor di
Spanner dengan menggunakan fungsi vektor jarak kosinus, jarak Euclidean, dan perkalian
titik untuk menemukan K-nearest neighbors. Informasi ini berlaku untuk database dialek GoogleSQL dan database dialek PostgreSQL. Sebelum membaca halaman ini,
Anda harus memahami konsep berikut:
Jarak Euclidean:
mengukur jarak terpendek antara dua vektor.
Jarak kosinus:
mengukur kosinus sudut antara dua vektor.
Perkalian titik: menghitung
kosinus sudut yang dikalikan dengan hasil perkalian besaran vektor yang sesuai. Jika Anda tahu bahwa semua embedding vektor dalam set data Anda telah dinormalisasi, maka Anda dapat menggunakan DOT_PRODUCT() sebagai fungsi jarak.
K-nearest neighbors (KNN):
algoritma machine learning terawasi yang digunakan untuk memecahkan masalah klasifikasi atau regresi.
Anda dapat menggunakan fungsi jarak vektor untuk melakukan penelusuran vektor tetangga K-terdekat (KNN) untuk kasus penggunaan seperti penelusuran kesamaan atau retrieval-augmented generation. Spanner mendukung fungsi COSINE_DISTANCE(),
EUCLIDEAN_DISTANCE(), dan DOT_PRODUCT(), yang beroperasi pada embedding vektor, sehingga Anda dapat menemukan KNN dari embedding input.
Misalnya, setelah membuat dan menyimpan data Spanner operasional sebagai embedding vektor, Anda dapat
memberikan embedding vektor ini sebagai parameter input dalam kueri untuk menemukan
vektor terdekat dalam ruang N-dimensi untuk menelusuri item yang secara semantik mirip atau
terkait.
Ketiga fungsi jarak mengambil argumen vector1 dan vector2, yang
berjenis array<>, dan harus terdiri dari dimensi yang sama dan memiliki
panjang yang sama. Untuk mengetahui detail selengkapnya tentang fungsi ini, lihat:
Contoh berikut menunjukkan penelusuran KNN, penelusuran KNN atas data berpartisi, dan
penggunaan indeks sekunder dengan KNN.
Semua contoh menggunakan EUCLIDEAN_DISTANCE(). Anda juga dapat menggunakan
COSINE_DISTANCE(). Selain itu, jika semua embedding vektor dalam set data Anda dinormalisasi, Anda dapat menggunakan DOT_PRODUCT() sebagai fungsi jarak.
Contoh 1: Penelusuran KNN
Pertimbangkan tabel Documents yang memiliki kolom (DocEmbedding) embedding teks yang telah dikomputasi sebelumnya dari kolom DocContents byte.
Dengan asumsi bahwa embedding input untuk "bisbol, tetapi bukan bisbol profesional"
adalah array [0.3, 0.3, 0.7, 0.7], Anda dapat menemukan lima dokumen terdekat teratas
yang cocok, dengan kueri berikut:
Contoh 2: Penelusuran KNN pada data yang dipartisi
Kueri dalam contoh sebelumnya dapat diubah dengan menambahkan kondisi ke klausa
WHERE untuk membatasi penelusuran vektor ke subset data Anda. Salah satu penerapan umum hal ini adalah untuk menelusuri data yang dipartisi, seperti baris yang termasuk dalam UserId tertentu.
Contoh 3: Penelusuran KNN melalui rentang indeks sekunder
Jika filter klausa WHERE yang Anda gunakan bukan bagian dari kunci utama tabel,
Anda dapat membuat indeks sekunder untuk mempercepat operasi dengan
pemindaian khusus indeks.
GoogleSQL
CREATEINDEXDocsByAuthorONDocuments(Author)STORING(DocEmbedding);SELECTAuthor,DocId,DocEmbeddingFROMDocumentsWHEREAuthor="Mark Twain"ORDERBYEUCLIDEAN_DISTANCE(DocEmbedding,<embeddingsfor"book about the time traveling American">)LIMIT5;
PostgreSQL
CREATEINDEXDocsByAuthorONDocuments(Author)INCLUDE(DocEmbedding);SELECTAuthor,DocId,DocEmbeddingFROMDocumentsWHEREAuthor="Mark Twain"ORDERBYspanner.euclidean_distance(DocEmbedding,<embeddingsfor"that book about the time traveling American">)LIMIT5;
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-08-11 UTC."],[],[],null,["# Perform vector similarity search in Spanner by finding the K-nearest neighbors\n\n| **Note:** This feature is available with the Spanner Enterprise edition and Enterprise Plus edition. For more information, see the [Spanner editions overview](/spanner/docs/editions-overview).\n\n\u003cbr /\u003e\n\nThis page describes how to perform a vector similarity search in\nSpanner by using the cosine distance, Euclidean distance, and dot\nproduct vector functions to find K-nearest neighbors. This information\napplies to both GoogleSQL-dialect databases and PostgreSQL-dialect databases. Before you read this page,\nit's important that you understand the following concepts:\n\n- [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance): measures the shortest distance between two vectors.\n- [Cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity#Cosine_distance): measures the cosine of the angle between two vectors.\n- [Dot product](https://mathworld.wolfram.com/DotProduct.html): calculates the cosine of the angle multiplied by the product of corresponding vector magnitudes. If you know that all the vector embeddings in your dataset are normalized, then you can use `DOT_PRODUCT()` as a distance function.\n- [K-nearest neighbors (KNN)](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm): a supervised machine learning algorithm used to solve classification or regression problems.\n\nYou can use vector distance functions to perform K-nearest neighbors (KNN)\nvector search for use cases like similarity search or retrieval-augmented\ngeneration. Spanner supports the `COSINE_DISTANCE()`,\n`EUCLIDEAN_DISTANCE()`, and `DOT_PRODUCT()` functions, which operate on vector\nembeddings, allowing you to find the KNN of the input embedding.\n\nFor example, after you [generate and save your operational Spanner\ndata as vector embeddings](/spanner/docs/ml-tutorial-embeddings), you can then\nprovide these vector embeddings as an input parameter in your query to find the\nnearest vectors in N-dimensional space to search for semantically similar or\nrelated items.\n\nAll three distance functions take the arguments `vector1` and `vector2`, which\nare of the type `array\u003c\u003e`, and must consist of the same dimensions and have the\nsame length. For more details about these functions, see:\n\n- [`COSINE_DISTANCE()` in GoogleSQL](/spanner/docs/reference/standard-sql/mathematical_functions#cosine_distance)\n- [`EUCLIDEAN_DISTANCE()` in GoogleSQL](/spanner/docs/reference/standard-sql/mathematical_functions#euclidean_distance)\n- [`DOT_PRODUCT()` in GoogleSQL](/spanner/docs/reference/standard-sql/mathematical_functions#dot_product)\n- [Mathematical functions in PostgreSQL](/spanner/docs/reference/postgresql/functions-and-operators#mathematical) (`spanner.cosine_distance()`, `spanner.euclidean_distance()`, and `spanner.dot_product()`)\n- [Choose among vector distance functions to measure vector embeddings similarity](/spanner/docs/choose-vector-distance-function).\n\nExamples\n--------\n\nThe following examples show KNN search, KNN search over partitioned data, and\nusing a secondary index with KNN.\n\nThe examples all use `EUCLIDEAN_DISTANCE()`. You can also use\n`COSINE_DISTANCE()`. In addition, if all the vector embeddings in your dataset\nare normalized, you can use `DOT_PRODUCT()` as a distance function.\n\n### Example 1: KNN search\n\nConsider a `Documents` table that has a column (`DocEmbedding`) of precomputed\ntext embeddings from the `DocContents` bytes column. \n\n### GoogleSQL\n\n CREATE TABLE Documents (\n UserId INT64 NOT NULL,\n DocId INT64 NOT NULL,\n Author STRING(1024),\n DocContents BYTES(MAX),\n DocEmbedding ARRAY\u003cFLOAT32\u003e\n ) PRIMARY KEY (UserId, DocId);\n\n### PostgreSQL\n\n CREATE TABLE Documents (\n UserId bigint NOT NULL,\n DocId bigint NOT NULL,\n Author varchar(1024),\n DocContents bytea,\n DocEmbedding float4[],\n PRIMARY KEY (UserId, DocId)\n );\n\nAssuming that an input embedding for \"baseball, but not professional baseball\"\nis the array `[0.3, 0.3, 0.7, 0.7]`, you can find the top five nearest documents\nthat match, with the following query: \n\n### GoogleSQL\n\n SELECT DocId, DocEmbedding FROM Documents\n ORDER BY EUCLIDEAN_DISTANCE(DocEmbedding,\n ARRAY\u003cFLOAT32\u003e[0.3, 0.3, 0.7, 0.8])\n LIMIT 5;\n\n### PostgreSQL\n\n SELECT DocId, DocEmbedding FROM Documents\n ORDER BY spanner.euclidean_distance(DocEmbedding,\n '{0.3, 0.3, 0.7, 0.8}'::float4[])\n LIMIT 5;\n\nThe expected results of this example: \n\n Documents\n +---------------------------+-----------------+\n | DocId | DocEmbedding |\n +---------------------------+-----------------+\n | 24 | [8, ...] |\n +---------------------------+-----------------+\n | 25 | [6, ...] |\n +---------------------------+-----------------+\n | 26 | [3.2, ...] |\n +---------------------------+-----------------+\n | 27 | [38, ...] |\n +---------------------------+-----------------+\n | 14229 | [1.6, ...] |\n +---------------------------+-----------------+\n\n### Example 2: KNN search over partitioned data\n\nThe query in the previous example can be modified by adding conditions to the\n`WHERE` clause to limit the vector search to a subset of your data. One common\napplication of this is to search over partitioned data, such as rows that belong\nto a specific `UserId`. \n\n### GoogleSQL\n\n SELECT UserId, DocId, DocEmbedding FROM Documents\n WHERE UserId=18\n ORDER BY EUCLIDEAN_DISTANCE(DocEmbedding,\n ARRAY\u003cFLOAT32\u003e[0.3, 0.3, 0.7, 0.8])\n LIMIT 5;\n\n### PostgreSQL\n\n SELECT UserId, DocId, DocEmbedding FROM Documents\n WHERE UserId=18\n ORDER BY spanner.euclidean_distance(DocEmbedding,\n '{0.3, 0.3, 0.7, 0.8}'::float4[])\n LIMIT 5;\n\nThe expected results of this example: \n\n Documents\n +-----------+-----------------+-----------------+\n | UserId | DocId | DocEmbedding |\n +-----------+-----------------+-----------------+\n | 18 | 234 | [12, ...] |\n +-----------+-----------------+-----------------+\n | 18 | 12 | [1.6, ...] |\n +-----------+-----------------+-----------------+\n | 18 | 321 | [22, ...] |\n +-----------+-----------------+-----------------+\n | 18 | 432 | [3, ...] |\n +-----------+-----------------+-----------------+\n\n### Example 3: KNN search over secondary index ranges\n\nIf the `WHERE` clause filter you're using isn't part of the table's primary key,\nthen you can create a secondary index to accelerate the operation with an\n[index-only scan](/spanner/docs/secondary-indexes#storing-clause). \n\n### GoogleSQL\n\n CREATE INDEX DocsByAuthor\n ON Documents(Author)\n STORING (DocEmbedding);\n\n SELECT Author, DocId, DocEmbedding FROM Documents\n WHERE Author=\"Mark Twain\"\n ORDER BY EUCLIDEAN_DISTANCE(DocEmbedding,\n \u003cembeddings for \"book about the time traveling American\"\u003e)\n LIMIT 5;\n\n### PostgreSQL\n\n CREATE INDEX DocsByAuthor\n ON Documents(Author)\n INCLUDE (DocEmbedding);\n\n SELECT Author, DocId, DocEmbedding FROM Documents\n WHERE Author=\"Mark Twain\"\n ORDER BY spanner.euclidean_distance(DocEmbedding,\n \u003cembeddings for \"that book about the time traveling American\"\u003e)\n LIMIT 5;\n\nThe expected results of this example: \n\n Documents\n +------------+-----------------+-----------------+\n | Author | DocId | DocEmbedding |\n +------------+-----------------+-----------------+\n | Mark Twain | 234 | [12, ...] |\n +------------+-----------------+-----------------+\n | Mark Twain | 12 | [1.6, ...] |\n +------------+-----------------+-----------------+\n | Mark Twain | 321 | [22, ...] |\n +------------+-----------------+-----------------+\n | Mark Twain | 432 | [3, ...] |\n +------------+-----------------+-----------------+\n | Mark Twain | 375 | [9, ...] |\n +------------+-----------------+-----------------+\n\nWhat's next\n-----------\n\n- Learn more about the [GoogleSQL `COSINE_DISTANCE()`, `EUCLIDEAN_DISTANCE()`, `DOT_PRODUCT()`](/spanner/docs/reference/standard-sql/mathematical_functions) functions.\n\n- Learn more about the [PostgreSQL `spanner.cosine_distance()`, `spanner.euclidean_distance()`, `spanner.dot_product()`](/spanner/docs/reference/postgresql/functions-and-operators#mathematical) functions.\n\n- Learn more about how to [Choose among vector distance functions to measure vector embeddings similarity](/spanner/docs/choose-vector-distance-function)."]]