API documentation for bigquery
package.
Packages Functions
array_agg
array_agg(
obj: groupby.SeriesGroupBy | groupby.DataFrameGroupBy,
) -> series.Series | dataframe.DataFrame
Group data and create arrays from selected columns, omitting NULLs to avoid BigQuery errors (NULLs not allowed in arrays).
Examples:
>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> import numpy as np
>>> bpd.options.display.progress_bar = None
For a SeriesGroupBy object:
>>> lst = ['a', 'a', 'b', 'b', 'a']
>>> s = bpd.Series([1, 2, 3, 4, np.nan], index=lst)
>>> bbq.array_agg(s.groupby(level=0))
a [1. 2.]
b [3. 4.]
dtype: list<item: double>[pyarrow]
For a DataFrameGroupBy object:
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = bpd.DataFrame(l, columns=["a", "b", "c"])
>>> bbq.array_agg(df.groupby(by=["b"]))
a c
b
1.0 [2] [3]
2.0 [1 1] [3 2]
<BLANKLINE>
[2 rows x 2 columns]
Parameter | |
---|---|
Name | Description |
obj |
A GroupBy object to be applied the function. |
array_length
array_length(series: series.Series) -> series.Series
Compute the length of each array element in the Series.
Examples:
>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> bpd.options.display.progress_bar = None
>>> s = bpd.Series([[1, 2, 8, 3], [], [3, 4]])
>>> bbq.array_length(s)
0 4
1 0
2 2
dtype: Int64
You can also apply this function directly to Series.
>>> s.apply(bbq.array_length, by_row=False)
0 4
1 0
2 2
dtype: Int64
Parameter | |
---|---|
Name | Description |
series |
A Series with array columns. |
array_to_string
array_to_string(series: series.Series, delimiter: str) -> series.Series
Converts array elements within a Series into delimited strings.
Examples:
>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> import numpy as np
>>> bpd.options.display.progress_bar = None
>>> s = bpd.Series([["H", "i", "!"], ["Hello", "World"], np.nan, [], ["Hi"]])
>>> bbq.array_to_string(s, delimiter=", ")
0 H, i, !
1 Hello, World
2
3
4 Hi
dtype: string
Parameters | |
---|---|
Name | Description |
series |
A Series containing arrays. |
delimiter |
The string used to separate array elements. |
json_set
json_set(
series: series.Series,
json_path_value_pairs: typing.Sequence[typing.Tuple[str, typing.Any]],
) -> series.Series
Produces a new JSON value within a Series by inserting or replacing values at specified paths.
Examples:
>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> import numpy as np
>>> bpd.options.display.progress_bar = None
>>> s = bpd.read_gbq("SELECT JSON '{\"a\": 1}' AS data")["data"]
>>> bbq.json_set(s, json_path_value_pairs=[("$.a", 100), ("$.b", "hi")])
0 {"a":100,"b":"hi"}
Name: data, dtype: string
Parameters | |
---|---|
Name | Description |
series |
The Series containing JSON data (as native JSON objects or JSON-formatted strings). |
json_path_value_pairs |
Pairs of JSON path and the new value to insert/replace. |
vector_search
vector_search(
base_table: str,
column_to_search: str,
query: Union[dataframe.DataFrame, series.Series],
*,
query_column_to_search: Optional[str] = None,
top_k: Optional[int] = 10,
distance_type: Literal["euclidean", "cosine"] = "euclidean",
fraction_lists_to_search: Optional[float] = None,
use_brute_force: bool = False
) -> dataframe.DataFrame
Conduct vector search which searches embeddings to find semantically similar entities.
Examples:
>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> bpd.options.display.progress_bar = None
DataFrame embeddings for which to find nearest neighbors. The ARRAY<FLOAT64>
column
is used as the search query:
>>> search_query = bpd.DataFrame({"query_id": ["dog", "cat"],
... "embedding": [[1.0, 2.0], [3.0, 5.2]]})
>>> bbq.vector_search(
... base_table="bigframes-dev.bigframes_tests_sys.base_table",
... column_to_search="my_embedding",
... query=search_query,
... top_k=2)
query_id embedding id my_embedding distance
1 cat [3. 5.2] 5 [5. 5.4] 2.009975
0 dog [1. 2.] 1 [1. 2.] 0.0
0 dog [1. 2.] 4 [1. 3.2] 1.2
1 cat [3. 5.2] 2 [2. 4.] 1.56205
<BLANKLINE>
[4 rows x 5 columns]
Series embeddings for which to find nearest neighbors:
>>> search_query = bpd.Series([[1.0, 2.0], [3.0, 5.2]],
... index=["dog", "cat"],
... name="embedding")
>>> bbq.vector_search(
... base_table="bigframes-dev.bigframes_tests_sys.base_table",
... column_to_search="my_embedding",
... query=search_query,
... top_k=2)
embedding id my_embedding distance
dog [1. 2.] 1 [1. 2.] 0.0
cat [3. 5.2] 5 [5. 5.4] 2.009975
dog [1. 2.] 4 [1. 3.2] 1.2
cat [3. 5.2] 2 [2. 4.] 1.56205
<BLANKLINE>
[4 rows x 4 columns]
You can specify the name of the column in the query DataFrame embeddings and distance type. If you specify query_column_to_search_value, it will use the provided column which contains the embeddings for which to find nearest neighbors. Otherwiese, it uses the column_to_search value.
>>> search_query = bpd.DataFrame({"query_id": ["dog", "cat"],
... "embedding": [[1.0, 2.0], [3.0, 5.2]],
... "another_embedding": [[0.7, 2.2], [3.3, 5.2]]})
>>> bbq.vector_search(
... base_table="bigframes-dev.bigframes_tests_sys.base_table",
... column_to_search="my_embedding",
... query=search_query,
... distance_type="cosine",
... query_column_to_search="another_embedding",
... top_k=2)
query_id embedding another_embedding id my_embedding distance
1 cat [3. 5.2] [3.3 5.2] 2 [2. 4.] 0.005181
0 dog [1. 2.] [0.7 2.2] 4 [1. 3.2] 0.000013
1 cat [3. 5.2] [3.3 5.2] 1 [1. 2.] 0.005181
0 dog [1. 2.] [0.7 2.2] 3 [1.5 7. ] 0.004697
<BLANKLINE>
[4 rows x 6 columns]
Parameters | |
---|---|
Name | Description |
base_table |
The table to search for nearest neighbor embeddings. |
column_to_search |
The name of the base table column to search for nearest neighbor embeddings. The column must have a type of |
query |
A Series or DataFrame that provides the embeddings for which to find nearest neighbors. |
query_column_to_search |
Specifies the name of the column in the query that contains the embeddings for which to find nearest neighbors. The column must have a type of |
top_k |
Sepecifies the number of nearest neighbors to return. Default to 10. |
distance_type |
Specifies the type of metric to use to compute the distance between two vectors. Possible values are "euclidean" and "cosine". Default to "euclidean". |
fraction_lists_to_search |
Specifies the percentage of lists to search. Specifying a higher percentage leads to higher recall and slower performance, and the converse is true when specifying a lower percentage. It is only used when a vector index is also used. You can only specify |
use_brute_force |
Determines whether to use brute force search by skipping the vector index if one is available. Default to False. |