Stay organized with collections
Save and categorize content based on your preferences.
Advanced website indexing incurs monthly data
storage charges based on the size of the web data that you import into your data
store. To get an estimate of the size of your web data before importing it, you
can call the estimateDataSize method and specify the web
pages that you want to import. The estimateDataSize method is a long-running
operation that runs until the process for estimating
the data size is complete. This can take from a few minutes to over an hour,
depending on the number of web pages that you specify. After you have an
estimate of the size of your web data, you can get an estimate of your monthly
data storage costs using the AI Applications pricing page (see the Data Index
pricing section) or the Google Cloud's pricing
calculator (search for AI Applications).
Before you begin
Determine the URL patterns for the websites that you intend to include (and
optionally exclude) when you import web data into your data store. You
specify these URL patterns when you call the estimateDataSize method.
Procedure
To get an estimate of the size of your web data, follow these steps:
URI_PATTERN_TO_INCLUDE: the URL patterns for the websites that
you want to include in your data size estimate.
URI_PATTERN_TO_EXCLUDE: (Optional) The URL patterns for the
websites that you want to exclude from your data size estimate.
For URI_PATTERN_TO_INCLUDE and
URI_PATTERN_TO_EXCLUDE, you can use patterns similar to the
following:
Entire website: www.mysite.com
Parts of a website: www.mysite.com/faq
Entire domain: mysite.com or *.mysite.com
EXCLUSIVE_BOOLEAN: (Optional) If true, then the provided URI
pattern represents web pages that are excluded from your data size
estimate. The default is false, which means that the provided URI
pattern represents web pages that are included in your data size estimate.
EXACT_MATCH_BOOLEAN: (Optional) If true, then the provided
URI pattern represents a single web page, instead of the web page and all
of its children. The default is false, which means that the provided URI
pattern represents the web page and all of its children.
Replace OPERATION_NAME with the name value that you saved in the
previous step. You can also get the operation name by listing long-running
operations.
Evaluate each response.
If a response does not contain "done": true, then the process for
estimating the data size is not complete. Continue polling.
If a response contains "done": true, then the process for estimating the
data size is complete. Save the DATA_SIZE_BYTES value from the
response to use in the following step.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-29 UTC."],[[["\u003cp\u003eAdvanced website indexing involves monthly data storage charges based on the size of imported web data.\u003c/p\u003e\n"],["\u003cp\u003eThe \u003ccode\u003eestimateDataSize\u003c/code\u003e method allows you to estimate your web data size before importing, which is a long-running operation taking minutes to over an hour.\u003c/p\u003e\n"],["\u003cp\u003eYou can specify URL patterns to include or exclude when estimating the size, using options such as entire websites, parts of a website or an entire domain.\u003c/p\u003e\n"],["\u003cp\u003eAfter estimating your data size, you can calculate monthly storage costs using the Vertex AI Agent Builder pricing page or Google Cloud's pricing calculator.\u003c/p\u003e\n"],["\u003cp\u003eThe \u003ccode\u003eestimateDataSize\u003c/code\u003e method has a specific use case and can only be used on web domains that are owned or authorized by the company.\u003c/p\u003e\n"]]],[],null,["# Get an estimate of monthly storage costs\n\n[Advanced website indexing](/generative-ai-app-builder/docs/about-advanced-features#advanced-website-indexing) incurs monthly data\nstorage charges based on the size of the web data that you import into your data\nstore. To get an estimate of the size of your web data before importing it, you\ncan call the [`estimateDataSize`](/generative-ai-app-builder/docs/reference/rest/v1alpha/projects.locations/estimateDataSize) method and specify the web\npages that you want to import. The `estimateDataSize` method is a [long-running\noperation](https://google.aip.dev/151) that runs until the process for estimating\nthe data size is complete. This can take from a few minutes to over an hour,\ndepending on the number of web pages that you specify. After you have an\nestimate of the size of your web data, you can get an estimate of your monthly\ndata storage costs using the AI Applications pricing page (see the [Data Index\npricing](https://cloud.google.com/generative-ai-app-builder/pricing#data_index_pricing) section) or the [Google Cloud's pricing\ncalculator](https://cloud.google.com/products/calculator) (search for AI Applications).\n| **Important:** You are permitted to use the `estimateDataSize` method only on web domains that your company owns or is authorized to utilize.\n\nBefore you begin\n----------------\n\nDetermine the URL patterns for the websites that you intend to include (and\noptionally exclude) when you import web data into your data store. You\nspecify these URL patterns when you call the `estimateDataSize` method.\n\nProcedure\n---------\n\nTo get an estimate of the size of your web data, follow these steps:\n\n1. Call the [`estimateDataSize`](/generative-ai-app-builder/docs/reference/rest/v1alpha/projects.locations/estimateDataSize) method.\n\n curl -X POST \\\n -H \"Authorization: Bearer $(gcloud auth application-default print-access-token)\" \\\n -H \"Content-Type: application/json\" \\\n \"https://discoveryengine.googleapis.com/v1alpha/projects/\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e/locations/global:estimateDataSize\" \\\n -d '{\n \"website_data_source\": {\n \"estimator_uri_patterns\": {\n provided_uri_pattern: \"\u003cvar translate=\"no\"\u003eURI_PATTERN_TO_INCLUDE\u003c/var\u003e\",\n exact_match: \u003cvar translate=\"no\"\u003eEXACT_MATCH_BOOLEAN\u003c/var\u003e\n },\n \"estimator_uri_patterns\": {\n provided_uri_pattern: \"\u003cvar translate=\"no\"\u003eURI_PATTERN_TO_EXCLUDE\u003c/var\u003e\",\n exact_match: \u003cvar translate=\"no\"\u003eEXACT_MATCH_BOOLEAN\u003c/var\u003e,\n exclusive: \u003cvar translate=\"no\"\u003eEXCLUSIVE_BOOLEAN\u003c/var\u003e\n }\n }\n }'\n\n Replace the following:\n - \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: the ID of your project.\n\n - \u003cvar translate=\"no\"\u003eURI_PATTERN_TO_INCLUDE\u003c/var\u003e: the URL patterns for the websites that\n you want to include in your data size estimate.\n\n - \u003cvar translate=\"no\"\u003eURI_PATTERN_TO_EXCLUDE\u003c/var\u003e: (Optional) The URL patterns for the\n websites that you want to exclude from your data size estimate.\n\n For \u003cvar translate=\"no\"\u003eURI_PATTERN_TO_INCLUDE\u003c/var\u003e and\n \u003cvar translate=\"no\"\u003eURI_PATTERN_TO_EXCLUDE\u003c/var\u003e, you can use patterns similar to the\n following:\n - Entire website: `www.mysite.com`\n - Parts of a website: `www.mysite.com/faq`\n - Entire domain: `mysite.com` or `*.mysite.com`\n - \u003cvar translate=\"no\"\u003eEXCLUSIVE_BOOLEAN\u003c/var\u003e: (Optional) If `true`, then the provided URI\n pattern represents web pages that are excluded from your data size\n estimate. The default is `false`, which means that the provided URI\n pattern represents web pages that are included in your data size estimate.\n\n - \u003cvar translate=\"no\"\u003eEXACT_MATCH_BOOLEAN\u003c/var\u003e: (Optional) If `true`, then the provided\n URI pattern represents a single web page, instead of the web page and all\n of its children. The default is `false`, which means that the provided URI\n pattern represents the web page and all of its children.\n\n The output is similar to the following: \n\n {\n \"name\": \"projects/\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e/locations/global/operations/estimate-data-size-01234567890123456789\",\n \"metadata\": {\n \"@type\": \"type.googleapis.com/google.cloud.discoveryengine.v1alpha.EstimateDataSizeMetadata\"\n }\n }\n\n This output includes the `name` field, which is the name of the long-running\n operation. Save the `name` value to use in the following step.\n2. Poll the [`operations.get`](/generative-ai-app-builder/docs/reference/rest/v1/projects.operations/get) method.\n\n curl -X GET \\\n -H \"Authorization: Bearer $(gcloud auth application-default print-access-token)\" \\\n \"https://discoveryengine.googleapis.com/v1/\u003cvar translate=\"no\"\u003eOPERATION_NAME\u003c/var\u003e\"\n\n Replace \u003cvar translate=\"no\"\u003eOPERATION_NAME\u003c/var\u003e with the `name` value that you saved in the\n previous step. You can also get the operation name by [listing long-running\n operations](/generative-ai-app-builder/docs/long-running-operations#list-lros).\n3. Evaluate each response.\n\n - If a response does not contain `\"done\": true`, then the process for\n estimating the data size is not complete. Continue polling.\n\n The output is similar to the following: \n\n {\n \"name\": \"projects/\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e/locations/global/operations/estimate-data-size-01234567890123456789\",\n \"metadata\": {\n \"@type\": \"type.googleapis.com/google.cloud.discoveryengine.v1alpha.EstimateDataSizeMetadata\"\n }\n }\n\n - If a response contains `\"done\": true`, then the process for estimating the\n data size is complete. Save the \u003cvar translate=\"no\"\u003eDATA_SIZE_BYTES\u003c/var\u003e value from the\n response to use in the following step.\n\n The output is similar to the following: \n\n {\n \"name\": \"projects/\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e/locations/global/operations/estimate-data-size-01234567890123456789\",\n \"metadata\": {\n \"@type\": \"type.googleapis.com/google.cloud.discoveryengine.v1alpha.EstimateDataSizeMetadata\",\n \"createTime\": \"2023-12-08T19:54:06.911248Z\"\n },\n \"done\": true,\n \"response\": {\n \"@type\": \"type.googleapis.com/google.cloud.discoveryengine.v1alpha.EstimateDataSizeResponse\",\n \"dataSizeBytes\": \u003cvar class=\"readonly\" translate=\"no\"\u003e\u003cspan class=\"devsite-syntax-err\"\u003eDATA_SIZE_BYTES\u003c/span\u003e\u003c/var\u003e,\n \"documentCount\": \u003cvar class=\"readonly\" translate=\"no\"\u003e\u003cspan class=\"devsite-syntax-err\"\u003eDOCUMENT_COUNT\u003c/span\u003e\u003c/var\u003e\n }\n }\n\n This output includes the following values:\n - \u003cvar translate=\"no\"\u003eDATA_SIZE_BYTES\u003c/var\u003e: the estimated size of your web data, in\n bytes.\n\n - \u003cvar translate=\"no\"\u003eDOCUMENT_COUNT\u003c/var\u003e: the estimated number of web pages in your web\n data.\n\n4. Divide the \u003cvar class=\"readonly\" translate=\"no\"\u003eDATA_SIZE_BYTES\u003c/var\u003e\n value from the previous step by 1,000,000,000 to get gigabytes. Save this\n value for the following step.\n\n5. To get an estimate for your monthly data storage costs:\n\n 1. Go [Google Cloud's pricing calculator](https://cloud.google.com/products/calculator).\n\n 2. Click **Add to estimate**.\n\n 3. Search for `AI Applications` and then click the\n **AI Applications** box.\n\n 4. In the **Data Index** box, enter the estimated size of your web data, in\n gigabytes, from the previous step.\n\n See the **Estimated cost** box for your estimated data storage cost."]]