Collect URLScan IO logs
This document explains how to ingest URLScan IO logs to Google Security Operations using Amazon S3.
Before you begin
Make sure you have the following prerequisites:
- A Google SecOps instance
- Privileged access to URLScan IO tenant
- Privileged access to AWS (S3, IAM, Lambda, EventBridge)
Get URLScan IO prerequisites
- Sign in to URLScan IO.
- Click your profile icon.
- Select API Key from the menu.
- If you don't have an API key yet:
- Click Create API Key button.
- Enter a description for the API key (for example,
Google SecOps Integration). - Select the permissions for the key (for read-only access, select Read permissions).
- Click Generate API Key.
- Copy and save in a secure location the following details:
- API_KEY: The generated API key string (format:
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx) - API Base URL:
https://urlscan.io/api/v1(this is constant for all users)
- API_KEY: The generated API key string (format:
- Note your API quota limits:
- Free accounts: Limited to 1000 API calls per day, 60 per minute
- Pro accounts: Higher limits based on subscription tier
- If you need to restrict searches to your organization's scans only, note down:
- User identifier: Your username or email (for use with
user:search filter) - Team identifier: If using teams feature (for use with
team:search filter)
- User identifier: Your username or email (for use with
Configure AWS S3 bucket and IAM for Google SecOps
- Create Amazon S3 bucket following this user guide: Creating a bucket.
- Save bucket Name and Region for future reference (for example,
urlscan-logs-bucket). - Create a User following this user guide: Creating an IAM user.
- Select the created User.
- Select Security credentials tab.
- Click Create Access Key in section Access Keys.
- Select Third-party service as Use case.
- Click Next.
- Optional: Add a description tag.
- Click Create access key.
- Click Download CSV file to save the Access Key and Secret Access Key for future reference.
- Click Done.
- Select Permissions tab.
- Click Add permissions in section Permissions policies.
- Select Add permissions.
- Select Attach policies directly.
- Search for AmazonS3FullAccess policy.
- Select the policy.
- Click Next.
- Click Add permissions.
Configure the IAM policy and role for S3 uploads
- In the AWS console, go to IAM > Policies.
- Click Create policy > JSON tab.
Enter the following policy:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "AllowPutObjects", "Effect": "Allow", "Action": "s3:PutObject", "Resource": "arn:aws:s3:::urlscan-logs-bucket/*" }, { "Sid": "AllowGetStateObject", "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::urlscan-logs-bucket/urlscan/state.json" } ] }- Replace
urlscan-logs-bucketif you entered a different bucket name.
- Replace
Click Next > Create policy.
Go to IAM > Roles > Create role > AWS service > Lambda.
Attach the newly created policy.
Name the role
urlscan-lambda-roleand click Create role.
Create the Lambda function
- In the AWS Console, go to Lambda > Functions > Create function.
- Click Author from scratch.
Provide the following configuration details:
Setting Value Name urlscan-collectorRuntime Python 3.13 Architecture x86_64 Execution role urlscan-lambda-roleAfter the function is created, open the Code tab, delete the stub and enter the following code (
urlscan-collector.py):import json import os import boto3 from datetime import datetime, timedelta import urllib3 import base64 s3 = boto3.client('s3') http = urllib3.PoolManager() def lambda_handler(event, context): # Environment variables bucket = os.environ['S3_BUCKET'] prefix = os.environ['S3_PREFIX'] state_key = os.environ['STATE_KEY'] api_key = os.environ['API_KEY'] api_base = os.environ['API_BASE'] search_query = os.environ.get('SEARCH_QUERY', 'date:>now-1h') page_size = int(os.environ.get('PAGE_SIZE', '100')) max_pages = int(os.environ.get('MAX_PAGES', '10')) # Load state state = load_state(bucket, state_key) last_run = state.get('last_run') # Prepare search query if last_run: # Adjust search query based on last run search_time = datetime.fromisoformat(last_run) time_diff = datetime.utcnow() - search_time hours = int(time_diff.total_seconds() / 3600) + 1 search_query = f'date:>now-{hours}h' # Search for scans headers = {'API-Key': api_key} all_results = [] for page in range(max_pages): search_url = f"{api_base}/search/" params = { 'q': search_query, 'size': page_size, 'offset': page * page_size } # Make search request response = http.request( 'GET', search_url, fields=params, headers=headers ) if response.status != 200: print(f"Search failed: {response.status}") break search_data = json.loads(response.data.decode('utf-8')) results = search_data.get('results', []) if not results: break # Fetch full result for each scan for result in results: uuid = result.get('task', {}).get('uuid') if uuid: result_url = f"{api_base}/result/{uuid}/" result_response = http.request( 'GET', result_url, headers=headers ) if result_response.status == 200: full_result = json.loads(result_response.data.decode('utf-8')) all_results.append(full_result) else: print(f"Failed to fetch result for {uuid}: {result_response.status}") # Check if we have more pages if len(results) < page_size: break # Write results to S3 if all_results: now = datetime.utcnow() file_key = f"{prefix}year={now.year}/month={now.month:02d}/day={now.day:02d}/hour={now.hour:02d}/urlscan_{now.strftime('%Y%m%d_%H%M%S')}.json" # Create NDJSON content ndjson_content = '\n'.join([json.dumps(r, separators=(',', ':')) for r in all_results]) # Upload to S3 s3.put_object( Bucket=bucket, Key=file_key, Body=ndjson_content.encode('utf-8'), ContentType='application/x-ndjson' ) print(f"Uploaded {len(all_results)} results to s3://{bucket}/{file_key}") # Update state state['last_run'] = datetime.utcnow().isoformat() save_state(bucket, state_key, state) return { 'statusCode': 200, 'body': json.dumps({ 'message': f'Processed {len(all_results)} scan results', 'location': f"s3://{bucket}/{prefix}" }) } def load_state(bucket, key): try: response = s3.get_object(Bucket=bucket, Key=key) return json.loads(response['Body'].read()) except s3.exceptions.NoSuchKey: return {} except Exception as e: print(f"Error loading state: {e}") return {} def save_state(bucket, key, state): try: s3.put_object( Bucket=bucket, Key=key, Body=json.dumps(state), ContentType='application/json' ) except Exception as e: print(f"Error saving state: {e}")Go to Configuration > Environment variables.
Click Edit > Add new environment variable.
Enter the following environment variables, replacing with your values:
Key Example value S3_BUCKETurlscan-logs-bucketS3_PREFIXurlscan/STATE_KEYurlscan/state.jsonAPI_KEY<your-api-key>API_BASEhttps://urlscan.io/api/v1SEARCH_QUERYdate:>now-1hPAGE_SIZE100MAX_PAGES10After the function is created, stay on its page (or open Lambda > Functions > your-function).
Select the Configuration tab.
In the General configuration panel click Edit.
Change Timeout to 5 minutes (300 seconds) and click Save.
Create an EventBridge schedule
- Go to Amazon EventBridge > Scheduler > Create schedule.
- Provide the following configuration details:
- Recurring schedule: Rate (
1 hour). - Target: your Lambda function
urlscan-collector. - Name:
urlscan-collector-1h.
- Recurring schedule: Rate (
- Click Create schedule.
Optional: Create read-only IAM user & keys for Google SecOps
- Go to AWS Console > IAM > Users.
- Click Add users.
- Provide the following configuration details:
- User: Enter
secops-reader. - Access type: Select Access key – Programmatic access.
- User: Enter
- Click Create user.
- Attach minimal read policy (custom): Users > secops-reader > Permissions > Add permissions > Attach policies directly > Create policy.
In the JSON editor, enter the following policy:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:GetObject"], "Resource": "arn:aws:s3:::urlscan-logs-bucket/*" }, { "Effect": "Allow", "Action": ["s3:ListBucket"], "Resource": "arn:aws:s3:::urlscan-logs-bucket" } ] }Set the name to
secops-reader-policy.Go to Create policy > search/select > Next > Add permissions.
Go to Security credentials > Access keys > Create access key.
Download the CSV (these values are entered into the feed).
Configure a feed in Google SecOps to ingest URLScan IO logs
- Go to SIEM Settings > Feeds.
- Click Add New Feed.
- In the Feed name field, enter a name for the feed (for example,
URLScan IO logs). - Select Amazon S3 V2 as the Source type.
- Select URLScan IO as the Log type.
- Click Next.
- Specify values for the following input parameters:
- S3 URI:
s3://urlscan-logs-bucket/urlscan/ - Source deletion options: Select deletion option according to your preference.
- Maximum File Age: Include files modified in the last number of days. Default is 180 days.
- Access Key ID: User access key with access to the S3 bucket.
- Secret Access Key: User secret key with access to the S3 bucket.
- Asset namespace: The asset namespace.
- Ingestion labels: The label applied to the events from this feed.
- S3 URI:
- Click Next.
- Review your new feed configuration in the Finalize screen, and then click Submit.
Need more help? Get answers from Community members and Google SecOps professionals.