This page describes how to perform supervised fine-tuning on open models such as Llama 3.1.
Supported tuning methods
Low-Rank Adaptation (LoRA): LoRA is a parameter-efficient tuning method that only adjust subset of parameters. It's more cost efficient and require less training data than full fine-tuning. On the other hand, full fine-tuning has higher quality potential by adjusting all parameters.
Supported models
meta/llama3_1@llama-3.1-8b
meta/llama3_1@llama-3.1-8b-instruct
meta/llama3-2@llama-3.2-1b-instruct
: supports only full fine-tuningmeta/llama3-2@llama-3.2-3b-instruct
: supports only full fine-tuningmeta/llama3-3@llama-3.3-70b-instruct
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI and Cloud Storage APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the Vertex AI and Cloud Storage APIs.
- Install and initialize the Vertex AI SDK for Python
- Import the following libraries:
import os import time import uuid import vertexai vertexai.init(project=PROJECT_ID, location=REGION) from google.cloud import aiplatform from google.cloud.aiplatform.preview.vertexai.tuning import sft, SourceModel
Prepare dataset for tuning
A training dataset is required for tuning. You are recommended to prepare an optional validation dataset if you'd like to evaluate your tuned model's performance.
Your dataset must be in one of the following supported JSON Lines (JSONL) formats, where each line contains a single tuning example.
Prompt completion
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
Turn based chat format
{"messages": [
{"content": "You are a chatbot that helps with scientific literature and generates state-of-the-art abstracts from articles.",
"role": "system"},
{"content": "Summarize the paper in one paragraph.",
"role": "user"},
{"content": " Here is a one paragraph summary of the paper:\n\nThe paper describes PaLM, ...",
"role": "assistant"}
]}
Upload your JSONL files to Cloud Storage.
Create tuning job
You can tune from:
- A supported base model, such as Llama 3.1
A model that has the same architecture as one of the supported base models. This could be either a custom model checkpoint from a repository such as Hugging Face or a previously tuned model from a Vertex AI tuning job. This lets you continue tuning a model that has already been tuned.
Cloud Console
You can initiate fine tuning in the following ways:
Go the model card and click Fine tune and choose Managed tuning.
or
Go to the Tuning page and click Create tuned model.
Fill out the parameters and click Start tuning.
This starts a tuning job, which you can see in the Tuning page under the Managed tuning tab.
Once the tuning job has finished, you can view the information about the tuned model in the Details tab.
Vertex AI SDK for Python
Replace the parameter values with your own and then run the following code to create a tuning job:
sft_tuning_job = sft.preview_train(
source_model=SourceModel(
base_model="meta/llama3_1@llama-3.1-8b",
# Optional, folder that either a custom model checkpoint or previously tuned model
custom_base_model="gs://{STORAGE-URI}",
),
tuning_mode="FULL", # FULL or PEFT_ADAPTER
epochs=3,
train_dataset="gs://{STORAGE-URI}", # JSONL file
validation_dataset="gs://{STORAGE-URI}", # JSONL file
output_uri="gs://{STORAGE-URI}",
)
When the job finishes, the model artifacts for the tuned model are stored in
the <output_uri>/postprocess/node-0/checkpoints/final
folder.
Deploy tuned model
You can deploy the tuned model to a Vertex AI endpoint. You can also export the tuned model from Cloud Storage and deploy it elsewhere.
To deploy the tuned model to a Vertex AI endpoint:
Cloud Console
Go to the Model Garden page and click Deploy model with custom weights.
Fill out the parameters and click Deploy.
Vertex AI SDK for Python
Deploy a G2 machine
using a prebuilt
container:
from vertexai.preview import model_garden
MODEL_ARTIFACTS_STORAGE_URI = "gs://{STORAGE-URI}/postprocess/node-0/checkpoints/final"
model = model_garden.CustomModel(
gcs_uri=MODEL_ARTIFACTS_STORAGE_URI,
)
# deploy the model to an endpoint using GPUs. Cost will incur for the deployment
endpoint = model.deploy(
machine_type="g2-standard-12",
accelerator_type="NVIDIA_L4",
accelerator_count=1,
)
Get an inference
Once deployment succeeds, you can send requests to the endpoint with text prompts. Note that the first few prompts will take longer to execute.
# Loads the deployed endpoint
endpoint = aiplatform.Endpoint("projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_name}")
prompt = "Summarize the following article. Article: Preparing a perfect risotto requires patience and attention to detail. Begin by heating butter in a large, heavy-bottomed pot over medium heat. Add finely chopped onions and minced garlic to the pot, and cook until they're soft and translucent, about 5 minutes. Next, add Arborio rice to the pot and cook, stirring constantly, until the grains are coated with the butter and begin to toast slightly. Pour in a splash of white wine and cook until it's absorbed. From there, gradually add hot chicken or vegetable broth to the rice, stirring frequently, until the risotto is creamy and the rice is tender with a slight bite.. Summary:"
# Define input to the prediction call
instances = [
{
"prompt": "What is a car?",
"max_tokens": 200,
"temperature": 1.0,
"top_p": 1.0,
"top_k": 1,
"raw_response": True,
},
]
# Request the prediction
response = endpoint.predict(
instances=instances
)
for prediction in response.predictions:
print(prediction)
For more details on getting inferences from a deployed model, see Get an online inference.
Notice that managed open models use the
chat.completions
method instead of the
predict
method used by deployed models. For more information on getting inferences from
managed models, see Make a call to a Llama
model.
Limits and quotas
Quota is enforced on the number of concurrent tuning jobs. Every project comes
with a default quota to run at least one tuning job. This is a global quota,
shared across all available regions and supported models. If you want to run
more jobs concurrently, you need to request additional
quota for Global
concurrent managed OSS model fine-tuning jobs per project
.
Pricing
You are billed for tuning based on pricing for Model tuning.
You are also billed for related services, such as Cloud Storage and Vertex AI Prediction.
Learn about Vertex AI pricing, Cloud Storage pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.