Tune an open model

This page describes how to perform supervised fine-tuning on open models such as Llama 3.1.

Supported tuning methods

  • Full fine-tuning

  • Low-Rank Adaptation (LoRA): LoRA is a parameter-efficient tuning method that only adjust subset of parameters. It's more cost efficient and require less training data than full fine-tuning. On the other hand, full fine-tuning has higher quality potential by adjusting all parameters.

Supported models

  • meta/llama3_1@llama-3.1-8b
  • meta/llama3_1@llama-3.1-8b-instruct
  • meta/llama3-2@llama-3.2-1b-instruct: supports only full fine-tuning
  • meta/llama3-2@llama-3.2-3b-instruct: supports only full fine-tuning
  • meta/llama3-3@llama-3.3-70b-instruct

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the Vertex AI and Cloud Storage APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the Vertex AI and Cloud Storage APIs.

    Enable the APIs

  8. Install and initialize the Vertex AI SDK for Python
  9. Import the following libraries:
    import os
    import time
    import uuid
    import vertexai
    
    vertexai.init(project=PROJECT_ID, location=REGION)
    
    from google.cloud import aiplatform
    from google.cloud.aiplatform.preview.vertexai.tuning import sft, SourceModel
    

Prepare dataset for tuning

A training dataset is required for tuning. You are recommended to prepare an optional validation dataset if you'd like to evaluate your tuned model's performance.

Your dataset must be in one of the following supported JSON Lines (JSONL) formats, where each line contains a single tuning example.

Prompt completion

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

Turn based chat format

{"messages": [
  {"content": "You are a chatbot that helps with scientific literature and generates state-of-the-art abstracts from articles.",
    "role": "system"},
  {"content": "Summarize the paper in one paragraph.",
    "role": "user"},
  {"content": " Here is a one paragraph summary of the paper:\n\nThe paper describes PaLM, ...",
    "role": "assistant"}
]}

Upload your JSONL files to Cloud Storage.

Create tuning job

You can tune from:

  • A supported base model, such as Llama 3.1
  • A model that has the same architecture as one of the supported base models. This could be either a custom model checkpoint from a repository such as Hugging Face or a previously tuned model from a Vertex AI tuning job. This lets you continue tuning a model that has already been tuned.

Cloud Console

  1. You can initiate fine tuning in the following ways:

  2. Fill out the parameters and click Start tuning.

This starts a tuning job, which you can see in the Tuning page under the Managed tuning tab.

Once the tuning job has finished, you can view the information about the tuned model in the Details tab.

Vertex AI SDK for Python

Replace the parameter values with your own and then run the following code to create a tuning job:

sft_tuning_job = sft.preview_train(
    source_model=SourceModel(
      base_model="meta/llama3_1@llama-3.1-8b",
      # Optional, folder that either a custom model checkpoint or previously tuned model
      custom_base_model="gs://{STORAGE-URI}",
    ),
    tuning_mode="FULL", # FULL or PEFT_ADAPTER
    epochs=3,
    train_dataset="gs://{STORAGE-URI}", # JSONL file
    validation_dataset="gs://{STORAGE-URI}", # JSONL file
    output_uri="gs://{STORAGE-URI}",
)

When the job finishes, the model artifacts for the tuned model are stored in the <output_uri>/postprocess/node-0/checkpoints/final folder.

Deploy tuned model

You can deploy the tuned model to a Vertex AI endpoint. You can also export the tuned model from Cloud Storage and deploy it elsewhere.

To deploy the tuned model to a Vertex AI endpoint:

Cloud Console

  1. Go to the Model Garden page and click Deploy model with custom weights.

    Go to Model Garden

  2. Fill out the parameters and click Deploy.

Vertex AI SDK for Python

Deploy a G2 machine using a prebuilt container:

from vertexai.preview import model_garden

MODEL_ARTIFACTS_STORAGE_URI = "gs://{STORAGE-URI}/postprocess/node-0/checkpoints/final"

model = model_garden.CustomModel(
    gcs_uri=MODEL_ARTIFACTS_STORAGE_URI,
)

# deploy the model to an endpoint using GPUs. Cost will incur for the deployment
endpoint = model.deploy(
  machine_type="g2-standard-12",
  accelerator_type="NVIDIA_L4",
  accelerator_count=1,
)

Get an inference

Once deployment succeeds, you can send requests to the endpoint with text prompts. Note that the first few prompts will take longer to execute.

# Loads the deployed endpoint
endpoint = aiplatform.Endpoint("projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_name}")

prompt = "Summarize the following article. Article: Preparing a perfect risotto requires patience and attention to detail. Begin by heating butter in a large, heavy-bottomed pot over medium heat. Add finely chopped onions and minced garlic to the pot, and cook until they're soft and translucent, about 5 minutes. Next, add Arborio rice to the pot and cook, stirring constantly, until the grains are coated with the butter and begin to toast slightly. Pour in a splash of white wine and cook until it's absorbed. From there, gradually add hot chicken or vegetable broth to the rice, stirring frequently, until the risotto is creamy and the rice is tender with a slight bite.. Summary:"

# Define input to the prediction call
instances = [
    {
        "prompt": "What is a car?",
        "max_tokens": 200,
        "temperature": 1.0,
        "top_p": 1.0,
        "top_k": 1,
        "raw_response": True,
    },
]

# Request the prediction
response = endpoint.predict(
    instances=instances
)

for prediction in response.predictions:
    print(prediction)

For more details on getting inferences from a deployed model, see Get an online inference.

Notice that managed open models use the chat.completions method instead of the predict method used by deployed models. For more information on getting inferences from managed models, see Make a call to a Llama model.

Limits and quotas

Quota is enforced on the number of concurrent tuning jobs. Every project comes with a default quota to run at least one tuning job. This is a global quota, shared across all available regions and supported models. If you want to run more jobs concurrently, you need to request additional quota for Global concurrent managed OSS model fine-tuning jobs per project.

Pricing

You are billed for tuning based on pricing for Model tuning.

You are also billed for related services, such as Cloud Storage and Vertex AI Prediction.

Learn about Vertex AI pricing, Cloud Storage pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.

What's next