Prepare supervised fine-tuning data for Translation LLM models

This document describes how to define a supervised fine-tuning dataset for a Translation LLM model. You can tune text data types.

About supervised fine-tuning datasets

A supervised fine-tuning dataset is used to fine-tune a pre-trained model to a specific domain. The input data should be similar to what you expect the model to encounter in real-world use. The output labels should represent the correct answers or outcomes for each input.

Training dataset

To tune a model, you provide a training dataset. For best results, we recommend that you start with 100 examples. You can scale up to thousands of examples if needed. The quality of the dataset is far more important than the quantity.

Limitations:

  • Max input and out token per examples: 1,000
  • Max file size of training dataset: Up to 1GB for JSONL.

Validation dataset

We strongly recommend that you provide a validation dataset. A validation dataset helps you measure the effectiveness of a tuning job.

Limitations:

  • Max input and out token per examples: 1,000
  • Max numbers of examples in validation dataset: 1024
  • Max file size of training dataset: Up to 1GB for JSONL.

Dataset format

Your model tuning dataset must be in the JSON Lines (JSONL) format, where each line contains a single tuning example. Before tuning your model, you must upload your dataset to a Cloud Storage bucket. Make sure to upload to us-central1.

{
  "contents": [
    {
      "role": string,
      "parts": [
        {
          "text": string,
        }
      ]
    }
  ]
}

Parameters

The example contains data with the following parameters:

Parameters

contents

Required: Content

The content of the current conversation with the model.

For single-turn queries, this is a single instance.

Dataset example for translation-llm-002

{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "text": "English: Hello. Spanish:",
        }
      ]
    }
    {
      "role": "model"",
      "parts": [
        {
          "text": "Hola.",
        }
      ]
    }
  ]
}

Contents

The base structured data type containing multi-part content of a message.

This class consists of two main properties: role and parts. The role property denotes the individual producing the content, while the parts property contains multiple elements, each representing a segment of data within a message.

Parameters

role

Optional: string

The identity of the entity that creates the message. The following values are supported:

  • user: This indicates that the message is sent by a real person, typically a user-generated message.
  • model: This indicates that the message is generated by the model.

parts

part

A list of ordered parts that make up a single message.

For limits on the inputs, such as the maximum number of tokens or the number of images, see the model specifications on the Google models page.

To compute the number of tokens in your request, see Get token count.

Parts

A data type containing media that is part of a multi-part Content message.

Parameters

text

Optional: string

A text prompt or code snippet.

Upload tuning datasets to Cloud Storage

To run a tuning job, you need to upload one or more datasets to a Cloud Storage bucket. You can either create a new Cloud Storage bucket or use an existing one to store dataset files. The region of the bucket doesn't matter, but we recommend that you use a bucket that's in the same Google Cloud project where you plan to tune your model.

After your bucket is ready, upload your dataset file to the bucket.

What's next