音声ファイルを作成する

Text-to-Speech を使用すると、単語や文を自然な人間の発声による音声データに変換し、base64 でエンコードできます。その後、base64 のデータをデコードすることで、音声データを MP3 のような再生可能な音声ファイルに変換できます。Text-to-Speech API は、入力を生のテキストまたは音声合成マークアップ言語（SSML）として受け入れます。

このドキュメントでは、Text-to-Speech を使用してテキストまたは SSML 入力から音声ファイルを作成する方法について説明します。また、音声合成や SSML などのコンセプトについて熟知していない場合は、Text-to-Speech の基本の記事で確認することもできます。

これらのサンプルを使用するには、Google Cloud CLI をインストールして初期化しておく必要があります。gcloud CLI の設定については、TTS に対する認証を行うをご覧ください。

テキストを合成音声に変換する

次のコードサンプルは、文字列を音声データに変換する方法を示しています。

音声合成の出力はさまざまな方法で構成できます。たとえば、ユニークな音声を選択することや、ピッチの出力、音量、発話速度、サンプルレートを調節することが可能です。

プロトコル

詳細については、text:synthesize API エンドポイントをご覧ください。

テキストから音声を合成するには、text:synthesize エンドポイントに対して HTTP POST リクエストを送信します。POST リクエストの本文内の voice 構成セクションで、合成する声の種類を指定します。さらに、input セクションの text フィールドで合成するテキストを指定し、audioConfig セクションで作成する音声の種類を指定します。

次のコードスニペットでは、合成リクエストを text:synthesize エンドポイントに送信し、その結果を synthesize-text.txt という名前のファイルに保存します。PROJECT_ID は実際のプロジェクト ID で置き換えます。

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "x-goog-user-project: <var>PROJECT_ID</var>" \
  -H "Content-Type: application/json; charset=utf-8" \
  --data "{
    'input':{
      'text':'Android is a mobile operating system developed by Google,
         based on the Linux kernel and designed primarily for
         touchscreen mobile devices such as smartphones and tablets.'
    },
    'voice':{
      'languageCode':'en-gb',
      'name':'en-GB-Standard-A',
      'ssmlGender':'FEMALE'
    },
    'audioConfig':{
      'audioEncoding':'MP3'
    }
  }" "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-text.txt

Text-to-Speech API は合成された音声を、base64 でエンコードされたデータとして JSON 出力に格納して返します。synthesize-text.txt ファイルの JSON 出力は、次のコードスニペットのようになります。

{
  "audioContent": "//NExAASCCIIAAhEAGAAEMW4kAYPnwwIKw/BBTpwTvB+IAxIfghUfW.."
}

Text-to-Speech API の結果を MP3 音声ファイルにデコードするには、synthesize-text.txt ファイルと同じディレクトリから次のコマンドを実行します。

cat synthesize-text.txt | grep 'audioContent' | \
sed 's|audioContent| |' | tr -d '\n ":{},' > tmp.txt && \
base64 tmp.txt --decode > synthesize-text-audio.mp3 && \
rm tmp.txt

Go

Text-to-Speech 用のクライアントライブラリをインストールして使用する方法については、Text-to-Speech クライアントライブラリをご覧ください。詳細については、Text-to-Speech Go API のリファレンスドキュメントをご覧ください。

Text-to-Speech で認証を行うには、アプリケーションのデフォルト認証情報を設定します。詳細については、ローカル開発環境の認証を設定するをご覧ください。


// SynthesizeText synthesizes plain text and saves the output to outputFile.
func SynthesizeText(w io.Writer, text, outputFile string) error {
	ctx := context.Background()

	client, err := texttospeech.NewClient(ctx)
	if err != nil {
		return err
	}
	defer client.Close()

	req := texttospeechpb.SynthesizeSpeechRequest{
		Input: &texttospeechpb.SynthesisInput{
			InputSource: &texttospeechpb.SynthesisInput_Text{Text: text},
		},
		// Note: the voice can also be specified by name.
		// Names of voices can be retrieved with client.ListVoices().
		Voice: &texttospeechpb.VoiceSelectionParams{
			LanguageCode: "en-US",
			SsmlGender:   texttospeechpb.SsmlVoiceGender_FEMALE,
		},
		AudioConfig: &texttospeechpb.AudioConfig{
			AudioEncoding: texttospeechpb.AudioEncoding_MP3,
		},
	}

	resp, err := client.SynthesizeSpeech(ctx, &req)
	if err != nil {
		return err
	}

	err = os.WriteFile(outputFile, resp.AudioContent, 0644)
	if err != nil {
		return err
	}
	fmt.Fprintf(w, "Audio content written to file: %v\n", outputFile)
	return nil
}

Java

Text-to-Speech 用のクライアントライブラリをインストールして使用する方法については、Text-to-Speech クライアントライブラリをご覧ください。詳細については、Text-to-Speech Java API のリファレンスドキュメントをご覧ください。

/**
 * Demonstrates using the Text to Speech client to synthesize text or ssml.
 *
 * @param text the raw text to be synthesized. (e.g., "Hello there!")
 * @throws Exception on TextToSpeechClient Errors.
 */
public static ByteString synthesizeText(String text) throws Exception {
  // Instantiates a client
  try (TextToSpeechClient textToSpeechClient = TextToSpeechClient.create()) {
    // Set the text input to be synthesized
    SynthesisInput input = SynthesisInput.newBuilder().setText(text).build();

    // Build the voice request
    VoiceSelectionParams voice =
        VoiceSelectionParams.newBuilder()
            .setLanguageCode("en-US") // languageCode = "en_us"
            .setSsmlGender(SsmlVoiceGender.FEMALE) // ssmlVoiceGender = SsmlVoiceGender.FEMALE
            .build();

    // Select the type of audio file you want returned
    AudioConfig audioConfig =
        AudioConfig.newBuilder()
            .setAudioEncoding(AudioEncoding.MP3) // MP3 audio.
            .build();

    // Perform the text-to-speech request
    SynthesizeSpeechResponse response =
        textToSpeechClient.synthesizeSpeech(input, voice, audioConfig);

    // Get the audio contents from the response
    ByteString audioContents = response.getAudioContent();

    // Write the response to the output file.
    try (OutputStream out = new FileOutputStream("output.mp3")) {
      out.write(audioContents.toByteArray());
      System.out.println("Audio content written to file \"output.mp3\"");
      return audioContents;
    }
  }
}

Node.js

Text-to-Speech 用のクライアントライブラリをインストールして使用する方法については、Text-to-Speech クライアントライブラリをご覧ください。詳細については、Text-to-Speech Node.js API のリファレンスドキュメントをご覧ください。

const textToSpeech = require('@google-cloud/text-to-speech');
const fs = require('fs');
const util = require('util');

const client = new textToSpeech.TextToSpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const text = 'Text to synthesize, eg. hello';
// const outputFile = 'Local path to save audio file to, e.g. output.mp3';

const request = {
  input: {text: text},
  voice: {languageCode: 'en-US', ssmlGender: 'FEMALE'},
  audioConfig: {audioEncoding: 'MP3'},
};
const [response] = await client.synthesizeSpeech(request);
const writeFile = util.promisify(fs.writeFile);
await writeFile(outputFile, response.audioContent, 'binary');
console.log(`Audio content written to file: ${outputFile}`);

Python

Text-to-Speech 用のクライアントライブラリをインストールして使用する方法については、Text-to-Speech クライアントライブラリをご覧ください。詳細については、Text-to-Speech Python API のリファレンスドキュメントをご覧ください。

def synthesize_text():
    """Synthesizes speech from the input string of text."""
    from google.cloud import texttospeech

    text = "Hello there."
    client = texttospeech.TextToSpeechClient()

    input_text = texttospeech.SynthesisInput(text=text)

    # Note: the voice can also be specified by name.
    # Names of voices can be retrieved with client.list_voices().
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Chirp3-HD-Charon",
    )

    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

    response = client.synthesize_speech(
        input=input_text,
        voice=voice,
        audio_config=audio_config,
    )

    # The response's audio_content is binary.
    with open("output.mp3", "wb") as out:
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

その他の言語

C#: クライアントライブラリページの C# の設定手順を完了してから、.NET の Text-to-Speech のリファレンスドキュメントをご覧ください。

PHP: クライアントライブラリページの PHP の設定手順を完了してから、PHP の Text-to-Speech のリファレンスドキュメントをご覧ください。

Ruby: クライアントライブラリページの Ruby の設定手順を完了してから、Ruby の Text-to-Speech リファレンスドキュメントをご覧ください。

SSML を合成音声に変換する

音声合成リクエストで SSML を使用すると、自然な人間の音声に似た音声を生成できます。具体的には、SSML では、音声出力で発声中の休止をどのように表すか、音声で日付、時刻、頭字語、略語がどのように発音されるかなどを細かく制御できます。

Text-to-Speech API でサポートされている SSML 要素の詳細については、SSML リファレンスをご覧ください。

プロトコル

詳細については、text:synthesize API エンドポイントをご覧ください。

SSML から音声を合成するには、text:synthesize エンドポイントに対して HTTP POST リクエストを送信します。POST リクエストの本文内の voice 構成セクションで、合成する声の種類を指定します。さらに、input セクションの ssml フィールドで合成する SSML を指定し、audioConfig セクションで作成する音声の種類を指定します。

次のコードスニペットでは、合成リクエストを text:synthesize エンドポイントに送信し、その結果を synthesize-ssml.txt という名前のファイルに保存します。PROJECT_ID は実際のプロジェクト ID で置き換えます。

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "x-goog-user-project: <var>PROJECT_ID</var>" \
  -H "Content-Type: application/json; charset=utf-8" --data "{
    'input':{
     'ssml':'<speak>The <say-as interpret-as=\"characters\">SSML</say-as> standard
          is defined by the <sub alias=\"World Wide Web Consortium\">W3C</sub>.</speak>'
    },
    'voice':{
      'languageCode':'en-us',
      'name':'en-US-Standard-B',
      'ssmlGender':'MALE'
    },
    'audioConfig':{
      'audioEncoding':'MP3'
    }
  }" "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-ssml.txt

Text-to-Speech API は合成された音声を、base64 でエンコードされたデータとして JSON 出力に格納して返します。synthesize-ssml.txt ファイルの JSON 出力は、次のコードスニペットのようになります。

{
  "audioContent": "//NExAASCCIIAAhEAGAAEMW4kAYPnwwIKw/BBTpwTvB+IAxIfghUfW.."
}

Text-to-Speech API の結果を MP3 音声ファイルにデコードするには、synthesize-ssml.txt ファイルと同じディレクトリから次のコマンドを実行します。

cat synthesize-ssml.txt | grep 'audioContent' | \
sed 's|audioContent| |' | tr -d '\n ":{},' > tmp.txt && \
base64 tmp.txt --decode > synthesize-ssml-audio.mp3 && \
rm tmp.txt

Go


// SynthesizeSSML synthesizes ssml and saves the output to outputFile.
//
// ssml must be well-formed according to:
//
//	https://www.w3.org/TR/speech-synthesis/
//
// Example: <speak>Hello there.</speak>
func SynthesizeSSML(w io.Writer, ssml, outputFile string) error {
	ctx := context.Background()

	client, err := texttospeech.NewClient(ctx)
	if err != nil {
		return err
	}
	defer client.Close()

	req := texttospeechpb.SynthesizeSpeechRequest{
		Input: &texttospeechpb.SynthesisInput{
			InputSource: &texttospeechpb.SynthesisInput_Ssml{Ssml: ssml},
		},
		// Note: the voice can also be specified by name.
		// Names of voices can be retrieved with client.ListVoices().
		Voice: &texttospeechpb.VoiceSelectionParams{
			LanguageCode: "en-US",
			SsmlGender:   texttospeechpb.SsmlVoiceGender_FEMALE,
		},
		AudioConfig: &texttospeechpb.AudioConfig{
			AudioEncoding: texttospeechpb.AudioEncoding_MP3,
		},
	}

	resp, err := client.SynthesizeSpeech(ctx, &req)
	if err != nil {
		return err
	}

	err = os.WriteFile(outputFile, resp.AudioContent, 0644)
	if err != nil {
		return err
	}
	fmt.Fprintf(w, "Audio content written to file: %v\n", outputFile)
	return nil
}