음성 오디오 파일 만들기

Text-to-Speech를 사용하여 단어 및 문장을 자연스러운 인간 음성의 base64 인코딩 오디오 데이터로 변환할 수 있습니다. 그런 다음 base64 데이터를 디코딩하여 오디오 데이터를 MP3와 같은 재생 가능한 오디오 파일로 변환할 수 있습니다. Text-to-Speech API는 원시 텍스트 또는 SSML(Speech Synthesis Markup Language) 형식의 입력을 사용합니다.

이 문서에서는 Text-to-Speech를 사용하여 텍스트 또는 SSML 입력으로 오디오 파일을 만드는 방법을 설명합니다 또한 음성 합성이나 SSML과 같은 개념에 익숙하지 않은 경우 Text-to-Speech 기본 사항 문서를 살펴볼 수도 있습니다.

이 샘플을 사용하려면 Google Cloud CLI를 설치하고 초기화해야 합니다. gcloud CLI 설정에 대한 자세한 내용은 TTS에 인증을 참조하세요.

텍스트를 합성 음성 오디오로 변환

다음 코드 샘플은 문자열을 오디오 데이터로 변환하는 방법을 보여줍니다.

고유한 음성을 선택하거나 출력의 높낮이, 볼륨, 말하기 속도, 샘플링 레이트를 조정하는 등 다양한 방식으로 음성 합성 출력을 구성할 수 있습니다.

프로토콜

자세한 내용은 text:synthesize API 엔드포인트를 참조하세요.

텍스트에서 오디오를 합성하려면 text:synthesize 엔드포인트에 HTTP POST 요청을 수행합니다. POST 요청 본문에서 voice 구성 섹션에 합성할 음성 유형을 지정하고, input 섹션의 text 필드에 합성할 텍스트를 지정하고, audioConfig 섹션에 생성할 오디오 유형을 지정합니다.

다음 코드 스니펫은 합성 요청을 text:synthesize 엔드포인트에 보내고 synthesize-text.txt라는 파일에 결과를 저장합니다. PROJECT_ID를 프로젝트 ID로 바꿉니다.

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "x-goog-user-project: <var>PROJECT_ID</var>" \
  -H "Content-Type: application/json; charset=utf-8" \
  --data "{
    'input':{
      'text':'Android is a mobile operating system developed by Google,
         based on the Linux kernel and designed primarily for
         touchscreen mobile devices such as smartphones and tablets.'
    },
    'voice':{
      'languageCode':'en-gb',
      'name':'en-GB-Standard-A',
      'ssmlGender':'FEMALE'
    },
    'audioConfig':{
      'audioEncoding':'MP3'
    }
  }" "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-text.txt

Text-to-Speech API는 합성된 오디오를 JSON 출력에 포함된 base64 인코딩 데이터로 반환합니다. synthesize-text.txt 파일의 JSON 출력은 다음 코드 스니펫과 유사합니다.

{
  "audioContent": "//NExAASCCIIAAhEAGAAEMW4kAYPnwwIKw/BBTpwTvB+IAxIfghUfW.."
}

Text-to-Speech API의 결과를 MP3 오디오 파일로 디코딩하려면 synthesize-text.txt 파일과 동일한 디렉터리에서 다음 명령어를 실행합니다.

cat synthesize-text.txt | grep 'audioContent' | \
sed 's|audioContent| |' | tr -d '\n ":{},' > tmp.txt && \
base64 tmp.txt --decode > synthesize-text-audio.mp3 && \
rm tmp.txt

Go

Text-to-Speech용 클라이언트 라이브러리를 설치하고 사용하는 방법은 Text-to-Speech 클라이언트 라이브러리를 참조하세요. 자세한 내용은 Text-to-Speech Go API 참고 문서를 확인하세요.

Text-to-Speech에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


// SynthesizeText synthesizes plain text and saves the output to outputFile.
func SynthesizeText(w io.Writer, text, outputFile string) error {
	ctx := context.Background()

	client, err := texttospeech.NewClient(ctx)
	if err != nil {
		return err
	}
	defer client.Close()

	req := texttospeechpb.SynthesizeSpeechRequest{
		Input: &texttospeechpb.SynthesisInput{
			InputSource: &texttospeechpb.SynthesisInput_Text{Text: text},
		},
		// Note: the voice can also be specified by name.
		// Names of voices can be retrieved with client.ListVoices().
		Voice: &texttospeechpb.VoiceSelectionParams{
			LanguageCode: "en-US",
			SsmlGender:   texttospeechpb.SsmlVoiceGender_FEMALE,
		},
		AudioConfig: &texttospeechpb.AudioConfig{
			AudioEncoding: texttospeechpb.AudioEncoding_MP3,
		},
	}

	resp, err := client.SynthesizeSpeech(ctx, &req)
	if err != nil {
		return err
	}

	err = os.WriteFile(outputFile, resp.AudioContent, 0644)
	if err != nil {
		return err
	}
	fmt.Fprintf(w, "Audio content written to file: %v\n", outputFile)
	return nil
}

Java

Text-to-Speech용 클라이언트 라이브러리를 설치하고 사용하는 방법은 Text-to-Speech 클라이언트 라이브러리를 참조하세요. 자세한 내용은 Text-to-Speech Java API 참고 문서를 확인하세요.

Text-to-Speech에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

/**
 * Demonstrates using the Text to Speech client to synthesize text or ssml.
 *
 * @param text the raw text to be synthesized. (e.g., "Hello there!")
 * @throws Exception on TextToSpeechClient Errors.
 */
public static ByteString synthesizeText(String text) throws Exception {
  // Instantiates a client
  try (TextToSpeechClient textToSpeechClient = TextToSpeechClient.create()) {
    // Set the text input to be synthesized
    SynthesisInput input = SynthesisInput.newBuilder().setText(text).build();

    // Build the voice request
    VoiceSelectionParams voice =
        VoiceSelectionParams.newBuilder()
            .setLanguageCode("en-US") // languageCode = "en_us"
            .setSsmlGender(SsmlVoiceGender.FEMALE) // ssmlVoiceGender = SsmlVoiceGender.FEMALE
            .build();

    // Select the type of audio file you want returned
    AudioConfig audioConfig =
        AudioConfig.newBuilder()
            .setAudioEncoding(AudioEncoding.MP3) // MP3 audio.
            .build();

    // Perform the text-to-speech request
    SynthesizeSpeechResponse response =
        textToSpeechClient.synthesizeSpeech(input, voice, audioConfig);

    // Get the audio contents from the response
    ByteString audioContents = response.getAudioContent();

    // Write the response to the output file.
    try (OutputStream out = new FileOutputStream("output.mp3")) {
      out.write(audioContents.toByteArray());
      System.out.println("Audio content written to file \"output.mp3\"");
      return audioContents;
    }
  }
}

Node.js

Text-to-Speech용 클라이언트 라이브러리를 설치하고 사용하는 방법은 Text-to-Speech 클라이언트 라이브러리를 참조하세요. 자세한 내용은 Text-to-Speech Node.js API 참고 문서를 확인하세요.

Text-to-Speech에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

const textToSpeech = require('@google-cloud/text-to-speech');
const fs = require('fs');
const util = require('util');

const client = new textToSpeech.TextToSpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const text = 'Text to synthesize, eg. hello';
// const outputFile = 'Local path to save audio file to, e.g. output.mp3';

const request = {
  input: {text: text},
  voice: {languageCode: 'en-US', ssmlGender: 'FEMALE'},
  audioConfig: {audioEncoding: 'MP3'},
};
const [response] = await client.synthesizeSpeech(request);
const writeFile = util.promisify(fs.writeFile);
await writeFile(outputFile, response.audioContent, 'binary');
console.log(`Audio content written to file: ${outputFile}`);

Python

Text-to-Speech용 클라이언트 라이브러리를 설치하고 사용하는 방법은 Text-to-Speech 클라이언트 라이브러리를 참조하세요. 자세한 내용은 Text-to-Speech Python API 참고 문서를 확인하세요.

Text-to-Speech에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

def synthesize_text():
    """Synthesizes speech from the input string of text."""
    from google.cloud import texttospeech

    text = "Hello there."
    client = texttospeech.TextToSpeechClient()

    input_text = texttospeech.SynthesisInput(text=text)

    # Note: the voice can also be specified by name.
    # Names of voices can be retrieved with client.list_voices().
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Chirp3-HD-Charon",
    )

    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

    response = client.synthesize_speech(
        input=input_text,
        voice=voice,
        audio_config=audio_config,
    )

    # The response's audio_content is binary.
    with open("output.mp3", "wb") as out:
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

추가 언어

C#: 클라이언트 라이브러리 페이지의 C# 설정 안내를 따른 후 .NET용 Text-to-Speech 참고 문서를 참조하세요.

PHP: 클라이언트 라이브러리 페이지의 PHP 설정 안내를 따른 후 PHP용 Text-to-Speech 참고 문서를 참조하세요.

Ruby: 클라이언트 라이브러리 페이지의 Ruby 설정 안내를 따른 후 Ruby용 Text-to-Speech 참고 문서를 참조하세요.

SSML을 합성 음성 오디오로 변환

오디오 합성 요청에 SSML을 사용하면 자연스러운 인간의 음성과 매우 비슷한 오디오를 생성할 수 있습니다. 특히 SSML은 음성에서 오디오 출력의 일시중지 표현 방식이나 오디오의 날짜, 시간, 두문자어, 약어 발음 방법에 대한 세부적인 제어를 제공합니다.

Text-to-Speech API에서 지원하는 SSML 요소에 대한 자세한 내용은 SSML 참조를 참조하세요.

프로토콜

자세한 내용은 text:synthesize API 엔드포인트를 참조하세요.

SSML에서 오디오를 합성하려면 text:synthesize 엔드포인트에 HTTP POST 요청을 수행합니다. POST 요청 본문에서 voice 구성 섹션에 합성할 음성 유형을 지정하고, input 섹션의 ssml 필드에 합성할 SSML을 지정하고, audioConfig 섹션에 생성할 오디오 유형을 지정합니다.

다음 코드 스니펫은 합성 요청을 text:synthesize 엔드포인트에 보내고 synthesize-ssml.txt라는 파일에 결과를 저장합니다. PROJECT_ID를 프로젝트 ID로 바꿉니다.

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "x-goog-user-project: <var>PROJECT_ID</var>" \
  -H "Content-Type: application/json; charset=utf-8" --data "{
    'input':{
     'ssml':'<speak>The <say-as interpret-as=\"characters\">SSML</say-as> standard
          is defined by the <sub alias=\"World Wide Web Consortium\">W3C</sub>.</speak>'
    },
    'voice':{
      'languageCode':'en-us',
      'name':'en-US-Standard-B',
      'ssmlGender':'MALE'
    },
    'audioConfig':{
      'audioEncoding':'MP3'
    }
  }" "https://texttospeech.googleapis.com/v1/text:synthesize" > synthesize-ssml.txt

Text-to-Speech API는 합성된 오디오를 JSON 출력에 포함된 base64 인코딩 데이터로 반환합니다. synthesize-ssml.txt 파일의 JSON 출력은 다음 코드 스니펫과 유사합니다.

{
  "audioContent": "//NExAASCCIIAAhEAGAAEMW4kAYPnwwIKw/BBTpwTvB+IAxIfghUfW.."
}

Text-to-Speech API의 결과를 MP3 오디오 파일로 디코딩하려면 synthesize-ssml.txt 파일과 동일한 디렉터리에서 다음 명령어를 실행합니다.

cat synthesize-ssml.txt | grep 'audioContent' | \
sed 's|audioContent| |' | tr -d '\n ":{},' > tmp.txt && \
base64 tmp.txt --decode > synthesize-ssml-audio.mp3 && \
rm tmp.txt

Go

Text-to-Speech에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


// SynthesizeSSML synthesizes ssml and saves the output to outputFile.
//
// ssml must be well-formed according to:
//
//	https://www.w3.org/TR/speech-synthesis/
//
// Example: <speak>Hello there.</speak>
func SynthesizeSSML(w io.Writer, ssml, outputFile string) error {
	ctx := context.Background()

	client, err := texttospeech.NewClient(ctx)
	if err != nil {
		return err
	}
	defer client.Close()

	req := texttospeechpb.SynthesizeSpeechRequest{
		Input: &texttospeechpb.SynthesisInput{
			InputSource: &texttospeechpb.SynthesisInput_Ssml{Ssml: ssml},
		},
		// Note: the voice can also be specified by name.
		// Names of voices can be retrieved with client.ListVoices().
		Voice: &texttospeechpb.VoiceSelectionParams{
			LanguageCode: "en-US",
			SsmlGender:   texttospeechpb.SsmlVoiceGender_FEMALE,
		},
		AudioConfig: &texttospeechpb.AudioConfig{
			AudioEncoding: texttospeechpb.AudioEncoding_MP3,
		},
	}

	resp, err := client.SynthesizeSpeech(ctx, &req)
	if err != nil {
		return err
	}

	err = os.WriteFile(outputFile, resp.AudioContent, 0644)
	if err != nil {
		return err
	}
	fmt.Fprintf(w, "Audio content written to file: %v\n", outputFile)
	return nil
}

Java

Text-to-Speech에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

/**
 * Demonstrates using the Text to Speech client to synthesize text or ssml.
 *
 * <p>Note: ssml must be well-formed according to: (https://www.w3.org/TR/speech-synthesis/
 * Example: <speak>Hello there.</speak>
 *
 * @param ssml the ssml document to be synthesized. (e.g., "<?xml...")
 * @throws Exception on TextToSpeechClient Errors.
 */
public static ByteString synthesizeSsml(String ssml) throws Exception {
  // Instantiates a client
  try (TextToSpeechClient textToSpeechClient = TextToSpeechClient.create()) {
    // Set the ssml input to be synthesized
    SynthesisInput input = SynthesisInput.newBuilder().setSsml(ssml).build();

    // Build the voice request
    VoiceSelectionParams voice =
        VoiceSelectionParams.newBuilder()
            .setLanguageCode("en-US") // languageCode = "en_us"
            .setSsmlGender(SsmlVoiceGender.FEMALE) // ssmlVoiceGender = SsmlVoiceGender.FEMALE
            .build();

    // Select the type of audio file you want returned
    AudioConfig audioConfig =
        AudioConfig.newBuilder()
            .setAudioEncoding(AudioEncoding.MP3) // MP3 audio.
            .build();

    // Perform the text-to-speech request
    SynthesizeSpeechResponse response =
        textToSpeechClient.synthesizeSpeech(input, voice, audioConfig);

    // Get the audio contents from the response
    ByteString audioContents = response.getAudioContent();

    // Write the response to the output file.
    try (OutputStream out = new FileOutputStream("output.mp3")) {
      out.write(audioContents.toByteArray());
      System.out.println("Audio content written to file \"output.mp3\"");
      return audioContents;
    }
  }
}

Node.js

Text-to-Speech에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

const textToSpeech = require('@google-cloud/text-to-speech');
const fs = require('fs');
const util = require('util');

const client = new textToSpeech.TextToSpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const ssml = '<speak>Hello there.</speak>';
// const outputFile = 'Local path to save audio file to, e.g. output.mp3';

const request = {
  input: {ssml: ssml},
  voice: {languageCode: 'en-US', ssmlGender: 'FEMALE'},
  audioConfig: {audioEncoding: 'MP3'},
};

const [response] = await client.synthesizeSpeech(request);
const writeFile = util.promisify(fs.writeFile);
await writeFile(outputFile, response.audioContent, 'binary');
console.log(`Audio content written to file: ${outputFile}`);

Python

Text-to-Speech에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

def synthesize_ssml():
    """Synthesizes speech from the input string of ssml.

    Note: ssml must be well-formed according to:
        https://www.w3.org/TR/speech-synthesis/

    """
    from google.cloud import texttospeech

    ssml = "<speak>Hello there.</speak>"
    client = texttospeech.TextToSpeechClient()

    input_text = texttospeech.SynthesisInput(ssml=ssml)

    # Note: the voice can also be specified by name.
    # Names of voices can be retrieved with client.list_voices().
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Standard-C",
        ssml_gender=texttospeech.SsmlVoiceGender.FEMALE,
    )

    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

    response = client.synthesize_speech(
        input=input_text, voice=voice, audio_config=audio_config
    )

    # The response's audio_content is binary.
    with open("output.mp3", "wb") as out:
        out.write(response.audio_content)
        print('Audio content written to file "output.mp3"')

추가 언어

C#: 클라이언트 라이브러리 페이지의 C# 설정 안내를 따른 후 .NET용 Text-to-Speech 참고 문서를 참조하세요.

PHP: 클라이언트 라이브러리 페이지의 PHP 설정 안내를 따른 후 PHP용 Text-to-Speech 참고 문서를 참조하세요.

Ruby: 클라이언트 라이브러리 페이지의 Ruby 설정 안내를 따른 후 Ruby용 Text-to-Speech 참고 문서를 참조하세요.

직접 사용해 보기

Google Cloud를 처음 사용하는 경우 계정을 만들어 실제 시나리오에서 Text-to-Speech의 성능을 평가합니다. 신규 고객에게는 워크로드를 실행, 테스트, 배포하는 데 사용할 수 있는 $300의 무료 크레딧이 제공됩니다.

무료로 Text-to-Speech 사용해 보기