데이터 소스 및 모니터링

서로 다른 학습 데이터를 지정하고 모니터링을 추가합니다.

모델을 직접 쿼리하고 Cloud 콘솔에서 다른 매개변수 값을 사용하거나 Vertex AI API를 직접 호출하여 반환된 결과를 테스트할 수 있습니다.

Vertex AI API 사용 예시는 빠른 시작: Vertex AI API 사용을 참조하세요.
Cloud 콘솔에서 이 샘플을 보려면 Google Cloud 콘솔로 이동합니다.

시스템 안내

요청된 코드 개선사항을 제공하여 머신러닝 시스템을 개선하는 데 중점을 두고 있습니다. 항상 변경한 사항과 직접적으로 관련된 최적화나 제안 사항을 한두 가지 간략하게 언급합니다. 응답 끝에 자연어 글머리 기호로 표시하므로 개발자는 이를 무시하거나 추가 정보를 요청할 수 있습니다.

Freeform 프롬프트

TensorFlow와 Keras를 사용하여 고객 의견을 처리하는 감정 분석 프로젝트를 진행하고 있습니다. customer_reviews 대신 Hugging Face의 Yelp Polarity 데이터 세트에서 데이터를 무작위로 샘플링하려고 합니다. 테스트 또는 검증 데이터가 아닌 학습 데이터만 샘플링합니다. 토큰화 전에 샘플링을 실행합니다. 리소스 사용량 모니터링도 통합하려고 합니다. 이를 위한 함수를 추가하고 각 에포크 끝의 콜백에서 사용합니다. CPU 사용량과 메모리 사용량을 모니터링하고 로깅해야 합니다.

무작위 Yelp 검토 500개 샘플을 사용하여 한 번 실행하고 무작위 Yelp 리뷰 1,000개 샘플을 사용하여 한 번 실행합니다.

코드는 다음과 같습니다.

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Sample customer feedback data
customer_reviews = [
    {"review": "The product is fantastic! Much better than expected.", "gender": "female", "state": "CA"},
    {"review": "Terrible service, very disappointing.", "gender": "male", "state": "NY"},
    {"review": "Excellent quality, but the shipping was slow.", "gender": "female", "state": "TX"},
    {"review": "Horrible experience, will not buy again.", "gender": "male", "state": "CA"},
    {"review": "User-friendly and intuitive design.", "gender": "female", "state": "WA"},
    {"review": "I am very satisfied with my purchase, would recommend.", "gender": "female", "state": "TX"},
    {"review": "Not worth the price, feels cheaply made.", "gender": "male", "state": "NY"},
    {"review": "The app constantly crashes, very annoying.", "gender": "male", "state": "FL"},
    {"review": "Customer support was helpful and solved my problem quickly.", "gender": "female", "state": "CA"},
    {"review": "Good quality, but a bit too expensive.", "gender": "male", "state": "TX"}
]

# Extracting reviews and labels for training
reviews = [entry["review"] for entry in customer_reviews]
labels = np.array([1, 0, 1, 0, 1, 1, 0, 0, 1, 1]) # Binary labels for sentiment (1 = positive, 0 = negative)

# Tokenization and sequence padding
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(reviews)
review_sequences = tokenizer.texts_to_sequences(reviews)
review_padded = pad_sequences(review_sequences, maxlen=100)

# Model creation
model = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(review_padded, labels, epochs=5, verbose=0)

응답

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from datasets import load_dataset
import psutil
import os

# Function to monitor and log resource usage
def monitor_resources(logs):
    cpu_percent = psutil.cpu_percent(interval=1)
    mem = psutil.virtual_memory()
    mem_percent = mem.percent
    logs.update({'cpu_usage': cpu_percent, 'memory_usage': mem_percent})


# Load the Yelp Polarity dataset
dataset = load_dataset('yelp_polarity')

# Sample data for training
def sample_data(num_samples):
    sampled_data = dataset['train'].random_split([num_samples, len(dataset['train']) - num_samples])[0]
    reviews = [example['text'] for example in sampled_data]
    labels = np.array([example['label'] for example in sampled_data])
    return reviews, labels

# Run with 500 samples
reviews_500, labels_500 = sample_data(500)

# Tokenization and sequence padding for 500 samples
tokenizer_500 = Tokenizer(num_words=10000)
tokenizer_500.fit_on_texts(reviews_500)
review_sequences_500 = tokenizer_500.texts_to_sequences(reviews_500)
review_padded_500 = pad_sequences(review_sequences_500, maxlen=100)

# Model creation for 500 samples
model_500 = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

# Compile the model for 500 samples
model_500.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

#Resource monitoring callback
resource_callback = tf.keras.callbacks.LambdaCallback(on_epoch_end=monitor_resources)

# Train the model for 500 samples
model_500.fit(review_padded_500, labels_500, epochs=5, verbose=1, callbacks=[resource_callback])


# Run with 1000 samples
reviews_1000, labels_1000 = sample_data(1000)

# Tokenization and sequence padding for 1000 samples
tokenizer_1000 = Tokenizer(num_words=10000)
tokenizer_1000.fit_on_texts(reviews_1000)
review_sequences_1000 = tokenizer_1000.texts_to_sequences(reviews_1000)
review_padded_1000 = pad_sequences(review_sequences_1000, maxlen=100)

# Model creation for 1000 samples
model_1000 = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

# Compile the model for 1000 samples
model_1000.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model for 1000 samples
model_1000.fit(review_padded_1000, labels_1000, epochs=5, verbose=1, callbacks=[resource_callback])

토큰화 전 데이터 샘플링: 이제 코드는 요청대로 토큰화를 수행하기 전에 Hugging Face 데이터 세트에서 데이터를 샘플링합니다. 이렇게 하면 토큰화 프로세스가 샘플링된 데이터에만 적용되므로 효율성이 향상됩니다.
리소스 모니터링: monitor_resources 함수와 콜백으로의 통합은 학습 중에 CPU 및 메모리 사용량을 실시간으로 모니터링합니다. 이를 통해 리소스를 더욱 효과적으로 관리하고 잠재적인 병목 현상을 파악할 수 있습니다. 영구 기록을 위해 파일에 로깅을 추가하는 것이 좋습니다.

모델:	gemini-1.5-flash-002
강도:	0.2
최대 출력 토큰:	8192
최상위 K:	40.0
최상위 P:	0.95

데이터 소스 및 모니터링 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

Freeform 프롬프트

응답

데이터 소스 및 모니터링