数据源和监控

指定不同的训练数据并添加监控。

在对 Cloud 控制台使用不同参数值时,或通过直接调用 Vertex AI API,您可以直接查询模型,并测试返回的结果。

系统指令

您将专注于提供所请求的代码增强功能,以提升机器学习系统的性能。您总是简要提及一两项与您所做的更改直接相关的优化或建议。您可以在响应末尾以自然语言的形式列出要提供的补充信息,这样开发者可以自由选择忽略这些信息或要求提供更多信息。

提示自由格式

我正在进行一个情感分析项目,该项目使用 TensorFlow 和 Keras 处理客户反馈。我想从 Hugging Face 的 Yelp 极性度数据集中随机抽取数据,而不是 customer_reviews。仅对训练数据进行抽样,而不要对测试数据或验证数据进行抽样。在令牌化之前执行抽样。我还想集成资源使用情况监控。请为此添加一个函数,并在每个 epoch 结束时在回调中使用。它应监控和记录 CPU 用量和内存用量。

使用 500 条 Yelp 评价的随机样本运行一次,使用 1,000 条 Yelp 评价的随机样本运行一次。

我的代码如下:

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Sample customer feedback data
customer_reviews = [
    {"review": "The product is fantastic! Much better than expected.", "gender": "female", "state": "CA"},
    {"review": "Terrible service, very disappointing.", "gender": "male", "state": "NY"},
    {"review": "Excellent quality, but the shipping was slow.", "gender": "female", "state": "TX"},
    {"review": "Horrible experience, will not buy again.", "gender": "male", "state": "CA"},
    {"review": "User-friendly and intuitive design.", "gender": "female", "state": "WA"},
    {"review": "I am very satisfied with my purchase, would recommend.", "gender": "female", "state": "TX"},
    {"review": "Not worth the price, feels cheaply made.", "gender": "male", "state": "NY"},
    {"review": "The app constantly crashes, very annoying.", "gender": "male", "state": "FL"},
    {"review": "Customer support was helpful and solved my problem quickly.", "gender": "female", "state": "CA"},
    {"review": "Good quality, but a bit too expensive.", "gender": "male", "state": "TX"}
]

# Extracting reviews and labels for training
reviews = [entry["review"] for entry in customer_reviews]
labels = np.array([1, 0, 1, 0, 1, 1, 0, 0, 1, 1]) # Binary labels for sentiment (1 = positive, 0 = negative)

# Tokenization and sequence padding
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(reviews)
review_sequences = tokenizer.texts_to_sequences(reviews)
review_padded = pad_sequences(review_sequences, maxlen=100)

# Model creation
model = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(review_padded, labels, epochs=5, verbose=0)

响应

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from datasets import load_dataset
import psutil
import os

# Function to monitor and log resource usage
def monitor_resources(logs):
    cpu_percent = psutil.cpu_percent(interval=1)
    mem = psutil.virtual_memory()
    mem_percent = mem.percent
    logs.update({'cpu_usage': cpu_percent, 'memory_usage': mem_percent})


# Load the Yelp Polarity dataset
dataset = load_dataset('yelp_polarity')

# Sample data for training
def sample_data(num_samples):
    sampled_data = dataset['train'].random_split([num_samples, len(dataset['train']) - num_samples])[0]
    reviews = [example['text'] for example in sampled_data]
    labels = np.array([example['label'] for example in sampled_data])
    return reviews, labels

# Run with 500 samples
reviews_500, labels_500 = sample_data(500)

# Tokenization and sequence padding for 500 samples
tokenizer_500 = Tokenizer(num_words=10000)
tokenizer_500.fit_on_texts(reviews_500)
review_sequences_500 = tokenizer_500.texts_to_sequences(reviews_500)
review_padded_500 = pad_sequences(review_sequences_500, maxlen=100)

# Model creation for 500 samples
model_500 = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

# Compile the model for 500 samples
model_500.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

#Resource monitoring callback
resource_callback = tf.keras.callbacks.LambdaCallback(on_epoch_end=monitor_resources)

# Train the model for 500 samples
model_500.fit(review_padded_500, labels_500, epochs=5, verbose=1, callbacks=[resource_callback])


# Run with 1000 samples
reviews_1000, labels_1000 = sample_data(1000)

# Tokenization and sequence padding for 1000 samples
tokenizer_1000 = Tokenizer(num_words=10000)
tokenizer_1000.fit_on_texts(reviews_1000)
review_sequences_1000 = tokenizer_1000.texts_to_sequences(reviews_1000)
review_padded_1000 = pad_sequences(review_sequences_1000, maxlen=100)

# Model creation for 1000 samples
model_1000 = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=100),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dense(1, activation='sigmoid')
])

# Compile the model for 1000 samples
model_1000.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model for 1000 samples
model_1000.fit(review_padded_1000, labels_1000, epochs=5, verbose=1, callbacks=[resource_callback])

  • 在进行令牌化之前对数据进行采样:现在,该代码会根据请求,在执行令牌化之前从 Hugging Face 数据集中对数据进行采样。这样可以确保令牌化流程仅应用于抽样数据,从而提高效率。
  • 资源监控monitor_resources 函数及其作为回调的集成可实时监控训练期间的 CPU 和内存用量。这样可以更好地管理资源并发现潜在的瓶颈。考虑向文件添加日志记录,以便更永久地保留记录。
型号: gemini-1.5-flash-002
温度: 0.2
输出令牌数量上限: 8192
TopK: 40.0
TopP: 0.95