While optimizing a transformer can lower time and memory requirements, it might result in a minor decrease in performance. Therefore, it’s crucial to evaluate the model’s performance after applying these optimization techniques.
Deploying transformers in production environments involves a trade-off among several constraints, the most common being
Let’s start by developing a basic benchmark that evaluates each metric for a specific pipeline and test set.
import torch
from pathlib import Path
from time import perf_counter
class PerformanceBenchmark:
def __init__(self, pipeline, dataset, optim_type="BERT baseline"):
self.pipeline = pipeline
self.dataset = dataset
self.optim_type = optim_type
def compute_accuracy(self):
preds, labels = [], []
for example in self.dataset:
pred = self.pipeline(example["text"])[0]["label"]
label = example["intent"]
preds.append(intents.str2int(pred))
labels.append(label)
accuracy = accuracy_score.compute(predictions=preds, references=labels)
return accuracy
def compute_size(self):
state_dict = self.pipeline.model.state_dict()
tmp_path = Path("model.pt")
torch.save(state_dict, tmp_path)
# Calculate size in megabytes
size_mb = Path(tmp_path).stat().st_size / (1024 * 1024)
# Delete temporary file
tmp_path.unlink()
return {"size_mb": size_mb}
def time_pipeline(self):
latencies = []
# Warmup
for _ in range(10):
_ = self.pipeline(query)
# Timed run
for _ in range(100):
start_time = perf_counter()
_ = self.pipeline(query)
latency = perf_counter() - start_time
latencies.append(latency)
# Compute run statistics
time_avg_ms = 1000 * np.mean(latencies)
time_std_ms = 1000 * np.std(latencies)
return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}
def run_benchmark(self):
metrics = {}
metrics[self.optim_type] = self.compute_size()
metrics[self.optim_type].update(self.time_pipeline())
metrics[self.optim_type].update(self.compute_accuracy())
return metricsThe PerformanceBenchmark class in the provided code evaluates a machine learning pipeline’s performance in terms of size, inference time, and accuracy.
The compute_accuracy method calculates accuracy by comparing predicted and actual labels.
The compute_size method determines the model’s size in megabytes by temporarily saving and measuring the state dictionary.
The time_pipeline method measures inference latency, including a warm-up phase for accuracy. Finally, run_benchmark compiles these metrics into a comprehensive performance report, organizing the results by optimization type.