AI应用开发进阶(八):AI应用监控与可观测性让系统透明可控
一、开场:看不见的问题最可怕
大家好,我是老金。
AI应用上线后,最可怕的是什么?
不知道哪里出了问题。
今天聊聊监控与可观测性。
二、监控体系
2.1 三大支柱
┌─────────────────────────────────────────┐
│ 可观测性三大支柱 │
├─────────────────────────────────────────┤
│ │
│ Metrics(指标) │
│ ├── 延迟、吞吐量、错误率 │
│ ├── Token消耗、成本 │
│ └── 自定义业务指标 │
│ │
│ Logs(日志) │
│ ├── 结构化日志 │
│ ├── 分布式追踪 │
│ └── 错误日志 │
│ │
│ Traces(链路) │
│ ├── 请求全链路 │
│ ├── 服务依赖 │
│ └── 性能瓶颈 │
│ │
└─────────────────────────────────────────┘
2.2 核心指标
# Prometheus指标定义
from prometheus_client import Counter, Histogram, Gauge
# 请求指标
REQUEST_COUNT = Counter(
'ai_requests_total',
'Total requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'ai_request_latency_seconds',
'Request latency',
['endpoint']
)
# LLM指标
LLM_TOKENS = Counter(
'ai_llm_tokens_total',
'LLM tokens used',
['model', 'type']
)
LLM_LATENCY = Histogram(
'ai_llm_latency_seconds',
'LLM call latency',
['model']
)
# 业务指标
ACTIVE_SESSIONS = Gauge(
'ai_active_sessions',
'Active sessions'
)
QUEUE_SIZE = Gauge(
'ai_queue_size',
'Request queue size'
)
三、日志管理
3.1 结构化日志
import structlog
import json
# 配置结构化日志
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer()
],
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
)
logger = structlog.get_logger()
# 使用
logger.info(
"request_processed",
user_id="user_123",
latency_ms=150,
tokens_input=100,
tokens_output=200,
model="gpt-4"
)
# 输出:
# {"event": "request_processed", "user_id": "user_123", "latency_ms": 150, ...}
3.2 分布式追踪
# OpenTelemetry追踪
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
# 配置
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
# 使用
async def process_request(request_id: str):
with tracer.start_as_current_span("process_request") as span:
span.set_attribute("request_id", request_id)
# 子操作
with tracer.start_as_current_span("llm_call"):
result = await call_llm()
span.set_attribute("tokens_used", result.tokens)
with tracer.start_as_current_span("save_result"):
await save_to_db(result)
四、告警系统
4.1 告警规则
# alerting_rules.yml
groups:
- name: ai_alerts
rules:
- alert: HighErrorRate
expr: rate(ai_requests_total{status="error"}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate > 5%"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(ai_request_latency_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency > 5s"
- alert: HighCost
expr: increase(ai_llm_tokens_total[1h]) > 1000000
for: 10m
labels:
severity: warning
annotations:
summary: "Token usage > 1M/hour"
五、总结
监控 checklist
- [ ] 基础指标(延迟、错误、吞吐量)
- [ ] 业务指标(Token、成本、会话)
- [ ] 结构化日志
- [ ] 分布式追踪
- [ ] 告警配置
- [ ] Dashboard
相关阅读
正文完