Наблюдаемость и мониторинг
Три столпа наблюдаемости
Observability (наблюдаемость) — способность понять внутреннее состояние системы по её внешним выходным данным. В отличие от мониторинга, который отвечает на известные вопросы, наблюдаемость позволяет отвечать на любые вопросы о поведении системы.
| Столп | Что это | Для чего |
|---|---|---|
| Metrics | Числовые значения во времени | Алерты, тренды, SLI |
| Logs | Дискретные события | Детальная отладка, аудит |
| Traces | Путь запроса через систему | Анализ latency, зависимости |
Метрики
Метрики — это числовые измерения, собираемые с определённым интервалом. Четыре типа метрик:
| Тип | Описание | Пример |
|---|---|---|
| Counter | Только растёт | Количество запросов |
| Gauge | Растёт и уменьшается | CPU usage, memory |
| Histogram | Распределение значений | Latency buckets |
| Summary | Квантили | p50, p95, p99 latency |
Логи
Логи — записи о дискретных событиях. Современный подход — structured logging (JSON вместо plain text).
# Plain text — плохо для парсинга
[2024-01-15 10:23:45] ERROR: Payment failed for user 12345
# Structured JSON — удобно для поиска и фильтрации
{"timestamp":"2024-01-15T10:23:45Z","level":"error","message":"Payment failed","user_id":12345,"amount":99.99,"currency":"USD","error_code":"INSUFFICIENT_FUNDS"}
Трейсы
Distributed tracing показывает путь запроса через все сервисы. Каждый трейс состоит из спанов (spans), образующих дерево вызовов.
Trace ID: abc-123
├── [API Gateway] 250ms
│ ├── [Auth Service] 15ms
│ ├── [Order Service] 180ms
│ │ ├── [Database] 45ms
│ │ ├── [Payment Service] 120ms
│ │ │ └── [External Payment API] 95ms
│ │ └── [Notification Service] 10ms (async)
│ └── [Response] 5ms
Инструменты
Стек мониторинга
| Инструмент | Назначение | Модель |
|---|---|---|
| Prometheus | Метрики (сбор, хранение, запросы) | Pull-based |
| Grafana | Визуализация метрик и логов | Dashboards |
| Loki | Агрегация логов (Grafana Stack) | Like Prometheus, but for logs |
| Jaeger / Tempo | Distributed tracing | Trace storage |
| Elasticsearch + Kibana | Логи (ELK stack) | Full-text search |
| AlertManager | Управление алертами | Alert routing |
| OpenTelemetry | Unified SDK для всех трёх столпов | Vendor-neutral |
Prometheus + Grafana
Prometheus собирает метрики по HTTP (pull model). Приложение экспортирует метрики на эндпоинте /metrics.
Формат экспорта Prometheus:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/orders",status="200"} 15234
http_requests_total{method="POST",path="/api/orders",status="201"} 892
http_requests_total{method="GET",path="/api/orders",status="500"} 12
# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 9800
http_request_duration_seconds_bucket{le="0.1"} 12400
http_request_duration_seconds_bucket{le="0.25"} 14900
http_request_duration_seconds_bucket{le="0.5"} 15100
http_request_duration_seconds_bucket{le="+Inf"} 15234
PHP: Structured Logging
Monolog с JSON-форматированием
<?php
declare(strict_types=1);
namespace App\Logging;
use Psr\Log\LoggerInterface;
final readonly class StructuredLogger
{
public function __construct(
private LoggerInterface $logger,
) {}
/**
* Log an HTTP request with structured context.
*
* @param array<string, mixed> $extra Additional context
*/
public function logRequest(
string $method,
string $path,
int $statusCode,
float $durationMs,
array $extra = [],
): void {
$context = [
'http_method' => $method,
'http_path' => $path,
'http_status' => $statusCode,
'duration_ms' => round($durationMs, 2),
'trace_id' => $this->getTraceId(),
...$extra,
];
$level = match (true) {
$statusCode >= 500 => 'error',
$statusCode >= 400 => 'warning',
default => 'info',
};
$this->logger->log($level, 'HTTP request processed', $context);
}
/**
* Log a business event.
*/
public function logBusinessEvent(
string $event,
string $entityType,
string $entityId,
array $data = [],
): void {
$this->logger->info('Business event', [
'event' => $event,
'entity_type' => $entityType,
'entity_id' => $entityId,
'trace_id' => $this->getTraceId(),
...$data,
]);
}
private function getTraceId(): string
{
return $_SERVER['HTTP_X_TRACE_ID']
?? $_SERVER['HTTP_X_REQUEST_ID']
?? bin2hex(random_bytes(16));
}
}
package logging
import (
"log/slog"
"math"
"net/http"
"github.com/google/uuid"
)
// StructuredLogger provides structured logging with trace context.
type StructuredLogger struct {
logger *slog.Logger
}
// NewStructuredLogger creates a new StructuredLogger.
func NewStructuredLogger(logger *slog.Logger) *StructuredLogger {
return &StructuredLogger{logger: logger}
}
// LogRequest logs an HTTP request with structured context.
func (l *StructuredLogger) LogRequest(r *http.Request, statusCode int, durationMs float64, extra ...slog.Attr) {
attrs := []slog.Attr{
slog.String("http_method", r.Method),
slog.String("http_path", r.URL.Path),
slog.Int("http_status", statusCode),
slog.Float64("duration_ms", math.Round(durationMs*100)/100),
slog.String("trace_id", traceIDFromRequest(r)),
}
attrs = append(attrs, extra...)
level := slog.LevelInfo
switch {
case statusCode >= 500:
level = slog.LevelError
case statusCode >= 400:
level = slog.LevelWarn
}
l.logger.LogAttrs(r.Context(), level, "HTTP request processed", attrs...)
}
// LogBusinessEvent logs a business event.
func (l *StructuredLogger) LogBusinessEvent(r *http.Request, event, entityType, entityID string, extra ...slog.Attr) {
attrs := []slog.Attr{
slog.String("event", event),
slog.String("entity_type", entityType),
slog.String("entity_id", entityID),
slog.String("trace_id", traceIDFromRequest(r)),
}
attrs = append(attrs, extra...)
l.logger.LogAttrs(r.Context(), slog.LevelInfo, "Business event", attrs...)
}
func traceIDFromRequest(r *http.Request) string {
if id := r.Header.Get("X-Trace-ID"); id != "" {
return id
}
if id := r.Header.Get("X-Request-ID"); id != "" {
return id
}
return uuid.New().String()
}
<?php
declare(strict_types=1);
namespace App\Middleware;
use App\Logging\StructuredLogger;
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;
use Symfony\Component\HttpKernel\Event\RequestEvent;
use Symfony\Component\HttpKernel\Event\ResponseEvent;
final class RequestLoggingMiddleware
{
private float $startTime;
public function __construct(
private readonly StructuredLogger $logger,
) {}
public function onKernelRequest(RequestEvent $event): void
{
if (!$event->isMainRequest()) {
return;
}
$this->startTime = microtime(true);
}
public function onKernelResponse(ResponseEvent $event): void
{
if (!$event->isMainRequest()) {
return;
}
$request = $event->getRequest();
$response = $event->getResponse();
$durationMs = (microtime(true) - $this->startTime) * 1000;
$this->logger->logRequest(
method: $request->getMethod(),
path: $request->getPathInfo(),
statusCode: $response->getStatusCode(),
durationMs: $durationMs,
extra: [
'ip' => $request->getClientIp(),
'user_agent' => $request->headers->get('User-Agent', 'unknown'),
'request_size' => $request->headers->get('Content-Length', '0'),
'response_size' => strlen($response->getContent() ?: ''),
],
);
}
}
package middleware
import (
"log/slog"
"net/http"
"time"
)
// RequestLoggingMiddleware logs every HTTP request with duration and metadata.
func RequestLoggingMiddleware(logger *slog.Logger) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Wrap ResponseWriter to capture status code and size
rw := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
next.ServeHTTP(rw, r)
durationMs := float64(time.Since(start).Microseconds()) / 1000.0
level := slog.LevelInfo
switch {
case rw.statusCode >= 500:
level = slog.LevelError
case rw.statusCode >= 400:
level = slog.LevelWarn
}
logger.LogAttrs(r.Context(), level, "HTTP request processed",
slog.String("http_method", r.Method),
slog.String("http_path", r.URL.Path),
slog.Int("http_status", rw.statusCode),
slog.Float64("duration_ms", durationMs),
slog.String("ip", r.RemoteAddr),
slog.String("user_agent", r.UserAgent()),
slog.Int64("request_size", r.ContentLength),
slog.Int("response_size", rw.bytesWritten),
)
})
}
}
type responseWriter struct {
http.ResponseWriter
statusCode int
bytesWritten int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
func (rw *responseWriter) Write(b []byte) (int, error) {
n, err := rw.ResponseWriter.Write(b)
rw.bytesWritten += n
return n, err
}
<?php
declare(strict_types=1);
namespace App\Metrics;
/**
* Simple Prometheus metrics collector.
* In production, use promphp/prometheus_client_php library.
*/
final class MetricsCollector
{
/** @var array<string, array{type: string, help: string, values: array}> */
private array $metrics = [];
public function registerCounter(string $name, string $help): void
{
$this->metrics[$name] = [
'type' => 'counter',
'help' => $help,
'values' => [],
];
}
public function registerHistogram(string $name, string $help, array $buckets): void
{
$this->metrics[$name] = [
'type' => 'histogram',
'help' => $help,
'buckets' => $buckets,
'values' => [],
];
}
/**
* Increment a counter.
*
* @param array<string, string> $labels
*/
public function incrementCounter(string $name, array $labels = [], float $value = 1.0): void
{
$key = $this->labelsToKey($labels);
$this->metrics[$name]['values'][$key] ??= 0;
$this->metrics[$name]['values'][$key] += $value;
}
/**
* Observe a value for a histogram.
*
* @param array<string, string> $labels
*/
public function observeHistogram(string $name, float $value, array $labels = []): void
{
$key = $this->labelsToKey($labels);
$this->metrics[$name]['values'][$key][] = $value;
}
/**
* Render all metrics in Prometheus exposition format.
*/
public function render(): string
{
$output = '';
foreach ($this->metrics as $name => $metric) {
$output .= "# HELP {$name} {$metric['help']}\n";
$output .= "# TYPE {$name} {$metric['type']}\n";
foreach ($metric['values'] as $labels => $value) {
$labelStr = $labels ? "{{$labels}}" : '';
$output .= "{$name}{$labelStr} {$value}\n";
}
$output .= "\n";
}
return $output;
}
private function labelsToKey(array $labels): string
{
$parts = [];
foreach ($labels as $key => $value) {
$parts[] = "{$key}=\"{$value}\"";
}
return implode(',', $parts);
}
}
package metrics
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
// Collector registers and exposes Prometheus metrics.
// In production, use github.com/prometheus/client_golang directly.
type Collector struct {
requestsTotal *prometheus.CounterVec
requestDuration *prometheus.HistogramVec
}
// NewCollector creates a Collector with standard HTTP metrics.
func NewCollector(reg prometheus.Registerer) *Collector {
c := &Collector{
requestsTotal: prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "path", "status"},
),
requestDuration: prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency",
Buckets: []float64{0.05, 0.1, 0.25, 0.5, 1.0},
},
[]string{"method", "path"},
),
}
reg.MustRegister(c.requestsTotal, c.requestDuration)
return c
}
// IncRequestsTotal increments the request counter.
func (c *Collector) IncRequestsTotal(method, path, status string) {
c.requestsTotal.WithLabelValues(method, path, status).Inc()
}
// ObserveDuration records a request duration.
func (c *Collector) ObserveDuration(method, path string, seconds float64) {
c.requestDuration.WithLabelValues(method, path).Observe(seconds)
}
// Handler returns an HTTP handler that serves /metrics.
func Handler() http.Handler {
return promhttp.Handler()
}
<?php
declare(strict_types=1);
namespace App\Controller;
use App\Metrics\MetricsCollector;
use Symfony\Component\HttpFoundation\Response;
use Symfony\Component\Routing\Attribute\Route;
final class MetricsController
{
public function __construct(
private readonly MetricsCollector $collector,
) {}
#[Route('/metrics', methods: ['GET'])]
public function __invoke(): Response
{
return new Response(
content: $this->collector->render(),
headers: ['Content-Type' => 'text/plain; charset=utf-8'],
);
}
}
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
// Register /metrics endpoint for Prometheus scraping
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
Правила хороших алертов
| Принцип | Описание |
|---|---|
| Actionable | Алерт требует конкретного действия |
| Symptom-based | Алертить на симптомы, не причины |
| Low noise | Минимизировать ложные срабатывания |
| Documented | Каждый алерт содержит runbook |
| Tiered | Severity определяет канал уведомления |
Уровни severity
| Severity | Описание | Реакция |
|---|---|---|
| P1 Critical | Сервис недоступен | Немедленно, on-call |
| P2 High | Деградация для >10% пользователей | В течение 30 минут |
| P3 Medium | Деградация для <10% пользователей | В рабочие часы |
| P4 Low | Потенциальная проблема | При планировании |
Практика: Если алерт срабатывает и не требует действия — удалите его. Каждый алерт должен будить on-call инженера только если это действительно необходимо.
OpenTelemetry и PHP
OpenTelemetry — vendor-neutral стандарт для сбора телеметрии. Объединяет метрики, логи и трейсы в единый SDK.
<?php
declare(strict_types=1);
namespace App\Tracing;
/**
* Simplified tracing context propagation.
* In production use open-telemetry/opentelemetry-php SDK.
*/
final class TraceContext
{
private string $traceId;
private string $spanId;
private ?string $parentSpanId;
public function __construct(?string $traceId = null)
{
$this->traceId = $traceId ?? bin2hex(random_bytes(16));
$this->spanId = bin2hex(random_bytes(8));
$this->parentSpanId = null;
}
public function createChildSpan(string $operationName): SpanRecord
{
$childSpanId = bin2hex(random_bytes(8));
return new SpanRecord(
traceId: $this->traceId,
spanId: $childSpanId,
parentSpanId: $this->spanId,
operationName: $operationName,
startTime: microtime(true),
);
}
public function getTraceId(): string
{
return $this->traceId;
}
/**
* Extract trace context from HTTP headers (W3C Trace Context).
*/
public static function fromHeaders(array $headers): self
{
$traceparent = $headers['traceparent'] ?? null;
if ($traceparent !== null && preg_match('/^00-([a-f0-9]{32})-([a-f0-9]{16})-\d{2}$/', $traceparent, $m)) {
$context = new self($m[1]);
$context->parentSpanId = $m[2];
return $context;
}
return new self();
}
}
final readonly class SpanRecord
{
public function __construct(
public string $traceId,
public string $spanId,
public string $parentSpanId,
public string $operationName,
public float $startTime,
public ?float $endTime = null,
public array $attributes = [],
) {}
public function finish(): self
{
return new self(
traceId: $this->traceId,
spanId: $this->spanId,
parentSpanId: $this->parentSpanId,
operationName: $this->operationName,
startTime: $this->startTime,
endTime: microtime(true),
attributes: $this->attributes,
);
}
public function getDurationMs(): float
{
$end = $this->endTime ?? microtime(true);
return ($end - $this->startTime) * 1000;
}
}
package tracing
import (
"context"
"fmt"
"net/http"
"regexp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
// In production, use go.opentelemetry.io/otel SDK.
// This example shows the standard OpenTelemetry Go usage.
var tracer = otel.Tracer("myapp")
// StartSpan creates a child span from the current context.
func StartSpan(ctx context.Context, operationName string, attrs ...attribute.KeyValue) (context.Context, trace.Span) {
ctx, span := tracer.Start(ctx, operationName,
trace.WithAttributes(attrs...),
)
return ctx, span
}
// TraceIDFromContext extracts the trace ID from the current span.
func TraceIDFromContext(ctx context.Context) string {
span := trace.SpanFromContext(ctx)
if span.SpanContext().HasTraceID() {
return span.SpanContext().TraceID().String()
}
return ""
}
// W3C Trace Context header regex for manual parsing.
var traceparentRe = regexp.MustCompile(`^00-([a-f0-9]{32})-([a-f0-9]{16})-\d{2}$`)
// ExtractTraceParent parses a W3C traceparent header manually.
func ExtractTraceParent(header string) (traceID, parentSpanID string, ok bool) {
m := traceparentRe.FindStringSubmatch(header)
if len(m) != 3 {
return "", "", false
}
return m[1], m[2], true
}
// TracingMiddleware adds OpenTelemetry spans to HTTP requests.
func TracingMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(),
fmt.Sprintf("%s %s", r.Method, r.URL.Path),
trace.WithAttributes(
attribute.String("http.method", r.Method),
attribute.String("http.url", r.URL.String()),
),
)
defer span.End()
next.ServeHTTP(w, r.WithContext(ctx))
})
}
Уровни dashboard-ов
| Уровень | Аудитория | Содержание |
|---|---|---|
| Business | Менеджмент | KPI, конверсии, revenue |
| Service | Разработчики | Latency, errors, throughput |
| Infrastructure | SRE/DevOps | CPU, RAM, disk, network |
| Debug | On-call | Детальные метрики для troubleshooting |
Совет: Начинайте с USE/RED метрик для каждого сервиса. Добавляйте детали по мере необходимости. Dashboard, на который никто не смотрит — мёртвый dashboard.
Итоги
| Концепция | Применение |
|---|---|
| Structured logging | JSON-логи с trace_id для корреляции |
| Prometheus metrics | Экспорт counter/gauge/histogram |
| Distributed tracing | Propagation trace context через заголовки |
| OpenTelemetry | Единый SDK для всех трёх столпов |
| Alerting | Symptom-based, actionable, documented |