Monitoring

Comprehensive monitoring setup for Noid production deployments.

Metrics Overview

Key Metrics to Track

MetricTypeDescription
noid_vms_totalGaugeTotal number of VMs
noid_vms_runningGaugeNumber of running VMs
noid_vm_creation_duration_secondsHistogramVM creation time
noid_checkpoint_creation_duration_secondsHistogramCheckpoint creation time
noid_memory_usage_bytesGaugeMemory usage per VM
noid_cpu_usage_percentGaugeCPU usage per VM
noid_disk_usage_bytesGaugeDisk usage
noid_api_requests_totalCounterAPI request count
noid_api_errors_totalCounterAPI error count

Prometheus Setup

1. Install Prometheus

# Download Prometheus
VERSION="2.49.0"
wget https://github.com/prometheus/prometheus/releases/download/v${VERSION}/prometheus-${VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf prometheus-${VERSION}.linux-amd64.tar.gz
sudo mv prometheus-${VERSION}.linux-amd64 /opt/prometheus
sudo ln -s /opt/prometheus/prometheus /usr/local/bin/

2. Configure Prometheus

Create /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Noid server metrics
  - job_name: 'noid-server'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  # VM metrics (per-VM exporter)
  - job_name: 'noid-vms'
    static_configs:
      - targets: ['localhost:9091']

3. Systemd Service

Create /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/data
Restart=always

[Install]
WantedBy=multi-user.target
sudo systemctl enable prometheus
sudo systemctl start prometheus

Grafana Setup

1. Install Grafana

# Add repository
sudo apt-get install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"

# Install
sudo apt-get update
sudo apt-get install -y grafana

# Start service
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

2. Add Prometheus Data Source

  1. Open Grafana: http://localhost:3000 (default: admin/admin)
  2. Go to Configuration → Data Sources
  3. Add Prometheus data source
  4. URL: http://localhost:9090
  5. Save & Test

3. Import Noid Dashboard

Create custom dashboard or use pre-built one:

{
  "dashboard": {
    "title": "Noid Overview",
    "panels": [
      {
        "title": "Total VMs",
        "targets": [{
          "expr": "noid_vms_total"
        }]
      },
      {
        "title": "VM Creation Time",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(noid_vm_creation_duration_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "Memory Usage",
        "targets": [{
          "expr": "sum(noid_memory_usage_bytes) by (vm)"
        }]
      }
    ]
  }
}

Logging

Structured Logging

Server logs with JSON format:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "message": "VM created successfully",
  "vm_name": "prod-api",
  "user": "alice",
  "duration_ms": 5234
}

Log Aggregation

Using Loki:

# Install Loki
wget https://github.com/grafana/loki/releases/download/v2.9.3/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki

# Install Promtail (log collector)
wget https://github.com/grafana/loki/releases/download/v2.9.3/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo mv promtail-linux-amd64 /usr/local/bin/promtail

Promtail config /etc/promtail/config.yml:

server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  - job_name: noid-server
    static_configs:
      - targets:
          - localhost
        labels:
          job: noid-server
          __path__: /var/log/noid/server.log

  - job_name: noid-audit
    static_configs:
      - targets:
          - localhost
        labels:
          job: noid-audit
          __path__: /var/log/noid/audit.log

Alerting

Prometheus Alerting Rules

Create /etc/prometheus/alerts.yml:

groups:
  - name: noid
    interval: 30s
    rules:
      # High VM count
      - alert: HighVMCount
        expr: noid_vms_total > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High number of VMs"
          description: "{{ $value }} VMs running (threshold: 80)"

      # Slow VM creation
      - alert: SlowVMCreation
        expr: histogram_quantile(0.95, rate(noid_vm_creation_duration_seconds_bucket[5m])) > 60
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "VM creation is slow"
          description: "95th percentile: {{ $value }}s"

      # High memory usage
      - alert: HighMemoryUsage
        expr: sum(noid_memory_usage_bytes) / (node_memory_MemTotal_bytes) > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage"
          description: "Memory usage: {{ $value | humanizePercentage }}"

      # High error rate
      - alert: HighErrorRate
        expr: rate(noid_api_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High API error rate"
          description: "Error rate: {{ $value }} errors/sec"

      # Server down
      - alert: ServerDown
        expr: up{job="noid-server"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Noid server is down"

Alert Manager

Configure /etc/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager'
        auth_password: 'password'

    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        title: 'Noid Alert'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Health Checks

Server Health Endpoint

# Check server health
curl -k https://localhost:7654/health

# Response
{
  "status": "healthy",
  "uptime_seconds": 3600,
  "vms": {
    "total": 15,
    "running": 12,
    "stopped": 3
  },
  "resources": {
    "memory_used_mb": 8192,
    "memory_available_mb": 24576,
    "disk_used_gb": 150,
    "disk_available_gb": 350
  }
}

Automated Health Checks

#!/bin/bash
# /usr/local/bin/noid-healthcheck.sh

HEALTH_URL="https://localhost:7654/health"

response=$(curl -sk -w "%{http_code}" -o /tmp/health.json "${HEALTH_URL}")

if [ "$response" != "200" ]; then
  echo "CRITICAL: Health check failed (HTTP $response)"
  # Send alert
  exit 2
fi

status=$(jq -r '.status' /tmp/health.json)
if [ "$status" != "healthy" ]; then
  echo "WARNING: Server status is $status"
  exit 1
fi

echo "OK: Server is healthy"
exit 0

Performance Monitoring

VM Performance Metrics

# CPU usage per VM
noid exec vm1 -- top -bn1 | head -n 3

# Memory usage
noid exec vm1 -- free -h

# Disk I/O
noid exec vm1 -- iostat -x 1 3

# Network throughput
noid exec vm1 -- iftop -t -s 10

Benchmark Tests

#!/bin/bash
# VM creation benchmark

echo "Testing VM creation speed..."
start=$(date +%s.%N)
noid create benchmark-vm
end=$(date +%s.%N)
duration=$(echo "$end - $start" | bc)

echo "VM created in ${duration}s"

# Checkpoint benchmark
start=$(date +%s.%N)
noid checkpoint benchmark-vm snap1
end=$(date +%s.%N)
duration=$(echo "$end - $start" | bc)

echo "Checkpoint created in ${duration}s"

# Cleanup
noid destroy benchmark-vm

Dashboard Examples

System Overview

  • Total VMs (gauge)
  • VMs by status (pie chart)
  • VM creation rate (time series)
  • API request rate (time series)

Performance

  • VM creation time (histogram)
  • Checkpoint creation time (histogram)
  • API latency (histogram)
  • Resource utilization (stacked area)

Errors & Issues

  • Error rate by endpoint (time series)
  • Failed operations (table)
  • Slow operations (table)
  • Recent errors (logs panel)

Next Steps