Monitoring

Comprehensive monitoring setup for Noid production deployments.

Metrics Overview

Key Metrics to Track

Metric	Type	Description
`noid_vms_total`	Gauge	Total number of VMs
`noid_vms_running`	Gauge	Number of running VMs
`noid_vm_creation_duration_seconds`	Histogram	VM creation time
`noid_checkpoint_creation_duration_seconds`	Histogram	Checkpoint creation time
`noid_memory_usage_bytes`	Gauge	Memory usage per VM
`noid_cpu_usage_percent`	Gauge	CPU usage per VM
`noid_disk_usage_bytes`	Gauge	Disk usage
`noid_api_requests_total`	Counter	API request count
`noid_api_errors_total`	Counter	API error count

Prometheus Setup

1. Install Prometheus

# Download Prometheus
VERSION="2.49.0"
wget https://github.com/prometheus/prometheus/releases/download/v${VERSION}/prometheus-${VERSION}.linux-amd64.tar.gz

# Extract and install
tar xvf prometheus-${VERSION}.linux-amd64.tar.gz
sudo mv prometheus-${VERSION}.linux-amd64 /opt/prometheus
sudo ln -s /opt/prometheus/prometheus /usr/local/bin/

2. Configure Prometheus

Create /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Noid server metrics
  - job_name: 'noid-server'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  # VM metrics (per-VM exporter)
  - job_name: 'noid-vms'
    static_configs:
      - targets: ['localhost:9091']

3. Systemd Service

Create /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/data
Restart=always

[Install]
WantedBy=multi-user.target

sudo systemctl enable prometheus
sudo systemctl start prometheus

Grafana Setup

1. Install Grafana

# Add repository
sudo apt-get install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"

# Install
sudo apt-get update
sudo apt-get install -y grafana

# Start service
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

2. Add Prometheus Data Source

Open Grafana: http://localhost:3000 (default: admin/admin)
Go to Configuration → Data Sources
Add Prometheus data source
URL: http://localhost:9090
Save & Test

3. Import Noid Dashboard

Create custom dashboard or use pre-built one:

{
  "dashboard": {
    "title": "Noid Overview",
    "panels": [
      {
        "title": "Total VMs",
        "targets": [{
          "expr": "noid_vms_total"
        }]
      },
      {
        "title": "VM Creation Time",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(noid_vm_creation_duration_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "Memory Usage",
        "targets": [{
          "expr": "sum(noid_memory_usage_bytes) by (vm)"
        }]
      }
    ]
  }
}

Logging

Structured Logging

Server logs with JSON format:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "message": "VM created successfully",
  "vm_name": "prod-api",
  "user": "alice",
  "duration_ms": 5234
}

Log Aggregation

Using Loki:

# Install Loki
wget https://github.com/grafana/loki/releases/download/v2.9.3/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki

# Install Promtail (log collector)
wget https://github.com/grafana/loki/releases/download/v2.9.3/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo mv promtail-linux-amd64 /usr/local/bin/promtail

Promtail config /etc/promtail/config.yml:

server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  - job_name: noid-server
    static_configs:
      - targets:
          - localhost
        labels:
          job: noid-server
          __path__: /var/log/noid/server.log

  - job_name: noid-audit
    static_configs:
      - targets:
          - localhost
        labels:
          job: noid-audit
          __path__: /var/log/noid/audit.log

Alerting

Prometheus Alerting Rules

Create /etc/prometheus/alerts.yml:

groups:
  - name: noid
    interval: 30s
    rules:
      # High VM count
      - alert: HighVMCount
        expr: noid_vms_total > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High number of VMs"
          description: "{{ $value }} VMs running (threshold: 80)"

      # Slow VM creation
      - alert: SlowVMCreation
        expr: histogram_quantile(0.95, rate(noid_vm_creation_duration_seconds_bucket[5m])) > 60
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "VM creation is slow"
          description: "95th percentile: {{ $value }}s"

      # High memory usage
      - alert: HighMemoryUsage
        expr: sum(noid_memory_usage_bytes) / (node_memory_MemTotal_bytes) > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage"
          description: "Memory usage: {{ $value | humanizePercentage }}"

      # High error rate
      - alert: HighErrorRate
        expr: rate(noid_api_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High API error rate"
          description: "Error rate: {{ $value }} errors/sec"

      # Server down
      - alert: ServerDown
        expr: up{job="noid-server"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Noid server is down"

Alert Manager

Configure /etc/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager'
        auth_password: 'password'

    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        title: 'Noid Alert'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Health Checks

Server Health Endpoint

# Check server health
curl -k https://localhost:7654/health

# Response
{
  "status": "healthy",
  "uptime_seconds": 3600,
  "vms": {
    "total": 15,
    "running": 12,
    "stopped": 3
  },
  "resources": {
    "memory_used_mb": 8192,
    "memory_available_mb": 24576,
    "disk_used_gb": 150,
    "disk_available_gb": 350
  }
}

Automated Health Checks

#!/bin/bash
# /usr/local/bin/noid-healthcheck.sh

HEALTH_URL="https://localhost:7654/health"

response=$(curl -sk -w "%{http_code}" -o /tmp/health.json "${HEALTH_URL}")

if [ "$response" != "200" ]; then
  echo "CRITICAL: Health check failed (HTTP $response)"
  # Send alert
  exit 2
fi

status=$(jq -r '.status' /tmp/health.json)
if [ "$status" != "healthy" ]; then
  echo "WARNING: Server status is $status"
  exit 1
fi

echo "OK: Server is healthy"
exit 0

Performance Monitoring

VM Performance Metrics

# CPU usage per VM
noid exec vm1 -- top -bn1 | head -n 3

# Memory usage
noid exec vm1 -- free -h

# Disk I/O
noid exec vm1 -- iostat -x 1 3

# Network throughput
noid exec vm1 -- iftop -t -s 10

Benchmark Tests

#!/bin/bash
# VM creation benchmark

echo "Testing VM creation speed..."
start=$(date +%s.%N)
noid create benchmark-vm
end=$(date +%s.%N)
duration=$(echo "$end - $start" | bc)

echo "VM created in ${duration}s"

# Checkpoint benchmark
start=$(date +%s.%N)
noid checkpoint benchmark-vm snap1
end=$(date +%s.%N)
duration=$(echo "$end - $start" | bc)

echo "Checkpoint created in ${duration}s"

# Cleanup
noid destroy benchmark-vm

Dashboard Examples

System Overview

Total VMs (gauge)
VMs by status (pie chart)
VM creation rate (time series)
API request rate (time series)

Performance

VM creation time (histogram)
Checkpoint creation time (histogram)
API latency (histogram)
Resource utilization (stacked area)

Errors & Issues

Error rate by endpoint (time series)
Failed operations (table)
Slow operations (table)
Recent errors (logs panel)

Next Steps

Production Setup - Deploy Noid
Troubleshooting - Common issues
Architecture - System architecture