Monitoring
Comprehensive monitoring setup for Noid production deployments.
Metrics Overview
Key Metrics to Track
| Metric | Type | Description |
|---|---|---|
noid_vms_total | Gauge | Total number of VMs |
noid_vms_running | Gauge | Number of running VMs |
noid_vm_creation_duration_seconds | Histogram | VM creation time |
noid_checkpoint_creation_duration_seconds | Histogram | Checkpoint creation time |
noid_memory_usage_bytes | Gauge | Memory usage per VM |
noid_cpu_usage_percent | Gauge | CPU usage per VM |
noid_disk_usage_bytes | Gauge | Disk usage |
noid_api_requests_total | Counter | API request count |
noid_api_errors_total | Counter | API error count |
Prometheus Setup
1. Install Prometheus
# Download Prometheus
VERSION="2.49.0"
wget https://github.com/prometheus/prometheus/releases/download/v${VERSION}/prometheus-${VERSION}.linux-amd64.tar.gz
# Extract and install
tar xvf prometheus-${VERSION}.linux-amd64.tar.gz
sudo mv prometheus-${VERSION}.linux-amd64 /opt/prometheus
sudo ln -s /opt/prometheus/prometheus /usr/local/bin/
2. Configure Prometheus
Create /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Noid server metrics
- job_name: 'noid-server'
static_configs:
- targets: ['localhost:9090']
# Node exporter (system metrics)
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
# VM metrics (per-VM exporter)
- job_name: 'noid-vms'
static_configs:
- targets: ['localhost:9091']
3. Systemd Service
Create /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/data
Restart=always
[Install]
WantedBy=multi-user.target
sudo systemctl enable prometheus
sudo systemctl start prometheus
Grafana Setup
1. Install Grafana
# Add repository
sudo apt-get install -y software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
# Install
sudo apt-get update
sudo apt-get install -y grafana
# Start service
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
2. Add Prometheus Data Source
- Open Grafana: http://localhost:3000 (default: admin/admin)
- Go to Configuration → Data Sources
- Add Prometheus data source
- URL: http://localhost:9090
- Save & Test
3. Import Noid Dashboard
Create custom dashboard or use pre-built one:
{
"dashboard": {
"title": "Noid Overview",
"panels": [
{
"title": "Total VMs",
"targets": [{
"expr": "noid_vms_total"
}]
},
{
"title": "VM Creation Time",
"targets": [{
"expr": "histogram_quantile(0.95, rate(noid_vm_creation_duration_seconds_bucket[5m]))"
}]
},
{
"title": "Memory Usage",
"targets": [{
"expr": "sum(noid_memory_usage_bytes) by (vm)"
}]
}
]
}
}
Logging
Structured Logging
Server logs with JSON format:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "info",
"message": "VM created successfully",
"vm_name": "prod-api",
"user": "alice",
"duration_ms": 5234
}
Log Aggregation
Using Loki:
# Install Loki
wget https://github.com/grafana/loki/releases/download/v2.9.3/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo mv loki-linux-amd64 /usr/local/bin/loki
# Install Promtail (log collector)
wget https://github.com/grafana/loki/releases/download/v2.9.3/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo mv promtail-linux-amd64 /usr/local/bin/promtail
Promtail config /etc/promtail/config.yml:
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://localhost:3100/loki/api/v1/push
scrape_configs:
- job_name: noid-server
static_configs:
- targets:
- localhost
labels:
job: noid-server
__path__: /var/log/noid/server.log
- job_name: noid-audit
static_configs:
- targets:
- localhost
labels:
job: noid-audit
__path__: /var/log/noid/audit.log
Alerting
Prometheus Alerting Rules
Create /etc/prometheus/alerts.yml:
groups:
- name: noid
interval: 30s
rules:
# High VM count
- alert: HighVMCount
expr: noid_vms_total > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High number of VMs"
description: "{{ $value }} VMs running (threshold: 80)"
# Slow VM creation
- alert: SlowVMCreation
expr: histogram_quantile(0.95, rate(noid_vm_creation_duration_seconds_bucket[5m])) > 60
for: 10m
labels:
severity: warning
annotations:
summary: "VM creation is slow"
description: "95th percentile: {{ $value }}s"
# High memory usage
- alert: HighMemoryUsage
expr: sum(noid_memory_usage_bytes) / (node_memory_MemTotal_bytes) > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage"
description: "Memory usage: {{ $value | humanizePercentage }}"
# High error rate
- alert: HighErrorRate
expr: rate(noid_api_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High API error rate"
description: "Error rate: {{ $value }} errors/sec"
# Server down
- alert: ServerDown
expr: up{job="noid-server"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Noid server is down"
Alert Manager
Configure /etc/alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager'
auth_password: 'password'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
title: 'Noid Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Health Checks
Server Health Endpoint
# Check server health
curl -k https://localhost:7654/health
# Response
{
"status": "healthy",
"uptime_seconds": 3600,
"vms": {
"total": 15,
"running": 12,
"stopped": 3
},
"resources": {
"memory_used_mb": 8192,
"memory_available_mb": 24576,
"disk_used_gb": 150,
"disk_available_gb": 350
}
}
Automated Health Checks
#!/bin/bash
# /usr/local/bin/noid-healthcheck.sh
HEALTH_URL="https://localhost:7654/health"
response=$(curl -sk -w "%{http_code}" -o /tmp/health.json "${HEALTH_URL}")
if [ "$response" != "200" ]; then
echo "CRITICAL: Health check failed (HTTP $response)"
# Send alert
exit 2
fi
status=$(jq -r '.status' /tmp/health.json)
if [ "$status" != "healthy" ]; then
echo "WARNING: Server status is $status"
exit 1
fi
echo "OK: Server is healthy"
exit 0
Performance Monitoring
VM Performance Metrics
# CPU usage per VM
noid exec vm1 -- top -bn1 | head -n 3
# Memory usage
noid exec vm1 -- free -h
# Disk I/O
noid exec vm1 -- iostat -x 1 3
# Network throughput
noid exec vm1 -- iftop -t -s 10
Benchmark Tests
#!/bin/bash
# VM creation benchmark
echo "Testing VM creation speed..."
start=$(date +%s.%N)
noid create benchmark-vm
end=$(date +%s.%N)
duration=$(echo "$end - $start" | bc)
echo "VM created in ${duration}s"
# Checkpoint benchmark
start=$(date +%s.%N)
noid checkpoint benchmark-vm snap1
end=$(date +%s.%N)
duration=$(echo "$end - $start" | bc)
echo "Checkpoint created in ${duration}s"
# Cleanup
noid destroy benchmark-vm
Dashboard Examples
System Overview
- Total VMs (gauge)
- VMs by status (pie chart)
- VM creation rate (time series)
- API request rate (time series)
Performance
- VM creation time (histogram)
- Checkpoint creation time (histogram)
- API latency (histogram)
- Resource utilization (stacked area)
Errors & Issues
- Error rate by endpoint (time series)
- Failed operations (table)
- Slow operations (table)
- Recent errors (logs panel)
Next Steps
- Production Setup - Deploy Noid
- Troubleshooting - Common issues
- Architecture - System architecture