Docs that actually help
Written by engineers, for engineers. No filler, no marketing speak.
User Manual
Complete guide to every page in Pointer APM — dashboard, topology, traces, logs, incidents, alerts, healing, custom dashboards, profiles, change intelligence, capacity forecasting, SLOs, synthetic monitoring, and more.
Dashboard
The Dashboard is your home page showing a high-level overview of the monitored environment.
Charts include Ingestion Rate Over Time (area chart) and Error Rate Distribution (bar chart). Data auto-refreshes with a configurable time range selector (Last 15m, 1h, 24h).
Topology Map
Live service dependency graph. Nodes represent services; edges represent dependencies. Edge width shows throughput, edge color shows error rate.
Green — Healthy | Yellow — Degraded | Red — Critical | Gray — Unknown
Click any node for details: overview, metrics, traces, logs, and dependencies. Switch to 3D mode for large architectures (200+ nodes, powered by Three.js). Use the Time Travel slider to view historical topology state.
Trace Explorer
Search and analyze distributed traces. Filter by service, operation, duration, status, and time range.
Click a Trace ID to open the waterfall view — each row is a span, horizontal bars show timing, nested bars show parent-child relationships. Click any span to inspect attributes, resource attributes, events, and timing.
Log Explorer
Full-text search across all ingested logs. Filter by service, severity (TRACE, DEBUG, INFO, WARN, ERROR, FATAL), keyword, and time range.
Expand any log to see the full body, attributes, and a clickable Trace ID to jump to the waterfall. Toggle Live Tail for real-time SSE streaming — ideal during deployments or incident investigation.
Incident Management
Incidents are created automatically by the AI engine when correlated anomalies are detected, or manually.
Each incident has a timeline, AI-generated RCA with confidence scores and evidence links, an impact radius view (mini topology + affected services), and action buttons for acknowledge, assign, status transitions, and healing triggers. Severity is P1–P5.
Alert Management
View active alerts (CRITICAL, WARNING, INFO) and manage PromQL-based alert rules. Rules are evaluated every 15 seconds against VictoriaMetrics. Duplicate alerts are auto-deduplicated and the AI engine may correlate alerts into incidents.
Self-Healing
Automatic remediation for known issues. Actions include restart pods, scale deployments, and clear queues.
Healing policies define triggers, actions, and modes — Manual (require approval), Suggested (one-click approve), or Auto-approved (execute automatically). Configurable scope, cooldown, and active time windows.
Custom Dashboards
Drag-and-drop bento grid layout. PromQL-powered panels: line charts, bar charts, stats, tables, and gauges. Panels snap to a grid, auto-save, and support time range selectors, auto-refresh, full screen, and sharing.
Profiles (Flame Graphs)
View CPU and memory profiles as flame graphs. Width represents time in a function, vertical stacking shows the call stack. Hover for exact timing, click to zoom. Comparison view for side-by-side analysis.
Change Intelligence
Track deployments, config changes, and rollbacks on a timeline. Changes near incidents are automatically correlated and flagged.
Capacity Forecasting
Prophet-powered forecasts for CPU, memory, and disk by service. Historical data (solid), projected trend (dashed), confidence interval (shaded). Predictive alerts fire when exhaustion is projected within 7 days.
SLO & Error Budgets
Define service level objectives with target thresholds and rolling evaluation windows.
Configure SLOs per service with target percentages and evaluation periods. The engine continuously evaluates compliance using a rolling window and calculates error budgets and burn rates across 1h, 6h, 24h, and 30d windows.
The SLO dashboard includes compliance bars, burn rate charts, and a heatmap view showing SLO health across all services at a glance. Evaluation history is stored in ClickHouse for trend analysis.
Latency Percentile Analytics
Deep latency analysis with P50, P75, P90, P95, P99, and P999 breakdowns. Per-operation percentile overlays and time-series charts for identifying latency trends and outliers across services.
Synthetic Monitoring
Schedule HTTP checks from multiple locations to proactively detect outages.
Create monitors with target URLs, check intervals, assertion rules, and multi-location execution. The scheduler runs checks periodically and evaluates assertions against response status, body, and latency. Failed checks trigger alerts automatically.
Cost Observability
Allocate infrastructure costs per service with the cost allocation engine. Detect spending anomalies, visualize cost trends over time, and understand which services drive the most infrastructure spend.
Business KPI Correlation
Ingest custom business metrics via the KPI API and correlate them with technical performance data. A correlation scoring engine identifies relationships between business outcomes and system behavior, with revenue impact estimation per incident.
Reliability Score
A weighted health score combining latency, error rates, SLO compliance, incident frequency, and deployment risk. The reliability scorer runs on a schedule, producing trend charts so you can track service health improvements over time.
Custom Log Monitors
Define pattern-based log monitors that watch for specific log patterns across services. Configure match patterns, alerting thresholds, and evaluation windows. When thresholds are breached, notifications are triggered via configured channels.
Service Catalog
Full service registry with health status, dependency mapping, alert rule associations, and team ownership per service. Provides a single pane of glass for understanding your service landscape, with quick links to traces, logs, incidents, and SLOs for each service.
Ticket Integration
Bi-directional integration with Jira and ServiceNow. Create tickets directly from incidents, sync status changes between Pointer and your ticketing system, and maintain a linked audit trail across both platforms.
Audit Logging
Tamper-proof hash-chain audit trail recording all user actions — logins, configuration changes, incident updates, healing approvals, and policy modifications. Supports SOC 2 and ISO 27001 compliance requirements with exportable audit reports.
ML Anomaly Detection & Baselines
Multi-model anomaly detection using scikit-learn, Prophet, and PyTorch LSTM across all signal types. Automatic baseline generation computes normal behavior patterns for all metrics, enabling deviation detection without manual threshold tuning. Anomaly-based alerts fire proactively when metrics deviate from established baselines.
Predictive Analytics
PyTorch LSTM neural network models for time-series prediction across CPU, memory, disk, and custom metrics. Proactive alerting on predicted anomalies before they impact users. Combined with capacity forecasting for comprehensive forward-looking observability.
GraphQL API
Full GraphQL API alongside the REST API for flexible, efficient querying. Supports complex nested queries, subscriptions for real-time updates, and is ideal for building custom integrations, automations, and tooling.
Real-time WebSocket Push
Live push updates via WebSocket for dashboards, alerts, incidents, and topology changes. Eliminates polling for real-time situational awareness during incidents and normal operations.
Retention Policies
Configure data retention periods for traces, logs, and metrics via Settings → Retention. The processor runs a scheduled cleanup job to enforce retention limits automatically.
Command Palette
Press ⌘K to search everything — services, incidents, traces, alert rules, dashboards. Keyboard-first navigation.
Settings & Administration
Users & roles (Admin, Operator, Viewer, Auditor + custom), teams with data isolation, auth providers (LDAP, SAML 2.0, OAuth2/OIDC, Microsoft Entra ID), API keys, license management with key validation and usage metering, notification channels (M365 Email, Microsoft Teams), email notification recipients, retention policies, Jira/ServiceNow ticket integration config, and a tamper-proof hash-chain audit log.