Docs that actually help

Written by engineers, for engineers. No filler, no marketing speak.

User Manual

Complete guide to every page in Pointer APM — dashboard, topology, traces, logs, incidents, alerts, healing, custom dashboards, profiles, change intelligence, capacity forecasting, SLOs, synthetic monitoring, and more.

Dashboard

The Dashboard is your home page showing a high-level overview of the monitored environment.

Total Services — distinct services sending telemetry
Active Incidents — current open/investigating incidents
Ingestion Rate — current telemetry data volume (spans/sec)
Error Rate — aggregate error rate across all services

Charts include Ingestion Rate Over Time (area chart) and Error Rate Distribution (bar chart). Data auto-refreshes with a configurable time range selector (Last 15m, 1h, 24h).

Topology Map

Live service dependency graph. Nodes represent services; edges represent dependencies. Edge width shows throughput, edge color shows error rate.

Green — Healthy  | Yellow — Degraded  | Red — Critical  | Gray — Unknown

Click any node for details: overview, metrics, traces, logs, and dependencies. Switch to 3D mode for large architectures (200+ nodes, powered by Three.js). Use the Time Travel slider to view historical topology state.

Trace Explorer

Search and analyze distributed traces. Filter by service, operation, duration, status, and time range.

Click a Trace ID to open the waterfall view — each row is a span, horizontal bars show timing, nested bars show parent-child relationships. Click any span to inspect attributes, resource attributes, events, and timing.

Log Explorer

Full-text search across all ingested logs. Filter by service, severity (TRACE, DEBUG, INFO, WARN, ERROR, FATAL), keyword, and time range.

Expand any log to see the full body, attributes, and a clickable Trace ID to jump to the waterfall. Toggle Live Tail for real-time SSE streaming — ideal during deployments or incident investigation.

Incident Management

Incidents are created automatically by the AI engine when correlated anomalies are detected, or manually.

OPENACKNOWLEDGEDINVESTIGATINGRESOLVEDCLOSED

Each incident has a timeline, AI-generated RCA with confidence scores and evidence links, an impact radius view (mini topology + affected services), and action buttons for acknowledge, assign, status transitions, and healing triggers. Severity is P1–P5.

Alert Management

View active alerts (CRITICAL, WARNING, INFO) and manage PromQL-based alert rules. Rules are evaluated every 15 seconds against VictoriaMetrics. Duplicate alerts are auto-deduplicated and the AI engine may correlate alerts into incidents.

Self-Healing

Automatic remediation for known issues. Actions include restart pods, scale deployments, and clear queues.

PENDINGAPPROVEDEXECUTINGSUCCEEDEDFAILEDROLLED_BACK

Healing policies define triggers, actions, and modes — Manual (require approval), Suggested (one-click approve), or Auto-approved (execute automatically). Configurable scope, cooldown, and active time windows.

Custom Dashboards

Drag-and-drop bento grid layout. PromQL-powered panels: line charts, bar charts, stats, tables, and gauges. Panels snap to a grid, auto-save, and support time range selectors, auto-refresh, full screen, and sharing.

Profiles (Flame Graphs)

View CPU and memory profiles as flame graphs. Width represents time in a function, vertical stacking shows the call stack. Hover for exact timing, click to zoom. Comparison view for side-by-side analysis.

Change Intelligence

Track deployments, config changes, and rollbacks on a timeline. Changes near incidents are automatically correlated and flagged.

Capacity Forecasting

Prophet-powered forecasts for CPU, memory, and disk by service. Historical data (solid), projected trend (dashed), confidence interval (shaded). Predictive alerts fire when exhaustion is projected within 7 days.

SLO & Error Budgets

Define service level objectives with target thresholds and rolling evaluation windows.

Configure SLOs per service with target percentages and evaluation periods. The engine continuously evaluates compliance using a rolling window and calculates error budgets and burn rates across 1h, 6h, 24h, and 30d windows.

The SLO dashboard includes compliance bars, burn rate charts, and a heatmap view showing SLO health across all services at a glance. Evaluation history is stored in ClickHouse for trend analysis.

Latency Percentile Analytics

Deep latency analysis with P50, P75, P90, P95, P99, and P999 breakdowns. Per-operation percentile overlays and time-series charts for identifying latency trends and outliers across services.

Synthetic Monitoring

Schedule HTTP checks from multiple locations to proactively detect outages.

Create monitors with target URLs, check intervals, assertion rules, and multi-location execution. The scheduler runs checks periodically and evaluates assertions against response status, body, and latency. Failed checks trigger alerts automatically.

Cost Observability

Allocate infrastructure costs per service with the cost allocation engine. Detect spending anomalies, visualize cost trends over time, and understand which services drive the most infrastructure spend.

Business KPI Correlation

Ingest custom business metrics via the KPI API and correlate them with technical performance data. A correlation scoring engine identifies relationships between business outcomes and system behavior, with revenue impact estimation per incident.

Reliability Score

A weighted health score combining latency, error rates, SLO compliance, incident frequency, and deployment risk. The reliability scorer runs on a schedule, producing trend charts so you can track service health improvements over time.

Custom Log Monitors

Define pattern-based log monitors that watch for specific log patterns across services. Configure match patterns, alerting thresholds, and evaluation windows. When thresholds are breached, notifications are triggered via configured channels.

Service Catalog

Full service registry with health status, dependency mapping, alert rule associations, and team ownership per service. Provides a single pane of glass for understanding your service landscape, with quick links to traces, logs, incidents, and SLOs for each service.

Ticket Integration

Bi-directional integration with Jira and ServiceNow. Create tickets directly from incidents, sync status changes between Pointer and your ticketing system, and maintain a linked audit trail across both platforms.

Audit Logging

Tamper-proof hash-chain audit trail recording all user actions — logins, configuration changes, incident updates, healing approvals, and policy modifications. Supports SOC 2 and ISO 27001 compliance requirements with exportable audit reports.

ML Anomaly Detection & Baselines

Multi-model anomaly detection using scikit-learn, Prophet, and PyTorch LSTM across all signal types. Automatic baseline generation computes normal behavior patterns for all metrics, enabling deviation detection without manual threshold tuning. Anomaly-based alerts fire proactively when metrics deviate from established baselines.

Predictive Analytics

PyTorch LSTM neural network models for time-series prediction across CPU, memory, disk, and custom metrics. Proactive alerting on predicted anomalies before they impact users. Combined with capacity forecasting for comprehensive forward-looking observability.

GraphQL API

Full GraphQL API alongside the REST API for flexible, efficient querying. Supports complex nested queries, subscriptions for real-time updates, and is ideal for building custom integrations, automations, and tooling.

Real-time WebSocket Push

Live push updates via WebSocket for dashboards, alerts, incidents, and topology changes. Eliminates polling for real-time situational awareness during incidents and normal operations.

Retention Policies

Configure data retention periods for traces, logs, and metrics via Settings → Retention. The processor runs a scheduled cleanup job to enforce retention limits automatically.

Command Palette

Press ⌘K to search everything — services, incidents, traces, alert rules, dashboards. Keyboard-first navigation.

Settings & Administration

Users & roles (Admin, Operator, Viewer, Auditor + custom), teams with data isolation, auth providers (LDAP, SAML 2.0, OAuth2/OIDC, Microsoft Entra ID), API keys, license management with key validation and usage metering, notification channels (M365 Email, Microsoft Teams), email notification recipients, retention policies, Jira/ServiceNow ticket integration config, and a tamper-proof hash-chain audit log.