You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

9.3 KiB

TimeSafari Daily Notification Plugin - Observability Dashboards

Author: Matthew Raymer
Version: 1.0.0
Created: 2025-10-08 06:08:15 UTC

Overview

This document provides sample dashboards, queries, and monitoring configurations for the TimeSafari Daily Notification Plugin. These can be imported into your monitoring system (Grafana, DataDog, New Relic, etc.) to track plugin health and performance.

Key Metrics

Core Performance Metrics

  • Fetch Success Rate: Percentage of successful content fetches
  • Notification Delivery Rate: Percentage of notifications successfully delivered
  • Callback Success Rate: Percentage of successful callback executions
  • Average Fetch Time: Mean time for content fetching operations
  • Average Notification Time: Mean time for notification delivery

User Interaction Metrics

  • User Opt-out Rate: Percentage of users who opt out of notifications
  • Permission Grant Rate: Percentage of users who grant notification permissions
  • Permission Denial Rate: Percentage of users who deny notification permissions

Platform-Specific Metrics

  • Android WorkManager Starts: Number of Android background task starts
  • iOS Background Task Starts: Number of iOS background task starts
  • Electron Notifications: Number of Electron desktop notifications
  • Platform Error Rate: Percentage of platform-specific errors

Sample Queries

Grafana Queries

1. Notification Delivery Success Rate

# Success rate over last 24 hours
(
  sum(rate(dnp_notifications_success_total[24h])) /
  sum(rate(dnp_notifications_total[24h]))
) * 100

2. Average Fetch Time

# Average fetch time over last hour
avg_over_time(dnp_fetch_duration_seconds[1h])

3. User Opt-out Rate

# Opt-out rate over last 7 days
(
  sum(rate(dnp_user_opt_outs_total[7d])) /
  sum(rate(dnp_user_interactions_total[7d]))
) * 100

4. Platform Error Rate

# Platform error rate over last hour
(
  sum(rate(dnp_platform_errors_total[1h])) /
  sum(rate(dnp_platform_events_total[1h]))
) * 100

DataDog Queries

1. Health Status Dashboard

# Notification health score
100 - (
  (sum:dnp.notifications.failed{*}.as_rate() / 
   sum:dnp.notifications.total{*}.as_rate()) * 100
)
# Fetch performance trend
avg:dnp.fetch.duration{*}.rollup(avg, 300)

3. User Engagement

# User engagement rate
(sum:dnp.user.opt_ins{*}.as_rate() / 
 sum:dnp.user.interactions{*}.as_rate()) * 100

Sample Dashboard Configurations

1. Overview Dashboard

Purpose: High-level plugin health and performance overview

Panels:

  • Notification Success Rate (Gauge): Current success rate percentage
  • Active Schedules (Stat): Number of active notification schedules
  • Recent Errors (Logs): Last 10 error events
  • Performance Trends (Time Series): Fetch and notification times over time
  • User Metrics (Bar Chart): Opt-ins vs opt-outs over last 7 days

2. Platform-Specific Dashboard

Purpose: Monitor platform-specific performance and issues

Panels:

  • Android WorkManager Status (Stat): Active background tasks
  • iOS Background Task Success (Gauge): Success rate for iOS tasks
  • Electron Notification Count (Counter): Desktop notifications sent
  • Platform Error Breakdown (Pie Chart): Errors by platform
  • Platform Performance (Time Series): Performance by platform

3. User Engagement Dashboard

Purpose: Track user interaction and engagement metrics

Panels:

  • Permission Grant Rate (Gauge): Current permission grant rate
  • Opt-out Trends (Time Series): Opt-out rate over time
  • User Interaction Heatmap (Heatmap): User actions by time of day
  • Engagement Funnel (Funnel): Permission → Opt-in → Active usage

Alerting Rules

Critical Alerts

1. Notification Delivery Failure

alert: NotificationDeliveryFailure
expr: dnp_notifications_success_rate < 0.95
for: 5m
labels:
  severity: critical
annotations:
  summary: "Notification delivery success rate below 95%"
  description: "Notification success rate is {{ $value }}% for the last 5 minutes"

2. High Error Rate

alert: HighErrorRate
expr: rate(dnp_errors_total[5m]) > 0.1
for: 2m
labels:
  severity: warning
annotations:
  summary: "High error rate detected"
  description: "Error rate is {{ $value }} errors/second"

3. Platform Errors

alert: PlatformErrors
expr: rate(dnp_platform_errors_total[5m]) > 0.05
for: 3m
labels:
  severity: warning
annotations:
  summary: "Platform-specific errors detected"
  description: "Platform error rate is {{ $value }} errors/second"

Warning Alerts

1. Performance Degradation

alert: PerformanceDegradation
expr: avg_over_time(dnp_fetch_duration_seconds[10m]) > 5
for: 5m
labels:
  severity: warning
annotations:
  summary: "Fetch performance degraded"
  description: "Average fetch time is {{ $value }} seconds"

2. High Opt-out Rate

alert: HighOptOutRate
expr: rate(dnp_user_opt_outs_total[1h]) > 0.1
for: 10m
labels:
  severity: warning
annotations:
  summary: "High user opt-out rate"
  description: "Opt-out rate is {{ $value }} users/hour"

SLO Definitions

Service Level Objectives

1. Notification Delivery SLO

  • Target: 99.5% success rate
  • Measurement: Successful notifications / Total notifications
  • Time Window: 30 days
  • Error Budget: 0.5%

2. Performance SLO

  • Target: 95% of fetches complete within 3 seconds
  • Measurement: Fetch duration percentiles
  • Time Window: 7 days
  • Error Budget: 5%

3. Availability SLO

  • Target: 99.9% uptime
  • Measurement: Plugin health endpoint availability
  • Time Window: 30 days
  • Error Budget: 0.1%

Log Analysis

Structured Log Patterns

1. Error Analysis

# Find all errors in the last hour
grep "DNP-.*-FAILURE" /var/log/timesafari/daily-notification.log | \
  jq -r '.timestamp, .eventCode, .message' | \
  head -20

2. Performance Analysis

# Find slow operations
grep "DNP-FETCH-START\|DNP-FETCH-SUCCESS" /var/log/timesafari/daily-notification.log | \
  jq -r 'select(.duration > 5000) | .timestamp, .duration, .message'

3. User Behavior Analysis

# Analyze user interactions
grep "DNP-USER-\|DNP-PERMISSION-" /var/log/timesafari/daily-notification.log | \
  jq -r '.timestamp, .eventCode, .data.userId' | \
  sort | uniq -c

Monitoring Best Practices

1. Log Retention

  • Structured Logs: Retain for 30 days
  • Error Logs: Retain for 90 days
  • Performance Logs: Retain for 7 days
  • User Interaction Logs: Retain for 1 year (with privacy compliance)

2. Metric Collection

  • High-frequency metrics: Collect every 30 seconds
  • Medium-frequency metrics: Collect every 5 minutes
  • Low-frequency metrics: Collect every 1 hour
  • User metrics: Collect on-demand

3. Alert Tuning

  • Start with conservative thresholds
  • Adjust based on historical data
  • Use different severity levels
  • Implement alert fatigue prevention

4. Dashboard Design

  • Keep dashboards focused and actionable
  • Use consistent color schemes
  • Include context and annotations
  • Regular review and updates

Integration Examples

Grafana Dashboard JSON

{
  "dashboard": {
    "title": "TimeSafari Daily Notification Plugin",
    "panels": [
      {
        "title": "Notification Success Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "(sum(rate(dnp_notifications_success_total[24h])) / sum(rate(dnp_notifications_total[24h]))) * 100"
          }
        ]
      }
    ]
  }
}

Prometheus Recording Rules

groups:
  - name: timesafari_daily_notification
    rules:
      - record: dnp:notification_success_rate
        expr: (sum(rate(dnp_notifications_success_total[5m])) / sum(rate(dnp_notifications_total[5m]))) * 100
      
      - record: dnp:fetch_duration_avg
        expr: avg_over_time(dnp_fetch_duration_seconds[5m])
      
      - record: dnp:user_opt_out_rate
        expr: (sum(rate(dnp_user_opt_outs_total[1h])) / sum(rate(dnp_user_interactions_total[1h]))) * 100

Troubleshooting Guide

Common Issues and Queries

1. High Error Rate

# Check recent errors
curl -s "http://localhost:9090/api/v1/query?query=rate(dnp_errors_total[5m])" | jq

2. Performance Issues

# Check fetch performance
curl -s "http://localhost:9090/api/v1/query?query=avg_over_time(dnp_fetch_duration_seconds[10m])" | jq

3. User Engagement Issues

# Check user metrics
curl -s "http://localhost:9090/api/v1/query?query=rate(dnp_user_opt_outs_total[1h])" | jq

Privacy and Compliance

Data Retention

  • User interaction logs: 1 year maximum
  • Performance metrics: 90 days maximum
  • Error logs: 30 days maximum
  • Personal data: Redacted or anonymized

GDPR Compliance

  • User consent: Tracked and logged
  • Data portability: Export capabilities
  • Right to deletion: Automated cleanup
  • Privacy by design: Built into observability system

Note: These dashboards and queries should be customized based on your specific monitoring infrastructure and requirements. Regular review and updates are recommended to ensure they remain relevant and actionable.