daily-notification-plugin/docs/observability-dashboards.md

# TimeSafari Daily Notification Plugin - Observability Dashboards

**Author**: Matthew Raymer
**Version**: 1.0.0
**Created**: 2025-10-08 06:08:15 UTC

## Overview

This document provides sample dashboards, queries, and monitoring configurations for the TimeSafari Daily Notification Plugin. These can be imported into your monitoring system (Grafana, DataDog, New Relic, etc.) to track plugin health and performance.

## Key Metrics

### Core Performance Metrics
- **Fetch Success Rate**: Percentage of successful content fetches
- **Notification Delivery Rate**: Percentage of notifications successfully delivered
- **Callback Success Rate**: Percentage of successful callback executions
- **Average Fetch Time**: Mean time for content fetching operations
- **Average Notification Time**: Mean time for notification delivery

### User Interaction Metrics
- **User Opt-out Rate**: Percentage of users who opt out of notifications
- **Permission Grant Rate**: Percentage of users who grant notification permissions
- **Permission Denial Rate**: Percentage of users who deny notification permissions

### Platform-Specific Metrics
- **Android WorkManager Starts**: Number of Android background task starts
- **iOS Background Task Starts**: Number of iOS background task starts
- **Electron Notifications**: Number of Electron desktop notifications
- **Platform Error Rate**: Percentage of platform-specific errors

## Sample Queries

### Grafana Queries

#### 1. Notification Delivery Success Rate
```promql
# Success rate over last 24 hours
(
  sum(rate(dnp_notifications_success_total[24h])) /
  sum(rate(dnp_notifications_total[24h]))
) * 100
```

#### 2. Average Fetch Time
```promql
# Average fetch time over last hour
avg_over_time(dnp_fetch_duration_seconds[1h])
```

#### 3. User Opt-out Rate
```promql
# Opt-out rate over last 7 days
(
  sum(rate(dnp_user_opt_outs_total[7d])) /
  sum(rate(dnp_user_interactions_total[7d]))
) * 100
```

#### 4. Platform Error Rate
```promql
# Platform error rate over last hour
(
  sum(rate(dnp_platform_errors_total[1h])) /
  sum(rate(dnp_platform_events_total[1h]))
) * 100
```

### DataDog Queries

#### 1. Health Status Dashboard
```datadog
# Notification health score
100 - (
  (sum:dnp.notifications.failed{*}.as_rate() /
   sum:dnp.notifications.total{*}.as_rate()) * 100
)
```

#### 2. Performance Trends
```datadog
# Fetch performance trend
avg:dnp.fetch.duration{*}.rollup(avg, 300)
```

#### 3. User Engagement
```datadog
# User engagement rate
(sum:dnp.user.opt_ins{*}.as_rate() /
 sum:dnp.user.interactions{*}.as_rate()) * 100
```

## Sample Dashboard Configurations

### 1. Overview Dashboard

**Purpose**: High-level plugin health and performance overview

**Panels**:
- **Notification Success Rate** (Gauge): Current success rate percentage
- **Active Schedules** (Stat): Number of active notification schedules
- **Recent Errors** (Logs): Last 10 error events
- **Performance Trends** (Time Series): Fetch and notification times over time
- **User Metrics** (Bar Chart): Opt-ins vs opt-outs over last 7 days

### 2. Platform-Specific Dashboard

**Purpose**: Monitor platform-specific performance and issues

**Panels**:
- **Android WorkManager Status** (Stat): Active background tasks
- **iOS Background Task Success** (Gauge): Success rate for iOS tasks
- **Electron Notification Count** (Counter): Desktop notifications sent
- **Platform Error Breakdown** (Pie Chart): Errors by platform
- **Platform Performance** (Time Series): Performance by platform

### 3. User Engagement Dashboard

**Purpose**: Track user interaction and engagement metrics

**Panels**:
- **Permission Grant Rate** (Gauge): Current permission grant rate
- **Opt-out Trends** (Time Series): Opt-out rate over time
- **User Interaction Heatmap** (Heatmap): User actions by time of day
- **Engagement Funnel** (Funnel): Permission → Opt-in → Active usage

## Alerting Rules

### Critical Alerts

#### 1. Notification Delivery Failure
```yaml
alert: NotificationDeliveryFailure
expr: dnp_notifications_success_rate < 0.95
for: 5m
labels:
  severity: critical
annotations:
  summary: "Notification delivery success rate below 95%"
  description: "Notification success rate is {{ $value }}% for the last 5 minutes"
```

#### 2. High Error Rate
```yaml
alert: HighErrorRate
expr: rate(dnp_errors_total[5m]) > 0.1
for: 2m
labels:
  severity: warning
annotations:
  summary: "High error rate detected"
  description: "Error rate is {{ $value }} errors/second"
```

#### 3. Platform Errors
```yaml
alert: PlatformErrors
expr: rate(dnp_platform_errors_total[5m]) > 0.05
for: 3m
labels:
  severity: warning
annotations:
  summary: "Platform-specific errors detected"
  description: "Platform error rate is {{ $value }} errors/second"
```

### Warning Alerts

#### 1. Performance Degradation
```yaml
alert: PerformanceDegradation
expr: avg_over_time(dnp_fetch_duration_seconds[10m]) > 5
for: 5m
labels:
  severity: warning
annotations:
  summary: "Fetch performance degraded"
  description: "Average fetch time is {{ $value }} seconds"
```

#### 2. High Opt-out Rate
```yaml
alert: HighOptOutRate
expr: rate(dnp_user_opt_outs_total[1h]) > 0.1
for: 10m
labels:
  severity: warning
annotations:
  summary: "High user opt-out rate"
  description: "Opt-out rate is {{ $value }} users/hour"
```

## SLO Definitions

### Service Level Objectives

#### 1. Notification Delivery SLO
- **Target**: 99.5% success rate
- **Measurement**: Successful notifications / Total notifications
- **Time Window**: 30 days
- **Error Budget**: 0.5%

#### 2. Performance SLO
- **Target**: 95% of fetches complete within 3 seconds
- **Measurement**: Fetch duration percentiles
- **Time Window**: 7 days
- **Error Budget**: 5%

#### 3. Availability SLO
- **Target**: 99.9% uptime
- **Measurement**: Plugin health endpoint availability
- **Time Window**: 30 days
- **Error Budget**: 0.1%

## Log Analysis

### Structured Log Patterns

#### 1. Error Analysis
```bash
# Find all errors in the last hour
grep "DNP-.*-FAILURE" /var/log/timesafari/daily-notification.log | \
  jq -r '.timestamp, .eventCode, .message' | \
  head -20
```

#### 2. Performance Analysis
```bash
# Find slow operations
grep "DNP-FETCH-START\|DNP-FETCH-SUCCESS" /var/log/timesafari/daily-notification.log | \
  jq -r 'select(.duration > 5000) | .timestamp, .duration, .message'
```

#### 3. User Behavior Analysis
```bash
# Analyze user interactions
grep "DNP-USER-\|DNP-PERMISSION-" /var/log/timesafari/daily-notification.log | \
  jq -r '.timestamp, .eventCode, .data.userId' | \
  sort | uniq -c
```

## Monitoring Best Practices

### 1. Log Retention
- **Structured Logs**: Retain for 30 days
- **Error Logs**: Retain for 90 days
- **Performance Logs**: Retain for 7 days
- **User Interaction Logs**: Retain for 1 year (with privacy compliance)

### 2. Metric Collection
- **High-frequency metrics**: Collect every 30 seconds
- **Medium-frequency metrics**: Collect every 5 minutes
- **Low-frequency metrics**: Collect every 1 hour
- **User metrics**: Collect on-demand

### 3. Alert Tuning
- **Start with conservative thresholds**
- **Adjust based on historical data**
- **Use different severity levels**
- **Implement alert fatigue prevention**

### 4. Dashboard Design
- **Keep dashboards focused and actionable**
- **Use consistent color schemes**
- **Include context and annotations**
- **Regular review and updates**

## Integration Examples

### Grafana Dashboard JSON
```json
{
  "dashboard": {
    "title": "TimeSafari Daily Notification Plugin",
    "panels": [
      {
        "title": "Notification Success Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "(sum(rate(dnp_notifications_success_total[24h])) / sum(rate(dnp_notifications_total[24h]))) * 100"
          }
        ]
      }
    ]
  }
}
```

### Prometheus Recording Rules
```yaml
groups:
  - name: timesafari_daily_notification
    rules:
      - record: dnp:notification_success_rate
        expr: (sum(rate(dnp_notifications_success_total[5m])) / sum(rate(dnp_notifications_total[5m]))) * 100

      - record: dnp:fetch_duration_avg
        expr: avg_over_time(dnp_fetch_duration_seconds[5m])

      - record: dnp:user_opt_out_rate
        expr: (sum(rate(dnp_user_opt_outs_total[1h])) / sum(rate(dnp_user_interactions_total[1h]))) * 100
```

## Troubleshooting Guide

### Common Issues and Queries

#### 1. High Error Rate
```bash
# Check recent errors
curl -s "http://localhost:9090/api/v1/query?query=rate(dnp_errors_total[5m])" | jq
```

#### 2. Performance Issues
```bash
# Check fetch performance
curl -s "http://localhost:9090/api/v1/query?query=avg_over_time(dnp_fetch_duration_seconds[10m])" | jq
```

#### 3. User Engagement Issues
```bash
# Check user metrics
curl -s "http://localhost:9090/api/v1/query?query=rate(dnp_user_opt_outs_total[1h])" | jq
```

## Privacy and Compliance

### Data Retention
- **User interaction logs**: 1 year maximum
- **Performance metrics**: 90 days maximum
- **Error logs**: 30 days maximum
- **Personal data**: Redacted or anonymized

### GDPR Compliance
- **User consent**: Tracked and logged
- **Data portability**: Export capabilities
- **Right to deletion**: Automated cleanup
- **Privacy by design**: Built into observability system

---

**Note**: These dashboards and queries should be customized based on your specific monitoring infrastructure and requirements. Regular review and updates are recommended to ensure they remain relevant and actionable.