AWS Services Monitoring¶

Prerequisites

This section assumes familiarity with Connectivity Within AWS, Load Balancing, and Hybrid & Multi-Cloud. Review those topics first if you're new to AWS networking fundamentals.

Monitoring network traffic (covered in Internal Traffic and External Traffic) tells you what's flowing through your network. Monitoring the networking services themselves tells you whether the infrastructure carrying that traffic is healthy. A Transit Gateway with blackhole drops, a NAT gateway exhausting its port allocation, or a Direct Connect connection flapping between states — these are service-level failures that traffic monitoring alone won't catch until users are already affected.

This page focuses on the operational health of AWS networking services: the CloudWatch metrics that matter, the alarms you should configure from day one, and the automation patterns that turn monitoring signals into remediation actions. The goal is to detect degradation in the networking plane before it becomes an outage, and to respond automatically where possible.

Service monitoring in a multi-account AWS environment requires a deliberate architecture. Metrics live in the account that owns the resource, but the networking team needs a unified view across all accounts and Regions. The patterns here assume a centralized monitoring account with cross-account CloudWatch dashboards and a shared EventBridge bus for networking events.

Service monitoring stack showing networking service metrics (Transit Gateway, NAT gateway, Direct Connect, VPN, ALB, NLB, Network Firewall, Route 53, VPC Lattice) feeding into a centralized monitoring account (CloudWatch metrics, alarms, dashboards) which triggers automated response (EventBridge, SNS, Lambda, Incident Manager) — Service monitoring stack — Drawio Source

Critical metrics by service¶

Not all CloudWatch metrics deserve an alarm. The table below identifies the metrics that signal real operational problems — the ones you should alarm on from day one, before the first production workload routes through the service.

Transit Gateway¶

Metric	Why it matters	Alarm condition
`PacketDropCountBlackhole`	Traffic is being sent to a route that leads nowhere. Indicates a missing or misconfigured route table entry.	> 0 for 2 consecutive periods
`PacketDropCountNoRoute`	No matching route exists for the destination. Often caused by missing route propagation or a detached attachment.	> 0 for 2 consecutive periods
`BytesIn` / `BytesOut`	Baseline throughput. Sudden drops indicate connectivity loss; sustained growth signals capacity planning needs.	Anomaly detection band (2 standard deviations)
`AttachmentCount`	Track attachment growth against the per-Region quota (default 5,000).	> 80% of quota

NAT gateway¶

Metric	Why it matters	Alarm condition
`ErrorPortAllocation`	The NAT gateway has exhausted its 55,000 simultaneous connections to a single destination. Workloads will fail to establish new connections.	> 0 for 1 period
`PacketsDropCount`	Packets dropped due to NAT gateway processing limits. Indicates the gateway is overwhelmed.	> 0 sustained over 3 periods
`ActiveConnectionCount`	Tracks connection table utilization. Useful for capacity planning and detecting connection leaks.	Anomaly detection or > 80% of expected baseline
`BytesOutToDestination`	Data processing volume directly correlates with cost. Unexpected spikes indicate misconfigured routing or data exfiltration.	Anomaly detection band
`ConnectionEstablishedCount`	Rate of new connections. Sudden spikes may indicate scanning or misconfigured retry logic.	Anomaly detection band

Key insight: ErrorPortAllocation is the single most critical NAT gateway metric. When it fires, connections are already failing. Alarm on it immediately and consider multiple NAT gateways or destination diversification.

Direct Connect¶

Metric	Why it matters	Alarm condition
`ConnectionState`	Binary: the physical connection is up or down. State changes indicate fiber cuts, router failures, or maintenance events.	State != 1 (up) for 1 period
`VirtualInterfaceBpsEgress` / `VirtualInterfaceBpsIngress`	Per-VIF throughput. Approaching the port capacity means you need to add capacity or shift traffic.	> 80% of port speed sustained over 5 minutes
`ConnectionBpsEgress` / `ConnectionBpsIngress`	Aggregate connection throughput.	> 80% of port speed sustained over 5 minutes
`ConnectionLightLevelTx` / `ConnectionLightLevelRx`	Optical signal strength. Degrading light levels predict physical failures before they happen.	Outside acceptable dBm range for the optic type

Site-to-Site VPN¶

Metric	Why it matters	Alarm condition
`TunnelState`	Binary: the IPsec tunnel is up or down. Each VPN connection has two tunnels for redundancy.	Either tunnel state = 0 for 2 consecutive periods
`TunnelDataIn` / `TunnelDataOut`	Per-tunnel throughput. Asymmetric traffic may indicate a routing problem or a failed tunnel with traffic on the remaining one.	Anomaly detection; alert on zero traffic when traffic is expected

Key insight: Alarm when a single tunnel goes down, not just when both are down. A single-tunnel failure means you're running without redundancy — the next failure is an outage.

Application Load Balancer¶

Metric	Why it matters	Alarm condition
`HealthyHostCount`	Tracks how many targets are passing health checks. A declining count means capacity is shrinking.	< expected minimum per target group
`UnHealthyHostCount`	Targets failing health checks. Non-zero means something is wrong with the application or its dependencies.	> 0 sustained over 2 periods
`HTTPCode_ELB_5XX_Count`	Errors generated by the ALB itself (not the targets). Indicates ALB-level issues like capacity exhaustion or no healthy targets.	> 0 sustained over 3 periods
`TargetResponseTime`	P99 latency from the ALB to targets. Degradation here affects every request.	Anomaly detection or > SLA threshold
`RejectedConnectionCount`	Connections rejected because the ALB hit its maximum connections. Indicates undersized subnets or a traffic spike beyond ALB scaling.	> 0 for 1 period
`RequestCount`	Baseline traffic volume. Useful for anomaly detection and correlating with other metrics.	Anomaly detection band

Network Load Balancer¶

Metric	Why it matters	Alarm condition
`HealthyHostCount` / `UnHealthyHostCount`	Same as ALB — tracks target availability.	Same thresholds as ALB
`TCP_ELB_Reset_Count`	TCP resets generated by the NLB (not targets). Indicates idle timeout mismatches or connection tracking issues.	Anomaly detection; sustained increase
`ProcessedBytes`	Total throughput. Correlates directly with cost and capacity utilization.	Anomaly detection band
`NewFlowCount`	Rate of new TCP/UDP flows. Sudden spikes may indicate DDoS or misconfigured clients.	Anomaly detection band
`UnHealthyHostCount` (per Availability Zone)	Per-AZ health. Critical when cross-zone load balancing is off (NLB default).	> 0 in any single Availability Zone

AWS Network Firewall¶

Metric	Why it matters	Alarm condition
`DroppedPackets`	Packets explicitly dropped by firewall rules. Expected in normal operation, but sudden spikes indicate either an attack or a rule misconfiguration blocking legitimate traffic.	Anomaly detection band
`PassedPackets`	Packets allowed through. A sudden drop to zero means traffic isn't reaching the firewall (routing issue) or the firewall is down.	< baseline for 2 periods
`ReceivedPackets`	Total packets entering the firewall. Baseline for capacity planning.	Anomaly detection band
`Packets` (per rule group)	Per-rule-group hit counts. Identifies which rules are active and whether new rules are matching as expected.	Monitor for zero hits on rules expected to match

Route 53 Resolver¶

Metric	Why it matters	Alarm condition
`InboundQueryVolume`	DNS queries arriving from on-premises or peered networks. Spikes may indicate DNS amplification or misconfigured resolvers.	Anomaly detection band
`OutboundQueryVolume`	DNS queries forwarded to on-premises or external resolvers. Drops indicate forwarding rule issues.	< baseline for 3 periods
`FirewallRuleGroupQueryVolume`	Queries evaluated by DNS Firewall rules. Tracks DNS-layer security enforcement.	Monitor for expected baseline

VPC Lattice¶

Metric	Why it matters	Alarm condition
`RequestCount`	Total requests through the service network. Baseline for capacity and cost tracking.	Anomaly detection band
`HTTPCode_Target_4XX_Count`	Client errors at the target. Elevated counts indicate API contract issues or auth failures.	Anomaly detection band
`HTTPCode_Target_5XX_Count`	Server errors at the target. Direct indicator of backend health problems.	> threshold for 2 periods
`TargetResponseTime`	Latency from VPC Lattice to the target. Degradation affects all consumers on the service network.	> SLA threshold or anomaly detection

Best Practices¶

Alarm design¶

Alarm on state changes, not just thresholds¶

Many networking services have binary state metrics (tunnel up/down, connection active/inactive, BGP session established/idle). These deserve state-change alarms, not threshold-based alarms. A VPN tunnel transitioning from up to down is immediately actionable regardless of traffic volume. Configure alarms that trigger on the state value itself (for example, TunnelState < 1 for 1 evaluation period) rather than waiting for traffic metrics to reflect the failure.

For Direct Connect, monitor ConnectionState transitions. For VPN, monitor individual TunnelState per tunnel. For ALB/NLB, monitor HealthyHostCount dropping below the expected minimum rather than waiting for error rates to climb.

Use composite alarms to reduce noise¶

Individual metric alarms generate noise. A brief spike in PacketDropCountNoRoute during a Transit Gateway route table update is expected. A sustained spike combined with increased ErrorPortAllocation on a NAT gateway in the same path is a real problem.

Composite alarms combine multiple alarm states with AND/OR logic. Configure them to alert only when multiple signals confirm a problem:

Transit Gateway: PacketDropCountBlackhole > 0 AND BytesOut anomaly (confirms traffic is affected, not just a transient routing update)
NAT gateway: ErrorPortAllocation > 0 AND ActiveConnectionCount above baseline (confirms the port exhaustion is real load, not a monitoring artifact)
Load Balancer: UnHealthyHostCount > 0 AND HealthyHostCount < minimum (confirms actual capacity loss, not a single target cycling)

Key insight: Composite alarms are the difference between a monitoring system that gets ignored and one that gets acted on. Every alert that fires without requiring action trains your team to ignore alerts.

Use anomaly detection instead of static thresholds¶

Static thresholds require constant tuning as traffic patterns change. CloudWatch anomaly detection builds a model of expected behavior and alerts when metrics deviate from the learned pattern. This is particularly effective for:

BytesIn/BytesOut on Transit Gateway and NAT gateway (traffic follows daily/weekly patterns)
RequestCount on ALB and VPC Lattice (application traffic has predictable cycles)
NewFlowCount on NLB (connection rates correlate with business activity)

Anomaly detection costs the same as a standard alarm but adapts automatically to traffic growth, seasonal patterns, and baseline shifts. Use a band width of 2 standard deviations for most networking metrics — tight enough to catch real anomalies, loose enough to avoid false positives during normal variance.

Monitor quotas before you hit them¶

Every networking service has quotas. Hitting a quota silently — no new VPN connections, no additional Transit Gateway attachments, no more NAT gateway elastic IPs — causes failures that look like service issues but are actually capacity limits.

Use AWS Service Quotas integration with CloudWatch to alarm at 80% utilization:

Service	Quota to monitor	Default limit
Transit Gateway	Attachments per TGW	5,000
Transit Gateway	Routes per route table	10,000
NAT gateway	NAT gateways per Availability Zone	5
VPN	VPN connections per VGW/TGW	10 / 20
Direct Connect	Virtual interfaces per connection	50
ALB	Rules per ALB	100
NLB	Targets per target group	500 (IP) / 500 (instance)
Network Firewall	Rule groups per firewall policy	20

Multi-account monitoring architecture¶

Deploy a centralized monitoring account¶

In a multi-account environment, networking resources are distributed across shared-services accounts, workload accounts, and connectivity accounts. The networking team needs a single pane of glass.

Use CloudWatch cross-account observability to designate a monitoring account that can view metrics, logs, and traces from all source accounts. Configure this at the AWS Organizations level so new accounts are automatically enrolled.

The monitoring account hosts:

Cross-account dashboards showing all networking service health
Centralized alarms that evaluate metrics from any source account
EventBridge rules that aggregate networking events from all accounts

Build cross-Region dashboards for the networking team¶

A single CloudWatch dashboard can display metrics from multiple Regions. Build dashboards organized by service type, not by account or Region:

Transit Gateway dashboard: all TGW metrics across all Regions, with per-attachment drill-down
Hybrid connectivity dashboard: all Direct Connect and VPN metrics, showing connection state and utilization
Load balancer dashboard: ALB and NLB health across all workload accounts
DNS dashboard: Route 53 Resolver query volumes and DNS Firewall activity

Each dashboard should show the last 3 hours by default with the ability to zoom to 1 week for trend analysis.

Key insight: Organize dashboards by networking concern (connectivity health, capacity, security), not by AWS account or Region. The networking team thinks in terms of paths and services, not account boundaries.

IPv6 monitoring considerations¶

Monitor dual-stack metrics separately¶

Several networking services report metrics that differ between IPv4 and IPv6 traffic paths. When running dual-stack:

ALB/NLB: Monitor IPv6ProcessedBytes and IPv6RequestCount separately from their IPv4 counterparts. A failure in the IPv6 path won't show up in aggregate metrics if IPv4 traffic dominates.
NAT gateway: NAT64 metrics (BytesOutToDestination for IPv6-to-IPv4 translation) track a different failure mode than standard NAT. Monitor both paths.
VPC Lattice: Dual-stack service networks carry both IPv4 and IPv6 traffic. Monitor per-protocol error rates to catch IPv6-specific routing issues.

Configure IPv6-specific health checks¶

For services that support IPv6 health checks (ALB, NLB), configure health checks over both protocols when targets are dual-stack. An IPv4 health check passing doesn't guarantee the IPv6 path is functional — different security groups, NACLs, or routing may apply to each address family.

Cost-effective monitoring¶

Use metric math to reduce alarm count¶

CloudWatch charges per alarm per month. Instead of creating individual alarms for every NAT gateway or every Transit Gateway attachment, use metric math to aggregate:

Sum ErrorPortAllocation across all NAT gateways in a Region into a single alarm
Calculate the ratio of UnHealthyHostCount to total targets across all target groups
Compute PacketDropCountBlackhole + PacketDropCountNoRoute as a single "routing failures" metric

This reduces alarm count (and cost) while maintaining coverage. Create per-resource alarms only for the most critical individual resources (primary Direct Connect connections, production ALBs).

Understand the cost model¶

CloudWatch component	Pricing consideration
Standard metrics	Free (included with the service)
Custom metrics	Per-metric/month (tiered — see CloudWatch pricing)
Alarms (standard)	Per-alarm/month
Alarms (high-resolution)	Per-alarm/month (higher than standard)
Anomaly detection alarms	Per-alarm/month
Composite alarms	Per-alarm/month (highest alarm tier)
Dashboards	Per-dashboard/month (first 3 free)
Cross-account observability	No additional charge for metrics

For a typical multi-account networking setup with dozens of alarms, several dashboards, and anomaly detection alarms, CloudWatch costs are negligible compared to the networking services themselves — but worth understanding to avoid surprise bills from over-instrumentation with custom metrics.

Prefer built-in metrics over custom metrics¶

Every networking service publishes metrics to CloudWatch at no additional cost. Before building custom metrics with Lambda functions or CloudWatch agents, verify the built-in metrics don't already cover your need. Custom metrics charge per-metric/month and add up quickly when you're monitoring hundreds of resources across multiple accounts.

Automated remediation¶

Use EventBridge for automated response¶

CloudWatch alarms transition states. EventBridge captures those transitions and routes them to automated actions. Common networking remediation patterns:

Trigger	Automated action
VPN `TunnelState` → 0 on both tunnels	Trigger failover to backup VPN or Direct Connect path
NAT gateway `ErrorPortAllocation` > 0	Scale out by provisioning additional NAT gateways and updating route tables
ALB `HealthyHostCount` < minimum	Trigger Auto Scaling step scaling or notify on-call
Direct Connect `ConnectionState` → down	Update Route 53 health checks to failover to VPN backup
Transit Gateway `PacketDropCountBlackhole` > 0	Run diagnostic Lambda to identify the affected route and notify
Network Firewall `DroppedPackets` spike	Capture packet samples and create incident ticket

Design health checks for networking services¶

Beyond CloudWatch metrics, active health checking validates end-to-end path availability. Design synthetic checks that probe the networking layer:

VPN path validation: Lambda in the VPC sends ICMP or TCP probes through the VPN tunnel to an on-premises endpoint every 60 seconds. Failure triggers an alarm independent of the TunnelState metric (which only reflects IKE/IPsec state, not actual data-plane forwarding).
NAT gateway validation: Lambda in a private subnet makes an HTTPS request to an external endpoint. Failure indicates NAT gateway or internet gateway issues.
Transit Gateway path validation: Lambda in spoke VPC A sends a request to a known endpoint in spoke VPC B through the Transit Gateway. Validates routing, not just attachment state.
Direct Connect path validation: On-premises probe sends traffic to a known VPC endpoint. Validates the full path including BGP routing, not just the physical connection state.

Key insight: CloudWatch metrics tell you the service is healthy. Synthetic health checks tell you the path works end-to-end. You need both — a healthy service with broken routing still means an outage.

Combining service monitoring with other services¶

Combination	Service monitoring provides	Other service provides
Service monitoring + VPC Flow Logs	Health state of networking services (up/down, error rates, capacity)	Actual traffic patterns, source/destination pairs, accepted/rejected flows
Service monitoring + AWS CloudTrail	Runtime operational metrics	API-level audit trail (who changed what configuration and when)
Service monitoring + AWS Network Manager	Per-service metric alarms and dashboards	Topology visualization and route analysis across the global network
Service monitoring + AWS Health Dashboard	Your resource-specific metrics and alarms	AWS-side service events, maintenance notifications, and regional issues
Service monitoring + Amazon DevOps Guru	Explicit alarm thresholds and anomaly bands you define	ML-driven anomaly detection across related resources that you didn't explicitly instrument
Service monitoring + AWS Trusted Advisor	Real-time operational health	Periodic checks for quota utilization, security, and cost optimization
Service monitoring + Notifications	Metric collection and alarm evaluation	Alert routing, escalation, and on-call integration (see Notifications)

Documentation¶

CloudWatch cross-account observability

Set up a centralized monitoring account to view metrics, logs, and traces across your AWS Organization.

Documentation
CloudWatch anomaly detection

Configure ML-based anomaly detection alarms that adapt to changing traffic patterns without manual threshold tuning.

Documentation
CloudWatch composite alarms

Combine multiple alarm states into a single composite alarm to reduce noise and alert only on confirmed problems.

Documentation
EventBridge rules for CloudWatch alarms

Route alarm state changes to automated remediation actions through EventBridge rules.

Documentation
CloudWatch pricing

Understand the cost model for metrics, alarms, dashboards, and cross-account observability.

Pricing
AWS Service Quotas

Monitor service quota utilization with CloudWatch integration and request increases before hitting limits.

Documentation

Internal Traffic Monitoring — Covers VPC Flow Logs and traffic mirroring for understanding what's flowing through your network, complementing the service health view on this page.
External Traffic Monitoring — Covers monitoring traffic between AWS and the internet, including CloudFront and edge service metrics.
Notifications — Covers alert routing, escalation policies, and integration with incident management tools. Service monitoring generates the signals; notifications deliver them to the right people.

Relationship to other sections:

Connectivity Within AWS: Covers the Transit Gateway, Cloud WAN, and VPC Peering services that this page monitors.
Hybrid & Multi-Cloud: Covers Direct Connect and Site-to-Site VPN architecture; this page covers their operational monitoring.
Load Balancing: Covers ALB, NLB, and GWLB architecture and best practices; this page covers their health metrics and alarms.