Load Balancing Best Practices
Introduction
The crux of modern distributed architectures is high availability and fault tolerance. To achieve this, it's imperative to leverage intelligent health checks for Kubernetes clusters (K8s) and F5 load-balancers. This briefing aims to provide a comprehensive overview of the impact and best practices regarding readiness and liveness probes.
Importance
-
Defense In-Depth: Health checks, readiness probes, synthetics, and circuit breakers provide not only passive protection but active protection against overload, intermittency, outages and delays and "stampeding herd" problems. Layered, these solutions act as force multipliers for one another ensuring platforms have multiple resiliency mechanisms protecting against the most common fault points.
-
Increased Service Reliability: Proper health checks eliminate unnecessary outage time by identifying problematic instances before they affect the system.
-
Resource Efficiency: Readiness and liveness probes facilitate automatic rerouting or rescheduling of requests, ensuring optimal use of resources.
Technical Aspects
Kubernetes (K8s)
Readiness Probes: To determine if a pod can accept traffic
- High Availability Impact: Ensures new deployments don't receive traffic until fully initialized.
initialDelaySeconds: Can be similar to the liveness probe, 30-60 seconds.timeoutSeconds: 1-5 seconds, again to fail fast.periodSeconds: Around 5-10 seconds for quicker readiness checks.failureThreshold: Generally keep it low, around 1-2, to quickly remove a pod from service.
Liveness Probes: To know if a pod is running
- High Availability Impact: Resuscitates failing pods by restarting them.
initialDelaySeconds: Around 30-60 seconds. This allows your app ample time to start.timeoutSeconds: Keep this low, around 1-5 seconds. The idea is to fail fast.periodSeconds: Around 10-30 seconds. This is the interval at which checks are performed.failureThreshold: A value like 3 is reasonable. The pod will be killed and restarted after 3 consecutive failures.
Best Practices
- Use HTTP GET method on
/healthendpoints that execute light-weight logic and possibly a database query. - For stateful applications, consider leveraging gRPC checks for enhanced health assessment.
F5 Load-Balancers
-
HTTP Monitors: For checking application layer health.
- High Availability Impact: Redirects traffic to healthy nodes.
-
External Monitors: To execute complex logic for health checks.
- High Availability Impact: Allows for the most nuanced and customized assessments.
Best Practices 2
- Implement application-specific logic for content validation in HTTP monitors.
- Use scripted external monitors for complex multi-step health validations involving database transactions, cache health, etc.
Recommendations
- Avoid Shallow Checks: TCP port checks are insufficient; logical validations are crucial.
- End-to-End Checks: Include database and cache queries in
/healthendpoints where applicable. - Rate Limiting: Throttle health checks to avoid overwhelming services.
Consul Health Checks
Node-Level Health Checks
Interval: Every 30s to 1m is common.Timeout: Around 5s. Fail fast if a node isn't responding.DeregisterCriticalServiceAfter: Use this sparingly; it removes a service if it stays critical for the defined time (say, 72h).
Service-Level Health Checks
Interval: Can vary depending on the service, but every 10-30s is generally a good start.Timeout: Usually around 3-5s.
Consul Service Checks
HTTP or gRPC Checks: Favor these over simple TCP checks.- Can be integrated with your
/healthendpoints. Method: HTTP method if applicable (usuallyGET).TLSSkipVerify: Skip TLS verify for HTTPS checks (only if necessary).
Best Practices 3
- Scripted Checks: For more complex health evaluations, you can use scripts that run periodically.
- Keep these lightweight and make sure they complete within the
Timeoutperiod. - TCP Checks: While less favored, they can be useful for simple services that don't expose HTTP endpoints.
General Tips
- Rate Limiting: Just like with Kubernetes and F5, you'll want to ensure you're not overwhelming your services with health checks.
- Alerts and Metrics: Integrate with monitoring and alerting systems to keep a finger on the pulse of your infrastructure.
AWS Elastic Load Balancing (ELB)
Classic Load Balancer
Ping Protocol & Port: HTTP/HTTPS and the port your application listens on.Ping Path: Typically/health.Response Timeout: Around 2-5 seconds.Interval: 30 seconds is generally good.Unhealthy Threshold: Usually set at 2.Healthy Threshold: A value of 2-3 works well.
Application Load Balancer
Matcher: HTTP codes to be considered healthy, usually200-399.Timeout: About 5 seconds.Interval: Ranges from 5 to 300 seconds, but 30 seconds is often a good balance.
Network Load Balancer
- Similar to Classic but usually used for TCP/UDP-based services.
Azure Load Balancer & Application Gateway
Azure Load Balancer
Protocol: Usually HTTP/HTTPS.Port: Whatever your application listens on.Path: Usually/health.Interval: 5 seconds is common.Retries: Usually set between 2 and 4.
Azure Application Gateway
Interval: 30 seconds is a typical starting point.Timeout: Generally, set this lower than the interval, say at 20 seconds.Unhealthy Threshold: 3 is often used.
General Best Practices for Both AWS and Azure
- Multiple Health Checks: Implement diverse types of health checks (e.g., HTTP, database queries).
- Grace Period: Allow instances some time to warm up before subjecting them to health checks.
- Monitoring & Alerts: Integrate health checks with CloudWatch (AWS) or Azure Monitor to set up alerts.
Going Further - these are operational best practices that actually protect external systems vs. passively working around internal failure
/health Endpoint Design
Best Practices Recommendations for Endpoints
-
Granular Checks: A well-designed
/healthendpoint should support various types of checks, likeDBHealth,CacheHealth,ServiceDependencyHealth, etc. -
Selective Monitoring: Allow query parameters to perform specific health checks. For example,
/health?check=dbto only check database health. -
Non-blocking and Asynchronous: Make sure the checks are asynchronous to prevent lockups and to allow for parallel checking of various dependencies.
-
Response Codes and Content: Stick with HTTP status codes for quick interpretation but also include a detailed JSON body for a more comprehensive status.
{ "status": "OK", "details": { "database": "OK", "cache": "OK", "externalService": "FAIL" } } -
Rate Limit: Add a rate limiter to prevent abuse. This is especially important if the health check involves resource-intensive operations like DB queries.
-
Timeouts: Always set sensible timeouts for each individual check in the
/healthendpoint.
Circuit Breaker Design
Recommendations 2
-
Failure Threshold: Decide on a specific number of failures that will trigger the circuit breaker.
-
Retry Mechanism: Implement a progressive retry mechanism, so when a service goes down, it's not bombarded with requests the instant it comes back online.
-
Fallback Functionality: Provide a fallback action when a service is down, like serving stale cached data or a default value.
-
State Metrics: Keep metrics and expose them for monitoring. This can be invaluable for debugging and performance tuning.
-
Configurability: Parameters like timeouts, failure thresholds, and reset timeouts should be configurable at runtime, ideally without requiring a redeployment.
-
Distributed Circuit Breakers: In a microservices environment, consider a distributed circuit breaker design that allows shared state across multiple instances of a service.
Next Levels of Protection
-
Bulkhead Isolation: Isolate resources in a way that failures in one part of the system won't take down others.
-
Throttling: Use a token-bucket or leaky-bucket algorithm to limit the incoming request rate to a level that the system can handle.
-
Dead Man's Switch: Implement a mechanism that halts certain operations if regular "all-clear" signals are not received, indicating a possible system failure.
By implementing these /health endpoint and circuit breaker strategies, we're building resilience and fault-tolerance in-depth. It's about more than just detecting failure; it's about graceful degradation and self-healing, two pillars of a robust microservices architecture.