Load‑Balancing Techniques: Architecture Patterns and Implementations

Aug 14, 2025

Introduction

Generated with help of AI deepresearch based on AWS and GCP, OSS documentation. May contain errors. Check important info

Modern load balancing plays a critical role in scalable infrastructure, ensuring traffic is distributed across multiple backend services to optimize performance and reliability. Over the past decade, load balancers (LBs) have evolved from dedicated hardware appliances to software-defined, distributed systems running on commodity servers[1][2]. Major cloud providers and open-source projects have innovated new architectures and algorithms to handle massive scale, dynamic environments (like microservices), and newer protocols (HTTP/2, HTTP/3). In this article, we’ll deep dive into the latest load-balancing techniques – examining how major cloud providers design their LBs, key architecture patterns, popular open-source solutions, common load-balancing algorithms, and the impact of HTTP/2 and HTTP/3. The focus is on architectural patterns and technical differences, providing experienced infrastructure engineers and architects a comprehensive update on the state of load balancing in 2025.

Load Balancing in Major Cloud Providers

All leading cloud providers offer highly scalable, managed load balancing services. While they share common goals (distributing traffic, scaling out automatically, handling health checks, etc.), their implementations differ in architecture and features. We explore Google Cloud, AWS, and Microsoft Azure to see how each approaches load balancing.

Google Cloud Load Balancing Architecture

Google Cloud’s load balancing is a fully distributed, software-defined service built on the same technologies Google uses internally. In fact, Google Cloud Load Balancing runs on components like Maglev, Andromeda (Google’s SDN stack), Google Front Ends (GFEs), and the Envoy proxy[3]. Google’s design emphasizes global load balancing and edge distribution: a single anycast virtual IP can frontends services worldwide, with Google’s global network and GFEs routing users to the nearest healthy backend.

Global vs Regional Load Balancing: Google Cloud offers both global and regional load balancers. Its flagship offering is the Global External HTTP(S) Load Balancer, which provides a single anycast IP frontend distributed across Google’s worldwide points of presence[4][5]. Clients connect to the closest Google Front End location (entry point to Google’s network), reducing latency. The LB then proxies requests to backends in any region as needed, enabling cross-region balancing and automatic failover to healthy regions[6]. In contrast, Google’s regional load balancers restrict backends to one region (often used for internal applications or lower-cost tiers)[7]. Both support automatic scaling (no pre-warming needed – Google’s system can scale from zero to millions of QPS in seconds)[8] and health-aware routing.

Anycast and GFEs: By using an Anycast IP address advertised globally, Google’s LB lets clients everywhere use a single IP that routes them to the nearest Google Front End location[9][4]. These GFEs are distributed in over 80 cities and terminate the client connection at the edge. This provides built-in DDoS mitigation and TLS offload at the edge. The GFEs (or Envoy proxies, as we’ll see) then forward traffic over Google’s private network to backend instances. This architecture yields low latency and high reliability – if a backend fails in one region, traffic is seamlessly routed to another region without the client needing to change IP or DNS.

Maglev Load Balancing (L4): Google’s network load balancer relies on Maglev, a software-based L4 load balancer Google pioneered[3]. Maglev is a distributed system running on commodity Linux servers, which replaced traditional hardware load balancers at Google[1][2]. Packets to a given virtual IP (VIP) are distributed via equal-cost multipath (ECMP) to a cluster of Maglev machines; each Maglev node then uses consistent hashing and connection tracking to map flows to backend servers[10][11]. Maglev’s hashing algorithm (often called Maglev hashing) spreads load evenly and minimizes disruption if backends or Maglev nodes change[12][13]. For example, Maglev uses a fixed lookup table of size 65,537 entries and a custom algorithm to assign backends to hash slots, achieving nearly equal distribution[14][15]. This consistent hashing approach ensures that when a backend is added or removed, only a small fraction of flows are remapped (increasing stability for long-lived connections). Maglev’s architecture is highly optimized: by sharding traffic and processing packets in parallel threads, a single Maglev machine can saturate a 10 Gbps link with small packets[16]. Google has used Maglev since 2008 to load-balance at extreme scale, including for Google Cloud networking[17].

Envoy at the Edge: Originally, Google’s global HTTP(S) load balancer ran on proprietary GFEs, but the latest generation has adopted Envoy (an open-source L7 proxy) to add advanced traffic management features[18]. Google’s External HTTP(S) Load Balancer (with Advanced Traffic Management) uses Envoy proxies at the edge to enable features like traffic mirroring, weighted traffic splitting (canary deployments), and content-based routing with header transformations[18]. These capabilities go beyond basic load distribution, allowing more fine-grained control such as mirroring a portion of traffic to a test service or splitting traffic by percentages between service versions. Envoy’s role is transparent to users but represents Google’s shift toward leveraging open-source cloud-native technology within its LB service. (The older “classic” Global HTTP(S) LB continues to use GFEs, with a similar global anycast design but fewer layer-7 features[19].)

Security and CDN Integration: By terminating traffic at Google’s edge, the load balancer also provides security benefits. It defends against network-layer DDoS attacks and can integrate with Cloud Armor for application-layer defenses[20]. TLS is offloaded at the edge, and features like web application firewall and Identity-Aware Proxy can be plugged in[21]. Google’s LBs also integrate with Cloud CDN to cache content at edge locations, improving performance for static content delivery[22].

HTTP/2 and HTTP/3: Google has been a leader in adopting new web protocols on its infrastructure. Google Cloud’s HTTP(S) load balancers support HTTP/2, and since 2021 they also support HTTP/3 (QUIC) for client connections[23][24]. Enabling HTTP/3 on a Google Cloud LB allows clients (modern browsers, etc.) to benefit from QUIC’s faster handshakes and multiplexing, which Google found reduced search latencies and video rebuffering significantly[23]. Importantly, this is backward-compatible – clients that don’t support QUIC simply continue using HTTP/1.1 or HTTP/2, with the load balancer handling protocol negotiation[25]. (Internally, the load balancer terminates QUIC and likely communicates with backends over HTTP/2 or gRPC, so your services don’t need native QUIC support.) Google’s ability to roll out HTTP/3 across its global edge means improved performance (2% lower latency for Google Search, 9% fewer video rebuffers as observed in their tests) for applications behind its load balancers[23].

AWS Elastic Load Balancing (ALB/NLB) Architecture

AWS offers multiple managed load balancer types under the Elastic Load Balancing (ELB) family, chiefly the Application Load Balancer (ALB) for HTTP/HTTPS (Layer 7) and the Network Load Balancer (NLB) for TCP/UDP (Layer 4). Their architectures reflect different approaches to scaling and routing, and AWS’s design choices prioritize integration with AWS networking (VPCs) and ease of use for typical scenarios.

Application Load Balancer (ALB): The ALB operates at Layer 7, terminating HTTP/HTTPS and routing requests based on content (paths, hostnames, etc.). AWS ALBs are regional; when you create an ALB, AWS provisions load balancer nodes in multiple Availability Zones (AZs) for high availability[26]. Each ALB node gets a network interface and (for internet-facing ALBs) a public IP in its AZ[27]. The ALB’s DNS name resolves to a set of IP addresses (one per AZ at minimum), so clients are served by a node in each AZ. As traffic grows, ALB will scale vertically (scaling up capacity on existing nodes) and then horizontally (adding more nodes) – up to 100 nodes across AZs[28]. DNS is updated to include up to 8 IPs at a time (to keep DNS responses small) and will rotate or add/remove IPs as scaling occurs[29]. Clients are expected to honor DNS TTLs (60s) and retry failed connections by re-resolving, so that traffic naturally shifts if nodes are added or replaced[30]. This DNS-based scaling is crucial: AWS essentially load-balances to the ALB by giving multiple entry point IPs, and the ALB then balances among targets.

ALB Request Routing: Once traffic hits an ALB node, it terminates the TCP and TLS (if HTTPS) connection. It then chooses a target from the configured target group (which could be EC2 instances, containers (IP targets), or Lambda functions). The default algorithm for ALB is round-robin for HTTP/1.1 requests. For HTTP/2/gRPC, ALB can maintain multiple requests over one connection, and recently AWS enabled end-to-end HTTP/2 so that ALB can speak HTTP/2 to your targets (useful for gRPC services)[31]. This end-to-end HTTP/2 support means ALB no longer has to downgrade to HTTP/1.1 on the backend: it can forward gRPC calls natively, performing health checks that understand gRPC codes and providing gRPC-specific metrics[31][32]. ALB also supports features like sticky sessions (session affinity via cookies), WebSockets, and content-based routing rules. With content routing, ALB can host multiple applications on one LB and direct requests by URL path or host header to different target groups. ALB does not as of 2025 support HTTP/3 for client connections – AWS instead added HTTP/3 at the CDN layer (CloudFront) – so ALB client connections use HTTP/1.1 or HTTP/2 (with TLS). If HTTP/3 is needed end-to-end, AWS suggests using CloudFront in front (CloudFront has HTTP/3 to the viewer and then uses HTTP/1.1 to ALB)[33].

Scaling and Fault Tolerance: AWS ALB’s scaling is managed by AWS’s control plane; it can scale quickly but in extremely sudden traffic scenarios (like very spiky traffic or immediate regional failover) AWS has sometimes recommended pre-warming or a sharding approach[34][35]. Sharding here means deploying multiple ALBs behind a DNS round-robin or Route 53 weighted record to split traffic if one ALB cannot scale single-handedly to the desired level[36][37]. In practice, an ALB can handle millions of requests per second, but it has an upper limit (AWS notes the maximum ~100 nodes) beyond which adding a second ALB is prudent for very large workloads[38][39]. ALB is also multi-AZ: each AZ has at least one LB node, so if an AZ goes down, clients connecting via the other AZs’ IPs continue to be served. The Route 53 integration ensures DNS only returns healthy IPs of ALB nodes; unhealthy nodes or AZs are removed from DNS[40].

Network Load Balancer (NLB): The NLB is AWS’s high-performance Layer 4 load balancer. It’s designed for ultra-low latency and huge scale, capable of tens of millions of requests per second with stable performance[41][42]. Architecturally, the NLB is quite different from ALB: it’s a pass-through, connection-based balancer that does not terminate TCP/UDP but rather steers traffic at the packet level. When you create an NLB, AWS allocates a static IP per AZ (you can also bring your own Elastic IPs)[43]. These IPs are what the NLB’s DNS name resolves to. The NLB is deeply integrated with the AWS cloud network: under the hood, it uses AWS Hyperplane, a distributed network function platform deployed in each AZ[44]. Hyperplane maintains a fleet of “load balancer” engines that run on AWS-managed infrastructure in each AZ, mapping connections to targets. An NLB is essentially a logical construct that, in each AZ, has an array of Hyperplane-powered load balancer instances (ENIs) to handle traffic[45]. Because NLB works at the connection level, it preserves the original client IP and port (no proxy)[46]. This is important: servers see the real client IP, so applications can apply IP-based logic without needing X-Forwarded-For or proxy protocol[46]. It also allows using normal security groups on the targets since the source IP is the actual client.

Hyperplane and NLB Behavior: AWS Hyperplane (which powers NLB, NAT Gateways, etc.) is a highly scalable network virtualization system. Instead of a traditional load balancer appliance, Hyperplane is more like a distributed hash table of connections. The control plane writes configuration (like the list of target IPs for the NLB) into a datastore, and Hyperplane’s data plane instances in each AZ pick up that config (AWS uses a technique where the control plane publishes config to S3 and the Hyperplane nodes poll it, as described in AWS’s architecture library[47][48]). When a new flow arrives at an NLB’s IP, the Hyperplane data plane node will select a target from the target group (likely using a flow-hash algorithm to consistently send all packets of a connection to the same target) and direct the traffic. Target selection is often done per flow via hashing (like 5-tuple hashing) for even distribution. Indeed, AWS mentions that NLB attempts to keep connections from the same client going to the same AZ for performance (a feature called Zonal isolation)[49]. This means if a client connects repeatedly, NLB will try to use the AZ nearest to that client’s IP for all connections, to reduce cross-AZ hops and improve latency. If targets in that AZ fail, it will fail over to targets in another AZ. NLB also has built-in failover across regions: you can set up an NLB with an attached Route 53 health check so that if an entire region’s NLB is unhealthy, Route 53 will steer traffic to an alternate region’s NLB[50]. This provides a DNS-based global failover using NLB, though it’s not an active-active global LB in the way Google’s anycast system is.

Performance and Use Cases: NLB’s design allows very long-lived connections (months or more)[46], useful for IoT or gaming scenarios with persistent sockets. It’s also one of the few ways to load balance non-HTTP protocols (e.g. arbitrary TCP or UDP services) in AWS. Because it’s connection-oriented and doesn’t parse higher protocols, NLB is extremely fast – essentially performing routing/NAT at scale. AWS tested NLB with over 3 million requests per second and 30 Gbps of traffic in a demo, and it handled it “with ease”[51]. Such capacity is achieved without users needing to pre-warm; the distributed architecture adds capacity behind the scenes. One trade-off is that NLB is a simpler load balancer – no content-based routing or HTTP-specific features. For example, it cannot do SSL offload or inject cookies. It forwards traffic as-is (though it can optionally perform TLS termination if you use TLS listeners on NLB, introduced later, primarily for proxying to TCP-only backends with encryption).

Global Load Balancing: AWS, unlike Google, doesn’t have a single global anycast load balancer for user applications at L7. Instead, AWS encourages use of Amazon Route 53 for global DNS load balancing across regions, or AWS Global Accelerator which provides anycast static IPs that then route to regional endpoints. Another approach for HTTP workloads is to use Amazon CloudFront (a CDN) as a global edge, which can distribute load to origins in different regions based on latency or geography. CloudFront essentially acts as a globally distributed HTTP(S) proxy (with caching), somewhat akin to Azure Front Door or Google’s global LB, though typically considered a CDN first. Global Accelerator, on the other hand, provides anycast IPs that then tunnel traffic via AWS’s network to NLB or ALB endpoints in designated regions. These solutions can be mentioned as part of AWS’s global LB strategy: e.g., Global Accelerator gives you the single anycast IP convenience, while under the hood it uses AWS’s edge network to distribute to regional load balancers.

Protocol Support: ALB supports HTTP/1.1 and HTTP/2 for client connections (HTTPS listeners) natively[52]. It can also handle WebSockets. With the 2020 update, ALB supports gRPC (which runs over HTTP/2) end-to-end[31]. HTTP/3 is not supported on ALB as of 2025 – AWS’s HTTP/3 support is currently in CloudFront distributions[53]. NLB, being L4, supports TCP and UDP (and some protocols on top like TLS or SMPT as raw TCP). One interesting note: since HTTP/3 uses UDP, an NLB could forward QUIC traffic but it would treat it like any UDP datagram stream without understanding QUIC; CloudFront’s approach has been to implement QUIC in software at the edge instead.

Microsoft Azure Load Balancing Options

Microsoft Azure provides a suite of load balancing services for different needs. Azure’s design splits load balancing by scope (regional vs global) and layer (L4 vs L7)[54][55]. The primary services are Azure Load Balancer, Azure Application Gateway, Azure Front Door, and Azure Traffic Manager, each with a distinct role[56].

Azure Load Balancer – a high-performance Layer 4 load balancer for TCP and UDP, primarily within a region (though it also supports a cross-region mode). It’s analogous to AWS NLB or Google’s network load balancer, focusing on raw throughput and low latency. Azure Load Balancer is designed to handle millions of requests per second with ultra-low latency, and is zone-redundant by default (spanning Availability Zones for resiliency)[57]. This service is typically used to distribute traffic to VM scale sets, container instances, or other IaaS endpoints in a single region. It operates at the network level, so it preserves client IP (direct server return or SNAT depending on config), and does not do any packet inspection beyond L4. A key feature is that it supports both regional and cross-region deployments: a regional load balancer balances among backends in one region, whereas a cross-region load balancer can act as a wrapper that balances among regional load balancers in different regions for global resiliency[57]. (Cross-region LB uses your regional LBs as the unit of balancing, providing a single global VIP that routes to the “closest” or healthiest region – effectively Azure’s anycast-like solution at L4). Azure’s L4 LB is often used under the hood for services like AKS (Kubernetes) to provide service IPs.
Azure Application Gateway – a Layer 7 web traffic load balancer and web application firewall (WAF). Application Gateway is similar in concept to AWS ALB, terminating HTTP/HTTPS and routing requests based on URL path, host header, etc. It’s a regional service (deployed in one region’s VNet)[58]. Application Gateway supports features like TLS/SSL offloading, cookie-based session affinity, URL-based routing, and it can integrate with Azure’s Web Application Firewall to protect against OWASP threats. You use App Gateway to expose web applications securely, often fronting web servers in a private subnet. A typical use-case is inbound traffic from the internet hitting an Application Gateway in a DMZ subnet, which then forwards to VMs or containers in a protected network. Application Gateway can be scaled to multiple instances (it supports autoscaling mode) and is zone-aware. It now also supports HTTP/2 (both to clients and to backend servers) and even WebSocket and gRPC in newer versions. As of late 2023, HTTP/3 (QUIC) support for Application Gateway is in preview[59] – meaning Azure is testing QUIC support so that eventually App Gateway listeners can accept HTTP/3 from clients. (This is expected to become generally available in the near future, bringing Azure on par with Google and AWS’s CDN in QUIC support). Internally, App Gateway is implemented as a cluster of Azure-managed VMs running a proxy. Microsoft’s earlier research (like the Ananta project) influenced this design: Azure’s original L4 load balancer was software on every host (Ananta), and for L7 they built gateways that provide more advanced features.
Azure Front Door – a global, scalable Layer 7 entry point service, often described as an application delivery network. Front Door provides global HTTP(S) load balancing and acceleration, acting much like a CDN + smart load balancer for web applications[60]. When you use Front Door, you get a global anycast endpoint (similar to CloudFront or Google’s global LB) that edges around the world will accept traffic on. Front Door then routes traffic to your backend pools which could be in different Azure regions (or even on-premises). It offers features such as TLS offload, path-based routing, fast failover between regions, and can incorporate caching for static content. Essentially, Front Door can ensure users connect to a location nearest them, and if one region’s backend is down, Front Door can automatically fail over to another region with minimal delay (because it’s not dependent on DNS TTLs like Traffic Manager)[60][61]. Front Door is ideal for high-availability web applications that are deployed in multiple regions – it provides instant global failover and performance optimization via Microsoft’s global edge network. (Front Door was originally distinct from Azure CDN, but nowadays Azure’s “Edge Network” has converged such that Front Door can also do CDN tasks. In the context of load balancing, it’s the global L7 load balancer option in Azure’s portfolio). Regarding new protocols: Front Door as of 2025 does not yet support HTTP/3 for incoming traffic – it’s something Microsoft has mentioned experimenting with, but official support lags behind (HTTP/3 on Front Door isn’t generally available at the time of writing). So, Front Door handles HTTP/2 and HTTP/1.1 to clients, and then connects to origins likely using HTTP/2 or HTTP/1 depending on configuration.
Azure Traffic Manager – this is a DNS-based global load balancer (like AWS Route 53 or GCP Cloud DNS can do with policy). Traffic Manager is not a proxy or a gateway; it simply responds to DNS queries with a selected endpoint’s IP based on policies (e.g. geographic routing, priority (failover) routing, or weighted round-robin). Because it works at the DNS level, it can direct clients to different Azure regions or external endpoints according to health or performance metrics[62]. The downside is the inherent latency in DNS changes – failover isn’t instantaneous (DNS TTLs must expire, which is why Microsoft notes it “can’t fail over as quickly as Front Door” due to DNS caching effects[62]). Traffic Manager is often used for non-HTTP services or as a simple failover mechanism across globally distributed deployments when you don’t need the full proxy layer.

To summarize Azure’s approach: regional L4 load balancing is handled by Azure Load Balancer (fast, multi-protocol, native to VNet), regional L7 by Application Gateway (with WAF and rich HTTP features), global L7 by Front Door (for anycast-driven, fast failover and CDN capabilities), and global any-protocol by Traffic Manager (DNS redirection). These can also be combined (for example, one common pattern is using Front Door globally to route to App Gateways in each region, which then talk to VMs – combining global and regional L7 layers).

Azure’s implementation details: Azure’s L4 Load Balancer, like others, uses distributed software. The 2013 Ananta paper (from MSR) described the design: Every host has an agent and the top-of-rack switches use ECMP to distribute flows among hosts acting as load balancer managers. Newer enhancements likely incorporate SDN and perhaps programmable offloads. Azure Load Balancer supports features like floating IP/direct server return (DSR) where the LB does not SNAT outbound traffic, allowing responses to go directly back to clients for certain configurations – this improves performance by bypassing the LB on the response path (at the cost of some complexity on the NIC/IP config). It also supports outbound SNAT for letting VMs without public IPs initiate connections out to the internet.

HTTP/2 support is available on Application Gateway (it can accept HTTP/2 from clients and speak HTTP/2 to backends)[63]. Azure Front Door also supports HTTP/2 to clients. HTTP/3: as noted, in preview for App Gateway as of 2024, not yet in Front Door. But Azure did enable TLS 1.3 on Front Door and is likely to add QUIC in the future to stay competitive.

Integration: Azure’s load balancers integrate with other Azure services – for instance, AKS (Kubernetes) can provision an Azure Load Balancer for Services of type LoadBalancer. App Gateway can integrate with Azure Container Instances or App Services as backends. Front Door can directly front Azure Storage static websites or App Service apps across regions. This tight integration means Azure users often use a combination (e.g., Traffic Manager plus App Gateway) to achieve multi-region redundancy with application-layer awareness.

Key Load-Balancing Algorithms and Techniques

Under the hood, all load balancers use one or more algorithms to decide how to distribute incoming requests or connections to backend servers. The choice of algorithm can impact load distribution fairness, cache utilization, latency, and failure resilience. Here we outline the common load-balancing algorithms and where they are used:

Round Robin (Weighted Round Robin): Perhaps the simplest approach – the load balancer cycles through the list of backends in order, assigning each new request to the next server. This tends to distribute requests evenly, assuming equal server capacity. Most software LBs implement round robin and also weighted round robin, where each server is given a weight proportional to its capacity and appears more often in the rotation[64]. For example, Envoy’s default policy is weighted round robin[64], and NGINX, HAProxy, and others default to round robin if not specified. Round robin is stateless and simple, but doesn’t account for current load – if one server becomes slow, round robin will still send it an equal share of requests (unless combined with active health checks or weights).
Least Connections / Least Requests: This family of algorithms directs traffic to the server with the fewest active connections or requests at the moment. The idea is to dynamically balance load by considering server load. Least Connections (common in L4 LBs) looks at TCP connection counts; Least Requests (L7 context) might look at how many HTTP requests each server is currently handling. A challenge is that not all requests are equal in cost. Envoy implements an improved variant called Least Request with two random choices (also known as P2C, “power of two choices”)[65]. In P2C, the load balancer picks two random healthy servers and compares their active request counts, choosing the one with fewer requests[65]. This approach has been proven to achieve near-optimal load balancing with much less overhead than examining all servers[65]. It avoids a thundering herd picking the same “least busy” server by randomization. If weights differ, Envoy adjusts by combining weighted round robin with the least-request calculation to avoid overloading smaller servers[66]. HAProxy and NGINX have similar least_conn algorithms, and AWS ALB internally uses a least outstanding requests approach for HTTP/2 (since with HTTP/2, a single connection can carry many requests, ALB tries to keep the number of concurrent requests per target balanced).
Hash-Based (Consistent Hashing): In some scenarios, you want the same client or session to consistently go to the same backend (for example, to utilize in-memory cache locality or for session stickiness without cookies). Hashing algorithms achieve this by mapping an attribute of the request (IP address, HTTP cookie, etc.) via a hash function to a specific server. A popular technique is consistent hashing, which minimizes disruption when servers are added or removed. Each server is assigned spots on a hash ring, and a request’s hash (say of client IP) is mapped to the nearest server on the ring[67]. This ensures a form of sticky routing. Envoy provides a Ring Hash load balancer for consistent hashing (similar to the classic Ketama algorithm)[67]. Another approach is Maglev hashing, which uses a fixed-size lookup table (as discussed earlier) to evenly spread keys among servers[15]. Envoy actually has an option for Maglev load balancing, implementing the algorithm from Google’s paper with a 65537-size table[15]. Consistent hashing is commonly used for session persistence (e.g., “source IP affinity” in Azure Traffic Manager or AWS NLB’s “sticky sessions” for TCP can be seen as consistent-hash on source IP and port). It’s also used in distributed caches and CDNs to keep content consistently routed. The trade-off is that pure hashing doesn’t consider real-time load – a server could become hot, yet hash will keep sending specific clients there unless combined with other logic.
Random Choice: A very simple method: pick a random backend for each request. Pure random can in theory lead to imbalance (though law of large numbers helps when many requests). However, in practice, random is often combined with other logic (like the P2C algorithm above starts with random picks). AWS’s Route 53 DNS LB, for example, returns up to 8 random healthy IPs for ALB[68] to spread load from clients.
Weighted Routing / Traffic Splitting: This is not an algorithm for load per se but a technique to deliberately skew traffic distribution. Many load balancers allow assigning a weight to each server (or each server pool) and then traffic is split accordingly (proportionally or by probability). This is used for blue-green deployments or canary releases – e.g., send 5% of traffic to a new version and 95% to the stable version. GCP’s Envoy-based LB supports weighted traffic splitting[18], AWS ALB supports target group weight splitting (in an ALB listener rule you can forward to multiple target groups with weights), and Azure Front Door supports routing rules with weight. This is typically implemented by treating weights as probabilities in the selection algorithm.
Priority/Fallback: Some systems support a priority list rather than true load distribution. For example, “priority-based” load balancing sends 100% traffic to the highest priority backends, and only if none are healthy does it fail over to the next priority[69]. Azure’s API Management mentions priority mode[69]. This isn’t load balancing for scale, but for high availability – ensuring a backup pool only handles traffic if the primary is down.
Adaptive Load Balancing: This is an emerging area where load balancers use more than simple counts – e.g., actual response latency or queue lengths – to choose a target. Some LBs can measure response times and direct new requests to the server with the fastest response (with the assumption that faster = less loaded). HAProxy has a mode called Least Time (choose the server with lowest average response time). Envoy can utilize outlier detection (temporarily remove a server that exhibits high latency or error rates). These features effectively adjust the pool rather than the base algorithm. Outlier detection in Envoy will eject hosts that exceed error rate thresholds or latency p99, to avoid sending traffic to a bad instance[70]. This improves overall quality by not strictly balancing to an overloaded (or failing) server.
Connection Multiplexing and Batching: Some newer techniques aren’t “who to pick” but how to send. For example, HTTP/2 and HTTP/3 allow many requests on one connection. A load balancer could choose to pool connections to backends such that multiple client streams get merged to one backend connection (to reduce connection overhead). Envoy does connection pooling and can overlap streams. Another example: if a service mesh is doing client-side load balancing, it might send batches of requests to one server until that server’s queue is somewhat filled, then move to next – to optimize CPU cache usage. These are specialized strategies often internal to proxies.

Many real-world systems use a combination: e.g., Maglev does consistent hashing plus connection tracking to handle flow affinity and failures[11]. AWS NLB’s flow hashing is coupled with health checks and multi-AZ failover logic. ALBs use weighted round robin under the hood but might adjust weights based on instance health or slow-start (new targets start with lower weight until warm). Slow start is another technique: when a server instance is added (or comes back from failure), some load balancers implement a ramp-up period during which that server only gets a gradually increasing portion of traffic (so it can warm caches, etc., without being overloaded immediately). HAProxy supports slow start for servers, for instance.

In summary, the algorithmic choice can be tuned to the use-case: round robin for simplicity and broad even distribution, least connections/requests for reactive load-aware distribution, consistent hashing for stickiness, weighted splitting for traffic management, and power-of-two random choices as a robust, efficient way to approximate least-loaded behavior[65]. Modern load balancers often make these pluggable or even automatically choose between them based on context (Envoy allows specifying different policies per cluster[71]). Also, layer matters: at L4, you often simply hash 5-tuple (source/dest IP/port) to keep a connection on one server (Maglev and Azure’s L4 LB do this for flow consistency). At L7, you have more flexibility per request.

Open-Source Load Balancing Solutions and Trends

Beyond cloud providers’ managed services, the load balancing field has a rich landscape of open-source software and innovative techniques. These are used on-premises, within cloud VMs, or even under the hood of some cloud services. We’ll highlight a few major open-source load balancers – Envoy, HAProxy, NGINX – and some emerging trends like eBPF-based load balancing.

Envoy Proxy (L7 Proxy and Service Mesh LB)

Envoy is a high-performance Layer 7 proxy open-sourced by Lyft (now a CNCF graduated project) that has rapidly become a centerpiece of modern architectures. Envoy is designed for cloud-native environments and provides advanced load balancing features out of the box. It operates as a reverse proxy (handling inbound requests to a service) and is commonly used in two scenarios: as an edge/API gateway, and as a sidecar proxy in service mesh (handling service-to-service traffic with load balancing and telemetry).

Envoy’s load balancing capabilities are extensive: it supports all the algorithms discussed (round robin, least request, ring-hash, Maglev, etc.) via configuration[64][65]. By default, Envoy uses a Least Request (P2C) algorithm for HTTP/2 traffic, which is well-suited to gRPC and microservices traffic (this prevents slow endpoints from accumulating too many outstanding requests)[72]. Envoy can be configured to do circuit breaking (limit the number of concurrent requests per upstream to avoid overload) and outlier detection (automatically remove unhealthy hosts)[73]. These features let Envoy not just balance load, but also shield clients from bad endpoints.

One reason Envoy is popular is its dynamic configuration and integration with service discovery. In a microservices platform, the set of healthy instances for “service X” may be constantly changing (scaling up/down, deploying). Envoy supports a dynamic API (xDS protocol) to update its cluster membership and routing rules on the fly. This makes it ideal as the data plane in a service mesh (e.g., Istio uses Envoy sidecars) – each service instance has an Envoy that knows about all instances of other services and balances calls among them. Envoy does client-side load balancing in this context (the sidecar Envoy on the client picks a server instance and connects directly), which can reduce load on a central LB and improve latency.

Envoy also excels at L7 routing: It can route or rewrite requests based on HTTP headers, paths, query params, etc. This goes beyond traditional load balancing into API gateway territory. For example, you can configure Envoy to send /api/v1/* traffic to one cluster and /api/v2/* to another. It can also mirror traffic to a shadow service or do blue/green deploys with percentage-based routing. These capabilities have influenced cloud LBs (as seen by Google’s adoption of Envoy for advanced LB features[18]).

In terms of protocol support: Envoy fully supports HTTP/2 and gRPC, and as of Envoy v1.18+ it has support for HTTP/3 (QUIC) in beta/experimental form, moving toward stable. By Envoy v1.22, QUIC/HTTP3 for downstream connections became generally available (disabled by default, but configurable)[74]. This means Envoy can act as a QUIC server, terminating HTTP/3 from clients. It can also originate HTTP/3 to upstreams if desired, though in practice many deployments terminate at Envoy then use HTTP/2 to backends. Envoy’s HTTP/3 implementation uses QUICHE (Google’s QUIC library) or similar under the hood.

Finally, Envoy’s footprint: It’s written in C++ for performance, and while capable of handling tens of thousands of connections, it is a user-space proxy, so per-node performance is less than something in-kernel. However, it’s usually horizontally scaled or embedded everywhere (e.g., a thousand Envoy sidecars rather than a couple big load balancers). Envoy has effectively become the standard L7 load balancer for cloud-native applications, with broad adoption (e.g., AWS App Mesh and Google Traffic Director both use Envoy proxies, and even Cloudflare uses Envoy in some products).

HAProxy

HAProxy is a legendary open-source load balancer and proxy that has been widely used for two decades. Known for its reliability and extremely efficient C implementation, HAProxy is often found in high-traffic web services as the frontend load balancer. It operates at both Layer 4 and Layer 7. At L4 it can do TCP forwarding with features like connection limits and health checks, and at L7 it speaks HTTP and can do content switching, header manipulation, etc.

Key features of HAProxy include: high performance (it uses an event-driven, single-threaded per-process model, with newer versions supporting multi-threading and even multi-process for scaling on multi-core systems), and a rich set of load balancing algorithms and options. HAProxy supports round robin, least connections, source IP hashing, URI hashing, etc., and can apply weights. It also has advanced features like “slow start” (gradually ramp up traffic to a server when it joins) and connection stickiness (via cookies or source IP). It is highly tunable with a configuration file.

HAProxy has kept up with protocol developments. It has supported HTTP/2 for several years (including acting as an HTTP/2 to HTTP/1 gateway for backends that only support 1.1). Regarding HTTP/3/QUIC, HAProxy added support in its development branches and it’s available in the community as of version 2.6+ (though marked as experimental). Enabling QUIC in HAProxy requires building with a QUIC-enabled TLS library (e.g., quictls) and then HAProxy can accept HTTP/3 on frontend connections[75][76]. In HAProxy Enterprise 2.7 (a commercial version), QUIC support is fully integrated and production-ready[77][75]. This makes HAProxy one of the first enterprise-grade load balancers to offer QUIC. Essentially, HAProxy can now terminate QUIC and convert to HTTP/1 or 2 for the backend, similar to how CDNs do. On the backend side, there’s work in progress to allow HAProxy to connect to servers via QUIC if needed (for now, it’s primarily for frontend).

Another notable feature of HAProxy is its observability and control: It provides detailed stats, a runtime API, and even a built-in minimalistic GUI for monitoring. You can see per-backend connection counts, queue lengths, etc., and it supports sending logs of each request (with timing breakdowns). It also has a proven track record in terms of security (common in front of banking apps, etc.) and in handling edge cases (like it can proxy arbitrary TCP with its TCP mode which some use for database proxies too).

Because of its maturity, HAProxy often appears in comparisons with cloud LBs. For instance, AWS’s ALB is often contrasted with HAProxy or NGINX for feature differences. Some companies still run HAProxy on VMs or containers in the cloud to get more control or specific capabilities, even though managed LBs exist.

NGINX

NGINX, while often thought of as a web server, is also a very capable load balancer (and reverse proxy). NGINX (open source) and its commercial variant NGINX Plus are widely used to distribute HTTP and TCP/UDP traffic. Many Kubernetes ingress controllers are built on NGINX, attesting to its popularity in cloud deployments.

As a load balancer, NGINX supports multiple algorithms: round robin (default), least_conn, and ip_hash (for session persistence by source IP). NGINX Plus (commercial) adds more, like least time (smallest latency). NGINX is known for its high performance and low memory usage, achieved through an event-loop, asynchronous processing model (the famed “Nginx architecture” with worker processes and an efficient kernel polling mechanism).

NGINX can be deployed in a highly available pair (using VRRP or similar for a virtual IP) for on-prem scenarios. In cloud, typically one would put NGINX behind a cloud network LB to get HA.

NGINX also provides layer 7 routing (different requests can be routed to different upstream groups based on request attributes). It has a rich configuration language for modifying headers, rewriting URLs, etc., so it often doubles as both load balancer and basic API gateway.

In terms of new protocols: HTTP/2 has been supported on NGINX for a long time (since ~2015 for clients, and it can proxy to backends via HTTP/2 with some tuning). HTTP/3 support in NGINX is relatively recent. In the open-source NGINX, HTTP/3 (QUIC) support was added as a technology preview in 1.25.0[78]. By now (2025), Nginx’s mainline branch does have QUIC support available (though one might need to enable some compile-time options). This means NGINX can accept QUIC connections and then likely proxy to backends over TCP. NGINX’s QUIC implementation is still maturing, but it’s an active area given QUIC’s rise.

One difference to highlight is that NGINX (open source) does not come with built-in active health checks in the free version – it will proxy to all configured servers and if a connection fails it will mark it down after the fact. NGINX Plus adds a health monitoring feature. HAProxy and Envoy, on the other hand, have active health checking built-in in their open versions.

NGINX, like HAProxy, can also serve as an SSL/TLS terminator with certificates and can do TLS session reuse, etc., which makes it efficient for HTTPS load balancing. Both support SNI for hosting multiple domains.

eBPF and Kernel-Level Load Balancing (Katran, XDP, IPVS)

A major trend in load balancing is leveraging the kernel or even bypassing the kernel to achieve higher performance and lower latency. eBPF (extended Berkeley Packet Filter) technology in Linux has opened new possibilities for in-kernel load balancing logic that’s more flexible than traditional iptables/IPVS.

One notable project is Katran, open-sourced by Facebook (Meta). Katran is a high-performance L4 load balancer that uses eBPF and the XDP (Express Data Path) feature of Linux to process packets at the lowest level (very close to the network driver)[79][80]. Katran runs as part of the Linux kernel on standard servers and can handle huge packet rates with minimal CPU by avoiding much of the kernel’s network stack overhead. Facebook deploys Katran on servers in their Points of Presence (PoPs) worldwide as their layer 4 load balancer for user traffic[81][82]. The approach they describe is: advertise a VIP via BGP (anycast) at each POP, use ECMP to spray incoming packets to a set of servers running Katran in that POP (similar to Google’s Maglev ECMP approach)[83][84]. Katran’s eBPF/XDP code then consistently forwards packets to backend servers (these backends can be on the same physical machines – Katran allows co-locating load balancer and application on one machine to maximize utilization[85]). This co-location was a key improvement over their first-gen IPVS-based LB, which required dedicated machines due to networking constraints[86]. By having an XDP program decide packet forwarding, Katran achieves extremely high throughput (millions of packets per second per server) with low latency and low CPU (since XDP can bypass a lot of kernel overhead). It’s essentially a custom in-kernel forwarding plane tailored for load balancing, taking advantage of “recent innovations in kernel engineering” as the FB engineers put it[79].

Another eBPF-based load balancing approach is within Cilium (an open-source CNI for Kubernetes). Cilium uses eBPF to implement things like Kubernetes services (which are a form of internal load balancing) without iptables. This means that every node can direct traffic to pods across the cluster via an eBPF program that does a hash and rewrite in the kernel. This dramatically improves efficiency for East-West traffic in large clusters.

The Linux kernel also has IPVS (IP Virtual Server) which has been around for a long time. IPVS implements various scheduling algorithms (rr, lc, dh (destination hash), sh (source hash), etc.) at kernel level. It was used for LVS (Linux Virtual Server) deployments and is still an option for Kubernetes kube-proxy. However, IPVS operates at a slightly higher overhead than eBPF and is less flexible in modern environments.

Why kernel-level? Performance. User-space proxies like Envoy/HAProxy are powerful for L7, but for pure L4, each extra context switch and copy adds latency. Kernel bypass techniques like XDP can handle packets faster (XDP can even drop or redirect packets before they go through Linux’s TCP stack). eBPF allows logic that can be updated on the fly (for example, update the backend server pool) without restarting the program. Facebook’s Katran eBPF code can be updated as backends change, effectively fulfilling a role similar to Maglev but on modern Linux.

Cloud providers also use kernel/NIC offloads: Azure was known to offload a lot of LB to the host SDN (and now possibly to SmartNICs with FPGA or custom ASIC on their hosts). AWS’s Hyperplane likely uses a mix of software and hardware acceleration (there’s talk that some packet processing is offloaded to AWS’s custom NICs which have FPGA). Google’s Andromeda uses programmable virtual switches in the host (some implemented with NIC offloads via gRPC in newer versions). So the trend even in managed clouds is pushing as much of the packet steering to high-performance paths (kernel or NIC ASIC) as possible, while doing L7 in user-space where needed.

In open source, we see projects like Metallb (for bare-metal k8s clusters) which uses BGP to announce a service VIP and can use ECMP to spread across nodes. That’s more of a control plane mechanism but pairs with something like IPVS or direct routing for the data plane.

In summary, open-source load balancers give architects building their own systems fine-grained control. Envoy provides a comprehensive L7 solution with rich routing, suitable for modern microservices. HAProxy and NGINX remain go-to solutions for many due to their proven performance and ease of deployment (often just a static config file). And pushing the envelope, technologies like Katran (eBPF/XDP) demonstrate how large-scale operators replace hardware appliances with cleverly engineered software on commodity servers, achieving scalability through distributed architectures and kernel-level optimizations. These advancements from companies like Google and Facebook are being shared (through research papers and open source), so the community as a whole benefits – Envoy adopting Maglev hashing[15], or others adopting the power-of-two choice algorithm etc., all cross-pollinate to make load balancing more effective at scale.

HTTP/2 and HTTP/3: Impact on Load Balancing

The advent of HTTP/2 and HTTP/3 (QUIC) has changed some assumptions in load balancing and introduced both opportunities and challenges for LB implementations. These protocols aim to improve performance with features like multiplexing, header compression, and in QUIC’s case, zero-round-trip handshakes and improved loss recovery. Let’s discuss their applicability and how modern load balancers handle them:

HTTP/2 and Load Balancers

HTTP/2 introduced multiplexing of many requests over a single TCP connection. For load balancers, especially L7 proxies, this means a client might only make one connection to the LB and send many requests through it concurrently. Traditional LBs that forwarded at the connection level (as many L4 LBs do) might inadvertently send all those requests to one backend if unaware of the multiplexing. Layer 4 load balancers, which operate below HTTP, see only a TCP connection – so all of an HTTP/2 client’s streams will go to whichever server was chosen initially for that connection. This can lead to uneven load if one client issues many requests: e.g., 100 HTTP/2 streams from one client all end up on the same server because they share a connection. In contrast, under HTTP/1.1 those might have been spread over multiple separate connections (if the client used multiple).

To address this, Layer 7 load balancers (proxies like ALB, Envoy, etc.) can balance at the request level. An L7 LB terminates the client HTTP/2 connection and can then decide how to dispatch each individual HTTP request (or stream) to backends. In practice, some L7 LBs will create a pool of backend connections (maybe also HTTP/2) and assign streams to different backends potentially. For example, Google Cloud’s HTTP/2 LB likely terminates at the GFE and then issues requests to backends, possibly balancing them. AWS’s ALB when it added HTTP/2 end-to-end, effectively does a similar thing: ALB terminates client H2, then opens one or more H2 connections to each target as needed and distributes streams[31][32]. Envoy can act as an H2 to H2 proxy and will distribute streams among multiple upstream connections if one becomes overwhelmed (it has configuration for max concurrent streams per backend).

Concurrency: With H2, load balancers pay more attention to the number of outstanding requests per server (hence algorithms like “least requests” become more meaningful than just connection counts). In Envoy’s default, as noted, it uses P2C least-request for HTTP/2 to avoid one server getting too many streams[72]. HAProxy similarly has settings for max concurrent streams to a backend.

HPACK Compression and State: HTTP/2’s header compression (HPACK) means the LB has to manage compression contexts if it’s terminating and re-encoding headers. This is a technical detail, but LBs like Envoy handle it transparently. One caution was HTTP/2 allows very large header fields in theory, so LBs had to implement sensible limits to avoid compression bombs or memory issues.

Health and Multiplexing: In HTTP/1, a single request failure might be isolated. In HTTP/2, if the TCP connection breaks, it could drop many in-flight requests at once. Load balancers must monitor not just connection health but also handle retrying streams or signaling upstream issues. Some LBs use HTTP/2’s GOAWAY mechanism to gradually shed load from a backend (Envoy can propagate GOAWAY to signal no new streams).

gRPC: gRPC is built on HTTP/2, and load balancers have added support to detect gRPC-specific health and status. AWS ALB can check gRPC health by looking at response codes (it knows about gRPC’s unique status trailers)[32]. Load balancers also need to handle long-lived streams (gRPC bi-directional streams could last minutes or hours). ALB and Envoy both support long-lived HTTP/2 streams – they just treat them somewhat like TCP from a balancing perspective (once a stream is on a server, it stays there).

HTTP/3 (QUIC) and Load Balancers

HTTP/3 uses QUIC over UDP as the transport. This introduces a new layer for load balancers to consider:

Layer 4 (UDP) load balancers: Many network load balancers already support UDP (e.g., AWS NLB, Azure Load Balancer) and treat it similar to TCP in terms of hashing flows. QUIC connections are identified by 5-tuple (source IP, source port, dest IP, dest port, plus QUIC connection ID at a higher level). A pure L4 LB will usually hash src/dest IP and port (and protocol) to pick a backend. That means all QUIC packets of a connection will go to one server – which is necessary because QUIC runs in user-space on that server managing its own streams. So existing L4 load balancers can handle QUIC traffic distribution with no special awareness (just like any UDP). However, certain QUIC features like connection migration (a client changing IP addresses mid-connection, which QUIC supports via connection IDs) could confuse a dumb L4 LB – if the source IP changes, the 5-tuple hash changes, and the LB might send the packet to a different backend which doesn’t have the state. Ideally, a QUIC-unaware LB should pin based on QUIC Connection ID instead. Some advanced L4 load balancers (like Facebook’s Katran or Google’s Maglev) could be updated to parse QUIC headers and consistently route based on the connection ID to handle migration. But it’s unclear if cloud NLBs do that yet. In practice, connection migration is not widely used at the moment.
L7 Load Balancers and QUIC: To fully support HTTP/3, an L7 LB must implement a QUIC endpoint (to terminate the QUIC connection from clients). This is non-trivial because QUIC is quite different from TCP – it involves cryptography in user-space, packet loss recovery (no kernel to rely on), etc. Google and Cloudflare were among the first to do this at scale. Google’s edge (GFEs) originally ran “gQUIC” (Google’s pre-standard QUIC) and now IETF QUIC to serve products like YouTube; they extended that to Cloud Load Balancing and CDN in 2021[23]. AWS added HTTP/3 in CloudFront in 2022, as an edge service[53]. In both cases, the LB (or CDN edge) terminates QUIC and then typically uses a standard protocol to talk to backends. For example, Amazon CloudFront when HTTP/3 is enabled will accept QUIC from clients, but will convert and forward those requests to the origin over HTTP/1.1 (or HTTP/2 if supported)[33][87]. This is explicitly stated: enabling HTTP/3 requires no changes on origin servers because CloudFront continues using HTTP/1.1 to them[33]. The load balancer thus acts as a translator between protocols. Similarly, Google’s Cloud LB when speaking HTTP/3 likely translates to HTTP/2 to communicate with the actual service.

From an architecture perspective, terminating QUIC at the LB has benefits: you get the improved user performance (faster handshakes, less head-of-line blocking) for the edge hops. But you don’t burden the backend infrastructure with QUIC yet, since many server platforms are still optimizing their QUIC support. It’s a pragmatic approach: let the load balancer handle the complexities of QUIC/UDP, and keep the backend using well-established HTTP/2 or even HTTP/1.1.

Performance considerations: HTTP/3 can reduce connection setup latency (no TCP 3-way handshake, no extra TLS round-trip) – this is particularly helpful for new or infrequent connections. A load balancer that handles lots of short connections (like CDN for images) benefits by moving to QUIC. However, load balancers need to manage QUIC’s higher CPU usage (crypto for every packet, etc.). CloudFront uses a custom QUIC implementation in Rust (s2n-quic) optimized for this[88]. They tout faster connection times, improved multiplexing, and connection migration as benefits on unreliable networks[89]. Real-world, enabling HTTP/3 on CloudFront has shown about 10% reduction in latency (time-to-first-byte) for some customers[90].

Stateful vs Stateless LBs: QUIC being UDP might tempt one to treat load balancing differently (e.g., anycast at L4 to multiple servers). But since QUIC is connection-oriented (in user space), load balancers often still need to be stateful or consistent for that connection. If anycast is used (like with Cloudflare or Google global LB for QUIC), the load balancer cluster nodes must coordinate or rely on connection ID routing to ensure once a connection is established with one node, all packets go there. There was also a concept of QUIC-LB (an IETF draft) to allow stateless load balancers that use a mapping of QUIC connection IDs to servers (so LB doesn’t terminate, just directs based on encoded info in the ID). This is still an evolving area and not widely implemented yet.

Edge vs direct: In cloud environments, currently only the edge-focused services (CDNs, global anycast LBs) support HTTP/3. As mentioned, AWS ALB doesn’t, Azure’s main services don’t yet (only in preview for App Gateway), Google’s does on external HTTP(S) LB. On-prem or open-source: HAProxy and NGINX now allow you to run your own QUIC-enabled LB. Envoy too can, albeit that’s newer. So if you operate your own ingress, you could turn on HTTP/3 and take advantage of it. One has to weigh the benefit: If most clients are HTTP/3 capable (browsers, mobile apps) and your network has latency/loss issues, it’s a win. We are at a point where most major load balancers support HTTP/3 in some fashion – if not GA, then in beta or preview[91]. So we can expect universal HTTP/3 support in load balancers soon.

HTTP/2 Rapid Reset Attack: A recent development (late 2023) was a HTTP/2 vulnerability (CVE-2023-44487, “Rapid Reset”) where an attacker could misuse the HTTP/2 stream reset feature to overload servers with minimal cost. This affected many LBs and servers. Cloud providers responded by patching ALBs, Cloudflare, etc. The detail isn’t crucial here, but it underscores that load balancers must evolve with protocol quirks and security issues. LBs often implement rate-limiting or anomaly detection at the protocol layer to mitigate such issues.

QUIC and CPU Offload: We might see in the future more hardware offload for QUIC in load balancers (similar to how TLS accelerators exist). Since QUIC is UDP+user-space, maybe smart NICs will handle parts of it. But for now, it’s software.

In summary, HTTP/2 and HTTP/3 have generally improved the client to LB connection efficiency, and load balancers have adapted by terminating these protocols at the edge and translating to simpler protocols to backend. All major cloud and open-source LBs support HTTP/2 (this is table stakes now) and are in the process of supporting HTTP/3 (some already fully do, like Google and CloudFront, others in progress). The applicability of these protocols in load balancing is largely positive: faster client experiences with no change needed on servers in many cases[33]. It does add complexity to the load balancer itself, but that’s a burden the cloud providers and proxy developers are taking on so that users can transparently benefit.

Conclusion

Load balancing is a cornerstone of resilient, high-scale architectures. The latest techniques emphasize software-defined, distributed designs over traditional hardware, enabling tremendous scalability and flexibility. We saw that Google Cloud leverages its global network and innovations like Maglev consistent hashing and Envoy proxies at the edge to provide worldwide load balancing with advanced traffic control[3][18]. AWS takes a slightly different approach with regional LBs that scale out via DNS, using Hyperplane technology under the hood for extreme performance at L4[44] and content-aware routing at L7, and extending globally through Route 53 and CloudFront. Azure offers a layered portfolio splitting L4 vs L7 and global vs regional load balancing, allowing architects to mix and match services like Front Door and Application Gateway[57][60] to achieve both scale and fine-grained control.

Open-source solutions remain highly relevant: Envoy has become a de facto standard for modern L7 load balancing (and a building block within clouds), bringing features like sophisticated LB algorithms (power-of-two, ring hash, etc.) and easy integration with service meshes. HAProxy and NGINX continue to be workhorses for many, now integrating support for cutting-edge protocols like HTTP/3[75]. And at the very forefront, kernel-bypass and eBPF-based load balancers like Facebook’s Katran are pushing performance limits, demonstrating that with clever engineering, even a single server can handle millions of L4 connections with minimal latency[80][86].

In terms of algorithms, the industry has converged on a few best practices: round robin for simplicity, least request (P2C) for adaptivity without cost[65], and consistent hashing for stickiness and fault tolerance[11][67]. These algorithms are often augmented by health checks, slow-start, and outlier detection to improve reliability and performance perceived by users.

The ongoing adoption of HTTP/2 and HTTP/3 by load balancers is enabling faster and more reliable client experiences without complicating backend systems[33]. Load balancers are effectively becoming protocol translators and accelerators – for instance, handling QUIC at the edge (reducing latency for the user) but using established protocols internally for steadiness[33]. This trend will likely continue, with load balancers taking on responsibilities like TLS optimization, DDoS absorption, and maybe even application-level routing (functioning almost like an edge compute layer).

For experienced infrastructure engineers and architects, the landscape of load balancing in 2025 offers powerful building blocks. One can combine global anycast delivery, regional load distribution, and application-layer logic to achieve remarkable resilience and performance. The key is understanding the strengths and limits of each approach: e.g., use anycast global LB for lowest latency user connections, but be mindful of connection stickiness; use L7 intelligent routing for modern web APIs, but remember to account for H2/H3 behaviors; consider kernel-level LB if ultra-low latency is needed, but weigh the operational complexity.

The good news is that much of this technology is available as managed services or well-tested open source. By leveraging these, architects can design systems that serve millions of users with high availability, as the big cloud providers do, using the same core techniques: distributed software load balancers, smart algorithms, and edge optimization for new protocols. The result is that load balancing, once merely about distributing traffic, is now about improving every aspect of traffic delivery – from user’s network latency to backend utilization to fault isolation – truly an integral part of modern infrastructure design.

Sources:

Google Cloud – Cloud Load Balancing Overview: Google’s load balancing is built on Maglev, Andromeda, GFEs, and Envoy[3]. Offers single anycast IP global load balancing with automatic multi-region failover[4][5].
Google Cloud Blog – Global External HTTP(S) LB Deep Dive: New Envoy-based global LB supports traffic mirroring, weighted splitting, header transforms[18]. Anycast IP and global distribution eliminate need for DNS-based global routing[9].
Google Cloud Blog – HTTP/3 support: By 2021, Google’s Cloud CDN and HTTPS LB support HTTP/3 (QUIC), improving performance (e.g. 9% fewer rebuffers on YouTube)[23].
AWS Blog – Scaling ELB: ALB/CLB scale by adding up to 100 nodes (IPs) across AZs; DNS resolution yields up to 8 IPs of healthy nodes[26][29]. NLB is built on AWS Hyperplane, a distributed virtualization system in each AZ[44].
AWS Blog – NLB launch: NLB provides one static IP per AZ, preserves source IP (no X-Forwarded-For needed)[43][46], and can handle millions of requests/sec with ultra-low latency. Long-lived connections supported (great for IoT, gaming)[46].
AWS News – ALB HTTP/2 & gRPC: ALB supports end-to-end HTTP/2, enabling gRPC services. It provides gRPC-aware health checks and metrics, and retains features like stickiness and TLS offload[31][32].
Azure Architecture Center – Load Balancing Options: Azure offers Application Gateway (regional L7 with WAF)[58], Front Door (global anycast L7 with CDN features)[60], Load Balancer (regional or cross-region L4, ultra-low latency, millions of req/sec)[57], and Traffic Manager (global DNS load balancer)[62].
Envoy Documentation – Load Balancers: Envoy supports multiple policies: weighted round robin[64], weighted least request (power-of-two random pick by default)[65], ring-hash and Maglev consistent hashing (Maglev uses algorithm from Google’s paper with table size 65537)[15].
HAProxy Tech Blog – QUIC/HTTP3: HTTP/3 (QUIC) support introduced in HAProxy 2.5 and stabilized in Enterprise 2.7 – it’s moved beyond experimental and is recommended for serving real-world traffic over QUIC[77][75].
Facebook (Meta) Engineering – Katran load balancer: Katran is the eBPF/XDP-based L4 load balancer used at Facebook’s POPs, enabling software LB on commodity servers. It leverages XDP and eBPF for a custom forwarding plane, achieving high packet per second throughput and co-existing with backend services on the same machine[79][80].

[1] [2] [10] [11] [12] [13] [14] [16] [17] research.google.com

https://research.google.com/pubs/archive/44824.pdf

[3] [4] [5] [6] [7] [8] [22] Cloud Load Balancing overview | Google Cloud

https://cloud.google.com/load-balancing/docs/load-balancing-overview

[9] [18] [19] [20] [21] Google Cloud Global External HTTP(S) Load Balancer - Deep Dive | Google Cloud Blog

https://cloud.google.com/blog/topics/developers-practitioners/google-cloud-global-external-https-load-balancer-deep-dive

[15] [64] [65] [66] [67] [70] [71] [73] Supported load balancers — envoy 1.36.0-dev-ac4070 documentation

https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/load_balancers

[23] [24] [25] Cloud CDN and Load Balancing support HTTP/3 | Google Cloud Blog

https://cloud.google.com/blog/products/networking/cloud-cdn-and-load-balancing-support-http3

[26] [27] [28] [29] [30] [34] [35] [36] [37] [38] [39] [40] [44] [45] [68] Scaling strategies for Elastic Load Balancing | Networking & Content Delivery

https://aws.amazon.com/blogs/networking-and-content-delivery/scaling-strategies-for-elastic-load-balancing/

[31] [32] New – Application Load Balancer Support for End-to-End HTTP/2 and gRPC | AWS News Blog

https://aws.amazon.com/blogs/aws/new-application-load-balancer-support-for-end-to-end-http-2-and-grpc/

[33] [53] [87] [88] [89] [90] New – HTTP/3 Support for Amazon CloudFront | AWS News Blog

https://aws.amazon.com/blogs/aws/new-http-3-support-for-amazon-cloudfront/

[41] [42] [43] [46] [49] [50] [51] New Network Load Balancer – Effortless Scaling to Millions of Requests per Second | AWS News Blog

https://aws.amazon.com/blogs/aws/new-network-load-balancer-effortless-scaling-to-millions-of-requests-per-second/

[47] [48] Avoiding overload in distributed systems by putting the smaller service in control | Amazon Builders' Library

https://aws.amazon.com/builders-library/avoiding-overload-in-distributed-systems-by-putting-the-smaller-service-in-control/

[52] Listeners for your Application Load Balancers - AWS Documentation

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html

[54] [55] [56] [57] [58] [60] [61] [62] [63] [69] Load Balancing Options - Azure Architecture Center | Microsoft Learn

https://learn.microsoft.com/en-us/azure/architecture/guide/technology-choices/load-balancing-overview

[59] QUIC based HTTP/3 with Application Gateway: Feature information Private Preview | Microsoft Community Hub

https://techcommunity.microsoft.com/blog/azurenetworkingblog/quic-based-http3-with-application-gateway-feature-information-private-preview/3913972/replies/4159418

[72] Istio / Traffic Management

https://istio.io/latest/docs/concepts/traffic-management/

[74] Upgrade to HTTP/3 with Envoy - Baptiste Collard

https://baptistout.net/posts/upgrade-envoy-http3/

[75] [76] [77] How to Enable QUIC Load Balancing on HAProxy

https://www.haproxy.com/blog/how-to-enable-quic-load-balancing-on-haproxy

[78] HTTP/3 is everywhere but nowhere - Hacker News

https://news.ycombinator.com/item?id=43360251

[79] [80] [81] [82] [83] [84] [85] [86] Open-sourcing Katran, a scalable network load balancer - Engineering at Meta

https://engineering.fb.com/2018/05/22/open-source/open-sourcing-katran-a-scalable-network-load-balancer/

[91] Y'all are sleeping on HTTP/3 : r/programming - Reddit

https://www.reddit.com/r/programming/comments/1elhcd1/yall_are_sleeping_on_http3/