Compare Products

Hide

Clear All

VS

Time: December 23rd, 2024

Network Load Balancing in Data Centers: An Overview

Modern data centres are integral to managing a significant portion of data traffic and application operations on the contemporary internet. Ideally, these data centres should deliver differentiated services characterized by high throughput and low latency tailored to various application traffic demands from multiple users. The performance of a data centre's network transmission capacity plays a critical role in determining its overall service capabilities. Consequently, effective traffic management within the data centre can enhance the overall utilization of network links, mitigate congestion, and reduce the need for retransmissions. Therefore, the design of a well-structured and efficient network load-balancing solution is essential for the development of innovative data centre infrastructures.


Challenges of Network Load Balancing

Designing an effective network load-balancing solution for contemporary data centres presents several challenges due to the inherent complexities involved.

1. Traffic Dynamics: Data centre networks experience dynamic traffic patterns, where a limited number of substantial flows can dominate the network load, while numerous smaller flows may induce significant fluctuations in the network state. The latency associated with flow scheduling complicates the problem of load balancing in these environments.

2. Congestion Perception Difficulty: The high level of dynamism in data centre network traffic results in a temporal delay in the perception of network congestion. The congestion information available at any given moment reflects the previous state of the network. Consequently, the accuracy and timeliness of this congestion perception directly influence the efficacy of load-balancing strategies.

3. Packet Out of Order Issues: Traditional load balancing techniques within data centre networks typically rely on flow scheduling and hash calculations to select a single path for communication. When two data streams conflict on the same link, the transmission time can effectively double. Moreover, utilizing packet-level scheduling can lead to packet out-of-order problems due to the acknowledgement mechanisms inherent in transport layer protocols.

4. Abnormal Traffic Scheduling: In the event of a network device or link failures, the uplink and downlink may exhibit asymmetry, thereby contributing to network congestion and significantly diminishing data transmission efficiency. It is imperative for load-balancing solutions to promptly address such failure conditions and to redistribute affected traffic within the network to optimize overall transmission performance.


The challenge of multi-link load balancing within data centre networks

It is a critical consideration for optimizing performance. Data centre network topologies typically employ a CLOS structure, resulting in multiple paths between hosts. In order to accommodate the demands of throughput-sensitive traffic, data streams are distributed across various paths for effective data transmission. To mitigate congestion and enhance resource utilization within data centres, Equal Cost Multi-Path (ECMP) technology is frequently utilized as the primary network load-balancing solution.

ECMP refers to the practice of equivalent multi-path routing, which involves the availability of multiple paths with equivalent costs leading to the same destination address. In environments that support equivalent routing, Layer 3 forwarding traffic directed towards a specific destination IP or network segment is capable of being shared across different paths, thereby facilitating network link load balancing. Several methods exist to implement a path selection strategy that allows for prompt switching of ECMP in the event of a link failure:

1. HASH: This method employs a calculation based on IP quintuples to determine and select a specific path for the data flow.
2. Polling: In this approach, each data stream actively polls to transmit over multiple paths.
3. Path Weighting: Flows are allocated based on the weights assigned to the paths, where greater flow capacity is assigned to paths with higher weights.

By utilizing these strategies, data centres can achieve more efficient load balancing and overall improved network performance.


ECMP Load Balancing Flowchart

ECMP (Equal-Cost Multi-Path) is a relatively straightforward load-balancing strategy. However, it presents several challenges in practical application:

1. HASH Polarization Issue
The issue of HASH polarization frequently occurs in multi-stage load-balancing scenarios, particularly when multiple interconnected devices adopt the same load-balancing pattern. In such cases, the algorithm may be prone to exhibiting the HASH polarization phenomenon.

2.
HASH Consistency Issue
The implementation of an elastic HASH function is designed to maintain consistency by redistributing data flows within the ECMP group following a single link failure. This approach enables the switch to rebalance traffic across the remaining operational links, thereby preserving the integrity of the original HASH traffic. Nonetheless, the elastic HASH function is effective only in circumstances involving a singular port or link failure; it does not facilitate load balancing in instances of multiple simultaneous port or link failures.

3. Static HASH Imbalance Issue
Static HASH is generated through specific algorithms that utilize a HASH-KEY value, which in turn determines the selection of a member link for data packet forwarding. A significant drawback of static HASH is its failure to account for the utilization rates of each member link involved in load balancing. This shortcoming leads to an imbalance in load distribution among member links, particularly during periods of large data flows. Consequently, this imbalance exacerbates congestion on the selected member link. As the switch buffer increases, it may trigger Priority Flow Control (PFC) and Explicit Congestion Notification (ECN), resulting in a reduction of throughput at the source.

In traditional Ethernet-based data centre networks, achieving load balancing through ECMP is effective in scenarios characterized by numerous small flows. However, the flow characteristics associated with artificial intelligence (AI) training often involve substantial concurrent large flows, which may introduce long-tail delays stemming from uneven hashing. This dynamic poses a challenge to maintaining training efficiency and supporting large-scale machine learning initiatives within an Ethernet framework.

Ruijie RALB Load Balancing Technology

To address the challenges associated with network performance, Ruijie has introduced innovations in multi-path traffic scheduling through its Remote Adaptive Load Balancing (RALB) technology. This advanced technology achieves global dynamic load balancing on a per-packet basis by assessing link quality. As a result, network bandwidth utilization can reach an impressive 97.6%.

Comparison of Bandwidth Utilization under Different Load Balancing Techniques


● Dynamic Load Balancing at the Packet Level

This approach involves assessing local link quality to achieve a non-blocking network and maximize throughput through a packet-by-packet dynamic load balancing strategy.


Static Hash vs. Dynamic Load Balancing

Static HASH employs a flow-based load-balancing mechanism. The HASH selection process follows an evaluation of Equal-Cost Multi-Path (ECMP), whereby one of the links in the designated group is utilized for data forwarding. In contrast, dynamic load balancing utilizes a per-packet forwarding approach, which divides the flow into smaller segments. Data packets are forwarded based on the current load conditions of each link within the ECMP framework. Consequently, in scenarios involving Artificial Intelligence training utilizing RALB technology, both Read and Write data packets are partitioned into smaller entities, facilitating per-packet dynamic load balancing. Conversely, for other message types, static per-flow methods are retained. This strategy enables the attainment of enhanced bandwidth utilization.

● Global Load Balancing
By monitoring fluctuations in the quality of remote links, the Leaf mechanism performs accurate traffic scheduling at the layer level, thereby achieving an effective global load-balancing outcome.


Traditional Flow Control vs. RALB Technology

In the event of a link failure within a data centre network, such as the failure of an optical module, switch port malfunction, or disconnection of a fibre link, an asymmetrical condition arises between the uplinks and downlinks of the switch. This scenario may result in congestion at the switch's egress port. During instances of queue congestion, the switch marks affected packets with Explicit Congestion Notification (ECN). Upon receiving these ECN-marked packets, the receiving end generates a Congestion Notification Packet (CNP) to communicate the congestion back to the sending end. In response, the sender is prompted to throttle the relevant priority queue.

For illustrative purposes, consider the case where one of the links connecting Spine1 to Leaf2 is compromised. Under these circumstances, with only a single 100G link between Spine1 and Leaf2, traffic traversing from Leaf1 to Leaf2 via Spine1 will result in egress congestion at Leaf1. Consequently, the ingestion rate at Leaf1 must be throttled to 100G to mitigate congestion. Notably, such throttling affects all outgoing traffic simultaneously due to per-packet load balancing, leading to a reduction in upstream traffic on Leaf1. In a scenario where traffic from Leaf1 to Leaf2 is effectively balanced across Spine1 and Spine2, it follows that traffic routed via Spine2 will also be throttled to 100G, thereby presenting an aggregate bandwidth utilization of only 200G.

Through the implementation of Remote Adaptive Load Balancing (RALB) technology, the Spine switch is equipped to swiftly detect changes in the link status and promptly inform the Top of Rack (TOR) switch. This functionality enables Leaf1 to adapt to the alterations in the remote link, facilitating an adaptive redistribution of traffic directed toward Leaf2 and promoting global load balancing. Following the integration of RALB technology, Leaf1 efficiently allocates only one-third of the traffic to Leaf2 through Spine1, while the remaining two-thirds is forwarded via Spine2. This adjustment allows for optimal utilization of the available 300G bandwidth from Leaf1 to Leaf2, resulting in a 50% increase relative to the previous maximum of 200G. The efficacy of this technology is further evidenced as the number of Spine switches increases—escalating from 3/6 to 5/6 with three Spine switches, and from 4/8 to 7/8 with four Spine switches.

Moreover, once RALB technology is implemented, proactive flow control can be initiated on Leaf1, significantly contributing to reduced latency within the network.

Ruijie Networks' RALB load-balancing technology transcends the limitations imposed by conventional practices. Leveraging the current load status of links, it executes global traffic balancing and, in conjunction with dynamic packet-by-packet forwarding, achieves a non-blocking network environment characterized by ultra-high bandwidth utilization. This advancement markedly enhances the efficiency and stability of data transmission, thereby providing robust support for the effective operation of data centres.

In light of the escalating global Internet traffic and the increasingly diverse requirements of data applications, Ruijie Networks is dedicated to advancing and innovating network technology. The introduction of their global load-balancing solution serves as a testament to their relentless pursuit of progress. Through ongoing research and development, as well as product innovation, Ruijie Networks is poised to continue delivering efficient, reliable, and intelligent network solutions for data centres worldwide. The emergence of the AIGC era will significantly facilitate rapid development across various sectors within the Internet enterprise landscape.

Ruijie Networks websites use cookies to deliver and improve the website experience.

See our cookie policy for further details on how we use cookies and how to change your cookie settings.

Cookie Manager

When you visit any website, the website will store or retrieve the information on your browser. This process is mostly in the form of cookies. Such information may involve your personal information, preferences or equipment, and is mainly used to enable the website to provide services in accordance with your expectations. Such information usually does not directly identify your personal information, but it can provide you with a more personalized network experience. We fully respect your privacy, so you can choose not to allow certain types of cookies. You only need to click on the names of different cookie categories to learn more and change the default settings. However, blocking certain types of cookies may affect your website experience and the services we can provide you.

  • Performance cookies

    Through this type of cookie, we can count website visits and traffic sources in order to evaluate and improve the performance of our website. This type of cookie can also help us understand the popularity of the page and the activity of visitors on the site. All information collected by such cookies will be aggregated to ensure the anonymity of the information. If you do not allow such cookies, we will have no way of knowing when you visited our website, and we will not be able to monitor website performance.

  • Essential cookies

    This type of cookie is necessary for the normal operation of the website and cannot be turned off in our system. Usually, they are only set for the actions you do, which are equivalent to service requests, such as setting your privacy preferences, logging in, or filling out forms. You can set your browser to block or remind you of such cookies, but certain functions of the website will not be available. Such cookies do not store any personally identifiable information.

Accept All

View Cookie Policy Details

Kontakt

Kontakt

How can we help you?

Kontakt

Get an Order help

Kontakt

Get a tech support