Background
1. "Why should we upgrade to 25G Ethernet?"
In the past years, we have observed that many Internet data centres have upgraded their server access from 10G Ethernet to 25G Ethernet. Why do people want to upgrade to 25G Ethernet?
Following information furnish a concise answer to the above mentioned question:
● Supporting high-performance businesses, involves collaborating with rapidly expanding businesses to enhance application system performance. For instance, internet applications based on AI and big data have led to a significant increase in business traffic.
● Supporting business emergencies is crucial, especially when there are sudden business crises that require full infrastructure support from the business side.
● Matching the upgrade of server performance, entails enhancing the server CPU and storage I/O performance, which in turn boosts the network throughput performance of each server. It has been observed that 10G networking is insufficient to meet the bandwidth requirements.
● Reducing the cost per bit, is important for public cloud services. The adoption of 25G Ethernet has led to a decrease in network single-bit cost, consequently reducing operating expenses.
● Realizing technical dividends is important. The new generation of 25G Ethernet switch chips offers a wide range of technical features, such as Telemetry and RDMA (Remote Direct Memory Access), significantly enhancing the efficiency of basic network operations and maintenance while cutting down on costs.
In Internet data centers, what are the differences in the networking architecture between 25G Ethernet and 10G Ethernet? Let's now explore the networking architecture of 25G.
2. What factors determine the 25G networking architecture?
When designing and implementing a 25G Data Centre network, it's important to consider two main factors that influence the choice of products and architecture solution:
1. Server scale: This refers to the expected number of servers in a single cluster.
2. Business application requirements: This includes the network convergence ratio, single/dual uplink of servers, and other requirements specific to different types of business applications.
The two most common network architecture models are the two-level network architecture and the three-level network architecture. In the following analysis, we will examine how these architectures correspond to the server scale and applicable business application requirements.
25G Network Architecture Design Solution
1. Two-level network architecture
▲Figure 1: two-level network architecture topology diagram
In the above Figure 1, we analyze the 2 two-level network architecture topologies in terms of single/dual uplink mode, scale, equipment selection, and convergence ratio of the server as follows:
| 1,000~2,000 sets | 5,000~20,000 sets | 5,000~20,000 sets |
Architecture type | two-level multi-core BOX architecture | two-level multi-core CHASSIS architecture | two-level multi-core CHASSIS architecture |
Server- single uplink | 2000 units | 10000 ~ 20000 units | 10000 ~ 20000 units |
Server- dual uplink | 1000 units | 5000 ~ 10000 units | 5000 ~ 10000 units |
Equipment model | Leaf: RG-S6510-48VS8CQ (48*25G+8*100G) Spine: RG-S6520-64CQ (64*100G) | Leaf: RG-S6510-48VS8CQ (48*25G+8*100G) Spine: RG-N18000-X series (CB-card) 8/16 service slots | Leaf: RG-S6510-48VS8CQ (48*25G+8*100G) Spine: RG-N18000-X series (DB-card)8/16 service slots |
Convergence ratio | Spine : 3:1 Leaf : 1.5:1 | Spine : 3:1 Leaf : 1.5:1 | Spine : 3:1 Leaf : 1.5:1 |
▲ Table 1: Comparison of two-level network architectures
When the scale of a single cluster server ranges from 1,000 to 2,000, a box-type (BOX) multi-core two-level architecture can be utilized to fulfill the demand. This architecture employs the same series of single-chip switch solutions for PFC (Priority-based Flow Control), + ECN (Explicit Congestion Notification), + MMU (Memory Management Unit) management. The chip waterline settings are highly consistent and well-coordinated. Additionally, the forwarding delay is low and the throughput is high. The entire network can implement RDMA services and network visualization solutions.
For a scale of 5,000 to 20,000 single cluster servers, a Chassis-based multi-core two-level architecture can be employed. The Spine layer core devices of this architecture offer two types of core boards to choose from:
1.CB-type boards cater to business scenarios with frequent many-to-one services and effectively reduce packet loss in such scenarios through a large cache mechanism.
2. DB-type boards are suitable for business scenarios with high requirements for RDMA networking and network visualization. This architecture also inherits the advantages of the BOX multi-core two-level architecture.
In the two-level networking architecture, the choice of architecture depends on the scale of single cluster servers and business needs. In the routing protocol part of the networking, EBGP (External Border Gateway Protocol) can be used between Spine and Leaf. All Leaf devices are deployed with the same AS number (Autonomous System number). The Spine layer replaces the AS number after receiving the Leaf layer route to resolve the EBGP horizontal split problem. When the business requires dual uplink of servers, it is recommended to use the de-stacking solution for Leaf layer deployment. "For details, see [Article 1] How to "de-stack" data centre network architecture.
2. Three-level network architecture ▲Figure 2: three-level architecture topology diagram
For ultra-large data centres with more than 20,000 servers in a single cluster, the Spine Leaf two-level network can no longer meet the needs, and the scalability is also poor. In this case, it is recommended to adopt a three-level architecture based on POD (Point Of Delivery, the smallest unit of the data centre) and horizontal expansion (Scale-out).
As shown above in Figure 2, each POD has a two-level network of Spine Leaf. The number of servers and network devices is standardized and fixed. Multiple PODs are interconnected through core devices to achieve larger-scale networking and solve the problem of flexible expansion. We present the following Table 2 based on the number of PODs, server scale, device selection and convergence ratio:
| More than 20,000 sets | More than 20,000 sets |
Architecture type | A three-level architecture based on POD horizontal expansion | A three-level architecture based on POD horizontal expansion |
Number of PODs | 14 ~ 56 | 14 ~ 56 |
Servers in POD- Single uplink | 2000 units | 2000 units |
Servers in POD- dual uplink | 1000 units | 1000 units |
Whole network server - single uplink | 30000 ~ 120000 units | 30000 ~ 120000 units |
Whole network server - dual uplink | 15000 ~ 60000 units | 15000 ~ 60000 units |
Equipment model | Leaf: RG-S6510-48VS8CQ (48*25G+8*100G) PoD-Spine: RG-S6520-64CQ (64*100G) Core: RG-N18000-X series(CB-card)16 service slots | Leaf: RG-S6510-48VS8CQ (48*25G+8*100G) PoD-Spine: RG-S6520-64CQ (64*100G) Core: RG-N18000-X series (DB-card) 16 service slots |
Convergence ratio | Core: 3:1 POD-Spine :3:1 Leaf : 1.5:1 | Core: 3:1 POD-Spine :3:1 Leaf : 1.5:1 |
▲Table 2: Comparison table of three-level architecture
In the three-level networking architecture, there are two types of equipment selections. The POD is a standard Spine Leaf two-level architecture with the same equipment selection. At the core layer, it's important to differentiate and select between the Chassis (CB-type board) multi-core two-level architecture and Chassis (DB-type board) multi-core two-level architecture in the two-level architecture.
When deploying RDMA services for a business, it is recommended to control the deployment scope of the RDMA domain within the POD. This is because the control difficulty of PFC and ECN messages will significantly increase for larger-scale RDMA deployment, leading to a more serious impact of congestion and back pressure.
If planning a larger-scale data centre with over 100,000 servers in a single cluster, it's necessary to upgrade the Spine layer switch equipment and use a BOX device that can provide 128 ports of 100G in order to double the server scale in the POD.
"Outlook for the Next-Generation Data Center Architecture"
According to the International Data Corporation (IDC), the data processed by data centres is projected to reach 175ZB in 2025, which is five times the amount in 2018. China is expected to experience the fastest growth, with data increasing from 7.6ZB in 2018 to 48.6ZB in 2025. With the rapid development of data, the foundational network infrastructure needs to undergo various improvements, including iterative upgrades of network bandwidth and enhancements to the IP CLOS architecture with a 1:1 network convergence ratio. Will the next-generation networking architecture continue to use chassis devices in the IP CLOS network architecture? How will server access evolve and upgrade to meet business needs? The next article will provide detailed explanations.
Summary
The transition from 10G to 25G Ethernet in Internet Data Centers (IDCs) represents a significant advancement in network architecture, driven by the growing demands of modern applications. Upgrading to 25G Ethernet supports high-performance business needs, including improved application efficiency for AI and big data, while also accommodating sudden traffic spikes. The design of 25G networks is influenced by server scale and business application requirements, leading to two primary architectures: two-level and three-level networks. The two-level architecture is suitable for clusters of 1,000 to 20,000 servers, utilizing box-type and chassis-based designs for optimal performance. In contrast, three-level architectures are essential for ultra-large data centres exceeding 20,000 servers, enabling scalable, flexible network solutions through standardized PODs. As data demands continue to rise, advancements in 25G networking will play a crucial role in shaping the future of data centre architecture.