Compare Products

Hide

Clear All

VS

Time: January 13th, 2025
With the rapid advancement of artificial intelligence technology, a diverse array of AI applications has permeated various aspects of professional, academic, and personal life, including chatbots, virtual anchors, and Artificial Intelligence Personal Computers (AIPC). To enhance user experience and improve responsiveness to temporal demands, there is a necessity for the development of more sophisticated large language models with an increased parameter volume.

It is noteworthy that the recently launched Llama 3.1 model boasts an impressive parameter scale of 405 billion.

The training of such a substantial model necessitates the support of ultra-large-scale intelligent computing centres. Recently, Elon Musk announced via social media that xAI has commenced training at the world's largest supercomputing facility, known as the "Super cluster," which utilizes 100,000 liquid-cooled GPUs. The interconnection of 100,000 GPU power cards requires high-speed network channels. For further details on the construction of high-performance computing networks, one may refer to resources such as "Intelligent Computing on the Internet" and a detailed analysis of Ruijie's AIGC network solution.

As intelligent computing centres expand, the optical market is increasingly occupying a significant share of data centres. During the 100G era, the ratio of optical modules to networks was approximately 1:1; however, in the 400G era, this ratio has shifted to 7:3, underscoring the critical importance of optical modules within clusters. This article will focus on the failure rates of optical modules, analyze the primary causes of failure in traditional Digital Signal Processing (DSP) modules, compare failure rates utilizing LPO technology, and discuss the advantages presented by LPO modules.


The current status of optical modules in computing power networks

Optical modules are widely recognized components within network infrastructure, essential for various applications. This prompts an inquiry into the types of optical modules that will be integrated into computing power networks.

The diagram presented illustrates the current mainstream network architecture utilized in the RoCE Ethernet solution for intelligent computing centres. In this configuration, servers connect to the computing power network via a 400G high-speed network card. Additionally, a data centre switch equipped with a 51.2T switching chip establishes a three-tier architecture capable of supporting a cluster scale exceeding ten thousand cards.


The requirement for the module rate at the Intelligent Computing Centre has reached 400G, with considerations for implementing 800G interconnection in the switch interconnections. Currently, the predominant switching chip available is the 51.2T model, which utilizes 112G SerDes technology, necessitating the adoption of Q112 packaging for the corresponding optical modules on the switch side. On the network card side, the OSFP packaging format is primarily utilized, allowing for the selection of models based on the required distance during deployment.


The operational principles of Digital Signal Processing (DSP) optical module

This can be elucidated by examining the 400G Q112 VR4 module as a representative example. This analysis will highlight the functional roles of the key components within DSP optical modules. It is important to note that the structural diagrams of the Short-Range (SR) and Long-Range (DR) modules exhibit fundamental similarities; however, they diverge in the electro-optical conversion methodologies employed, with the SR module utilizing Vertical-Cavity Surface-Emitting Lasers (VCSELs), while the DR module implements External Modulation Lasers (EMLs) or silicon photonics technology.



1. Exchange chips facilitate the transmission of 4*112Gbps PAM4 electrical signals into the optical module.
2. The Digital Signal Processor (DSP) chip reshapes the transmitted electrical signals before forwarding them to the driver's side.
3. The driver is responsible for transmitting the electrical signals to the laser.
4. The Vertical-Cavity Surface-Emitting Laser (VCSEL) converts these electrical signals into optical signals, which are subsequently transmitted to optical fibres.
5. Upon traversing the optical fibre, the optical signal is converted back into an electrical signal upon reaching the photodiode (PD) array of the receiving optical module.
6. The Trans impedance Amplifier (TIA) amplifies the converted electrical signals and relays them to the DSP chip.
7. Subsequently, the DSP chip reorganizes the electrical signal and sends it to the switch chip once more.


Optical Module Failure Rate Index

The Significance of Addressing Inefficiency
While the architecture of optical modules is relatively straightforward in comparison to devices such as switches and servers, these modules play an essential role in computing networks. Although the failure rate of an individual optical module is low, this rate can become significantly amplified when modules are deployed in clusters exceeding ten thousand units. The occurrence of module failures introduces a probability of faults that can disrupt training tasks, necessitating additional time to restart such tasks, which in turn indirectly escalates the operational costs of the cluster. Therefore, it is imperative to closely monitor the failure rate of optical modules.

● Definition of Efficiency Indicators
Failures in Time (FIT) is a metric used to quantify the frequency of product or system failures over a specified period. It typically indicates the anticipated number of failures for a defined number of products or systems within a designated timeframe. FIT is a dimensionless figure that represents failures per billion hours. For instance, if a product experiences 100 failures within 1 billion hours, its FIT failure rate is 100. This implies that it is expected to experience 100 failures for every billion hours of operation.

The failure rate associated with an optical module is the aggregation of the failure rates of all its constituent components. For example, if a specific optical module has a theoretical failure rate of 155.63 FIT, it suggests that one could expect 155.63 failures to occur within one billion hours of operation.

The duration required for a single module to fail once can be calculated as one billion divided by 155.63, yielding approximately 8647744 hours. This translates to an estimated failure occurrence of one module within 8647744 hours. While this data may suggest a high level of reliability for the module, a comprehensive evaluation of the statistics across the entire cluster is warranted.

GPU Credit card scale (piece)
4096
8192
16384
32768
Number of optical modules required for the cluster (blocks)
16384
32768
98304
196608
The interval time (in hours) between one failure in all optical modules
528
Two hundred sixty-four
88
44

As illustrated in the accompanying figure, we present the number of optical modules required at various cluster scales, along with the interval time during which all-optical modules are expected to experience a failure for the first time. It is evident that this relationship exhibits a monotonically decreasing trend as the number of modules increases.

In a cluster scale exceeding ten thousand cards, the failure rate of a single module is exacerbated. For instance, within a cluster of 32,000 cards, module failures may occur approximately every two days; thus, from this perspective, the module failure rate merits significant consideration.

There are two principal factors that influence variations in the failure rate of optical modules: the number of components within the module and the operational temperature of the module itself.

The relationships governing these changes are as follows:
1. A reduction in the number of components within a module correlates with a lower failure rate.
2. A decrease in the operational temperature of the module similarly correlates with a lower failure rate.

Analysis of the failure rates associated with traditional Digital Signal Processing (DSP) optical modules reveals several shortcomings:
1. The presence of multiple components and elevated operating temperatures: DSP modules encompass not only DSP chips but also peripheral components such as crystal oscillators and a series of additional chips that contribute to a power consumption exceeding 50%. This significantly elevates the operating temperature of the module.
2. The high failure rate of the components themselves: When the DSP module utilizes EML (Electro-Absorptive Modulated Laser) or VCSEL (Vertical-Cavity Surface-Emitting Laser) technologies, it incorporates several distinct III-V group lasers, which inherently possess a higher failure rate.

From the analysis above, it is clear that the primary causes for the failure of DSP modules are the substantial number of devices involved and the high operating temperatures, including those associated with DSP and peripheral chips, as well as EML and VCSEL lasers. To effectively mitigate the failure rate of these modules, it is essential to address the root causes. In the subsequent sections, we will introduce the Linear Drive Pluggable Optics (LPO) module solution.


LPO Optical Module Solution

● LPO Module


The LPO module eliminates the traditional DSP chip found in conventional DSP modules, utilizing the DSP integrated within the switching chip to process electrical signals. This module employs standard performance driver and TIA chips, carefully selecting an appropriate electro-optical conversion strategy to achieve superior transmission performance. The electro-optical conversion component may incorporate VCSEL, EML, or silicon optical technologies, with silicon optical exhibiting enhanced linearity and reduced electrical reflection. To ensure a reliable supply and increased performance, Ruijie Network has implemented silicon optical technology solutions. For further insights into the foundational concepts of LPO, readers may refer to prior articles detailing the advent of LPO technology as an innovative tool for network construction in intelligent computing centres.

LPO Analysis of Module Failure Rate

LPO + silicon photonics
DSP + Silicon Photonics
DSP+EML
DSP+VCSEL
1
1.31
1.64
2.35

The accompanying chart illustrates the failure rate ratios of various technical solutions for 400G modules at a standardized operating temperature of 55°C. It is evident that, at this consistent operating temperature, the failure rate associated with the LPO combined with silicon photonics solution is the lowest. In comparison, the failure rates of the other solutions range from 1.31 to 2.34 times greater than that of the LPO and silicon photonics configuration.

This comparative analysis is fundamentally employed to evaluate the failure rates of different modules while ensuring a constant operating temperature. However, it is important to note that, in actual deployment scenarios, the operating temperature of LPO and silicon optical modules is typically lower than that of the DSP solution, thus contributing to a further reduction in the failure rate.



As illustrated in the above figure, when considering the same ambient temperature, the operating temperature of the LPO module is approximately 15°C lower than that of the DSP module.


When the temperature of the LPO module is reduced from 55°C to 40°C, the failure rate is observed to decrease by 50%, thereby indicating enhanced reliability, as demonstrated in the subsequent figure.


In the context of real-world deployment scenarios, a comparison of 400G modules utilizing different technical solutions at identical ambient temperatures reveals that the failure rate of the LPO and silicon photonics solution is further diminished, a benefit attributable to the lower operating temperature of the module.


Summary

In summary, based on the theoretical analysis combined with empirical data, the LPO and silicon photonics solution exhibits the lowest failure rate among the evaluated solutions. The primary reasons for this finding are as follows:
1. Elimination of the DSP chip: The removal of the DSP chip significantly decreases the operating temperature of the module, thereby reducing the adverse effects of elevated temperatures on the laser.
2. Adoption of silicon photonics technology: As depicted in the subsequent figure, the implementation of silicon photonics technology in the optoelectronic conversion process enables the silicon photonic chip to handle signal modulation, allowing the laser to provide direct current light without the necessity for signal modulation. In contrast, the EML solution relies on four lasers and a thermoelectric cooler (TEC), whereas the silicon photonics solution requires only one laser, resulting in a reduced number of module components and a corresponding decrease in the failure rate.



Performance Parameters of LPO Optical Modules

The mere presence of a lower module failure rate does not suffice for the LPO optical module to entirely supplant the DSP module. It is imperative to also assess the usability of the optical module, with particular emphasis on Bit Error Rate (BER) and Sensitivity (SEN). The performance metrics of these two indicators must align with the established protocol threshold standards.

Evaluation Methods for Optical Module BER and SEN



Through the adjustment of optical attenuation, it is possible to ascertain the BER at various receiver optical power levels. The aggregation of all test results allows for the generation of a BER versus optical power curve. As the optical power is systematically diminished, resulting in a leftward shift of the horizontal axis on the chart, a point is reached where the BER matches the Forward Error Correction (FEC) threshold specified at 2.4e-4. The optical power recorded at this juncture is designated as the sensitivity (SEN) of the optical module. The typical BER is evaluated in scenarios where optical attenuators are not employed, ensuring results remain within the BER error floor range.


The SEN optical module's capacity to withstand lower optical power levels is particularly advantageous for practical deployments, addressing issues such as diminished optical power due to dirty connectors, reduced launching optical power, or high insertion loss associated with fibre optic connectors.


Performance Parameters of the LPO DR Module

The following presents test data for various module schemes under room temperature conditions in short fibre scenarios.


From the BER data chart, the following observations can be made:
1. The BER of the LPO DR Module exceeds the protocol threshold, demonstrating a margin of five orders of magnitude.
2. The BER parameters of the LPO DR and DSP combined with silicon photonics solutions are comparable, and both exhibit superiority over the DSP combined with EML solution by two to three orders of magnitude.


From the SEN data chart, the following can be observed:

1. The SEN of the LPO DR Module shows a margin of approximately 3.5 dB relative to the protocol threshold.
2. The SEN parameters across the three solutions exhibit minimal variance.

In summary, the analysis indicates that the optical performance parameters of LPO combined with silicon photonics are closely aligned, while the combination of DSP with silicon photonics outperforms the DSP coupled with EML plan. Therefore, it is plausible to conclude that the LPO module can effectively replace the existing DSP DR plan.


Additional Benefits of LPO Optical Modules

Beyond attributes of high reliability and availability, LPO optical modules present additional value in various dimensions:
1. Lower Power Consumption: By eliminating the DSP chip, the maximum power consumption of the optical module can be reduced by approximately 51.3%, achieving a value below 4W (with case temperature assessed at 70°C).


2. Reduced Latency: The omission of the DSP chip from the module minimizes one processing step, resulting in a latency reduction of 95%, thus satisfying the requirements of low-latency application scenarios.


3. Enhanced Supply Chain Stability: Traditional DSP modules face challenges related to the limited availability of DSP chips and VCSEL lasers, which are currently in short supply and have extended delivery timelines, presenting risks for large-scale deployments. The LPO module's design negates the necessity for a DSP chip, instead employing silicon photonics technology. This strategy effectively mitigates supply risks associated with critical components by circumventing reliance on tightly-supplied DSP chips and VCSEL components.


Ruijie LPO Optical Module Products



Index
400G Q112 DR4 LD
800G OSFP DR8 LD
800G OSFP 2DR4 LD
Transmission distance
500mSingle-mode fibre
500mSingle-mode fibre
500mSingle-mode optical fibre
Initiating optical power
0 to 4 decibel-mill watts
0~4dBm
0~4dBm
Full temperature BER Floor
<1e-8
<1e-8
<1e-8
Sensitivity OMA
<-7dBm
<-7dBm
<-7dBm
Power consumption
Max 4W
Max 8W
Max 8W
Delay
<5ns
<5ns
<5ns
Case Temperature
0~70°C
0~70°C
0~70℃
Optical port type
MPO12/APC
MPO16/APC
2xMPO12/APC


Ruijie Networks is dedicated to advancing AIGC computing power network scenario planning and has developed three proprietary optical modules for LPO DR to address the interconnection requirements of three distinct network architectures.


Currently, the company is engaged in collaborative adaptation testing with leading manufacturers; further updates will be provided. OFC 2024|Ruijie Networks partners with ByteDance to demonstrate the capabilities of the 800G LPO optical module.

As a comprehensive service expert in the GenAI era, Ruijie Networks is committed to delivering full-stack products and solutions that encompass Infrastructure as a Service (IaaS) to Platform as a Service (PaaS). Our offerings include high-performance networks and optimized GPU computing power scheduling, with the objective of significantly enhancing production efficiency and reducing operational costs through innovative technological solutions. We firmly believe that through our endeavours, we can foster a more intelligent, efficient, and reliable future for our clients. We invite collaboration as we explore opportunities within the AI era together.



Related Blogs:

Exploration of Data Center Automated Operation and Maintenance Technology: Zero Configuration of Switches
Technology Feast | How to De-Stack Data Center Network Architecture
Technology Feast | A Brief Discussion on 100G Optical Modules in Data Centers

Research on the Application of Equal Cost Multi-Path (ECMP) Technology in Data Center Networks
Technology Feast | How to build a lossless network for RDMA
Technology Feast | Distributed VXLAN Implementation Solution Based on EVPN
Exploration of Data Center Automated Operation and Maintenance Technology: NETCONF
Technical Feast | A Brief Analysis of MMU Waterline Settings in RDMA Network
Technology Feast | Internet Data Center Network 25G Network Architecture Design
Technology Feast | The "Giant Sword" of Data Center Network Operation and Maintenance
Technology Feast: Routing Protocol Selection for Large Data Centre Networks
Technology Feast | BGP Routing Protocol Planning for Large Data Centres
Technology Feast | Talk about the next generation 25G/100G data centre network
Technology Feast | Ruijie Data Center Switch ACL Service TCAM Resource Evaluation Guide

Silicon Photonics Illuminates the Path to Sustainable Development for Data Centre Networks
How CXL Technology Solves Memory Problems in Data Centres (Part 1)
CXL 3.0: Solving New Memory Problems in Data Centres (Part 2)
Ruijie RALB Technology: Revolutionizing Data Center Network Congestion with Advanced Load Balancing
Multi-Tenant Isolation Technology in AIGC Networks—Data Security and Performance Stability
Multi-dimensional Comparison and Analysis of AIGC Network Card Dual Uplink Technical Architecture

Ruijie Networks websites use cookies to deliver and improve the website experience.

See our cookie policy for further details on how we use cookies and how to change your cookie settings.

Cookie Manager

When you visit any website, the website will store or retrieve the information on your browser. This process is mostly in the form of cookies. Such information may involve your personal information, preferences or equipment, and is mainly used to enable the website to provide services in accordance with your expectations. Such information usually does not directly identify your personal information, but it can provide you with a more personalized network experience. We fully respect your privacy, so you can choose not to allow certain types of cookies. You only need to click on the names of different cookie categories to learn more and change the default settings. However, blocking certain types of cookies may affect your website experience and the services we can provide you.

  • Performance cookies

    Through this type of cookie, we can count website visits and traffic sources in order to evaluate and improve the performance of our website. This type of cookie can also help us understand the popularity of the page and the activity of visitors on the site. All information collected by such cookies will be aggregated to ensure the anonymity of the information. If you do not allow such cookies, we will have no way of knowing when you visited our website, and we will not be able to monitor website performance.

  • Essential cookies

    This type of cookie is necessary for the normal operation of the website and cannot be turned off in our system. Usually, they are only set for the actions you do, which are equivalent to service requests, such as setting your privacy preferences, logging in, or filling out forms. You can set your browser to block or remind you of such cookies, but certain functions of the website will not be available. Such cookies do not store any personally identifiable information.

Accept All

View Cookie Policy Details

Kontakt

Kontakt

How can we help you?

Kontakt

Get an Order help

Kontakt

Get a tech support