PCIE错误系统

1、 Error Classification

|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Error Classification | desc | | |
| Correctable Errors | Correctable errors include those error conditions where hardware can recover without any loss of information. Hardware corrects these errors and software intervention is not required. For example, an LCRC error in a TLP that might be corrected by Data Link Level Retry is considered a correctable error. Measuring the frequency of Link-level correctable errors may be helpful for profiling the integrity of a Link. Correctable errors also include transaction-level cases where one agent detects an error with a TLP, but another agent is responsible for taking any recovery action if needed, such as re-attempting the operation with a separate subsequent transaction. The detecting agent can be configured to report the error as being correctable since the recovery agent may be able to correct it. If recovery action is indeed needed, the recovery agent must report the error as uncorrectable if the recovery agent decides not to attempt recovery | 硬件可恢复 传输级可恢复; | Bad TLP Bad DLLP Replay Timer REPLAY_NUM Rollover |
| Fatal Errors | Uncorrectable errors are those error conditions that impact functionality of the interface. There is no mechanism defined in this specification to correct these errors. Reporting an uncorrectable error is analogous to asserting SERR# in PCI/PCI-X. For more robust error handling by the system, this specification further classifies uncorrectable errors as Fatal and Non-fatal.--- Fatal errors are uncorrectable error conditions which render the particular Link and related hardware unreliable. For Fatal errors, a reset of the components on the Link may be required to return to reliable operation. Platform handling of Fatal errors, and any efforts to limit the effects of these errors, is platform implementation specific. | 特定链路/相关硬件不可靠 | Data Link Protocol Error Surprise Down Receiver Overflow Flow Control Protocol Error Malformed TLP Uncorrectable TLP Prefix Blocked |
| Non-Fatal Errors | Non-fatal errors are uncorrectable errors which cause a particular transaction to be unreliable but the Link is otherwise fully functional. Isolating Non-fatal from Fatal errors provides Requester/Receiver logic in a device or system management software the opportunity to recover from the error without resetting the components on the Link and disturbing other transactions in progress. Devices not associated with the transaction in error are not impacted by the error. | 特定交易不可靠,链路功能完整 | Poisoned TLP Received ECRC Check Failed Unsupported Request (UR) Completion Timeout Completer Abort Unexpected Completion ACS Violation MC Blocked TLP AtomicOp Egress Blocked |

2、 Error Signaling

|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| | | |
| Completion Status | The Completion Status field (when status is not Successful Completion) in the Completion header indicates that the associated Request failed (see Section 2.2.9). This is one method of error reporting which enables the Requester to associate an error with a specific Request. In other words, since 20 Non-Posted Requests are not considered complete until after the Completion returns, the Completion Status field gives the Requester an opportunity to "fix" the problem at some higher level protocol (outside the scope of this specification). For example, if a Read is issued to prefetchable Memory Space and the Completion returns with an Unsupported Request Completion Status, the Requester would not be in violation of this specification if it chose to reissue the Read 25 Request. Note that from a PCI Express point of view, the reissued Read Request is a distinct Request, and there is no relationship (on PCI Express) between the initial Request and the reissued Request. | |
| Error Messages | Error Messages are sent to the Root Complex for reporting the detection of errors according to the severity of the error. Error messages that originate from PCI Express or Legacy Endpoints are sent to corresponding Root Ports. Errors that originate from a Root Port itself are reported through the same Root Port. If a Root Complex Event Collector is implemented, errors that originate from a Root Complex Integrated Endpoint may optionally be sent to the corresponding Root Complex Event Collector. Errors that originate from a Root Complex Integrated Endpoint are reported in a Root Complex Event Collector residing on the same Logical Bus as the Root Complex Integrated Endpoint. The Root Complex Event Collector must explicitly declare supported Root Complex Integrated Endpoints as part of its capabilities; each Root Complex Integrated Endpoint must be associated with no more than one Root Complex Event Collector. When multiple errors of the same severity are detected, the corresponding error Messages with the same Requester ID may be merged for different errors of the same severity. At least one error Message must be sent for detected errors of each severity level. Note, however, that the detection of a given error in some cases will preclude the reporting of certain errors. Refer to Section 6.2.3.2.3. Also note special rules in Section 6.2.4 regarding non-Function-specific errors in multi-Function devices. | |
| Error Forwarding (Data Poisoning) | Error Forwarding, also known as data poisoning, is indicated by setting the EP bit in a TLP. Refer to Section 2.7.2. This is another method of error reporting in PCI Express that enables the Receiver mechanism, Error Forwarding can be used with either Requests or Completions that contain data. In addition, "intermediate" Receivers along the TLP's route, not just the Receiver at the ultimate destination, are required to detect and report (if enabled) receiving the poisoned TLP. This can help software determine if a particular Switch along the path poisoned the TLP. | |

3、 Error Logging

  • Device Status
  • Advanced Error Reporting Capability
  • PCI compatible (Type 00h and 01h) configuration registers


|----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| | | | |
| PCI-Compatible Configuration Registers | Command Register.SERR#Enable | 使能Non-Fatal和Fatal错误上报(通过这个bit或者Device Control寄存器的相应bit),控制error Message是否发送 | |
| PCI Express Capability Structure | Device Control Register. Correctable Error Reporting Enable /Non-Fatal Error Reporting Enable /Fatal Error Reporting Enable /Unsupported Request Reporting Enable | 使能错误上报,控制error Message是否发送 | |
| | Root Control Register. System Error on Correctable Error Enable System Error on Non-Fatal Error Enable System Error on Fatal Error Enable | If Set, this bit indicates that a System Error should be generated if a xx error is reported by any of the devic in the hierarchy associated with this Root Port, or by the Root ort itself. The mechanism for signaling a System Error to the system is system specific 1、os native mode去上报pcie aer,在这种模式下,pcie故障是通过对应的rootport触发msi上报故障,rootctl寄存器是不需要的; 2、如果使用firmware first mode去处理aer,就必须使能rootctl,让故障能传递到一个global aer的模块,再由这个global模块触发smi中断通知bios。 error pin也是由global aer触发的,所以一般情况是firmware | |
| Advanced Error Reporting Capability | Root Error Command Register. /Non-Fatal Error Reporting Enable /Fatal Error Reporting Enable /Unsupported Request Reporting Enable | 使能AER错误上报interrupt | |
| Type 1 Configuration Space Header | Bridge Control Register. SERR# Enable | This bit controls forwarding of ERR_COR, ERR_NONFATAL | |

相关推荐
学嵌入式的小杨同学5 小时前
STM32 进阶封神之路(二十七):MQTT 深度解析 —— 从协议原理到 OneNET 云平台接入(底层逻辑 + AT 指令开发)
stm32·单片机·嵌入式硬件·mcu·硬件架构·pcb·嵌入式实时数据库
qq_389600137 小时前
pads-logic 学习笔记
笔记·嵌入式硬件·学习·硬件工程·pcb工艺
祝大家百事可乐1 天前
静止同步调相机——05 光CT、电磁CT、霍尔传感器、PT(电压互感器)
硬件工程
学嵌入式的小杨同学1 天前
STM32 进阶封神之路(二十六):ESP8266 实战全攻略 ——TCP 通信 + 数据上传 + 远程控制 + 透传模式(库函数 + 代码落地)
stm32·单片机·嵌入式硬件·mcu·硬件架构·硬件工程·智能硬件
线束线缆组件品替网1 天前
Amphenol RJE1Y36610644401 CAT6A网线组件选型与替代指南
网络·人工智能·数码相机·电脑·音视频·硬件工程·游戏机
weiyvyy1 天前
嵌入式硬件接口开发的流程
人工智能·驱动开发·单片机·嵌入式硬件·硬件架构·硬件工程
weiyvyy1 天前
嵌入式硬件接口开发的核心原则
驱动开发·单片机·嵌入式硬件·fpga开发·硬件架构·硬件工程
EMC仿真秀儿1 天前
基本共射放大电路的等效模型构建方法
硬件工程
rosmis1 天前
复杂工程拆解:自顶向下设计,自底向上实现
人工智能·python·机器人·自动化·自动驾驶·硬件工程·制造
祝大家百事可乐2 天前
储能系统——06 EMS基本原理及功能
硬件工程