1、 Error Classification
|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Error Classification | desc | | |
| Correctable Errors | Correctable errors include those error conditions where hardware can recover without any loss of information. Hardware corrects these errors and software intervention is not required. For example, an LCRC error in a TLP that might be corrected by Data Link Level Retry is considered a correctable error. Measuring the frequency of Link-level correctable errors may be helpful for profiling the integrity of a Link. Correctable errors also include transaction-level cases where one agent detects an error with a TLP, but another agent is responsible for taking any recovery action if needed, such as re-attempting the operation with a separate subsequent transaction. The detecting agent can be configured to report the error as being correctable since the recovery agent may be able to correct it. If recovery action is indeed needed, the recovery agent must report the error as uncorrectable if the recovery agent decides not to attempt recovery | 硬件可恢复 传输级可恢复; | Bad TLP Bad DLLP Replay Timer REPLAY_NUM Rollover |
| Fatal Errors | Uncorrectable errors are those error conditions that impact functionality of the interface. There is no mechanism defined in this specification to correct these errors. Reporting an uncorrectable error is analogous to asserting SERR# in PCI/PCI-X. For more robust error handling by the system, this specification further classifies uncorrectable errors as Fatal and Non-fatal.--- Fatal errors are uncorrectable error conditions which render the particular Link and related hardware unreliable. For Fatal errors, a reset of the components on the Link may be required to return to reliable operation. Platform handling of Fatal errors, and any efforts to limit the effects of these errors, is platform implementation specific. | 特定链路/相关硬件不可靠 | Data Link Protocol Error Surprise Down Receiver Overflow Flow Control Protocol Error Malformed TLP Uncorrectable TLP Prefix Blocked |
| Non-Fatal Errors | Non-fatal errors are uncorrectable errors which cause a particular transaction to be unreliable but the Link is otherwise fully functional. Isolating Non-fatal from Fatal errors provides Requester/Receiver logic in a device or system management software the opportunity to recover from the error without resetting the components on the Link and disturbing other transactions in progress. Devices not associated with the transaction in error are not impacted by the error. | 特定交易不可靠,链路功能完整 | Poisoned TLP Received ECRC Check Failed Unsupported Request (UR) Completion Timeout Completer Abort Unexpected Completion ACS Violation MC Blocked TLP AtomicOp Egress Blocked |
2、 Error Signaling
|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| | | |
| Completion Status | The Completion Status field (when status is not Successful Completion) in the Completion header indicates that the associated Request failed (see Section 2.2.9). This is one method of error reporting which enables the Requester to associate an error with a specific Request. In other words, since 20 Non-Posted Requests are not considered complete until after the Completion returns, the Completion Status field gives the Requester an opportunity to "fix" the problem at some higher level protocol (outside the scope of this specification). For example, if a Read is issued to prefetchable Memory Space and the Completion returns with an Unsupported Request Completion Status, the Requester would not be in violation of this specification if it chose to reissue the Read 25 Request. Note that from a PCI Express point of view, the reissued Read Request is a distinct Request, and there is no relationship (on PCI Express) between the initial Request and the reissued Request. | |
| Error Messages | Error Messages are sent to the Root Complex for reporting the detection of errors according to the severity of the error. Error messages that originate from PCI Express or Legacy Endpoints are sent to corresponding Root Ports. Errors that originate from a Root Port itself are reported through the same Root Port. If a Root Complex Event Collector is implemented, errors that originate from a Root Complex Integrated Endpoint may optionally be sent to the corresponding Root Complex Event Collector. Errors that originate from a Root Complex Integrated Endpoint are reported in a Root Complex Event Collector residing on the same Logical Bus as the Root Complex Integrated Endpoint. The Root Complex Event Collector must explicitly declare supported Root Complex Integrated Endpoints as part of its capabilities; each Root Complex Integrated Endpoint must be associated with no more than one Root Complex Event Collector. When multiple errors of the same severity are detected, the corresponding error Messages with the same Requester ID may be merged for different errors of the same severity. At least one error Message must be sent for detected errors of each severity level. Note, however, that the detection of a given error in some cases will preclude the reporting of certain errors. Refer to Section 6.2.3.2.3. Also note special rules in Section 6.2.4 regarding non-Function-specific errors in multi-Function devices. | |
| Error Forwarding (Data Poisoning) | Error Forwarding, also known as data poisoning, is indicated by setting the EP bit in a TLP. Refer to Section 2.7.2. This is another method of error reporting in PCI Express that enables the Receiver mechanism, Error Forwarding can be used with either Requests or Completions that contain data. In addition, "intermediate" Receivers along the TLP's route, not just the Receiver at the ultimate destination, are required to detect and report (if enabled) receiving the poisoned TLP. This can help software determine if a particular Switch along the path poisoned the TLP. | |
3、 Error Logging
- Device Status
- Advanced Error Reporting Capability
- PCI compatible (Type 00h and 01h) configuration registers
|----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| | | | |
| PCI-Compatible Configuration Registers | Command Register.SERR#Enable | 使能Non-Fatal和Fatal错误上报(通过这个bit或者Device Control寄存器的相应bit),控制error Message是否发送 | |
| PCI Express Capability Structure | Device Control Register. Correctable Error Reporting Enable /Non-Fatal Error Reporting Enable /Fatal Error Reporting Enable /Unsupported Request Reporting Enable | 使能错误上报,控制error Message是否发送 | |
| | Root Control Register. System Error on Correctable Error Enable System Error on Non-Fatal Error Enable System Error on Fatal Error Enable | If Set, this bit indicates that a System Error should be generated if a xx error is reported by any of the devic in the hierarchy associated with this Root Port, or by the Root ort itself. The mechanism for signaling a System Error to the system is system specific 1、os native mode去上报pcie aer,在这种模式下,pcie故障是通过对应的rootport触发msi上报故障,rootctl寄存器是不需要的; 2、如果使用firmware first mode去处理aer,就必须使能rootctl,让故障能传递到一个global aer的模块,再由这个global模块触发smi中断通知bios。 error pin也是由global aer触发的,所以一般情况是firmware | |
| Advanced Error Reporting Capability | Root Error Command Register. /Non-Fatal Error Reporting Enable /Fatal Error Reporting Enable /Unsupported Request Reporting Enable | 使能AER错误上报interrupt | |
| Type 1 Configuration Space Header | Bridge Control Register. SERR# Enable | This bit controls forwarding of ERR_COR, ERR_NONFATAL | |