Documentation: PCI: Sync AER doc with code

The PCIe Advanced Error Reporting driver has evolved over the years but
its documentation hasn't.  Catch up with past code changes:

* The documentation claims that Correctable Errors are logged with
  KERN_INFO severity, but the code uses KERN_WARN.

  It had used KERN_WARN from the beginning with commit 6c2b374d74
  ("PCI-Express AER implemetation: AER core and aerdriver").  In 2013,
  commit 2cced2d959 ("aerdrv: Cleanup log output for AER") switched to
  KERN_ERR, until 2020 when it was reverted back to KERN_WARN by commit
  e83e2ca3c3 ("PCI/AER: Log correctable errors as warning, not error").

* An example log message in the documentation uses the term "Uncorrected",
  but the code uses "Uncorrectable" since commit 02a06f5f1a ("PCI/AER:
  Use 'Correctable' and 'Uncorrectable' spec terms for errors").

* The example contains the Requester ID "id=0500", which is omitted since
  commit 010caed4cc ("PCI/AER: Decode Error Source Requester ID").

* The example contains the error name "Unsupported Request", which is
  instead reported as "UnsupReq" since commit bd237801fe ("PCI/AER:
  Adopt lspci names for AER error decoding").

* The example doesn't prepend "0x" to hex values from the TLP Header Log,
  as introduced by commit f68ea779d9 ("PCI: Add pcie_print_tlp_log() to
  print TLP Header and Prefix Log").

* The documentation refers to a reset_link callback which was removed by
  commit b6cf1a42f9 ("PCI/ERR: Remove service dependency in
  pcie_do_recovery()").

* Commit 5790862255 ("PCI/ERR: Recover from RCiEP AER errors") added
  support to recover Root Complex Integrated Endpoints by applying a
  Function Level Reset, alternatively to the Secondary Bus Reset which is
  applied otherwise.

* On non-fatal errors, a reset was previously never performed.  But the
  AER driver has just been amended to allow drivers to opt in to a reset.

* The documentation claims that a warning message is logged if a driver
  lacks pci_error_handlers.  But the message has been informational
  (logged with KERN_INFO severity) since its introduction with commit
  01daacfb90 ("PCI/AER: Log which device prevents error recovery").

  The documentation claims that the message is only logged for fatal
  errors, which is incorrect.  Moreover it refers to "section 3", even
  though the documentation no longer contains section numbers since commit
  4e37f055a9 ("Documentation: PCI: convert pcieaer-howto.txt to reST").
  Section 3 is titled "Developer Guide".  That's the same section where
  the reference is located, so it is self-referential and can be dropped.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://patch.msgid.link/7501bfc5b9920193a25998a3cbcf72c47674ec63.1757942121.git.lukas@wunner.de
This commit is contained in:
Lukas Wunner 2025-09-15 15:50:01 +02:00 committed by Bjorn Helgaas
parent 0a27bdb14b
commit 01cc0dc9de
1 changed files with 38 additions and 43 deletions

View File

@ -70,16 +70,16 @@ AER error output
----------------
When a PCIe AER error is captured, an error message will be output to
console. If it's a correctable error, it is output as an info message.
console. If it's a correctable error, it is output as a warning message.
Otherwise, it is printed as an error. So users could choose different
log level to filter out correctable error messages.
Below shows an example::
0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
0000:50:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Requester ID)
0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
0000:50:00.0: [20] Unsupported Request (First)
0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
0000:50:00.0: [20] UnsupReq (First)
0000:50:00.0: TLP Header: 0x04000001 0x00200a03 0x05010000 0x00050100
In the example, 'Requester ID' means the ID of the device that sent
the error message to the Root Port. Please refer to PCIe specs for other
@ -152,18 +152,6 @@ the device driver.
Provide callbacks
-----------------
callback reset_link to reset PCIe link
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This callback is used to reset the PCIe physical link when a
fatal error happens. The Root Port AER service driver provides a
default reset_link function, but different Upstream Ports might
have different specifications to reset the PCIe link, so
Upstream Port drivers may provide their own reset_link functions.
Section 3.2.2.2 provides more detailed info on when to call
reset_link.
PCI error-recovery callbacks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -174,8 +162,8 @@ when performing error recovery actions.
Data struct pci_driver has a pointer, err_handler, to point to
pci_error_handlers who consists of a couple of callback function
pointers. The AER driver follows the rules defined in
pci-error-recovery.rst except PCIe-specific parts (e.g.
reset_link). Please refer to pci-error-recovery.rst for detailed
pci-error-recovery.rst except PCIe-specific parts (see
below). Please refer to pci-error-recovery.rst for detailed
definitions of the callbacks.
The sections below specify when to call the error callback functions.
@ -189,10 +177,21 @@ software intervention or any loss of data. These errors do not
require any recovery actions. The AER driver clears the device's
correctable error status register accordingly and logs these errors.
Non-correctable (non-fatal and fatal) errors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Uncorrectable (non-fatal and fatal) errors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If an error message indicates a non-fatal error, performing link reset
The AER driver performs a Secondary Bus Reset to recover from
uncorrectable errors. The reset is applied at the port above
the originating device: If the originating device is an Endpoint,
only the Endpoint is reset. If on the other hand the originating
device has subordinate devices, those are all affected by the
reset as well.
If the originating device is a Root Complex Integrated Endpoint,
there's no port above where a Secondary Bus Reset could be applied.
In this case, the AER driver instead applies a Function Level Reset.
If an error message indicates a non-fatal error, performing a reset
at upstream is not required. The AER driver calls error_detected(dev,
pci_channel_io_normal) to all drivers associated within a hierarchy in
question. For example::
@ -204,38 +203,34 @@ Downstream Port B and Endpoint.
A driver may return PCI_ERS_RESULT_CAN_RECOVER,
PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
whether it can recover or the AER driver calls mmio_enabled as next.
whether it can recover without a reset, considers the device unrecoverable
or needs a reset for recovery. If all affected drivers agree that they can
recover without a reset, it is skipped. Should one driver request a reset,
it overrides all other drivers.
If an error message indicates a fatal error, kernel will broadcast
error_detected(dev, pci_channel_io_frozen) to all drivers within
a hierarchy in question. Then, performing link reset at upstream is
necessary. As different kinds of devices might use different approaches
to reset link, AER port service driver is required to provide the
function to reset link via callback parameter of pcie_do_recovery()
function. If reset_link is not NULL, recovery function will use it
to reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
and reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
to mmio_enabled.
a hierarchy in question. Then, performing a reset at upstream is
necessary. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
to indicate that recovery without a reset is possible, the error
handling goes to mmio_enabled, but afterwards a reset is still
performed.
Frequent Asked Questions
------------------------
In other words, for non-fatal errors, drivers may opt in to a reset.
But for fatal errors, they cannot opt out of a reset, based on the
assumption that the link is unreliable.
Frequently Asked Questions
--------------------------
Q:
What happens if a PCIe device driver does not provide an
error recovery handler (pci_driver->err_handler is equal to NULL)?
A:
The devices attached with the driver won't be recovered. If the
error is fatal, kernel will print out warning messages. Please refer
to section 3 for more information.
Q:
What happens if an upstream port service driver does not provide
callback reset_link?
A:
Fatal error recovery will fail if the errors are reported by the
upstream ports who are attached by the service driver.
The devices attached with the driver won't be recovered.
The kernel will print out informational messages to identify
unrecoverable devices.
Software error injection