-
Notifications
You must be signed in to change notification settings - Fork 29
Enable propolis to generate ACPI tables #999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
5fed974 to
b234998
Compare
Add a TableLoader builder that can be used to generate the etc/table-loader file to be passed to guest firmware via fw_cfg. The etc/table-loader file in fw_cfg contains the sequence of fixed size linker/loader commands that can be used to instruct guest to allcoate memory for set of fw_cfg files(e.g. ACPI tables), link allocated memory by patching pointers and calculate the ACPI checksum. Signed-off-by: Amey Narkhede <[email protected]>
b234998 to
d05da54
Compare
|
Thanks for taking a swing at this. It'll be nice to have ACPI table generation wired up. Some initial high-level feedback, which you can take with as much salt as you want, since I'm ex-Oxide now: You're defining quite a few ACPI-specific structs in When it comes to DSDT generation, I think this is probably something we'll want to farm out to the various piece of specific device emulation? They could own the specific knowledge required, rather than defining all those constants in acpi/dsdt.rs. Maybe think about a trait they could opt into for appending bits to a DSDT we build while assembling the machine? |
|
That makes sense. I'll create a trait for Dsdt and implment for each device that is being exposed. |
Add builders to generate basic ACPI tables RSDP(ACPI 2.0+) that points to XSDT, XSDT with 64-bit table pointers and RSDT with 32-bit table pointers that would work with the table-loader mechanism in fw_cfg. These tables are used to describe the ACPI table hierarchy to guest firmware. The builders produce raw table data bytes with placeholder addresses and checksums that are fixed up by firmware using table-loader commands. Signed-off-by: Amey Narkhede <[email protected]>
FADT describes fixed hardware features and points to the DSDT. The builder supports both standard and HW-reduced ACPI modes. DSDT contains AML bytecode describing system hardware. The builder provides methods to append AML data which could be populated by an AML generation mechanism in subsequent commits. Signed-off-by: Amey Narkhede <[email protected]>
Add a builder for the Multiple APIC Description Table (MADT) that describes the system's interrupt controllers. Supports adding local APIC, I/O APIC and interrupt source overrides for describing processor and interrupt controller topology. Signed-off-by: Amey Narkhede <[email protected]>
Add builders for MCFG and HPET ACPI tables. MCFG describes the PCIe ECAM base address, PCIe segment group and bus number range for firmware to locate PCI Express configuration space. HPET describes the HPET hardware to the guest. The table uses the bhyve HPET hardware ID (0x80860701) and maps to the standard HPET MMIO address at 0xfed00000. Signed-off-by: Amey Narkhede <[email protected]>
Add the FACS table that provides a memory region for firmware/OS handshaking. The table includes the GlobalLock field for OS/firmware mutual exclusion during ACPI operations. We don't yet have support for GBL_EN handling[1], but expose the table to match OVMF's behaviour. [1]: oxidecomputer#837 Signed-off-by: Amey Narkhede <[email protected]>
Define bytecode opcodes for AML generation per ACPI Specification Chapter 20 [1]. Includes namespace modifiers, named objects, data object prefixes, name path prefixes, local/argument references, control flow and logical/arithmetic operators. These constants will be used in subsequent commits to generate AML bytecode which would enable us to generate ACPI tables ourselves. [1]: https://uefi.org/specs/ACPI/6.5/20_AML_Specification.html Signed-off-by: Amey Narkhede <[email protected]>
Implement NameSeg and NameString encoding per ACPI Specification Section 20.2.2 [1]. Single segments encode as 4 bytes padded with underscores, dual segments use DualNamePrefix and three or more use MultiNamePrefix with a count byte. Also implement EISA ID compression for hardware identification strings like "PNP0A08". [1]: https://uefi.org/specs/ACPI/6.4_A/20_AML_Specification.html#name-objects-encoding Signed-off-by: Amey Narkhede <[email protected]>
Add AML bytecode generation to mainly support dynamically generating ACPI tables and control methods. The bytecode is built in a single pass by directly writing to the output buffer. AML scopes encode their length in a 1-4 byte PkgLength field at the start[1]. Since we don't know the final size until the scope's content is fully written, reserve 4 bytes when opening a scope upfront and splice in the actual encoded length when the scope closes. This avoids complexity of having to build an in memory tree and then walk it twice to measure and serialize. The RAII guards automatically close scopes and finalize the PkgLength on drop. Those guards hold a mutable borrow on the builder so the borrow checker won't let us close a parent while a child scope is still open. The limitation of this approach is that the content has to be written in output order but that is not a big issue for the use case of VM device descriptions. [1]: ACPI Specification Section 20.2.4 https://uefi.org/specs/ACPI/6.4_A/20_AML_Specification.html#package-length-encoding Signed-off-by: Amey Narkhede <[email protected]>
Implement ResourceTemplateBuilder for constructing resource descriptors used in methods like _CRS. Supports QWord/DWord memory and I/O ranges, Word bus numbers and IRQ descriptors per ACPI Specification Section 6.4 [1]. [1]: https://uefi.org/specs/ACPI/6.4_A/06_Device_Configuration.html#resource-data-types-for-acpi Signed-off-by: Amey Narkhede <[email protected]> Signed-off-by: glitzflitz <[email protected]>
Export public API for AML generation AmlBuilder, AmlWriter trait, guard types (ScopeGuard, DeviceGuard, MethodGuard), EisaId and ResourceTemplateBuilder. This would enable generating the dynamic bytecode used in tables like DSDT. Signed-off-by: Amey Narkhede <[email protected]>
d05da54 to
0bf945c
Compare
Add DSDT generation that provides the guest OS with device information via AML. The DSDT contains _SB.PCI0 describing the PCIe host bridge with bus number and MMIO resources. The ECAM is reserved via a separate PNP0C02 motherboard resources device (_SB.MRES) rather than in the PCI host bridge's _CRS. This is required by PCI Firmware Spec 3.2, sec 4.1.2. Also add the DsdtGenerator trait that will be implemented by each device in DSDT to expose its ACPI description. Signed-off-by: Amey Narkhede <[email protected]>
Since we can generation our own ACPI tables, implement DsdtGenerator trait for serial console device to expose it in generated DSDT. Signed-off-by: Amey Narkhede <[email protected]>
Add AT keyboard controller resources to allow guest to enumerate the i8042 controller. Only keyboard is added to match the OVMF's existing behaviour for now. Signed-off-by: Amey Narkhede <[email protected]>
Implement DsdtGenerator for QemuPvPanic to export it via new DSDT. Signed-off-by: Amey Narkhede <[email protected]>
The OS calls _OSC on the PCIe host bridge to negotiate control of native PCIe features like hotplug, AER and PME. Without _OSC, Linux logs warning about missing capability negotiation(_OSC: platform retains control of PCIe features (AE_NOT_FOUND). Since as of now we don't have support for any PCIe handling, no capabilities are exposed. In future when PCIe handling is implemented the supported bits can be simply unmasked to expose them to the guest. Also to simplify the aml generation of _OSC itself introduce some high level wrappers around aml generation. [1]: https://learn.microsoft.com/en-us/windows-hardware/drivers/pci/enabling-pci-express-native-control Signed-off-by: Amey Narkhede <[email protected]>
Combine all ACPI tables into the format expected by firmware(OVMF) by using fw_cfg's table-loader commands for address patching and checksum computation. Signed-off-by: Amey Narkhede <[email protected]>
0bf945c to
2e51b62
Compare
|
I pushed the new changes to move table structs out of |
|
Oops I missed the case of base -> new -> base Currently we have The "unset" state is being lost once a VM touches new propolis in this scenario
All tests should pass now |
Integrate the new ACPI table generation into propolis-standalone and propolis-server. Also replace hardcoded memory region addresses with constants that align with ACPI table definitions. The PCIe ECAM base is kept same as before at 0xe000_0000 (3.5GB) to match existing i440fx chipset ECAM placement. ECAM is no longer added to the E820 map as reserved memory since it is MMIO space properly described in the MCFG ACPI table. Guest physical memory map: 0x0000_0000 - 0xbfff_ffff Low RAM (up to 3 GiB) 0xc000_0000 - 0xffff_ffff PCI hole (1 GiB MMIO region) 0xc000_0000 - 0xdfff_ffff 32-bit PCI MMIO 0xe000_0000 - 0xefff_ffff PCIe ECAM (256 MiB, 256 buses) 0xfec0_0000 IOAPIC 0xfed0_0000 HPET 0xffe0_0000 - 0xffff_ffff Bootrom (2 MiB) 0x1_0000_0000+ High RAM + 64-bit PCI MMIO e820 map as seen by guest: 0x0000_0000 - 0x0009_ffff Usable (640 KiB low memory) 0x0010_0000 - 0xbeaf_ffff Usable (~3 GiB main RAM) 0xbeb0_0000 - 0xbfb6_cfff Reserved (UEFI runtime/data) 0xbfb6_d000 - 0xbfbf_efff ACPI Tables + NVS 0xbfbf_f000 - 0xbffd_ffff Usable (top of low memory) 0xbffe_0000 - 0xffff_ffff Reserved (PCI hole) 0x1_0000_0000 - highmem Usable (high RAM above 4 GiB) To stay on safe side only enable using new ACPI tables for newly launched VMs. Old VMs using OVMF tables would keep using the same OVMF tables throughout multiple migrations. To verify this add the phd test as well for new VM launched with native tables, native tables preserved through migration and VM launched from old propolis without native tables stays with OVMF through multiple future migrations. Signed-off-by: Amey Narkhede <[email protected]> Signed-off-by: glitzflitz <[email protected]>
2e51b62 to
61f0ed4
Compare
Background
As per #695
currently propolis relies on
edk2-stable202105version of EDK2 OVMF to provide the ACPI tables to the guest as it was the last version that has included static tables.Another limitation is the guest only sees whatever OVMF decided to generate rather than what the hypervisor knows about the virtual/emulated hardware.
In newer versions, OVMF expects the VMM to generate a set of ACPI tables and expose them via the
fw_cfgtable-loader interface. Being able to generate ACPI tables also unlocks other opportunities for features like being able to chose which tables and control methods to expose, PCIe host bridge and switch emulation, supporting native PCIe hotplug etc.This PR addresses that limitation and adds mechanism to let propolis generate its own ACPI tables.
Implementation
Oveview
The series starts with implementing fw_cfg's table-loader mechanism to enable passing static tables to guest firmware(OVMF). Then the basic static tables like
RSDT,XSDTandRSDPetc are added.After that we reach to second milestone that is generating the AML bytecode. This where some technical decision need to be made after evaluating different options and tradeoffs along with use case for how to go about generating bytecode without introducing too much complexity.
At the end everything is wired up to switch to using propolis generated tables.
Details
The fw_cfg Interface
QEMU's fw_cfg interface provides a mechanism for the hypervisor to expose files to guest firmware. Propolis already had basic fw_cfg support for the e820 memory map and bootrom. The ACPI implementation builds on that foundation.
OVMF expects three specific fw_cfg files for ACPI tables:
The table-loader file contains a sequence of fixed-size commands that instruct OVMF to allocate memory, patch pointer fields and compute checksums. This is necessary because the tables contain absolute addresses that are only known after OVMF allocates memory for them.
In the proposed implementation in Add fw_cfg table-loader helpers for ACPI table generation ,
TableLoadergenerates three command types:ALLOCATE - reserves memory for a fw_cfg file with specified alignment in a given zone
ADD_POINTER - patches an address field in one file to point at another file's allocated location. The command specifies source file, destination file, offset within source and pointer size
ADD_CHECKSUM - computes a checksum over a byte range and writes it to a specified offset. ACPI tables use a simple byte sum that must equal zero.
The commands are used in Prepare the ACPI tables for generation
Static Table Generation
The simpler static tables that don't require AML bytecode are implemented first.
There are some minor differences between tables generated here and old tables from https://github.com/oxidecomputer/edk2.
In new _DSDT, the PCIe host bridge(PNPA08) is exposed(with _CID PNPA03 to support PCI) instead of just PCI one in https://github.com/oxidecomputer/edk2. _OSC method is also provided but since as of now propolis doesn't handle PCIe, no capability is advertised to the guest.
_PRT
The edk2 from https://github.com/oxidecomputer/edk2 uses legacy interrupts https://github.com/oxidecomputer/edk2/blob/907a5fd1763ce5ddd74001261e5b52cd200a25f9/OvmfPkg/AcpiTables/Dsdt.asl#L196
while the generated _PRT uses direct GSI based routing.
Legacy LPC bridge is skipped in generated tables so no LPC bridge, IRQ link devices and PIRQ register are present in new tables.
For ISA, PIC(PNP0000), DMA(PNP0200), Timer(PNP0100), RTC(PNP0B00), Speaker(PNP0800), FPU(PNP0C04), XTRA(PNP0C02) are skipped in new tables.
Since propolis does not have hotplug support yet SSDT is also skipped at the moment.
AML generation and usage
The DSDT contains AML bytecode for describing devices, methods and resources. AML has a hierarchical structure with scopes containing devices which contain named objects and methods. The encoding uses variable length packages.
Possible approaches
QEMU uses a C based approach with GArray buffers. Each AML construct is a function returning an Aml pointer that must be explicitly appended to its parent. The design is flexible but also has caveats for example, forgetting manual
aml_appendcall silently drops content and there is no type safety around what can be nested. Since we are not bound my limitations of C and have borrow checker with us, we can do better.crosvm defines a single
Amltrait with many implementing types. Each construct is a separate struct collecting children in a Vec. The usage pattern is usually a macro followed byto_aml_bytes()which recursively serializes the tree. Although this provides strong typing, its bit more complex and requires constructing the entire tree in memory before serialization. Package lengths use a two pass approach of first measuring then writing.Firecracker also follows a same pattern to crosvm with trait methods along with some additional error handling.
acpi_tables crate used by cloud-hypervisor: uses a dual trait design to split the problem into two traits:
Amlfor things that can be serialized andAmlSinkas the destination. The sink abstraction is nice because the same tree can write to a Vec or feed a checksum calculator without changing the serialization code. Its structurally similar to crosvm and the same two pass length encoding which gets bit complex when building nested hierarchies.Approach in this series
Introduce AML bytecode generation adds RAII guards that automatically finalize package lengths when dropped.
The core abstraction is an
AmlBuilderthat owns a single byte buffer plus guard types for Scope, Device and Method. Each guard holds a mutable borrow on the builder so we have compile time scope safety through the borrow checker. This way its impossible to miss closing any scope.Also using single buffer from
AmlBuilderavoids the overhead of dynamic dispatch as in crosvm and acpi_tables approach.Guards borrow the builder mutably and write content directly to its buffer. When a guard is created it writes the opcode, reserves 4 bytes for the package length (the maximum encoding size) and writes the name. When the guard drops it calculates the actual package length, encodes it in 1-4 bytes and splices out the unused reserved bytes.
Usage looks like
which looks structurally similar to ASL code that is compiled to AML bytecode.
The conditional content is simply an if statement due to RAII guards which avoids complexity of Option wrappers as needed in other cases mentioned above. The limitation in this design is that its less composable. There is no easy way to return a "partial device tree" from a function or store AML fragments for later use.
Note about Package Length Encoding
The ACPI specification Section 20.2.4 defines a variable length encoding for package sizes. A package length includes itself in the count which creates a circular dependency: the length must be known to encode it but the encoding affects the length. That is why two pass approach is often used as done by others.
The implementation in Introduce AML bytecode generation, simply reserves max 4 bytes when opening any scope and splices in the actual encoded length when the scope closes. This produces minimal output with a single pass through the data.
I'd be open to new ideas or going with another approach mentioned above as well :)
DsdtGenerator Trait
The
DsdtGeneratortrait introduced in Generate DSDT with PCIe host bridge, enables device emulators to implement their ACPI descriptions to the DSDT. Devices can expose themselves in the DSDT by implementing the DsdtGenerator trait and wiring it up throughLifecycle::as_dsdt_generator().The trait has two methods:
dsdt_scope()- returnsSystemBusorPciRootdepending on where the device belongs in the ACPI namespacegenerate_dsdt()- receives aScopeGuardand emits the device's AML (HID, resources, methods, etc.)During DSDT construction,
build_dsdt_aml()iterates over all registered generators and invokes them within the appropriate scope.See Add PS/2 controller in DSDT and [Add Qemu pvpanic device to DSDT for a minimal example. The Qemu pvpanic device declares device
PEVTwith HIDQEMU0001and a single I/O port resource. The pattern is the same for LpcUart and the PS/2 controller.This keeps ACPI bits of device co-located with the device implementation rather than requiring a central place that knows about every device's resources and lets device own its ACPI description.
Wiring up new tables
The new table generation is controlled by a
native_acpi_tablesflag in the Board spec. Newly launched VMs have this set totrueand get new generated tables viafw_cfg. VMs migrating from older propolis versions won't have this field in their spec so it defaults tofalseand they keep using OVMF tables.So existing VMs can safely migrate to propolis generated tables without any guest visible changes to their ACPI tables. Only VMs launched with new version of propolis will use the new tables.
Future scope
Being able to generate ACPI tables now opens up several opportunities
CPU hotplug
The MADT generation could be extended to support processor hotplug by including Processor Local APIC entries for potential CPUs along with corresponding _MAT methods and processor container devices in the DSDT.
Memory hotplug
Memory hotplug requires adding memory device objects under _SB scope with _HID of PNP0C80. Each memory region would need _CRS , _STA and _EJ0 methods. Propolis could signal memory add/remove via ACPI notifications. This would enable dynamic memory ballooning and live resizing of guest RAM.
PCIe Native Hotplug
The _OSC method added in this series can be easily extended to report support for PCIe capabilities to the guest. Once propolis implements the hotplug controller logic, native PCIe hotplug can be enabled by updating the _OSC return value and adding the necessary _EJ0 and notification methods.
PCIe topology emulation
With DSDT generation, PCIe topologies with multiple host birdges and PCIe swicthes can be properly described to the guest. This would involve adding additional _BBN methods and extending the PCI routing tables for downstream ports.This would increase the number of devices that can be attached to guest.
NUMA topology
For guests that benefit from memory locality awareness, SRAT and SLIT tables can be added following the same pattern as other static tables.
Testing
Testing
This is the dmesg of linux when using new tables. Now the standard OVMF bootrom can be used.
GlobalLockis not supported by propolis yet so the warning appears with OVMF tables as wellOVMF ACPI table dump
New ACPI table dump
TODO: