From 9d6033575f5a6129e2150e86d91cf32cbb2962ed Mon Sep 17 00:00:00 2001 From: rusty Date: Fri, 16 Aug 2013 03:17:18 +0000 Subject: Reworked spec into non-PCI order. Issue: https://tools.oasis-open.org/issues/browse/VIRTIO-1 Signed-off-by: Rusty Russell git-svn-id: https://tools.oasis-open.org/version-control/svn/virtio@2 0c8fb4dd-22a2-4bb5-bc14-6c75a5f43652 --- virtio-spec.txt | 1849 +++++++++++++++++++++++++++++++------------------------ 1 file changed, 1030 insertions(+), 819 deletions(-) diff --git a/virtio-spec.txt b/virtio-spec.txt index dcf3918..6a3860e 100644 --- a/virtio-spec.txt +++ b/virtio-spec.txt @@ -1,29 +1,31 @@ -This document describes the specifications of the “virtio” family -of PCI devices. These are devices -are found in virtual environments, -yet by design they are not all that different from physical PCI -devices, and this document treats them as such. This allows the -guest to use standard PCI drivers and discovery mechanisms. +1. INTRODUCTION +=============== + +This document describes the specifications of the “virtio” family of +devices. These are devices are found in virtual environments, yet by +design they are not all that different from physical devices, and this +document treats them as such. This allows the guest to use standard +drivers and discovery mechanisms. The purpose of virtio and this specification is that virtual environments and guests should have a straightforward, efficient, standard and extensible mechanism for virtual devices, rather than boutique per-environment or per-OS mechanisms. - Straightforward: Virtio PCI devices use normal PCI mechanisms - of interrupts and DMA which should be familiar to any device - driver author. There is no exotic page-flipping or COW - mechanism: it's just a PCI device.[1] + Straightforward: Virtio devices use normal bus mechanisms of + interrupts and DMA which should be familiar to any device driver + author. There is no exotic page-flipping or COW mechanism: it's just + a normal device.[1] - Efficient: Virtio PCI devices consist of rings of descriptors + Efficient: Virtio devices consist of rings of descriptors for input and output, which are neatly separated to avoid cache effects from both guest and device writing to the same cache lines. - Standard: Virtio PCI makes no assumptions about the environment - in which it operates, beyond supporting PCI. In fact the virtio - devices specified in the appendices do not require PCI at all: - they have been implemented on non-PCI buses.[2] + Standard: Virtio makes no assumptions about the environment in which + it operates, beyond supporting the bus attaching the device. Virtio + devices are implemented over PCI and other buses, and earlier drafts + been implemented on other buses not included in this spec.[2] Extensible: Virtio PCI devices contain feature bits which are acknowledged by the guest operating system during device setup. @@ -31,170 +33,69 @@ than boutique per-environment or per-OS mechanisms. offers all the features it knows about, and the driver acknowledges those it understands and wishes to use. -1.1 Virtqueues - -The mechanism for bulk data transport on virtio PCI devices is -pretentiously called a virtqueue. Each device can have zero or -more virtqueues: for example, the network device has one for -transmit and one for receive. - -Each virtqueue occupies two or more physically-contiguous pages -(defined, for the purposes of this specification, as 4096 bytes), -and consists of three parts: - - -+-------------------+-----------------------------------+-----------+ -| Descriptor Table | Available Ring (padding) | Used Ring | -+-------------------+-----------------------------------+-----------+ - - -When the driver wants to send a buffer to the device, it fills in -a slot in the descriptor table (or chains several together), and -writes the descriptor index into the available ring. It then -notifies the device. When the device has finished a buffer, it -writes the descriptor into the used ring, and sends an interrupt. - -Specification - -2.1 PCI Discovery - -Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through -0x103F inclusive is a virtio device[3]. The device must also have a -Revision ID of 0 to match this specification. - -The Subsystem Device ID indicates which virtio device is -supported by the device. The Subsystem Vendor ID should reflect -the PCI Vendor ID of the environment (it's currently only used -for informational purposes by the guest). - - -+----------------------+--------------------+---------------+ -| Subsystem Device ID | Virtio Device | Specification | -+----------------------+--------------------+---------------+ -+----------------------+--------------------+---------------+ -| 1 | network card | Appendix C | -+----------------------+--------------------+---------------+ -| 2 | block device | Appendix D | -+----------------------+--------------------+---------------+ -| 3 | console | Appendix E | -+----------------------+--------------------+---------------+ -| 4 | entropy source | Appendix F | -+----------------------+--------------------+---------------+ -| 5 | memory ballooning | Appendix G | -+----------------------+--------------------+---------------+ -| 6 | ioMemory | - | -+----------------------+--------------------+---------------+ -| 7 | rpmsg | - | -+----------------------+--------------------+---------------+ -| 8 | SCSI host | Appendix I | -+----------------------+--------------------+---------------+ -| 9 | 9P transport | - | -+----------------------+--------------------+---------------+ -| 10 | mac80211 wlan | - | -+----------------------+--------------------+---------------+ - - -2.2 Device Configuration - -To configure the device, we use the first I/O region of the PCI -device. This contains a virtio header followed by a -device-specific region. - -There may be different widths of accesses to the I/O region; the -“natural” access method for each field in the virtio header must be -used (i.e. 32-bit accesses for 32-bit fields, etc), but the -device-specific region can be accessed using any width accesses, and -should obtain the same results. - -Note that this is possible because while the virtio header is PCI -(i.e. little) endian, the device-specific region is encoded in -the native endian of the guest (where such distinction is -applicable). - -2.2.1 Device Initialization Sequence - -We start with an overview of device initialization, then expand -on the details of the device and how each step is preformed. - -1. Reset the device. This is not required on initial start up. - -2. The ACKNOWLEDGE status bit is set: we have noticed the device. - -3. The DRIVER status bit is set: we know how to drive the device. - -4. Device-specific setup, including reading the Device Feature - Bits, discovery of virtqueues for the device, optional MSI-X - setup, and reading and possibly writing the virtio - configuration space. - -5. The subset of Device Feature Bits understood by the driver is - written to the device. +1.1.1. Key words +----------------- -6. The DRIVER_OK status bit is set. +The key words must, must not, required, shall, shall not, should, +should not, recommended, may, and optional are to be interpreted as +described in [RFC 2119]. Note that for reasons of style, these words +are not capitalized in this document. -7. The device can now be used (ie. buffers added to the - virtqueues)[4] +1.1.2. Definitions +------------------- -If any of these steps go irrecoverably wrong, the guest should -set the FAILED status bit to indicate that it has given up on the -device (it can reset the device later to restart if desired). +term + Definition -We now cover the fields required for general setup in detail. +1.1.3. Key concepts +-------------------- -2.2.2 Virtio Header +Guest + Definition... -The virtio header looks as follows: +Host + Definition +Device + Definition -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR | -| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status | -+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ +Driver + Definition +1.2. Normative References +========================= -If MSI-X is enabled for the device, two additional fields -immediately follow this header:[5] +[RFC 2119] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, http://www.ietf.org/rfc/rfc2119.txt IETF (Internet Engineering Task Force) RFC 2119, March 1997. +1.3. Non-Normative References +========================= -+------------++----------------+--------+ -| Bits || 16 | 16 | - +----------------+--------+ -+------------++----------------+--------+ -| Read/Write || R+W | R+W | -+------------++----------------+--------+ -| Purpose || Configuration | Queue | -| (MSI-X) || Vector | Vector | -+------------++----------------+--------+ -Immediately following these general headers, there may be -device-specific headers: +2 The Virtio Standard +========================= +2.1 Basic Facilities of a Virtio Device +======================================= -+------------++--------------------+ -| Bits || Device Specific | - +--------------------+ -+------------++--------------------+ -| Read/Write || Device Specific | -+------------++--------------------+ -| Purpose || Device Specific... | -| || | -+------------++--------------------+ +A virtio device is discovered and identified by a bus-specific method +(see the bus specific sections *XREF*). Each device consists of the following +parts: +o Device Status field +o Feature bits +o Configuration space +o One or more virtqueues -2.2.2.1 Device Status +2.1.1 Device Status Field +------------------------- The Device Status field is updated by the guest to indicate its progress. This provides a simple low-level diagnostic: it's most useful to imagine them hooked up to traffic lights on the console indicating the status of each device. -The device can be reset by writing a 0 to this field, otherwise -at least one bit should be set: +This field is 0 upon reset, otherwise at least one bit should be set: ACKNOWLEDGE (1) Indicates that the guest OS has found the device and recognized it as a valid virtio device. @@ -213,105 +114,68 @@ at least one bit should be set: even a fatal error during device operation. The device must be reset before attempting to re-initialize. -2.2.2.2 Feature Bits +2.1.2 Feature Bits +------------------ + +Each virtio device lists all the features it understands. During +device initialization, the guest reads this and tells the device the +subset that it understands. The only way to renegotiate is to reset +the device. -The first configuration field indicates the features that the -device supports. The bits are allocated as follows: +This allows for forwards and backwards compatibility: if the device is +enhanced with a new feature bit, older guests will not write that +feature bit back to the device and it can go into backwards +compatibility mode. Similarly, if a guest is enhanced with a feature +that the device doesn't support, it see the new feature is not offered +and can go into backwards compatibility mode (or, for poor +implementations, set the FAILED Device Status bit). + +Feature bits are allocated as follows: - 0 to 23 Feature bits for the specific device type + 0 to 23: Feature bits for the specific device type - 24 to 32 Feature bits reserved for extensions to the queue and + 24 to 32: Feature bits reserved for extensions to the queue and feature negotiation mechanisms For example, feature bit 0 for a network device (i.e. Subsystem Device ID 1) indicates that the device supports checksumming of packets. -The feature bits are negotiated: the device lists all the -features it understands in the Device Features field, and the -guest writes the subset that it understands into the Guest -Features field. The only way to renegotiate is to reset the -device. - -In particular, new fields in the device configuration header are +In particular, new fields in the device configuration space are indicated by offering a feature bit, so the guest can check before accessing that part of the configuration space. -This allows for forwards and backwards compatibility: if the -device is enhanced with a new feature bit, older guests will not -write that feature bit back to the Guest Features field and it -can go into backwards compatibility mode. Similarly, if a guest -is enhanced with a feature that the device doesn't support, it -will not see that feature bit in the Device Features field and -can go into backwards compatibility mode (or, for poor -implementations, set the FAILED Device Status bit). - -2.2.2.3 Configuration/Queue Vectors - -When MSI-X capability is present and enabled in the device -(through standard PCI configuration space) 4 bytes at byte offset -20 are used to map configuration change and queue interrupts to -MSI-X vectors. In this case, the ISR Status field is unused, and -device specific configuration starts at byte offset 24 in virtio -header structure. When MSI-X capability is not enabled, device -specific configuration starts at byte offset 20 in virtio header. - -Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of -Configuration/Queue Vector registers, maps interrupts triggered -by the configuration change/selected queue events respectively to -the corresponding MSI-X vector. To disable interrupts for a -specific event type, unmap it by writing a special NO_VECTOR -value: - -/* Vector value used to disable MSI for queue */ - -#define VIRTIO_MSI_NO_VECTOR 0xffff - -Reading these registers returns vector mapped to a given event, -or NO_VECTOR if unmapped. All queue and configuration change -events are unmapped by default. - -Note that mapping an event to vector might require allocating -internal device resources, and might fail. Devices report such -failures by returning the NO_VECTOR value when the relevant -Vector field is read. After mapping an event to vector, the -driver must verify success by reading the Vector field value: on -success, the previously written value is returned, and on -failure, NO_VECTOR is returned. If a mapping failure is detected, -the driver can retry mapping with fewervectors, or disable MSI-X. - -2.3 Virtqueue Configuration - -As a device can have zero or more virtqueues for bulk data -transport (for example, the network driver has two), the driver -needs to configure them as part of the device-specific -configuration. +2.1.3 Configuration Space +------------------------- -This is done as follows, for each virtqueue a device has: +Configuration space is generally used for rarely-changing or +initialization-time parameters. -1. Write the virtqueue index (first queue is 0) to the Queue - Select field. +Note that this space is generally the guest's native endian, +rather than PCI's little-endian. -2. Read the virtqueue size from the Queue Size field, which is - always a power of 2. This controls how big the virtqueue is - (see below). If this field is 0, the virtqueue does not exist. +2.1.4 Virtqueues +---------------- -3. Allocate and zero virtqueue in contiguous physical memory, on - a 4096 byte alignment. Write the physical address, divided by - 4096 to the Queue Address field.[6] +The mechanism for bulk data transport on virtio devices is +pretentiously called a virtqueue. Each device can have zero or more +virtqueues: for example, the simplest network device has one for +transmit and one for receive. Each queue has a 16-bit queue size +parameter, which sets the number of entries and implies the total size +of the queue. -4. Optionally, if MSI-X capability is present and enabled on the - device, select a vector to use to request interrupts triggered - by virtqueue events. Write the MSI-X Table entry number - corresponding to this vector in Queue Vector field. Read the - Queue Vector field: on success, previously written value is - returned; on failure, NO_VECTOR value is returned. +Each virtqueue occupies two or more physically-contiguous pages +(usually defined as 4096 bytes, but depending on the transport) +and consists of three parts: -The Queue Size field controls the total number of bytes required -for the virtqueue according to the following formula: ++-------------------+-----------------------------------+-----------+ +| Descriptor Table | Available Ring (padding) | Used Ring | ++-------------------+-----------------------------------+-----------+ - #define ALIGN(x) (((x) + 4095) & ~4095) +The bus-specific Queue Size field controls the total number of bytes +required for the virtqueue according to the following formula: + #define ALIGN(x) (((x) + PAGE_SIZE) & ~PAGE_SIZE) static inline unsigned vring_size(unsigned int qsz) { return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2 + qsz)) @@ -319,34 +183,53 @@ for the virtqueue according to the following formula: } This currently wastes some space with padding, but also allows -future extensions. The virtqueue layout structure looks like this -(qsz is the Queue Size field, which is a variable, so this code -won't compile): +future extensions. The virtqueue layout structure looks like this: struct vring { - /* The actual descriptors (16 bytes each) */ - struct vring_desc desc[qsz]; + // The actual descriptors (16 bytes each) + struct vring_desc desc[ Queue Size ]; - /* A ring of available descriptor heads with free-running index. */ + // A ring of available descriptor heads with free-running index. struct vring_avail avail; - // Padding to the next 4096 boundary. - char pad[]; + // Padding to the next PAGE_SIZE boundary. + char pad[ Padding ]; // A ring of used descriptor heads with free-running index. struct vring_used used; }; -2.3.1 A Note on Virtqueue Endianness +When the driver wants to send a buffer to the device, it fills in +a slot in the descriptor table (or chains several together), and +writes the descriptor index into the available ring. It then +notifies the device. When the device has finished a buffer, it +writes the descriptor into the used ring, and sends an interrupt. + +2.1.4.1 A Note on Virtqueue Endianness +-------------------------------------- -Note that the endian of these fields and everything else in the -virtqueue is the native endian of the guest, not little-endian as -PCI normally is. This makes for simpler guest code, and it is -assumed that the host already has to be deeply aware of the guest -endian so such an “endian-aware” device is not a significant -issue. +Note that the endian of fields and in the virtqueue is the native +endian of the guest, not little-endian as PCI normally is. This makes +for simpler guest code, and it is assumed that the host already has to +be deeply aware of the guest endian so such an “endian-aware” device +is not a significant issue. -2.3.2 Descriptor Table +2.1.4.2 Message Framing +----------------------- + +The descriptors used for a buffer should not effect the semantics +of the message, except for the total length of the buffer. For +example, a network buffer consists of a 10 byte header followed +by the network packet. Whether this is presented in the ring +descriptor chain as (say) a 10 byte buffer and a 1514 byte +buffer, or a single 1524 byte buffer, or even three buffers, +should have no effect. + +In particular, no implementation should use the descriptor +boundaries to determine the size of any header in a request.[10] + +2.1.4.3 The Virtqueue Descriptor Table +-------------------------------------- The descriptor table refers to the buffers the guest is using for the device. The addresses are physical addresses, and the buffers @@ -374,17 +257,18 @@ No descriptor chain may be more than 2^32 bytes long in total. u16 next; }; -The number of descriptors in the table is specified by the Queue -Size field for this virtqueue. +The number of descriptors in the table is defined by the queue size +for this virtqueue. -2.3.3 Indirect Descriptors +2.1.4.3.1 Indirect Descriptors +------------------------------ Some devices benefit by concurrently dispatching a large number of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be -used to allow this (see Appendix B: Reserved Feature Bits). To increase +used to allow this (see FIXME: Reserved Feature Bits). To increase ring capacity it is possible to store a table of indirect descriptors anywhere in memory, and insert a descriptor in main -virtqueue (with flags&INDIRECT on) that refers to memory buffer +virtqueue (with flags&VRING_DESC_F_INDIRECT on) that refers to memory buffer containing this indirect descriptor table; fields addr and len refer to the indirect table address and length in bytes, respectively. The indirect table layout structure looks like this @@ -399,40 +283,42 @@ which is a variable, so this code won't compile): The first indirect descriptor is located at start of the indirect descriptor table (index 0), additional indirect descriptors are chained by next field. An indirect descriptor without next field -(with flags&NEXT off) signals the end of the indirect descriptor +(with flags&VRING_DESC_F_NEXT off) signals the end of the indirect descriptor table, and transfers control back to the main virtqueue. An indirect descriptor can not refer to another indirect descriptor -table (flags&INDIRECT must be off). A single indirect descriptor +table (flags&VRING_DESC_F_INDIRECT must be off). A single indirect descriptor table can include both read-only and write-only descriptors; -write-only flag (flags&WRITE) in the descriptor that refers to it +write-only flag (flags&VRING_DESC_F_WRITE) in the descriptor that refers to it is ignored. -2.3.4 Available Ring +2.1.4.4 The Virtqueue Available Ring +------------------------------------ The available ring refers to what descriptors we are offering the device: it refers to the head of a descriptor chain. The “flags” field is currently 0 or 1: 1 indicating that we do not need an interrupt when the device consumes a descriptor from the available ring. Alternatively, the guest can ask the device to delay interrupts -until an entry with an index specified by the “ used_event” field is +until an entry with an index specified by the “used_event” field is written in the used ring (equivalently, until the idx field in the used ring will reach the value used_event + 1). The method employed by the device is controlled by the VIRTIO_RING_F_EVENT_IDX feature bit -(see Appendix B: Reserved Feature Bits). This interrupt suppression is +(see FIXME: Reserved Feature Bits). This interrupt suppression is merely an optimization; it may not suppress interrupts entirely. The “idx” field indicates where we would put the next descriptor -entry (modulo the ring size). This starts at 0, and increases. +entry (modulo the queue size). This starts at 0, and increases. struct vring_avail { #define VRING_AVAIL_F_NO_INTERRUPT 1 u16 flags; u16 idx; - u16 ring[qsz]; /* qsz is the Queue Size field read from device */ + u16 ring[ /* Queue Size */ ]; u16 used_event; }; -2.3.5 Used Ring +2.1.4.5 The Virtqueue Used Ring +------------------------------- The used ring is where the device returns buffers once it is done with them. The flags field can be used by the device to hint that @@ -443,7 +329,7 @@ with an index specified by the “avail_event” is written in the available ring (equivalently, until the idx field in the available ring will reach the value avail_event + 1). The method employed by the device is controlled by the guest through the -VIRTIO_RING_F_EVENT_IDX feature bit (see Appendix B: Reserved +VIRTIO_RING_F_EVENT_IDX feature bit (see FIXME: Reserved Feature Bits).[7] Each entry in the ring is a pair: the head entry of the @@ -466,31 +352,68 @@ the buffer to ensure no data leakage occurs. #define VRING_USED_F_NO_NOTIFY 1 u16 flags; u16 idx; - struct vring_used_elem ring[qsz]; + struct vring_used_elem ring[ /* Queue Size */]; u16 avail_event; }; -2.3.6 Helpers for Managing Virtqueues +2.1.4.6 Helpers for Operating Virtqueues +---------------------------------------- The Linux Kernel Source code contains the definitions above and helper routines in a more usable form, in include/linux/virtio_ring.h. This was explicitly licensed by IBM and Red Hat under the (3-clause) BSD license so that it can be freely used by all other projects, and is reproduced (with slight -variation to remove Linux assumptions) in Appendix A. +variation to remove Linux assumptions) in *XREF*. + +2.2 General Initialization And Device Operation +=============================================== + +We start with an overview of device initialization, then expand on the +details of the device and how each step is preformed. This section +should be read along with the bus-specific section which describes +how to communicate with the specific device. + +2.2.1 Device Initialization +--------------------------- + +1. Reset the device. This is not required on initial start up. + +2. The ACKNOWLEDGE status bit is set: we have noticed the device. + +3. The DRIVER status bit is set: we know how to drive the device. + +4. Device-specific setup, including reading the device feature + bits, discovery of virtqueues for the device, optional per-bus + setup, and reading and possibly writing the device's virtio + configuration space. + +5. The subset of device feature bits understood by the driver is + written to the device. + +6. The DRIVER_OK status bit is set. + +7. The device can now be used (ie. buffers added to the + virtqueues)[4] + +If any of these steps go irrecoverably wrong, the guest should +set the FAILED status bit to indicate that it has given up on the +device (it can reset the device later to restart if desired). -2.4 Device Operation +2.2.2 Device Operation +---------------------- There are two parts to device operation: supplying new buffers to the device, and processing used buffers from the device. As an -example, the virtio network device has two virtqueues: the +example, the simplest virtio network device has two virtqueues: the transmit virtqueue and the receive virtqueue. The driver adds outgoing (read-only) packets to the transmit virtqueue, and then frees them after they are used. Similarly, incoming (write-only) buffers are added to the receive virtqueue, and processed after they are used. -2.4.1 Supplying Buffers to The Device +2.2.2.1 Supplying Buffers to The Device +--------------------------------------- Actual transfer of buffers from the guest OS to the device operates as follows: @@ -531,14 +454,15 @@ distinguish between a full and empty buffer. Here is a description of each stage in more detail. -2.4.1.1 Placing Buffers Into The Descriptor Table +2.2.2.1.1 Placing Buffers Into The Descriptor Table +--------------------------------------------------- A buffer consists of zero or more read-only physically-contiguous elements followed by zero or more physically-contiguous write-only elements (it must have at least one element). This algorithm maps it into the descriptor table: -1. for each buffer element, b: +for each buffer element, b: (a) Get the next free descriptor table entry, d @@ -560,7 +484,8 @@ In practice, the d.next fields are usually used to chain free descriptors, and a separate count kept to check there are enough free descriptors before beginning the mappings. -2.4.1.2 Updating The Available Ring +2.2.2.1.2 Updating The Available Ring +------------------------------------- The head of the buffer we mapped is the first d in the algorithm above. A naive implementation would do the following: @@ -573,44 +498,47 @@ device), so we keep a counter of how many we've added: avail->ring[(avail->idx + added++) % qsz] = head; -2.4.1.3 Updating The Index Field +2.2.2.1.3 Updating The Index Field +---------------------------------- -Once the idx field of the virtqueue is updated, the device will +Once the index field of the virtqueue is updated, the device will be able to access the descriptor entries we've created and the memory they refer to. This is why a memory barrier is generally -used before the idx update, to ensure it sees the most up-to-date +used before the index update, to ensure it sees the most up-to-date copy. -The idx field always increments, and we let it wrap naturally at +The index field always increments, and we let it wrap naturally at 65536: avail->idx += added; -2.4.1.4 Notifying The Device +2.2.2.1.4 Notifying The Device +------------------------------ -Device notification occurs by writing the 16-bit virtqueue index -of this virtqueue to the Queue Notify field of the virtio header -in the first I/O region of the PCI device. This can be expensive, -however, so the device can suppress such notifications if it -doesn't need them. We have to be careful to expose the new idx -value before checking the suppression flag: it's OK to notify +The actual method of device notification is bus-specific, but generally +it can be expensive. So the device can suppress such notifications if it +doesn't need them. We have to be careful to expose the new index +value before checking if notifications are suppressed: it's OK to notify gratuitously, but not to omit a required notification. So again, we use a memory barrier here before reading the flags or the avail_event field. -If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if -the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to -the PCI configuration space. +If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if the +VRING_USED_F_NOTIFY flag is not set, we go ahead and notify the +device. If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the avail_event field in the available ring structure. If the available index crossed_the avail_event field value since the last notification, we go ahead and write to the PCI configuration -space. The avail_event field wraps naturally at 65536 as well: +space. The avail_event field wraps naturally at 65536 as well, +iving the following algorithm for calculating whether a device needs +notification: (u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx) -2.4.2 Receiving Used Buffers From The Device +2.2.2.2 Receiving Used Buffers From The Device +---------------------------------------------- Once the device has used a buffer (read from or written to it, or parts of both, depending on the nature of the virtqueue and the @@ -621,13 +549,13 @@ buffer: 1. Write the head descriptor number to the next field in the used ring. -2. Update the used ring idx. +2. Update the used ring index. -3. Determine whether an interrupt is necessary: +3. Deliver an interrupt if necessary: (a) If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: - check if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in - avail->flags + check if the VRING_AVAIL_F_NO_INTERRUPT flag is not set in + avail->flags. (b) If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check whether the used index crossed the used_event field value @@ -635,44 +563,13 @@ buffer: at 65536 as well: (u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx) -4. If an interrupt is necessary: - - (a) If MSI-X capability is disabled: - - i. Set the lower bit of the ISR Status field for the device. - - ii. Send the appropriate PCI interrupt for the device. - - (b) If MSI-X capability is enabled: - - i. Request the appropriate MSI-X interrupt message for the - device, Queue Vector field sets the MSI-X Table entry - number. - - ii. If Queue Vector field value is NO_VECTOR, no interrupt - message is requested for this event. - -The guest interrupt handler should: - -1. If MSI-X capability is disabled: read the ISR Status field, - which will reset it to zero. If the lower bit is zero, the - interrupt was not for this device. Otherwise, the guest driver - should look through the used rings of each virtqueue for the - device, to see if any progress has been made by the device - which requires servicing. - -2. If MSI-X capability is enabled: look through the used rings of - each virtqueue mapped to the specific MSI-X vector for the - device, to see if any progress has been made by the device - which requires servicing. - For each ring, guest should then disable interrupts by writing VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required. It can then process used ring entries finally enabling interrupts by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the -EVENT_IDX field in the available structure, Guest should then +EVENT_IDX field in the available structure. The guest should then execute a memory barrier, and then recheck the ring empty -condition. This is necessary to handle the case where, after the +condition. This is necessary to handle the case where after the last check and before enabling interrupts, an interrupt has been suppressed by the device: @@ -692,279 +589,507 @@ suppressed by the device: vq->last_seen_used++; } -2.4.3 Dealing With Configuration Changes - -Some virtio PCI devices can change the device configuration -state, as reflected in the virtio header in the PCI configuration -space. In this case: +2.2.2.3 Notification of Device Configuration Changes +---------------------------------------------------- -1. If MSI-X capability is disabled: an interrupt is delivered and - the second highest bit is set in the ISR Status field to - indicate that the driver should re-examine the configuration - space. Note that a single interrupt can indicate both that one - or more virtqueue has been used and that the configuration - space has changed: even if the config bit is set, virtqueues - must be scanned. +For devices where the configuration information can be changed, an +interrupt is delivered when a configuration change occurs. -2. If MSI-X capability is enabled: an interrupt message is - requested. The Configuration Vector field sets the MSI-X Table - entry number to use. If Configuration Vector field value is - NO_VECTOR, no interrupt message is requested for this event. -Creating New Device Types +2.4 Virtio Transport Options +============================ -Various considerations are necessary when creating a new device -type: +Virtio can use various different busses, thus the standard is split +into virtio general and bus-specific sections. - How Many Virtqueues? +2.4.1 Virtio Over PCI Bus +------------------------- -It is possible that a very simple device will operate entirely -through its configuration space, but most will need at least one -virtqueue in which it will place requests. A device with both -input and output (eg. console and network devices described here) -need two queues: one which the driver fills with buffers to -receive input, and one which the driver places buffers to -transmit output. +Virtio devices are commonly implemented as PCI devices. - What Configuration Space Layout? +2.4.1.1 PCI Device Discovery +---------------------------- -Configuration space is generally used for rarely-changing or -initialization-time parameters. But it is a limited resource, so -it might be better to use a virtqueue to update configuration -information (the network device does this for filtering, -otherwise the table in the config space could potentially be very -large). +Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through +0x103F inclusive is a virtio device[3]. The device must also have a +Revision ID of 0 to match this specification. -Note that this space is generally the guest's native endian, -rather than PCI's little-endian. +The Subsystem Device ID indicates which virtio device is +supported by the device. The Subsystem Vendor ID should reflect +the PCI Vendor ID of the environment (it's currently only used +for informational purposes by the guest). - What Device Number? +2.4.1.2 PCI Device Layout +------------------------- -Currently device numbers are assigned quite freely: a simple -request mail to the author of this document or the Linux -virtualization mailing list[9] will be sufficient to secure a unique one. +To configure the device, we use the first I/O region of the PCI +device. This contains a virtio header followed by a +device-specific region. -Meanwhile for experimental drivers, use 65535 and work backwards. +There may be different widths of accesses to the I/O region; the +“natural” access method for each field in the virtio header must be +used (i.e. 32-bit accesses for 32-bit fields, etc), but the +device-specific region can be accessed using any width accesses, and +should obtain the same results. - How many MSI-X vectors? +Note that this is possible because while the virtio header is PCI +(i.e. little) endian, the device-specific region is encoded in +the native endian of the guest (where such distinction is +applicable). -Using the optional MSI-X capability devices can speed up -interrupt processing by removing the need to read ISR Status -register by guest driver (which might be an expensive operation), -reducing interrupt sharing between devices and queues within the -device, and handling interrupts from multiple CPUs. However, some -systems impose a limit (which might be as low as 256) on the -total number of MSI-X vectors that can be allocated to all -devices. Devices and/or device drivers should take this into -account, limiting the number of vectors used unless the device is -expected to cause a high volume of interrupts. Devices can -control the number of vectors used by limiting the MSI-X Table -Size or not presenting MSI-X capability in PCI configuration -space. Drivers can control this by mapping events to as small -number of vectors as possible, or disabling MSI-X capability -altogether. +2.4.1.2.1 PCI Device Virtio Header +---------------------------------- - Message Framing +The virtio header looks as follows: -The descriptors used for a buffer should not effect the semantics -of the message, except for the total length of the buffer. For -example, a network buffer consists of a 10 byte header followed -by the network packet. Whether this is presented in the ring -descriptor chain as (say) a 10 byte buffer and a 1514 byte -buffer, or a single 1524 byte buffer, or even three buffers, -should have no effect. ++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ +| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 | ++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ +| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R | ++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ +| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR | +| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status | ++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -In particular, no implementation should use the descriptor -boundaries to determine the size of any header in a request.[10] - Device Improvements +If MSI-X is enabled for the device, two additional fields +immediately follow this header:[5] -Any change to configuration space, or new virtqueues, or -behavioural changes, should be indicated by negotiation of a new -feature bit. This establishes clarity[11] and avoids future expansion problems. -Clusters of functionality which are always implemented together -can use a single bit, but if one feature makes sense without the -others they should not be gratuitously grouped together to -conserve feature bits. We can always extend the spec when the -first person needs more than 24 feature bits for their device. ++------------++----------------+--------+ +| Bits || 16 | 16 | + +----------------+--------+ ++------------++----------------+--------+ +| Read/Write || R+W | R+W | ++------------++----------------+--------+ +| Purpose || Configuration | Queue | +| (MSI-X) || Vector | Vector | ++------------++----------------+--------+ +Immediately following these general headers, there may be +device-specific headers: ++------------++--------------------+ +| Bits || Device Specific | + +--------------------+ ++------------++--------------------+ +| Read/Write || Device Specific | ++------------++--------------------+ +| Purpose || Device Specific... | +| || | ++------------++--------------------+ +2.4.1.3 PCI-specific Initialization And Device Operation +-------------------------------------------------------- -Appendix A: virtio_ring.h +The page size for a virtqueue on a PCI virtio device is defined as +4096 bytes. -#ifndef VIRTIO_RING_H -#define VIRTIO_RING_H -/* An interface for efficient virtio implementation. - * - * This header is BSD licensed so anyone can use the definitions - * to implement compatible drivers/servers. - * - * Copyright 2007, 2009, IBM Corporation - * Copyright 2011, Red Hat, Inc - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions - * are met: - * 1. Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * 3. Neither the name of IBM nor the names of its contributors - * may be used to endorse or promote products derived from this software - * without specific prior written permission. - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE - * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF - * SUCH DAMAGE. - */ +2.4.1.3.1 Device Initialization +------------------------------- -/* This marks a buffer as continuing via the next field. */ -#define VRING_DESC_F_NEXT 1 -/* This marks a buffer as write-only (otherwise read-only). */ -#define VRING_DESC_F_WRITE 2 +2.4.1.3.1.1 Queue Vector Configuration +-------------------------------------- -/* The Host uses this in used->flags to advise the Guest: don't kick me - * when you add a buffer. It's unreliable, so it's simply an - * optimization. Guest will still kick if it's out of buffers. */ -#define VRING_USED_F_NO_NOTIFY 1 -/* The Guest uses this in avail->flags to advise the Host: don't - * interrupt me when you consume a buffer. It's unreliable, so it's - * simply an optimization. */ -#define VRING_AVAIL_F_NO_INTERRUPT 1 +When MSI-X capability is present and enabled in the device +(through standard PCI configuration space) 4 bytes at byte offset +20 are used to map configuration change and queue interrupts to +MSI-X vectors. In this case, the ISR Status field is unused, and +device specific configuration starts at byte offset 24 in virtio +header structure. When MSI-X capability is not enabled, device +specific configuration starts at byte offset 20 in virtio header. + +Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of +Configuration/Queue Vector registers, maps interrupts triggered +by the configuration change/selected queue events respectively to +the corresponding MSI-X vector. To disable interrupts for a +specific event type, unmap it by writing a special NO_VECTOR +value: + + /* Vector value used to disable MSI for queue */ + #define VIRTIO_MSI_NO_VECTOR 0xffff + +Reading these registers returns vector mapped to a given event, +or NO_VECTOR if unmapped. All queue and configuration change +events are unmapped by default. + +Note that mapping an event to vector might require allocating +internal device resources, and might fail. Devices report such +failures by returning the NO_VECTOR value when the relevant +Vector field is read. After mapping an event to vector, the +driver must verify success by reading the Vector field value: on +success, the previously written value is returned, and on +failure, NO_VECTOR is returned. If a mapping failure is detected, +the driver can retry mapping with fewervectors, or disable MSI-X. + +2.4.1.3.1.2 Virtqueue Configuration +----------------------------------- + +As a device can have zero or more virtqueues for bulk data +transport (for example, the simplest network device has two), the driver +needs to configure them as part of the device-specific +configuration. + +This is done as follows, for each virtqueue a device has: + +1. Write the virtqueue index (first queue is 0) to the Queue + Select field. + +2. Read the virtqueue size from the Queue Size field, which is + always a power of 2. This controls how big the virtqueue is + (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist. + +3. Allocate and zero virtqueue in contiguous physical memory, on + a 4096 byte alignment. Write the physical address, divided by + 4096 to the Queue Address field.[6] + +4. Optionally, if MSI-X capability is present and enabled on the + device, select a vector to use to request interrupts triggered + by virtqueue events. Write the MSI-X Table entry number + corresponding to this vector in Queue Vector field. Read the + Queue Vector field: on success, previously written value is + returned; on failure, NO_VECTOR value is returned. + +2.4.1.3.2 Notifying The Device +------------------------------ + +Device notification occurs by writing the 16-bit virtqueue index +of this virtqueue to the Queue Notify field of the virtio header +in the first I/O region of the PCI device. + +2.4.1.3.3 Receiving Used Buffers From The Device + +If an interrupt is necessary: + + (a) If MSI-X capability is disabled: + + i. Set the lower bit of the ISR Status field for the device. + + ii. Send the appropriate PCI interrupt for the device. + + (b) If MSI-X capability is enabled: + + i. Request the appropriate MSI-X interrupt message for the + device, Queue Vector field sets the MSI-X Table entry + number. + + ii. If Queue Vector field value is NO_VECTOR, no interrupt + message is requested for this event. + +The guest interrupt handler should: + +1. If MSI-X capability is disabled: read the ISR Status field, + which will reset it to zero. If the lower bit is zero, the + interrupt was not for this device. Otherwise, the guest driver + should look through the used rings of each virtqueue for the + device, to see if any progress has been made by the device + which requires servicing. + +2. If MSI-X capability is enabled: look through the used rings of + each virtqueue mapped to the specific MSI-X vector for the + device, to see if any progress has been made by the device + which requires servicing. + +2.4.1.3.4 Notification of Device Configuration Changes +------------------------------------------------------ + +Some virtio PCI devices can change the device configuration +state, as reflected in the virtio header in the PCI configuration +space. In this case: + +1. If MSI-X capability is disabled: an interrupt is delivered and + the second highest bit is set in the ISR Status field to + indicate that the driver should re-examine the configuration + space. Note that a single interrupt can indicate both that one + or more virtqueue has been used and that the configuration + space has changed: even if the config bit is set, virtqueues + must be scanned. + +2. If MSI-X capability is enabled: an interrupt message is + requested. The Configuration Vector field sets the MSI-X Table + entry number to use. If Configuration Vector field value is + NO_VECTOR, no interrupt message is requested for this event. + +2.4.2 Virtio Over MMIO +---------------------- + +Virtual environments without PCI support (a common situation in +embedded devices models) might use simple memory mapped device (“ +virtio-mmio”) instead of the PCI device. + +The memory mapped virtio device behaviour is based on the PCI +device specification. Therefore most of operations like device +initialization, queues configuration and buffer transfers are +nearly identical. Existing differences are described in the +following sections. + +2.4.2.1 MMIO Device Discovery +----------------------------- + +Unlike PCI, MMIO provides no generic device discovery. For systems using +a device-tree such as Linux's dtc or Open Firmware, the suggested format is: + + virtio_block@1e000 { + compatible = "virtio,mmio"; + reg = <0x1e000 0x100>; + interrupts = <42>; + } + +2.4.2.2 MMIO Device Layout +-------------------------- + +MMIO virtio devices provides a set of memory mapped control +registers, all 32 bits wide, followed by device-specific +configuration space. The following list presents their layout: + +• Offset from the device base address | Direction | Name + Description + +• 0x000 | R | MagicValue + “virt” string. + +• 0x004 | R | Version + Device version number. Currently must be 1. + +• 0x008 | R | DeviceID + Virtio Subsystem Device ID (ie. 1 for network card). + +• 0x00c | R | VendorID + Virtio Subsystem Vendor ID. + +• 0x010 | R | HostFeatures + Flags representing features the device supports. + Reading from this register returns 32 consecutive flag bits, + first bit depending on the last value written to + HostFeaturesSel register. Access to this register returns bits HostFeaturesSel*32 + + to (HostFeaturesSel*32)+31, eg. feature bits 0 to 31 if + HostFeaturesSel is set to 0 and features bits 32 to 63 if + HostFeaturesSel is set to 1. Also see [sub:Feature-Bits] + +• 0x014 | W | HostFeaturesSel + Device (Host) features word selection. + Writing to this register selects a set of 32 device feature bits + accessible by reading from HostFeatures register. Device driver + must write a value to the HostFeaturesSel register before + reading from the HostFeatures register. + +• 0x020 | W | GuestFeatures + Flags representing device features understood and activated by + the driver. + Writing to this register sets 32 consecutive flag bits, first + bit depending on the last value written to GuestFeaturesSel + register. Access to this register sets bits GuestFeaturesSel*32 + to (GuestFeaturesSel*32)+31, eg. feature bits 0 to 31 if + GuestFeaturesSel is set to 0 and features bits 32 to 63 if + GuestFeaturesSel is set to 1. Also see [sub:Feature-Bits] + +• 0x024 | W | GuestFeaturesSel + Activated (Guest) features word selection. + Writing to this register selects a set of 32 activated feature + bits accessible by writing to the GuestFeatures register. + Device driver must write a value to the GuestFeaturesSel + register before writing to the GuestFeatures register. + +• 0x028 | W | GuestPageSize + Guest page size. + Device driver must write the guest page size in bytes to the + register during initialization, before any queues are used. + This value must be a power of 2 and is used by the Host to + calculate Guest address of the first queue page (see QueuePFN). + +• 0x030 | W | QueueSel + Virtual queue index (first queue is 0). + Writing to this register selects the virtual queue that the + following operations on QueueNum, QueueAlign and QueuePFN apply + to. + +• 0x034 | R | QueueNumMax + Maximum virtual queue size. + Reading from the register returns the maximum size of the queue + the Host is ready to process or zero (0x0) if the queue is not + available. This applies to the queue selected by writing to + QueueSel and is allowed only when QueuePFN is set to zero + (0x0), so when the queue is not actively used. + +• 0x038 | W | QueueNum + Virtual queue size. + Queue size is a number of elements in the queue, therefore size + of the descriptor table and both available and used rings. + Writing to this register notifies the Host what size of the + queue the Guest will use. This applies to the queue selected by + writing to QueueSel. + +• 0x03c | W | QueueAlign + Used Ring alignment in the virtual queue. + Writing to this register notifies the Host about alignment + boundary of the Used Ring in bytes. This value must be a power + of 2 and applies to the queue selected by writing to QueueSel. + +• 0x040 | RW | QueuePFN + Guest physical page number of the virtual queue. + Writing to this register notifies the host about location of the + virtual queue in the Guest's physical address space. This value + is the index number of a page starting with the queue + Descriptor Table. Value zero (0x0) means physical address zero + (0x00000000) and is illegal. When the Guest stops using the + queue it must write zero (0x0) to this register. + Reading from this register returns the currently used page + number of the queue, therefore a value other than zero (0x0) + means that the queue is in use. + Both read and write accesses apply to the queue selected by + writing to QueueSel. + +• 0x050 | W | QueueNotify + Queue notifier. + Writing a queue index to this register notifies the Host that + there are new buffers to process in the queue. + +• 0x60 | R | InterruptStatus +Interrupt status. +Reading from this register returns a bit mask of interrupts + asserted by the device. An interrupt is asserted if the + corresponding bit is set, ie. equals one (1). + + – Bit 0 | Used Ring Update + This interrupt is asserted when the Host has updated the Used + Ring in at least one of the active virtual queues. + + – Bit 1 | Configuration change + This interrupt is asserted when configuration of the device has + changed. -/* Virtio ring descriptors: 16 bytes. - * These can chain together via "next". */ -struct vring_desc { - /* Address (guest-physical). */ - uint64_t addr; - /* Length. */ - uint32_t len; - /* The flags as indicated above. */ - uint16_t flags; - /* We chain unused descriptors via this, too */ - uint16_t next; -}; +• 0x064 | W | InterruptACK + Interrupt acknowledge. + Writing to this register notifies the Host that the Guest + finished handling interrupts. Set bits in the value clear the + corresponding bits of the InterruptStatus register. -struct vring_avail { - uint16_t flags; - uint16_t idx; - uint16_t ring[]; - uint16_t used_event; -}; +• 0x070 | RW | Status + Device status. + Reading from this register returns the current device status + flags. + Writing non-zero values to this register sets the status flags, + indicating the Guest progress. Writing zero (0x0) to this + register triggers a device reset. + Also see [sub:Device-Initialization-Sequence] -/* u32 is used here for ids for padding reasons. */ -struct vring_used_elem { - /* Index of start of used descriptor chain. */ - uint32_t id; - /* Total length of the descriptor chain which was written to. */ - uint32_t len; -}; +• 0x100+ | RW | Config + Device-specific configuration space starts at an offset 0x100 + and is accessed with byte alignment. Its meaning and size + depends on the device and the driver. -struct vring_used { - uint16_t flags; - uint16_t idx; - struct vring_used_elem ring[]; - uint16_t avail_event; -}; +Virtual queue size is a number of elements in the queue, +therefore size of the descriptor table and both available and +used rings. -struct vring { - unsigned int num; +The endianness of the registers follows the native endianness of +the Guest. Writing to registers described as “R” and reading from +registers described as “W” is not permitted and can cause +undefined behavior. - struct vring_desc *desc; - struct vring_avail *avail; - struct vring_used *used; -}; +2.4.2.3 MMIO-specific Initialization And Device Operation +--------------------------------------------------------- -/* The standard layout for the ring is a continuous chunk of memory which - * looks like this. We assume num is a power of 2. - * - * struct vring { - * // The actual descriptors (16 bytes each) - * struct vring_desc desc[num]; - * - * // A ring of available descriptor heads with free-running index. - * __u16 avail_flags; - * __u16 avail_idx; - * __u16 available[num]; - * - * // Padding to the next align boundary. - * char pad[]; - * - * // A ring of used descriptor heads with free-running index. - * __u16 used_flags; - * __u16 EVENT_IDX; - * struct vring_used_elem used[num]; - * }; - * Note: for virtio PCI, align is 4096. - */ -static inline void vring_init(struct vring *vr, unsigned int num, void *p, - unsigned long align) -{ - vr->num = num; - vr->desc = p; - vr->avail = p + num*sizeof(struct vring_desc); - vr->used = (void *)(((unsigned long)&vr->avail->ring[num] - + align-1) - & ~(align - 1)); -} +2.4.2.3.1 Device Initialization +------------------------------- -static inline unsigned vring_size(unsigned int num, unsigned long align) -{ - return ((sizeof(struct vring_desc)*num + sizeof(uint16_t)*(2+num) - + align - 1) & ~(align - 1)) - + sizeof(uint16_t)*3 + sizeof(struct vring_used_elem)*num; -} +Unlike the fixed page size for PCI, the virtqueue page size is defined +by the GuestPageSize field, as written by the guest. This must be +done before the virtqueues are configured. -static inline int vring_need_event(uint16_t event_idx, uint16_t new_idx, uint16_t old_idx) -{ - return (uint16_t)(new_idx - event_idx - 1) < (uint16_t)(new_idx - old_idx); -} -#endif /* VIRTIO_RING_H */ +2.4.2.3.1.1 Virtqueue Configuration +----------------------------------- +1. Select the queue writing its index (first queue is 0) to the + QueueSel register. -Appendix B: Reserved Feature Bits +2. Check if the queue is not already in use: read QueuePFN + register, returned value should be zero (0x0). -Currently there are five device-independent feature bits defined: +3. Read maximum queue size (number of elements) from the + QueueNumMax register. If the returned value is zero (0x0) the + queue is not available. - VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature - indicates that the driver wants an interrupt if the device runs - out of available descriptors on a virtqueue, even though - interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT - flag or the used_event field. An example of this is the - networking driver: it doesn't need to know every time a packet - is transmitted, but it does need to free the transmitted - packets a finite time after they are transmitted. It can avoid - using a timer if the device interrupts it when all the packets - are transmitted. +4. Allocate and zero the queue pages in contiguous virtual + memory, aligning the Used Ring to an optimal boundary (usually + page size). Size of the allocated queue may be smaller than or + equal to the maximum size returned by the Host. - VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature indicates - that the driver can use descriptors with the VRING_DESC_F_INDIRECT - flag set, as described in 2.3.3 Indirect Descriptors. +5. Notify the Host about the queue size by writing the size to + QueueNum register. - VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event - and the avail_event fields. If set, it indicates that the - device should ignore the flags field in the available ring - structure. Instead, the used_event field in this structure is - used by guest to suppress device interrupts. Further, the - driver should ignore the flags field in the used ring - structure. Instead, the avail_event field in this structure is - used by the device to suppress notifications. If unset, the - driver should ignore the used_event field; the device should - ignore the avail_event field; the flags field is used +6. Notify the Host about the used alignment by writing its value + in bytes to QueueAlign register. -Appendix C: Network Device +7. Write the physical number of the first page of the queue to + the QueuePFN register. + +2.4.2.3.2 Notifying The Device +------------------------------ + +The device is notified about new buffers available in a queue by +writing the queue index to register QueueNum. + +2.4.2.3.3 Receiving Used Buffers From The Device +------------------------------------------------ + +The memory mapped virtio device is using single, dedicated +interrupt signal, which is raised when at least one of the +interrupts described in the InterruptStatus register +description is asserted. After receiving an interrupt, the +driver must read the InterruptStatus register to check what +caused the interrupt (see the register description). After the +interrupt is handled, the driver must acknowledge it by writing +a bit mask corresponding to the serviced interrupt to the +InterruptACK register. + +2.4.2.4.4 Notification of Device Configuration Changes +------------------------------------------------------ + +This is indicated by bit 1 in the InterruptStatus register, as +documented in the register description. + +2.5 Device Types +================ + +On top of the queues, config space and feature negotiation facilities +built into virtio, several specific devices are defined. + +The following device IDs are used to identify different types of virtio +devices. Some device IDs are reserved for devices which are not currently +defined in this standard. + +Discovering what devices are available and their type is bus-dependent. + ++------------+--------------------+ +| Device ID | Virtio Device | ++------------+--------------------+ ++------------+--------------------+ +| 1 | network card | ++------------+--------------------+ +| 2 | block device | ++------------+--------------------+ +| 3 | console | ++------------+--------------------+ +| 4 | entropy source | ++------------+--------------------+ +| 5 | memory ballooning | ++------------+--------------------+ +| 6 | ioMemory | ++------------+--------------------+ +| 7 | rpmsg | ++------------+--------------------+ +| 8 | SCSI host | ++------------+--------------------+ +| 9 | 9P transport | ++------------+--------------------+ +| 10 | mac80211 wlan | ++------------+--------------------+ + +2.5.1 Network Device +==================== The virtio network device is a virtual ethernet card, and is the most complex of the devices supported so far by virtio. It has @@ -975,13 +1100,20 @@ packets are enqueued into another for transmission in that order. A third command queue is used to control advanced filtering features. -Configuration +2.5.1.1 Device ID +----------------- + + 1 - Subsystem Device ID 1 +2.5.1.2 Virtqueues +------------------ - Virtqueues 0:receiveq. 1:transmitq. 2:controlq[12] + 0:receiveq. 1:transmitq. 2:controlq -Feature bits + Virtqueue 2 only exists if VIRTIO_NET_F_CTRL_VQ set. + +2.5.1.3 Feature bits +-------------------- VIRTIO_NET_F_CSUM (0) Device handles packets with partial checksum @@ -1037,7 +1169,8 @@ Feature bits u16 status; }; -Device Initialization +2.5.1.4 Device Initialization +----------------------------- 1. The initialization routine should identify the receive and transmission virtqueues. @@ -1080,7 +1213,8 @@ Device Initialization equivalents of the features described above. See “Receiving Packets” below. -Device Operation +2.5.1.5 Device Operation +------------------------ Packets are transmitted by placing them in the transmitq, and buffers for incoming packets are placed in the receiveq. In each @@ -1106,7 +1240,8 @@ case, the packet itself is preceeded by a header: The controlq is used to control device features such as filtering. -Packet Transmission +2.5.1.5.1 Packet Transmission +----------------------------- Transmitting a single packet is simple, but varies depending on the different features the driver negotiated. @@ -1151,7 +1286,8 @@ the different features the driver negotiated. transmitq, and the device is notified of the new entry (see 2.4.1.4 Notifying The Device).[20] - Packet Transmission Interrupt +2.5.1.5.1.1 Packet Transmission Interrupt +----------------------------------------- Often a driver will suppress transmission interrupts using the VRING_AVAIL_F_NO_INTERRUPT flag (see 2.4.2 Receiving Used Buffers From @@ -1164,7 +1300,7 @@ The normal behavior in this interrupt handler is to retrieve and new descriptors from the used ring and free the corresponding headers and packets. - Setting Up Receive Buffers +2.5.1.5.2 Setting Up Receive Buffers It is generally a good idea to keep the receive virtqueue as fully populated as possible: if it runs out, network performance @@ -1180,7 +1316,8 @@ buffer in the receive queue needs to be at least this length [20a] If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at least the size of the struct virtio_net_hdr. - Packet Receive Interrupt +2.5.1.5.2.1 Packet Receive Interrupt +------------------------------------ When a packet is copied into a buffer in the receiveq, the optimal path is to disable further interrupts for the receiveq @@ -1214,7 +1351,8 @@ Processing packet involves: VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the desired MSS (see Packet Transmission point 2). -Control Virtqueue +2.5.1.5.3 Control Virtqueue +--------------------------- The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is negotiated) to send commands to manipulate various features of @@ -1239,7 +1377,8 @@ driver, and the device sets the ack byte. There is little it can do except issue a diagnostic if the ack byte is not VIRTIO_NET_OK. -Packet Receive Filtering +2.5.1.5.3.1 Packet Receive Filtering +------------------------------------ If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can send control commands for promiscuous mode, multicast receiving, @@ -1260,7 +1399,8 @@ VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and off. The command-specific-data is one byte containing 0 (off) or 1 (on). -Setting MAC Address Filtering +2.5.1.5.3.2 Setting MAC Address Filtering +----------------------------------------- struct virtio_net_ctrl_mac { u32 entries; @@ -1277,7 +1417,8 @@ command-specific-data is two variable length tables of 6-byte MAC addresses. The first table contains unicast addresses, and the second contains multicast addresses. -VLAN Filtering +2.5.1.5.3.3 VLAN Filtering +-------------------------- If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it can control a VLAN filter table in the device. @@ -1289,7 +1430,8 @@ can control a VLAN filter table in the device. Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL command take a 16-bit VLAN id as the command-specific-data. -Gratuitous Packet Sending +2.5.1.5.3.4 Gratuitous Packet Sending +------------------------------------- If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE (depends on VIRTIO_NET_F_CTRL_VQ), it can ask the guest to send gratuitous @@ -1318,22 +1460,24 @@ Processing this notification involves: 2. Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control vq. -3. . - -Appendix D: Block Device +2.5.2 Block Device +================== The virtio block device is a simple virtual block device (ie. disk). Read and write requests (and other exotic requests) are placed in the queue, and serviced (probably out of order) by the device except where noted. -Configuration +2.5.2.1 Device ID +----------------- + 2 - Subsystem Device ID 2 +2.5.2.2 Virtqueues +------------------ + 0:requestq - Virtqueues 0:requestq. - - Feature bits +2.5.2.3 Feature bits +-------------------- VIRTIO_BLK_F_BARRIER (0) Host supports request barriers. @@ -1371,7 +1515,8 @@ Configuration u32 blk_size; }; -Device Initialization +2.5.2.4 Device Initialization +----------------------------- 1. The device size should be read from the “capacity” configuration field. No requests should be submitted which goes @@ -1386,7 +1531,8 @@ Device Initialization 3. If the VIRTIO_BLK_F_RO feature is set by the device, any write requests will fail. -Device Operation +2.5.2.5 Device Operation +------------------------ The driver queues requests to the virtqueue, and they are used by the device (not necessarily in order). Each request is of form: @@ -1487,7 +1633,9 @@ data_len, sense_len and residual in a single write-only buffer; and the status field is a separate read-only buffer of size 1 byte, by itself. -Appendix E: Console Device + +2.5.3 Console Device +==================== The virtio console device is a simple device for data input and output. A device may have one or more ports. Each port has a pair @@ -1502,15 +1650,21 @@ successfully added, port open/close, etc.. For data IO, one or more empty buffers are placed in the receive queue for incoming data and outgoing characters are placed in the transmit queue. -Configuration +2.5.3.1 Device ID +----------------- - Subsystem Device ID 3 + 3 - Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control - receiveq[24], 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1), +2.5.3.2 Virtqueues +------------------ + + 0:receiveq(port0). 1:transmitq(port0), 2:control receiveq, 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1), ... - Feature bits + Ports 2 onwards only exist if VIRTIO_CONSOLE_F_MULTIPORT is set. + +2.5.3.3 Feature bits +-------------------- VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields are valid. @@ -1519,7 +1673,10 @@ Configuration ports; configuration fields nr_ports and max_nr_ports are valid and control virtqueues will be used. - Device configuration layout The size of the console is supplied +2.5.3.4 Device configuration layout +----------------------------------- + + The size of the console is supplied in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature is set, the maximum number of ports supported by the device can @@ -1531,7 +1688,8 @@ Configuration u32 max_nr_ports; }; -Device Initialization +2.5.3.5 Device Initialization +----------------------------- 1. If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver can read the console dimensions from the configuration fields. @@ -1552,7 +1710,8 @@ Device Initialization 3. The receiveq for each port is populated with one or more receive buffers. -Device Operation +2.5.3.6 Device Operation +------------------------ 1. For output, a buffer containing the characters is placed in the port's transmitq.[25] @@ -1596,32 +1755,42 @@ Device Operation #define VIRTIO_CONSOLE_PORT_OPEN 6 #define VIRTIO_CONSOLE_PORT_NAME 7 -Appendix F: Entropy Device +2.5.4 Entropy Device +==================== The virtio entropy device supplies high-quality randomness for guest use. - Configuration - - Subsystem Device ID 4 +2.5.4.1 Device ID +----------------- + 4 - Virtqueues 0:requestq. +2.5.4.2 Virtqueues +------------------ + 0:requestq. - Feature bits None currently defined +2.5.4.3 Feature bits +-------------------- + None currently defined - Device configuration layout None currently defined. +2.5.4.4 Device configuration layout +----------------------------------- + None currently defined. -Device Initialization +2.5.4.5 Device Initialization +----------------------------- 1. The virtqueue is initialized -Device Operation +2.5.4.6 Device Operation +------------------------ When the driver requires random bytes, it places the descriptor of one or more buffers in the queue. It will be completely filled by random data by the device. -Appendix G: Memory Balloon Device +2.5.5 Memory Balloon Device +=========================== The virtio memory balloon device is a primitive device for managing guest memory: the device asks for a certain amount of @@ -1631,21 +1800,27 @@ changes in allowance of underlying physical memory. If the feature is negotiated, the device can also be used to communicate guest memory statistics to the host. - Configuration +2.5.5.1 Device ID +----------------- + 5 - Subsystem Device ID 5 +2.5.5.2 Virtqueues +------------------ + 0:inflateq. 1:deflateq. 2:statsq. - Virtqueues 0:inflateq. 1:deflateq. 2:statsq.[26] - - Feature bits + Virtqueue 2 only exists if VIRTIO_BALLON_F_STATS_VQ set. +2.5.5.3 Feature bits +-------------------- VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before pages from the balloon are used. VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest memory statistics is present. - Device configuration layout Both fields of this configuration +2.5.5.4 Device configuration layout +----------------------------------- + Both fields of this configuration are always available. Note that they are little endian, despite convention that device fields are guest endian: @@ -1654,7 +1829,8 @@ guest memory statistics to the host. u32 actual; }; -Device Initialization +2.5.5.5 Device Initialization +----------------------------- 1. The inflate and deflate virtqueues are identified. @@ -1667,7 +1843,8 @@ Device Initialization Device operation begins immediately. -Device Operation +2.5.5.6 Device Operation +------------------------ Memory Ballooning The device is driven by the receipt of a configuration change interrupt. @@ -1702,7 +1879,8 @@ configuration change interrupt. deflation, the “actual” field of the configuration should be updated to reflect the new number of pages in the balloon.[29] -Memory Statistics +2.5.5.6.1 Memory Statistics +--------------------------- The stats virtqueue is atypical because communication is driven by the device (not the driver). The channel becomes active at @@ -1741,7 +1919,8 @@ as follows: u64 val; } __attribute__((packed)); - Tags +2.5.5.6.2 Memory Statistics Tags +-------------------------------- VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been swapped in (in bytes). @@ -1761,7 +1940,9 @@ as follows: VIRTIO_BALLOON_S_MEMTOT The total amount of memory available (in bytes). -Appendix I: SCSI Host Device + +2.5.6 SCSI Host Device +====================== The virtio SCSI host device groups together one or more virtual logical units (such as disks), and allows communicating to them @@ -1782,13 +1963,16 @@ medium. In the transport protocol, the virtio driver acts as the initiator, while the virtio SCSI host provides one or more targets that receive and process the requests. - Configuration +2.5.6.1 Device ID +----------------- + 8 - Subsystem Device ID 8 +2.5.6.2 Virtqueues +------------------ + 0:controlq; 1:eventq; 2..n:request queues. - Virtqueues 0:controlq; 1:eventq; 2..n:request queues. - - Feature bits +2.5.6.3 Feature bits +-------------------- VIRTIO_SCSI_F_INOUT (0) A single request can include both read-only and write-only data buffers. @@ -1796,9 +1980,11 @@ targets that receive and process the requests. VIRTIO_SCSI_F_HOTPLUG (1) The host should enable hot-plug/hot-unplug of new LUNs and targets on the SCSI bus. - Device configuration layout All fields of this configuration - are always available. sense_size and cdb_size are writable by - the guest. +2.5.6.4 Device configuration layout +----------------------------------- + + All fields of this configuration are always available. sense_size + and cdb_size are writable by the guest. struct virtio_scsi_config { u32 num_queues; @@ -1849,7 +2035,8 @@ targets that receive and process the requests. as hints to constrain scanning the logical units on the host.h -Device Initialization +2.5.6.5 Device Initialization +----------------------------- The initialization routine should first of all discover the device's virtqueues. @@ -1861,7 +2048,14 @@ The driver can immediately issue requests (for example, INQUIRY or REPORT LUNS) or task management functions (for example, I_T RESET). -Device Operation: request queues +2.5.6.6 Device Operation +------------------------ + +Device operation consists of operating request queues, the control +queue and the event queue. + +2.5.6.6.1 Device Operation: Request Queues +------------------------------------------ The driver queues requests to an arbitrary request queue, and they are used by the device on that same queue. It is the @@ -1983,7 +2177,8 @@ following: request will be immediately returned with a response equal to VIRTIO_SCSI_S_FAILURE. -Device Operation: controlq +2.5.6.6.2 Device Operation: controlq +------------------------------------ The controlq is used for other SCSI transport operations. Requests have the following format: @@ -2114,7 +2309,8 @@ The following commands are defined: No command-specific values are defined for the response byte. -Device Operation: eventq +2.5.6.6.3 Device Operation: eventq +---------------------------------- The eventq is used by the device to report information on logical units that are attached to it. The driver should always leave a @@ -2257,234 +2453,254 @@ contents of the event field. The following events are defined: When dropped events are reported, the driver should poll for asynchronous events manually using SCSI commands. -Appendix X: virtio-mmio - -Virtual environments without PCI support (a common situation in -embedded devices models) might use simple memory mapped device (“ -virtio-mmio”) instead of the PCI device. - -The memory mapped virtio device behaviour is based on the PCI -device specification. Therefore most of operations like device -initialization, queues configuration and buffer transfers are -nearly identical. Existing differences are described in the -following sections. - -Device Initialization - -Instead of using the PCI IO space for virtio header, the “ -virtio-mmio” device provides a set of memory mapped control -registers, all 32 bits wide, followed by device-specific -configuration space. The following list presents their layout: - -• Offset from the device base address | Direction | Name - Description - -• 0x000 | R | MagicValue - “virt” string. - -• 0x004 | R | Version - Device version number. Currently must be 1. - -• 0x008 | R | DeviceID - Virtio Subsystem Device ID (ie. 1 for network card). - -• 0x00c | R | VendorID - Virtio Subsystem Vendor ID. - -• 0x010 | R | HostFeatures - Flags representing features the device supports. - Reading from this register returns 32 consecutive flag bits, - first bit depending on the last value written to - HostFeaturesSel register. Access to this register returns bits HostFeaturesSel*32 - - to (HostFeaturesSel*32)+31, eg. feature bits 0 to 31 if - HostFeaturesSel is set to 0 and features bits 32 to 63 if - HostFeaturesSel is set to 1. Also see [sub:Feature-Bits] -• 0x014 | W | HostFeaturesSel - Device (Host) features word selection. - Writing to this register selects a set of 32 device feature bits - accessible by reading from HostFeatures register. Device driver - must write a value to the HostFeaturesSel register before - reading from the HostFeatures register. +2.6 Reserved Feature Bits +========================= -• 0x020 | W | GuestFeatures - Flags representing device features understood and activated by - the driver. - Writing to this register sets 32 consecutive flag bits, first - bit depending on the last value written to GuestFeaturesSel - register. Access to this register sets bits GuestFeaturesSel*32 - to (GuestFeaturesSel*32)+31, eg. feature bits 0 to 31 if - GuestFeaturesSel is set to 0 and features bits 32 to 63 if - GuestFeaturesSel is set to 1. Also see [sub:Feature-Bits] +Currently there are five device-independent feature bits defined: -• 0x024 | W | GuestFeaturesSel - Activated (Guest) features word selection. - Writing to this register selects a set of 32 activated feature - bits accessible by writing to the GuestFeatures register. - Device driver must write a value to the GuestFeaturesSel - register before writing to the GuestFeatures register. + VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature + indicates that the driver wants an interrupt if the device runs + out of available descriptors on a virtqueue, even though + interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT + flag or the used_event field. An example of this is the + networking driver: it doesn't need to know every time a packet + is transmitted, but it does need to free the transmitted + packets a finite time after they are transmitted. It can avoid + using a timer if the device interrupts it when all the packets + are transmitted. -• 0x028 | W | GuestPageSize - Guest page size. - Device driver must write the guest page size in bytes to the - register during initialization, before any queues are used. - This value must be a power of 2 and is used by the Host to - calculate Guest address of the first queue page (see QueuePFN). + VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature indicates + that the driver can use descriptors with the VRING_DESC_F_INDIRECT + flag set, as described in 2.3.3 Indirect Descriptors. -• 0x030 | W | QueueSel - Virtual queue index (first queue is 0). - Writing to this register selects the virtual queue that the - following operations on QueueNum, QueueAlign and QueuePFN apply - to. + VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event + and the avail_event fields. If set, it indicates that the + device should ignore the flags field in the available ring + structure. Instead, the used_event field in this structure is + used by guest to suppress device interrupts. Further, the + driver should ignore the flags field in the used ring + structure. Instead, the avail_event field in this structure is + used by the device to suppress notifications. If unset, the + driver should ignore the used_event field; the device should + ignore the avail_event field; the flags field is used -• 0x034 | R | QueueNumMax - Maximum virtual queue size. - Reading from the register returns the maximum size of the queue - the Host is ready to process or zero (0x0) if the queue is not - available. This applies to the queue selected by writing to - QueueSel and is allowed only when QueuePFN is set to zero - (0x0), so when the queue is not actively used. -• 0x038 | W | QueueNum - Virtual queue size. - Queue size is a number of elements in the queue, therefore size - of the descriptor table and both available and used rings. - Writing to this register notifies the Host what size of the - queue the Guest will use. This applies to the queue selected by - writing to QueueSel. +2.7 virtio_ring.h +================= -• 0x03c | W | QueueAlign - Used Ring alignment in the virtual queue. - Writing to this register notifies the Host about alignment - boundary of the Used Ring in bytes. This value must be a power - of 2 and applies to the queue selected by writing to QueueSel. +#ifndef VIRTIO_RING_H +#define VIRTIO_RING_H +/* An interface for efficient virtio implementation. + * + * This header is BSD licensed so anyone can use the definitions + * to implement compatible drivers/servers. + * + * Copyright 2007, 2009, IBM Corporation + * Copyright 2011, Red Hat, Inc + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of IBM nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ -• 0x040 | RW | QueuePFN - Guest physical page number of the virtual queue. - Writing to this register notifies the host about location of the - virtual queue in the Guest's physical address space. This value - is the index number of a page starting with the queue - Descriptor Table. Value zero (0x0) means physical address zero - (0x00000000) and is illegal. When the Guest stops using the - queue it must write zero (0x0) to this register. - Reading from this register returns the currently used page - number of the queue, therefore a value other than zero (0x0) - means that the queue is in use. - Both read and write accesses apply to the queue selected by - writing to QueueSel. +/* This marks a buffer as continuing via the next field. */ +#define VRING_DESC_F_NEXT 1 +/* This marks a buffer as write-only (otherwise read-only). */ +#define VRING_DESC_F_WRITE 2 -• 0x050 | W | QueueNotify - Queue notifier. - Writing a queue index to this register notifies the Host that - there are new buffers to process in the queue. +/* The Host uses this in used->flags to advise the Guest: don't kick me + * when you add a buffer. It's unreliable, so it's simply an + * optimization. Guest will still kick if it's out of buffers. */ +#define VRING_USED_F_NO_NOTIFY 1 +/* The Guest uses this in avail->flags to advise the Host: don't + * interrupt me when you consume a buffer. It's unreliable, so it's + * simply an optimization. */ +#define VRING_AVAIL_F_NO_INTERRUPT 1 -• 0x60 | R | InterruptStatus -Interrupt status. -Reading from this register returns a bit mask of interrupts - asserted by the device. An interrupt is asserted if the - corresponding bit is set, ie. equals one (1). +/* Virtio ring descriptors: 16 bytes. + * These can chain together via "next". */ +struct vring_desc { + /* Address (guest-physical). */ + uint64_t addr; + /* Length. */ + uint32_t len; + /* The flags as indicated above. */ + uint16_t flags; + /* We chain unused descriptors via this, too */ + uint16_t next; +}; - – Bit 0 | Used Ring Update - This interrupt is asserted when the Host has updated the Used - Ring in at least one of the active virtual queues. +struct vring_avail { + uint16_t flags; + uint16_t idx; + uint16_t ring[]; + uint16_t used_event; +}; - – Bit 1 | Configuration change - This interrupt is asserted when configuration of the device has - changed. +/* u32 is used here for ids for padding reasons. */ +struct vring_used_elem { + /* Index of start of used descriptor chain. */ + uint32_t id; + /* Total length of the descriptor chain which was written to. */ + uint32_t len; +}; -• 0x064 | W | InterruptACK - Interrupt acknowledge. - Writing to this register notifies the Host that the Guest - finished handling interrupts. Set bits in the value clear the - corresponding bits of the InterruptStatus register. +struct vring_used { + uint16_t flags; + uint16_t idx; + struct vring_used_elem ring[]; + uint16_t avail_event; +}; -• 0x070 | RW | Status - Device status. - Reading from this register returns the current device status - flags. - Writing non-zero values to this register sets the status flags, - indicating the Guest progress. Writing zero (0x0) to this - register triggers a device reset. - Also see [sub:Device-Initialization-Sequence] +struct vring { + unsigned int num; -• 0x100+ | RW | Config - Device-specific configuration space starts at an offset 0x100 - and is accessed with byte alignment. Its meaning and size - depends on the device and the driver. + struct vring_desc *desc; + struct vring_avail *avail; + struct vring_used *used; +}; -Virtual queue size is a number of elements in the queue, -therefore size of the descriptor table and both available and -used rings. +/* The standard layout for the ring is a continuous chunk of memory which + * looks like this. We assume num is a power of 2. + * + * struct vring { + * // The actual descriptors (16 bytes each) + * struct vring_desc desc[num]; + * + * // A ring of available descriptor heads with free-running index. + * __u16 avail_flags; + * __u16 avail_idx; + * __u16 available[num]; + * + * // Padding to the next align boundary. + * char pad[]; + * + * // A ring of used descriptor heads with free-running index. + * __u16 used_flags; + * __u16 EVENT_IDX; + * struct vring_used_elem used[num]; + * }; + * Note: for virtio PCI, align is 4096. + */ +static inline void vring_init(struct vring *vr, unsigned int num, void *p, + unsigned long align) +{ + vr->num = num; + vr->desc = p; + vr->avail = p + num*sizeof(struct vring_desc); + vr->used = (void *)(((unsigned long)&vr->avail->ring[num] + + align-1) + & ~(align - 1)); +} -The endianness of the registers follows the native endianness of -the Guest. Writing to registers described as “R” and reading from -registers described as “W” is not permitted and can cause -undefined behavior. +static inline unsigned vring_size(unsigned int num, unsigned long align) +{ + return ((sizeof(struct vring_desc)*num + sizeof(uint16_t)*(2+num) + + align - 1) & ~(align - 1)) + + sizeof(uint16_t)*3 + sizeof(struct vring_used_elem)*num; +} -The device initialization is performed as described in 2.2.1 Device -Initialization Sequence with one exception: the Guest must notify the -Host about its page size, writing the size in bytes to GuestPageSize -register before the initialization is finished. +static inline int vring_need_event(uint16_t event_idx, uint16_t new_idx, uint16_t old_idx) +{ + return (uint16_t)(new_idx - event_idx - 1) < (uint16_t)(new_idx - old_idx); +} +#endif /* VIRTIO_RING_H */ -The memory mapped virtio devices generate single interrupt only, -therefore no special configuration is required. -Virtqueue Configuration -The virtual queue configuration is performed in a similar way to -the one described in 2.3 Virtqueue Configuration with a few -additional operations: +2.10 Creating New Device Types +============================== -1. Select the queue writing its index (first queue is 0) to the - QueueSel register. +Various considerations are necessary when creating a new device +type. + +2.10.1 How Many Virtqueues? +--------------------------- -2. Check if the queue is not already in use: read QueuePFN - register, returned value should be zero (0x0). +It is possible that a very simple device will operate entirely +through its configuration space, but most will need at least one +virtqueue in which it will place requests. A device with both +input and output (eg. console and network devices described here) +need two queues: one which the driver fills with buffers to +receive input, and one which the driver places buffers to +transmit output. -3. Read maximum queue size (number of elements) from the - QueueNumMax register. If the returned value is zero (0x0) the - queue is not available. +2.10.2 What Configuration Space Layout? +--------------------------------------- -4. Allocate and zero the queue pages in contiguous virtual - memory, aligning the Used Ring to an optimal boundary (usually - page size). Size of the allocated queue may be smaller than or - equal to the maximum size returned by the Host. +Configuration space should only be used for initialization-time +parameters. It is a limited resource with no synchronization, so for +most uses it is better to use a virtqueue to update configuration +information (the network device does this for filtering, +otherwise the table in the config space could potentially be very +large). -5. Notify the Host about the queue size by writing the size to - QueueNum register. +2.10.3 What Device Number? +-------------------------- -6. Notify the Host about the used alignment by writing its value - in bytes to QueueAlign register. +Currently device numbers are assigned quite freely: a simple +request mail to the author of this document or the Linux +virtualization mailing list[9] will be sufficient to secure a unique one. -7. Write the physical number of the first page of the queue to - the QueuePFN register. +Meanwhile for experimental drivers, use 65535 and work backwards. -The queue and the device are ready to begin normal operations -now. +2.10.4 How many MSI-X vectors? (for PCI) +----------------------------------------- -Device Operation +Using the optional MSI-X capability devices can speed up +interrupt processing by removing the need to read ISR Status +register by guest driver (which might be an expensive operation), +reducing interrupt sharing between devices and queues within the +device, and handling interrupts from multiple CPUs. However, some +systems impose a limit (which might be as low as 256) on the +total number of MSI-X vectors that can be allocated to all +devices. Devices and/or device drivers should take this into +account, limiting the number of vectors used unless the device is +expected to cause a high volume of interrupts. Devices can +control the number of vectors used by limiting the MSI-X Table +Size or not presenting MSI-X capability in PCI configuration +space. Drivers can control this by mapping events to as small +number of vectors as possible, or disabling MSI-X capability +altogether. -The memory mapped virtio device behaves in the same way as -described in 2.4 Device Operation, with the following -exceptions: +2.10.5 Device Improvements +-------------------------- -1. The device is notified about new buffers available in a queue - by writing the queue index to register QueueNum instead of the - virtio header in PCI I/O space (2.4.1.4 Notifying The Device). +Any change to configuration space, or new virtqueues, or +behavioural changes, should be indicated by negotiation of a new +feature bit. This establishes clarity[11] and avoids future expansion problems. -2. The memory mapped virtio device is using single, dedicated - interrupt signal, which is raised when at least one of the - interrupts described in the InterruptStatus register - description is asserted. After receiving an interrupt, the - driver must read the InterruptStatus register to check what - caused the interrupt (see the register description). After the - interrupt is handled, the driver must acknowledge it by writing - a bit mask corresponding to the serviced interrupt to the - InterruptACK register. +Clusters of functionality which are always implemented together +can use a single bit, but if one feature makes sense without the +others they should not be gratuitously grouped together to +conserve feature bits. We can always extend the spec when the +first person needs more than 24 feature bits for their device. FOOTNOTES: +========== + [1] This lack of page-sharing implies that the implementation of the device (e.g. the hypervisor or host) needs full access to the guest memory. Communication with untrusted parties (i.e. @@ -2524,7 +2740,6 @@ a cautious driver should arrange it so. [11] Even if it does mean documenting design or implementation mistakes! -[12] Only if VIRTIO_NET_F_CTRL_VQ set [13] It was supposed to indicate segmentation offload support, but upon further investigation it became clear that multiple bits @@ -2575,8 +2790,6 @@ does not distinguish between them. [23] The FLUSH and FLUSH_OUT types are equivalent, the device does not distinguish between them -[24] Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set. - [25] Because this is high importance and low bandwidth, the current Linux implementation polls for the buffer to be used, rather than waiting for an interrupt, simplifying the implementation @@ -2585,8 +2798,6 @@ O_NONBLOCK flag set, the polling limitation is relaxed and the consumed buffers are freed upon the next write or poll call or when a port is closed or hot-unplugged. -[26] Only if VIRTIO_BALLON_F_STATS_VQ set. - [27] This is historical, and independent of the guest page size [28] In this case, deflation advice is merely a courtesy -- cgit v1.2.3