diff options
-rw-r--r-- | virtio-v1.0-wd01-part1-specification.txt | 1766 |
1 files changed, 883 insertions, 883 deletions
diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt index 5c887f2..a887618 100644 --- a/virtio-v1.0-wd01-part1-specification.txt +++ b/virtio-v1.0-wd01-part1-specification.txt @@ -7,9 +7,9 @@ design they are not all that different from physical devices, and this document treats them as such. This allows the guest to use standard drivers and discovery mechanisms. -The purpose of virtio and this specification is that virtual -environments and guests should have a straightforward, efficient, -standard and extensible mechanism for virtual devices, rather +The purpose of virtio and this specification is that virtual +environments and guests should have a straightforward, efficient, +standard and extensible mechanism for virtual devices, rather than boutique per-environment or per-OS mechanisms. Straightforward: Virtio devices use normal bus mechanisms of @@ -17,9 +17,9 @@ than boutique per-environment or per-OS mechanisms. author. There is no exotic page-flipping or COW mechanism: it's just a normal device.[1] - Efficient: Virtio devices consist of rings of descriptors - for input and output, which are neatly separated to avoid cache - effects from both guest and device writing to the same cache + Efficient: Virtio devices consist of rings of descriptors + for input and output, which are neatly separated to avoid cache + effects from both guest and device writing to the same cache lines. Standard: Virtio makes no assumptions about the environment in which @@ -27,10 +27,10 @@ than boutique per-environment or per-OS mechanisms. devices are implemented over PCI and other buses, and earlier drafts been implemented on other buses not included in this spec.[2] - Extensible: Virtio PCI devices contain feature bits which are - acknowledged by the guest operating system during device setup. - This allows forwards and backwards compatibility: the device - offers all the features it knows about, and the driver + Extensible: Virtio PCI devices contain feature bits which are + acknowledged by the guest operating system during device setup. + This allows forwards and backwards compatibility: the device + offers all the features it knows about, and the driver acknowledges those it understands and wishes to use. 1.1.1. Key words @@ -90,28 +90,28 @@ o One or more virtqueues 2.1.1. Device Status Field ------------------------- -The Device Status field is updated by the guest to indicate its -progress. This provides a simple low-level diagnostic: it's most -useful to imagine them hooked up to traffic lights on the console +The Device Status field is updated by the guest to indicate its +progress. This provides a simple low-level diagnostic: it's most +useful to imagine them hooked up to traffic lights on the console indicating the status of each device. This field is 0 upon reset, otherwise at least one bit should be set: - ACKNOWLEDGE (1) Indicates that the guest OS has found the + ACKNOWLEDGE (1) Indicates that the guest OS has found the device and recognized it as a valid virtio device. - DRIVER (2) Indicates that the guest OS knows how to drive the - device. Under Linux, drivers can be loadable modules so there - may be a significant (or infinite) delay before setting this + DRIVER (2) Indicates that the guest OS knows how to drive the + device. Under Linux, drivers can be loadable modules so there + may be a significant (or infinite) delay before setting this bit. - DRIVER_OK (4) Indicates that the driver is set up and ready to + DRIVER_OK (4) Indicates that the driver is set up and ready to drive the device. - FAILED (128) Indicates that something went wrong in the guest, - and it has given up on the device. This could be an internal - error, or the driver didn't like the device for some reason, or - even a fatal error during device operation. The device must be + FAILED (128) Indicates that something went wrong in the guest, + and it has given up on the device. This could be an internal + error, or the driver didn't like the device for some reason, or + even a fatal error during device operation. The device must be reset before attempting to re-initialize. 2.1.2. Feature Bits @@ -134,15 +134,15 @@ Feature bits are allocated as follows: 0 to 23: Feature bits for the specific device type - 24 to 31: Feature bits reserved for extensions to the queue and + 24 to 31: Feature bits reserved for extensions to the queue and feature negotiation mechanisms -For example, feature bit 0 for a network device (i.e. Subsystem -Device ID 1) indicates that the device supports checksumming of +For example, feature bit 0 for a network device (i.e. Subsystem +Device ID 1) indicates that the device supports checksumming of packets. -In particular, new fields in the device configuration space are -indicated by offering a feature bit, so the guest can check +In particular, new fields in the device configuration space are +indicated by offering a feature bit, so the guest can check before accessing that part of the configuration space. 2.1.3. Configuration Space @@ -151,7 +151,7 @@ before accessing that part of the configuration space. Configuration space is generally used for rarely-changing or initialization-time parameters. -Note that this space is generally the guest's native endian, +Note that this space is generally the guest's native endian, rather than PCI's little-endian. 2.1.4. Virtqueues @@ -164,7 +164,7 @@ transmit and one for receive. Each queue has a 16-bit queue size parameter, which sets the number of entries and implies the total size of the queue. -Each virtqueue occupies two or more physically-contiguous pages +Each virtqueue occupies two or more physically-contiguous pages (usually defined as 4096 bytes, but depending on the transport) and consists of three parts: @@ -189,10 +189,10 @@ virtqueue layout structure looks like this: struct vring { // The actual descriptors (16 bytes each) struct vring_desc desc[ Queue Size ]; - + // A ring of available descriptor heads with free-running index. struct vring_avail avail; - + // Padding to the next PAGE_SIZE boundary. char pad[ Padding ]; @@ -200,10 +200,10 @@ virtqueue layout structure looks like this: struct vring_used used; }; -When the driver wants to send a buffer to the device, it fills in -a slot in the descriptor table (or chains several together), and -writes the descriptor index into the available ring. It then -notifies the device. When the device has finished a buffer, it +When the driver wants to send a buffer to the device, it fills in +a slot in the descriptor table (or chains several together), and +writes the descriptor index into the available ring. It then +notifies the device. When the device has finished a buffer, it writes the descriptor into the used ring, and sends an interrupt. 2.1.4.1. A Note on Virtqueue Endianness @@ -240,10 +240,10 @@ dividing a network packet into 1500 single-byte descriptors! 2.1.4.3. The Virtqueue Descriptor Table -------------------------------------- -The descriptor table refers to the buffers the guest is using for -the device. The addresses are physical addresses, and the buffers -can be chained via the next field. Each descriptor describes a -buffer which is read-only or write-only, but a chain of +The descriptor table refers to the buffers the guest is using for +the device. The addresses are physical addresses, and the buffers +can be chained via the next field. Each descriptor describes a +buffer which is read-only or write-only, but a chain of descriptors can contain both read-only and write-only buffers. No descriptor chain may be more than 2^32 bytes long in total. @@ -253,13 +253,13 @@ No descriptor chain may be more than 2^32 bytes long in total. u64 addr; /* Length. */ u32 len; - + /* This marks a buffer as continuing via the next field. */ #define VRING_DESC_F_NEXT 1 /* This marks a buffer as write-only (otherwise read-only). */ #define VRING_DESC_F_WRITE 2 /* This means the buffer contains a list of buffer descriptors. */ - #define VRING_DESC_F_INDIRECT 4 + #define VRING_DESC_F_INDIRECT 4 /* The flags as indicated above. */ u16 flags; /* Next field if flags & NEXT */ @@ -272,16 +272,16 @@ for this virtqueue. 2.1.4.3.1. Indirect Descriptors ------------------------------ -Some devices benefit by concurrently dispatching a large number -of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be -used to allow this (see "2.6. Reserved Feature Bits"). To increase -ring capacity it is possible to store a table of indirect -descriptors anywhere in memory, and insert a descriptor in main -virtqueue (with flags&VRING_DESC_F_INDIRECT on) that refers to memory buffer -containing this indirect descriptor table; fields addr and len -refer to the indirect table address and length in bytes, -respectively. The indirect table layout structure looks like this -(len is the length of the descriptor that refers to this table, +Some devices benefit by concurrently dispatching a large number +of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be +used to allow this (see "2.6. Reserved Feature Bits"). To increase +ring capacity it is possible to store a table of indirect +descriptors anywhere in memory, and insert a descriptor in main +virtqueue (with flags&VRING_DESC_F_INDIRECT on) that refers to memory buffer +containing this indirect descriptor table; fields addr and len +refer to the indirect table address and length in bytes, +respectively. The indirect table layout structure looks like this +(len is the length of the descriptor that refers to this table, which is a variable, so this code won't compile): struct indirect_descriptor_table { @@ -289,15 +289,15 @@ which is a variable, so this code won't compile): struct vring_desc desc[len / 16]; }; -The first indirect descriptor is located at start of the indirect -descriptor table (index 0), additional indirect descriptors are -chained by next field. An indirect descriptor without next field -(with flags&VRING_DESC_F_NEXT off) signals the end of the indirect descriptor -table, and transfers control back to the main virtqueue. An -indirect descriptor can not refer to another indirect descriptor -table (flags&VRING_DESC_F_INDIRECT must be off). A single indirect descriptor -table can include both read-only and write-only descriptors; -write-only flag (flags&VRING_DESC_F_WRITE) in the descriptor that refers to it +The first indirect descriptor is located at start of the indirect +descriptor table (index 0), additional indirect descriptors are +chained by next field. An indirect descriptor without next field +(with flags&VRING_DESC_F_NEXT off) signals the end of the indirect descriptor +table, and transfers control back to the main virtqueue. An +indirect descriptor can not refer to another indirect descriptor +table (flags&VRING_DESC_F_INDIRECT must be off). A single indirect descriptor +table can include both read-only and write-only descriptors; +write-only flag (flags&VRING_DESC_F_WRITE) in the descriptor that refers to it is ignored. 2.1.4.4. The Virtqueue Available Ring @@ -315,7 +315,7 @@ the device is controlled by the VIRTIO_RING_F_EVENT_IDX feature bit (see "2.6. Reserved Feature Bits"). This interrupt suppression is merely an optimization; it may not suppress interrupts entirely. -The “idx” field indicates where we would put the next descriptor +The “idx” field indicates where we would put the next descriptor entry (modulo the queue size). This starts at 0, and increases. struct vring_avail { @@ -324,29 +324,29 @@ entry (modulo the queue size). This starts at 0, and increases. u16 idx; u16 ring[ /* Queue Size */ ]; u16 used_event; /* Only if VIRTIO_RING_F_EVENT_IDX */ - }; + }; 2.1.4.5. The Virtqueue Used Ring ------------------------------- -The used ring is where the device returns buffers once it is done -with them. The flags field can be used by the device to hint that -no notification is necessary when the guest adds to the available -ring. Alternatively, the “avail_event” field can be used by the -device to hint that no notification is necessary until an entry -with an index specified by the “avail_event” is written in the -available ring (equivalently, until the idx field in the -available ring will reach the value avail_event + 1). The method -employed by the device is controlled by the guest through the +The used ring is where the device returns buffers once it is done +with them. The flags field can be used by the device to hint that +no notification is necessary when the guest adds to the available +ring. Alternatively, the “avail_event” field can be used by the +device to hint that no notification is necessary until an entry +with an index specified by the “avail_event” is written in the +available ring (equivalently, until the idx field in the +available ring will reach the value avail_event + 1). The method +employed by the device is controlled by the guest through the VIRTIO_RING_F_EVENT_IDX feature bit (see "2.6. Reserved Feature Bits").[7] -Each entry in the ring is a pair: the head entry of the -descriptor chain describing the buffer (this matches an entry -placed in the available ring by the guest earlier), and the total -of bytes written into the buffer. The latter is extremely useful -for guests using untrusted buffers: if you do not know exactly -how much has been written by the device, you usually have to zero +Each entry in the ring is a pair: the head entry of the +descriptor chain describing the buffer (this matches an entry +placed in the available ring by the guest earlier), and the total +of bytes written into the buffer. The latter is extremely useful +for guests using untrusted buffers: if you do not know exactly +how much has been written by the device, you usually have to zero the buffer to ensure no data leakage occurs. /* u32 is used here for ids for padding reasons. */ @@ -358,7 +358,7 @@ the buffer to ensure no data leakage occurs. }; struct vring_used { - #define VRING_USED_F_NO_NOTIFY 1 + #define VRING_USED_F_NO_NOTIFY 1 u16 flags; u16 idx; struct vring_used_elem ring[ /* Queue Size */]; @@ -368,11 +368,11 @@ the buffer to ensure no data leakage occurs. 2.1.4.6. Helpers for Operating Virtqueues ---------------------------------------- -The Linux Kernel Source code contains the definitions above and -helper routines in a more usable form, in -include/linux/virtio_ring.h. This was explicitly licensed by IBM -and Red Hat under the (3-clause) BSD license so that it can be -freely used by all other projects, and is reproduced (with slight +The Linux Kernel Source code contains the definitions above and +helper routines in a more usable form, in +include/linux/virtio_ring.h. This was explicitly licensed by IBM +and Red Hat under the (3-clause) BSD license so that it can be +freely used by all other projects, and is reproduced (with slight variation to remove Linux assumptions) in *XREF*. 2.2. General Initialization And Device Operation @@ -392,73 +392,73 @@ how to communicate with the specific device. 3. The DRIVER status bit is set: we know how to drive the device. -4. Device-specific setup, including reading the device feature +4. Device-specific setup, including reading the device feature bits, discovery of virtqueues for the device, optional per-bus - setup, and reading and possibly writing the device's virtio + setup, and reading and possibly writing the device's virtio configuration space. -5. The subset of device feature bits understood by the driver is +5. The subset of device feature bits understood by the driver is written to the device. 6. The DRIVER_OK status bit is set. -7. The device can now be used (ie. buffers added to the +7. The device can now be used (ie. buffers added to the virtqueues)[4] -If any of these steps go irrecoverably wrong, the guest should -set the FAILED status bit to indicate that it has given up on the +If any of these steps go irrecoverably wrong, the guest should +set the FAILED status bit to indicate that it has given up on the device (it can reset the device later to restart if desired). 2.2.2. Device Operation ---------------------- -There are two parts to device operation: supplying new buffers to -the device, and processing used buffers from the device. As an -example, the simplest virtio network device has two virtqueues: the -transmit virtqueue and the receive virtqueue. The driver adds -outgoing (read-only) packets to the transmit virtqueue, and then -frees them after they are used. Similarly, incoming (write-only) -buffers are added to the receive virtqueue, and processed after +There are two parts to device operation: supplying new buffers to +the device, and processing used buffers from the device. As an +example, the simplest virtio network device has two virtqueues: the +transmit virtqueue and the receive virtqueue. The driver adds +outgoing (read-only) packets to the transmit virtqueue, and then +frees them after they are used. Similarly, incoming (write-only) +buffers are added to the receive virtqueue, and processed after they are used. 2.2.2.1. Supplying Buffers to The Device --------------------------------------- -Actual transfer of buffers from the guest OS to the device +Actual transfer of buffers from the guest OS to the device operates as follows: 1. Place the buffer(s) into free descriptor(s). - (a) If there are no free descriptors, the guest may choose to - notify the device even if notifications are suppressed (to + (a) If there are no free descriptors, the guest may choose to + notify the device even if notifications are suppressed (to reduce latency).[8] -2. Place the id of the buffer in the next ring entry of the +2. Place the id of the buffer in the next ring entry of the available ring. -3. The steps (1) and (2) may be performed repeatedly if batching +3. The steps (1) and (2) may be performed repeatedly if batching is possible. -4. A memory barrier should be executed to ensure the device sees - the updated descriptor table and available ring before the next +4. A memory barrier should be executed to ensure the device sees + the updated descriptor table and available ring before the next step. -5. The available “idx” field should be increased by the number of +5. The available “idx” field should be increased by the number of entries added to the available ring. -6. A memory barrier should be executed to ensure that we update +6. A memory barrier should be executed to ensure that we update the idx field before checking for notification suppression. -7. If notifications are not suppressed, the device should be +7. If notifications are not suppressed, the device should be notified of the new buffers. -Note that the above code does not take precautions against the -available ring buffer wrapping around: this is not possible since -the ring buffer is the same size as the descriptor table, so step +Note that the above code does not take precautions against the +available ring buffer wrapping around: this is not possible since +the ring buffer is the same size as the descriptor table, so step (1) will prevent such a condition. -In addition, the maximum queue size is 32768 (it must be a power -of 2 which fits in 16 bits), so the 16-bit “idx” value can always +In addition, the maximum queue size is 32768 (it must be a power +of 2 which fits in 16 bits), so the 16-bit “idx” value can always distinguish between a full and empty buffer. Here is a description of each stage in more detail. @@ -466,9 +466,9 @@ Here is a description of each stage in more detail. 2.2.2.1.1. Placing Buffers Into The Descriptor Table --------------------------------------------------- -A buffer consists of zero or more read-only physically-contiguous -elements followed by zero or more physically-contiguous -write-only elements (it must have at least one element). This +A buffer consists of zero or more read-only physically-contiguous +elements followed by zero or more physically-contiguous +write-only elements (it must have at least one element). This algorithm maps it into the descriptor table: for each buffer element, b: @@ -479,30 +479,30 @@ for each buffer element, b: (c) Set d.len to the length of b. - (d) If b is write-only, set d.flags to VRING_DESC_F_WRITE, + (d) If b is write-only, set d.flags to VRING_DESC_F_WRITE, otherwise 0. (e) If there is a buffer element after this: - i. Set d.next to the index of the next free descriptor + i. Set d.next to the index of the next free descriptor element. ii. Set the VRING_DESC_F_NEXT bit in d.flags. -In practice, the d.next fields are usually used to chain free -descriptors, and a separate count kept to check there are enough +In practice, the d.next fields are usually used to chain free +descriptors, and a separate count kept to check there are enough free descriptors before beginning the mappings. 2.2.2.1.2. Updating The Available Ring ------------------------------------- -The head of the buffer we mapped is the first d in the algorithm +The head of the buffer we mapped is the first d in the algorithm above. A naive implementation would do the following: avail->ring[avail->idx % qsz] = head; -However, in general we can add many descriptor chains before we update -the “idx” field (at which point they become visible to the +However, in general we can add many descriptor chains before we update +the “idx” field (at which point they become visible to the device), so we keep a counter of how many we've added: avail->ring[(avail->idx + added++) % qsz] = head; @@ -510,13 +510,13 @@ device), so we keep a counter of how many we've added: 2.2.2.1.3. Updating The Index Field ---------------------------------- -Once the index field of the virtqueue is updated, the device will -be able to access the descriptor chains we've created and the -memory they refer to. This is why a memory barrier is generally -used before the index update, to ensure it sees the most up-to-date +Once the index field of the virtqueue is updated, the device will +be able to access the descriptor chains we've created and the +memory they refer to. This is why a memory barrier is generally +used before the index update, to ensure it sees the most up-to-date copy. -The index field always increments, and we let it wrap naturally at +The index field always increments, and we let it wrap naturally at 65536: avail->idx += added; @@ -525,21 +525,21 @@ The index field always increments, and we let it wrap naturally at ------------------------------ The actual method of device notification is bus-specific, but generally -it can be expensive. So the device can suppress such notifications if it +it can be expensive. So the device can suppress such notifications if it doesn't need them. We have to be careful to expose the new index -value before checking if notifications are suppressed: it's OK to notify -gratuitously, but not to omit a required notification. So again, -we use a memory barrier here before reading the flags or the +value before checking if notifications are suppressed: it's OK to notify +gratuitously, but not to omit a required notification. So again, +we use a memory barrier here before reading the flags or the avail_event field. If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if the VRING_USED_F_NOTIFY flag is not set, we go ahead and notify the device. -If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the -avail_event field in the available ring structure. If the -available index crossed_the avail_event field value since the -last notification, we go ahead and write to the PCI configuration +If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the +avail_event field in the available ring structure. If the +available index crossed_the avail_event field value since the +last notification, we go ahead and write to the PCI configuration space. The avail_event field wraps naturally at 65536 as well, iving the following algorithm for calculating whether a device needs notification: @@ -549,46 +549,46 @@ notification: 2.2.2.2. Receiving Used Buffers From The Device ---------------------------------------------- -Once the device has used a buffer (read from or written to it, or -parts of both, depending on the nature of the virtqueue and the -device), it sends an interrupt, following an algorithm very -similar to the algorithm used for the driver to send the device a +Once the device has used a buffer (read from or written to it, or +parts of both, depending on the nature of the virtqueue and the +device), it sends an interrupt, following an algorithm very +similar to the algorithm used for the driver to send the device a buffer: -1. Write the head descriptor number to the next field in the used +1. Write the head descriptor number to the next field in the used ring. 2. Update the used ring index. 3. Deliver an interrupt if necessary: - (a) If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: - check if the VRING_AVAIL_F_NO_INTERRUPT flag is not set in + (a) If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: + check if the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail->flags. - (b) If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check - whether the used index crossed the used_event field value - since the last update. The used_event field wraps naturally + (b) If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check + whether the used index crossed the used_event field value + since the last update. The used_event field wraps naturally at 65536 as well: (u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx) -For each ring, guest should then disable interrupts by writing -VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required. -It can then process used ring entries finally enabling interrupts -by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the -EVENT_IDX field in the available structure. The guest should then -execute a memory barrier, and then recheck the ring empty -condition. This is necessary to handle the case where after the -last check and before enabling interrupts, an interrupt has been +For each ring, guest should then disable interrupts by writing +VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required. +It can then process used ring entries finally enabling interrupts +by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the +EVENT_IDX field in the available structure. The guest should then +execute a memory barrier, and then recheck the ring empty +condition. This is necessary to handle the case where after the +last check and before enabling interrupts, an interrupt has been suppressed by the device: vring_disable_interrupts(vq); - + for (;;) { if (vq->last_seen_used != vring->used.idx) { vring_enable_interrupts(vq); mb(); - + if (vq->last_seen_used != vring->used.idx) break; } @@ -624,16 +624,16 @@ Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through 0x103F inclusive is a virtio device[3]. The device must also have a Revision ID of 0 to match this specification. -The Subsystem Device ID indicates which virtio device is -supported by the device. The Subsystem Vendor ID should reflect -the PCI Vendor ID of the environment (it's currently only used +The Subsystem Device ID indicates which virtio device is +supported by the device. The Subsystem Vendor ID should reflect +the PCI Vendor ID of the environment (it's currently only used for informational purposes by the guest). 2.4.1.2. PCI Device Layout ------------------------- -To configure the device, we use the first I/O region of the PCI -device. This contains a virtio header followed by a +To configure the device, we use the first I/O region of the PCI +device. This contains a virtio header followed by a device-specific region. There may be different widths of accesses to the I/O region; the @@ -642,9 +642,9 @@ used (i.e. 32-bit accesses for 32-bit fields, etc), but the device-specific region can be accessed using any width accesses, and should obtain the same results. -Note that this is possible because while the virtio header is PCI -(i.e. little) endian, the device-specific region is encoded in -the native endian of the guest (where such distinction is +Note that this is possible because while the virtio header is PCI +(i.e. little) endian, the device-specific region is encoded in +the native endian of the guest (where such distinction is applicable). 2.4.1.2.1. PCI Device Virtio Header @@ -662,7 +662,7 @@ The virtio header looks as follows: +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ -If MSI-X is enabled for the device, two additional fields +If MSI-X is enabled for the device, two additional fields immediately follow this header:[5] @@ -676,7 +676,7 @@ immediately follow this header:[5] | (MSI-X) || Vector | Vector | +------------++----------------+--------+ -Immediately following these general headers, there may be +Immediately following these general headers, there may be device-specific headers: +------------++--------------------+ @@ -701,70 +701,70 @@ The page size for a virtqueue on a PCI virtio device is defined as 2.4.1.3.1.1. Queue Vector Configuration -------------------------------------- -When MSI-X capability is present and enabled in the device -(through standard PCI configuration space) 4 bytes at byte offset -20 are used to map configuration change and queue interrupts to -MSI-X vectors. In this case, the ISR Status field is unused, and -device specific configuration starts at byte offset 24 in virtio -header structure. When MSI-X capability is not enabled, device +When MSI-X capability is present and enabled in the device +(through standard PCI configuration space) 4 bytes at byte offset +20 are used to map configuration change and queue interrupts to +MSI-X vectors. In this case, the ISR Status field is unused, and +device specific configuration starts at byte offset 24 in virtio +header structure. When MSI-X capability is not enabled, device specific configuration starts at byte offset 20 in virtio header. -Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of -Configuration/Queue Vector registers, maps interrupts triggered -by the configuration change/selected queue events respectively to -the corresponding MSI-X vector. To disable interrupts for a -specific event type, unmap it by writing a special NO_VECTOR +Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of +Configuration/Queue Vector registers, maps interrupts triggered +by the configuration change/selected queue events respectively to +the corresponding MSI-X vector. To disable interrupts for a +specific event type, unmap it by writing a special NO_VECTOR value: /* Vector value used to disable MSI for queue */ - #define VIRTIO_MSI_NO_VECTOR 0xffff + #define VIRTIO_MSI_NO_VECTOR 0xffff -Reading these registers returns vector mapped to a given event, -or NO_VECTOR if unmapped. All queue and configuration change +Reading these registers returns vector mapped to a given event, +or NO_VECTOR if unmapped. All queue and configuration change events are unmapped by default. -Note that mapping an event to vector might require allocating -internal device resources, and might fail. Devices report such -failures by returning the NO_VECTOR value when the relevant -Vector field is read. After mapping an event to vector, the -driver must verify success by reading the Vector field value: on -success, the previously written value is returned, and on -failure, NO_VECTOR is returned. If a mapping failure is detected, +Note that mapping an event to vector might require allocating +internal device resources, and might fail. Devices report such +failures by returning the NO_VECTOR value when the relevant +Vector field is read. After mapping an event to vector, the +driver must verify success by reading the Vector field value: on +success, the previously written value is returned, and on +failure, NO_VECTOR is returned. If a mapping failure is detected, the driver can retry mapping with fewervectors, or disable MSI-X. 2.4.1.3.1.2. Virtqueue Configuration ----------------------------------- -As a device can have zero or more virtqueues for bulk data -transport (for example, the simplest network device has two), the driver -needs to configure them as part of the device-specific +As a device can have zero or more virtqueues for bulk data +transport (for example, the simplest network device has two), the driver +needs to configure them as part of the device-specific configuration. This is done as follows, for each virtqueue a device has: -1. Write the virtqueue index (first queue is 0) to the Queue +1. Write the virtqueue index (first queue is 0) to the Queue Select field. -2. Read the virtqueue size from the Queue Size field, which is - always a power of 2. This controls how big the virtqueue is - (see "2.1.4. Virtqueues"). If this field is 0, the virtqueue does not exist. +2. Read the virtqueue size from the Queue Size field, which is + always a power of 2. This controls how big the virtqueue is + (see "2.1.4. Virtqueues"). If this field is 0, the virtqueue does not exist. -3. Allocate and zero virtqueue in contiguous physical memory, on - a 4096 byte alignment. Write the physical address, divided by +3. Allocate and zero virtqueue in contiguous physical memory, on + a 4096 byte alignment. Write the physical address, divided by 4096 to the Queue Address field.[6] -4. Optionally, if MSI-X capability is present and enabled on the - device, select a vector to use to request interrupts triggered - by virtqueue events. Write the MSI-X Table entry number - corresponding to this vector in Queue Vector field. Read the - Queue Vector field: on success, previously written value is +4. Optionally, if MSI-X capability is present and enabled on the + device, select a vector to use to request interrupts triggered + by virtqueue events. Write the MSI-X Table entry number + corresponding to this vector in Queue Vector field. Read the + Queue Vector field: on success, previously written value is returned; on failure, NO_VECTOR value is returned. 2.4.1.3.2. Notifying The Device ------------------------------ -Device notification occurs by writing the 16-bit virtqueue index -of this virtqueue to the Queue Notify field of the virtio header +Device notification occurs by writing the 16-bit virtqueue index +of this virtqueue to the Queue Notify field of the virtio header in the first I/O region of the PCI device. 2.4.1.3.3. Virtqueue Interrupts From The Device @@ -780,58 +780,58 @@ If an interrupt is necessary: (b) If MSI-X capability is enabled: - i. Request the appropriate MSI-X interrupt message for the - device, Queue Vector field sets the MSI-X Table entry + i. Request the appropriate MSI-X interrupt message for the + device, Queue Vector field sets the MSI-X Table entry number. - ii. If Queue Vector field value is NO_VECTOR, no interrupt + ii. If Queue Vector field value is NO_VECTOR, no interrupt message is requested for this event. The guest interrupt handler should: -1. If MSI-X capability is disabled: read the ISR Status field, - which will reset it to zero. If the lower bit is zero, the - interrupt was not for this device. Otherwise, the guest driver - should look through the used rings of each virtqueue for the - device, to see if any progress has been made by the device +1. If MSI-X capability is disabled: read the ISR Status field, + which will reset it to zero. If the lower bit is zero, the + interrupt was not for this device. Otherwise, the guest driver + should look through the used rings of each virtqueue for the + device, to see if any progress has been made by the device which requires servicing. -2. If MSI-X capability is enabled: look through the used rings of - each virtqueue mapped to the specific MSI-X vector for the - device, to see if any progress has been made by the device +2. If MSI-X capability is enabled: look through the used rings of + each virtqueue mapped to the specific MSI-X vector for the + device, to see if any progress has been made by the device which requires servicing. 2.4.1.3.4. Notification of Device Configuration Changes ------------------------------------------------------ -Some virtio PCI devices can change the device configuration -state, as reflected in the virtio header in the PCI configuration +Some virtio PCI devices can change the device configuration +state, as reflected in the virtio header in the PCI configuration space. In this case: -1. If MSI-X capability is disabled: an interrupt is delivered and - the second highest bit is set in the ISR Status field to - indicate that the driver should re-examine the configuration - space. Note that a single interrupt can indicate both that one - or more virtqueue has been used and that the configuration - space has changed: even if the config bit is set, virtqueues +1. If MSI-X capability is disabled: an interrupt is delivered and + the second highest bit is set in the ISR Status field to + indicate that the driver should re-examine the configuration + space. Note that a single interrupt can indicate both that one + or more virtqueue has been used and that the configuration + space has changed: even if the config bit is set, virtqueues must be scanned. -2. If MSI-X capability is enabled: an interrupt message is - requested. The Configuration Vector field sets the MSI-X Table - entry number to use. If Configuration Vector field value is +2. If MSI-X capability is enabled: an interrupt message is + requested. The Configuration Vector field sets the MSI-X Table + entry number to use. If Configuration Vector field value is NO_VECTOR, no interrupt message is requested for this event. 2.4.2. Virtio Over MMIO ---------------------- -Virtual environments without PCI support (a common situation in +Virtual environments without PCI support (a common situation in embedded devices models) might use simple memory mapped device (“ virtio-mmio”) instead of the PCI device. -The memory mapped virtio device behaviour is based on the PCI -device specification. Therefore most of operations like device -initialization, queues configuration and buffer transfers are -nearly identical. Existing differences are described in the +The memory mapped virtio device behaviour is based on the PCI +device specification. Therefore most of operations like device +initialization, queues configuration and buffer transfers are +nearly identical. Existing differences are described in the following sections. 2.4.2.1. MMIO Device Discovery @@ -849,154 +849,154 @@ a device-tree such as Linux's dtc or Open Firmware, the suggested format is: 2.4.2.2. MMIO Device Layout -------------------------- -MMIO virtio devices provides a set of memory mapped control -registers, all 32 bits wide, followed by device-specific +MMIO virtio devices provides a set of memory mapped control +registers, all 32 bits wide, followed by device-specific configuration space. The following list presents their layout: -• Offset from the device base address | Direction | Name - Description +• Offset from the device base address | Direction | Name + Description -• 0x000 | R | MagicValue - “virt” string. +• 0x000 | R | MagicValue + “virt” string. -• 0x004 | R | Version - Device version number. Currently must be 1. +• 0x004 | R | Version + Device version number. Currently must be 1. -• 0x008 | R | DeviceID - Virtio Subsystem Device ID (ie. 1 for network card). +• 0x008 | R | DeviceID + Virtio Subsystem Device ID (ie. 1 for network card). -• 0x00c | R | VendorID - Virtio Subsystem Vendor ID. +• 0x00c | R | VendorID + Virtio Subsystem Vendor ID. -• 0x010 | R | HostFeatures +• 0x010 | R | HostFeatures Flags representing features the device supports. - Reading from this register returns 32 consecutive flag bits, - first bit depending on the last value written to + Reading from this register returns 32 consecutive flag bits, + first bit depending on the last value written to HostFeaturesSel register. Access to this register returns bits HostFeaturesSel*32 - to (HostFeaturesSel*32)+31, eg. feature bits 0 to 31 if - HostFeaturesSel is set to 0 and features bits 32 to 63 if + to (HostFeaturesSel*32)+31, eg. feature bits 0 to 31 if + HostFeaturesSel is set to 0 and features bits 32 to 63 if HostFeaturesSel is set to 1. Also see [sub:Feature-Bits] -• 0x014 | W | HostFeaturesSel +• 0x014 | W | HostFeaturesSel Device (Host) features word selection. - Writing to this register selects a set of 32 device feature bits - accessible by reading from HostFeatures register. Device driver - must write a value to the HostFeaturesSel register before - reading from the HostFeatures register. + Writing to this register selects a set of 32 device feature bits + accessible by reading from HostFeatures register. Device driver + must write a value to the HostFeaturesSel register before + reading from the HostFeatures register. -• 0x020 | W | GuestFeatures - Flags representing device features understood and activated by +• 0x020 | W | GuestFeatures + Flags representing device features understood and activated by the driver. - Writing to this register sets 32 consecutive flag bits, first - bit depending on the last value written to GuestFeaturesSel + Writing to this register sets 32 consecutive flag bits, first + bit depending on the last value written to GuestFeaturesSel register. Access to this register sets bits GuestFeaturesSel*32 - to (GuestFeaturesSel*32)+31, eg. feature bits 0 to 31 if - GuestFeaturesSel is set to 0 and features bits 32 to 63 if + to (GuestFeaturesSel*32)+31, eg. feature bits 0 to 31 if + GuestFeaturesSel is set to 0 and features bits 32 to 63 if GuestFeaturesSel is set to 1. Also see [sub:Feature-Bits] -• 0x024 | W | GuestFeaturesSel +• 0x024 | W | GuestFeaturesSel Activated (Guest) features word selection. - Writing to this register selects a set of 32 activated feature - bits accessible by writing to the GuestFeatures register. - Device driver must write a value to the GuestFeaturesSel - register before writing to the GuestFeatures register. + Writing to this register selects a set of 32 activated feature + bits accessible by writing to the GuestFeatures register. + Device driver must write a value to the GuestFeaturesSel + register before writing to the GuestFeatures register. -• 0x028 | W | GuestPageSize +• 0x028 | W | GuestPageSize Guest page size. - Device driver must write the guest page size in bytes to the - register during initialization, before any queues are used. - This value must be a power of 2 and is used by the Host to - calculate Guest address of the first queue page (see QueuePFN). + Device driver must write the guest page size in bytes to the + register during initialization, before any queues are used. + This value must be a power of 2 and is used by the Host to + calculate Guest address of the first queue page (see QueuePFN). -• 0x030 | W | QueueSel +• 0x030 | W | QueueSel Virtual queue index (first queue is 0). - Writing to this register selects the virtual queue that the - following operations on QueueNum, QueueAlign and QueuePFN apply - to. - -• 0x034 | R | QueueNumMax - Maximum virtual queue size. - Reading from the register returns the maximum size of the queue - the Host is ready to process or zero (0x0) if the queue is not - available. This applies to the queue selected by writing to - QueueSel and is allowed only when QueuePFN is set to zero - (0x0), so when the queue is not actively used. - -• 0x038 | W | QueueNum + Writing to this register selects the virtual queue that the + following operations on QueueNum, QueueAlign and QueuePFN apply + to. + +• 0x034 | R | QueueNumMax + Maximum virtual queue size. + Reading from the register returns the maximum size of the queue + the Host is ready to process or zero (0x0) if the queue is not + available. This applies to the queue selected by writing to + QueueSel and is allowed only when QueuePFN is set to zero + (0x0), so when the queue is not actively used. + +• 0x038 | W | QueueNum Virtual queue size. - Queue size is the number of elements in the queue, therefore size + Queue size is the number of elements in the queue, therefore size of the descriptor table and both available and used rings. - Writing to this register notifies the Host what size of the - queue the Guest will use. This applies to the queue selected by - writing to QueueSel. + Writing to this register notifies the Host what size of the + queue the Guest will use. This applies to the queue selected by + writing to QueueSel. -• 0x03c | W | QueueAlign +• 0x03c | W | QueueAlign Used Ring alignment in the virtual queue. - Writing to this register notifies the Host about alignment - boundary of the Used Ring in bytes. This value must be a power - of 2 and applies to the queue selected by writing to QueueSel. + Writing to this register notifies the Host about alignment + boundary of the Used Ring in bytes. This value must be a power + of 2 and applies to the queue selected by writing to QueueSel. -• 0x040 | RW | QueuePFN +• 0x040 | RW | QueuePFN Guest physical page number of the virtual queue. - Writing to this register notifies the host about location of the - virtual queue in the Guest's physical address space. This value - is the index number of a page starting with the queue - Descriptor Table. Value zero (0x0) means physical address zero - (0x00000000) and is illegal. When the Guest stops using the + Writing to this register notifies the host about location of the + virtual queue in the Guest's physical address space. This value + is the index number of a page starting with the queue + Descriptor Table. Value zero (0x0) means physical address zero + (0x00000000) and is illegal. When the Guest stops using the queue it must write zero (0x0) to this register. - Reading from this register returns the currently used page - number of the queue, therefore a value other than zero (0x0) + Reading from this register returns the currently used page + number of the queue, therefore a value other than zero (0x0) means that the queue is in use. - Both read and write accesses apply to the queue selected by - writing to QueueSel. + Both read and write accesses apply to the queue selected by + writing to QueueSel. -• 0x050 | W | QueueNotify +• 0x050 | W | QueueNotify Queue notifier. - Writing a queue index to this register notifies the Host that - there are new buffers to process in the queue. + Writing a queue index to this register notifies the Host that + there are new buffers to process in the queue. • 0x60 | R | InterruptStatus Interrupt status. -Reading from this register returns a bit mask of interrupts - asserted by the device. An interrupt is asserted if the +Reading from this register returns a bit mask of interrupts + asserted by the device. An interrupt is asserted if the corresponding bit is set, ie. equals one (1). – Bit 0 | Used Ring Update - This interrupt is asserted when the Host has updated the Used + This interrupt is asserted when the Host has updated the Used Ring in at least one of the active virtual queues. – Bit 1 | Configuration change - This interrupt is asserted when configuration of the device has + This interrupt is asserted when configuration of the device has changed. -• 0x064 | W | InterruptACK - Interrupt acknowledge. - Writing to this register notifies the Host that the Guest - finished handling interrupts. Set bits in the value clear the - corresponding bits of the InterruptStatus register. - -• 0x070 | RW | Status - Device status. - Reading from this register returns the current device status - flags. - Writing non-zero values to this register sets the status flags, - indicating the Guest progress. Writing zero (0x0) to this - register triggers a device reset. +• 0x064 | W | InterruptACK + Interrupt acknowledge. + Writing to this register notifies the Host that the Guest + finished handling interrupts. Set bits in the value clear the + corresponding bits of the InterruptStatus register. + +• 0x070 | RW | Status + Device status. + Reading from this register returns the current device status + flags. + Writing non-zero values to this register sets the status flags, + indicating the Guest progress. Writing zero (0x0) to this + register triggers a device reset. Also see [sub:Device-Initialization-Sequence] -• 0x100+ | RW | Config - Device-specific configuration space starts at an offset 0x100 - and is accessed with byte alignment. Its meaning and size - depends on the device and the driver. +• 0x100+ | RW | Config + Device-specific configuration space starts at an offset 0x100 + and is accessed with byte alignment. Its meaning and size + depends on the device and the driver. -Virtual queue size is the number of elements in the queue, -therefore size of the descriptor table and both available and +Virtual queue size is the number of elements in the queue, +therefore size of the descriptor table and both available and used rings. -The endianness of the registers follows the native endianness of -the Guest. Writing to registers described as “R” and reading from -registers described as “W” is not permitted and can cause +The endianness of the registers follows the native endianness of +the Guest. Writing to registers described as “R” and reading from +registers described as “W” is not permitted and can cause undefined behavior. 2.4.2.3. MMIO-specific Initialization And Device Operation @@ -1012,29 +1012,29 @@ done before the virtqueues are configured. 2.4.2.3.1.1. Virtqueue Configuration ----------------------------------- -1. Select the queue writing its index (first queue is 0) to the - QueueSel register. +1. Select the queue writing its index (first queue is 0) to the + QueueSel register. -2. Check if the queue is not already in use: read QueuePFN - register, returned value should be zero (0x0). +2. Check if the queue is not already in use: read QueuePFN + register, returned value should be zero (0x0). -3. Read maximum queue size (number of elements) from the - QueueNumMax register. If the returned value is zero (0x0) the - queue is not available. +3. Read maximum queue size (number of elements) from the + QueueNumMax register. If the returned value is zero (0x0) the + queue is not available. -4. Allocate and zero the queue pages in contiguous virtual - memory, aligning the Used Ring to an optimal boundary (usually - page size). Size of the allocated queue may be smaller than or - equal to the maximum size returned by the Host. +4. Allocate and zero the queue pages in contiguous virtual + memory, aligning the Used Ring to an optimal boundary (usually + page size). Size of the allocated queue may be smaller than or + equal to the maximum size returned by the Host. -5. Notify the Host about the queue size by writing the size to - QueueNum register. +5. Notify the Host about the queue size by writing the size to + QueueNum register. -6. Notify the Host about the used alignment by writing its value - in bytes to QueueAlign register. +6. Notify the Host about the used alignment by writing its value + in bytes to QueueAlign register. -7. Write the physical number of the first page of the queue to - the QueuePFN register. +7. Write the physical number of the first page of the queue to + the QueuePFN register. 2.4.2.3.2. Notifying The Device ------------------------------ @@ -1045,14 +1045,14 @@ writing the queue index to register QueueNum. 2.4.2.3.3. Receiving Used Buffers From The Device ------------------------------------------------ -The memory mapped virtio device is using single, dedicated -interrupt signal, which is raised when at least one of the -interrupts described in the InterruptStatus register -description is asserted. After receiving an interrupt, the -driver must read the InterruptStatus register to check what -caused the interrupt (see the register description). After the -interrupt is handled, the driver must acknowledge it by writing -a bit mask corresponding to the serviced interrupt to the +The memory mapped virtio device is using single, dedicated +interrupt signal, which is raised when at least one of the +interrupts described in the InterruptStatus register +description is asserted. After receiving an interrupt, the +driver must read the InterruptStatus register to check what +caused the interrupt (see the register description). After the +interrupt is handled, the driver must acknowledge it by writing +a bit mask corresponding to the serviced interrupt to the InterruptACK register. 2.4.2.4.4. Notification of Device Configuration Changes @@ -1105,13 +1105,13 @@ Discovering what devices are available and their type is bus-dependent. 2.5.1. Network Device ==================== -The virtio network device is a virtual ethernet card, and is the -most complex of the devices supported so far by virtio. It has -enhanced rapidly and demonstrates clearly how support for new -features should be added to an existing device. Empty buffers are -placed in one virtqueue for receiving packets, and outgoing -packets are enqueued into another for transmission in that order. -A third command queue is used to control advanced filtering +The virtio network device is a virtual ethernet card, and is the +most complex of the devices supported so far by virtio. It has +enhanced rapidly and demonstrates clearly how support for new +features should be added to an existing device. Empty buffers are +placed in one virtqueue for receiving packets, and outgoing +packets are enqueued into another for transmission in that order. +A third command queue is used to control advanced filtering features. 2.5.1.1. Device ID @@ -1126,7 +1126,7 @@ features. Virtqueue 2 only exists if VIRTIO_NET_F_CTRL_VQ set. -2.5.1.3. Feature bits +2.5.1.3. Feature bits -------------------- VIRTIO_NET_F_CSUM (0) Device handles packets with partial checksum @@ -1138,8 +1138,8 @@ features. VIRTIO_NET_F_MAC (5) Device has given MAC address. - VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with - any GSO type.[13] + VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with + any GSO type.[13] VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4. @@ -1159,7 +1159,7 @@ features. VIRTIO_NET_F_MRG_RXBUF (15) Guest can merge receive buffers. - VIRTIO_NET_F_STATUS (16) Configuration status field is + VIRTIO_NET_F_STATUS (16) Configuration status field is available. VIRTIO_NET_F_CTRL_VQ (17) Control channel is available. @@ -1168,14 +1168,14 @@ features. VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN filtering. - VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous + VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous packets. - Device configuration layout Two configuration fields are - currently defined. The mac address field always exists (though - is only valid if VIRTIO_NET_F_MAC is set), and the status field - only exists if VIRTIO_NET_F_STATUS is set. Two read-only bits - are currently defined for the status field: + Device configuration layout Two configuration fields are + currently defined. The mac address field always exists (though + is only valid if VIRTIO_NET_F_MAC is set), and the status field + only exists if VIRTIO_NET_F_STATUS is set. Two read-only bits + are currently defined for the status field: VIRTIO_NET_S_LINK_UP and VIRTIO_NET_S_ANNOUNCE. #define VIRTIO_NET_S_LINK_UP 1 @@ -1189,27 +1189,27 @@ features. 2.5.1.4. Device Initialization ----------------------------- -1. The initialization routine should identify the receive and +1. The initialization routine should identify the receive and transmission virtqueues. -2. If the VIRTIO_NET_F_MAC feature bit is set, the configuration - space “mac” entry indicates the “physical” address of the the - network card, otherwise a private MAC address should be - assigned. All guests are expected to negotiate this feature if +2. If the VIRTIO_NET_F_MAC feature bit is set, the configuration + space “mac” entry indicates the “physical” address of the the + network card, otherwise a private MAC address should be + assigned. All guests are expected to negotiate this feature if it is set. -3. If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, +3. If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identify the control virtqueue. -4. If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link - status can be read from the bottom bit of the “status” config +4. If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link + status can be read from the bottom bit of the “status” config field. Otherwise, the link should be assumed active. -5. The receive virtqueue should be filled with receive buffers. - This is described in detail below in “Setting Up Receive +5. The receive virtqueue should be filled with receive buffers. + This is described in detail below in “Setting Up Receive Buffers”. -6. A driver can indicate that it will generate checksumless +6. A driver can indicate that it will generate checksumless packets by negotating the VIRTIO_NET_F_CSUM feature. This “ checksum offload” is a common feature on modern network cards. @@ -1221,20 +1221,20 @@ features. Notification bit set, unless the VIRTIO_NET_F_HOST_ECN feature is negotiated.[15] -8. The converse features are also available: a driver can save +8. The converse features are also available: a driver can save the virtual device some work by negotiating these features.[16] - The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially - checksummed packets can be received, and if it can do that then - the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, - VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input - equivalents of the features described above. See “Receiving + The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially + checksummed packets can be received, and if it can do that then + the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, + VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input + equivalents of the features described above. See “Receiving Packets” below. 2.5.1.5. Device Operation ------------------------ -Packets are transmitted by placing them in the transmitq, and -buffers for incoming packets are placed in the receiveq. In each +Packets are transmitted by placing them in the transmitq, and +buffers for incoming packets are placed in the receiveq. In each case, the packet itself is preceeded by a header: struct virtio_net_hdr { @@ -1254,18 +1254,18 @@ case, the packet itself is preceeded by a header: u16 num_buffers; }; -The controlq is used to control device features such as +The controlq is used to control device features such as filtering. 2.5.1.5.1. Packet Transmission ----------------------------- -Transmitting a single packet is simple, but varies depending on +Transmitting a single packet is simple, but varies depending on the different features the driver negotiated. -1. If the driver negotiated VIRTIO_NET_F_CSUM, and the packet has - not been fully checksummed, then the virtio_net_hdr's fields - are set as follows. Otherwise, the packet must be fully +1. If the driver negotiated VIRTIO_NET_F_CSUM, and the packet has + not been fully checksummed, then the virtio_net_hdr's fields + are set as follows. Otherwise, the packet must be fully checksummed, and flags is zero. • flags has the VIRTIO_NET_HDR_F_NEEDS_CSUM set, @@ -1273,30 +1273,30 @@ the different features the driver negotiated. • csum_start is set to the offset within the packet to begin checksumming, and - • csum_offset indicates how many bytes after the csum_start the + • csum_offset indicates how many bytes after the csum_start the new (16 bit ones' complement) checksum should be placed.[17] -2. If the driver negotiated - VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires - TCP segmentation or UDP fragmentation, then the “gso_type” - field is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP. - (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this - case, packets larger than 1514 bytes can be transmitted: the - metadata indicates how to replicate the packet header to cut it +2. If the driver negotiated + VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires + TCP segmentation or UDP fragmentation, then the “gso_type” + field is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP. + (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this + case, packets larger than 1514 bytes can be transmitted: the + metadata indicates how to replicate the packet header to cut it into smaller packets. The other gso fields are set: - • hdr_len is a hint to the device as to how much of the header - needs to be kept to copy into each packet, usually set to the + • hdr_len is a hint to the device as to how much of the header + needs to be kept to copy into each packet, usually set to the length of the headers, including the transport header.[18] - • gso_size is the maximum size of each packet beyond that + • gso_size is the maximum size of each packet beyond that header (ie. MSS). - • If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, - the VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as + • If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, + the VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as well, indicating that the TCP packet has the ECN bit set.[19] -3. If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, +3. If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, the num_buffers field is set to zero. 4. The header and packet are added as one output buffer to the @@ -1309,71 +1309,71 @@ the different features the driver negotiated. Often a driver will suppress transmission interrupts using the VRING_AVAIL_F_NO_INTERRUPT flag (see "2.4.2. Receiving Used Buffers From The Device") and check for used packets in the transmit path of following -packets. However, it will still receive interrupts if the -VIRTIO_F_NOTIFY_ON_EMPTY feature is negotiated, indicating that +packets. However, it will still receive interrupts if the +VIRTIO_F_NOTIFY_ON_EMPTY feature is negotiated, indicating that the transmission queue is completely emptied. -The normal behavior in this interrupt handler is to retrieve and -new descriptors from the used ring and free the corresponding +The normal behavior in this interrupt handler is to retrieve and +new descriptors from the used ring and free the corresponding headers and packets. 2.5.1.5.2. Setting Up Receive Buffers -It is generally a good idea to keep the receive virtqueue as -fully populated as possible: if it runs out, network performance +It is generally a good idea to keep the receive virtqueue as +fully populated as possible: if it runs out, network performance will suffer. -If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or -VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to -accept packets of up to 65550 bytes long (the maximum size of a -TCP or UDP packet, plus the 14 byte ethernet header), otherwise -1514. bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every +If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or +VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to +accept packets of up to 65550 bytes long (the maximum size of a +TCP or UDP packet, plus the 14 byte ethernet header), otherwise +1514. bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every buffer in the receive queue needs to be at least this length [20a] -If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at +If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at least the size of the struct virtio_net_hdr. 2.5.1.5.2.1. Packet Receive Interrupt ------------------------------------ -When a packet is copied into a buffer in the receiveq, the -optimal path is to disable further interrupts for the receiveq -(see [sub:Receiving-Used-Buffers]) and process packets until no +When a packet is copied into a buffer in the receiveq, the +optimal path is to disable further interrupts for the receiveq +(see [sub:Receiving-Used-Buffers]) and process packets until no more are found, then re-enable them. Processing packet involves: -1. If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, - then the “num_buffers” field indicates how many descriptors - this packet is spread over (including this one). This allows - receipt of large packets without having to allocate large - buffers. In this case, there will be at least “num_buffers” in - the used ring, and they should be chained together to form a - single packet. The other buffers will not begin with a struct +1. If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, + then the “num_buffers” field indicates how many descriptors + this packet is spread over (including this one). This allows + receipt of large packets without having to allocate large + buffers. In this case, there will be at least “num_buffers” in + the used ring, and they should be chained together to form a + single packet. The other buffers will not begin with a struct virtio_net_hdr. -2. If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, or - the “num_buffers” field is one, then the entire packet will be - contained within this buffer, immediately following the struct +2. If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, or + the “num_buffers” field is one, then the entire packet will be + contained within this buffer, immediately following the struct virtio_net_hdr. -3. If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the - VIRTIO_NET_HDR_F_NEEDS_CSUM bit in the “flags” field may be +3. If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the + VIRTIO_NET_HDR_F_NEEDS_CSUM bit in the “flags” field may be set: if so, the checksum on the packet is incomplete and the “ - csum_start” and “csum_offset” fields indicate how to calculate + csum_start” and “csum_offset” fields indicate how to calculate it (see Packet Transmission point 1). -4. If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were - negotiated, then the “gso_type” may be something other than - VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the +4. If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were + negotiated, then the “gso_type” may be something other than + VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the desired MSS (see Packet Transmission point 2). 2.5.1.5.3. Control Virtqueue --------------------------- -The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is -negotiated) to send commands to manipulate various features of -the device which would not easily map into the configuration +The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is +negotiated) to send commands to manipulate various features of +the device which would not easily map into the configuration space. All commands are of the following form: @@ -1387,33 +1387,33 @@ All commands are of the following form: /* ack values */ #define VIRTIO_NET_OK 0 - #define VIRTIO_NET_ERR 1 + #define VIRTIO_NET_ERR 1 -The class, command and command-specific-data are set by the -driver, and the device sets the ack byte. There is little it can -do except issue a diagnostic if the ack byte is not +The class, command and command-specific-data are set by the +driver, and the device sets the ack byte. There is little it can +do except issue a diagnostic if the ack byte is not VIRTIO_NET_OK. 2.5.1.5.3.1. Packet Receive Filtering ------------------------------------ -If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can -send control commands for promiscuous mode, multicast receiving, +If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can +send control commands for promiscuous mode, multicast receiving, and filtering of MAC addresses. -Note that in general, these commands are best-effort: unwanted -packets may still arrive. +Note that in general, these commands are best-effort: unwanted +packets may still arrive. Setting Promiscuous Mode #define VIRTIO_NET_CTRL_RX 0 #define VIRTIO_NET_CTRL_RX_PROMISC 0 - #define VIRTIO_NET_CTRL_RX_ALLMULTI 1 + #define VIRTIO_NET_CTRL_RX_ALLMULTI 1 -The class VIRTIO_NET_CTRL_RX has two commands: -VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and off, and -VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and -off. The command-specific-data is one byte containing 0 (off) or +The class VIRTIO_NET_CTRL_RX has two commands: +VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and off, and +VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and +off. The command-specific-data is one byte containing 0 (off) or 1 (on). 2.5.1.5.3.2. Setting MAC Address Filtering @@ -1425,7 +1425,7 @@ off. The command-specific-data is one byte containing 0 (off) or }; #define VIRTIO_NET_CTRL_MAC 1 - #define VIRTIO_NET_CTRL_MAC_TABLE_SET 0 + #define VIRTIO_NET_CTRL_MAC_TABLE_SET 0 The device can filter incoming packets by any number of destination MAC addresses.[21] This table is set using the class @@ -1437,45 +1437,45 @@ contains multicast addresses. 2.5.1.5.3.3. VLAN Filtering -------------------------- -If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it +If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it can control a VLAN filter table in the device. #define VIRTIO_NET_CTRL_VLAN 2 #define VIRTIO_NET_CTRL_VLAN_ADD 0 - #define VIRTIO_NET_CTRL_VLAN_DEL 1 + #define VIRTIO_NET_CTRL_VLAN_DEL 1 -Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL +Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL command take a 16-bit VLAN id as the command-specific-data. 2.5.1.5.3.4. Gratuitous Packet Sending ------------------------------------- -If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE (depends -on VIRTIO_NET_F_CTRL_VQ), it can ask the guest to send gratuitous -packets; this is usually done after the guest has been physically -migrated, and needs to announce its presence on the new network -links. (As hypervisor does not have the knowledge of guest -network configuration (eg. tagged vlan) it is simplest to prod +If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE (depends +on VIRTIO_NET_F_CTRL_VQ), it can ask the guest to send gratuitous +packets; this is usually done after the guest has been physically +migrated, and needs to announce its presence on the new network +links. (As hypervisor does not have the knowledge of guest +network configuration (eg. tagged vlan) it is simplest to prod the guest in this way). #define VIRTIO_NET_CTRL_ANNOUNCE 3 #define VIRTIO_NET_CTRL_ANNOUNCE_ACK 0 -The Guest needs to check VIRTIO_NET_S_ANNOUNCE bit in status -field when it notices the changes of device configuration. The -command VIRTIO_NET_CTRL_ANNOUNCE_ACK is used to indicate that -driver has recevied the notification and device would clear the -VIRTIO_NET_S_ANNOUNCE bit in the status filed after it received +The Guest needs to check VIRTIO_NET_S_ANNOUNCE bit in status +field when it notices the changes of device configuration. The +command VIRTIO_NET_CTRL_ANNOUNCE_ACK is used to indicate that +driver has recevied the notification and device would clear the +VIRTIO_NET_S_ANNOUNCE bit in the status filed after it received this command. Processing this notification involves: -1. Sending the gratuitous packets or marking there are pending - gratuitous packets to be sent and letting deferred routine to +1. Sending the gratuitous packets or marking there are pending + gratuitous packets to be sent and letting deferred routine to send them. -2. Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control - vq. +2. Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control + vq. 2.5.1.5.3.4. Offloads State Configuration ------------------------------------- @@ -1514,9 +1514,9 @@ change of specific offload state. 2.5.2. Block Device ================== -The virtio block device is a simple virtual block device (ie. -disk). Read and write requests (and other exotic requests) are -placed in the queue, and serviced (probably out of order) by the +The virtio block device is a simple virtual block device (ie. +disk). Read and write requests (and other exotic requests) are +placed in the queue, and serviced (probably out of order) by the device except where noted. 2.5.2.1. Device ID @@ -1532,10 +1532,10 @@ device except where noted. VIRTIO_BLK_F_BARRIER (0) Host supports request barriers. - VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single segment is + VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single segment is in “size_max”. - VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segments in a + VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segments in a request is in “seg_max”. VIRTIO_BLK_F_GEOMETRY (4) Disk-style geometry specified in “ @@ -1549,9 +1549,9 @@ device except where noted. VIRTIO_BLK_F_FLUSH (9) Cache flush command support. - Device configuration layout The capacity of the device - (expressed in 512-byte sectors) is always present. The - availability of the others all depend on various feature bits + Device configuration layout The capacity of the device + (expressed in 512-byte sectors) is always present. The + availability of the others all depend on various feature bits as indicated above. struct virtio_blk_config { @@ -1569,23 +1569,23 @@ device except where noted. 2.5.2.4. Device Initialization ----------------------------- -1. The device size should be read from the “capacity” - configuration field. No requests should be submitted which goes +1. The device size should be read from the “capacity” + configuration field. No requests should be submitted which goes beyond this limit. -2. If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the - blk_size field can be read to determine the optimal sector size - for the driver to use. This does not effect the units used in - the protocol (always 512 bytes), but awareness of the correct +2. If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the + blk_size field can be read to determine the optimal sector size + for the driver to use. This does not effect the units used in + the protocol (always 512 bytes), but awareness of the correct value can effect performance. -3. If the VIRTIO_BLK_F_RO feature is set by the device, any write +3. If the VIRTIO_BLK_F_RO feature is set by the device, any write requests will fail. 2.5.2.5. Device Operation ------------------------ -The driver queues requests to the virtqueue, and they are used by +The driver queues requests to the virtqueue, and they are used by the device (not necessarily in order). Each request is of form: struct virtio_blk_req { @@ -1596,7 +1596,7 @@ the device (not necessarily in order). Each request is of form: u8 status; }; -If the device has VIRTIO_BLK_F_SCSI feature, it can also support +If the device has VIRTIO_BLK_F_SCSI feature, it can also support scsi packet command requests, each of these requests is of form: struct virtio_scsi_pc_req { @@ -1634,71 +1634,71 @@ flush the host cache. #define VIRTIO_BLK_T_FLUSH_OUT 5 #define VIRTIO_BLK_T_BARRIER 0x80000000 -The ioprio field is a hint about the relative priorities of -requests to the device: higher numbers indicate more important +The ioprio field is a hint about the relative priorities of +requests to the device: higher numbers indicate more important requests. -The sector number indicates the offset (multiplied by 512) where -the read or write is to occur. This field is unused and set to 0 +The sector number indicates the offset (multiplied by 512) where +the read or write is to occur. This field is unused and set to 0 for scsi packet commands and for flush commands. -The cmd field is only present for scsi packet command requests, -and indicates the command to perform. This field must reside in a -single, separate read-only buffer; command length can be derived -from the length of this buffer. +The cmd field is only present for scsi packet command requests, +and indicates the command to perform. This field must reside in a +single, separate read-only buffer; command length can be derived +from the length of this buffer. -Note that these first three (four for scsi packet commands) -fields are always read-only: the data field is either read-only -or write-only, depending on the request. The size of the read or +Note that these first three (four for scsi packet commands) +fields are always read-only: the data field is either read-only +or write-only, depending on the request. The size of the read or write can be derived from the total size of the request buffers. -The sense field is only present for scsi packet command requests, +The sense field is only present for scsi packet command requests, and indicates the buffer for scsi sense data. -The data_len field is only present for scsi packet command -requests, this field is deprecated, and should be ignored by the +The data_len field is only present for scsi packet command +requests, this field is deprecated, and should be ignored by the driver. Historically, devices copied data length there. -The sense_len field is only present for scsi packet command -requests and indicates the number of bytes actually written to +The sense_len field is only present for scsi packet command +requests and indicates the number of bytes actually written to the sense buffer. -The residual field is only present for scsi packet command -requests and indicates the residual size, calculated as data +The residual field is only present for scsi packet command +requests and indicates the residual size, calculated as data length - number of bytes actually transferred. -The final status byte is written by the device: either -VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for host or guest +The final status byte is written by the device: either +VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for host or guest error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host: #define VIRTIO_BLK_S_OK 0 #define VIRTIO_BLK_S_IOERR 1 #define VIRTIO_BLK_S_UNSUPP 2 -Historically, devices assumed that the fields type, ioprio and -sector reside in a single, separate read-only buffer; the fields -errors, data_len, sense_len and residual reside in a single, -separate write-only buffer; the sense field in a separate -write-only buffer of size 96 bytes, by itself; the fields errors, -data_len, sense_len and residual in a single write-only buffer; -and the status field is a separate read-only buffer of size 1 +Historically, devices assumed that the fields type, ioprio and +sector reside in a single, separate read-only buffer; the fields +errors, data_len, sense_len and residual reside in a single, +separate write-only buffer; the sense field in a separate +write-only buffer of size 96 bytes, by itself; the fields errors, +data_len, sense_len and residual in a single write-only buffer; +and the status field is a separate read-only buffer of size 1 byte, by itself. 2.5.3. Console Device ==================== -The virtio console device is a simple device for data input and -output. A device may have one or more ports. Each port has a pair -of input and output virtqueues. Moreover, a device has a pair of -control IO virtqueues. The control virtqueues are used to -communicate information between the device and the driver about -ports being opened and closed on either side of the connection, -indication from the host about whether a particular port is a -console port, adding new ports, port hot-plug/unplug, etc., and -indication from the guest about whether a port or a device was -successfully added, port open/close, etc.. For data IO, one or -more empty buffers are placed in the receive queue for incoming +The virtio console device is a simple device for data input and +output. A device may have one or more ports. Each port has a pair +of input and output virtqueues. Moreover, a device has a pair of +control IO virtqueues. The control virtqueues are used to +communicate information between the device and the driver about +ports being opened and closed on either side of the connection, +indication from the host about whether a particular port is a +console port, adding new ports, port hot-plug/unplug, etc., and +indication from the guest about whether a port or a device was +successfully added, port open/close, etc.. For data IO, one or +more empty buffers are placed in the receive queue for incoming data and outgoing characters are placed in the transmit queue. 2.5.3.1. Device ID @@ -1709,7 +1709,7 @@ data and outgoing characters are placed in the transmit queue. 2.5.3.2. Virtqueues ------------------ - 0:receiveq(port0). 1:transmitq(port0), 2:control receiveq, 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1), + 0:receiveq(port0). 1:transmitq(port0), 2:control receiveq, 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1), ... Ports 2 onwards only exist if VIRTIO_CONSOLE_F_MULTIPORT is set. @@ -1717,20 +1717,20 @@ data and outgoing characters are placed in the transmit queue. 2.5.3.3. Feature bits -------------------- - VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields + VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields are valid. - VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support for multiple - ports; configuration fields nr_ports and max_nr_ports are + VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support for multiple + ports; configuration fields nr_ports and max_nr_ports are valid and control virtqueues will be used. 2.5.3.4. Device configuration layout ----------------------------------- - The size of the console is supplied - in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature - is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature - is set, the maximum number of ports supported by the device can + The size of the console is supplied + in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature + is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature + is set, the maximum number of ports supported by the device can be fetched. struct virtio_console_config { @@ -1742,52 +1742,52 @@ data and outgoing characters are placed in the transmit queue. 2.5.3.5. Device Initialization ----------------------------- -1. If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver +1. If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver can read the console dimensions from the configuration fields. -2. If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the - driver can spawn multiple ports, not all of which may be - attached to a console. Some could be generic ports. In this - case, the control virtqueues are enabled and according to the - max_nr_ports configuration-space value, the appropriate number - of virtqueues are created. A control message indicating the - driver is ready is sent to the host. The host can then send - control messages for adding new ports to the device. After - creating and initializing each port, a - VIRTIO_CONSOLE_PORT_READY control message is sent to the host - for that port so the host can let us know of any additional +2. If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the + driver can spawn multiple ports, not all of which may be + attached to a console. Some could be generic ports. In this + case, the control virtqueues are enabled and according to the + max_nr_ports configuration-space value, the appropriate number + of virtqueues are created. A control message indicating the + driver is ready is sent to the host. The host can then send + control messages for adding new ports to the device. After + creating and initializing each port, a + VIRTIO_CONSOLE_PORT_READY control message is sent to the host + for that port so the host can let us know of any additional configuration options set for that port. -3. The receiveq for each port is populated with one or more +3. The receiveq for each port is populated with one or more receive buffers. 2.5.3.6. Device Operation ------------------------ -1. For output, a buffer containing the characters is placed in +1. For output, a buffer containing the characters is placed in the port's transmitq.[25] -2. When a buffer is used in the receiveq (signalled by an - interrupt), the contents is the input to the port associated +2. When a buffer is used in the receiveq (signalled by an + interrupt), the contents is the input to the port associated with the virtqueue for which the notification was received. -3. If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a - configuration change interrupt may occur. The updated size can +3. If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a + configuration change interrupt may occur. The updated size can be read from the configuration fields. -4. If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT - feature, active ports are announced by the host using the - VIRTIO_CONSOLE_PORT_ADD control message. The same message is +4. If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT + feature, active ports are announced by the host using the + VIRTIO_CONSOLE_PORT_ADD control message. The same message is used for port hot-plug as well. -5. If the host specified a port `name', a sysfs attribute is - created with the name filled in, so that udev rules can be - written that can create a symlink from the port's name to the +5. If the host specified a port `name', a sysfs attribute is + created with the name filled in, so that udev rules can be + written that can create a symlink from the port's name to the char device for port discovery by applications in the guest. -6. Changes to ports' state are effected by control messages. - Appropriate action is taken on the port indicated in the - control message. The layout of the structure of the control +6. Changes to ports' state are effected by control messages. + Appropriate action is taken on the port indicated in the + control message. The layout of the structure of the control buffer and the events associated are: struct virtio_console_control { @@ -1809,7 +1809,7 @@ data and outgoing characters are placed in the transmit queue. 2.5.4. Entropy Device ==================== -The virtio entropy device supplies high-quality randomness for +The virtio entropy device supplies high-quality randomness for guest use. 2.5.4.1. Device ID @@ -1836,19 +1836,19 @@ guest use. 2.5.4.6. Device Operation ------------------------ -When the driver requires random bytes, it places the descriptor -of one or more buffers in the queue. It will be completely filled +When the driver requires random bytes, it places the descriptor +of one or more buffers in the queue. It will be completely filled by random data by the device. 2.5.5. Memory Balloon Device =========================== -The virtio memory balloon device is a primitive device for -managing guest memory: the device asks for a certain amount of -memory, and the guest supplies it (or withdraws it, if the device -has more than it asks for). This allows the guest to adapt to -changes in allowance of underlying physical memory. If the -feature is negotiated, the device can also be used to communicate +The virtio memory balloon device is a primitive device for +managing guest memory: the device asks for a certain amount of +memory, and the guest supplies it (or withdraws it, if the device +has more than it asks for). This allows the guest to adapt to +changes in allowance of underlying physical memory. If the +feature is negotiated, the device can also be used to communicate guest memory statistics to the host. 2.5.5.1. Device ID @@ -1863,16 +1863,16 @@ guest memory statistics to the host. 2.5.5.3. Feature bits -------------------- - VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before + VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before pages from the balloon are used. - VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest + VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest memory statistics is present. 2.5.5.4. Device configuration layout ----------------------------------- - Both fields of this configuration - are always available. Note that they are little endian, despite + Both fields of this configuration + are always available. Note that they are little endian, despite convention that device fields are guest endian: struct virtio_balloon_config { @@ -1889,7 +1889,7 @@ guest memory statistics to the host. (a) Identify the stats virtqueue. - (b) Add one empty buffer to the stats virtqueue and notify the + (b) Add one empty buffer to the stats virtqueue and notify the host. Device operation begins immediately. @@ -1897,13 +1897,13 @@ Device operation begins immediately. 2.5.5.6. Device Operation ------------------------ -Memory Ballooning The device is driven by the receipt of a +Memory Ballooning The device is driven by the receipt of a configuration change interrupt. -1. The “num_pages” configuration field is examined. If this is - greater than the “actual” number of pages, memory must be given - to the balloon. If it is less than the “actual” number of - pages, memory may be taken back from the balloon for general +1. The “num_pages” configuration field is examined. If this is + greater than the “actual” number of pages, memory must be given + to the balloon. If it is less than the “actual” number of + pages, memory may be taken back from the balloon for general use. 2. To supply memory to the balloon (aka. inflate): @@ -1914,49 +1914,49 @@ configuration change interrupt. 3. To remove memory from the balloon (aka. deflate): - (a) The driver constructs an array of addresses of memory pages - it has previously given to the balloon, as described above. + (a) The driver constructs an array of addresses of memory pages + it has previously given to the balloon, as described above. This descriptor is added to the deflateq. - (b) If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is negotiated, the - guest may not use these requested pages until that descriptor + (b) If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is negotiated, the + guest may not use these requested pages until that descriptor in the deflateq has been used by the device. - (c) Otherwise, the guest may begin to re-use pages previously - given to the balloon before the device has acknowledged their - withdrawl. [28] + (c) Otherwise, the guest may begin to re-use pages previously + given to the balloon before the device has acknowledged their + withdrawl. [28] -4. In either case, once the device has completed the inflation or - deflation, the “actual” field of the configuration should be +4. In either case, once the device has completed the inflation or + deflation, the “actual” field of the configuration should be updated to reflect the new number of pages in the balloon.[29] 2.5.5.6.1. Memory Statistics --------------------------- -The stats virtqueue is atypical because communication is driven -by the device (not the driver). The channel becomes active at -driver initialization time when the driver adds an empty buffer -and notifies the device. A request for memory statistics proceeds +The stats virtqueue is atypical because communication is driven +by the device (not the driver). The channel becomes active at +driver initialization time when the driver adds an empty buffer +and notifies the device. A request for memory statistics proceeds as follows: -1. The device pushes the buffer onto the used ring and sends an +1. The device pushes the buffer onto the used ring and sends an interrupt. 2. The driver pops the used buffer and discards it. -3. The driver collects memory statistics and writes them into a +3. The driver collects memory statistics and writes them into a new buffer. -4. The driver adds the buffer to the virtqueue and notifies the +4. The driver adds the buffer to the virtqueue and notifies the device. -5. The device pops the buffer (retaining it to initiate a +5. The device pops the buffer (retaining it to initiate a subsequent request) and consumes the statistics. - Memory Statistics Format Each statistic consists of a 16 bit - tag and a 64 bit value. Both quantities are represented in the - native endian of the guest. All statistics are optional and the - driver may choose which ones to supply. To guarantee backwards + Memory Statistics Format Each statistic consists of a 16 bit + tag and a 64 bit value. Both quantities are represented in the + native endian of the guest. All statistics are optional and the + driver may choose which ones to supply. To guarantee backwards compatibility, unsupported statistics should be omitted. struct virtio_balloon_stat { @@ -1973,46 +1973,46 @@ as follows: 2.5.5.6.2. Memory Statistics Tags -------------------------------- - VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been + VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been swapped in (in bytes). - VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that has been + VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that has been swapped out to disk (in bytes). - VIRTIO_BALLOON_S_MAJFLT The number of major page faults that + VIRTIO_BALLOON_S_MAJFLT The number of major page faults that have occurred. - VIRTIO_BALLOON_S_MINFLT The number of minor page faults that + VIRTIO_BALLOON_S_MINFLT The number of minor page faults that have occurred. - VIRTIO_BALLOON_S_MEMFREE The amount of memory not being used + VIRTIO_BALLOON_S_MEMFREE The amount of memory not being used for any purpose (in bytes). - VIRTIO_BALLOON_S_MEMTOT The total amount of memory available + VIRTIO_BALLOON_S_MEMTOT The total amount of memory available (in bytes). 2.5.6. SCSI Host Device ====================== -The virtio SCSI host device groups together one or more virtual -logical units (such as disks), and allows communicating to them -using the SCSI protocol. An instance of the device represents a +The virtio SCSI host device groups together one or more virtual +logical units (such as disks), and allows communicating to them +using the SCSI protocol. An instance of the device represents a SCSI host to which many targets and LUNs are attached. The virtio SCSI device services two kinds of requests: • command requests for a logical unit; -• task management functions related to a logical unit, target or +• task management functions related to a logical unit, target or command. -The device is also able to send out notifications about added and -removed logical units. Together, these capabilities provide a -SCSI transport protocol that uses virtqueues as the transfer -medium. In the transport protocol, the virtio driver acts as the -initiator, while the virtio SCSI host provides one or more -targets that receive and process the requests. +The device is also able to send out notifications about added and +removed logical units. Together, these capabilities provide a +SCSI transport protocol that uses virtqueues as the transfer +medium. In the transport protocol, the virtio driver acts as the +initiator, while the virtio SCSI host provides one or more +targets that receive and process the requests. 2.5.6.1. Device ID ----------------- @@ -2025,10 +2025,10 @@ targets that receive and process the requests. 2.5.6.3. Feature bits -------------------- - VIRTIO_SCSI_F_INOUT (0) A single request can include both + VIRTIO_SCSI_F_INOUT (0) A single request can include both read-only and write-only data buffers. - VIRTIO_SCSI_F_HOTPLUG (1) The host should enable + VIRTIO_SCSI_F_HOTPLUG (1) The host should enable hot-plug/hot-unplug of new LUNs and targets on the SCSI bus. 2.5.6.4. Device configuration layout @@ -2050,54 +2050,54 @@ targets that receive and process the requests. u32 max_lun; }; - num_queues is the total number of request virtqueues exposed by - the device. The driver is free to use only one request queue, + num_queues is the total number of request virtqueues exposed by + the device. The driver is free to use only one request queue, or it can use more to achieve better performance. - seg_max is the maximum number of segments that can be in a - command. A bidirectional command can include seg_max input + seg_max is the maximum number of segments that can be in a + command. A bidirectional command can include seg_max input segments and seg_max output segments. - max_sectors is a hint to the guest about the maximum transfer + max_sectors is a hint to the guest about the maximum transfer size it should use. - cmd_per_lun is a hint to the guest about the maximum number of - linked commands it should send to one LUN. The actual value - to be used is the minimum of cmd_per_lun and the virtqueue + cmd_per_lun is a hint to the guest about the maximum number of + linked commands it should send to one LUN. The actual value + to be used is the minimum of cmd_per_lun and the virtqueue size. - event_info_size is the maximum size that the device will fill - for buffers that the driver places in the eventq. The driver - should always put buffers at least of this size. It is - written by the device depending on the set of negotated + event_info_size is the maximum size that the device will fill + for buffers that the driver places in the eventq. The driver + should always put buffers at least of this size. It is + written by the device depending on the set of negotated features. - sense_size is the maximum size of the sense data that the - device will write. The default value is written by the device - and will always be 96, but the driver can modify it. It is + sense_size is the maximum size of the sense data that the + device will write. The default value is written by the device + and will always be 96, but the driver can modify it. It is restored to the default when the device is reset. - cdb_size is the maximum size of the CDB that the driver will - write. The default value is written by the device and will - always be 32, but the driver can likewise modify it. It is + cdb_size is the maximum size of the CDB that the driver will + write. The default value is written by the device and will + always be 32, but the driver can likewise modify it. It is restored to the default when the device is reset. - max_channel, max_target and max_lun can be used by the driver - as hints to constrain scanning the logical units on the + max_channel, max_target and max_lun can be used by the driver + as hints to constrain scanning the logical units on the host.h 2.5.6.5. Device Initialization ----------------------------- -The initialization routine should first of all discover the +The initialization routine should first of all discover the device's virtqueues. -If the driver uses the eventq, it should then place at least a +If the driver uses the eventq, it should then place at least a buffer in the eventq. -The driver can immediately issue requests (for example, INQUIRY -or REPORT LUNS) or task management functions (for example, I_T -RESET). +The driver can immediately issue requests (for example, INQUIRY +or REPORT LUNS) or task management functions (for example, I_T +RESET). 2.5.6.6. Device Operation ------------------------ @@ -2108,13 +2108,13 @@ queue and the event queue. 2.5.6.6.1. Device Operation: Request Queues ------------------------------------------ -The driver queues requests to an arbitrary request queue, and -they are used by the device on that same queue. It is the -responsibility of the driver to ensure strict request ordering -for commands placed on different queues, because they will be +The driver queues requests to an arbitrary request queue, and +they are used by the device on that same queue. It is the +responsibility of the driver to ensure strict request ordering +for commands placed on different queues, because they will be consumed with no order constraints. -Requests have the following format: +Requests have the following format: struct virtio_scsi_req_cmd { // Read-only @@ -2154,84 +2154,84 @@ Requests have the following format: #define VIRTIO_SCSI_S_HEAD 2 #define VIRTIO_SCSI_S_ACA 3 -The lun field addresses a target and logical unit in the -virtio-scsi device's SCSI domain. The only supported format for -the LUN field is: first byte set to 1, second byte set to target, -third and fourth byte representing a single level LUN structure, -followed by four zero bytes. With this representation, a -virtio-scsi device can serve up to 256 targets and 16384 LUNs per +The lun field addresses a target and logical unit in the +virtio-scsi device's SCSI domain. The only supported format for +the LUN field is: first byte set to 1, second byte set to target, +third and fourth byte representing a single level LUN structure, +followed by four zero bytes. With this representation, a +virtio-scsi device can serve up to 256 targets and 16384 LUNs per target. The id field is the command identifier (“tag”). -task_attr, prio and crn should be left to zero. task_attr defines -the task attribute as in the table above, but all task attributes -may be mapped to SIMPLE by the device; crn may also be provided -by clients, but is generally expected to be 0. The maximum CRN -value defined by the protocol is 255, since CRN is stored in an +task_attr, prio and crn should be left to zero. task_attr defines +the task attribute as in the table above, but all task attributes +may be mapped to SIMPLE by the device; crn may also be provided +by clients, but is generally expected to be 0. The maximum CRN +value defined by the protocol is 255, since CRN is stored in an 8-bit integer. -All of these fields are defined in SAM. They are always -read-only, as are the cdb and dataout field. The cdb_size is +All of these fields are defined in SAM. They are always +read-only, as are the cdb and dataout field. The cdb_size is taken from the configuration space. -sense and subsequent fields are always write-only. The sense_len -field indicates the number of bytes actually written to the sense -buffer. The residual field indicates the residual size, -calculated as “data_length - number_of_transferred_bytes”, for -read or write operations. For bidirectional commands, the -number_of_transferred_bytes includes both read and written bytes. -A residual field that is less than the size of datain means that -the dataout field was processed entirely. A residual field that -exceeds the size of datain means that the dataout field was -processed partially and the datain field was not processed at +sense and subsequent fields are always write-only. The sense_len +field indicates the number of bytes actually written to the sense +buffer. The residual field indicates the residual size, +calculated as “data_length - number_of_transferred_bytes”, for +read or write operations. For bidirectional commands, the +number_of_transferred_bytes includes both read and written bytes. +A residual field that is less than the size of datain means that +the dataout field was processed entirely. A residual field that +exceeds the size of datain means that the dataout field was +processed partially and the datain field was not processed at all. -The status byte is written by the device to be the status code as +The status byte is written by the device to be the status code as defined in SAM. -The response byte is written by the device to be one of the +The response byte is written by the device to be one of the following: - VIRTIO_SCSI_S_OK when the request was completed and the status - byte is filled with a SCSI status code (not necessarily + VIRTIO_SCSI_S_OK when the request was completed and the status + byte is filled with a SCSI status code (not necessarily "GOOD"). - VIRTIO_SCSI_S_OVERRUN if the content of the CDB requires + VIRTIO_SCSI_S_OVERRUN if the content of the CDB requires transferring more data than is available in the data buffers. - VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an + VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an ABORT TASK or ABORT TASK SET task management function. - VIRTIO_SCSI_S_BAD_TARGET if the request was never processed + VIRTIO_SCSI_S_BAD_TARGET if the request was never processed because the target indicated by the lun field does not exist. - VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus + VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus or device reset (including a task management function). - VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a - problem in the connection between the host and the target + VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a + problem in the connection between the host and the target (severed link). - VIRTIO_SCSI_S_TARGET_FAILURE if the target is suffering a + VIRTIO_SCSI_S_TARGET_FAILURE if the target is suffering a failure and the guest should not retry on other paths. - VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure + VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure but retrying on other paths might yield a different result. - VIRTIO_SCSI_S_BUSY if the request failed but retrying on the + VIRTIO_SCSI_S_BUSY if the request failed but retrying on the same path should work. - VIRTIO_SCSI_S_FAILURE for other host or guest error. In - particular, if neither dataout nor datain is empty, and the - VIRTIO_SCSI_F_INOUT feature has not been negotiated, the - request will be immediately returned with a response equal to - VIRTIO_SCSI_S_FAILURE. + VIRTIO_SCSI_S_FAILURE for other host or guest error. In + particular, if neither dataout nor datain is empty, and the + VIRTIO_SCSI_F_INOUT feature has not been negotiated, the + request will be immediately returned with a response equal to + VIRTIO_SCSI_S_FAILURE. 2.5.6.6.2. Device Operation: controlq ------------------------------------ -The controlq is used for other SCSI transport operations. +The controlq is used for other SCSI transport operations. Requests have the following format: struct virtio_scsi_ctrl { @@ -2254,7 +2254,7 @@ The type identifies the remaining fields. The following commands are defined: - Task management function + Task management function #define VIRTIO_SCSI_T_TMF 0 #define VIRTIO_SCSI_T_TMF_ABORT_TASK 0 @@ -2282,23 +2282,23 @@ The following commands are defined: #define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED 10 #define VIRTIO_SCSI_S_FUNCTION_REJECTED 11 - The type is VIRTIO_SCSI_T_TMF; the subtype field defines. All - fields except response are filled by the driver. The subtype - field must always be specified and identifies the requested + The type is VIRTIO_SCSI_T_TMF; the subtype field defines. All + fields except response are filled by the driver. The subtype + field must always be specified and identifies the requested task management function. - Other fields may be irrelevant for the requested TMF; if so, - they are ignored but they should still be present. The lun - field is in the same format specified for request queues; the - single level LUN is ignored when the task management function - addresses a whole I_T nexus. When relevant, the value of the id + Other fields may be irrelevant for the requested TMF; if so, + they are ignored but they should still be present. The lun + field is in the same format specified for request queues; the + single level LUN is ignored when the task management function + addresses a whole I_T nexus. When relevant, the value of the id field is matched against the id values passed on the requestq. - The outcome of the task management function is written by the - device in the response field. The command-specific response + The outcome of the task management function is written by the + device in the response field. The command-specific response values map 1-to-1 with those defined in SAM. - Asynchronous notification query + Asynchronous notification query #define VIRTIO_SCSI_T_AN_QUERY 1 @@ -2319,20 +2319,20 @@ The following commands are defined: #define VIRTIO_SCSI_EVT_ASYNC_MULTI_HOST 32 #define VIRTIO_SCSI_EVT_ASYNC_DEVICE_BUSY 64 - By sending this command, the driver asks the device which - events the given LUN can report, as described in paragraphs 6.6 - and A.6 of the SCSI MMC specification. The driver writes the - events it is interested in into the event_requested; the device - responds by writing the events that it supports into + By sending this command, the driver asks the device which + events the given LUN can report, as described in paragraphs 6.6 + and A.6 of the SCSI MMC specification. The driver writes the + events it is interested in into the event_requested; the device + responds by writing the events that it supports into event_actual. - The type is VIRTIO_SCSI_T_AN_QUERY. The lun and event_requested - fields are written by the driver. The event_actual and response + The type is VIRTIO_SCSI_T_AN_QUERY. The lun and event_requested + fields are written by the driver. The event_actual and response fields are written by the device. No command-specific values are defined for the response byte. - Asynchronous notification subscription + Asynchronous notification subscription #define VIRTIO_SCSI_T_AN_SUBSCRIBE 2 struct virtio_scsi_ctrl_an { @@ -2345,17 +2345,17 @@ The following commands are defined: u8 response; } - By sending this command, the driver asks the specified LUN to - report events for its physical interface, again as described in - the SCSI MMC specification. The driver writes the events it is - interested in into the event_requested; the device responds by + By sending this command, the driver asks the specified LUN to + report events for its physical interface, again as described in + the SCSI MMC specification. The driver writes the events it is + interested in into the event_requested; the device responds by writing the events that it supports into event_actual. - Event types are the same as for the asynchronous notification + Event types are the same as for the asynchronous notification query message. - The type is VIRTIO_SCSI_T_AN_SUBSCRIBE. The lun and - event_requested fields are written by the driver. The + The type is VIRTIO_SCSI_T_AN_SUBSCRIBE. The lun and + event_requested fields are written by the driver. The event_actual and response fields are written by the device. No command-specific values are defined for the response byte. @@ -2363,25 +2363,25 @@ The following commands are defined: 2.5.6.6.3. Device Operation: eventq ---------------------------------- -The eventq is used by the device to report information on logical -units that are attached to it. The driver should always leave a -few buffers ready in the eventq. In general, the device will not -queue events to cope with an empty eventq, and will end up -dropping events if it finds no buffer ready. However, when -reporting events for many LUNs (e.g. when a whole target -disappears), the device can throttle events to avoid dropping -them. For this reason, placing 10-15 buffers on the event queue +The eventq is used by the device to report information on logical +units that are attached to it. The driver should always leave a +few buffers ready in the eventq. In general, the device will not +queue events to cope with an empty eventq, and will end up +dropping events if it finds no buffer ready. However, when +reporting events for many LUNs (e.g. when a whole target +disappears), the device can throttle events to avoid dropping +them. For this reason, placing 10-15 buffers on the event queue should be enough. -Buffers are placed in the eventq and filled by the device when -interesting events occur. The buffers should be strictly -write-only (device-filled) and the size of the buffers should be -at least the value given in the device's configuration +Buffers are placed in the eventq and filled by the device when +interesting events occur. The buffers should be strictly +write-only (device-filled) and the size of the buffers should be +at least the value given in the device's configuration information. -Buffers returned by the device on the eventq will be referred to -as "events" in the rest of this section. Events have the -following format: +Buffers returned by the device on the eventq will be referred to +as "events" in the rest of this section. Events have the +following format: #define VIRTIO_SCSI_T_EVENTS_MISSED 0x80000000 @@ -2391,33 +2391,33 @@ following format: ... } -If bit 31 is set in the event field, the device failed to report -an event due to missing buffers. In this case, the driver should -poll the logical units for unit attention conditions, and/or do -whatever form of bus scan is appropriate for the guest operating +If bit 31 is set in the event field, the device failed to report +an event due to missing buffers. In this case, the driver should +poll the logical units for unit attention conditions, and/or do +whatever form of bus scan is appropriate for the guest operating system. -Other data that the device writes to the buffer depends on the +Other data that the device writes to the buffer depends on the contents of the event field. The following events are defined: - No event + No event #define VIRTIO_SCSI_T_NO_EVENT 0 - This event is fired in the following cases: + This event is fired in the following cases: - • When the device detects in the eventq a buffer that is - shorter than what is indicated in the configuration field, it - might use it immediately and put this dummy value in the - event field. A well-written driver will never observe this + • When the device detects in the eventq a buffer that is + shorter than what is indicated in the configuration field, it + might use it immediately and put this dummy value in the + event field. A well-written driver will never observe this situation. - • When events are dropped, the device may signal this event as - soon as the drivers makes a buffer available, in order to - request action from the driver. In this case, of course, this - event will be reported with the VIRTIO_SCSI_T_EVENTS_MISSED - flag. + • When events are dropped, the device may signal this event as + soon as the drivers makes a buffer available, in order to + request action from the driver. In this case, of course, this + event will be reported with the VIRTIO_SCSI_T_EVENTS_MISSED + flag. - Transport reset + Transport reset #define VIRTIO_SCSI_T_TRANSPORT_RESET 1 struct virtio_scsi_event_reset { @@ -2431,58 +2431,58 @@ contents of the event field. The following events are defined: #define VIRTIO_SCSI_EVT_RESET_RESCAN 1 #define VIRTIO_SCSI_EVT_RESET_REMOVED 2 - By sending this event, the device signals that a logical unit - on a target has been reset, including the case of a new device - appearing or disappearing on the bus.The device fills in all - fields. The event field is set to - VIRTIO_SCSI_T_TRANSPORT_RESET. The lun field addresses a + By sending this event, the device signals that a logical unit + on a target has been reset, including the case of a new device + appearing or disappearing on the bus.The device fills in all + fields. The event field is set to + VIRTIO_SCSI_T_TRANSPORT_RESET. The lun field addresses a logical unit in the SCSI host. - The reason value is one of the three #define values appearing + The reason value is one of the three #define values appearing above: - • VIRTIO_SCSI_EVT_RESET_REMOVED (“LUN/target removed”) is used - if the target or logical unit is no longer able to receive + • VIRTIO_SCSI_EVT_RESET_REMOVED (“LUN/target removed”) is used + if the target or logical unit is no longer able to receive commands. - • VIRTIO_SCSI_EVT_RESET_HARD (“LUN hard reset”) is used if the + • VIRTIO_SCSI_EVT_RESET_HARD (“LUN hard reset”) is used if the logical unit has been reset, but is still present. - • VIRTIO_SCSI_EVT_RESET_RESCAN (“rescan LUN/target”) is used if + • VIRTIO_SCSI_EVT_RESET_RESCAN (“rescan LUN/target”) is used if a target or logical unit has just appeared on the device. - The “removed” and “rescan” events, when sent for LUN 0, may - apply to the entire target. After receiving them the driver - should ask the initiator to rescan the target, in order to - detect the case when an entire target has appeared or - disappeared. These two events will never be reported unless the - VIRTIO_SCSI_F_HOTPLUG feature was negotiated between the host + The “removed” and “rescan” events, when sent for LUN 0, may + apply to the entire target. After receiving them the driver + should ask the initiator to rescan the target, in order to + detect the case when an entire target has appeared or + disappeared. These two events will never be reported unless the + VIRTIO_SCSI_F_HOTPLUG feature was negotiated between the host and the guest. - Events will also be reported via sense codes (this obviously - does not apply to newly appeared buses or targets, since the + Events will also be reported via sense codes (this obviously + does not apply to newly appeared buses or targets, since the application has never discovered them): - • “LUN/target removed” maps to sense key ILLEGAL REQUEST, asc + • “LUN/target removed” maps to sense key ILLEGAL REQUEST, asc 0x25, ascq 0x00 (LOGICAL UNIT NOT SUPPORTED) - • “LUN hard reset” maps to sense key UNIT ATTENTION, asc 0x29 + • “LUN hard reset” maps to sense key UNIT ATTENTION, asc 0x29 (POWER ON, RESET OR BUS DEVICE RESET OCCURRED) - • “rescan LUN/target” maps to sense key UNIT ATTENTION, asc + • “rescan LUN/target” maps to sense key UNIT ATTENTION, asc 0x3f, ascq 0x0e (REPORTED LUNS DATA HAS CHANGED) - The preferred way to detect transport reset is always to use - events, because sense codes are only seen by the driver when it - sends a SCSI command to the logical unit or target. However, in - case events are dropped, the initiator will still be able to - synchronize with the actual state of the controller if the - driver asks the initiator to rescan of the SCSI bus. During the - rescan, the initiator will be able to observe the above sense - codes, and it will process them as if it the driver had - received the equivalent event. - - Asynchronous notification + The preferred way to detect transport reset is always to use + events, because sense codes are only seen by the driver when it + sends a SCSI command to the logical unit or target. However, in + case events are dropped, the initiator will still be able to + synchronize with the actual state of the controller if the + driver asks the initiator to rescan of the SCSI bus. During the + rescan, the initiator will be able to observe the above sense + codes, and it will process them as if it the driver had + received the equivalent event. + + Asynchronous notification #define VIRTIO_SCSI_T_ASYNC_NOTIFY 2 struct virtio_scsi_event_an { @@ -2492,16 +2492,16 @@ contents of the event field. The following events are defined: u32 reason; } - By sending this event, the device signals that an asynchronous + By sending this event, the device signals that an asynchronous event was fired from a physical interface. - All fields are written by the device. The event field is set to - VIRTIO_SCSI_T_ASYNC_NOTIFY. The lun field addresses a logical - unit in the SCSI host. The reason field is a subset of the - events that the driver has subscribed to via the "Asynchronous + All fields are written by the device. The event field is set to + VIRTIO_SCSI_T_ASYNC_NOTIFY. The lun field addresses a logical + unit in the SCSI host. The reason field is a subset of the + events that the driver has subscribed to via the "Asynchronous notification subscription" command. - When dropped events are reported, the driver should poll for + When dropped events are reported, the driver should poll for asynchronous events manually using SCSI commands. @@ -2510,15 +2510,15 @@ contents of the event field. The following events are defined: Currently there are four device-independent feature bits defined: - VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature - indicates that the driver wants an interrupt if the device runs - out of available descriptors on a virtqueue, even though - interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT - flag or the used_event field. An example of this is the - networking driver: it doesn't need to know every time a packet - is transmitted, but it does need to free the transmitted - packets a finite time after they are transmitted. It can avoid - using a timer if the device interrupts it when all the packets + VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature + indicates that the driver wants an interrupt if the device runs + out of available descriptors on a virtqueue, even though + interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT + flag or the used_event field. An example of this is the + networking driver: it doesn't need to know every time a packet + is transmitted, but it does need to free the transmitted + packets a finite time after they are transmitted. It can avoid + using a timer if the device interrupts it when all the packets are transmitted. VIRTIO_F_ANY_LAYOUT (27) This feature indicates that the device accepts arbitrary @@ -2528,15 +2528,15 @@ Currently there are four device-independent feature bits defined: that the driver can use descriptors with the VRING_DESC_F_INDIRECT flag set, as described in "2.3.3. Indirect Descriptors". - VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event - and the avail_event fields. If set, it indicates that the - device should ignore the flags field in the available ring - structure. Instead, the used_event field in this structure is - used by guest to suppress device interrupts. Further, the - driver should ignore the flags field in the used ring - structure. Instead, the avail_event field in this structure is - used by the device to suppress notifications. If unset, the - driver should ignore the used_event field; the device should + VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event + and the avail_event fields. If set, it indicates that the + device should ignore the flags field in the available ring + structure. Instead, the used_event field in this structure is + used by guest to suppress device interrupts. Further, the + driver should ignore the flags field in the used ring + structure. Instead, the avail_event field in this structure is + used by the device to suppress notifications. If unset, the + driver should ignore the used_event field; the device should ignore the avail_event field; the flags field is used @@ -2695,7 +2695,7 @@ static inline unsigned vring_size(unsigned int num, unsigned long align) static inline int vring_need_event(uint16_t event_idx, uint16_t new_idx, uint16_t old_idx) { - return (uint16_t)(new_idx - event_idx - 1) < (uint16_t)(new_idx - old_idx); + return (uint16_t)(new_idx - event_idx - 1) < (uint16_t)(new_idx - old_idx); } /* Get location of event indices (only with VIRTIO_RING_F_EVENT_IDX) */ @@ -2717,18 +2717,18 @@ static inline uint16_t *vring_avail_event(struct vring *vr) 2.10. Creating New Device Types ============================== -Various considerations are necessary when creating a new device +Various considerations are necessary when creating a new device type. - + 2.10.1. How Many Virtqueues? --------------------------- -It is possible that a very simple device will operate entirely -through its configuration space, but most will need at least one -virtqueue in which it will place requests. A device with both -input and output (eg. console and network devices described here) -need two queues: one which the driver fills with buffers to -receive input, and one which the driver places buffers to +It is possible that a very simple device will operate entirely +through its configuration space, but most will need at least one +virtqueue in which it will place requests. A device with both +input and output (eg. console and network devices described here) +need two queues: one which the driver fills with buffers to +receive input, and one which the driver places buffers to transmit output. 2.10.2. What Configuration Space Layout? @@ -2736,16 +2736,16 @@ transmit output. Configuration space should only be used for initialization-time parameters. It is a limited resource with no synchronization, so for -most uses it is better to use a virtqueue to update configuration -information (the network device does this for filtering, -otherwise the table in the config space could potentially be very +most uses it is better to use a virtqueue to update configuration +information (the network device does this for filtering, +otherwise the table in the config space could potentially be very large). 2.10.3. What Device Number? -------------------------- -Currently device numbers are assigned quite freely: a simple -request mail to the author of this document or the Linux +Currently device numbers are assigned quite freely: a simple +request mail to the author of this document or the Linux virtualization mailing list[9] will be sufficient to secure a unique one. Meanwhile for experimental drivers, use 65535 and work backwards. @@ -2753,67 +2753,67 @@ Meanwhile for experimental drivers, use 65535 and work backwards. 2.10.4. How many MSI-X vectors? (for PCI) ----------------------------------------- -Using the optional MSI-X capability devices can speed up -interrupt processing by removing the need to read ISR Status -register by guest driver (which might be an expensive operation), -reducing interrupt sharing between devices and queues within the -device, and handling interrupts from multiple CPUs. However, some -systems impose a limit (which might be as low as 256) on the -total number of MSI-X vectors that can be allocated to all -devices. Devices and/or device drivers should take this into -account, limiting the number of vectors used unless the device is -expected to cause a high volume of interrupts. Devices can -control the number of vectors used by limiting the MSI-X Table -Size or not presenting MSI-X capability in PCI configuration -space. Drivers can control this by mapping events to as small -number of vectors as possible, or disabling MSI-X capability +Using the optional MSI-X capability devices can speed up +interrupt processing by removing the need to read ISR Status +register by guest driver (which might be an expensive operation), +reducing interrupt sharing between devices and queues within the +device, and handling interrupts from multiple CPUs. However, some +systems impose a limit (which might be as low as 256) on the +total number of MSI-X vectors that can be allocated to all +devices. Devices and/or device drivers should take this into +account, limiting the number of vectors used unless the device is +expected to cause a high volume of interrupts. Devices can +control the number of vectors used by limiting the MSI-X Table +Size or not presenting MSI-X capability in PCI configuration +space. Drivers can control this by mapping events to as small +number of vectors as possible, or disabling MSI-X capability altogether. 2.10.5. Device Improvements -------------------------- -Any change to configuration space, or new virtqueues, or -behavioural changes, should be indicated by negotiation of a new +Any change to configuration space, or new virtqueues, or +behavioural changes, should be indicated by negotiation of a new feature bit. This establishes clarity[11] and avoids future expansion problems. -Clusters of functionality which are always implemented together -can use a single bit, but if one feature makes sense without the -others they should not be gratuitously grouped together to -conserve feature bits. We can always extend the spec when the +Clusters of functionality which are always implemented together +can use a single bit, but if one feature makes sense without the +others they should not be gratuitously grouped together to +conserve feature bits. We can always extend the spec when the first person needs more than 24 feature bits for their device. FOOTNOTES: ========== -[1] This lack of page-sharing implies that the implementation of the -device (e.g. the hypervisor or host) needs full access to the -guest memory. Communication with untrusted parties (i.e. +[1] This lack of page-sharing implies that the implementation of the +device (e.g. the hypervisor or host) needs full access to the +guest memory. Communication with untrusted parties (i.e. inter-guest communication) requires copying. -[2] The Linux implementation further separates the PCI virtio code -from the specific virtio drivers: these drivers are shared with +[2] The Linux implementation further separates the PCI virtio code +from the specific virtio drivers: these drivers are shared with the non-PCI implementations (currently lguest and S/390). [3] The actual value within this range is ignored -[4] Historically, drivers have used the device before steps 5 and 6. -This is only allowed if the driver does not use any features +[4] Historically, drivers have used the device before steps 5 and 6. +This is only allowed if the driver does not use any features which would alter this early use of the device. -[5] ie. once you enable MSI-X on the device, the other fields move. +[5] ie. once you enable MSI-X on the device, the other fields move. If you turn it off again, they move back! -[6] The 4096 is based on the x86 page size, but it's also large -enough to ensure that the separate parts of the virtqueue are on +[6] The 4096 is based on the x86 page size, but it's also large +enough to ensure that the separate parts of the virtqueue are on separate cache lines. -[7] These fields are kept here because this is the only part of the +[7] These fields are kept here because this is the only part of the virtqueue written by the device -[8] The Linux drivers do this only for read-only buffers: for -write-only buffers, it is assumed that the driver is merely -trying to keep the receive buffer ring full, and no notification +[8] The Linux drivers do this only for read-only buffers: for +write-only buffers, it is assumed that the driver is merely +trying to keep the receive buffer ring full, and no notification of this expected condition is necessary. [9] https://lists.linux-foundation.org/mailman/listinfo/virtualization @@ -2824,19 +2824,19 @@ devices assumed it. In addition, the specifications for virtio_blk and virtio_scsi require intuiting field lengths from frame boundaries. -[11] Even if it does mean documenting design or implementation +[11] Even if it does mean documenting design or implementation mistakes! -[13] It was supposed to indicate segmentation offload support, but -upon further investigation it became clear that multiple bits +[13] It was supposed to indicate segmentation offload support, but +upon further investigation it became clear that multiple bits were required. -[14] ie. VIRTIO_NET_F_HOST_TSO* and VIRTIO_NET_F_HOST_UFO are -dependent on VIRTIO_NET_F_CSUM; a dvice which offers the offload -features must offer the checksum feature, and a driver which -accepts the offload features must accept the checksum feature. -Similar logic applies to the VIRTIO_NET_F_GUEST_TSO4 features +[14] ie. VIRTIO_NET_F_HOST_TSO* and VIRTIO_NET_F_HOST_UFO are +dependent on VIRTIO_NET_F_CSUM; a dvice which offers the offload +features must offer the checksum feature, and a driver which +accepts the offload features must accept the checksum feature. +Similar logic applies to the VIRTIO_NET_F_GUEST_TSO4 features depending on VIRTIO_NET_F_GUEST_CSUM. [15] This is a common restriction in real, older network cards. @@ -2845,44 +2845,44 @@ depending on VIRTIO_NET_F_GUEST_CSUM. the same system may not require checksumming at all, nor segmentation, if both guests are amenable. -[17] For example, consider a partially checksummed TCP (IPv4) packet. -It will have a 14 byte ethernet header and 20 byte IP header -followed by the TCP header (with the TCP checksum field 16 bytes -into that header). csum_start will be 14+20 = 34 (the TCP -checksum includes the header), and csum_offset will be 16. The -value in the TCP checksum field should be initialized to the sum -of the TCP pseudo header, so that replacing it by the ones' -complement checksum of the TCP header and body will give the +[17] For example, consider a partially checksummed TCP (IPv4) packet. +It will have a 14 byte ethernet header and 20 byte IP header +followed by the TCP header (with the TCP checksum field 16 bytes +into that header). csum_start will be 14+20 = 34 (the TCP +checksum includes the header), and csum_offset will be 16. The +value in the TCP checksum field should be initialized to the sum +of the TCP pseudo header, so that replacing it by the ones' +complement checksum of the TCP header and body will give the correct result. -[18] Due to various bugs in implementations, this field is not useful +[18] Due to various bugs in implementations, this field is not useful as a guarantee of the transport header size. -[19] This case is not handled by some older hardware, so is called out +[19] This case is not handled by some older hardware, so is called out specifically in the protocol. -[20] Note that the header will be two bytes longer for the +[20] Note that the header will be two bytes longer for the VIRTIO_NET_F_MRG_RXBUF case. -[20a] Obviously each one can be split across multiple descriptor +[20a] Obviously each one can be split across multiple descriptor elements. [21] Since there are no guarentees, it can use a hash filter or silently switch to allmulti or promiscuous mode if it is given too many addresses. -[22] The SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device +[22] The SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device does not distinguish between them. [23] The FLUSH and FLUSH_OUT types are equivalent, the device does not distinguish between them -[25] Because this is high importance and low bandwidth, the current -Linux implementation polls for the buffer to be used, rather than -waiting for an interrupt, simplifying the implementation -significantly. However, for generic serial ports with the -O_NONBLOCK flag set, the polling limitation is relaxed and the -consumed buffers are freed upon the next write or poll call or +[25] Because this is high importance and low bandwidth, the current +Linux implementation polls for the buffer to be used, rather than +waiting for an interrupt, simplifying the implementation +significantly. However, for generic serial ports with the +O_NONBLOCK flag set, the polling limitation is relaxed and the +consumed buffers are freed upon the next write or poll call or when a port is closed or hot-unplugged. [27] This is historical, and independent of the guest page size |