summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorrusty <rusty@0c8fb4dd-22a2-4bb5-bc14-6c75a5f43652>2013-08-16 03:17:18 +0000
committerrusty <rusty@0c8fb4dd-22a2-4bb5-bc14-6c75a5f43652>2013-08-16 03:17:18 +0000
commit9d6033575f5a6129e2150e86d91cf32cbb2962ed (patch)
tree1aa17a68815b49e7869281ed1c576a587071727e
parent06d499635d4c65c4fe92d45d7578d7dc06f5e379 (diff)
Reworked spec into non-PCI order.
Issue: https://tools.oasis-open.org/issues/browse/VIRTIO-1 Signed-off-by: Rusty Russell <rusty@au1.ibm.com> git-svn-id: https://tools.oasis-open.org/version-control/svn/virtio@2 0c8fb4dd-22a2-4bb5-bc14-6c75a5f43652
-rw-r--r--virtio-spec.txt1807
1 files changed, 1009 insertions, 798 deletions
diff --git a/virtio-spec.txt b/virtio-spec.txt
index dcf3918..6a3860e 100644
--- a/virtio-spec.txt
+++ b/virtio-spec.txt
@@ -1,29 +1,31 @@
-This document describes the specifications of the “virtio” family
-of PCI devices. These are devices
-are found in virtual environments,
-yet by design they are not all that different from physical PCI
-devices, and this document treats them as such. This allows the
-guest to use standard PCI drivers and discovery mechanisms.
+1. INTRODUCTION
+===============
+
+This document describes the specifications of the “virtio” family of
+devices. These are devices are found in virtual environments, yet by
+design they are not all that different from physical devices, and this
+document treats them as such. This allows the guest to use standard
+drivers and discovery mechanisms.
The purpose of virtio and this specification is that virtual
environments and guests should have a straightforward, efficient,
standard and extensible mechanism for virtual devices, rather
than boutique per-environment or per-OS mechanisms.
- Straightforward: Virtio PCI devices use normal PCI mechanisms
- of interrupts and DMA which should be familiar to any device
- driver author. There is no exotic page-flipping or COW
- mechanism: it's just a PCI device.[1]
+ Straightforward: Virtio devices use normal bus mechanisms of
+ interrupts and DMA which should be familiar to any device driver
+ author. There is no exotic page-flipping or COW mechanism: it's just
+ a normal device.[1]
- Efficient: Virtio PCI devices consist of rings of descriptors
+ Efficient: Virtio devices consist of rings of descriptors
for input and output, which are neatly separated to avoid cache
effects from both guest and device writing to the same cache
lines.
- Standard: Virtio PCI makes no assumptions about the environment
- in which it operates, beyond supporting PCI. In fact the virtio
- devices specified in the appendices do not require PCI at all:
- they have been implemented on non-PCI buses.[2]
+ Standard: Virtio makes no assumptions about the environment in which
+ it operates, beyond supporting the bus attaching the device. Virtio
+ devices are implemented over PCI and other buses, and earlier drafts
+ been implemented on other buses not included in this spec.[2]
Extensible: Virtio PCI devices contain feature bits which are
acknowledged by the guest operating system during device setup.
@@ -31,170 +33,69 @@ than boutique per-environment or per-OS mechanisms.
offers all the features it knows about, and the driver
acknowledges those it understands and wishes to use.
-1.1 Virtqueues
+1.1.1. Key words
+-----------------
-The mechanism for bulk data transport on virtio PCI devices is
-pretentiously called a virtqueue. Each device can have zero or
-more virtqueues: for example, the network device has one for
-transmit and one for receive.
+The key words must, must not, required, shall, shall not, should,
+should not, recommended, may, and optional are to be interpreted as
+described in [RFC 2119]. Note that for reasons of style, these words
+are not capitalized in this document.
-Each virtqueue occupies two or more physically-contiguous pages
-(defined, for the purposes of this specification, as 4096 bytes),
-and consists of three parts:
+1.1.2. Definitions
+-------------------
+term
+ Definition
-+-------------------+-----------------------------------+-----------+
-| Descriptor Table | Available Ring (padding) | Used Ring |
-+-------------------+-----------------------------------+-----------+
+1.1.3. Key concepts
+--------------------
+Guest
+ Definition...
-When the driver wants to send a buffer to the device, it fills in
-a slot in the descriptor table (or chains several together), and
-writes the descriptor index into the available ring. It then
-notifies the device. When the device has finished a buffer, it
-writes the descriptor into the used ring, and sends an interrupt.
+Host
+ Definition
-Specification
+Device
+ Definition
-2.1 PCI Discovery
+Driver
+ Definition
-Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through
-0x103F inclusive is a virtio device[3]. The device must also have a
-Revision ID of 0 to match this specification.
+1.2. Normative References
+=========================
-The Subsystem Device ID indicates which virtio device is
-supported by the device. The Subsystem Vendor ID should reflect
-the PCI Vendor ID of the environment (it's currently only used
-for informational purposes by the guest).
+[RFC 2119] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, http://www.ietf.org/rfc/rfc2119.txt IETF (Internet Engineering Task Force) RFC 2119, March 1997.
+1.3. Non-Normative References
+=========================
-+----------------------+--------------------+---------------+
-| Subsystem Device ID | Virtio Device | Specification |
-+----------------------+--------------------+---------------+
-+----------------------+--------------------+---------------+
-| 1 | network card | Appendix C |
-+----------------------+--------------------+---------------+
-| 2 | block device | Appendix D |
-+----------------------+--------------------+---------------+
-| 3 | console | Appendix E |
-+----------------------+--------------------+---------------+
-| 4 | entropy source | Appendix F |
-+----------------------+--------------------+---------------+
-| 5 | memory ballooning | Appendix G |
-+----------------------+--------------------+---------------+
-| 6 | ioMemory | - |
-+----------------------+--------------------+---------------+
-| 7 | rpmsg | - |
-+----------------------+--------------------+---------------+
-| 8 | SCSI host | Appendix I |
-+----------------------+--------------------+---------------+
-| 9 | 9P transport | - |
-+----------------------+--------------------+---------------+
-| 10 | mac80211 wlan | - |
-+----------------------+--------------------+---------------+
-
-
-2.2 Device Configuration
-To configure the device, we use the first I/O region of the PCI
-device. This contains a virtio header followed by a
-device-specific region.
-There may be different widths of accesses to the I/O region; the
-“natural” access method for each field in the virtio header must be
-used (i.e. 32-bit accesses for 32-bit fields, etc), but the
-device-specific region can be accessed using any width accesses, and
-should obtain the same results.
+2 The Virtio Standard
+=========================
-Note that this is possible because while the virtio header is PCI
-(i.e. little) endian, the device-specific region is encoded in
-the native endian of the guest (where such distinction is
-applicable).
-
-2.2.1 Device Initialization Sequence
-
-We start with an overview of device initialization, then expand
-on the details of the device and how each step is preformed.
+2.1 Basic Facilities of a Virtio Device
+=======================================
-1. Reset the device. This is not required on initial start up.
+A virtio device is discovered and identified by a bus-specific method
+(see the bus specific sections *XREF*). Each device consists of the following
+parts:
-2. The ACKNOWLEDGE status bit is set: we have noticed the device.
+o Device Status field
+o Feature bits
+o Configuration space
+o One or more virtqueues
-3. The DRIVER status bit is set: we know how to drive the device.
-
-4. Device-specific setup, including reading the Device Feature
- Bits, discovery of virtqueues for the device, optional MSI-X
- setup, and reading and possibly writing the virtio
- configuration space.
-
-5. The subset of Device Feature Bits understood by the driver is
- written to the device.
-
-6. The DRIVER_OK status bit is set.
-
-7. The device can now be used (ie. buffers added to the
- virtqueues)[4]
-
-If any of these steps go irrecoverably wrong, the guest should
-set the FAILED status bit to indicate that it has given up on the
-device (it can reset the device later to restart if desired).
-
-We now cover the fields required for general setup in detail.
-
-2.2.2 Virtio Header
-
-The virtio header looks as follows:
-
-
-+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
-| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 |
-+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
-| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R |
-+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
-| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR |
-| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status |
-+------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
-
-
-If MSI-X is enabled for the device, two additional fields
-immediately follow this header:[5]
-
-
-+------------++----------------+--------+
-| Bits || 16 | 16 |
- +----------------+--------+
-+------------++----------------+--------+
-| Read/Write || R+W | R+W |
-+------------++----------------+--------+
-| Purpose || Configuration | Queue |
-| (MSI-X) || Vector | Vector |
-+------------++----------------+--------+
-
-
-Immediately following these general headers, there may be
-device-specific headers:
-
-
-+------------++--------------------+
-| Bits || Device Specific |
- +--------------------+
-+------------++--------------------+
-| Read/Write || Device Specific |
-+------------++--------------------+
-| Purpose || Device Specific... |
-| || |
-+------------++--------------------+
-
-
-2.2.2.1 Device Status
+2.1.1 Device Status Field
+-------------------------
The Device Status field is updated by the guest to indicate its
progress. This provides a simple low-level diagnostic: it's most
useful to imagine them hooked up to traffic lights on the console
indicating the status of each device.
-The device can be reset by writing a 0 to this field, otherwise
-at least one bit should be set:
+This field is 0 upon reset, otherwise at least one bit should be set:
ACKNOWLEDGE (1) Indicates that the guest OS has found the
device and recognized it as a valid virtio device.
@@ -213,105 +114,68 @@ at least one bit should be set:
even a fatal error during device operation. The device must be
reset before attempting to re-initialize.
-2.2.2.2 Feature Bits
+2.1.2 Feature Bits
+------------------
-The first configuration field indicates the features that the
-device supports. The bits are allocated as follows:
+Each virtio device lists all the features it understands. During
+device initialization, the guest reads this and tells the device the
+subset that it understands. The only way to renegotiate is to reset
+the device.
- 0 to 23 Feature bits for the specific device type
+This allows for forwards and backwards compatibility: if the device is
+enhanced with a new feature bit, older guests will not write that
+feature bit back to the device and it can go into backwards
+compatibility mode. Similarly, if a guest is enhanced with a feature
+that the device doesn't support, it see the new feature is not offered
+and can go into backwards compatibility mode (or, for poor
+implementations, set the FAILED Device Status bit).
+
+Feature bits are allocated as follows:
- 24 to 32 Feature bits reserved for extensions to the queue and
+ 0 to 23: Feature bits for the specific device type
+
+ 24 to 32: Feature bits reserved for extensions to the queue and
feature negotiation mechanisms
For example, feature bit 0 for a network device (i.e. Subsystem
Device ID 1) indicates that the device supports checksumming of
packets.
-The feature bits are negotiated: the device lists all the
-features it understands in the Device Features field, and the
-guest writes the subset that it understands into the Guest
-Features field. The only way to renegotiate is to reset the
-device.
-
-In particular, new fields in the device configuration header are
+In particular, new fields in the device configuration space are
indicated by offering a feature bit, so the guest can check
before accessing that part of the configuration space.
-This allows for forwards and backwards compatibility: if the
-device is enhanced with a new feature bit, older guests will not
-write that feature bit back to the Guest Features field and it
-can go into backwards compatibility mode. Similarly, if a guest
-is enhanced with a feature that the device doesn't support, it
-will not see that feature bit in the Device Features field and
-can go into backwards compatibility mode (or, for poor
-implementations, set the FAILED Device Status bit).
-
-2.2.2.3 Configuration/Queue Vectors
-
-When MSI-X capability is present and enabled in the device
-(through standard PCI configuration space) 4 bytes at byte offset
-20 are used to map configuration change and queue interrupts to
-MSI-X vectors. In this case, the ISR Status field is unused, and
-device specific configuration starts at byte offset 24 in virtio
-header structure. When MSI-X capability is not enabled, device
-specific configuration starts at byte offset 20 in virtio header.
-
-Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of
-Configuration/Queue Vector registers, maps interrupts triggered
-by the configuration change/selected queue events respectively to
-the corresponding MSI-X vector. To disable interrupts for a
-specific event type, unmap it by writing a special NO_VECTOR
-value:
-
-/* Vector value used to disable MSI for queue */
-
-#define VIRTIO_MSI_NO_VECTOR 0xffff
-
-Reading these registers returns vector mapped to a given event,
-or NO_VECTOR if unmapped. All queue and configuration change
-events are unmapped by default.
-
-Note that mapping an event to vector might require allocating
-internal device resources, and might fail. Devices report such
-failures by returning the NO_VECTOR value when the relevant
-Vector field is read. After mapping an event to vector, the
-driver must verify success by reading the Vector field value: on
-success, the previously written value is returned, and on
-failure, NO_VECTOR is returned. If a mapping failure is detected,
-the driver can retry mapping with fewervectors, or disable MSI-X.
+2.1.3 Configuration Space
+-------------------------
-2.3 Virtqueue Configuration
+Configuration space is generally used for rarely-changing or
+initialization-time parameters.
-As a device can have zero or more virtqueues for bulk data
-transport (for example, the network driver has two), the driver
-needs to configure them as part of the device-specific
-configuration.
-
-This is done as follows, for each virtqueue a device has:
-
-1. Write the virtqueue index (first queue is 0) to the Queue
- Select field.
+Note that this space is generally the guest's native endian,
+rather than PCI's little-endian.
-2. Read the virtqueue size from the Queue Size field, which is
- always a power of 2. This controls how big the virtqueue is
- (see below). If this field is 0, the virtqueue does not exist.
+2.1.4 Virtqueues
+----------------
-3. Allocate and zero virtqueue in contiguous physical memory, on
- a 4096 byte alignment. Write the physical address, divided by
- 4096 to the Queue Address field.[6]
+The mechanism for bulk data transport on virtio devices is
+pretentiously called a virtqueue. Each device can have zero or more
+virtqueues: for example, the simplest network device has one for
+transmit and one for receive. Each queue has a 16-bit queue size
+parameter, which sets the number of entries and implies the total size
+of the queue.
-4. Optionally, if MSI-X capability is present and enabled on the
- device, select a vector to use to request interrupts triggered
- by virtqueue events. Write the MSI-X Table entry number
- corresponding to this vector in Queue Vector field. Read the
- Queue Vector field: on success, previously written value is
- returned; on failure, NO_VECTOR value is returned.
+Each virtqueue occupies two or more physically-contiguous pages
+(usually defined as 4096 bytes, but depending on the transport)
+and consists of three parts:
-The Queue Size field controls the total number of bytes required
-for the virtqueue according to the following formula:
++-------------------+-----------------------------------+-----------+
+| Descriptor Table | Available Ring (padding) | Used Ring |
++-------------------+-----------------------------------+-----------+
- #define ALIGN(x) (((x) + 4095) & ~4095)
+The bus-specific Queue Size field controls the total number of bytes
+required for the virtqueue according to the following formula:
+ #define ALIGN(x) (((x) + PAGE_SIZE) & ~PAGE_SIZE)
static inline unsigned vring_size(unsigned int qsz)
{
return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2 + qsz))
@@ -319,34 +183,53 @@ for the virtqueue according to the following formula:
}
This currently wastes some space with padding, but also allows
-future extensions. The virtqueue layout structure looks like this
-(qsz is the Queue Size field, which is a variable, so this code
-won't compile):
+future extensions. The virtqueue layout structure looks like this:
struct vring {
- /* The actual descriptors (16 bytes each) */
- struct vring_desc desc[qsz];
+ // The actual descriptors (16 bytes each)
+ struct vring_desc desc[ Queue Size ];
- /* A ring of available descriptor heads with free-running index. */
+ // A ring of available descriptor heads with free-running index.
struct vring_avail avail;
- // Padding to the next 4096 boundary.
- char pad[];
+ // Padding to the next PAGE_SIZE boundary.
+ char pad[ Padding ];
// A ring of used descriptor heads with free-running index.
struct vring_used used;
};
-2.3.1 A Note on Virtqueue Endianness
+When the driver wants to send a buffer to the device, it fills in
+a slot in the descriptor table (or chains several together), and
+writes the descriptor index into the available ring. It then
+notifies the device. When the device has finished a buffer, it
+writes the descriptor into the used ring, and sends an interrupt.
+
+2.1.4.1 A Note on Virtqueue Endianness
+--------------------------------------
-Note that the endian of these fields and everything else in the
-virtqueue is the native endian of the guest, not little-endian as
-PCI normally is. This makes for simpler guest code, and it is
-assumed that the host already has to be deeply aware of the guest
-endian so such an “endian-aware” device is not a significant
-issue.
+Note that the endian of fields and in the virtqueue is the native
+endian of the guest, not little-endian as PCI normally is. This makes
+for simpler guest code, and it is assumed that the host already has to
+be deeply aware of the guest endian so such an “endian-aware” device
+is not a significant issue.
+
+2.1.4.2 Message Framing
+-----------------------
+
+The descriptors used for a buffer should not effect the semantics
+of the message, except for the total length of the buffer. For
+example, a network buffer consists of a 10 byte header followed
+by the network packet. Whether this is presented in the ring
+descriptor chain as (say) a 10 byte buffer and a 1514 byte
+buffer, or a single 1524 byte buffer, or even three buffers,
+should have no effect.
+
+In particular, no implementation should use the descriptor
+boundaries to determine the size of any header in a request.[10]
-2.3.2 Descriptor Table
+2.1.4.3 The Virtqueue Descriptor Table
+--------------------------------------
The descriptor table refers to the buffers the guest is using for
the device. The addresses are physical addresses, and the buffers
@@ -374,17 +257,18 @@ No descriptor chain may be more than 2^32 bytes long in total.
u16 next;
};
-The number of descriptors in the table is specified by the Queue
-Size field for this virtqueue.
+The number of descriptors in the table is defined by the queue size
+for this virtqueue.
-2.3.3 Indirect Descriptors
+2.1.4.3.1 Indirect Descriptors
+------------------------------
Some devices benefit by concurrently dispatching a large number
of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be
-used to allow this (see Appendix B: Reserved Feature Bits). To increase
+used to allow this (see FIXME: Reserved Feature Bits). To increase
ring capacity it is possible to store a table of indirect
descriptors anywhere in memory, and insert a descriptor in main
-virtqueue (with flags&INDIRECT on) that refers to memory buffer
+virtqueue (with flags&VRING_DESC_F_INDIRECT on) that refers to memory buffer
containing this indirect descriptor table; fields addr and len
refer to the indirect table address and length in bytes,
respectively. The indirect table layout structure looks like this
@@ -399,40 +283,42 @@ which is a variable, so this code won't compile):
The first indirect descriptor is located at start of the indirect
descriptor table (index 0), additional indirect descriptors are
chained by next field. An indirect descriptor without next field
-(with flags&NEXT off) signals the end of the indirect descriptor
+(with flags&VRING_DESC_F_NEXT off) signals the end of the indirect descriptor
table, and transfers control back to the main virtqueue. An
indirect descriptor can not refer to another indirect descriptor
-table (flags&INDIRECT must be off). A single indirect descriptor
+table (flags&VRING_DESC_F_INDIRECT must be off). A single indirect descriptor
table can include both read-only and write-only descriptors;
-write-only flag (flags&WRITE) in the descriptor that refers to it
+write-only flag (flags&VRING_DESC_F_WRITE) in the descriptor that refers to it
is ignored.
-2.3.4 Available Ring
+2.1.4.4 The Virtqueue Available Ring
+------------------------------------
The available ring refers to what descriptors we are offering the
device: it refers to the head of a descriptor chain. The “flags” field
is currently 0 or 1: 1 indicating that we do not need an interrupt
when the device consumes a descriptor from the available
ring. Alternatively, the guest can ask the device to delay interrupts
-until an entry with an index specified by the “ used_event” field is
+until an entry with an index specified by the “used_event” field is
written in the used ring (equivalently, until the idx field in the
used ring will reach the value used_event + 1). The method employed by
the device is controlled by the VIRTIO_RING_F_EVENT_IDX feature bit
-(see Appendix B: Reserved Feature Bits). This interrupt suppression is
+(see FIXME: Reserved Feature Bits). This interrupt suppression is
merely an optimization; it may not suppress interrupts entirely.
The “idx” field indicates where we would put the next descriptor
-entry (modulo the ring size). This starts at 0, and increases.
+entry (modulo the queue size). This starts at 0, and increases.
struct vring_avail {
#define VRING_AVAIL_F_NO_INTERRUPT 1
u16 flags;
u16 idx;
- u16 ring[qsz]; /* qsz is the Queue Size field read from device */
+ u16 ring[ /* Queue Size */ ];
u16 used_event;
};
-2.3.5 Used Ring
+2.1.4.5 The Virtqueue Used Ring
+-------------------------------
The used ring is where the device returns buffers once it is done
with them. The flags field can be used by the device to hint that
@@ -443,7 +329,7 @@ with an index specified by the “avail_event” is written in the
available ring (equivalently, until the idx field in the
available ring will reach the value avail_event + 1). The method
employed by the device is controlled by the guest through the
-VIRTIO_RING_F_EVENT_IDX feature bit (see Appendix B: Reserved
+VIRTIO_RING_F_EVENT_IDX feature bit (see FIXME: Reserved
Feature Bits).[7]
Each entry in the ring is a pair: the head entry of the
@@ -466,31 +352,68 @@ the buffer to ensure no data leakage occurs.
#define VRING_USED_F_NO_NOTIFY 1
u16 flags;
u16 idx;
- struct vring_used_elem ring[qsz];
+ struct vring_used_elem ring[ /* Queue Size */];
u16 avail_event;
};
-2.3.6 Helpers for Managing Virtqueues
+2.1.4.6 Helpers for Operating Virtqueues
+----------------------------------------
The Linux Kernel Source code contains the definitions above and
helper routines in a more usable form, in
include/linux/virtio_ring.h. This was explicitly licensed by IBM
and Red Hat under the (3-clause) BSD license so that it can be
freely used by all other projects, and is reproduced (with slight
-variation to remove Linux assumptions) in Appendix A.
+variation to remove Linux assumptions) in *XREF*.
-2.4 Device Operation
+2.2 General Initialization And Device Operation
+===============================================
+
+We start with an overview of device initialization, then expand on the
+details of the device and how each step is preformed. This section
+should be read along with the bus-specific section which describes
+how to communicate with the specific device.
+
+2.2.1 Device Initialization
+---------------------------
+
+1. Reset the device. This is not required on initial start up.
+
+2. The ACKNOWLEDGE status bit is set: we have noticed the device.
+
+3. The DRIVER status bit is set: we know how to drive the device.
+
+4. Device-specific setup, including reading the device feature
+ bits, discovery of virtqueues for the device, optional per-bus
+ setup, and reading and possibly writing the device's virtio
+ configuration space.
+
+5. The subset of device feature bits understood by the driver is
+ written to the device.
+
+6. The DRIVER_OK status bit is set.
+
+7. The device can now be used (ie. buffers added to the
+ virtqueues)[4]
+
+If any of these steps go irrecoverably wrong, the guest should
+set the FAILED status bit to indicate that it has given up on the
+device (it can reset the device later to restart if desired).
+
+2.2.2 Device Operation
+----------------------
There are two parts to device operation: supplying new buffers to
the device, and processing used buffers from the device. As an
-example, the virtio network device has two virtqueues: the
+example, the simplest virtio network device has two virtqueues: the
transmit virtqueue and the receive virtqueue. The driver adds
outgoing (read-only) packets to the transmit virtqueue, and then
frees them after they are used. Similarly, incoming (write-only)
buffers are added to the receive virtqueue, and processed after
they are used.
-2.4.1 Supplying Buffers to The Device
+2.2.2.1 Supplying Buffers to The Device
+---------------------------------------
Actual transfer of buffers from the guest OS to the device
operates as follows:
@@ -531,14 +454,15 @@ distinguish between a full and empty buffer.
Here is a description of each stage in more detail.
-2.4.1.1 Placing Buffers Into The Descriptor Table
+2.2.2.1.1 Placing Buffers Into The Descriptor Table
+---------------------------------------------------
A buffer consists of zero or more read-only physically-contiguous
elements followed by zero or more physically-contiguous
write-only elements (it must have at least one element). This
algorithm maps it into the descriptor table:
-1. for each buffer element, b:
+for each buffer element, b:
(a) Get the next free descriptor table entry, d
@@ -560,7 +484,8 @@ In practice, the d.next fields are usually used to chain free
descriptors, and a separate count kept to check there are enough
free descriptors before beginning the mappings.
-2.4.1.2 Updating The Available Ring
+2.2.2.1.2 Updating The Available Ring
+-------------------------------------
The head of the buffer we mapped is the first d in the algorithm
above. A naive implementation would do the following:
@@ -573,44 +498,47 @@ device), so we keep a counter of how many we've added:
avail->ring[(avail->idx + added++) % qsz] = head;
-2.4.1.3 Updating The Index Field
+2.2.2.1.3 Updating The Index Field
+----------------------------------
-Once the idx field of the virtqueue is updated, the device will
+Once the index field of the virtqueue is updated, the device will
be able to access the descriptor entries we've created and the
memory they refer to. This is why a memory barrier is generally
-used before the idx update, to ensure it sees the most up-to-date
+used before the index update, to ensure it sees the most up-to-date
copy.
-The idx field always increments, and we let it wrap naturally at
+The index field always increments, and we let it wrap naturally at
65536:
avail->idx += added;
-2.4.1.4 Notifying The Device
+2.2.2.1.4 Notifying The Device
+------------------------------
-Device notification occurs by writing the 16-bit virtqueue index
-of this virtqueue to the Queue Notify field of the virtio header
-in the first I/O region of the PCI device. This can be expensive,
-however, so the device can suppress such notifications if it
-doesn't need them. We have to be careful to expose the new idx
-value before checking the suppression flag: it's OK to notify
+The actual method of device notification is bus-specific, but generally
+it can be expensive. So the device can suppress such notifications if it
+doesn't need them. We have to be careful to expose the new index
+value before checking if notifications are suppressed: it's OK to notify
gratuitously, but not to omit a required notification. So again,
we use a memory barrier here before reading the flags or the
avail_event field.
-If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if
-the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to
-the PCI configuration space.
+If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if the
+VRING_USED_F_NOTIFY flag is not set, we go ahead and notify the
+device.
If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the
avail_event field in the available ring structure. If the
available index crossed_the avail_event field value since the
last notification, we go ahead and write to the PCI configuration
-space. The avail_event field wraps naturally at 65536 as well:
+space. The avail_event field wraps naturally at 65536 as well,
+iving the following algorithm for calculating whether a device needs
+notification:
(u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx)
-2.4.2 Receiving Used Buffers From The Device
+2.2.2.2 Receiving Used Buffers From The Device
+----------------------------------------------
Once the device has used a buffer (read from or written to it, or
parts of both, depending on the nature of the virtqueue and the
@@ -621,13 +549,13 @@ buffer:
1. Write the head descriptor number to the next field in the used
ring.
-2. Update the used ring idx.
+2. Update the used ring index.
-3. Determine whether an interrupt is necessary:
+3. Deliver an interrupt if necessary:
(a) If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated:
- check if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in
- avail->flags
+ check if the VRING_AVAIL_F_NO_INTERRUPT flag is not set in
+ avail->flags.
(b) If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check
whether the used index crossed the used_event field value
@@ -635,7 +563,204 @@ buffer:
at 65536 as well:
(u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx)
-4. If an interrupt is necessary:
+For each ring, guest should then disable interrupts by writing
+VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required.
+It can then process used ring entries finally enabling interrupts
+by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the
+EVENT_IDX field in the available structure. The guest should then
+execute a memory barrier, and then recheck the ring empty
+condition. This is necessary to handle the case where after the
+last check and before enabling interrupts, an interrupt has been
+suppressed by the device:
+
+ vring_disable_interrupts(vq);
+
+ for (;;) {
+ if (vq->last_seen_used != vring->used.idx) {
+ vring_enable_interrupts(vq);
+ mb();
+
+ if (vq->last_seen_used != vring->used.idx)
+ break;
+ }
+
+ struct vring_used_elem *e = vring.used->ring[vq->last_seen_used%vsz];
+ process_buffer(e);
+ vq->last_seen_used++;
+ }
+
+2.2.2.3 Notification of Device Configuration Changes
+----------------------------------------------------
+
+For devices where the configuration information can be changed, an
+interrupt is delivered when a configuration change occurs.
+
+
+
+2.4 Virtio Transport Options
+============================
+
+Virtio can use various different busses, thus the standard is split
+into virtio general and bus-specific sections.
+
+2.4.1 Virtio Over PCI Bus
+-------------------------
+
+Virtio devices are commonly implemented as PCI devices.
+
+2.4.1.1 PCI Device Discovery
+----------------------------
+
+Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through
+0x103F inclusive is a virtio device[3]. The device must also have a
+Revision ID of 0 to match this specification.
+
+The Subsystem Device ID indicates which virtio device is
+supported by the device. The Subsystem Vendor ID should reflect
+the PCI Vendor ID of the environment (it's currently only used
+for informational purposes by the guest).
+
+2.4.1.2 PCI Device Layout
+-------------------------
+
+To configure the device, we use the first I/O region of the PCI
+device. This contains a virtio header followed by a
+device-specific region.
+
+There may be different widths of accesses to the I/O region; the
+“natural” access method for each field in the virtio header must be
+used (i.e. 32-bit accesses for 32-bit fields, etc), but the
+device-specific region can be accessed using any width accesses, and
+should obtain the same results.
+
+Note that this is possible because while the virtio header is PCI
+(i.e. little) endian, the device-specific region is encoded in
+the native endian of the guest (where such distinction is
+applicable).
+
+2.4.1.2.1 PCI Device Virtio Header
+----------------------------------
+
+The virtio header looks as follows:
+
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 |
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R |
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR |
+| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status |
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+
+
+If MSI-X is enabled for the device, two additional fields
+immediately follow this header:[5]
+
+
++------------++----------------+--------+
+| Bits || 16 | 16 |
+ +----------------+--------+
++------------++----------------+--------+
+| Read/Write || R+W | R+W |
++------------++----------------+--------+
+| Purpose || Configuration | Queue |
+| (MSI-X) || Vector | Vector |
++------------++----------------+--------+
+
+Immediately following these general headers, there may be
+device-specific headers:
+
++------------++--------------------+
+| Bits || Device Specific |
+ +--------------------+
++------------++--------------------+
+| Read/Write || Device Specific |
++------------++--------------------+
+| Purpose || Device Specific... |
+| || |
++------------++--------------------+
+
+2.4.1.3 PCI-specific Initialization And Device Operation
+--------------------------------------------------------
+
+The page size for a virtqueue on a PCI virtio device is defined as
+4096 bytes.
+
+2.4.1.3.1 Device Initialization
+-------------------------------
+
+2.4.1.3.1.1 Queue Vector Configuration
+--------------------------------------
+
+When MSI-X capability is present and enabled in the device
+(through standard PCI configuration space) 4 bytes at byte offset
+20 are used to map configuration change and queue interrupts to
+MSI-X vectors. In this case, the ISR Status field is unused, and
+device specific configuration starts at byte offset 24 in virtio
+header structure. When MSI-X capability is not enabled, device
+specific configuration starts at byte offset 20 in virtio header.
+
+Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of
+Configuration/Queue Vector registers, maps interrupts triggered
+by the configuration change/selected queue events respectively to
+the corresponding MSI-X vector. To disable interrupts for a
+specific event type, unmap it by writing a special NO_VECTOR
+value:
+
+ /* Vector value used to disable MSI for queue */
+ #define VIRTIO_MSI_NO_VECTOR 0xffff
+
+Reading these registers returns vector mapped to a given event,
+or NO_VECTOR if unmapped. All queue and configuration change
+events are unmapped by default.
+
+Note that mapping an event to vector might require allocating
+internal device resources, and might fail. Devices report such
+failures by returning the NO_VECTOR value when the relevant
+Vector field is read. After mapping an event to vector, the
+driver must verify success by reading the Vector field value: on
+success, the previously written value is returned, and on
+failure, NO_VECTOR is returned. If a mapping failure is detected,
+the driver can retry mapping with fewervectors, or disable MSI-X.
+
+2.4.1.3.1.2 Virtqueue Configuration
+-----------------------------------
+
+As a device can have zero or more virtqueues for bulk data
+transport (for example, the simplest network device has two), the driver
+needs to configure them as part of the device-specific
+configuration.
+
+This is done as follows, for each virtqueue a device has:
+
+1. Write the virtqueue index (first queue is 0) to the Queue
+ Select field.
+
+2. Read the virtqueue size from the Queue Size field, which is
+ always a power of 2. This controls how big the virtqueue is
+ (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist.
+
+3. Allocate and zero virtqueue in contiguous physical memory, on
+ a 4096 byte alignment. Write the physical address, divided by
+ 4096 to the Queue Address field.[6]
+
+4. Optionally, if MSI-X capability is present and enabled on the
+ device, select a vector to use to request interrupts triggered
+ by virtqueue events. Write the MSI-X Table entry number
+ corresponding to this vector in Queue Vector field. Read the
+ Queue Vector field: on success, previously written value is
+ returned; on failure, NO_VECTOR value is returned.
+
+2.4.1.3.2 Notifying The Device
+------------------------------
+
+Device notification occurs by writing the 16-bit virtqueue index
+of this virtqueue to the Queue Notify field of the virtio header
+in the first I/O region of the PCI device.
+
+2.4.1.3.3 Receiving Used Buffers From The Device
+
+If an interrupt is necessary:
(a) If MSI-X capability is disabled:
@@ -666,33 +791,8 @@ The guest interrupt handler should:
device, to see if any progress has been made by the device
which requires servicing.
-For each ring, guest should then disable interrupts by writing
-VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required.
-It can then process used ring entries finally enabling interrupts
-by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the
-EVENT_IDX field in the available structure, Guest should then
-execute a memory barrier, and then recheck the ring empty
-condition. This is necessary to handle the case where, after the
-last check and before enabling interrupts, an interrupt has been
-suppressed by the device:
-
- vring_disable_interrupts(vq);
-
- for (;;) {
- if (vq->last_seen_used != vring->used.idx) {
- vring_enable_interrupts(vq);
- mb();
-
- if (vq->last_seen_used != vring->used.idx)
- break;
- }
-
- struct vring_used_elem *e = vring.used->ring[vq->last_seen_used%vsz];
- process_buffer(e);
- vq->last_seen_used++;
- }
-
-2.4.3 Dealing With Configuration Changes
+2.4.1.3.4 Notification of Device Configuration Changes
+------------------------------------------------------
Some virtio PCI devices can change the device configuration
state, as reflected in the virtio header in the PCI configuration
@@ -711,260 +811,285 @@ space. In this case:
entry number to use. If Configuration Vector field value is
NO_VECTOR, no interrupt message is requested for this event.
+2.4.2 Virtio Over MMIO
+----------------------
-Creating New Device Types
+Virtual environments without PCI support (a common situation in
+embedded devices models) might use simple memory mapped device (“
+virtio-mmio”) instead of the PCI device.
-Various considerations are necessary when creating a new device
-type:
+The memory mapped virtio device behaviour is based on the PCI
+device specification. Therefore most of operations like device
+initialization, queues configuration and buffer transfers are
+nearly identical. Existing differences are described in the
+following sections.
- How Many Virtqueues?
+2.4.2.1 MMIO Device Discovery
+-----------------------------
-It is possible that a very simple device will operate entirely
-through its configuration space, but most will need at least one
-virtqueue in which it will place requests. A device with both
-input and output (eg. console and network devices described here)
-need two queues: one which the driver fills with buffers to
-receive input, and one which the driver places buffers to
-transmit output.
+Unlike PCI, MMIO provides no generic device discovery. For systems using
+a device-tree such as Linux's dtc or Open Firmware, the suggested format is:
- What Configuration Space Layout?
+ virtio_block@1e000 {
+ compatible = "virtio,mmio";
+ reg = <0x1e000 0x100>;
+ interrupts = <42>;
+ }
-Configuration space is generally used for rarely-changing or
-initialization-time parameters. But it is a limited resource, so
-it might be better to use a virtqueue to update configuration
-information (the network device does this for filtering,
-otherwise the table in the config space could potentially be very
-large).
+2.4.2.2 MMIO Device Layout
+--------------------------
-Note that this space is generally the guest's native endian,
-rather than PCI's little-endian.
+MMIO virtio devices provides a set of memory mapped control
+registers, all 32 bits wide, followed by device-specific
+configuration space. The following list presents their layout:
- What Device Number?
+• Offset from the device base address | Direction | Name
+ Description
-Currently device numbers are assigned quite freely: a simple
-request mail to the author of this document or the Linux
-virtualization mailing list[9] will be sufficient to secure a unique one.
+• 0x000 | R | MagicValue
+ “virt” string.
-Meanwhile for experimental drivers, use 65535 and work backwards.
+• 0x004 | R | Version
+ Device version number. Currently must be 1.
- How many MSI-X vectors?
+• 0x008 | R | DeviceID
+ Virtio Subsystem Device ID (ie. 1 for network card).
-Using the optional MSI-X capability devices can speed up
-interrupt processing by removing the need to read ISR Status
-register by guest driver (which might be an expensive operation),
-reducing interrupt sharing between devices and queues within the
-device, and handling interrupts from multiple CPUs. However, some
-systems impose a limit (which might be as low as 256) on the
-total number of MSI-X vectors that can be allocated to all
-devices. Devices and/or device drivers should take this into
-account, limiting the number of vectors used unless the device is
-expected to cause a high volume of interrupts. Devices can
-control the number of vectors used by limiting the MSI-X Table
-Size or not presenting MSI-X capability in PCI configuration
-space. Drivers can control this by mapping events to as small
-number of vectors as possible, or disabling MSI-X capability
-altogether.
+• 0x00c | R | VendorID
+ Virtio Subsystem Vendor ID.
- Message Framing
+• 0x010 | R | HostFeatures
+ Flags representing features the device supports.
+ Reading from this register returns 32 consecutive flag bits,
+ first bit depending on the last value written to
+ HostFeaturesSel register. Access to this register returns bits HostFeaturesSel*32
-The descriptors used for a buffer should not effect the semantics
-of the message, except for the total length of the buffer. For
-example, a network buffer consists of a 10 byte header followed
-by the network packet. Whether this is presented in the ring
-descriptor chain as (say) a 10 byte buffer and a 1514 byte
-buffer, or a single 1524 byte buffer, or even three buffers,
-should have no effect.
+ to (HostFeaturesSel*32)+31, eg. feature bits 0 to 31 if
+ HostFeaturesSel is set to 0 and features bits 32 to 63 if
+ HostFeaturesSel is set to 1. Also see [sub:Feature-Bits]
-In particular, no implementation should use the descriptor
-boundaries to determine the size of any header in a request.[10]
+• 0x014 | W | HostFeaturesSel
+ Device (Host) features word selection.
+ Writing to this register selects a set of 32 device feature bits
+ accessible by reading from HostFeatures register. Device driver
+ must write a value to the HostFeaturesSel register before
+ reading from the HostFeatures register.
- Device Improvements
+• 0x020 | W | GuestFeatures
+ Flags representing device features understood and activated by
+ the driver.
+ Writing to this register sets 32 consecutive flag bits, first
+ bit depending on the last value written to GuestFeaturesSel
+ register. Access to this register sets bits GuestFeaturesSel*32
+ to (GuestFeaturesSel*32)+31, eg. feature bits 0 to 31 if
+ GuestFeaturesSel is set to 0 and features bits 32 to 63 if
+ GuestFeaturesSel is set to 1. Also see [sub:Feature-Bits]
-Any change to configuration space, or new virtqueues, or
-behavioural changes, should be indicated by negotiation of a new
-feature bit. This establishes clarity[11] and avoids future expansion problems.
+• 0x024 | W | GuestFeaturesSel
+ Activated (Guest) features word selection.
+ Writing to this register selects a set of 32 activated feature
+ bits accessible by writing to the GuestFeatures register.
+ Device driver must write a value to the GuestFeaturesSel
+ register before writing to the GuestFeatures register.
-Clusters of functionality which are always implemented together
-can use a single bit, but if one feature makes sense without the
-others they should not be gratuitously grouped together to
-conserve feature bits. We can always extend the spec when the
-first person needs more than 24 feature bits for their device.
+• 0x028 | W | GuestPageSize
+ Guest page size.
+ Device driver must write the guest page size in bytes to the
+ register during initialization, before any queues are used.
+ This value must be a power of 2 and is used by the Host to
+ calculate Guest address of the first queue page (see QueuePFN).
+• 0x030 | W | QueueSel
+ Virtual queue index (first queue is 0).
+ Writing to this register selects the virtual queue that the
+ following operations on QueueNum, QueueAlign and QueuePFN apply
+ to.
+• 0x034 | R | QueueNumMax
+ Maximum virtual queue size.
+ Reading from the register returns the maximum size of the queue
+ the Host is ready to process or zero (0x0) if the queue is not
+ available. This applies to the queue selected by writing to
+ QueueSel and is allowed only when QueuePFN is set to zero
+ (0x0), so when the queue is not actively used.
+• 0x038 | W | QueueNum
+ Virtual queue size.
+ Queue size is a number of elements in the queue, therefore size
+ of the descriptor table and both available and used rings.
+ Writing to this register notifies the Host what size of the
+ queue the Guest will use. This applies to the queue selected by
+ writing to QueueSel.
-Appendix A: virtio_ring.h
+• 0x03c | W | QueueAlign
+ Used Ring alignment in the virtual queue.
+ Writing to this register notifies the Host about alignment
+ boundary of the Used Ring in bytes. This value must be a power
+ of 2 and applies to the queue selected by writing to QueueSel.
-#ifndef VIRTIO_RING_H
-#define VIRTIO_RING_H
-/* An interface for efficient virtio implementation.
- *
- * This header is BSD licensed so anyone can use the definitions
- * to implement compatible drivers/servers.
- *
- * Copyright 2007, 2009, IBM Corporation
- * Copyright 2011, Red Hat, Inc
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in the
- * documentation and/or other materials provided with the distribution.
- * 3. Neither the name of IBM nor the names of its contributors
- * may be used to endorse or promote products derived from this software
- * without specific prior written permission.
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- */
+• 0x040 | RW | QueuePFN
+ Guest physical page number of the virtual queue.
+ Writing to this register notifies the host about location of the
+ virtual queue in the Guest's physical address space. This value
+ is the index number of a page starting with the queue
+ Descriptor Table. Value zero (0x0) means physical address zero
+ (0x00000000) and is illegal. When the Guest stops using the
+ queue it must write zero (0x0) to this register.
+ Reading from this register returns the currently used page
+ number of the queue, therefore a value other than zero (0x0)
+ means that the queue is in use.
+ Both read and write accesses apply to the queue selected by
+ writing to QueueSel.
-/* This marks a buffer as continuing via the next field. */
-#define VRING_DESC_F_NEXT 1
-/* This marks a buffer as write-only (otherwise read-only). */
-#define VRING_DESC_F_WRITE 2
+• 0x050 | W | QueueNotify
+ Queue notifier.
+ Writing a queue index to this register notifies the Host that
+ there are new buffers to process in the queue.
-/* The Host uses this in used->flags to advise the Guest: don't kick me
- * when you add a buffer. It's unreliable, so it's simply an
- * optimization. Guest will still kick if it's out of buffers. */
-#define VRING_USED_F_NO_NOTIFY 1
-/* The Guest uses this in avail->flags to advise the Host: don't
- * interrupt me when you consume a buffer. It's unreliable, so it's
- * simply an optimization. */
-#define VRING_AVAIL_F_NO_INTERRUPT 1
+• 0x60 | R | InterruptStatus
+Interrupt status.
+Reading from this register returns a bit mask of interrupts
+ asserted by the device. An interrupt is asserted if the
+ corresponding bit is set, ie. equals one (1).
-/* Virtio ring descriptors: 16 bytes.
- * These can chain together via "next". */
-struct vring_desc {
- /* Address (guest-physical). */
- uint64_t addr;
- /* Length. */
- uint32_t len;
- /* The flags as indicated above. */
- uint16_t flags;
- /* We chain unused descriptors via this, too */
- uint16_t next;
-};
+ – Bit 0 | Used Ring Update
+ This interrupt is asserted when the Host has updated the Used
+ Ring in at least one of the active virtual queues.
-struct vring_avail {
- uint16_t flags;
- uint16_t idx;
- uint16_t ring[];
- uint16_t used_event;
-};
+ – Bit 1 | Configuration change
+ This interrupt is asserted when configuration of the device has
+ changed.
-/* u32 is used here for ids for padding reasons. */
-struct vring_used_elem {
- /* Index of start of used descriptor chain. */
- uint32_t id;
- /* Total length of the descriptor chain which was written to. */
- uint32_t len;
-};
+• 0x064 | W | InterruptACK
+ Interrupt acknowledge.
+ Writing to this register notifies the Host that the Guest
+ finished handling interrupts. Set bits in the value clear the
+ corresponding bits of the InterruptStatus register.
-struct vring_used {
- uint16_t flags;
- uint16_t idx;
- struct vring_used_elem ring[];
- uint16_t avail_event;
-};
+• 0x070 | RW | Status
+ Device status.
+ Reading from this register returns the current device status
+ flags.
+ Writing non-zero values to this register sets the status flags,
+ indicating the Guest progress. Writing zero (0x0) to this
+ register triggers a device reset.
+ Also see [sub:Device-Initialization-Sequence]
-struct vring {
- unsigned int num;
+• 0x100+ | RW | Config
+ Device-specific configuration space starts at an offset 0x100
+ and is accessed with byte alignment. Its meaning and size
+ depends on the device and the driver.
- struct vring_desc *desc;
- struct vring_avail *avail;
- struct vring_used *used;
-};
+Virtual queue size is a number of elements in the queue,
+therefore size of the descriptor table and both available and
+used rings.
-/* The standard layout for the ring is a continuous chunk of memory which
- * looks like this. We assume num is a power of 2.
- *
- * struct vring {
- * // The actual descriptors (16 bytes each)
- * struct vring_desc desc[num];
- *
- * // A ring of available descriptor heads with free-running index.
- * __u16 avail_flags;
- * __u16 avail_idx;
- * __u16 available[num];
- *
- * // Padding to the next align boundary.
- * char pad[];
- *
- * // A ring of used descriptor heads with free-running index.
- * __u16 used_flags;
- * __u16 EVENT_IDX;
- * struct vring_used_elem used[num];
- * };
- * Note: for virtio PCI, align is 4096.
- */
-static inline void vring_init(struct vring *vr, unsigned int num, void *p,
- unsigned long align)
-{
- vr->num = num;
- vr->desc = p;
- vr->avail = p + num*sizeof(struct vring_desc);
- vr->used = (void *)(((unsigned long)&vr->avail->ring[num]
- + align-1)
- & ~(align - 1));
-}
+The endianness of the registers follows the native endianness of
+the Guest. Writing to registers described as “R” and reading from
+registers described as “W” is not permitted and can cause
+undefined behavior.
-static inline unsigned vring_size(unsigned int num, unsigned long align)
-{
- return ((sizeof(struct vring_desc)*num + sizeof(uint16_t)*(2+num)
- + align - 1) & ~(align - 1))
- + sizeof(uint16_t)*3 + sizeof(struct vring_used_elem)*num;
-}
+2.4.2.3 MMIO-specific Initialization And Device Operation
+---------------------------------------------------------
-static inline int vring_need_event(uint16_t event_idx, uint16_t new_idx, uint16_t old_idx)
-{
- return (uint16_t)(new_idx - event_idx - 1) < (uint16_t)(new_idx - old_idx);
-}
-#endif /* VIRTIO_RING_H */
+2.4.2.3.1 Device Initialization
+-------------------------------
+Unlike the fixed page size for PCI, the virtqueue page size is defined
+by the GuestPageSize field, as written by the guest. This must be
+done before the virtqueues are configured.
-Appendix B: Reserved Feature Bits
+2.4.2.3.1.1 Virtqueue Configuration
+-----------------------------------
-Currently there are five device-independent feature bits defined:
+1. Select the queue writing its index (first queue is 0) to the
+ QueueSel register.
- VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature
- indicates that the driver wants an interrupt if the device runs
- out of available descriptors on a virtqueue, even though
- interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT
- flag or the used_event field. An example of this is the
- networking driver: it doesn't need to know every time a packet
- is transmitted, but it does need to free the transmitted
- packets a finite time after they are transmitted. It can avoid
- using a timer if the device interrupts it when all the packets
- are transmitted.
+2. Check if the queue is not already in use: read QueuePFN
+ register, returned value should be zero (0x0).
- VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature indicates
- that the driver can use descriptors with the VRING_DESC_F_INDIRECT
- flag set, as described in 2.3.3 Indirect Descriptors.
+3. Read maximum queue size (number of elements) from the
+ QueueNumMax register. If the returned value is zero (0x0) the
+ queue is not available.
- VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event
- and the avail_event fields. If set, it indicates that the
- device should ignore the flags field in the available ring
- structure. Instead, the used_event field in this structure is
- used by guest to suppress device interrupts. Further, the
- driver should ignore the flags field in the used ring
- structure. Instead, the avail_event field in this structure is
- used by the device to suppress notifications. If unset, the
- driver should ignore the used_event field; the device should
- ignore the avail_event field; the flags field is used
+4. Allocate and zero the queue pages in contiguous virtual
+ memory, aligning the Used Ring to an optimal boundary (usually
+ page size). Size of the allocated queue may be smaller than or
+ equal to the maximum size returned by the Host.
-Appendix C: Network Device
+5. Notify the Host about the queue size by writing the size to
+ QueueNum register.
+
+6. Notify the Host about the used alignment by writing its value
+ in bytes to QueueAlign register.
+
+7. Write the physical number of the first page of the queue to
+ the QueuePFN register.
+
+2.4.2.3.2 Notifying The Device
+------------------------------
+
+The device is notified about new buffers available in a queue by
+writing the queue index to register QueueNum.
+
+2.4.2.3.3 Receiving Used Buffers From The Device
+------------------------------------------------
+
+The memory mapped virtio device is using single, dedicated
+interrupt signal, which is raised when at least one of the
+interrupts described in the InterruptStatus register
+description is asserted. After receiving an interrupt, the
+driver must read the InterruptStatus register to check what
+caused the interrupt (see the register description). After the
+interrupt is handled, the driver must acknowledge it by writing
+a bit mask corresponding to the serviced interrupt to the
+InterruptACK register.
+
+2.4.2.4.4 Notification of Device Configuration Changes
+------------------------------------------------------
+
+This is indicated by bit 1 in the InterruptStatus register, as
+documented in the register description.
+
+2.5 Device Types
+================
+
+On top of the queues, config space and feature negotiation facilities
+built into virtio, several specific devices are defined.
+
+The following device IDs are used to identify different types of virtio
+devices. Some device IDs are reserved for devices which are not currently
+defined in this standard.
+
+Discovering what devices are available and their type is bus-dependent.
+
++------------+--------------------+
+| Device ID | Virtio Device |
++------------+--------------------+
++------------+--------------------+
+| 1 | network card |
++------------+--------------------+
+| 2 | block device |
++------------+--------------------+
+| 3 | console |
++------------+--------------------+
+| 4 | entropy source |
++------------+--------------------+
+| 5 | memory ballooning |
++------------+--------------------+
+| 6 | ioMemory |
++------------+--------------------+
+| 7 | rpmsg |
++------------+--------------------+
+| 8 | SCSI host |
++------------+--------------------+
+| 9 | 9P transport |
++------------+--------------------+
+| 10 | mac80211 wlan |
++------------+--------------------+
+
+2.5.1 Network Device
+====================
The virtio network device is a virtual ethernet card, and is the
most complex of the devices supported so far by virtio. It has
@@ -975,13 +1100,20 @@ packets are enqueued into another for transmission in that order.
A third command queue is used to control advanced filtering
features.
-Configuration
+2.5.1.1 Device ID
+-----------------
+
+ 1
- Subsystem Device ID 1
+2.5.1.2 Virtqueues
+------------------
- Virtqueues 0:receiveq. 1:transmitq. 2:controlq[12]
+ 0:receiveq. 1:transmitq. 2:controlq
-Feature bits
+ Virtqueue 2 only exists if VIRTIO_NET_F_CTRL_VQ set.
+
+2.5.1.3 Feature bits
+--------------------
VIRTIO_NET_F_CSUM (0) Device handles packets with partial checksum
@@ -1037,7 +1169,8 @@ Feature bits
u16 status;
};
-Device Initialization
+2.5.1.4 Device Initialization
+-----------------------------
1. The initialization routine should identify the receive and
transmission virtqueues.
@@ -1080,7 +1213,8 @@ Device Initialization
equivalents of the features described above. See “Receiving
Packets” below.
-Device Operation
+2.5.1.5 Device Operation
+------------------------
Packets are transmitted by placing them in the transmitq, and
buffers for incoming packets are placed in the receiveq. In each
@@ -1106,7 +1240,8 @@ case, the packet itself is preceeded by a header:
The controlq is used to control device features such as
filtering.
-Packet Transmission
+2.5.1.5.1 Packet Transmission
+-----------------------------
Transmitting a single packet is simple, but varies depending on
the different features the driver negotiated.
@@ -1151,7 +1286,8 @@ the different features the driver negotiated.
transmitq, and the device is notified of the new entry (see 2.4.1.4
Notifying The Device).[20]
- Packet Transmission Interrupt
+2.5.1.5.1.1 Packet Transmission Interrupt
+-----------------------------------------
Often a driver will suppress transmission interrupts using the
VRING_AVAIL_F_NO_INTERRUPT flag (see 2.4.2 Receiving Used Buffers From
@@ -1164,7 +1300,7 @@ The normal behavior in this interrupt handler is to retrieve and
new descriptors from the used ring and free the corresponding
headers and packets.
- Setting Up Receive Buffers
+2.5.1.5.2 Setting Up Receive Buffers
It is generally a good idea to keep the receive virtqueue as
fully populated as possible: if it runs out, network performance
@@ -1180,7 +1316,8 @@ buffer in the receive queue needs to be at least this length [20a]
If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at
least the size of the struct virtio_net_hdr.
- Packet Receive Interrupt
+2.5.1.5.2.1 Packet Receive Interrupt
+------------------------------------
When a packet is copied into a buffer in the receiveq, the
optimal path is to disable further interrupts for the receiveq
@@ -1214,7 +1351,8 @@ Processing packet involves:
VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the
desired MSS (see Packet Transmission point 2).
-Control Virtqueue
+2.5.1.5.3 Control Virtqueue
+---------------------------
The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is
negotiated) to send commands to manipulate various features of
@@ -1239,7 +1377,8 @@ driver, and the device sets the ack byte. There is little it can
do except issue a diagnostic if the ack byte is not
VIRTIO_NET_OK.
-Packet Receive Filtering
+2.5.1.5.3.1 Packet Receive Filtering
+------------------------------------
If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can
send control commands for promiscuous mode, multicast receiving,
@@ -1260,7 +1399,8 @@ VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and
off. The command-specific-data is one byte containing 0 (off) or
1 (on).
-Setting MAC Address Filtering
+2.5.1.5.3.2 Setting MAC Address Filtering
+-----------------------------------------
struct virtio_net_ctrl_mac {
u32 entries;
@@ -1277,7 +1417,8 @@ command-specific-data is two variable length tables of 6-byte MAC
addresses. The first table contains unicast addresses, and the second
contains multicast addresses.
-VLAN Filtering
+2.5.1.5.3.3 VLAN Filtering
+--------------------------
If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it
can control a VLAN filter table in the device.
@@ -1289,7 +1430,8 @@ can control a VLAN filter table in the device.
Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL
command take a 16-bit VLAN id as the command-specific-data.
-Gratuitous Packet Sending
+2.5.1.5.3.4 Gratuitous Packet Sending
+-------------------------------------
If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE (depends
on VIRTIO_NET_F_CTRL_VQ), it can ask the guest to send gratuitous
@@ -1318,22 +1460,24 @@ Processing this notification involves:
2. Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control
vq.
-3. .
-
-Appendix D: Block Device
+2.5.2 Block Device
+==================
The virtio block device is a simple virtual block device (ie.
disk). Read and write requests (and other exotic requests) are
placed in the queue, and serviced (probably out of order) by the
device except where noted.
-Configuration
+2.5.2.1 Device ID
+-----------------
+ 2
- Subsystem Device ID 2
+2.5.2.2 Virtqueues
+------------------
+ 0:requestq
- Virtqueues 0:requestq.
-
- Feature bits
+2.5.2.3 Feature bits
+--------------------
VIRTIO_BLK_F_BARRIER (0) Host supports request barriers.
@@ -1371,7 +1515,8 @@ Configuration
u32 blk_size;
};
-Device Initialization
+2.5.2.4 Device Initialization
+-----------------------------
1. The device size should be read from the “capacity”
configuration field. No requests should be submitted which goes
@@ -1386,7 +1531,8 @@ Device Initialization
3. If the VIRTIO_BLK_F_RO feature is set by the device, any write
requests will fail.
-Device Operation
+2.5.2.5 Device Operation
+------------------------
The driver queues requests to the virtqueue, and they are used by
the device (not necessarily in order). Each request is of form:
@@ -1487,7 +1633,9 @@ data_len, sense_len and residual in a single write-only buffer;
and the status field is a separate read-only buffer of size 1
byte, by itself.
-Appendix E: Console Device
+
+2.5.3 Console Device
+====================
The virtio console device is a simple device for data input and
output. A device may have one or more ports. Each port has a pair
@@ -1502,15 +1650,21 @@ successfully added, port open/close, etc.. For data IO, one or
more empty buffers are placed in the receive queue for incoming
data and outgoing characters are placed in the transmit queue.
-Configuration
+2.5.3.1 Device ID
+-----------------
- Subsystem Device ID 3
+ 3
- Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control
- receiveq[24], 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1),
+2.5.3.2 Virtqueues
+------------------
+
+ 0:receiveq(port0). 1:transmitq(port0), 2:control receiveq, 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1),
...
- Feature bits
+ Ports 2 onwards only exist if VIRTIO_CONSOLE_F_MULTIPORT is set.
+
+2.5.3.3 Feature bits
+--------------------
VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields
are valid.
@@ -1519,7 +1673,10 @@ Configuration
ports; configuration fields nr_ports and max_nr_ports are
valid and control virtqueues will be used.
- Device configuration layout The size of the console is supplied
+2.5.3.4 Device configuration layout
+-----------------------------------
+
+ The size of the console is supplied
in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature
is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature
is set, the maximum number of ports supported by the device can
@@ -1531,7 +1688,8 @@ Configuration
u32 max_nr_ports;
};
-Device Initialization
+2.5.3.5 Device Initialization
+-----------------------------
1. If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver
can read the console dimensions from the configuration fields.
@@ -1552,7 +1710,8 @@ Device Initialization
3. The receiveq for each port is populated with one or more
receive buffers.
-Device Operation
+2.5.3.6 Device Operation
+------------------------
1. For output, a buffer containing the characters is placed in
the port's transmitq.[25]
@@ -1596,32 +1755,42 @@ Device Operation
#define VIRTIO_CONSOLE_PORT_OPEN 6
#define VIRTIO_CONSOLE_PORT_NAME 7
-Appendix F: Entropy Device
+2.5.4 Entropy Device
+====================
The virtio entropy device supplies high-quality randomness for
guest use.
- Configuration
-
- Subsystem Device ID 4
+2.5.4.1 Device ID
+-----------------
+ 4
- Virtqueues 0:requestq.
+2.5.4.2 Virtqueues
+------------------
+ 0:requestq.
- Feature bits None currently defined
+2.5.4.3 Feature bits
+--------------------
+ None currently defined
- Device configuration layout None currently defined.
+2.5.4.4 Device configuration layout
+-----------------------------------
+ None currently defined.
-Device Initialization
+2.5.4.5 Device Initialization
+-----------------------------
1. The virtqueue is initialized
-Device Operation
+2.5.4.6 Device Operation
+------------------------
When the driver requires random bytes, it places the descriptor
of one or more buffers in the queue. It will be completely filled
by random data by the device.
-Appendix G: Memory Balloon Device
+2.5.5 Memory Balloon Device
+===========================
The virtio memory balloon device is a primitive device for
managing guest memory: the device asks for a certain amount of
@@ -1631,21 +1800,27 @@ changes in allowance of underlying physical memory. If the
feature is negotiated, the device can also be used to communicate
guest memory statistics to the host.
- Configuration
-
- Subsystem Device ID 5
+2.5.5.1 Device ID
+-----------------
+ 5
- Virtqueues 0:inflateq. 1:deflateq. 2:statsq.[26]
+2.5.5.2 Virtqueues
+------------------
+ 0:inflateq. 1:deflateq. 2:statsq.
- Feature bits
+ Virtqueue 2 only exists if VIRTIO_BALLON_F_STATS_VQ set.
+2.5.5.3 Feature bits
+--------------------
VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before
pages from the balloon are used.
VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest
memory statistics is present.
- Device configuration layout Both fields of this configuration
+2.5.5.4 Device configuration layout
+-----------------------------------
+ Both fields of this configuration
are always available. Note that they are little endian, despite
convention that device fields are guest endian:
@@ -1654,7 +1829,8 @@ guest memory statistics to the host.
u32 actual;
};
-Device Initialization
+2.5.5.5 Device Initialization
+-----------------------------
1. The inflate and deflate virtqueues are identified.
@@ -1667,7 +1843,8 @@ Device Initialization
Device operation begins immediately.
-Device Operation
+2.5.5.6 Device Operation
+------------------------
Memory Ballooning The device is driven by the receipt of a
configuration change interrupt.
@@ -1702,7 +1879,8 @@ configuration change interrupt.
deflation, the “actual” field of the configuration should be
updated to reflect the new number of pages in the balloon.[29]
-Memory Statistics
+2.5.5.6.1 Memory Statistics
+---------------------------
The stats virtqueue is atypical because communication is driven
by the device (not the driver). The channel becomes active at
@@ -1741,7 +1919,8 @@ as follows:
u64 val;
} __attribute__((packed));
- Tags
+2.5.5.6.2 Memory Statistics Tags
+--------------------------------
VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been
swapped in (in bytes).
@@ -1761,7 +1940,9 @@ as follows:
VIRTIO_BALLOON_S_MEMTOT The total amount of memory available
(in bytes).
-Appendix I: SCSI Host Device
+
+2.5.6 SCSI Host Device
+======================
The virtio SCSI host device groups together one or more virtual
logical units (such as disks), and allows communicating to them
@@ -1782,13 +1963,16 @@ medium. In the transport protocol, the virtio driver acts as the
initiator, while the virtio SCSI host provides one or more
targets that receive and process the requests.
- Configuration
-
- Subsystem Device ID 8
+2.5.6.1 Device ID
+-----------------
+ 8
- Virtqueues 0:controlq; 1:eventq; 2..n:request queues.
+2.5.6.2 Virtqueues
+------------------
+ 0:controlq; 1:eventq; 2..n:request queues.
- Feature bits
+2.5.6.3 Feature bits
+--------------------
VIRTIO_SCSI_F_INOUT (0) A single request can include both
read-only and write-only data buffers.
@@ -1796,9 +1980,11 @@ targets that receive and process the requests.
VIRTIO_SCSI_F_HOTPLUG (1) The host should enable
hot-plug/hot-unplug of new LUNs and targets on the SCSI bus.
- Device configuration layout All fields of this configuration
- are always available. sense_size and cdb_size are writable by
- the guest.
+2.5.6.4 Device configuration layout
+-----------------------------------
+
+ All fields of this configuration are always available. sense_size
+ and cdb_size are writable by the guest.
struct virtio_scsi_config {
u32 num_queues;
@@ -1849,7 +2035,8 @@ targets that receive and process the requests.
as hints to constrain scanning the logical units on the
host.h
-Device Initialization
+2.5.6.5 Device Initialization
+-----------------------------
The initialization routine should first of all discover the
device's virtqueues.
@@ -1861,7 +2048,14 @@ The driver can immediately issue requests (for example, INQUIRY
or REPORT LUNS) or task management functions (for example, I_T
RESET).
-Device Operation: request queues
+2.5.6.6 Device Operation
+------------------------
+
+Device operation consists of operating request queues, the control
+queue and the event queue.
+
+2.5.6.6.1 Device Operation: Request Queues
+------------------------------------------
The driver queues requests to an arbitrary request queue, and
they are used by the device on that same queue. It is the
@@ -1983,7 +2177,8 @@ following:
request will be immediately returned with a response equal to
VIRTIO_SCSI_S_FAILURE.
-Device Operation: controlq
+2.5.6.6.2 Device Operation: controlq
+------------------------------------
The controlq is used for other SCSI transport operations.
Requests have the following format:
@@ -2114,7 +2309,8 @@ The following commands are defined:
No command-specific values are defined for the response byte.
-Device Operation: eventq
+2.5.6.6.3 Device Operation: eventq
+----------------------------------
The eventq is used by the device to report information on logical
units that are attached to it. The driver should always leave a
@@ -2257,234 +2453,254 @@ contents of the event field. The following events are defined:
When dropped events are reported, the driver should poll for
asynchronous events manually using SCSI commands.
-Appendix X: virtio-mmio
-Virtual environments without PCI support (a common situation in
-embedded devices models) might use simple memory mapped device (“
-virtio-mmio”) instead of the PCI device.
-
-The memory mapped virtio device behaviour is based on the PCI
-device specification. Therefore most of operations like device
-initialization, queues configuration and buffer transfers are
-nearly identical. Existing differences are described in the
-following sections.
+2.6 Reserved Feature Bits
+=========================
-Device Initialization
-
-Instead of using the PCI IO space for virtio header, the “
-virtio-mmio” device provides a set of memory mapped control
-registers, all 32 bits wide, followed by device-specific
-configuration space. The following list presents their layout:
-
-• Offset from the device base address | Direction | Name
- Description
-
-• 0x000 | R | MagicValue
- “virt” string.
-
-• 0x004 | R | Version
- Device version number. Currently must be 1.
-
-• 0x008 | R | DeviceID
- Virtio Subsystem Device ID (ie. 1 for network card).
-
-• 0x00c | R | VendorID
- Virtio Subsystem Vendor ID.
-
-• 0x010 | R | HostFeatures
- Flags representing features the device supports.
- Reading from this register returns 32 consecutive flag bits,
- first bit depending on the last value written to
- HostFeaturesSel register. Access to this register returns bits HostFeaturesSel*32
-
- to (HostFeaturesSel*32)+31, eg. feature bits 0 to 31 if
- HostFeaturesSel is set to 0 and features bits 32 to 63 if
- HostFeaturesSel is set to 1. Also see [sub:Feature-Bits]
-
-• 0x014 | W | HostFeaturesSel
- Device (Host) features word selection.
- Writing to this register selects a set of 32 device feature bits
- accessible by reading from HostFeatures register. Device driver
- must write a value to the HostFeaturesSel register before
- reading from the HostFeatures register.
-
-• 0x020 | W | GuestFeatures
- Flags representing device features understood and activated by
- the driver.
- Writing to this register sets 32 consecutive flag bits, first
- bit depending on the last value written to GuestFeaturesSel
- register. Access to this register sets bits GuestFeaturesSel*32
- to (GuestFeaturesSel*32)+31, eg. feature bits 0 to 31 if
- GuestFeaturesSel is set to 0 and features bits 32 to 63 if
- GuestFeaturesSel is set to 1. Also see [sub:Feature-Bits]
+Currently there are five device-independent feature bits defined:
-• 0x024 | W | GuestFeaturesSel
- Activated (Guest) features word selection.
- Writing to this register selects a set of 32 activated feature
- bits accessible by writing to the GuestFeatures register.
- Device driver must write a value to the GuestFeaturesSel
- register before writing to the GuestFeatures register.
+ VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature
+ indicates that the driver wants an interrupt if the device runs
+ out of available descriptors on a virtqueue, even though
+ interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT
+ flag or the used_event field. An example of this is the
+ networking driver: it doesn't need to know every time a packet
+ is transmitted, but it does need to free the transmitted
+ packets a finite time after they are transmitted. It can avoid
+ using a timer if the device interrupts it when all the packets
+ are transmitted.
-• 0x028 | W | GuestPageSize
- Guest page size.
- Device driver must write the guest page size in bytes to the
- register during initialization, before any queues are used.
- This value must be a power of 2 and is used by the Host to
- calculate Guest address of the first queue page (see QueuePFN).
+ VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature indicates
+ that the driver can use descriptors with the VRING_DESC_F_INDIRECT
+ flag set, as described in 2.3.3 Indirect Descriptors.
-• 0x030 | W | QueueSel
- Virtual queue index (first queue is 0).
- Writing to this register selects the virtual queue that the
- following operations on QueueNum, QueueAlign and QueuePFN apply
- to.
+ VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event
+ and the avail_event fields. If set, it indicates that the
+ device should ignore the flags field in the available ring
+ structure. Instead, the used_event field in this structure is
+ used by guest to suppress device interrupts. Further, the
+ driver should ignore the flags field in the used ring
+ structure. Instead, the avail_event field in this structure is
+ used by the device to suppress notifications. If unset, the
+ driver should ignore the used_event field; the device should
+ ignore the avail_event field; the flags field is used
-• 0x034 | R | QueueNumMax
- Maximum virtual queue size.
- Reading from the register returns the maximum size of the queue
- the Host is ready to process or zero (0x0) if the queue is not
- available. This applies to the queue selected by writing to
- QueueSel and is allowed only when QueuePFN is set to zero
- (0x0), so when the queue is not actively used.
-• 0x038 | W | QueueNum
- Virtual queue size.
- Queue size is a number of elements in the queue, therefore size
- of the descriptor table and both available and used rings.
- Writing to this register notifies the Host what size of the
- queue the Guest will use. This applies to the queue selected by
- writing to QueueSel.
+2.7 virtio_ring.h
+=================
-• 0x03c | W | QueueAlign
- Used Ring alignment in the virtual queue.
- Writing to this register notifies the Host about alignment
- boundary of the Used Ring in bytes. This value must be a power
- of 2 and applies to the queue selected by writing to QueueSel.
+#ifndef VIRTIO_RING_H
+#define VIRTIO_RING_H
+/* An interface for efficient virtio implementation.
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers.
+ *
+ * Copyright 2007, 2009, IBM Corporation
+ * Copyright 2011, Red Hat, Inc
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
-• 0x040 | RW | QueuePFN
- Guest physical page number of the virtual queue.
- Writing to this register notifies the host about location of the
- virtual queue in the Guest's physical address space. This value
- is the index number of a page starting with the queue
- Descriptor Table. Value zero (0x0) means physical address zero
- (0x00000000) and is illegal. When the Guest stops using the
- queue it must write zero (0x0) to this register.
- Reading from this register returns the currently used page
- number of the queue, therefore a value other than zero (0x0)
- means that the queue is in use.
- Both read and write accesses apply to the queue selected by
- writing to QueueSel.
+/* This marks a buffer as continuing via the next field. */
+#define VRING_DESC_F_NEXT 1
+/* This marks a buffer as write-only (otherwise read-only). */
+#define VRING_DESC_F_WRITE 2
-• 0x050 | W | QueueNotify
- Queue notifier.
- Writing a queue index to this register notifies the Host that
- there are new buffers to process in the queue.
+/* The Host uses this in used->flags to advise the Guest: don't kick me
+ * when you add a buffer. It's unreliable, so it's simply an
+ * optimization. Guest will still kick if it's out of buffers. */
+#define VRING_USED_F_NO_NOTIFY 1
+/* The Guest uses this in avail->flags to advise the Host: don't
+ * interrupt me when you consume a buffer. It's unreliable, so it's
+ * simply an optimization. */
+#define VRING_AVAIL_F_NO_INTERRUPT 1
-• 0x60 | R | InterruptStatus
-Interrupt status.
-Reading from this register returns a bit mask of interrupts
- asserted by the device. An interrupt is asserted if the
- corresponding bit is set, ie. equals one (1).
+/* Virtio ring descriptors: 16 bytes.
+ * These can chain together via "next". */
+struct vring_desc {
+ /* Address (guest-physical). */
+ uint64_t addr;
+ /* Length. */
+ uint32_t len;
+ /* The flags as indicated above. */
+ uint16_t flags;
+ /* We chain unused descriptors via this, too */
+ uint16_t next;
+};
- – Bit 0 | Used Ring Update
- This interrupt is asserted when the Host has updated the Used
- Ring in at least one of the active virtual queues.
+struct vring_avail {
+ uint16_t flags;
+ uint16_t idx;
+ uint16_t ring[];
+ uint16_t used_event;
+};
- – Bit 1 | Configuration change
- This interrupt is asserted when configuration of the device has
- changed.
+/* u32 is used here for ids for padding reasons. */
+struct vring_used_elem {
+ /* Index of start of used descriptor chain. */
+ uint32_t id;
+ /* Total length of the descriptor chain which was written to. */
+ uint32_t len;
+};
-• 0x064 | W | InterruptACK
- Interrupt acknowledge.
- Writing to this register notifies the Host that the Guest
- finished handling interrupts. Set bits in the value clear the
- corresponding bits of the InterruptStatus register.
+struct vring_used {
+ uint16_t flags;
+ uint16_t idx;
+ struct vring_used_elem ring[];
+ uint16_t avail_event;
+};
-• 0x070 | RW | Status
- Device status.
- Reading from this register returns the current device status
- flags.
- Writing non-zero values to this register sets the status flags,
- indicating the Guest progress. Writing zero (0x0) to this
- register triggers a device reset.
- Also see [sub:Device-Initialization-Sequence]
+struct vring {
+ unsigned int num;
-• 0x100+ | RW | Config
- Device-specific configuration space starts at an offset 0x100
- and is accessed with byte alignment. Its meaning and size
- depends on the device and the driver.
+ struct vring_desc *desc;
+ struct vring_avail *avail;
+ struct vring_used *used;
+};
-Virtual queue size is a number of elements in the queue,
-therefore size of the descriptor table and both available and
-used rings.
+/* The standard layout for the ring is a continuous chunk of memory which
+ * looks like this. We assume num is a power of 2.
+ *
+ * struct vring {
+ * // The actual descriptors (16 bytes each)
+ * struct vring_desc desc[num];
+ *
+ * // A ring of available descriptor heads with free-running index.
+ * __u16 avail_flags;
+ * __u16 avail_idx;
+ * __u16 available[num];
+ *
+ * // Padding to the next align boundary.
+ * char pad[];
+ *
+ * // A ring of used descriptor heads with free-running index.
+ * __u16 used_flags;
+ * __u16 EVENT_IDX;
+ * struct vring_used_elem used[num];
+ * };
+ * Note: for virtio PCI, align is 4096.
+ */
+static inline void vring_init(struct vring *vr, unsigned int num, void *p,
+ unsigned long align)
+{
+ vr->num = num;
+ vr->desc = p;
+ vr->avail = p + num*sizeof(struct vring_desc);
+ vr->used = (void *)(((unsigned long)&vr->avail->ring[num]
+ + align-1)
+ & ~(align - 1));
+}
-The endianness of the registers follows the native endianness of
-the Guest. Writing to registers described as “R” and reading from
-registers described as “W” is not permitted and can cause
-undefined behavior.
+static inline unsigned vring_size(unsigned int num, unsigned long align)
+{
+ return ((sizeof(struct vring_desc)*num + sizeof(uint16_t)*(2+num)
+ + align - 1) & ~(align - 1))
+ + sizeof(uint16_t)*3 + sizeof(struct vring_used_elem)*num;
+}
-The device initialization is performed as described in 2.2.1 Device
-Initialization Sequence with one exception: the Guest must notify the
-Host about its page size, writing the size in bytes to GuestPageSize
-register before the initialization is finished.
+static inline int vring_need_event(uint16_t event_idx, uint16_t new_idx, uint16_t old_idx)
+{
+ return (uint16_t)(new_idx - event_idx - 1) < (uint16_t)(new_idx - old_idx);
+}
+#endif /* VIRTIO_RING_H */
-The memory mapped virtio devices generate single interrupt only,
-therefore no special configuration is required.
-Virtqueue Configuration
-The virtual queue configuration is performed in a similar way to
-the one described in 2.3 Virtqueue Configuration with a few
-additional operations:
+2.10 Creating New Device Types
+==============================
-1. Select the queue writing its index (first queue is 0) to the
- QueueSel register.
+Various considerations are necessary when creating a new device
+type.
+
+2.10.1 How Many Virtqueues?
+---------------------------
-2. Check if the queue is not already in use: read QueuePFN
- register, returned value should be zero (0x0).
+It is possible that a very simple device will operate entirely
+through its configuration space, but most will need at least one
+virtqueue in which it will place requests. A device with both
+input and output (eg. console and network devices described here)
+need two queues: one which the driver fills with buffers to
+receive input, and one which the driver places buffers to
+transmit output.
-3. Read maximum queue size (number of elements) from the
- QueueNumMax register. If the returned value is zero (0x0) the
- queue is not available.
+2.10.2 What Configuration Space Layout?
+---------------------------------------
-4. Allocate and zero the queue pages in contiguous virtual
- memory, aligning the Used Ring to an optimal boundary (usually
- page size). Size of the allocated queue may be smaller than or
- equal to the maximum size returned by the Host.
+Configuration space should only be used for initialization-time
+parameters. It is a limited resource with no synchronization, so for
+most uses it is better to use a virtqueue to update configuration
+information (the network device does this for filtering,
+otherwise the table in the config space could potentially be very
+large).
-5. Notify the Host about the queue size by writing the size to
- QueueNum register.
+2.10.3 What Device Number?
+--------------------------
-6. Notify the Host about the used alignment by writing its value
- in bytes to QueueAlign register.
+Currently device numbers are assigned quite freely: a simple
+request mail to the author of this document or the Linux
+virtualization mailing list[9] will be sufficient to secure a unique one.
-7. Write the physical number of the first page of the queue to
- the QueuePFN register.
+Meanwhile for experimental drivers, use 65535 and work backwards.
-The queue and the device are ready to begin normal operations
-now.
+2.10.4 How many MSI-X vectors? (for PCI)
+-----------------------------------------
-Device Operation
+Using the optional MSI-X capability devices can speed up
+interrupt processing by removing the need to read ISR Status
+register by guest driver (which might be an expensive operation),
+reducing interrupt sharing between devices and queues within the
+device, and handling interrupts from multiple CPUs. However, some
+systems impose a limit (which might be as low as 256) on the
+total number of MSI-X vectors that can be allocated to all
+devices. Devices and/or device drivers should take this into
+account, limiting the number of vectors used unless the device is
+expected to cause a high volume of interrupts. Devices can
+control the number of vectors used by limiting the MSI-X Table
+Size or not presenting MSI-X capability in PCI configuration
+space. Drivers can control this by mapping events to as small
+number of vectors as possible, or disabling MSI-X capability
+altogether.
-The memory mapped virtio device behaves in the same way as
-described in 2.4 Device Operation, with the following
-exceptions:
+2.10.5 Device Improvements
+--------------------------
-1. The device is notified about new buffers available in a queue
- by writing the queue index to register QueueNum instead of the
- virtio header in PCI I/O space (2.4.1.4 Notifying The Device).
+Any change to configuration space, or new virtqueues, or
+behavioural changes, should be indicated by negotiation of a new
+feature bit. This establishes clarity[11] and avoids future expansion problems.
-2. The memory mapped virtio device is using single, dedicated
- interrupt signal, which is raised when at least one of the
- interrupts described in the InterruptStatus register
- description is asserted. After receiving an interrupt, the
- driver must read the InterruptStatus register to check what
- caused the interrupt (see the register description). After the
- interrupt is handled, the driver must acknowledge it by writing
- a bit mask corresponding to the serviced interrupt to the
- InterruptACK register.
+Clusters of functionality which are always implemented together
+can use a single bit, but if one feature makes sense without the
+others they should not be gratuitously grouped together to
+conserve feature bits. We can always extend the spec when the
+first person needs more than 24 feature bits for their device.
FOOTNOTES:
+==========
+
[1] This lack of page-sharing implies that the implementation of the
device (e.g. the hypervisor or host) needs full access to the
guest memory. Communication with untrusted parties (i.e.
@@ -2524,7 +2740,6 @@ a cautious driver should arrange it so.
[11] Even if it does mean documenting design or implementation
mistakes!
-[12] Only if VIRTIO_NET_F_CTRL_VQ set
[13] It was supposed to indicate segmentation offload support, but
upon further investigation it became clear that multiple bits
@@ -2575,8 +2790,6 @@ does not distinguish between them.
[23] The FLUSH and FLUSH_OUT types are equivalent, the device does not
distinguish between them
-[24] Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set.
-
[25] Because this is high importance and low bandwidth, the current
Linux implementation polls for the buffer to be used, rather than
waiting for an interrupt, simplifying the implementation
@@ -2585,8 +2798,6 @@ O_NONBLOCK flag set, the polling limitation is relaxed and the
consumed buffers are freed upon the next write or poll call or
when a port is closed or hot-unplugged.
-[26] Only if VIRTIO_BALLON_F_STATS_VQ set.
-
[27] This is historical, and independent of the guest page size
[28] In this case, deflation advice is merely a courtesy