diff options
Diffstat (limited to 'content.tex')
-rw-r--r-- | content.tex | 5887 |
1 files changed, 5887 insertions, 0 deletions
diff --git a/content.tex b/content.tex new file mode 100644 index 0000000..e57ebc5 --- /dev/null +++ b/content.tex @@ -0,0 +1,5887 @@ +\chapter{Basic Facilities of a Virtio Device}\label{sec:Basic Facilities of a Virtio Device} + +A virtio device is discovered and identified by a bus-specific method +(see the bus specific sections: \ref{sec:Virtio Transport Options / Virtio Over PCI Bus}~\nameref{sec:Virtio Transport Options / Virtio Over PCI Bus}, +\ref{sec:Virtio Transport Options / Virtio Over MMIO}~\nameref{sec:Virtio Transport Options / Virtio Over MMIO} and \ref{sec:Virtio Transport Options / Virtio Over Channel I/O}~\nameref{sec:Virtio Transport Options / Virtio Over Channel I/O}). Each +device consists of the following parts: + +\begin{itemize} +\item Device status field +\item Feature bits +\item Device Configuration space +\item One or more virtqueues +\end{itemize} + +\section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Device / Device Status Field} +During device initialization by a driver, +the driver follows the sequence of steps specified in +\ref{sec:General Initialization And Device Operation / Device +Initialization}. + +The \field{device status} field provides a simple low-level +indication of the completed steps of this sequence. +It's most useful to imagine it hooked up to traffic +lights on the console indicating the status of each device. The +following bits are defined (listed below in the order in which +they would be typically set): +\begin{description} +\item[ACKNOWLEDGE (1)] Indicates that the guest OS has found the + device and recognized it as a valid virtio device. + +\item[DRIVER (2)] Indicates that the guest OS knows how to drive the + device. + \begin{note} + There could be a significant (or infinite) delay before setting + this bit. For example, under Linux, drivers can be loadable modules. + \end{note} + +\item[FAILED (128)] Indicates that something went wrong in the guest, + and it has given up on the device. This could be an internal + error, or the driver didn't like the device for some reason, or + even a fatal error during device operation. + +\item[FEATURES_OK (8)] Indicates that the driver has acknowledged all the + features it understands, and feature negotiation is complete. + +\item[DRIVER_OK (4)] Indicates that the driver is set up and ready to + drive the device. + +\item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced + an error from which it can't recover. +\end{description} + +\drivernormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field} +The driver MUST update \field{device status}, +setting bits to indicate the completed steps of the driver +initialization sequence specified in +\ref{sec:General Initialization And Device Operation / Device +Initialization}. +The driver MUST NOT clear a +\field{device status} bit. If the driver sets the FAILED bit, +the driver MUST later reset the device before attempting to re-initialize. + +The driver SHOULD NOT rely on completion of operations of a +device if DEVICE_NEEDS_RESET is set. +\begin{note} +For example, the driver can't assume requests in flight will be +completed if DEVICE_NEEDS_RESET is set, nor can it assume that +they have not been completed. A good implementation will try to +recover by issuing a reset. +\end{note} + +\devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field} +The device MUST initialize \field{device status} to 0 upon reset. + +The device MUST NOT consume buffers or notify the driver before DRIVER_OK. + +\label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state +that a reset is needed. If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device +MUST send a device configuration change notification to the driver. + +\section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature Bits} + +Each virtio device offers all the features it understands. During +device initialization, the driver reads this and tells the device the +subset that it accepts. The only way to renegotiate is to reset +the device. + +This allows for forwards and backwards compatibility: if the device is +enhanced with a new feature bit, older drivers will not write that +feature bit back to the device. Similarly, if a driver is enhanced with a feature +that the device doesn't support, it see the new feature is not offered. + +Feature bits are allocated as follows: + +\begin{description} +\item[0 to 23] Feature bits for the specific device type + +\item[24 to 33] Feature bits reserved for extensions to the queue and + feature negotiation mechanisms + +\item[34 and above] Feature bits reserved for future extensions. +\end{description} + +\begin{note} +For example, feature bit 0 for a network device (i.e. +Device ID 1) indicates that the device supports checksumming of +packets. +\end{note} + +In particular, new fields in the device configuration space are +indicated by offering a new feature bit. + +\drivernormative{\subsection}{Feature Bits}{Basic Facilities of a Virtio Device / Feature Bits} +The driver MUST NOT accept a feature which the device did not offer, +and MUST NOT accept a feature which requires another feature which was +not accepted. + +The driver SHOULD go into backwards compatibility mode +if the device does not offer a feature it understands, otherwise MUST +set the FAILED \field{device status} bit and cease initialization. + +\devicenormative{\subsection}{Feature Bits}{Basic Facilities of a Virtio Device / Feature Bits} +The device MUST NOT offer a feature which requires another feature +which was not offered. The device SHOULD accept any valid subset +of features the driver accepts, otherwise it MUST fail to set the +FEATURES_OK \field{device status} bit when the driver writes it. + +\subsection{Legacy Interface: A Note on Feature +Bits}\label{sec:Basic Facilities of a Virtio Device / Feature +Bits / Legacy Interface: A Note on Feature Bits} + +Transitional Drivers MUST detect Legacy Devices by detecting that +the feature bit VIRTIO_F_VERSION_1 is not offered. +Transitional devices MUST detect Legacy drivers by detecting that +VIRTIO_F_VERSION_1 has not been acknowledged by the driver. + +In this case device is used through the legacy interface. + +Legacy interface support is OPTIONAL. +Thus, both transitional and non-transitional devices and +drivers are compliant with this specification. + +Requirements pertaining to transitional devices and drivers +is contained in sections named 'Legacy Interface' like this one. + +When device is used through the legacy interface, transitional +devices and transitional drivers MUST operate according to the +requirements documented within these legacy interface sections. +Specification text within these sections generally does not apply +to non-transitional devices. + +\section{Device Configuration Space}\label{sec:Basic Facilities of a Virtio Device / Device Configuration Space} + +Device configuration space is generally used for rarely-changing or +initialization-time parameters. Where configuration fields are +optional, their existence is indicated by feature bits: Future +versions of this specification will likely extend the device +configuration space by adding extra fields at the tail. + +\begin{note} +The device configuration space uses the little-endian format +for multi-byte fields. +\end{note} + +Each transport also provides a generation count for the device configuration +space, which will change whenever there is a possibility that two +accesses to the device configuration space can see different versions of that +space. + +\drivernormative{\subsection}{Device Configuration Space}{Basic Facilities of a Virtio Device / Device Configuration Space} +Drivers MUST NOT assume reads from +fields greater than 32 bits wide are atomic, nor are reads from +multiple fields: drivers SHOULD read device configuration space fields like so: + +\begin{lstlisting} +u32 before, after; +do { + before = get_config_generation(device); + // read config entry/entries. + after = get_config_generation(device); +} while (after != before); +\end{lstlisting} + +For optional configuration space fields, the driver MUST check that the +corresponding feature is offered before accessing that part of the configuration +space. +\begin{note} +See section \ref{sec:General Initialization And Device Operation / Device Initialization} for details on feature negotiation. +\end{note} + +Drivers MUST +NOT limit structure size and device configuration space size. Instead, +drivers SHOULD only check that device configuration space is {\em large enough} to +contain the fields necessary for device operation. + +\begin{note} +For example, if the specification states that device configuration +space 'includes a single 8-bit field' drivers should understand this to mean that +the device configuration space might also include an arbitrary amount of +tail padding, and accept any device configuration space size equal to or +greater than the specified 8-bit size. +\end{note} + +\devicenormative{\subsection}{Device Configuration Space}{Basic Facilities of a Virtio Device / Device Configuration Space} +The device MUST allow reading of any device-specific configuration +field before FEATURES_OK is set by the driver. This includes fields which are +conditional on feature bits, as long as those feature bits are offered +by the device. + +\subsection{Legacy Interface: A Note on Device Configuration Space endian-ness}\label{sec:Basic Facilities of a Virtio Device / Device Configuration Space / Legacy Interface: A Note on Configuration Space endian-ness} + +Note that for legacy interfaces, device configuration space is generally the +guest's native endian, rather than PCI's little-endian. +The correct endian-ness is documented for each device. + +\subsection{Legacy Interface: Device Configuration Space}\label{sec:Basic Facilities of a Virtio Device / Device Configuration Space / Legacy Interface: Device Configuration Space} + +Legacy devices did not have a configuration generation field, thus are +susceptible to race conditions if configuration is updated. This +affects the block \field{capacity} (see \ref{sec:Device Types / +Block Device / Device configuration layout}) and +network \field{mac} (see \ref{sec:Device Types / Network Device / +Device configuration layout}) fields; +when using the legacy interface, drivers SHOULD +read these fields multiple times until two reads generate a consistent +result. + +\section{Virtqueues}\label{sec:Basic Facilities of a Virtio Device / Virtqueues} + +The mechanism for bulk data transport on virtio devices is +pretentiously called a virtqueue. Each device can have zero or more +virtqueues\footnote{For example, the simplest network device has one virtqueue for +transmit and one for receive.}. Each queue has a 16-bit queue size +parameter, which sets the number of entries and implies the total size +of the queue. + +Each virtqueue consists of three parts: + +\begin{itemize} +\item Descriptor Table +\item Available Ring +\item Used Ring +\end{itemize} + +where each part is physically-contiguous in guest memory, +and has different alignment requirements. + +The memory aligment and size requirements, in bytes, of each part of the +virtqueue are summarized in the following table: + +\begin{tabular}{|l|l|l|} +\hline +Virtqueue Part & Alignment & Size \\ +\hline \hline +Descriptor Table & 16 & $16 * $(Queue Size) \\ +\hline +Available Ring & 2 & $6 + 2 * $(Queue Size) \\ + \hline +Used Ring & 4 & $6 + 8 * $(Queue Size) \\ + \hline +\end{tabular} + +The Alignment column gives the minimum alignment for each part +of the virtqueue. + +The Size column gives the total number of bytes for each +part of the virtqueue. + +Queue Size corresponds to the maximum number of buffers in the +virtqueue\footnote{For example, if Queue Size is 4 then at most 4 buffers +can be queued at any given time.}. Queue Size value is always a +power of 2. The maximum Queue Size value is 32768. This value +is specified in a bus-specific way. + +When the driver wants to send a buffer to the device, it fills in +a slot in the descriptor table (or chains several together), and +writes the descriptor index into the available ring. It then +notifies the device. When the device has finished a buffer, it +writes the descriptor index into the used ring, and sends an interrupt. + +\drivernormative{\subsection}{Virtqueues}{Basic Facilities of a Virtio Device / Virtqueues} +The driver MUST ensure that the physical address of the first byte +of each virtqueue part is a multiple of the specified alignment value +in the above table. + +\subsection{Legacy Interfaces: A Note on Virtqueue Layout}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Layout} + +For Legacy Interfaces, several additional +restrictions are placed on the virtqueue layout: + +Each virtqueue occupies two or more physically-contiguous pages +(usually defined as 4096 bytes, but depending on the transport; +henceforth referred to as Queue Align) +and consists of three parts: + +\begin{tabular}{|l|l|l|} +\hline +Descriptor Table & Available Ring (\ldots padding\ldots) & Used Ring \\ +\hline +\end{tabular} + +The bus-specific Queue Size field controls the total number of bytes +for the virtqueue. +When using the legacy interface, the transitional +driver MUST retrieve the Queue Size field from the device +and MUST allocate the total number of bytes for the virtqueue +according to the following formula (Queue Align given in qalign and +Queue Size given in qsz): + +\begin{lstlisting} +#define ALIGN(x) (((x) + qalign) & ~qalign) +static inline unsigned virtq_size(unsigned int qsz) +{ + return ALIGN(sizeof(struct virtq_desc)*qsz + sizeof(u16)*(3 + qsz)) + + ALIGN(sizeof(u16)*3 + sizeof(struct virtq_used_elem)*qsz); +} +\end{lstlisting} + +This wastes some space with padding. +When using the legacy interface, both transitional +devices and drivers MUST use the following virtqueue layout +structure to locate elements of the virtqueue: + +\begin{lstlisting} +struct virtq { + // The actual descriptors (16 bytes each) + struct virtq_desc desc[ Queue Size ]; + + // A ring of available descriptor heads with free-running index. + struct virtq_avail avail; + + // Padding to the next Queue Align boundary. + u8 pad[ Padding ]; + + // A ring of used descriptor heads with free-running index. + struct virtq_used used; +}; +\end{lstlisting} + +\subsection{Legacy Interfaces: A Note on Virtqueue Endianness}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Endianness} + +Note that when using the legacy interface, transitional +devices and drivers MUST use the native +endian of the guest as the endian of fields and in the virtqueue. +This is opposed to little-endian for non-legacy interface as +specified by this standard. +It is assumed that the host is already aware of the guest endian. + +\subsection{Message Framing}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing} +The framing of messages with descriptors is +independent of the contents of the buffers. For example, a network +transmit buffer consists of a 12 byte header followed by the network +packet. This could be most simply placed in the descriptor table as a +12 byte output descriptor followed by a 1514 byte output descriptor, +but it could also consist of a single 1526 byte output descriptor in +the case where the header and packet are adjacent, or even three or +more descriptors (possibly with loss of efficiency in that case). + +Note that, some device implementations have large-but-reasonable +restrictions on total descriptor size (such as based on IOV_MAX in the +host OS). This has not been a problem in practice: little sympathy +will be given to drivers which create unreasonably-sized descriptors +such as by dividing a network packet into 1500 single-byte +descriptors! + +\devicenormative{\subsubsection}{Message Framing}{Basic Facilities of a Virtio Device / Message Framing} +The device MUST NOT make assumptions about the particular arrangement +of descriptors. The device MAY have a reasonable limit of descriptors +it will allow in a chain. + +\drivernormative{\subsubsection}{Message Framing}{Basic Facilities of a Virtio Device / Message Framing} +The driver MUST place any device-writable descriptor elements after +any device-readable descriptor elements. + +The driver SHOULD NOT use an excessive number of descriptors to +describe a buffer. + +\subsubsection{Legacy Interface: Message Framing}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing / Legacy Interface: Message Framing} + +Regrettably, initial driver implementations used simple layouts, and +devices came to rely on it, despite this specification wording. In +addition, the specification for virtio_blk SCSI commands required +intuiting field lengths from frame boundaries (see + \ref{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation}~\nameref{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation}) + +Thus when using the legacy interface, the VIRTIO_F_ANY_LAYOUT +feature indicates to both the device and the driver that no +assumptions were made about framing. Requirements for +transitional drivers when this is not negotiated are included in +each device section. + +\subsection{The Virtqueue Descriptor Table}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table} + +The descriptor table refers to the buffers the driver is using for +the device. \field{addr} is a physical address, and the buffers +can be chained via \field{next}. Each descriptor describes a +buffer which is read-only for the device (``device-readable'') or write-only for the device (``device-writable''), but a chain of +descriptors can contain both device-readable and device-writable buffers. + +The actual contents of the memory offered to the device depends on the +device type. Most common is to begin the data with a header +(containing little-endian fields) for the device to read, and postfix +it with a status tailer for the device to write. + +\begin{lstlisting} +struct virtq_desc { + /* Address (guest-physical). */ + le64 addr; + /* Length. */ + le32 len; + +/* This marks a buffer as continuing via the next field. */ +#define VIRTQ_DESC_F_NEXT 1 +/* This marks a buffer as device write-only (otherwise device read-only). */ +#define VIRTQ_DESC_F_WRITE 2 +/* This means the buffer contains a list of buffer descriptors. */ +#define VIRTQ_DESC_F_INDIRECT 4 + /* The flags as indicated above. */ + le16 flags; + /* Next field if flags & NEXT */ + le16 next; +}; +\end{lstlisting} + +The number of descriptors in the table is defined by the queue size +for this virtqueue: this is the maximum possible descriptor chain length. + +\begin{note} +The legacy \hyperref[intro:Virtio PCI Draft]{[Virtio PCI Draft]} +referred to this structure as vring_desc, and the constants as +VRING_DESC_F_NEXT, etc, but the layout and values were identical. +\end{note} + +\devicenormative{\subsubsection}{The Virtqueue Descriptor Table}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table} +A device MUST NOT write to a device-readable buffer, and a device SHOULD NOT +read a device-writable buffer (it MAY do so for debugging or diagnostic +purposes). + +\drivernormative{\subsubsection}{The Virtqueue Descriptor Table}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table} +Drivers MUST NOT add a descriptor chain over than $2^{32}$ bytes long in total; +this implies that loops in the descriptor chain are forbidden! + +\subsubsection{Indirect Descriptors}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors} + +Some devices benefit by concurrently dispatching a large number +of large requests. The VIRTIO_F_INDIRECT_DESC feature allows this (see \ref{sec:virtio-queue.h}~\nameref{sec:virtio-queue.h}). To increase +ring capacity the driver can store a table of indirect +descriptors anywhere in memory, and insert a descriptor in main +virtqueue (with \field{flags}\&VIRTQ_DESC_F_INDIRECT on) that refers to memory buffer +containing this indirect descriptor table; \field{addr} and \field{len} +refer to the indirect table address and length in bytes, +respectively. + +The indirect table layout structure looks like this +(\field{len} is the length of the descriptor that refers to this table, +which is a variable, so this code won't compile): + +\begin{lstlisting} +struct indirect_descriptor_table { + /* The actual descriptors (16 bytes each) */ + struct virtq_desc desc[len / 16]; +}; +\end{lstlisting} + +The first indirect descriptor is located at start of the indirect +descriptor table (index 0), additional indirect descriptors are +chained by \field{next}. An indirect descriptor without a valid \field{next} +(with \field{flags}\&VIRTQ_DESC_F_NEXT off) signals the end of the descriptor. +A single indirect descriptor +table can include both device-readable and device-writable descriptors. + +\drivernormative{\paragraph}{Indirect Descriptors}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors} +The driver MUST NOT set the VIRTQ_DESC_F_INDIRECT flag unless the +VIRTIO_F_INDIRECT_DESC feature was negotiated. The driver MUST NOT +set the VIRTQ_DESC_F_INDIRECT flag within an indirect descriptor (ie. only +one table per descriptor). + +A driver MUST NOT create a descriptor chain longer than the Queue Size of +the device. + +A driver MUST NOT set both VIRTQ_DESC_F_INDIRECT and VIRTQ_DESC_F_NEXT +in \field{flags}. + +\devicenormative{\paragraph}{Indirect Descriptors}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors} +The device MUST ignore the write-only flag (\field{flags}\&VIRTQ_DESC_F_WRITE) in the descriptor that refers to an indirect table. + +The device MUST handle the case of zero or more normal chained +descriptors followed by a single descriptor with \field{flags}\&VIRTQ_DESC_F_INDIRECT. + +\begin{note} +While unusual (most implementations either create a chain solely using +non-indirect descriptors, or use a single indirect element), such a +layout is valid. +\end{note} + +\subsection{The Virtqueue Available Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Available Ring} + +\begin{lstlisting} +struct virtq_avail { +#define VIRTQ_AVAIL_F_NO_INTERRUPT 1 + le16 flags; + le16 idx; + le16 ring[ /* Queue Size */ ]; + le16 used_event; /* Only if VIRTIO_F_EVENT_IDX */ +}; +\end{lstlisting} + +The driver uses the available ring to offer buffers to the +device: each ring entry refers to the head of a descriptor chain. It is only +written by the driver and read by the device. + +\field{idx} field indicates where the driver would put the next descriptor +entry in the ring (modulo the queue size). This starts at 0, and increases. + +\begin{note} +The legacy \hyperref[intro:Virtio PCI Draft]{[Virtio PCI Draft]} +referred to this structure as vring_avail, and the constant as +VRING_AVAIL_F_NO_INTERRUPT, but the layout and value were identical. +\end{note} + +\subsection{Virtqueue Interrupt Suppression}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression} + +If the VIRTIO_F_EVENT_IDX feature bit is not negotiated, +the \field{flags} field in the available ring offers a crude mechanism for the driver to inform +the device that it doesn't want interrupts when buffers are used. Otherwise +\field{used_event} is a more performant alternative where the driver +specifies how far the device can progress before interrupting. + +Neither of these interrupt suppression methods are reliable, as they +are not synchronized with the device, but they serve as +useful optimizations. + +\drivernormative{\subsubsection}{Virtqueue Interrupt Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression} +If the VIRTIO_F_EVENT_IDX feature bit is not negotiated: +\begin{itemize} +\item The driver MUST set \field{flags} to 0 or 1. +\item The driver MAY set \field{flags} to 1 to advise +the device that interrupts are not needed. +\end{itemize} + +Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated: +\begin{itemize} +\item The driver MUST set \field{flags} to 0. +\item The driver MAY use \field{used_event} to advise the device that interrupts are unnecessary until the device writes entry with an index specified by \field{used_event} into the used ring (equivalently, until \field{idx} in the +used ring will reach the value \field{used_event} + 1). +\end{itemize} + +The driver MUST handle spurious interrupts from the device. + +\devicenormative{\subsubsection}{Virtqueue Interrupt Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression} + +If the VIRTIO_F_EVENT_IDX feature bit is not negotiated: +\begin{itemize} +\item The device MUST ignore the \field{used_event} value. +\item After the device writes a descriptor index into the used ring: + \begin{itemize} + \item If \field{flags} is 1, the device SHOULD NOT send an interrupt. + \item If \field{flags} is 0, the device MUST send an interrupt. + \end{itemize} +\end{itemize} + +Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated: +\begin{itemize} +\item The device MUST ignore the lower bit of \field{flags}. +\item After the device writes a descriptor index into the used ring: + \begin{itemize} + \item If the \field{idx} field in the used ring (which determined + where that descriptor index was placed) was equal to + \field{used_event}, the device MUST send an interrupt. + \item Otherwise the device SHOULD NOT send an interrupt. + \end{itemize} +\end{itemize} + +\begin{note} +For example, if \field{used_event} is 0, then a device using + VIRTIO_F_EVENT_IDX would interrupt after the first buffer is + used (and again after the 65536th buffer, etc). +\end{note} + +\subsection{The Virtqueue Used Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring} + +\begin{lstlisting} +struct virtq_used { +#define VIRTQ_USED_F_NO_NOTIFY 1 + le16 flags; + le16 idx; + struct virtq_used_elem ring[ /* Queue Size */]; + le16 avail_event; /* Only if VIRTIO_F_EVENT_IDX */ +}; + +/* le32 is used here for ids for padding reasons. */ +struct virtq_used_elem { + /* Index of start of used descriptor chain. */ + le32 id; + /* Total length of the descriptor chain which was used (written to) */ + le32 len; +}; +\end{lstlisting} + +The used ring is where the device returns buffers once it is done with +them: it is only written to by the device, and read by the driver. + +Each entry in the ring is a pair: \field{id} indicates the head entry of the +descriptor chain describing the buffer (this matches an entry +placed in the available ring by the guest earlier), and \field{len} the total +of bytes written into the buffer. + +\begin{note} +\field{len} is particularly useful +for drivers using untrusted buffers: if a driver does not know exactly +how much has been written by the device, the driver would have to zero +the buffer in advance to ensure no data leakage occurs. + +For example, a network driver may hand a received buffer directly to +an unprivileged userspace application. If the network device has not +overwritten the bytes which were in that buffer, this could leak the +contents of freed memory from other processes to the application. +\end{note} + +\field{idx} field indicates where the driver would put the next descriptor +entry in the ring (modulo the queue size). This starts at 0, and increases. + +\begin{note} +The legacy \hyperref[intro:Virtio PCI Draft]{[Virtio PCI Draft]} +referred to these structures as vring_used and vring_used_elem, and +the constant as VRING_USED_F_NO_NOTIFY, but the layout and value were +identical. +\end{note} + +\subsubsection{Legacy Interface: The Virtqueue Used +Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues +/ The Virtqueue Used Ring/ Legacy Interface: The Virtqueue Used +Ring} + +Historically, many drivers ignored the \field{len} value, as a +result, many devices set \field{len} incorrectly. Thus, when +using the legacy interface, it is generally a good idea to ignore +the \field{len} value in used ring entries if possible. Specific +known issues are listed per device type. + +\devicenormative{\subsubsection}{The Virtqueue Used Ring}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring} + +The device MUST set \field{len} prior to updating the used \field{idx}. + +The device MUST write at least \field{len} bytes to descriptor, +beginning at the first device-writable buffer, +prior to updating the used \field{idx}. + +The device MAY write more than \field{len} bytes to descriptor. + +\begin{note} +There are potential error cases where a device might not know what +parts of the buffers have been written. This is why \field{len} is +permitted to be an underestimate: that's preferable to the driver believing +that uninitialized memory has been overwritten when it has not. +\end{note} + +\drivernormative{\subsubsection}{The Virtqueue Used Ring}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring} + +The driver MUST NOT make assumptions about data in device-writable buffers +beyond the first \field{len} bytes, and SHOULD ignore this data. + +\subsection{Virtqueue Notification Suppression}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Notification Suppression} + +The device can suppress notifications in a manner analogous to the way +drivers can suppress interrupts as detailed in section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression}. +The device manipulates \field{flags} or \field{avail_event} in the used ring the +same way the driver manipulates \field{flags} or \field{used_event} in the available ring. + +\drivernormative{\subsubsection}{Virtqueue Notification Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Notification Suppression} + +The driver MUST initialize \field{flags} in the used ring to 0 when +allocating the used ring. + +If the VIRTIO_F_EVENT_IDX feature bit is not negotiated: +\begin{itemize} +\item The driver MUST ignore the \field{avail_event} value. +\item After the driver writes a descriptor index into the available ring: + \begin{itemize} + \item If \field{flags} is 1, the driver SHOULD NOT send a notification. + \item If \field{flags} is 0, the driver MUST send a notification. + \end{itemize} +\end{itemize} + +Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated: +\begin{itemize} +\item The driver MUST ignore the lower bit of \field{flags}. +\item After the driver writes a descriptor index into the available ring: + \begin{itemize} + \item If the \field{idx} field in the available ring (which determined + where that descriptor index was placed) was equal to + \field{avail_event}, the driver MUST send a notification. + \item Otherwise the driver SHOULD NOT send a notification. + \end{itemize} +\end{itemize} + +\devicenormative{\subsubsection}{Virtqueue Notification Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Notification Suppression} +If the VIRTIO_F_EVENT_IDX feature bit is not negotiated: +\begin{itemize} +\item The device MUST set \field{flags} to 0 or 1. +\item The device MAY set \field{flags} to 1 to advise +the driver that notifications are not needed. +\end{itemize} + +Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated: +\begin{itemize} +\item The device MUST set \field{flags} to 0. +\item The device MAY use \field{avail_event} to advise the driver that notifications are unnecessary until the driver writes entry with an index specified by \field{avail_event} into the available ring (equivalently, until \field{idx} in the +available ring will reach the value \field{avail_event} + 1). +\end{itemize} + +The device MUST handle spurious notifications from the driver. + +\subsection{Helpers for Operating Virtqueues}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Helpers for Operating Virtqueues} + +The Linux Kernel Source code contains the definitions above and +helper routines in a more usable form, in +include/uapi/linux/virtio_ring.h. This was explicitly licensed by IBM +and Red Hat under the (3-clause) BSD license so that it can be +freely used by all other projects, and is reproduced (with slight +variation) in \ref{sec:virtio-queue.h}~\nameref{sec:virtio-queue.h}. + +\chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation} + +We start with an overview of device initialization, then expand on the +details of the device and how each step is preformed. This section +is best read along with the bus-specific section which describes +how to communicate with the specific device. + +\section{Device Initialization}\label{sec:General Initialization And Device Operation / Device Initialization} + +\drivernormative{\subsection}{Device Initialization}{General Initialization And Device Operation / Device Initialization} +The driver MUST follow this sequence to initialize a device: + +\begin{enumerate} +\item Reset the device. + +\item Set the ACKNOWLEDGE status bit: the guest OS has notice the device. + +\item Set the DRIVER status bit: the guest OS knows how to drive the device. + +\item\label{itm:General Initialization And Device Operation / +Device Initialization / Read feature bits} Read device feature bits, and write the subset of feature bits + understood by the OS and driver to the device. During this step the + driver MAY read (but MUST NOT write) the device-specific configuration fields to check that it can support the device before accepting it. + +\item\label{itm:General Initialization And Device Operation / Device Initialization / Set FEATURES-OK} Set the FEATURES_OK status bit. The driver MUST NOT accept + new feature bits after this step. + +\item\label{itm:General Initialization And Device Operation / Device Initialization / Re-read FEATURES-OK} Re-read \field{device status} to ensure the FEATURES_OK bit is still + set: otherwise, the device does not support our subset of features + and the device is unusable. + +\item\label{itm:General Initialization And Device Operation / Device Initialization / Device-specific Setup} Perform device-specific setup, including discovery of virtqueues for the + device, optional per-bus setup, reading and possibly writing the + device's virtio configuration space, and population of virtqueues. + +\item\label{itm:General Initialization And Device Operation / Device Initialization / Set DRIVER-OK} Set the DRIVER_OK status bit. At this point the device is + ``live''. +\end{enumerate} + +If any of these steps go irrecoverably wrong, the driver SHOULD +set the FAILED status bit to indicate that it has given up on the +device (it can reset the device later to restart if desired). The +driver MUST NOT continue initialization in that case. + +The driver MUST NOT notify the device before setting DRIVER_OK. + +\subsection{Legacy Interface: Device Initialization}\label{sec:General Initialization And Device Operation / Device Initialization / Legacy Interface: Device Initialization} +Legacy devices did not support the FEATURES_OK status bit, and thus did +not have a graceful way for the device to indicate unsupported feature +combinations. They also did not provide a clear mechanism to end +feature negotiation, which meant that devices finalized features on +first-use, and no features could be introduced which radically changed +the initial operation of the device. + +Legacy driver implementations often used the device before setting the +DRIVER_OK bit, and sometimes even before writing the feature bits +to the device. + +The result was the steps \ref{itm:General Initialization And +Device Operation / Device Initialization / Set FEATURES-OK} and +\ref{itm:General Initialization And Device Operation / Device +Initialization / Re-read FEATURES-OK} were omitted, and steps +\ref{itm:General Initialization And Device Operation / +Device Initialization / Read feature bits}, +\ref{itm:General Initialization And Device Operation / Device Initialization / Device-specific Setup} and \ref{itm:General Initialization And Device Operation / Device Initialization / Set DRIVER-OK} +were conflated. + +Therefore, when using the legacy interface: +\begin{itemize} +\item +The transitional driver MUST execute the initialization +sequence as described in \ref{sec:General Initialization And Device +Operation / Device Initialization} +but omitting the steps \ref{itm:General Initialization And Device +Operation / Device Initialization / Set FEATURES-OK} and +\ref{itm:General Initialization And Device Operation / Device +Initialization / Re-read FEATURES-OK}. + +\item +The transitional device MUST support the driver +writing device configuration fields +before the step \ref{itm:General Initialization And Device Operation / +Device Initialization / Read feature bits}. +\item +The transitional device MUST support the driver +using the device before the step \ref{itm:General Initialization +And Device Operation / Device Initialization / Set DRIVER-OK}. +\end{itemize} + +\section{Device Operation}\label{sec:General Initialization And Device Operation / Device Operation} + +There are two parts to device operation: supplying new buffers to +the device, and processing used buffers from the device. + +\begin{note} As an +example, the simplest virtio network device has two virtqueues: the +transmit virtqueue and the receive virtqueue. The driver adds +outgoing (device-readable) packets to the transmit virtqueue, and then +frees them after they are used. Similarly, incoming (device-writable) +buffers are added to the receive virtqueue, and processed after +they are used. +\end{note} + +\subsection{Supplying Buffers to The Device}\label{sec:General Initialization And Device Operation / Device Operation / Supplying Buffers to The Device} + +The driver offers buffers to one of the device's virtqueues as follows: + +\begin{enumerate} +\item\label{itm:General Initialization And Device Operation / Device Operation / Supplying Buffers to The Device / Place Buffers} The driver places the buffer into free descriptor(s) in the + descriptor table, chaining as necessary (see \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table}~\nameref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table}). + +\item\label{itm:General Initialization And Device Operation / Device Operation / Supplying Buffers to The Device / Place Index} The driver places the index of the head of the descriptor chain + into the next ring entry of the available ring. + +\item Steps \ref{itm:General Initialization And Device Operation / Device Operation / Supplying Buffers to The Device / Place Buffers} and \ref{itm:General Initialization And Device Operation / Device Operation / Supplying Buffers to The Device / Place Index} MAY be performed repeatedly if batching + is possible. + +\item The driver performs suitable a memory barrier to ensure the device sees + the updated descriptor table and available ring before the next + step. + +\item The available \field{idx} is increased by the number of + descriptor chain heads added to the available ring. + +\item The driver performs a suitable memory barrier to ensure that it updates + the \field{idx} field before checking for notification suppression. + +\item If notifications are not suppressed, the driver notifies the device + of the new available buffers. +\end{enumerate} + +Note that the above code does not take precautions against the +available ring buffer wrapping around: this is not possible since +the ring buffer is the same size as the descriptor table, so step +(1) will prevent such a condition. + +In addition, the maximum queue size is 32768 (the highest power +of 2 which fits in 16 bits), so the 16-bit \field{idx} value can always +distinguish between a full and empty buffer. + +What follows is the requirements of each stage in more detail. + +\subsubsection{Placing Buffers Into The Descriptor Table}\label{sec:General Initialization And Device Operation / Device Operation / Supplying Buffers to The Device / Placing Buffers Into The Descriptor Table} + +A buffer consists of zero or more device-readable physically-contiguous +elements followed by zero or more physically-contiguous +device-writable elements (each has at least one element). This +algorithm maps it into the descriptor table to form a descriptor +chain: + +for each buffer element, b: + +\begin{enumerate} +\item Get the next free descriptor table entry, d +\item Set \field{d.addr} to the physical address of the start of b +\item Set \field{d.len} to the length of b. +\item If b is device-writable, set \field{d.flags} to VIRTQ_DESC_F_WRITE, + otherwise 0. +\item If there is a buffer element after this: + \begin{enumerate} + \item Set \field{d.next} to the index of the next free descriptor + element. + \item Set the VIRTQ_DESC_F_NEXT bit in \field{d.flags}. + \end{enumerate} +\end{enumerate} + +In practice, \field{d.next} is usually used to chain free +descriptors, and a separate count kept to check there are enough +free descriptors before beginning the mappings. + +\subsubsection{Updating The Available Ring}\label{sec:General Initialization And Device Operation / Device Operation / Supplying Buffers to The Device / Updating The Available Ring} + +The descriptor chain head is the first d in the algorithm +above, ie. the index of the descriptor table entry referring to the first +part of the buffer. A naive driver implementation MAY do the following (with the +appropriate conversion to-and-from little-endian assumed): + +\begin{lstlisting} +avail->ring[avail->idx % qsz] = head; +\end{lstlisting} + +However, in general the driver MAY add many descriptor chains before it updates +\field{idx} (at which point they become visible to the +device), so it is common to keep a counter of how many the driver has added: + +\begin{lstlisting} +avail->ring[(avail->idx + added++) % qsz] = head; +\end{lstlisting} + +\subsubsection{Updating \field{idx}}\label{sec:General Initialization And Device Operation / Device Operation / Supplying Buffers to The Device / Updating idx} + +\field{idx} always increments, and wraps naturally at +65536: + +\begin{lstlisting} +avail->idx += added; +\end{lstlisting} + +Once available \field{idx} is updated by the driver, this exposes the +descriptor and its contents. The device MAY +access the descriptor chains the driver created and the +memory they refer to immediately. + +\drivernormative{\paragraph}{Updating idx}{General Initialization And Device Operation / Device Operation / Supplying Buffers to The Device / Updating idx} +The driver MUST perform a suitable memory barrier before the \field{idx} update, to ensure the +device sees the most up-to-date copy. + +\subsubsection{Notifying The Device}\label{sec:General Initialization And Device Operation / Device Operation / Supplying Buffers to The Device / Notifying The Device} + +The actual method of device notification is bus-specific, but generally +it can be expensive. So the device MAY suppress such notifications if it +doesn't need them, as detailed in section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Notification Suppression}. + +The driver has to be careful to expose the new \field{idx} +value before checking if notifications are suppressed. + +\drivernormative{\paragraph}{Notifying The Device}{General Initialization And Device Operation / Device Operation / Supplying Buffers to The Device / Notifying The Device} +The driver MUST perform a suitable memory barrier before reading \field{flags} or +\field{avail_event}, to avoid missing a notification. + +\subsection{Receiving Used Buffers From The Device}\label{sec:General Initialization And Device Operation / Device Operation / Receiving Used Buffers From The Device} + +Once the device has used buffers referred to by a descriptor (read from or written to them, or +parts of both, depending on the nature of the virtqueue and the +device), it interrupts the driver as detailed in section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression}. + +\begin{note} +For optimal performance, a driver MAY disable interrupts while processing +the used ring, but beware the problem of missing interrupts between +emptying the ring and reenabling interrupts. This is usually handled by +re-checking for more used buffers after interrups are re-enabled: + +\begin{lstlisting} +virtq_disable_interrupts(vq); + +for (;;) { + if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) { + virtq_enable_interrupts(vq); + mb(); + + if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) + break; + + virtq_disable_interrupts(vq); + } + + struct virtq_used_elem *e = virtq.used->ring[vq->last_seen_used%vsz]; + process_buffer(e); + vq->last_seen_used++; +} +\end{lstlisting} +\end{note} + +\subsection{Notification of Device Configuration Changes}\label{sec:General Initialization And Device Operation / Device Operation / Notification of Device Configuration Changes} + +For devices where the device-specific configuration information can be changed, an +interrupt is delivered when a device-specific configuration change occurs. + +In addition, this interrupt is triggered by the device setting +DEVICE_NEEDS_RESET (see \ref{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}). + +\section{Device Cleanup}\label{sec:General Initialization And Device Operation / Device Cleanup} + +Once the driver has set the DRIVER_OK status bit, all the configured +virtqueue of the device are considered live. None of the virtqueues +of a device are live once the device has been reset. + +\drivernormative{\subsection}{Device Cleanup}{General Initialization And Device Operation / Device Cleanup} + +A driver MUST NOT alter descriptor table entries which have been +exposed in the available ring (and not marked consumed by the device +in the used ring) of a live virtqueue. + +A driver MUST NOT decrement the available \field{idx} on a live virtqueue (ie. +there is no way to ``unexpose'' buffers). + +Thus a driver MUST ensure a virtqueue isn't live (by device reset) before removing exposed buffers. + +\chapter{Virtio Transport Options}\label{sec:Virtio Transport Options} + +Virtio can use various different buses, thus the standard is split +into virtio general and bus-specific sections. + +\section{Virtio Over PCI Bus}\label{sec:Virtio Transport Options / Virtio Over PCI Bus} + +Virtio devices are commonly implemented as PCI devices. + +A Virtio device can be implemented as any kind of PCI device: +a Conventional PCI device or a PCI Express +device. To assure designs meet the latest level +requirements, see +the PCI-SIG home page at \url{http://www.pcisig.com} for any +approved changes. + +\devicenormative{\subsection}{Virtio Over PCI Bus}{Virtio Transport Options / Virtio Over PCI Bus} +A Virtio device using Virtio Over PCI Bus MUST expose to +guest an interface that meets the specification requirements of +the appropriate PCI specification: \hyperref[intro:PCI]{[PCI]} +and \hyperref[intro:PCIe]{[PCIe]} +respectively. + +\subsection{PCI Device Discovery}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Discovery} + +Any PCI device with PCI Vendor ID 0x1AF4, and PCI Device ID 0x1000 through +0x107F inclusive is a virtio device. The actual value within this range +indicates which virtio device is supported by the device. +The PCI Device ID is calculated by adding 0x1040 to the Virtio Device ID, +as indicated in section \ref{sec:Device Types}. +Additionally, devices MAY utilize a Transitional PCI Device ID range, +0x1000 to 0x103F depending on the device type. + +\devicenormative{\subsubsection}{PCI Device Discovery}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Discovery} + +Devices MUST have the PCI Vendor ID 0x1AF4. +Devices MUST either have the PCI Device ID calculated by adding 0x1040 +to the Virtio Device ID, as indicated in section \ref{sec:Device +Types} or have the Transitional PCI Device ID depending on the device type, +as follows: + +\begin{tabular}{|l|c|} +\hline +Transitional PCI Device ID & Virtio Device \\ +\hline \hline +0x1000 & network card \\ +\hline +0x1001 & block device \\ +\hline +0x1002 & memory ballooning (traditional) \\ +\hline +0x1003 & console \\ +\hline +0x1004 & SCSI host \\ +\hline +0x1005 & entropy source \\ +\hline +0x1009 & 9P transport \\ +\hline +\end{tabular} + +For example, the network card device with the Virtio Device ID 1 +has the PCI Device ID 0x1041 or the Transitional PCI Device ID 0x1000. + +The PCI Subsystem Vendor ID and the PCI Subsystem Device ID MAY reflect +the PCI Vendor and Device ID of the environment (for informational purposes by the driver). + +Non-transitional devices SHOULD have a PCI Device ID in the range +0x1040 to 0x107f. +Non-transitional devices SHOULD have a PCI Revision ID of 1 or higher. +Non-transitional devices SHOULD have a PCI Subsystem Device ID of 0x40 or higher. + +This is to reduce the chance of a legacy driver attempting +to drive the device. + +\drivernormative{\subsubsection}{PCI Device Discovery}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Discovery} +Drivers MUST match devices with the PCI Vendor ID 0x1AF4 and +the PCI Device ID in the range 0x1040 to 0x107f, +calculated by adding 0x1040 to the Virtio Device ID, +as indicated in section \ref{sec:Device Types}. +Drivers for device types listed in section \ref{sec:Virtio +Transport Options / Virtio Over PCI Bus / PCI Device Discovery} +MUST match devices with the PCI Vendor ID 0x1AF4 and +the Transitional PCI Device ID indicated in section + \ref{sec:Virtio +Transport Options / Virtio Over PCI Bus / PCI Device Discovery}. + +Drivers MUST match any PCI Revision ID value. +Drivers MAY match any PCI Subsystem Vendor ID and any +PCI Subsystem Device ID value. + +\subsubsection{Legacy Interfaces: A Note on PCI Device Discovery}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Discovery / Legacy Interfaces: A Note on PCI Device Discovery} +Transitional devices MUST have a PCI Revision ID of 0. +Transitional devices MUST have the PCI Subsystem Device ID +matching the Virtio Device ID, as indicated in section \ref{sec:Device Types}. +Transitional devices MUST have the Transitional PCI Device ID in +the range 0x1000 to 0x103f. + +This is to match legacy drivers. + +\subsection{PCI Device Layout}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout} + +The device is configured via I/O and/or memory regions (though see +\ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / PCI configuration access capability} +for access via the PCI configuration space), as specified by Virtio +Structure PCI Capabilities. + +Fields of different sizes are present in the device +configuration regions. +All 64-bit, 32-bit and 16-bit fields are little-endian. +64-bit fields are to be treated as two 32-bit fields, +with low 32 bit part followed by the high 32 bit part. + +\drivernormative{\subsubsection}{PCI Device Layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout} + +For device configuration access, the driver MUST use 8-bit wide +accesses for 8-bit wide fields, 16-bit wide and aligned accesses +for 16-bit wide fields and 32-bit wide and aligned accesses for +32-bit and 64-bit wide fields. For 64-bit fields, the driver MAY +access each of the high and low 32-bit parts of the field +independently. + +\devicenormative{\subsubsection}{PCI Device Layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout} + +For 64-bit device configuration fields, the device MUST allow driver +independent access to high and low 32-bit parts of the field. + +\subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / Virtio Structure PCI Capabilities} + +The virtio device configuration layout includes several structures: +\begin{itemize} +\item Common configuration +\item Notifications +\item ISR Status +\item Device-specific configuration (optional) +\item PCI configuration access +\end{itemize} + +Each structure can be mapped by a Base Address register (BAR) belonging to +the function, or accessed via the special VIRTIO_PCI_CAP_PCI_CFG field in the PCI configuration space. + +The location of each structure is specified using a vendor-specific PCI capability located +on the capability list in PCI configuration space of the device. +This virtio structure capability uses little-endian format; all fields are +read-only for the driver unless stated otherwise: + +\begin{lstlisting} +struct virtio_pci_cap { + u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ + u8 cap_next; /* Generic PCI field: next ptr. */ + u8 cap_len; /* Generic PCI field: capability length */ + u8 cfg_type; /* Identifies the structure. */ + u8 bar; /* Where to find it. */ + u8 padding[3]; /* Pad to full dword. */ + le32 offset; /* Offset within bar. */ + le32 length; /* Length of the structure, in bytes. */ +}; +\end{lstlisting} + +This structure can be followed by extra data, depending on +\field{cfg_type}, as documented below. + +The fields are interpreted as follows: + +\begin{description} +\item[\field{cap_vndr}] + 0x09; Identifies a vendor-specific capability. + +\item[\field{cap_next}] + Link to next capability in the capability list in the PCI configuration space. + +\item[\field{cap_len}] + Length of this capability structure, including the whole of + struct virtio_pci_cap, and extra data if any. + This length MAY include padding, or fields unused by the driver. + +\item[\field{cfg_type}] + identifies the structure, according to the following table: + +\begin{lstlisting} +/* Common configuration */ +#define VIRTIO_PCI_CAP_COMMON_CFG 1 +/* Notifications */ +#define VIRTIO_PCI_CAP_NOTIFY_CFG 2 +/* ISR Status */ +#define VIRTIO_PCI_CAP_ISR_CFG 3 +/* Device specific configuration */ +#define VIRTIO_PCI_CAP_DEVICE_CFG 4 +/* PCI configuration access */ +#define VIRTIO_PCI_CAP_PCI_CFG 5 +\end{lstlisting} + + Any other value is reserved for future use. + + Each structure is detailed individually below. + + The device MAY offer more than one structure of any type - this makes it + possible for the device to expose multiple interfaces to drivers. The order of + the capabilities in the capability list specifies the order of preference + suggested by the device. + \begin{note} + For example, on some hypervisors, notifications using IO accesses are + faster than memory accesses. In this case, the device would expose two + capabilities with \field{cfg_type} set to VIRTIO_PCI_CAP_NOTIFY_CFG: + the first one addressing an I/O BAR, the second one addressing a memory BAR. + In this example, the driver would use the I/O BAR if I/O resources are available, and fall back on + memory BAR when I/O resources are unavailable. + \end{note} + +\item[\field{bar}] + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to + the function located beginning at 10h in PCI Configuration Space + and used to map the structure into Memory or I/O Space. + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space + or I/O Space. + + Any other value is reserved for future use. + +\item[\field{offset}] + indicates where the structure begins relative to the base address associated + with the BAR. The alignment requirements of \field{offset} are indicated + in each structure-specific section below. + +\item[\field{length}] + indicates the length of the structure. + + \field{length} MAY include padding, or fields unused by the driver, or + future extensions. + + \begin{note} + For example, a future device might present a large structure size of several + MBytes. + As current devices never utilize structures larger than 4KBytes in size, + driver MAY limit the mapped structure size to e.g. + 4KBytes (thus ignoring parts of structure after the first + 4KBytes) to allow forward compatibility with such devices without loss of + functionality and without wasting resources. + \end{note} +\end{description} + +\drivernormative{\subsubsection}{Virtio Structure PCI Capabilities}{Virtio Transport Options / Virtio Over PCI Bus / Virtio Structure PCI Capabilities} + +The driver MUST ignore any vendor-specific capability structure which has +a reserved \field{cfg_type} value. + +The driver SHOULD use the first instance of each virtio structure type they can +support. + +The driver MUST accept a \field{cap_len} value which is larger than specified here. + +The driver MUST ignore any vendor-specific capability structure which has +a reserved \field{bar} value. + + The drivers SHOULD only map part of configuration structure + large enough for device operation. The drivers MUST handle + an unexpectedly large \field{length}, but MAY check that \field{length} + is large enough for device operation. + +The driver MUST NOT write into any field of the capability structure, +with the exception of those with \field{cap_type} VIRTIO_PCI_CAP_PCI_CFG as +detailed in \ref{drivernormative:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / PCI configuration access capability}. + +\devicenormative{\subsubsection}{Virtio Structure PCI Capabilities}{Virtio Transport Options / Virtio Over PCI Bus / Virtio Structure PCI Capabilities} + +The device MUST include any extra data (from the beginning of the \field{cap_vndr} field +through end of the extra data fields if any) in \field{cap_len}. +The device MAY append extra data +or padding to any structure beyond that. + +If the device presents multiple structures of the same type, it SHOULD order +them from optimal (first) to least-optimal (last). + +\subsubsection{Common configuration structure layout}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common configuration structure layout} + +The common configuration structure is found at the \field{bar} and \field{offset} within the VIRTIO_PCI_CAP_COMMON_CFG capability; its layout is below. + +\begin{lstlisting} +struct virtio_pci_common_cfg { + /* About the whole device. */ + le32 device_feature_select; /* read-write */ + le32 device_feature; /* read-only for driver */ + le32 driver_feature_select; /* read-write */ + le32 driver_feature; /* read-write */ + le16 msix_config; /* read-write */ + le16 num_queues; /* read-only for driver */ + u8 device_status; /* read-write */ + u8 config_generation; /* read-only for driver */ + + /* About a specific virtqueue. */ + le16 queue_select; /* read-write */ + le16 queue_size; /* read-write, power of 2, or 0. */ + le16 queue_msix_vector; /* read-write */ + le16 queue_enable; /* read-write */ + le16 queue_notify_off; /* read-only for driver */ + le64 queue_desc; /* read-write */ + le64 queue_avail; /* read-write */ + le64 queue_used; /* read-write */ +}; +\end{lstlisting} + +\begin{description} +\item[\field{device_feature_select}] + The driver uses this to select which feature bits \field{device_feature} shows. + Value 0x0 selects Feature Bits 0 to 31, 0x1 selects Feature Bits 32 to 63, etc. + +\item[\field{device_feature}] + The device uses this to report which feature bits it is + offering to the driver: the driver writes to + \field{device_feature_select} to select which feature bits are presented. + +\item[\field{driver_feature_select}] + The driver uses this to select which feature bits \field{driver_feature} shows. + Value 0x0 selects Feature Bits 0 to 31, 0x1 selects Feature Bits 32 to 63, etc. + +\item[\field{driver_feature}] + The driver writes this to accept feature bits offered by the device. + Driver Feature Bits selected by \field{driver_feature_select}. + +\item[\field{config_msix_vector}] + The driver sets the Configuration Vector for MSI-X. + +\item[\field{num_queues}] + The device specifies the maximum number of virtqueues supported here. + +\item[\field{device_status}] + The driver writes the device status here (see \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}). Writing 0 into this + field resets the device. + +\item[\field{config_generation}] + Configuration atomicity value. The device changes this every time the + configuration noticeably changes. + +\item[\field{queue_select}] + Queue Select. The driver selects which virtqueue the following + fields refer to. + +\item[\field{queue_size}] + Queue Size. On reset, specifies the maximum queue size supported by + the hypervisor. This can be modified by driver to reduce memory requirements. + A 0 means the queue is unavailable. + +\item[\field{queue_msix_vector}] + The driver uses this to specify the queue vector for MSI-X. + +\item[\field{queue_enable}] + The driver uses this to selectively prevent the device from executing requests from this virtqueue. + 1 - enabled; 0 - disabled. + +\item[\field{queue_notify_off}] + The driver reads this to calculate the offset from start of Notification structure at + which this virtqueue is located. + \begin{note} this is \em{not} an offset in bytes. + See \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Notification capability} below. + \end{note} + +\item[\field{queue_desc}] + The driver writes the physical address of Descriptor Table here. See section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues}. + +\item[\field{queue_avail}] + The driver writes the physical address of Available Ring here. See section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues}. + +\item[\field{queue_used}] + The driver writes the physical address of Used Ring here. See section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues}. +\end{description} + +\devicenormative{\paragraph}{Common configuration structure layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common configuration structure layout} +\field{offset} MUST be 4-byte aligned. + +The device MUST present at least one common configuration capability. + +The device MUST present the feature bits it is offering in \field{device_feature}, starting at bit \field{device_feature_select} $*$ 32 for any \field{device_feature_select} written by the driver. +\begin{note} + This means that it will present 0 for any \field{device_feature_select} other than 0 or 1, since no feature defined here exceeds 63. +\end{note} + +The device MUST present any valid feature bits the driver has written in \field{driver_feature}, starting at bit \field{driver_feature_select} $*$ 32 for any \field{driver_feature_select} written by the driver. Valid feature bits are those which are subset of the corresponding \field{device_feature} bits. The device MAY present invalid bits written by the driver. + +\begin{note} + This means that a device can ignore writes for feature bits it never + offers, and simply present 0 on reads. Or it can just mirror what the driver wrote + (but it will still have to check them when the driver sets FEATURES_OK). +\end{note} + +\begin{note} + A driver shouldn't write invalid bits anyway, as per \ref{drivernormative:General Initialization And Device Operation / Device Initialization}, but this attempts to handle it. +\end{note} + +The device MUST present a changed \field{config_generation} after the +driver has read a device-specific configuration value which has +changed since any part of the device-specific configuration was last +read. +\begin{note} +As \field{config_generation} is an 8-bit value, simply incrementing it +on every configuration change could violate this requirement due to wrap. +Better would be to set an internal flag when it has changed, +and if that flag is set when the driver reads from the device-specific +configuration, increment \field{config_generation} and clear the flag. +\end{note} + +The device MUST reset when 0 is written to \field{device_status}, and +present a 0 in \field{device_status} once that is done. + +The device MUST present a 0 in \field{queue_enable} on reset. + +The device MUST present a 0 in \field{queue_size} if the virtqueue +corresponding to the current \field{queue_select} is unavailable. + +\drivernormative{\paragraph}{Common configuration structure layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common configuration structure layout} + +The driver MUST NOT write to \field{device_feature}, \field{num_queues}, \field{config_generation} or \field{queue_notify_off}. + +The driver MUST NOT write a value which is not a power of 2 to \field{queue_size}. + +The driver MUST configure the other virtqueue fields before enabling the virtqueue +with \field{queue_enable}. + +After writing 0 to \field{device_status}, the driver MUST wait for a read of +\field{device_status} to return 0 before reinitializing the device. + +The driver MUST NOT write a 0 to \field{queue_enable}. + +\subsubsection{Notification structure layout}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Notification capability} + +The notification location is found using the VIRTIO_PCI_CAP_NOTIFY_CFG +capability. This capability is immediately followed by an additional +field, like so: + +\begin{lstlisting} +struct virtio_pci_notify_cap { + struct virtio_pci_cap cap; + le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */ +}; +\end{lstlisting} + +\field{notify_off_multiplier} is combined with the \field{queue_notify_off} to +derive the Queue Notify address within a BAR for a virtqueue: + +\begin{lstlisting} + cap.offset + queue_notify_off * notify_off_multiplier +\end{lstlisting} + +The \field{cap.offset} and \field{notify_off_multiplier} are taken from the +notification capability structure above, and the \field{queue_notify_off} is +taken from the common configuration structure. + +\begin{note} +For example, if \field{notifier_off_multiplier} is 0, the device uses +the same Queue Notify address for all queues. +\end{note} + +\devicenormative{\paragraph}{Notification capability}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Notification capability} +The device MUST present at least one notification capability. + +The \field{cap.offset} MUST be 2-byte aligned. + +The device MUST either present \field{notify_off_multiplier} as an even power of 2, +or present \field{notify_off_multiplier} as 0. + +The value \field{cap.length} presented by the device MUST be at least 2 +and MUST be large enough to support queue notification offsets +for all supported queues in all possible configurations. + +For all queues, the value \field{cap.length} presented by the device MUST satisfy: +\begin{lstlisting} +cap.length >= queue_notify_off * notify_off_multiplier + 2 +\end{lstlisting} + +\subsubsection{ISR status capability}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / ISR status capability} + +The VIRTIO_PCI_CAP_ISR_CFG capability +refers to at least a single byte, which contains the 8-bit ISR status field +to be used for INT\#x interrupt handling. + +The \field{offset} for the \field{ISR status} has no alignment requirements. + +The ISR bits allow the device to distinguish between device-specific configuration +change interrupts and normal virtqueue interrupts: + +\begin{tabular}{ |l||l|l|l| } +\hline +Bits & 0 & 1 & 2 to 31 \\ +\hline +Purpose & Queue Interrupt & Device Configuration Interrupt & Reserved \\ +\hline +\end{tabular} + +To avoid an extra access, simply reading this register resets it to 0 and +causes the device to de-assert the interrupt. + +In this way, driver read of ISR status causes the device to de-assert +an interrupt. + +See sections \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Virtqueue Interrupts From The Device} and \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Notification of Device Configuration Changes} for how this is used. + +\devicenormative{\paragraph}{ISR status capability}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / ISR status capability} + +The device MUST present at least one VIRTIO_PCI_CAP_ISR_CFG capability. + +The device MUST set the Device Configuration Interrupt bit +in \field{ISR status} before sending a device configuration +change notification to the driver. + +If MSI-X capability is disabled, the device MUST set the Queue +Interrupt bit in \field{ISR status} before sending a virtqueue +notification to the driver. + +If MSI-X capability is disabled, the device MUST set the Interrupt Status +bit in the PCI Status register in the PCI Configuration Header of +the device to the logical OR of all bits in \field{ISR status} of +the device. The device then asserts/deasserts INT\#x interrupts unless masked +according to standard PCI rules \hyperref[intro:PCI]{[PCI]}. + +The device MUST reset \field{ISR status} to 0 on driver read. + +\drivernormative{\paragraph}{ISR status capability}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / ISR status capability} + +If MSI-X capability is enabled, the driver SHOULD NOT access +\field{ISR status} upon detecting a Queue Interrupt. + +\subsubsection{Device-specific configuration}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Device-specific configuration} + +The device MUST present at least one VIRTIO_PCI_CAP_DEVICE_CFG capability for +any device type which has a device-specific configuration. + +\devicenormative{\paragraph}{Device-specific configuration}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Device-specific configuration} + +The \field{offset} for the device-specific configuration MUST be 4-byte aligned. + +\subsubsection{PCI configuration access capability}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / PCI configuration access capability} + +The VIRTIO_PCI_CAP_PCI_CFG capability +creates an alternative (and likely suboptimal) access method to the +common configuration, notification, ISR and device-specific configuration regions. + +The capability is immediately followed by an additional field like so: + +\begin{lstlisting} +struct virtio_pci_cfg_cap { + struct virtio_pci_cap cap; + u8 pci_cfg_data[4]; /* Data for BAR access. */ +}; +\end{lstlisting} + +The fields \field{cap.bar}, \field{cap.length}, \field{cap.offset} and +\field{pci_cfg_data} are read-write (RW) for the driver. + +To access a device region, the driver writes into the capability +structure (ie. within the PCI configuration space) as follows: + +\begin{itemize} +\item The driver sets the BAR to access by writing to \field{cap.bar}. + +\item The driver sets the size of the access by writing 1, 2 or 4 to + \field{cap.length}. + +\item The driver sets the offset within the BAR by writing to + \field{cap.offset}. +\end{itemize} + +At that point, \field{pci_cfg_data} will provide a window of size +\field{cap.length} into the given \field{cap.bar} at offset \field{cap.offset}. + +\devicenormative{\paragraph}{PCI configuration access capability}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / PCI configuration access capability} + +The device MUST present at least one VIRTIO_PCI_CAP_PCI_CFG capability. + +Upon detecting driver write access +to \field{pci_cfg_data}, the device MUST execute a write access +at offset \field{cap.offset} at BAR selected by \field{cap.bar} using the first \field{cap.length} +bytes from \field{pci_cfg_data}. + +Upon detecting driver read access +to \field{pci_cfg_data}, the device MUST +execute a read access of length cap.length at offset \field{cap.offset} +at BAR selected by \field{cap.bar} and store the first \field{cap.length} bytes in +\field{pci_cfg_data}. + +\drivernormative{\paragraph}{PCI configuration access capability}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / PCI configuration access capability} + +The driver MUST NOT write a \field{cap.offset} which is not +a multiple of \field{cap.length} (ie. all accesses MUST be aligned). + +The driver MUST NOT read or write \field{pci_cfg_data} +unless \field{cap.bar}, \field{cap.length} and \field{cap.offset} +address \field{cap.length} bytes within a BAR range +specified by some other Virtio Structure PCI Capability +of type other than \field{VIRTIO_PCI_CAP_PCI_CFG}. + +\subsubsection{Legacy Interfaces: A Note on PCI Device Layout}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Legacy Interfaces: A Note on PCI Device Layout} + +Transitional devices MUST present part of configuration +registers in a legacy configuration structure in BAR0 in the first I/O +region of the PCI device, as documented below. +When using the legacy interface, transitional drivers +MUST use the legacy configuration structure in BAR0 in the first +I/O region of the PCI device, as documented below. + +When using the legacy interface the driver MAY access +the device-specific configuration region using any width accesses, and +a transitional device MUST present driver with the same results as +when accessed using the ``natural'' access method (i.e. +32-bit accesses for 32-bit fields, etc). + +Note that this is possible because while the virtio common configuration structure is PCI +(i.e. little) endian, when using the legacy interface the device-specific +configuration region is encoded in the native endian of the guest (where such distinction is +applicable). + +When used through the legacy interface, the virtio common configuration structure looks as follows: + +\begin{tabularx}{\textwidth}{ |X||X|X|X|X|X|X|X|X| } +\hline + Bits & 32 & 32 & 32 & 16 & 16 & 16 & 8 & 8 \\ +\hline + Read / Write & R & R+W & R+W & R & R+W & R+W & R+W & R \\ +\hline + Purpose & Device Features bits 0:31 & Driver Features bits 0:31 & + Queue Address & \field{queue_size} & \field{queue_select} & Queue Notify & + Device Status & ISR \newline Status \\ +\hline +\end{tabularx} + +If MSI-X is enabled for the device, two additional fields +immediately follow this header: + +\begin{tabular}{ |l||l|l| } +\hline +Bits & 16 & 16 \\ +\hline +Read/Write & R+W & R+W \\ +\hline +Purpose (MSI-X) & \field{config_msix_vector} & \field{queue_msix_vector} \\ +\hline +\end{tabular} + +Note: When MSI-X capability is enabled, device-specific configuration starts at +byte offset 24 in virtio common configuration structure structure. When MSI-X capability is not +enabled, device-specific configuration starts at byte offset 20 in virtio +header. ie. once you enable MSI-X on the device, the other fields move. +If you turn it off again, they move back! + +Any device-specific configuration space immediately follows +these general headers: + +\begin{tabular}{|l||l|l|} +\hline +Bits & Device Specific & \multirow{3}{*}{\ldots} \\ +\cline{1-2} +Read / Write & Device Specific & \\ +\cline{1-2} +Purpose & Device Specific & \\ +\hline +\end{tabular} + +When accessing the device-specific configuration space +using the legacy interface, transitional +drivers MUST access the device-specific configuration space +at an offset immediately following the general headers. + +When using the legacy interface, transitional +devices MUST present the device-specific configuration space +if any at an offset immediately following the general headers. + +Note that only Feature Bits 0 to 31 are accessible through the +Legacy Interface. When used through the Legacy Interface, +Transitional Devices MUST assume that Feature Bits 32 to 63 +are not acknowledged by Driver. + +As legacy devices had no \field{config_generation} field, +see \ref{sec:Basic Facilities of a Virtio Device / Device +Configuration Space / Legacy Interface: Device Configuration +Space}~\nameref{sec:Basic Facilities of a Virtio Device / Device Configuration Space / Legacy Interface: Device Configuration Space} for workarounds. + +\subsubsection{Non-transitional Device With Legacy Driver: A Note +on PCI Device Layout}\label{sec:Virtio Transport Options / Virtio +Over PCI Bus / PCI Device Layout / Non-transitional Device With +Legacy Driver: A Note on PCI Device Layout} + +All known legacy drivers check either the PCI Revision or the +Device and Vendor IDs, and thus won't attempt to drive a +non-transitional device. + +A buggy legacy driver might mistakenly attempt to drive a +non-transitional device. If support for such drivers is required +(as opposed to fixing the bug), the following would be the +recommended way to detect and handle them. +\begin{note} +Such buggy drivers are not currently known to be used in +production. +\end{note} + +\subparagraph{ +\DIFdeltextcstwo{Driver Requirements: Non-transitional Device With Legacy Driver} +\DIFaddtextcstwo{Device Requirements: Non-transitional Device With Legacy Driver} +} +\label{drivernormative:Virtio Transport Options / Virtio Over PCI +Bus / PCI-specific Initialization And Device Operation / +Device Initialization / Non-transitional Device With Legacy +Driver} +\label{devicenormative:Virtio Transport Options / Virtio Over PCI +Bus / PCI-specific Initialization And Device Operation / +Device Initialization / Non-transitional Device With Legacy +Driver} + +Non-transitional devices, on a platform where a legacy driver for +a legacy device with the same ID (including PCI Revision, Device +and Vendor IDs) is known to have previously existed, +SHOULD take the following steps to cause the legacy driver to +fail gracefully when it attempts to drive them: + +\begin{enumerate} +\item Present an I/O BAR in BAR0, and +\item Respond to a single-byte zero write to offset 18 + (corresponding to Device Status register in the legacy layout) + of BAR0 by presenting zeroes on every BAR and ignoring writes. +\end{enumerate} + +\subsection{PCI-specific Initialization And Device Operation}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation} + +\subsubsection{Device Initialization}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Device Initialization} + +This documents PCI-specific steps executed during Device Initialization. + +\paragraph{Virtio Device Configuration Layout Detection}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Device Initialization / Virtio Device Configuration Layout Detection} + +As a prerequisite to device initialization, the driver scans the +PCI capability list, detecting virtio configuration layout using Virtio +Structure PCI capabilities as detailed in \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / Virtio Structure PCI Capabilities} + +\subparagraph{Legacy Interface: A Note on Device Layout Detection}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Device Initialization / Virtio Device Configuration Layout Detection / Legacy Interface: A Note on Device Layout Detection} + +Legacy drivers skipped the Device Layout Detection step, assuming legacy +device configuration space in BAR0 in I/O space unconditionally. + +Legacy devices did not have the Virtio PCI Capability in their +capability list. + +Therefore: + +Transitional devices MUST expose the Legacy Interface in I/O +space in BAR0. + +Transitional drivers MUST look for the Virtio PCI +Capabilities on the capability list. +If these are not present, driver MUST assume a legacy device, +and use it through the legacy interface. + +Non-transitional drivers MUST look for the Virtio PCI +Capabilities on the capability list. +If these are not present, driver MUST assume a legacy device, +and fail gracefully. + +\paragraph{MSI-X Vector Configuration}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Device Initialization / MSI-X Vector Configuration} + +When MSI-X capability is present and enabled in the device +(through standard PCI configuration space) \field{config_msix_vector} and \field{queue_msix_vector} are used to map configuration change and queue +interrupts to MSI-X vectors. In this case, the ISR Status is unused. + +Writing a valid MSI-X Table entry number, 0 to 0x7FF, to +\field{config_msix_vector}/\field{queue_msix_vector} maps interrupts triggered +by the configuration change/selected queue events respectively to +the corresponding MSI-X vector. To disable interrupts for an +event type, the driver unmaps this event by writing a special NO_VECTOR +value: + +\begin{lstlisting} +/* Vector value used to disable MSI for queue */ +#define VIRTIO_MSI_NO_VECTOR 0xffff +\end{lstlisting} + +Note that mapping an event to vector might require device to +allocate internal device resources, and thus could fail. + +\devicenormative{\subparagraph}{MSI-X Vector Configuration}{Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Device Initialization / MSI-X Vector Configuration} + +A device that has an MSI-X capability SHOULD support at least 2 +and at most 0x800 MSI-X vectors. +Device MUST report the number of vectors supported in +\field{Table Size} in the MSI-X Capability as specified in +\hyperref[intro:PCI]{[PCI]}. +The device SHOULD restrict the reported MSI-X Table Size field +to a value that might benefit system performance. +\begin{note} +For example, a device which does not expect to send +interrupts at a high rate might only specify 2 MSI-X vectors. +\end{note} +Device MUST support mapping any event type to any valid +vector 0 to MSI-X \field{Table Size}. +Device MUST support unmapping any event type. + +The device MUST return vector mapped to a given event, +(NO_VECTOR if unmapped) on read of \field{config_msix_vector}/\field{queue_msix_vector}. +The device MUST have all queue and configuration change +events are unmapped upon reset. + +Devices SHOULD NOT cause mapping an event to vector to fail +unless it is impossible for the device to satisfy the mapping +request. Devices MUST report mapping +failures by returning the NO_VECTOR value when the relevant +\field{config_msix_vector}/\field{queue_msix_vector} field is read. + +\drivernormative{\subparagraph}{MSI-X Vector Configuration}{Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Device Initialization / MSI-X Vector Configuration} + +Driver MUST support device with any MSI-X Table Size 0 to 0x7FF. +Driver MAY fall back on using INT\#x interrupts for a device +which only supports one MSI-X vector (MSI-X Table Size = 0). + +Driver MAY intepret the Table Size as a hint from the device +for the suggested number of MSI-X vectors to use. + +Driver MUST NOT attempt to map an event to a vector +outside the MSI-X Table supported by the device, +as reported by \field{Table Size} in the MSI-X Capability. + +After mapping an event to vector, the +driver MUST verify success by reading the Vector field value: on +success, the previously written value is returned, and on +failure, NO_VECTOR is returned. If a mapping failure is detected, +the driver MAY retry mapping with fewer vectors, disable MSI-X +or report device failure. + +\paragraph{Virtqueue Configuration}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Device Initialization / Virtqueue Configuration} + +As a device can have zero or more virtqueues for bulk data +transport\footnote{For example, the simplest network device has two virtqueues.}, the driver +needs to configure them as part of the device-specific +configuration. + +The driver typically does this as follows, for each virtqueue a device has: + +\begin{enumerate} +\item Write the virtqueue index (first queue is 0) to \field{queue_select}. + +\item Read the virtqueue size from \field{queue_size}. This controls how big the virtqueue is + (see \ref{sec:Basic Facilities of a Virtio Device / Virtqueues}~\nameref{sec:Basic Facilities of a Virtio Device / Virtqueues}). If this field is 0, the virtqueue does not exist. + +\item Optionally, select a smaller virtqueue size and write it to \field{queue_size}. + +\item Allocate and zero Descriptor Table, Available and Used rings for the + virtqueue in contiguous physical memory. + +\item Optionally, if MSI-X capability is present and enabled on the + device, select a vector to use to request interrupts triggered + by virtqueue events. Write the MSI-X Table entry number + corresponding to this vector into \field{queue_msix_vector}. Read + \field{queue_msix_vector}: on success, previously written value is + returned; on failure, NO_VECTOR value is returned. +\end{enumerate} + +\subparagraph{Legacy Interface: A Note on Virtqueue Configuration}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Device Initialization / Virtqueue Configuration / Legacy Interface: A Note on Virtqueue Configuration} +When using the legacy interface, the queue layout follows \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Layout}~\nameref{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Layout} with an alignment of 4096. +Driver writes the physical address, divided +by 4096 to the Queue Address field\footnote{The 4096 is based on the x86 page size, but it's also large +enough to ensure that the separate parts of the virtqueue are on +separate cache lines. +}. There was no mechanism to negotiate the queue size. + +\subsubsection{Notifying The Device}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Notifying The Device} + +The driver notifies the device by writing the 16-bit virtqueue index +of this virtqueue to the Queue Notify address. See \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Notification capability} for how to calculate this address. + +\subsubsection{Virtqueue Interrupts From The Device}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Virtqueue Interrupts From The Device} + +If an interrupt is necessary for a virtqueue, the device would typically act as follows: + +\begin{itemize} + \item If MSI-X capability is disabled: + \begin{enumerate} + \item Set the lower bit of the ISR Status field for the device. + + \item Send the appropriate PCI interrupt for the device. + \end{enumerate} + + \item If MSI-X capability is enabled: + \begin{enumerate} + \item If \field{queue_msix_vector} is not NO_VECTOR, + request the appropriate MSI-X interrupt message for the + device, \field{queue_msix_vector} sets the MSI-X Table entry + number. + \end{enumerate} +\end{itemize} + +\devicenormative{\paragraph}{Virtqueue Interrupts From The Device}{Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Virtqueue Interrupts From The Device} + +If MSI-X capability is enabled and \field{queue_msix_vector} is +NO_VECTOR for a virtqueue, the device MUST NOT deliver an interrupt +for that virtqueue. + +\subsubsection{Notification of Device Configuration Changes}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Notification of Device Configuration Changes} + +Some virtio PCI devices can change the device configuration +state, as reflected in the device-specific configuration region of the device. In this case: + +\begin{itemize} + \item If MSI-X capability is disabled: + \begin{enumerate} + \item Set the second lower bit of the ISR Status field for the device. + + \item Send the appropriate PCI interrupt for the device. + \end{enumerate} + + \item If MSI-X capability is enabled: + \begin{enumerate} + \item If \field{config_msix_vector} is not NO_VECTOR, + request the appropriate MSI-X interrupt message for the + device, \field{config_msix_vector} sets the MSI-X Table entry + number. + \end{enumerate} +\end{itemize} + +A single interrupt MAY indicate both that one or more virtqueue has +been used and that the configuration space has changed. + +\devicenormative{\paragraph}{Notification of Device Configuration Changes}{Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Notification of Device Configuration Changes} + +If MSI-X capability is enabled and \field{config_msix_vector} is +NO_VECTOR, the device MUST NOT deliver an interrupt +for device configuration space changes. + +\drivernormative{\paragraph}{Notification of Device Configuration Changes}{Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Notification of Device Configuration Changes} + +A driver MUST handle the case where the same interrupt is used to indicate +both device configuration space change and one or more virtqueues being used. + +\subsubsection{Driver Handling Interrupts}\label{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Driver Handling Interrupts} +The driver interrupt handler would typically: + +\begin{itemize} + \item If MSI-X capability is disabled: + \begin{itemize} + \item Read the ISR Status field, which will reset it to zero. + \item If the lower bit is set: + look through the used rings of all virtqueues for the + device, to see if any progress has been made by the device + which requires servicing. + \item If the second lower bit is set: + re-examine the configuration space to see what changed. + \end{itemize} + \item If MSI-X capability is enabled: + \begin{itemize} + \item + Look through the used rings of + all virtqueues mapped to that MSI-X vector for the + device, to see if any progress has been made by the device + which requires servicing. + \item + If the MSI-X vector is equal to \field{config_msix_vector}, + re-examine the configuration space to see what changed. + \end{itemize} +\end{itemize} + +\section{Virtio Over MMIO}\label{sec:Virtio Transport Options / Virtio Over MMIO} + +Virtual environments without PCI support (a common situation in +embedded devices models) might use simple memory mapped device +(``virtio-mmio'') instead of the PCI device. + +The memory mapped virtio device behaviour is based on the PCI +device specification. Therefore most operations including device +initialization, queues configuration and buffer transfers are +nearly identical. Existing differences are described in the +following sections. + +\subsection{MMIO Device Discovery}\label{sec:Virtio Transport Options / Virtio Over MMIO / MMIO Device Discovery} + +Unlike PCI, MMIO provides no generic device discovery mechanism. For each +device, the guest OS will need to know the location of the registers +and interrupt(s) used. The suggested binding for systems using +flattened device trees is shown in this example: + +\begin{lstlisting} +// EXAMPLE: virtio_block device taking 512 bytes at 0x1e000, interrupt 42. +virtio_block@1e000 { + compatible = "virtio,mmio"; + reg = <0x1e000 0x200>; + interrupts = <42>; +} +\end{lstlisting} + +\subsection{MMIO Device Register Layout}\label{sec:Virtio Transport Options / Virtio Over MMIO / MMIO Device Register Layout} + +MMIO virtio devices provide a set of memory mapped control +registers followed by a device-specific configuration space, +described in the table~\ref{tab:Virtio Trasport Options / Virtio Over MMIO / MMIO Device Register Layout}. + +All register values are organized as Little Endian. + +\newcommand{\mmioreg}[5]{% Name Function Offset Direction Description + {\field{#1}} \newline #3 \newline #4 & {\bf#2} \newline #5 \\ +} + +\newcommand{\mmiodreg}[7]{% NameHigh NameLow Function OffsetHigh OffsetLow Direction Description + {\field{#1}} \newline #4 \newline {\field{#2}} \newline #5 \newline #6 & {\bf#3} \newline #7 \\ +} + +\begin{longtable}{p{0.2\textwidth}p{0.7\textwidth}} + \caption {MMIO Device Register Layout} + \label{tab:Virtio Trasport Options / Virtio Over MMIO / MMIO Device Register Layout} \\ + \hline + \mmioreg{Name}{Function}{Offset from base}{Direction}{Description} + \hline + \hline + \endfirsthead + \hline + \mmioreg{Name}{Function}{Offset from the base}{Direction}{Description} + \hline + \hline + \endhead + \endfoot + \endlastfoot + \mmioreg{MagicValue}{Magic value}{0x000}{R}{% + 0x74726976 + (a Little Endian equivalent of the ``virt'' string). + } + \hline + \mmioreg{Version}{Device version number}{0x004}{R}{% + 0x2. + \begin{note} + Legacy devices (see \ref{sec:Virtio Transport Options / Virtio Over MMIO / Legacy interface}~\nameref{sec:Virtio Transport Options / Virtio Over MMIO / Legacy interface}) used 0x1. + \end{note} + } + \hline + \mmioreg{DeviceID}{Virtio Subsystem Device ID}{0x008}{R}{% + See \ref{sec:Device Types}~\nameref{sec:Device Types} for possible values. + Value zero (0x0) is used to + define a system memory map with placeholder devices at static, + well known addresses, assigning functions to them depending + on user's needs. + } + \hline + \mmioreg{VendorID}{Virtio Subsystem Vendor ID}{0x00c}{R}{} + \hline + \mmioreg{DeviceFeatures}{Flags representing features the device supports}{0x010}{R}{% + Reading from this register returns 32 consecutive flag bits, + the least significant bit depending on the last value written to + \field{DeviceFeaturesSel}. Access to this register returns + bits $\field{DeviceFeaturesSel}*32$ to $(\field{DeviceFeaturesSel}*32)+31$, eg. + feature bits 0 to 31 if \field{DeviceFeaturesSel} is set to 0 and + features bits 32 to 63 if \field{DeviceFeaturesSel} is set to 1. + Also see \ref{sec:Basic Facilities of a Virtio Device / Feature Bits}~\nameref{sec:Basic Facilities of a Virtio Device / Feature Bits}. + } + \hline + \mmioreg{DeviceFeaturesSel}{Device (host) features word selection.}{0x014}{W}{% + Writing to this register selects a set of 32 device feature bits + accessible by reading from \field{DeviceFeatures}. + } + \hline + \mmioreg{DriverFeatures}{Flags representing device features understood and activated by the driver}{0x020}{W}{% + Writing to this register sets 32 consecutive flag bits, the least significant + bit depending on the last value written to \field{DriverFeaturesSel}. + Access to this register sets bits $\field{DriverFeaturesSel}*32$ + to $(\field{DriverFeaturesSel}*32)+31$, eg. feature bits 0 to 31 if + \field{DriverFeaturesSel} is set to 0 and features bits 32 to 63 if + \field{DriverFeaturesSel} is set to 1. Also see \ref{sec:Basic Facilities of a Virtio Device / Feature Bits}~\nameref{sec:Basic Facilities of a Virtio Device / Feature Bits}. + } + \hline + \mmioreg{DriverFeaturesSel}{Activated (guest) features word selection}{0x024}{W}{% + Writing to this register selects a set of 32 activated feature + bits accessible by writing to \field{DriverFeatures}. + } + \hline + \mmioreg{QueueSel}{Virtual queue index}{0x030}{W}{% + Writing to this register selects the virtual queue that the + following operations on \field{QueueNumMax}, \field{QueueNum}, \field{QueueReady}, + \field{QueueDescLow}, \field{QueueDescHigh}, \field{QueueAvailLow}, \field{QueueAvailHigh}, + \field{QueueUsedLow} and \field{QueueUsedHigh} apply to. The index + number of the first queue is zero (0x0). + } + \hline + \mmioreg{QueueNumMax}{Maximum virtual queue size}{0x034}{R}{% + Reading from the register returns the maximum size (number of + elements) of the queue the device is ready to process or + zero (0x0) if the queue is not available. This applies to the + queue selected by writing to \field{QueueSel}. + } + \hline + \mmioreg{QueueNum}{Virtual queue size}{0x038}{W}{% + Queue size is the number of elements in the queue, therefore in each + of the Descriptor Table, the Available Ring and the Used Ring. + Writing to this register notifies the device what size of the + queue the driver will use. This applies to the queue selected by + writing to \field{QueueSel}. + } + \hline + \mmioreg{QueueReady}{Virtual queue ready bit}{0x044}{RW}{% + Writing one (0x1) to this register notifies the device that it can + execute requests from this virtual queue. Reading from this register + returns the last value written to it. Both read and write + accesses apply to the queue selected by writing to \field{QueueSel}. + } + \hline + \mmioreg{QueueNotify}{Queue notifier}{0x050}{W}{% + Writing a queue index to this register notifies the device that + there are new buffers to process in the queue. + } + \hline + \mmioreg{InterruptStatus}{Interrupt status}{0x60}{R}{% + Reading from this register returns a bit mask of events that + caused the device interrupt to be asserted. + The following events are possible: + \begin{description} + \item[Used Ring Update] - bit 0 - the interrupt was asserted + because the device has updated the Used + Ring in at least one of the active virtual queues. + \item [Configuration Change] - bit 1 - the interrupt was + asserted because the configuration of the device has changed. + \end{description} + } + \hline + \mmioreg{InterruptACK}{Interrupt acknowledge}{0x064}{W}{% + Writing a value with bits set as defined in \field{InterruptStatus} + to this register notifies the device that events causing + the interrupt have been handled. + } + \hline + \mmioreg{Status}{Device status}{0x070}{RW}{% + Reading from this register returns the current device status + flags. + Writing non-zero values to this register sets the status flags, + indicating the driver progress. Writing zero (0x0) to this + register triggers a device reset. + See also p. \ref{sec:Virtio Transport Options / Virtio Over MMIO / MMIO-specific Initialization And Device Operation / Device Initialization}~\nameref{sec:Virtio Transport Options / Virtio Over MMIO / MMIO-specific Initialization And Device Operation / Device Initialization}. + } + \hline + \mmiodreg{QueueDescLow}{QueueDescHigh}{Virtual queue's Descriptor Table 64 bit long physical address}{0x080}{0x084}{W}{% + Writing to these two registers (lower 32 bits of the address + to \field{QueueDescLow}, higher 32 bits to \field{QueueDescHigh}) notifies + the device about location of the Descriptor Table of the queue + selected by writing to \field{QueueSel} register. + } + \hline + \mmiodreg{QueueAvailLow}{QueueAvailHigh}{Virtual queue's Available Ring 64 bit long physical address}{0x090}{0x094}{W}{% + Writing to these two registers (lower 32 bits of the address + to \field{QueueAvailLow}, higher 32 bits to \field{QueueAvailHigh}) notifies + the device about location of the Available Ring of the queue + selected by writing to \field{QueueSel}. + } + \hline + \mmiodreg{QueueUsedLow}{QueueUsedHigh}{Virtual queue's Used Ring 64 bit long physical address}{0x0a0}{0x0a4}{W}{% + Writing to these two registers (lower 32 bits of the address + to \field{QueueUsedLow}, higher 32 bits to \field{QueueUsedHigh}) notifies + the device about location of the Used Ring of the queue + selected by writing to \field{QueueSel}. + } + \hline + \mmioreg{ConfigGeneration}{Configuration atomicity value}{0x0fc}{R}{ + Reading from this register returns a value describing a version of the device-specific configuration space (see \field{Config}). + The driver can then access the configuration space and, when finished, read \field{ConfigGeneration} again. + If no part of the configuration space has changed between these two \field{ConfigGeneration} reads, the returned values are identical. + If the values are different, the configuration space accesses were not atomic and the driver has to perform the operations again. + See also \ref {sec:Basic Facilities of a Virtio Device / Device Configuration Space}. + } + \hline + \mmioreg{Config}{Configuration space}{0x100+}{RW}{ + Device-specific configuration space starts at the offset 0x100 + and is accessed with byte alignment. Its meaning and size + depend on the device and the driver. + } + \hline +\end{longtable} + +\devicenormative{\subsubsection}{MMIO Device Register Layout}{Virtio Transport Options / Virtio Over MMIO / MMIO Device Register Layout} + +The device MUST return 0x74726976 in \field{MagicValue}. + +The device MUST return value 0x2 in \field{Version}. + +The device MUST present each event by setting the corresponding bit in \field{InterruptStatus} from the +moment it takes place, until the driver acknowledges the interrupt +by writing a corresponding bit mask to the \field{InterruptACK} register. Bits which +do not represent events which took place MUST be zero. + +Upon reset, the device MUST clear all bits in \field{InterruptStatus} and ready bits in the +\field{QueueReady} register for all queues in the device. + +The device MUST change value returned in \field{ConfigGeneration} if there is any risk of a +driver seeing an inconsistent configuration state. + +The device MUST NOT access virtual queue contents when \field{QueueReady} is zero (0x0). + +\drivernormative{\subsubsection}{MMIO Device Register Layout}{Virtio Transport Options / Virtio Over MMIO / MMIO Device Register Layout} +The driver MUST NOT access memory locations not described in the +table \ref{tab:Virtio Trasport Options / Virtio Over MMIO / MMIO Device Register Layout} +(or, in case of the configuration space, described in the device specification), +MUST NOT write to the read-only registers (direction R) and +MUST NOT read from the write-only registers (direction W). + +The driver MUST only use 32 bit wide and aligned reads and writes to access the control registers +described in table \ref{tab:Virtio Trasport Options / Virtio Over MMIO / MMIO Device Register Layout}. +For the device-specific configuration space, the driver MUST use 8 bit wide accesses for +8 bit wide fields, 16 bit wide and aligned accesses for 16 bit wide fields and 32 bit wide and +aligned accesses for 32 and 64 bit wide fields. + +The driver MUST ignore a device with \field{MagicValue} which is not 0x74726976, +although it MAY report an error. + +The driver MUST ignore a device with \field{Version} which is not 0x2, +although it MAY report an error. + +The driver MUST ignore a device with \field{DeviceID} 0x0, +but MUST NOT report any error. + +Before reading from \field{DeviceFeatures}, the driver MUST write a value to \field{DeviceFeaturesSel}. + +Before writing to the \field{DriverFeatures} register, the driver MUST write a value to the \field{DriverFeaturesSel} register. + +The driver MUST write a value to \field{QueueNum} which is less than +or equal to the value presented by the device in \field{QueueNumMax}. + +When \field{QueueReady} is not zero, the driver MUST NOT access +\field{QueueNum}, \field{QueueDescLow}, \field{QueueDescHigh}, +\field{QueueAvailLow}, \field{QueueAvailHigh}, \field{QueueUsedLow}, \field{QueueUsedHigh}. + +To stop using the queue the driver MUST write zero (0x0) to this +\field{QueueReady} and MUST read the value back to ensure +synchronization. + +The driver MUST ignore undefined bits in \field{InterruptStatus}. + +The driver MUST write a value with a bit mask describing events it handled into \field{InterruptACK} when +it finishes handling an interrupt and MUST NOT set any of the undefined bits in the value. + +\subsection{MMIO-specific Initialization And Device Operation}\label{sec:Virtio Transport Options / Virtio Over MMIO / MMIO-specific Initialization And Device Operation} + +\subsubsection{Device Initialization}\label{sec:Virtio Transport Options / Virtio Over MMIO / MMIO-specific Initialization And Device Operation / Device Initialization} + +\drivernormative{\paragraph}{Device Initialization}{Virtio Transport Options / Virtio Over MMIO / MMIO-specific Initialization And Device Operation / Device Initialization} + +The driver MUST start the device initialization by reading and +checking values from \field{MagicValue} and \field{Version}. +If both values are valid, it MUST read \field{DeviceID} +and if its value is zero (0x0) MUST abort initialization and +MUST NOT access any other register. + +Further initialization MUST follow the procedure described in +\ref{sec:General Initialization And Device Operation / Device Initialization}~\nameref{sec:General Initialization And Device Operation / Device Initialization}. + +\subsubsection{Virtqueue Configuration}\label{sec:Virtio Transport Options / Virtio Over MMIO / MMIO-specific Initialization And Device Operation / Virtqueue Configuration} + +The driver will typically initialize the virtual queue in the following way: + +\begin{enumerate} +\item Select the queue writing its index (first queue is 0) to + \field{QueueSel}. + +\item Check if the queue is not already in use: read \field{QueueReady}, + and expect a returned value of zero (0x0). + +\item Read maximum queue size (number of elements) from + \field{QueueNumMax}. If the returned value is zero (0x0) the + queue is not available. + +\item Allocate and zero the queue pages, making sure the memory + is physically contiguous. It is recommended to align the + Used Ring to an optimal boundary (usually the page size). + +\item Notify the device about the queue size by writing the size to + \field{QueueNum}. + +\item Write physical addresses of the queue's Descriptor Table, + Available Ring and Used Ring to (respectively) the + \field{QueueDescLow}/\field{QueueDescHigh}, + \field{QueueAvailLow}/\field{QueueAvailHigh} and + \field{QueueUsedLow}/\field{QueueUsedHigh} register pairs. + +\item Write 0x1 to \field{QueueReady}. +\end{enumerate} + +\subsubsection{Notifying The Device}\label{sec:Virtio Transport Options / Virtio Over MMIO / MMIO-specific Initialization And Device Operation / Notifying The Device} + +The driver notifies the device about new buffers being available in +a queue by writing the index of the updated queue to \field{QueueNotify}. + +\subsubsection{Notifications From The Device}\label{sec:Virtio Transport Options / Virtio Over MMIO / MMIO-specific Initialization And Device Operation / Notifications From The Device} + +The memory mapped virtio device is using a single, dedicated +interrupt signal, which is asserted when at least one of the +bits described in the description of \field{InterruptStatus} +is set. This is how the device notifies the +driver about a new used buffer being available in the queue +or about a change in the device configuration. + +\drivernormative{\paragraph}{Notifications From The Device}{Virtio Transport Options / Virtio Over MMIO / MMIO-specific Initialization And Device Operation / Notifications From The Device} +After receiving an interrupt, the driver MUST read +\field{InterruptStatus} to check what caused the interrupt +(see the register description). After the interrupt is handled, +the driver MUST acknowledge it by writing a bit mask +corresponding to the handled events to the InterruptACK register. + +\subsection{Legacy interface}\label{sec:Virtio Transport Options / Virtio Over MMIO / Legacy interface} + +The legacy MMIO transport used page-based addressing, resulting +in a slightly different control register layout, the device +initialization and the virtual queue configuration procedure. + +Table \ref{tab:Virtio Trasport Options / Virtio Over MMIO / MMIO Device Legacy Register Layout} +presents control registers layout, omitting +descriptions of registers which did not change their function +nor behaviour: + +\begin{longtable}{p{0.2\textwidth}p{0.7\textwidth}} + \caption {MMIO Device Legacy Register Layout} + \label{tab:Virtio Trasport Options / Virtio Over MMIO / MMIO Device Legacy Register Layout} \\ + \hline + \mmioreg{Name}{Function}{Offset from base}{Direction}{Description} + \hline + \hline + \endfirsthead + \hline + \mmioreg{Name}{Function}{Offset from the base}{Direction}{Description} + \hline + \hline + \endhead + \endfoot + \endlastfoot + \mmioreg{MagicValue}{Magic value}{0x000}{R}{} + \hline + \mmioreg{Version}{Device version number}{0x004}{R}{Legacy device returns value 0x1.} + \hline + \mmioreg{DeviceID}{Virtio Subsystem Device ID}{0x008}{R}{} + \hline + \mmioreg{VendorID}{Virtio Subsystem Vendor ID}{0x00c}{R}{} + \hline + \mmioreg{HostFeatures}{Flags representing features the device supports}{0x010}{R}{} + \hline + \mmioreg{HostFeaturesSel}{Device (host) features word selection.}{0x014}{W}{} + \hline + \mmioreg{GuestFeatures}{Flags representing device features understood and activated by the driver}{0x020}{W}{} + \hline + \mmioreg{GuestFeaturesSel}{Activated (guest) features word selection}{0x024}{W}{} + \hline + \mmioreg{GuestPageSize}{Guest page size}{0x028}{W}{% + The driver writes the guest page size in bytes to the + register during initialization, before any queues are used. + This value should be a power of 2 and is used by the device to + calculate the Guest address of the first queue page + (see QueuePFN). + } + \hline + \mmioreg{QueueSel}{Virtual queue index}{0x030}{W}{% + Writing to this register selects the virtual queue that the + following operations on the \field{QueueNumMax}, \field{QueueNum}, \field{QueueAlign} + and \field{QueuePFN} registers apply to. The index + number of the first queue is zero (0x0). +. + } + \hline + \mmioreg{QueueNumMax}{Maximum virtual queue size}{0x034}{R}{% + Reading from the register returns the maximum size of the queue + the device is ready to process or zero (0x0) if the queue is not + available. This applies to the queue selected by writing to + \field{QueueSel} and is allowed only when \field{QueuePFN} is set to zero + (0x0), so when the queue is not actively used. + } + \hline + \mmioreg{QueueNum}{Virtual queue size}{0x038}{W}{% + Queue size is the number of elements in the queue, therefore size + of the descriptor table and both available and used rings. + Writing to this register notifies the device what size of the + queue the driver will use. This applies to the queue selected by + writing to \field{QueueSel}. + } + \hline + \mmioreg{QueueAlign}{Used Ring alignment in the virtual queue}{0x03c}{W}{% + Writing to this register notifies the device about alignment + boundary of the Used Ring in bytes. This value should be a power + of 2 and applies to the queue selected by writing to \field{QueueSel}. + } + \hline + \mmioreg{QueuePFN}{Guest physical page number of the virtual queue}{0x040}{RW}{% + Writing to this register notifies the device about location of the + virtual queue in the Guest's physical address space. This value + is the index number of a page starting with the queue + Descriptor Table. Value zero (0x0) means physical address zero + (0x00000000) and is illegal. When the driver stops using the + queue it writes zero (0x0) to this register. + Reading from this register returns the currently used page + number of the queue, therefore a value other than zero (0x0) + means that the queue is in use. + Both read and write accesses apply to the queue selected by + writing to \field{QueueSel}. + } + \hline + \mmioreg{QueueNotify}{Queue notifier}{0x050}{W}{} + \hline + \mmioreg{InterruptStatus}{Interrupt status}{0x60}{R}{} + \hline + \mmioreg{InterruptACK}{Interrupt acknowledge}{0x064}{W}{} + \hline + \mmioreg{Status}{Device status}{0x070}{RW}{% + Reading from this register returns the current device status + flags. + Writing non-zero values to this register sets the status flags, + indicating the OS/driver progress. Writing zero (0x0) to this + register triggers a device reset. The device + sets \field{QueuePFN} to zero (0x0) for all queues in the device. + Also see \ref{sec:General Initialization And Device Operation / Device Initialization}~\nameref{sec:General Initialization And Device Operation / Device Initialization}. + } + \hline + \mmioreg{Config}{Configuration space}{0x100+}{RW}{} + \hline +\end{longtable} + +The virtual queue page size is defined by writing to \field{GuestPageSize}, +as written by the guest. The driver does this before the +virtual queues are configured. + +The virtual queue layout follows +p. \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Layout}~\nameref{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Layout}, +with the alignment defined in \field{QueueAlign}. + +The virtual queue is configured as follows: +\begin{enumerate} +\item Select the queue writing its index (first queue is 0) to + \field{QueueSel}. + +\item Check if the queue is not already in use: read \field{QueuePFN}, + expecting a returned value of zero (0x0). + +\item Read maximum queue size (number of elements) from + \field{QueueNumMax}. If the returned value is zero (0x0) the + queue is not available. + +\item Allocate and zero the queue pages in contiguous virtual + memory, aligning the Used Ring to an optimal boundary (usually + page size). The driver should choose a queue size smaller than or + equal to \field{QueueNumMax}. + +\item Notify the device about the queue size by writing the size to + \field{QueueNum}. + +\item Notify the device about the used alignment by writing its value + in bytes to \field{QueueAlign}. + +\item Write the physical number of the first page of the queue to + the \field{QueuePFN} register. +\end{enumerate} + +Notification mechanisms did not change. + +\section{Virtio Over Channel I/O}\label{sec:Virtio Transport Options / Virtio Over Channel I/O} + +S/390 based virtual machines support neither PCI nor MMIO, so a +different transport is needed there. + +virtio-ccw uses the standard channel I/O based mechanism used for +the majority of devices on S/390. A virtual channel device with a +special control unit type acts as proxy to the virtio device +(similar to the way virtio-pci uses a PCI device) and +configuration and operation of the virtio device is accomplished +(mostly) via channel commands. This means virtio devices are +discoverable via standard operating system algorithms, and adding +virtio support is mainly a question of supporting a new control +unit type. + +As the S/390 is a big endian machine, the data structures transmitted +via channel commands are big-endian: this is made clear by use of +the types be16, be32 and be64. + +\subsection{Basic Concepts}\label{sec:Virtio Transport Options / Virtio over channel I/O / Basic Concepts} + +As a proxy device, virtio-ccw uses a channel-attached I/O control +unit with a special control unit type (0x3832) and a control unit +model corresponding to the attached virtio device's subsystem +device ID, accessed via a virtual I/O subchannel and a virtual +channel path of type 0x32. This proxy device is discoverable via +normal channel subsystem device discovery (usually a STORE +SUBCHANNEL loop) and answers to the basic channel commands: + +\begin{itemize} +\item NO-OPERATION (0x03) +\item BASIC SENSE (0x04) +\item TRANSFER IN CHANNEL (0x08) +\item SENSE ID (0xe4) +\end{itemize} + +For a virtio-ccw proxy device, SENSE ID will return the following +information: + +\begin{tabular}{ |l|l|l| } +\hline +Bytes & Description & Contents \\ +\hline \hline +0 & reserved & 0xff \\ +\hline +1-2 & control unit type & 0x3832 \\ +\hline +3 & control unit model & <virtio device id> \\ +\hline +4-5 & device type & zeroes (unset) \\ +\hline +6 & device model & zeroes (unset) \\ +\hline +7-255 & extended SenseId data & zeroes (unset) \\ +\hline +\end{tabular} + +In addition to the basic channel commands, virtio-ccw defines a +set of channel commands related to configuration and operation of +virtio: + +\begin{lstlisting} +#define CCW_CMD_SET_VQ 0x13 +#define CCW_CMD_VDEV_RESET 0x33 +#define CCW_CMD_SET_IND 0x43 +#define CCW_CMD_SET_CONF_IND 0x53 +#define CCW_CMD_SET_IND_ADAPTER 0x73 +#define CCW_CMD_READ_FEAT 0x12 +#define CCW_CMD_WRITE_FEAT 0x11 +#define CCW_CMD_READ_CONF 0x22 +#define CCW_CMD_WRITE_CONF 0x21 +#define CCW_CMD_WRITE_STATUS 0x31 +#define CCW_CMD_READ_VQ_CONF 0x32 +#define CCW_CMD_SET_VIRTIO_REV 0x83 +#define CCW_CMD_READ_STATUS 0x72 +\end{lstlisting} + +\devicenormative{\subsubsection}{Basic Concepts}{Virtio Transport Options / Virtio over channel I/O / Basic Concepts} + +The virtio-ccw device acts like a normal channel device, as specified +in \hyperref[intro:S390 PoP]{[S390 PoP]} and \hyperref[intro:S390 Common I/O]{[S390 Common I/O]}. In particular: + +\begin{itemize} +\item A device MUST post a unit check with command reject for any command + it does not support. + +\item If a driver did not suppress length checks for a channel command, + the device MUST present a subchannel status as detailed in the + architecture when the actual length did not match the expected length. + +\item If a driver did suppress length checks for a channel command, the + device MUST present a check condition if the transmitted data does + not contain enough data to process the command. If the driver submitted + a buffer that was too long, the device SHOULD accept the command. +\end{itemize} + +\drivernormative{\subsubsection}{Basic Concepts}{Virtio Transport Options / Virtio over channel I/O / Basic Concepts} + +A driver for virtio-ccw devices MUST check for a control unit +type of 0x3832 and MUST ignore the device type and model. + +A driver SHOULD attempt to provide the correct length in a channel +command even if it suppresses length checks for that command. + +\subsection{Device Initialization}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization} + +virtio-ccw uses several channel commands to set up a device. + +\subsubsection{Setting the Virtio Revision}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Setting the Virtio Revision} + +CCW_CMD_SET_VIRTIO_REV is issued by the driver to set the revision of +the virtio-ccw transport it intends to drive the device with. It uses the +following communication structure: + +\begin{lstlisting} +struct virtio_rev_info { + be16 revision; + be16 length; + u8 data[]; +}; +\end{lstlisting} + +\field{revision} contains the desired revision id, \field{length} the length of the +data portion and \field{data} revision-dependent additional desired options. + +The following values are supported: + +\begin{tabular}{ |l|l|l|l| } +\hline +\field{revision} & \field{length} & \field{data} & remarks \\ +\hline \hline +0 & 0 & <empty> & legacy interface; transitional devices only \\ +\hline +1 & 0 & <empty> & Virtio 1.0 \\ +\hline +2 & 0 & <empty> & CCW_CMD_READ_STATUS support \\ +\hline +3-n & & & reserved for later revisions \\ +\hline +\end{tabular} + +Note that a change in the virtio standard does not necessarily +correspond to a change in the virtio-ccw revision. + +\devicenormative{\paragraph}{Setting the Virtio Revision}{Virtio Transport Options / Virtio over channel I/O / Device Initialization / Setting the Virtio Revision} + +A device MUST post a unit check with command reject for any \field{revision} +it does not support. For any invalid combination of \field{revision}, \field{length} +and \field{data}, it MUST post a unit check with command reject as well. A +non-transitional device MUST reject revision id 0. + +A device MUST answer with command reject to any virtio-ccw specific +channel command that is not contained in the revision selected by the +driver. + +A device MUST answer with command reject to any attempt to select a different revision +after a revision has been successfully selected by the driver. + +A device MUST treat the revision as unset from the time the associated +subchannel has been enabled until a revision has been successfully set +by the driver. This implies that revisions are not persistent across +disabling and enabling of the associated subchannel. + +\drivernormative{\paragraph}{Setting the Virtio Revision}{Virtio Transport Options / Virtio over channel I/O / Device Initialization / Setting the Virtio Revision} + +A driver SHOULD start with trying to set the highest revision it +supports and continue with lower revisions if it gets a command reject. + +A driver MUST NOT issue any other virtio-ccw specific channel commands +prior to setting the revision. + +After a revision has been successfully selected by the driver, it +MUST NOT attempt to select a different revision. + +\paragraph{Legacy Interfaces: A Note on Setting the Virtio Revision}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Setting the Virtio Revision / Legacy Interfaces: A Note on Setting the Virtio Revision} + +A legacy device will not support the CCW_CMD_SET_VIRTIO_REV and answer +with a command reject. A non-transitional driver MUST stop trying to +operate this device in that case. A transitional driver MUST operate +the device as if it had been able to set revision 0. + +A legacy driver will not issue the CCW_CMD_SET_VIRTIO_REV prior to +issuing other virtio-ccw specific channel commands. A non-transitional +device therefore MUST answer any such attempts with a command reject. +A transitional device MUST assume in this case that the driver is a +legacy driver and continue as if the driver selected revision 0. This +implies that the device MUST reject any command not valid for revision +0, including a subsequent CCW_CMD_SET_VIRTIO_REV. + +\subsubsection{Configuring a Virtqueue}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Configuring a Virtqueue} + +CCW_CMD_READ_VQ_CONF is issued by the driver to obtain information +about a queue. It uses the following structure for communicating: + +\begin{lstlisting} +struct vq_config_block { + be16 index; + be16 max_num; +}; +\end{lstlisting} + +The requested number of buffers for queue \field{index} is returned in +\field{max_num}. + +Afterwards, CCW_CMD_SET_VQ is issued by the driver to inform the +device about the location used for its queue. The transmitted +structure is + +\begin{lstlisting} +struct vq_info_block { + be64 desc; + be32 res0; + be16 index; + be16 num; + be64 avail; + be64 used; +}; +\end{lstlisting} + +\field{desc}, \field{avail} and \field{used} contain the guest addresses for the descriptor table, +available ring and used ring for queue \field{index}, respectively. The actual +virtqueue size (number of allocated buffers) is transmitted in \field{num}. + +\devicenormative{\paragraph}{Configuring a Virtqueue}{Virtio Transport Options / Virtio over channel I/O / Device Initialization / Configuring a Virtqueue} + +\field{res0} is reserved and MUST be ignored by the device. + +\paragraph{Legacy Interface: A Note on Configuring a Virtqueue}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Configuring a Virtqueue / Legacy Interface: A Note on Configuring a Virtqueue} + +For a legacy driver or for a driver that selected revision 0, +CCW_CMD_SET_VQ uses the following communication block: + +\begin{lstlisting} +struct vq_info_block_legacy { + be64 queue; + be32 align; + be16 index; + be16 num; +}; +\end{lstlisting} + +\field{queue} contains the guest address for queue \field{index}, \field{num} the number of buffers +and \field{align} the alignment. The queue layout follows \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Layout}~\nameref{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Layout}. + +\subsubsection{Communicating Status Information}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Communicating Status Information} + +The driver changes the status of a device via the +CCW_CMD_WRITE_STATUS command, which transmits an 8 bit status +value. + +As described in +\ref{devicenormative:Basic Facilities of a Virtio Device / Feature Bits}, +a device sometimes fails to set the \field{status} field: For example, it +might fail to accept the FEATURES_OK status bit during device initialization. + +With revision 2, CCW_CMD_READ_STATUS is defined: It reads an 8 bit status +value from the device and acts as a reverse operation to CCW_CMD_WRITE_STATUS. + +\drivernormative{\paragraph}{Communicating Status Information}{Virtio Transport Options / Virtio over channel I/O / Device Initialization / Communicating Status Information} + +If the device posts a unit check with command reject in response to the +CCW_CMD_WRITE_STATUS command, the driver MUST assume that the device failed +to set the status and the \field{status} field retained its previous value. + +If at least revision 2 has been negotiated, the driver SHOULD use the +CCW_CMD_READ_STATUS command to retrieve the \field{status} field after +a configuration change has been detected. + +If not at least revision 2 has been negotiated, the driver MUST NOT attempt +to issue the CCW_CMD_READ_STATUS command. + +\devicenormative{\paragraph}{Communicating Status Information}{Virtio Transport Options / Virtio over channel I/O / Device Initialization / Communicating Status Information} + +If the device fails to set the \field{status} field to the value written by +the driver, the device MUST assure that the \field{status} field is left +unchanged and MUST post a unit check with command reject. + +If at least revision 2 has been negotiated, the device MUST return the +current \field{status} field if the CCW_CMD_READ_STATUS command is issued. + +\subsubsection{Handling Device Features}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Handling Device Features} + +Feature bits are arranged in an array of 32 bit values, making +for a total of 8192 feature bits. Feature bits are in +little-endian byte order. + +The CCW commands dealing with features use the following +communication block: + +\begin{lstlisting} +struct virtio_feature_desc { + le32 features; + u8 index; +}; +\end{lstlisting} + +\field{features} are the 32 bits of features currently accessed, while +\field{index} describes which of the feature bit values is to be +accessed. No padding is added at the end of the structure, it is +exactly 5 bytes in length. + +The guest obtains the device's device feature set via the +CCW_CMD_READ_FEAT command. The device stores the features at \field{index} +to \field{features}. + +For communicating its supported features to the device, the driver +uses the CCW_CMD_WRITE_FEAT command, denoting a \field{features}/\field{index} +combination. + +\subsubsection{Device Configuration}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Device Configuration} + +The device's configuration space is located in host memory. + +To obtain information from the configuration space, the driver +uses CCW_CMD_READ_CONF, specifying the guest memory for the device +to write to. + +For changing configuration information, the driver uses +CCW_CMD_WRITE_CONF, specifying the guest memory for the device to +read from. + +In both cases, the complete configuration space is transmitted. This +allows the driver to compare the new configuration space with the old +version, and keep a generation count internally whenever it changes. + +\subsubsection{Setting Up Indicators}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Setting Up Indicators} + +In order to set up the indicator bits for host->guest notification, +the driver uses different channel commands depending on whether it +wishes to use traditional I/O interrupts tied to a subchannel or +adapter I/O interrupts for virtqueue notifications. For any given +device, the two mechanisms are mutually exclusive. + +For the configuration change indicators, only a mechanism using +traditional I/O interrupts is provided, regardless of whether +traditional or adapter I/O interrupts are used for virtqueue +notifications. + +\paragraph{Setting Up Classic Queue Indicators}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Setting Up Indicators / Setting Up Classic Queue Indicators} + +Indicators for notification via classic I/O interrupts are contained +in a 64 bit value per virtio-ccw proxy device. + +To communicate the location of the indicator bits for host->guest +notification, the driver uses the CCW_CMD_SET_IND command, +pointing to a location containing the guest address of the +indicators in a 64 bit value. + +If the driver has already set up two-staged queue indicators via the +CCW_CMD_SET_IND_ADAPTER command, the device MUST post a unit check +with command reject to any subsequent CCW_CMD_SET_IND command. + +\paragraph{Setting Up Configuration Change Indicators}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Setting Up Indicators / Setting Up Configuration Change Indicators} + +Indicators for configuration change host->guest notification are +contained in a 64 bit value per virtio-ccw proxy device. + +To communicate the location of the indicator bits used in the +configuration change host->guest notification, the driver issues the +CCW_CMD_SET_CONF_IND command, pointing to a location containing the +guest address of the indicators in a 64 bit value. + +\paragraph{Setting Up Two-Stage Queue Indicators}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Setting Up Indicators / Setting Up Two-Stage Queue Indicators} + +Indicators for notification via adapter I/O interrupts consist of +two stages: +\begin{itemize} +\item a summary indicator byte covering the virtqueues for one or more + virtio-ccw proxy devices +\item a set of contigous indicator bits for the virtqueues for a + virtio-ccw proxy device +\end{itemize} + +To communicate the location of the summary and queue indicator bits, +the driver uses the CCW_CMD_SET_IND_ADAPTER command with the following +payload: + +\begin{lstlisting} +struct virtio_thinint_area { + be64 summary_indicator; + be64 indicator; + be64 bit_nr; + u8 isc; +} __attribute__ ((packed)); +\end{lstlisting} + +\field{summary_indicator} contains the guest address of the 8 bit summary +indicator. +\field{indicator} contains the guest address of an area wherein the indicators +for the devices are contained, starting at \field{bit_nr}, one bit per +virtqueue of the device. Bit numbers start at the left, i.e. the most +significant bit in the first byte is assigned the bit number 0. +\field{isc} contains the I/O interruption subclass to be used for the adapter +I/O interrupt. It MAY be different from the isc used by the proxy +virtio-ccw device's subchannel. +No padding is added at the end of the structure, it is exactly 25 bytes +in length. + + +\devicenormative{\subparagraph}{Setting Up Two-Stage Queue Indicators}{Virtio Transport Options / Virtio over channel I/O / Device Initialization / Setting Up Indicators / Setting Up Two-Stage Queue Indicators} +If the driver has already set up classic queue indicators via the +CCW_CMD_SET_IND command, the device MUST post a unit check with +command reject to any subsequent CCW_CMD_SET_IND_ADAPTER command. + +\paragraph{Legacy Interfaces: A Note on Setting Up Indicators}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Initialization / Setting Up Indicators / Legacy Interfaces: A Note on Setting Up Indicators} + +In some cases, legacy devices will only support classic queue indicators; +in that case, they will reject CCW_CMD_SET_IND_ADAPTER as they don't know that +command. Some legacy devices will support two-stage queue indicators, though, +and a driver will be able to successfully use CCW_CMD_SET_IND_ADAPTER to set +them up. + +\subsection{Device Operation}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Operation} + +\subsubsection{Host->Guest Notification}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Operation / Host->Guest Notification} + +There are two modes of operation regarding host->guest notification, +classic I/O interrupts and adapter I/O interrupts. The mode to be +used is determined by the driver by using CCW_CMD_SET_IND respectively +CCW_CMD_SET_IND_ADAPTER to set up queue indicators. + +For configuration changes, the driver always uses classic I/O +interrupts. + +\paragraph{Notification via Classic I/O Interrupts}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Operation / Host->Guest Notification / Notification via Classic I/O Interrupts} + +If the driver used the CCW_CMD_SET_IND command to set up queue +indicators, the device will use classic I/O interrupts for +host->guest notification about virtqueue activity. + +For notifying the driver of virtqueue buffers, the device sets the +corresponding bit in the guest-provided indicators. If an +interrupt is not already pending for the subchannel, the device +generates an unsolicited I/O interrupt. + +If the device wants to notify the driver about configuration +changes, it sets bit 0 in the configuration indicators and +generates an unsolicited I/O interrupt, if needed. This also +applies if adapter I/O interrupts are used for queue notifications. + +\paragraph{Notification via Adapter I/O Interrupts}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Operation / Host->Guest Notification / Notification via Adapter I/O Interrupts} + +If the driver used the CCW_CMD_SET_IND_ADAPTER command to set up +queue indicators, the device will use adapter I/O interrupts for +host->guest notification about virtqueue activity. + +For notifying the driver of virtqueue buffers, the device sets the +bit in the guest-provided indicator area at the corresponding offset. +The guest-provided summary indicator is set to 0x01. An adapter I/O +interrupt for the corresponding interruption subclass is generated. + +The recommended way to process an adapter I/O interrupt by the driver +is as follows: + +\begin{itemize} +\item Process all queue indicator bits associated with the summary indicator. +\item Clear the summary indicator, performing a synchronization (memory +barrier) afterwards. +\item Process all queue indicator bits associated with the summary indicator +again. +\end{itemize} + +\devicenormative{\subparagraph}{Notification via Adapter I/O Interrupts}{Virtio Transport Options / Virtio over channel I/O / Device Operation / Host->Guest Notification / Notification via Adapter I/O Interrupts} + +The device SHOULD only generate an adapter I/O interrupt if the +summary indicator had not been set prior to notification. + +\drivernormative{\subparagraph}{Notification via Adapter I/O Interrupts}{Virtio Transport Options / Virtio over channel I/O / Device Operation / Host->Guest Notification / Notification via Adapter I/O Interrupts} +The driver +MUST clear the summary indicator after receiving an adapter I/O +interrupt before it processes the queue indicators. + +\paragraph{Legacy Interfaces: A Note on Host->Guest Notification}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Operation / Host->Guest Notification / Legacy Interfaces: A Note on Host->Guest Notification} + +As legacy devices and drivers support only classic queue indicators, +host->guest notification will always be done via classic I/O interrupts. + +\subsubsection{Guest->Host Notification}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Operation / Guest->Host Notification} + +For notifying the device of virtqueue buffers, the driver +unfortunately can't use a channel command (the asynchronous +characteristics of channel I/O interact badly with the host block +I/O backend). Instead, it uses a diagnose 0x500 call with subcode +3 specifying the queue, as follows: + +\begin{tabular}{ |l|l|l| } +\hline +GPR & Input Value & Output Value \\ +\hline \hline + 1 & 0x3 & \\ +\hline + 2 & Subchannel ID & Host Cookie \\ +\hline + 3 & Virtqueue number & \\ +\hline + 4 & Host Cookie & \\ +\hline +\end{tabular} + +\devicenormative{\paragraph}{Guest->Host Notification}{Virtio Transport Options / Virtio over channel I/O / Device Operation / Guest->Host Notification} +The device MUST ignore bits 0-31 (counting from the left) of GPR2. +This aligns passing the subchannel ID with the way it is passed +for the existing I/O instructions. + +The device MAY return a 64-bit host cookie in GPR2 to speed up the +notification execution. + +\drivernormative{\paragraph}{Guest->Host Notification}{Virtio Transport Options / Virtio over channel I/O / Device Operation / Guest->Host Notification} + +For each notification, the driver SHOULD use GPR4 to pass the host cookie received in GPR2 from the previous notication. + +\begin{note} +For example: +\begin{lstlisting} +info->cookie = do_notify(schid, + virtqueue_get_queue_index(vq), + info->cookie); +\end{lstlisting} +\end{note} + +\subsubsection{Resetting Devices}\label{sec:Virtio Transport Options / Virtio over channel I/O / Device Operation / Resetting Devices} + +In order to reset a device, a driver sends the +CCW_CMD_VDEV_RESET command. + + +\chapter{Device Types}\label{sec:Device Types} + +On top of the queues, config space and feature negotiation facilities +built into virtio, several devices are defined. + +The following device IDs are used to identify different types of virtio +devices. Some device IDs are reserved for devices which are not currently +defined in this standard. + +Discovering what devices are available and their type is bus-dependent. + +\begin{tabular} { |l|c| } +\hline +Device ID & Virtio Device \\ +\hline \hline +0 & reserved (invalid) \\ +\hline +1 & network card \\ +\hline +2 & block device \\ +\hline +3 & console \\ +\hline +4 & entropy source \\ +\hline +5 & memory ballooning (traditional) \\ +\hline +6 & ioMemory \\ +\hline +7 & rpmsg \\ +\hline +8 & SCSI host \\ +\hline +9 & 9P transport \\ +\hline +10 & mac80211 wlan \\ +\hline +11 & rproc serial \\ +\hline +12 & virtio CAIF \\ +\hline +13 & memory balloon \\ +\hline +16 & GPU device \\ +\hline +17 & Timer/Clock device \\ +\hline +18 & Input device \\ +\hline +19 & Socket device \\ +\hline +20 & Crypto device \\ +\hline +21 & Signal Distribution Module \\ +\hline +22 & pstore device \\ +\hline +\end{tabular} + +Some of the devices above are unspecified by this document, +because they are seen as immature or especially niche. Be warned +that some are only specified by the sole existing implementation; +they could become part of a future specification, be abandoned +entirely, or live on outside this standard. We shall speak of +them no further. + +\section{Network Device}\label{sec:Device Types / Network Device} + +The virtio network device is a virtual ethernet card, and is the +most complex of the devices supported so far by virtio. It has +enhanced rapidly and demonstrates clearly how support for new +features are added to an existing device. Empty buffers are +placed in one virtqueue for receiving packets, and outgoing +packets are enqueued into another for transmission in that order. +A third command queue is used to control advanced filtering +features. + +\subsection{Device ID}\label{sec:Device Types / Network Device / Device ID} + + 1 + +\subsection{Virtqueues}\label{sec:Device Types / Network Device / Virtqueues} + +\begin{description} +\item[0] receiveq1 +\item[1] transmitq1 +\item[\ldots] +\item[2N] receiveqN +\item[2N+1] transmitqN +\item[2N+2] controlq +\end{description} + + N=1 if VIRTIO_NET_F_MQ is not negotiated, otherwise N is set by + \field{max_virtqueue_pairs}. + + controlq only exists if VIRTIO_NET_F_CTRL_VQ set. + +\subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits} + +\begin{description} +\item[VIRTIO_NET_F_CSUM (0)] Device handles packets with partial checksum. This + ``checksum offload'' is a common feature on modern network cards. + +\item[VIRTIO_NET_F_GUEST_CSUM (1)] Driver handles packets with partial checksum. + +\item[VIRTIO_NET_F_CTRL_GUEST_OFFLOADS (2)] Control channel offloads + reconfiguration support. + +\item[VIRTIO_NET_F_MTU(3)] Device maximum MTU reporting is supported. If + offered by the device, device advises driver about the value of + its maximum MTU. If negotiated, the driver uses \field{mtu} as + the maximum MTU value. + +\item[VIRTIO_NET_F_MAC (5)] Device has given MAC address. + +\item[VIRTIO_NET_F_GUEST_TSO4 (7)] Driver can receive TSOv4. + +\item[VIRTIO_NET_F_GUEST_TSO6 (8)] Driver can receive TSOv6. + +\item[VIRTIO_NET_F_GUEST_ECN (9)] Driver can receive TSO with ECN. + +\item[VIRTIO_NET_F_GUEST_UFO (10)] Driver can receive UFO. + +\item[VIRTIO_NET_F_HOST_TSO4 (11)] Device can receive TSOv4. + +\item[VIRTIO_NET_F_HOST_TSO6 (12)] Device can receive TSOv6. + +\item[VIRTIO_NET_F_HOST_ECN (13)] Device can receive TSO with ECN. + +\item[VIRTIO_NET_F_HOST_UFO (14)] Device can receive UFO. + +\item[VIRTIO_NET_F_MRG_RXBUF (15)] Driver can merge receive buffers. + +\item[VIRTIO_NET_F_STATUS (16)] Configuration status field is + available. + +\item[VIRTIO_NET_F_CTRL_VQ (17)] Control channel is available. + +\item[VIRTIO_NET_F_CTRL_RX (18)] Control channel RX mode support. + +\item[VIRTIO_NET_F_CTRL_VLAN (19)] Control channel VLAN filtering. + +\item[VIRTIO_NET_F_GUEST_ANNOUNCE(21)] Driver can send gratuitous + packets. + +\item[VIRTIO_NET_F_MQ(22)] Device supports multiqueue with automatic + receive steering. + +\item[VIRTIO_NET_F_CTRL_MAC_ADDR(23)] Set MAC address through control + channel. +\end{description} + +\subsubsection{Feature bit requirements}\label{sec:Device Types / Network Device / Feature bits / Feature bit requirements} + +Some networking feature bits require other networking feature bits +(see \ref{drivernormative:Basic Facilities of a Virtio Device / Feature Bits}): + +\begin{description} +\item[VIRTIO_NET_F_GUEST_TSO4] Requires VIRTIO_NET_F_GUEST_CSUM. +\item[VIRTIO_NET_F_GUEST_TSO6] Requires VIRTIO_NET_F_GUEST_CSUM. +\item[VIRTIO_NET_F_GUEST_ECN] Requires VIRTIO_NET_F_GUEST_TSO4 or VIRTIO_NET_F_GUEST_TSO6. +\item[VIRTIO_NET_F_GUEST_UFO] Requires VIRTIO_NET_F_GUEST_CSUM. + +\item[VIRTIO_NET_F_HOST_TSO4] Requires VIRTIO_NET_F_CSUM. +\item[VIRTIO_NET_F_HOST_TSO6] Requires VIRTIO_NET_F_CSUM. +\item[VIRTIO_NET_F_HOST_ECN] Requires VIRTIO_NET_F_HOST_TSO4 or VIRTIO_NET_F_HOST_TSO6. +\item[VIRTIO_NET_F_HOST_UFO] Requires VIRTIO_NET_F_CSUM. + +\item[VIRTIO_NET_F_CTRL_RX] Requires VIRTIO_NET_F_CTRL_VQ. +\item[VIRTIO_NET_F_CTRL_VLAN] Requires VIRTIO_NET_F_CTRL_VQ. +\item[VIRTIO_NET_F_GUEST_ANNOUNCE] Requires VIRTIO_NET_F_CTRL_VQ. +\item[VIRTIO_NET_F_MQ] Requires VIRTIO_NET_F_CTRL_VQ. +\item[VIRTIO_NET_F_CTRL_MAC_ADDR] Requires VIRTIO_NET_F_CTRL_VQ. +\end{description} + +\subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / Network Device / Feature bits / Legacy Interface: Feature bits} +\begin{description} +\item[VIRTIO_NET_F_GSO (6)] Device handles packets with any GSO type. +\end{description} + +This was supposed to indicate segmentation offload support, but +upon further investigation it became clear that multiple bits +were needed. + +\subsection{Device configuration layout}\label{sec:Device Types / Network Device / Device configuration layout} +\label{sec:Device Types / Block Device / Feature bits / Device configuration layout} + +Three driver-read-only configuration fields are currently defined. The \field{mac} address field +always exists (though is only valid if VIRTIO_NET_F_MAC is set), and +\field{status} only exists if VIRTIO_NET_F_STATUS is set. Two +read-only bits (for the driver) are currently defined for the status field: +VIRTIO_NET_S_LINK_UP and VIRTIO_NET_S_ANNOUNCE. + +\begin{lstlisting} +#define VIRTIO_NET_S_LINK_UP 1 +#define VIRTIO_NET_S_ANNOUNCE 2 +\end{lstlisting} + +The following driver-read-only field, \field{max_virtqueue_pairs} only exists if +VIRTIO_NET_F_MQ is set. This field specifies the maximum number +of each of transmit and receive virtqueues (receiveq1\ldots receiveqN +and transmitq1\ldots transmitqN respectively) that can be configured once VIRTIO_NET_F_MQ +is negotiated. + +The following driver-read-only field, \field{mtu} only exists if +VIRTIO_NET_F_MTU is set. This field specifies the maximum MTU for the driver to +use. + +\begin{lstlisting} +struct virtio_net_config { + u8 mac[6]; + le16 status; + le16 max_virtqueue_pairs; + le16 mtu; +}; +\end{lstlisting} + +\devicenormative{\subsubsection}{Device configuration layout}{Device Types / Network Device / Device configuration layout} + +The device MUST set \field{max_virtqueue_pairs} to between 1 and 0x8000 inclusive, +if it offers VIRTIO_NET_F_MQ. + +The device MUST set \field{mtu} to between 68 and 65535 inclusive, +if it offers VIRTIO_NET_F_MTU. + +The device SHOULD set \field{mtu} to at least 1280, if it offers +VIRTIO_NET_F_MTU. + +The device MUST NOT modify \field{mtu} once it has been set. + +The device MUST NOT pass received packets that exceed \field{mtu} (plus low +level ethernet header length) size with \field{gso_type} NONE or ECN +after VIRTIO_NET_F_MTU has been successfully negotiated. + +The device MUST forward transmitted packets of up to \field{mtu} (plus low +level ethernet header length) size with \field{gso_type} NONE or ECN, and do +so without fragmentation, after VIRTIO_NET_F_MTU has been successfully +negotiated. + +\drivernormative{\subsubsection}{Device configuration layout}{Device Types / Network Device / Device configuration layout} + +A driver SHOULD negotiate VIRTIO_NET_F_MAC if the device offers it. +If the driver negotiates the VIRTIO_NET_F_MAC feature, the driver MUST set +the physical address of the NIC to \field{mac}. Otherwise, it SHOULD +use a locally-administered MAC address (see \hyperref[intro:IEEE 802]{IEEE 802}, +``9.2 48-bit universal LAN MAC addresses''). + +If the driver does not negotiate the VIRTIO_NET_F_STATUS feature, it SHOULD +assume the link is active, otherwise it SHOULD read the link status from +the bottom bit of \field{status}. + +A driver SHOULD negotiate VIRTIO_NET_F_MTU if the device offers it. + +If the driver negotiates VIRTIO_NET_F_MTU, it MUST supply enough receive +buffers to receive at least one receive packet of size \field{mtu} (plus low +level ethernet header length) with \field{gso_type} NONE or ECN. + +If the driver negotiates VIRTIO_NET_F_MTU, it MUST NOT transmit packets of +size exceeding the value of \field{mtu} (plus low level ethernet header length) +with \field{gso_type} NONE or ECN. + +\subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device Types / Network Device / Device configuration layout / Legacy Interface: Device configuration layout} +\label{sec:Device Types / Block Device / Feature bits / Device configuration layout / Legacy Interface: Device configuration layout} +When using the legacy interface, transitional devices and drivers +MUST format \field{status} and +\field{max_virtqueue_pairs} in struct virtio_net_config +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +When using the legacy interface, \field{mac} is driver-writable +which provided a way for drivers to update the MAC without +negotiating VIRTIO_NET_F_CTRL_MAC_ADDR. + +\subsection{Device Initialization}\label{sec:Device Types / Network Device / Device Initialization} + +A driver would perform a typical initialization routine like so: + +\begin{enumerate} +\item Identify and initialize the receive and + transmission virtqueues, up to N of each kind. If + VIRTIO_NET_F_MQ feature bit is negotiated, + N=\field{max_virtqueue_pairs}, otherwise identify N=1. + +\item If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, + identify the control virtqueue. + +\item Fill the receive queues with buffers: see \ref{sec:Device Types / Network Device / Device Operation / Setting Up Receive Buffers}. + +\item Even with VIRTIO_NET_F_MQ, only receiveq1, transmitq1 and + controlq are used by default. The driver would send the + VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command specifying the + number of the transmit and receive queues to use. + +\item If the VIRTIO_NET_F_MAC feature bit is set, the configuration + space \field{mac} entry indicates the ``physical'' address of the + network card, otherwise the driver would typically generate a random + local MAC address. + +\item If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link + status comes from the bottom bit of \field{status}. + Otherwise, the driver assumes it's active. + +\item A performant driver would indicate that it will generate checksumless + packets by negotating the VIRTIO_NET_F_CSUM feature. + +\item If that feature is negotiated, a driver can use TCP or UDP + segmentation offload by negotiating the VIRTIO_NET_F_HOST_TSO4 (IPv4 + TCP), VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and VIRTIO_NET_F_HOST_UFO + (UDP fragmentation) features. + +\item The converse features are also available: a driver can save + the virtual device some work by negotiating these features.\note{For example, a network packet transported between two guests on +the same system might not need checksumming at all, nor segmentation, +if both guests are amenable.} + The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially + checksummed packets can be received, and if it can do that then + the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, + VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input + equivalents of the features described above. + See \ref{sec:Device Types / Network Device / Device Operation / +Setting Up Receive Buffers}~\nameref{sec:Device Types / Network +Device / Device Operation / Setting Up Receive Buffers} and +\ref{sec:Device Types / Network Device / Device Operation / +Processing of Incoming Packets}~\nameref{sec:Device Types / +Network Device / Device Operation / Processing of Incoming Packets} below. +\end{enumerate} + +A truly minimal driver would only accept VIRTIO_NET_F_MAC and ignore +everything else. + +\subsection{Device Operation}\label{sec:Device Types / Network Device / Device Operation} + +Packets are transmitted by placing them in the +transmitq1\ldots transmitqN, and buffers for incoming packets are +placed in the receiveq1\ldots receiveqN. In each case, the packet +itself is preceded by a header: + +\begin{lstlisting} +struct virtio_net_hdr { +#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 + u8 flags; +#define VIRTIO_NET_HDR_GSO_NONE 0 +#define VIRTIO_NET_HDR_GSO_TCPV4 1 +#define VIRTIO_NET_HDR_GSO_UDP 3 +#define VIRTIO_NET_HDR_GSO_TCPV6 4 +#define VIRTIO_NET_HDR_GSO_ECN 0x80 + u8 gso_type; + le16 hdr_len; + le16 gso_size; + le16 csum_start; + le16 csum_offset; + le16 num_buffers; +}; +\end{lstlisting} + +The controlq is used to control device features such as +filtering. + +\subsubsection{Legacy Interface: Device Operation}\label{sec:Device Types / Network Device / Device Operation / Legacy Interface: Device Operation} +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_net_hdr +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +The legacy driver only presented \field{num_buffers} in the struct virtio_net_hdr +when VIRTIO_NET_F_MRG_RXBUF was negotiated; without that feature the +structure was 2 bytes shorter. + +When using the legacy interface, the driver SHOULD ignore the +\field{len} value in used ring entries for the transmit queues +and the controlq queue. +\begin{note} +Historically, some devices put +the total descriptor length there, even though no data was +actually written. +\end{note} + +\subsubsection{Packet Transmission}\label{sec:Device Types / Network Device / Device Operation / Packet Transmission} + +Transmitting a single packet is simple, but varies depending on +the different features the driver negotiated. + +\begin{enumerate} +\item The driver can send a completely checksummed packet. In this case, + \field{flags} will be zero, and \field{gso_type} will be VIRTIO_NET_HDR_GSO_NONE. + +\item If the driver negotiated VIRTIO_NET_F_CSUM, it can skip + checksumming the packet: + \begin{itemize} + \item \field{flags} has the VIRTIO_NET_HDR_F_NEEDS_CSUM set, + + \item \field{csum_start} is set to the offset within the packet to begin checksumming, + and + + \item \field{csum_offset} indicates how many bytes after the csum_start the + new (16 bit ones' complement) checksum is placed by the device. + + \item The TCP checksum field in the packet is set to the sum + of the TCP pseudo header, so that replacing it by the ones' + complement checksum of the TCP header and body will give the + correct result. + \end{itemize} + +\begin{note} +For example, consider a partially checksummed TCP (IPv4) packet. +It will have a 14 byte ethernet header and 20 byte IP header +followed by the TCP header (with the TCP checksum field 16 bytes +into that header). \field{csum_start} will be 14+20 = 34 (the TCP +checksum includes the header), and \field{csum_offset} will be 16. +\end{note} + +\item If the driver negotiated + VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires + TCP segmentation or UDP fragmentation, then \field{gso_type} + is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP. + (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this + case, packets larger than 1514 bytes can be transmitted: the + metadata indicates how to replicate the packet header to cut it + into smaller packets. The other gso fields are set: + + \begin{itemize} + \item \field{hdr_len} is a hint to the device as to how much of the header + needs to be kept to copy into each packet, usually set to the + length of the headers, including the transport header\footnote{Due to various bugs in implementations, this field is not useful +as a guarantee of the transport header size. +}. + + \item \field{gso_size} is the maximum size of each packet beyond that + header (ie. MSS). + + \item If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, + the VIRTIO_NET_HDR_GSO_ECN bit in \field{gso_type} + indicates that the TCP packet has the ECN bit set\footnote{This case is not handled by some older hardware, so is called out +specifically in the protocol.}. + \end{itemize} + +\item \field{num_buffers} is set to zero. This field is unused on transmitted packets. + +\item The header and packet are added as one output descriptor to the + transmitq, and the device is notified of the new entry + (see \ref{sec:Device Types / Network Device / Device Initialization}~\nameref{sec:Device Types / Network Device / Device Initialization}). +\end{enumerate} + +\drivernormative{\paragraph}{Packet Transmission}{Device Types / Network Device / Device Operation / Packet Transmission} + +The driver MUST set \field{num_buffers} to zero. + +If VIRTIO_NET_F_CSUM is not negotiated, the driver MUST set +\field{flags} to zero and SHOULD supply a fully checksummed +packet to the device. + +If VIRTIO_NET_F_HOST_TSO4 is negotiated, the driver MAY set +\field{gso_type} to VIRTIO_NET_HDR_GSO_TCPV4 to request TCPv4 +segmentation, otherwise the driver MUST NOT set +\field{gso_type} to VIRTIO_NET_HDR_GSO_TCPV4. + +If VIRTIO_NET_F_HOST_TSO6 is negotiated, the driver MAY set +\field{gso_type} to VIRTIO_NET_HDR_GSO_TCPV6 to request TCPv6 +segmentation, otherwise the driver MUST NOT set +\field{gso_type} to VIRTIO_NET_HDR_GSO_TCPV6. + +If VIRTIO_NET_F_HOST_UFO is negotiated, the driver MAY set +\field{gso_type} to VIRTIO_NET_HDR_GSO_UDP to request UDP +segmentation, otherwise the driver MUST NOT set +\field{gso_type} to VIRTIO_NET_HDR_GSO_UDP. + +The driver SHOULD NOT send to the device TCP packets requiring segmentation offload +which have the Explicit Congestion Notification bit set, unless the +VIRTIO_NET_F_HOST_ECN feature is negotiated, in which case the +driver MUST set the VIRTIO_NET_HDR_GSO_ECN bit in +\field{gso_type}. + +If the VIRTIO_NET_F_CSUM feature has been negotiated, the +driver MAY set the VIRTIO_NET_HDR_F_NEEDS_CSUM bit in +\field{flags}, if so: +\begin{enumerate} +\item the driver MUST validate the packet checksum at + offset \field{csum_offset} from \field{csum_start} as well as all + preceding offsets; +\item the driver MUST set the packet checksum stored in the + buffer to the TCP/UDP pseudo header; +\item the driver MUST set \field{csum_start} and + \field{csum_offset} such that calculating a ones' + complement checksum from \field{csum_start} up until the end of + the packet and storing the result at offset \field{csum_offset} + from \field{csum_start} will result in a fully checksummed + packet; +\end{enumerate} + +If none of the VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO options have +been negotiated, the driver MUST set \field{gso_type} to +VIRTIO_NET_HDR_GSO_NONE. + +If \field{gso_type} differs from VIRTIO_NET_HDR_GSO_NONE, then +the driver MUST also set the VIRTIO_NET_HDR_F_NEEDS_CSUM bit in +\field{flags} and MUST set \field{gso_size} to indicate the +desired MSS. + +If one of the VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO options have +been negotiated, the driver SHOULD set \field{hdr_len} to a value +not less than the length of the headers, including the transport +header. + +The driver MUST NOT set the VIRTIO_NET_HDR_F_DATA_VALID bit in +\field{flags}. + +\devicenormative{\paragraph}{Packet Transmission}{Device Types / Network Device / Device Operation / Packet Transmission} +The device MUST ignore \field{flag} bits that it does not recognize. + +If VIRTIO_NET_HDR_F_NEEDS_CSUM bit in \field{flags} is not set, the +device MUST NOT use the \field{csum_start} and \field{csum_offset}. + +If one of the VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO options have +been negotiated, the device MAY use \field{hdr_len} only as a hint about the +transport header size. +The device MUST NOT rely on \field{hdr_len} to be correct. +\begin{note} +This is due to various bugs in implementations. +\end{note} + +If VIRTIO_NET_HDR_F_NEEDS_CSUM is not set, the device MUST NOT +rely on the packet checksum being correct. +\paragraph{Packet Transmission Interrupt}\label{sec:Device Types / Network Device / Device Operation / Packet Transmission / Packet Transmission Interrupt} + +Often a driver will suppress transmission interrupts using the +VIRTQ_AVAIL_F_NO_INTERRUPT flag + (see \ref{sec:General Initialization And Device Operation / Device Operation / Receiving Used Buffers From The Device}~\nameref{sec:General Initialization And Device Operation / Device Operation / Receiving Used Buffers From The Device}) +and check for used packets in the transmit path of following +packets. + +The normal behavior in this interrupt handler is to retrieve and +new descriptors from the used ring and free the corresponding +headers and packets. + +\subsubsection{Setting Up Receive Buffers}\label{sec:Device Types / Network Device / Device Operation / Setting Up Receive Buffers} + +It is generally a good idea to keep the receive virtqueue as +fully populated as possible: if it runs out, network performance +will suffer. + +If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or +VIRTIO_NET_F_GUEST_UFO features are used, the maximum incoming packet +will be to 65550 bytes long (the maximum size of a +TCP or UDP packet, plus the 14 byte ethernet header), otherwise +1514 bytes. The 12-byte struct virtio_net_hdr is prepended to this, +making for 65562 or 1526 bytes. + +\drivernormative{\paragraph}{Setting Up Receive Buffers}{Device Types / Network Device / Device Operation / Setting Up Receive Buffers} + +\begin{itemize} +\item If VIRTIO_NET_F_MRG_RXBUF is not negotiated: + \begin{itemize} + \item If VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or + VIRTIO_NET_F_GUEST_UFO are negotiated, the driver SHOULD populate + the receive queue(s) with buffers of at least 65562 bytes. + \item Otherwise, the driver SHOULD populate the receive queue(s) + with buffers of at least 1526 bytes. + \end{itemize} +\item If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer MUST be at + greater than the size of the struct virtio_net_hdr. +\end{itemize} + +\begin{note} +Obviously each buffer can be split across multiple descriptor elements. +\end{note} + +If VIRTIO_NET_F_MQ is negotiated, each of receiveq1\ldots receiveqN +that will be used SHOULD be populated with receive buffers. + +\devicenormative{\paragraph}{Setting Up Receive Buffers}{Device Types / Network Device / Device Operation / Setting Up Receive Buffers} + +The device MUST set \field{num_buffers} to the number of descriptors used to +hold the incoming packet. + +The device MUST use only a single descriptor if VIRTIO_NET_F_MRG_RXBUF +was not negotiated. +\begin{note} +{This means that \field{num_buffers} will always be 1 +if VIRTIO_NET_F_MRG_RXBUF is not negotiated.} +\end{note} + +\subsubsection{Processing of Incoming Packets}\label{sec:Device Types / Network Device / Device Operation / Processing of Incoming Packets} +\label{sec:Device Types / Network Device / Device Operation / Processing of Packets}%old label for latexdiff + +When a packet is copied into a buffer in the receiveq, the +optimal path is to disable further interrupts for the receiveq +(see \ref{sec:General Initialization And Device Operation / Device Operation / Receiving Used Buffers From The Device}~\nameref{sec:General Initialization And Device Operation / Device Operation / Receiving Used Buffers From The Device}) and process +packets until no more are found, then re-enable them. + +Processing incoming packets involves: + +\begin{enumerate} +\item \field{num_buffers} indicates how many descriptors + this packet is spread over (including this one): this will + always be 1 if VIRTIO_NET_F_MRG_RXBUF was not negotiated. + This allows receipt of large packets without having to allocate large + buffers. In this case, there will be at least \field{num_buffers} in + the used ring, and the device chains them together to form a + single packet. The other buffers will not begin with a struct + virtio_net_hdr. + +\item If + \field{num_buffers} is one, then the entire packet will be + contained within this buffer, immediately following the struct + virtio_net_hdr. +\item If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the + VIRTIO_NET_HDR_F_DATA_VALID bit in \field{flags} can be + set: if so, device has validated the packet checksum. + In case of multiple encapsulated protocols, one level of checksums + has been validated. +\end{enumerate} + +Additionally, VIRTIO_NET_F_GUEST_CSUM, TSO4, TSO6, UDP and ECN +features enable receive checksum, large receive offload and ECN +support which are the input equivalents of the transmit checksum, +transmit segmentation offloading and ECN features, as described +in \ref{sec:Device Types / Network Device / Device Operation / +Packet Transmission}: +\begin{enumerate} +\item If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the + VIRTIO_NET_HDR_F_NEEDS_CSUM bit in \field{flags} can be + set: if so, the packet checksum at offset \field{csum_offset} + from \field{csum_start} and any preceding checksums + have been validated. The checksum on the packet is incomplete and + \field{csum_start} and \field{csum_offset} indicate how to calculate + it (see Packet Transmission point 1). + +\item If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were + negotiated, then \field{gso_type} MAY be something other than + VIRTIO_NET_HDR_GSO_NONE, and \field{gso_size} field indicates the + desired MSS (see Packet Transmission point 2). +\end{enumerate} + +\devicenormative{\paragraph}{Processing of Incoming Packets}{Device Types / Network Device / Device Operation / Processing of Incoming Packets} +\label{devicenormative:Device Types / Network Device / Device Operation / Processing of Packets}%old label for latexdiff + +If VIRTIO_NET_F_MRG_RXBUF has not been negotiated, the device MUST set +\field{num_buffers} to 1. + +If VIRTIO_NET_F_MRG_RXBUF has been negotiated, the device MUST set +\field{num_buffers} to indicate the number of descriptors +the packet (including the header) is spread over. + +The device MUST use all descriptors used by a single receive +packet together, by atomically incrementing \field{idx} in the +used ring by the \field{num_buffers} value. + +If VIRTIO_NET_F_GUEST_CSUM is not negotiated, the device MUST set +\field{flags} to zero and SHOULD supply a fully checksummed +packet to the driver. + +If VIRTIO_NET_F_GUEST_TSO4 is not negotiated, the device MUST NOT set +\field{gso_type} to VIRTIO_NET_HDR_GSO_TCPV4. + +If VIRTIO_NET_F_GUEST_UDP is not negotiated, the device MUST NOT set +\field{gso_type} to VIRTIO_NET_HDR_GSO_UDP. + +If VIRTIO_NET_F_GUEST_TSO6 is not negotiated, the device MUST NOT set +\field{gso_type} to VIRTIO_NET_HDR_GSO_TCPV6. + +The device SHOULD NOT send to the driver TCP packets requiring segmentation offload +which have the Explicit Congestion Notification bit set, unless the +VIRTIO_NET_F_GUEST_ECN feature is negotiated, in which case the +device MUST set the VIRTIO_NET_HDR_GSO_ECN bit in +\field{gso_type}. + +If the VIRTIO_NET_F_GUEST_CSUM feature has been negotiated, the +device MAY set the VIRTIO_NET_HDR_F_NEEDS_CSUM bit in +\field{flags}, if so: +\begin{enumerate} +\item the device MUST validate the packet checksum at + offset \field{csum_offset} from \field{csum_start} as well as all + preceding offsets; +\item the device MUST set the packet checksum stored in the + receive buffer to the TCP/UDP pseudo header; +\item the device MUST set \field{csum_start} and + \field{csum_offset} such that calculating a ones' + complement checksum from \field{csum_start} up until the + end of the packet and storing the result at offset + \field{csum_offset} from \field{csum_start} will result in a + fully checksummed packet; +\end{enumerate} + +If none of the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options have +been negotiated, the device MUST set \field{gso_type} to +VIRTIO_NET_HDR_GSO_NONE. + +If \field{gso_type} differs from VIRTIO_NET_HDR_GSO_NONE, then +the device MUST also set the VIRTIO_NET_HDR_F_NEEDS_CSUM bit in +\field{flags} MUST set \field{gso_size} to indicate the desired MSS. + +If one of the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options have +been negotiated, the device SHOULD set \field{hdr_len} to a value +not less than the length of the headers, including the transport +header. + +If the VIRTIO_NET_F_GUEST_CSUM feature has been negotiated, the +device MAY set the VIRTIO_NET_HDR_F_DATA_VALID bit in +\field{flags}, if so, the device MUST validate the packet +checksum (in case of multiple encapsulated protocols, one level +of checksums is validated). + +\drivernormative{\paragraph}{Processing of Incoming +Packets}{Device Types / Network Device / Device Operation / +Processing of Incoming Packets} + +The driver MUST ignore \field{flag} bits that it does not recognize. + +If VIRTIO_NET_HDR_F_NEEDS_CSUM bit in \field{flags} is not set, the +driver MUST NOT use the \field{csum_start} and \field{csum_offset}. + +If one of the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options have +been negotiated, the driver MAY use \field{hdr_len} only as a hint about the +transport header size. +The driver MUST NOT rely on \field{hdr_len} to be correct. +\begin{note} +This is due to various bugs in implementations. +\end{note} + +If neither VIRTIO_NET_HDR_F_NEEDS_CSUM nor +VIRTIO_NET_HDR_F_DATA_VALID is set, the driver MUST NOT +rely on the packet checksum being correct. +\subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue} + +The driver uses the control virtqueue (if VIRTIO_NET_F_CTRL_VQ is +negotiated) to send commands to manipulate various features of +the device which would not easily map into the configuration +space. + +All commands are of the following form: + +\begin{lstlisting} +struct virtio_net_ctrl { + u8 class; + u8 command; + u8 command-specific-data[]; + u8 ack; +}; + +/* ack values */ +#define VIRTIO_NET_OK 0 +#define VIRTIO_NET_ERR 1 +\end{lstlisting} + +The \field{class}, \field{command} and command-specific-data are set by the +driver, and the device sets the \field{ack} byte. There is little it can +do except issue a diagnostic if \field{ack} is not +VIRTIO_NET_OK. + +\paragraph{Packet Receive Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Packet Receive Filtering} +\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Setting Promiscuous Mode}%old label for latexdiff + +If the VIRTIO_NET_F_CTRL_RX and VIRTIO_NET_F_CTRL_RX_EXTRA +features are negotiated, the driver can send control commands for +promiscuous mode, multicast, unicast and broadcast receiving. + +\begin{note} +In general, these commands are best-effort: unwanted +packets could still arrive. +\end{note} + +\begin{lstlisting} +#define VIRTIO_NET_CTRL_RX 0 + #define VIRTIO_NET_CTRL_RX_PROMISC 0 + #define VIRTIO_NET_CTRL_RX_ALLMULTI 1 + #define VIRTIO_NET_CTRL_RX_ALLUNI 2 + #define VIRTIO_NET_CTRL_RX_NOMULTI 3 + #define VIRTIO_NET_CTRL_RX_NOUNI 4 + #define VIRTIO_NET_CTRL_RX_NOBCAST 5 +\end{lstlisting} + + +\devicenormative{\subparagraph}{Packet Receive Filtering}{Device Types / Network Device / Device Operation / Control Virtqueue / Packet Receive Filtering} + +If the VIRTIO_NET_F_CTRL_RX feature has been negotiated, +the device MUST support the following VIRTIO_NET_CTRL_RX class +commands: +\begin{itemize} +\item VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and +off. The command-specific-data is one byte containing 0 (off) or +1 (on). If promiscous mode is on, the device SHOULD receive all +incoming packets. +This SHOULD take effect even if one of the other modes set by +a VIRTIO_NET_CTRL_RX class command is on. +\item VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and +off. The command-specific-data is one byte containing 0 (off) or +1 (on). When all-multicast receive is on the device SHOULD allow +all incoming multicast packets. +\end{itemize} + +If the VIRTIO_NET_F_CTRL_RX_EXTRA feature has been negotiated, +the device MUST support the following VIRTIO_NET_CTRL_RX class +commands: +\begin{itemize} +\item VIRTIO_NET_CTRL_RX_ALLUNI turns all-unicast receive on and +off. The command-specific-data is one byte containing 0 (off) or +1 (on). When all-unicast receive is on the device SHOULD allow +all incoming unicast packets. +\item VIRTIO_NET_CTRL_RX_NOMULTI suppresses multicast receive. +The command-specific-data is one byte containing 0 (multicast +receive allowed) or 1 (multicast receive suppressed). +When multicast receive is suppressed, the device SHOULD NOT +send multicast packets to the driver. +This SHOULD take effect even if VIRTIO_NET_CTRL_RX_ALLMULTI is on. +This filter SHOULD NOT apply to broadcast packets. +\item VIRTIO_NET_CTRL_RX_NOUNI suppresses unicast receive. +The command-specific-data is one byte containing 0 (unicast +receive allowed) or 1 (unicast receive suppressed). +When unicast receive is suppressed, the device SHOULD NOT +send unicast packets to the driver. +This SHOULD take effect even if VIRTIO_NET_CTRL_RX_ALLUNI is on. +\item VIRTIO_NET_CTRL_RX_NOBCAST suppresses broadcast receive. +The command-specific-data is one byte containing 0 (broadcast +receive allowed) or 1 (broadcast receive suppressed). +When broadcast receive is suppressed, the device SHOULD NOT +send broadcast packets to the driver. +This SHOULD take effect even if VIRTIO_NET_CTRL_RX_ALLMULTI is on. +\end{itemize} + +\drivernormative{\subparagraph}{Packet Receive Filtering}{Device Types / Network Device / Device Operation / Control Virtqueue / Packet Receive Filtering} + +If the VIRTIO_NET_F_CTRL_RX feature has not been negotiated, +the driver MUST NOT issue commands VIRTIO_NET_CTRL_RX_PROMISC or +VIRTIO_NET_CTRL_RX_ALLMULTI. + +If the VIRTIO_NET_F_CTRL_RX_EXTRA feature has not been negotiated, +the driver MUST NOT issue commands + VIRTIO_NET_CTRL_RX_ALLUNI, + VIRTIO_NET_CTRL_RX_NOMULTI, + VIRTIO_NET_CTRL_RX_NOUNI or + VIRTIO_NET_CTRL_RX_NOBCAST. + +\paragraph{Setting MAC Address Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Setting MAC Address Filtering} + +If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can +send control commands for MAC address filtering. + +\begin{lstlisting} +struct virtio_net_ctrl_mac { + le32 entries; + u8 macs[entries][6]; +}; + +#define VIRTIO_NET_CTRL_MAC 1 + #define VIRTIO_NET_CTRL_MAC_TABLE_SET 0 + #define VIRTIO_NET_CTRL_MAC_ADDR_SET 1 +\end{lstlisting} + +The device can filter incoming packets by any number of destination +MAC addresses\footnote{Since there are no guarantees, it can use a hash filter or +silently switch to allmulti or promiscuous mode if it is given too +many addresses. +}. This table is set using the class +VIRTIO_NET_CTRL_MAC and the command VIRTIO_NET_CTRL_MAC_TABLE_SET. The +command-specific-data is two variable length tables of 6-byte MAC +addresses (as described in struct virtio_net_ctrl_mac). The first table contains unicast addresses, and the second +contains multicast addresses. + +The VIRTIO_NET_CTRL_MAC_ADDR_SET command is used to set the +default MAC address which rx filtering +accepts (and if VIRTIO_NET_F_MAC_ADDR has been negotiated, +this will be reflected in \field{mac} in config space). + +The command-specific-data for VIRTIO_NET_CTRL_MAC_ADDR_SET is +the 6-byte MAC address. + +\devicenormative{\subparagraph}{Setting MAC Address Filtering}{Device Types / Network Device / Device Operation / Control Virtqueue / Setting MAC Address Filtering} + +The device MUST have an empty MAC filtering table on reset. + +The device MUST update the MAC filtering table before it consumes +the VIRTIO_NET_CTRL_MAC_TABLE_SET command. + +The device MUST update \field{mac} in config space before it consumes +the VIRTIO_NET_CTRL_MAC_ADDR_SET command, if VIRTIO_NET_F_MAC_ADDR has +been negotiated. + +The device SHOULD drop incoming packets which have a destination MAC which +matches neither the \field{mac} (or that set with VIRTIO_NET_CTRL_MAC_ADDR_SET) +nor the MAC filtering table. + +\drivernormative{\subparagraph}{Setting MAC Address Filtering}{Device Types / Network Device / Device Operation / Control Virtqueue / Setting MAC Address Filtering} + +If VIRTIO_NET_F_CTRL_RX has not been negotiated, +the driver MUST NOT issue VIRTIO_NET_CTRL_MAC class commands. + +If VIRTIO_NET_F_CTRL_RX has been negotiated, +the driver SHOULD issue VIRTIO_NET_CTRL_MAC_ADDR_SET +to set the default mac if it is different from \field{mac}. + +The driver MUST follow the VIRTIO_NET_CTRL_MAC_TABLE_SET command +by a le32 number, followed by that number of non-multicast +MAC addresses, followed by another le32 number, followed by +that number of multicast addresses. Either number MAY be 0. + +\subparagraph{Legacy Interface: Setting MAC Address Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Setting MAC Address Filtering / Legacy Interface: Setting MAC Address Filtering} +When using the legacy interface, transitional devices and drivers +MUST format \field{entries} in struct virtio_net_ctrl_mac +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +Legacy drivers that didn't negotiate VIRTIO_NET_F_CTRL_MAC_ADDR +changed \field{mac} in config space when NIC is accepting +incoming packets. These drivers always wrote the mac value from +first to last byte, therefore after detecting such drivers, +a transitional device MAY defer MAC update, or MAY defer +processing incoming packets until driver writes the last byte +of \field{mac} in the config space. + +\paragraph{VLAN Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / VLAN Filtering} + +If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it +can control a VLAN filter table in the device. + +\begin{lstlisting} +#define VIRTIO_NET_CTRL_VLAN 2 + #define VIRTIO_NET_CTRL_VLAN_ADD 0 + #define VIRTIO_NET_CTRL_VLAN_DEL 1 +\end{lstlisting} + +Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL +command take a little-endian 16-bit VLAN id as the command-specific-data. + +\subparagraph{Legacy Interface: VLAN Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / VLAN Filtering / Legacy Interface: VLAN Filtering} +When using the legacy interface, transitional devices and drivers +MUST format the VLAN id +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +\paragraph{Gratuitous Packet Sending}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Gratuitous Packet Sending} + +If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE (depends +on VIRTIO_NET_F_CTRL_VQ), the device can ask the driver to send gratuitous +packets; this is usually done after the guest has been physically +migrated, and needs to announce its presence on the new network +links. (As hypervisor does not have the knowledge of guest +network configuration (eg. tagged vlan) it is simplest to prod +the guest in this way). + +\begin{lstlisting} +#define VIRTIO_NET_CTRL_ANNOUNCE 3 + #define VIRTIO_NET_CTRL_ANNOUNCE_ACK 0 +\end{lstlisting} + +The driver checks VIRTIO_NET_S_ANNOUNCE bit in the device configuration \field{status} field +when it notices the changes of device configuration. The +command VIRTIO_NET_CTRL_ANNOUNCE_ACK is used to indicate that +driver has received the notification and device clears the +VIRTIO_NET_S_ANNOUNCE bit in \field{status}. + +Processing this notification involves: + +\begin{enumerate} +\item Sending the gratuitous packets (eg. ARP) or marking there are pending + gratuitous packets to be sent and letting deferred routine to + send them. + +\item Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control + vq. +\end{enumerate} + +\drivernormative{\subparagraph}{Gratuitous Packet Sending}{Device Types / Network Device / Device Operation / Control Virtqueue / Gratuitous Packet Sending} + +If the driver negotiates VIRTIO_NET_F_GUEST_ANNOUNCE, it SHOULD notify +network peers of its new location after it sees the VIRTIO_NET_S_ANNOUNCE bit +in \field{status}. The driver MUST send a command on the command queue +with class VIRTIO_NET_CTRL_ANNOUNCE and command VIRTIO_NET_CTRL_ANNOUNCE_ACK. + +\devicenormative{\subparagraph}{Gratuitous Packet Sending}{Device Types / Network Device / Device Operation / Control Virtqueue / Gratuitous Packet Sending} + +If VIRTIO_NET_F_GUEST_ANNOUNCE is negotiated, the device MUST clear the +VIRTIO_NET_S_ANNOUNCE bit in \field{status} upon receipt of a command buffer +with class VIRTIO_NET_CTRL_ANNOUNCE and command VIRTIO_NET_CTRL_ANNOUNCE_ACK +before marking the buffer as used. + +\paragraph{Automatic receive steering in multiqueue mode}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Automatic receive steering in multiqueue mode} + +If the driver negotiates the VIRTIO_NET_F_MQ feature bit (depends +on VIRTIO_NET_F_CTRL_VQ), it MAY transmit outgoing packets on one +of the multiple transmitq1\ldots transmitqN and ask the device to +queue incoming packets into one of the multiple receiveq1\ldots receiveqN +depending on the packet flow. + +\begin{lstlisting} +struct virtio_net_ctrl_mq { + le16 virtqueue_pairs; +}; + +#define VIRTIO_NET_CTRL_MQ 4 + #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET 0 + #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MIN 1 + #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MAX 0x8000 +\end{lstlisting} + +Multiqueue is disabled by default. The driver enables multiqueue by +executing the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command, specifying +the number of the transmit and receive queues to be used up to +\field{max_virtqueue_pairs}; subsequently, +transmitq1\ldots transmitqn and receiveq1\ldots receiveqn where +n=\field{virtqueue_pairs} MAY be used. + +When multiqueue is enabled, the device MUST use automatic receive steering +based on packet flow. Programming of the receive steering +classificator is implicit. After the driver transmitted a packet of a +flow on transmitqX, the device SHOULD cause incoming packets for that flow to +be steered to receiveqX. For uni-directional protocols, or where +no packets have been transmitted yet, the device MAY steer a packet +to a random queue out of the specified receiveq1\ldots receiveqn. + +Multiqueue is disabled by setting \field{virtqueue_pairs} to 1 (this is +the default) and waiting for the device to use the command buffer. + +\drivernormative{\subparagraph}{Automatic receive steering in multiqueue mode}{Device Types / Network Device / Device Operation / Control Virtqueue / Automatic receive steering in multiqueue mode} + +The driver MUST configure the virtqueues before enabling them with the +VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command. + +The driver MUST NOT request a \field{virtqueue_pairs} of 0 or +greater than \field{max_virtqueue_pairs} in the device configuration space. + +The driver MUST queue packets only on any transmitq1 before the +VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command. + +The driver MUST NOT queue packets on transmit queues greater than +\field{virtqueue_pairs} once it has placed the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command in the available ring. + +\devicenormative{\subparagraph}{Automatic receive steering in multiqueue mode}{Device Types / Network Device / Device Operation / Control Virtqueue / Automatic receive steering in multiqueue mode} + +The device MUST queue packets only on any receiveq1 before the +VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command. + +The device MUST NOT queue packets on receive queues greater than +\field{virtqueue_pairs} once it has placed the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command in the used ring. + +\subparagraph{Legacy Interface: Automatic receive steering in multiqueue mode}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Automatic receive steering in multiqueue mode / Legacy Interface: Automatic receive steering in multiqueue mode} +When using the legacy interface, transitional devices and drivers +MUST format \field{virtqueue_pairs} +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +\paragraph{Offloads State Configuration}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Offloads State Configuration} + +If the VIRTIO_NET_F_CTRL_GUEST_OFFLOADS feature is negotiated, the driver can +send control commands for dynamic offloads state configuration. + +\subparagraph{Setting Offloads State}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Offloads State Configuration / Setting Offloads State} + +\begin{lstlisting} +le64 offloads; + +#define VIRTIO_NET_F_GUEST_CSUM 1 +#define VIRTIO_NET_F_GUEST_TSO4 7 +#define VIRTIO_NET_F_GUEST_TSO6 8 +#define VIRTIO_NET_F_GUEST_ECN 9 +#define VIRTIO_NET_F_GUEST_UFO 10 + +#define VIRTIO_NET_CTRL_GUEST_OFFLOADS 5 + #define VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET 0 +\end{lstlisting} + +The class VIRTIO_NET_CTRL_GUEST_OFFLOADS has one command: +VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET applies the new offloads configuration. + +le64 value passed as command data is a bitmask, bits set define +offloads to be enabled, bits cleared - offloads to be disabled. + +There is a corresponding device feature for each offload. Upon feature +negotiation corresponding offload gets enabled to preserve backward +compartibility. + +\drivernormative{\subparagraph}{Setting Offloads State}{Device Types / Network Device / Device Operation / Control Virtqueue / Offloads State Configuration / Setting Offloads State} + +A driver MUST NOT enable an offload for which the appropriate feature +has not been negotiated. + +\subparagraph{Legacy Interface: Setting Offloads State}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Offloads State Configuration / Setting Offloads State / Legacy Interface: Setting Offloads State} +When using the legacy interface, transitional devices and drivers +MUST format \field{offloads} +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + + +\subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device +Types / Network Device / Legacy Interface: Framing Requirements} + +When using legacy interfaces, transitional drivers which have not +negotiated VIRTIO_F_ANY_LAYOUT MUST use a single descriptor for the +struct virtio_net_hdr on both transmit and receive, with the +network data in the following descriptors. + +Additionally, when using the control virtqueue (see \ref{sec:Device +Types / Network Device / Device Operation / Control Virtqueue}) +, transitional drivers which have not +negotiated VIRTIO_F_ANY_LAYOUT MUST: +\begin{itemize} +\item for all commands, use a single 2-byte descriptor including the first two +fields: \field{class} and \field{command} +\item for all commands except VIRTIO_NET_CTRL_MAC_TABLE_SET +use a single descriptor including command-specific-data +with no padding. +\item for the VIRTIO_NET_CTRL_MAC_TABLE_SET command use exactly +two descriptors including command-specific-data with no padding: +the first of these descriptors MUST include the +virtio_net_ctrl_mac table structure for the unicast addresses with no padding, +the second of these descriptors MUST include the +virtio_net_ctrl_mac table structure for the multicast addresses +with no padding. +\item for all commands, use a single 1-byte descriptor for the +\field{ack} field +\end{itemize} + +See \ref{sec:Basic +Facilities of a Virtio Device / Virtqueues / Message Framing}. + +\section{Block Device}\label{sec:Device Types / Block Device} + +The virtio block device is a simple virtual block device (ie. +disk). Read and write requests (and other exotic requests) are +placed in the queue, and serviced (probably out of order) by the +device except where noted. + +\subsection{Device ID}\label{sec:Device Types / Block Device / Device ID} + 2 + +\subsection{Virtqueues}\label{sec:Device Types / Block Device / Virtqueues} +\begin{description} +\item[0] requestq +\end{description} + +\subsection{Feature bits}\label{sec:Device Types / Block Device / Feature bits} + +\begin{description} +\item[VIRTIO_BLK_F_SIZE_MAX (1)] Maximum size of any single segment is + in \field{size_max}. + +\item[VIRTIO_BLK_F_SEG_MAX (2)] Maximum number of segments in a + request is in \field{seg_max}. + +\item[VIRTIO_BLK_F_GEOMETRY (4)] Disk-style geometry specified in + \field{geometry}. + +\item[VIRTIO_BLK_F_RO (5)] Device is read-only. + +\item[VIRTIO_BLK_F_BLK_SIZE (6)] Block size of disk is in \field{blk_size}. + +\item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. + +\item[VIRTIO_BLK_F_TOPOLOGY (10)] Device exports information on optimal I/O + alignment. + +\item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback + and writethrough modes. +\end{description} + +\subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits} + +\begin{description} +\item[VIRTIO_BLK_F_BARRIER (0)] Device supports request barriers. + +\item[VIRTIO_BLK_F_SCSI (7)] Device supports scsi packet commands. +\end{description} + +\begin{note} + In the legacy interface, VIRTIO_BLK_F_FLUSH was also + called VIRTIO_BLK_F_WCE. +\end{note} + +\subsection{Device configuration layout}\label{sec:Device Types / Block Device / Device configuration layout} + +The \field{capacity} of the device (expressed in 512-byte sectors) is always +present. The availability of the others all depend on various feature +bits as indicated above. + +\begin{lstlisting} +struct virtio_blk_config { + le64 capacity; + le32 size_max; + le32 seg_max; + struct virtio_blk_geometry { + le16 cylinders; + u8 heads; + u8 sectors; + } geometry; + le32 blk_size; + struct virtio_blk_topology { + // # of logical blocks per physical block (log2) + u8 physical_block_exp; + // offset of first aligned logical block + u8 alignment_offset; + // suggested minimum I/O size in blocks + le16 min_io_size; + // optimal (suggested maximum) I/O size in blocks + le32 opt_io_size; + } topology; + u8 writeback; +}; +\end{lstlisting} + + +\subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device Types / Block Device / Device configuration layout / Legacy Interface: Device configuration layout} +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_blk_config +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + + +\subsection{Device Initialization}\label{sec:Device Types / Block Device / Device Initialization} + +\begin{enumerate} +\item The device size can be read from \field{capacity}. + +\item If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, + \field{blk_size} can be read to determine the optimal sector size + for the driver to use. This does not affect the units used in + the protocol (always 512 bytes), but awareness of the correct + value can affect performance. + +\item If the VIRTIO_BLK_F_RO feature is set by the device, any write + requests will fail. + +\item If the VIRTIO_BLK_F_TOPOLOGY feature is negotiated, the fields in the + \field{topology} struct can be read to determine the physical block size and optimal + I/O lengths for the driver to use. This also does not affect the units + in the protocol, only performance. + +\item If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the cache + mode can be read or set through the \field{writeback} field. 0 corresponds + to a writethrough cache, 1 to a writeback cache\footnote{Consistent with + \ref{devicenormative:Device Types / Block Device / Device Operation}, + a writethrough cache can be defined broadly as a cache that commits + writes to persistent device backend storage before reporting their + completion. For example, a battery-backed writeback cache actually + counts as writethrough according to this definition.}. The cache mode + after reset can be either writeback or writethrough. The actual + mode can be determined by reading \field{writeback} after feature + negotiation. +\end{enumerate} + +\drivernormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} + +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_FLUSH if they are incapable of +sending VIRTIO_BLK_T_FLUSH commands. + +If neither VIRTIO_BLK_F_CONFIG_WCE nor VIRTIO_BLK_F_FLUSH are +negotiated, the driver MAY deduce the presence of a writethrough cache. +If VIRTIO_BLK_F_CONFIG_WCE was not negotiated but VIRTIO_BLK_F_FLUSH was, +the driver SHOULD assume presence of a writeback cache. + +The driver MUST NOT read \field{writeback} before setting +the FEATURES_OK \field{status} bit. + +\devicenormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} + +Devices SHOULD always offer VIRTIO_BLK_F_FLUSH, and MUST offer it +if they offer VIRTIO_BLK_F_CONFIG_WCE. + +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but VIRTIO_BLK_F_FLUSH +is not, the device MUST initialize \field{writeback} to 0. + +\subsubsection{Legacy Interface: Device Initialization}\label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization} + +Because legacy devices do not have FEATURES_OK, transitional devices +MUST implement slightly different behavior around feature negotiation +when used through the legacy interface. In particular, when using the +legacy interface: + +\begin{itemize} +\item the driver MAY read or write \field{writeback} before setting + the DRIVER or DRIVER_OK \field{status} bit + +\item the device MUST NOT modify the cache mode (and \field{writeback}) + as a result of a driver setting a status bit, unless + the DRIVER_OK bit is being set and the driver has not set the + VIRTIO_BLK_F_CONFIG_WCE driver feature bit. + +\item the device MUST NOT modify the cache mode (and \field{writeback}) + as a result of a driver modifying the driver feature bits, for example + if the driver sets the VIRTIO_BLK_F_CONFIG_WCE driver feature bit but + does not set the VIRTIO_BLK_F_FLUSH bit. +\end{itemize} + + +\subsection{Device Operation}\label{sec:Device Types / Block Device / Device Operation} + +The driver queues requests to the virtqueue, and they are used by +the device (not necessarily in order). Each request is of form: + +\begin{lstlisting} +struct virtio_blk_req { + le32 type; + le32 reserved; + le64 sector; + u8 data[][512]; + u8 status; +}; +\end{lstlisting} + +The type of the request is either a read (VIRTIO_BLK_T_IN), a write +(VIRTIO_BLK_T_OUT), or a flush (VIRTIO_BLK_T_FLUSH). + +\begin{lstlisting} +#define VIRTIO_BLK_T_IN 0 +#define VIRTIO_BLK_T_OUT 1 +#define VIRTIO_BLK_T_FLUSH 4 +\end{lstlisting} + +The \field{sector} number indicates the offset (multiplied by 512) where +the read or write is to occur. This field is unused and set to 0 +for scsi packet commands and for flush commands. + +The final \field{status} byte is written by the device: either +VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for device or driver +error or VIRTIO_BLK_S_UNSUPP for a request unsupported by device: + +\begin{lstlisting} +#define VIRTIO_BLK_S_OK 0 +#define VIRTIO_BLK_S_IOERR 1 +#define VIRTIO_BLK_S_UNSUPP 2 +\end{lstlisting} + +\drivernormative{\subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} + +A driver MUST NOT submit a request which would cause a read or write +beyond \field{capacity}. + +A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered. + +A driver MUST set \field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request. +A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request. + +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY +switch to writethrough or writeback mode by writing respectively 0 and +1 to the \field{writeback} field. After writing a 0 to \field{writeback}, +the driver MUST NOT assume that any volatile writes have been committed +to persistent device backend storage. + +\devicenormative{\subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} + +A device MUST set the \field{status} byte to VIRTIO_BLK_S_IOERR +for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT +write any data. + +A write is considered volatile when it is submitted; the contents of +sectors covered by a volatile write are undefined in persistent device +backend storage until the write becomes stable. A write becomes stable +once it is completed and one or more of the following conditions is true: + +\begin{enumerate} +\item\label{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor + VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was + offered by the device; + +\item\label{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the + \field{writeback} field in configuration space was 0 \textbf{all the time between + the submission of the write and its completion}; + +\item\label{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent \textbf{after the write is + completed} and is completed itself. +\end{enumerate} + +If the device is backed by persistent storage, the device MUST ensure that +stable writes are committed to it, before reporting completion of the write +(cases~\ref{item:flush1} and~\ref{item:flush2}) or the flush +(case~\ref{item:flush3}). Failure to do so can cause data loss +in case of a crash. + +If the driver changes \field{writeback} between the submission of the write +and its completion, the write could be either volatile or stable when +its completion is reported; in other words, the exact behavior is undefined. + +% According to the device requirements for device initialization: +% Offer(CONFIG_WCE) => Offer(FLUSH). +% +% After reversing the implication: +% not Offer(FLUSH) => not Offer(CONFIG_WCE). + +If VIRTIO_BLK_F_FLUSH was not offered by the + device\footnote{Note that in this case, according to + \ref{devicenormative:Device Types / Block Device / Device Initialization}, + the device will not have offered VIRTIO_BLK_F_CONFIG_WCE either.}, the +device MAY also commit writes to persistent device backend storage before +reporting their completion. Unlike case~\ref{item:flush1}, however, this +is not an absolute requirement of the specification. + +\begin{note} + An implementation that does not offer VIRTIO_BLK_F_FLUSH and does not commit + completed writes will not be resilient to data loss in case of crashes. + Not offering VIRTIO_BLK_F_FLUSH is an absolute requirement + for implementations that do not wish to be safe against such data losses. +\end{note} + +\subsubsection{Legacy Interface: Device Operation}\label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_blk_req +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +When using the legacy interface, transitional drivers +SHOULD ignore the \field{len} value in used ring entries. +\begin{note} +Historically, some devices put the total descriptor length, +or the total length of device-writable buffers there, +even when only the status byte was actually written. +\end{note} + +The \field{reserved} field was previously called \field{ioprio}. \field{ioprio} +is a hint about the relative priorities of requests to the device: +higher numbers indicate more important requests. + +\begin{lstlisting} +#define VIRTIO_BLK_T_FLUSH_OUT 5 +\end{lstlisting} + +The command VIRTIO_BLK_T_FLUSH_OUT was a synonym for VIRTIO_BLK_T_FLUSH; +a driver MUST treat it as a VIRTIO_BLK_T_FLUSH command. + +\begin{lstlisting} +#define VIRTIO_BLK_T_BARRIER 0x80000000 +\end{lstlisting} + +If the device has VIRTIO_BLK_F_BARRIER +feature the high bit (VIRTIO_BLK_T_BARRIER) indicates that this +request acts as a barrier and that all preceding requests SHOULD be +complete before this one, and all following requests SHOULD NOT be +started until this is complete. + +\begin{note} A barrier does not flush +caches in the underlying backend device in host, and thus does not +serve as data consistency guarantee. Only a VIRTIO_BLK_T_FLUSH request +does that. +\end{note} + +Some older legacy devices did not commit completed writes to persistent +device backend storage when VIRTIO_BLK_F_FLUSH was offered but not +negotiated. In order to work around this, the driver MAY set the +\field{writeback} to 0 (if available) or it MAY send an explicit flush +request after every completed write. + +If the device has VIRTIO_BLK_F_SCSI feature, it can also support +scsi packet command requests, each of these requests is of form: + +\begin{lstlisting} +/* All fields are in guest's native endian. */ +struct virtio_scsi_pc_req { + u32 type; + u32 ioprio; + u64 sector; + u8 cmd[]; + u8 data[][512]; +#define SCSI_SENSE_BUFFERSIZE 96 + u8 sense[SCSI_SENSE_BUFFERSIZE]; + u32 errors; + u32 data_len; + u32 sense_len; + u32 residual; + u8 status; +}; +\end{lstlisting} + +A request type can also be a scsi packet command (VIRTIO_BLK_T_SCSI_CMD or +VIRTIO_BLK_T_SCSI_CMD_OUT). The two types are equivalent, the device +does not distinguish between them: + +\begin{lstlisting} +#define VIRTIO_BLK_T_SCSI_CMD 2 +#define VIRTIO_BLK_T_SCSI_CMD_OUT 3 +\end{lstlisting} + +The \field{cmd} field is only present for scsi packet command requests, +and indicates the command to perform. This field MUST reside in a +single, separate device-readable buffer; command length can be derived +from the length of this buffer. + +Note that these first three (four for scsi packet commands) +fields are always device-readable: \field{data} is either device-readable +or device-writable, depending on the request. The size of the read or +write can be derived from the total size of the request buffers. + +\field{sense} is only present for scsi packet command requests, +and indicates the buffer for scsi sense data. + +\field{data_len} is only present for scsi packet command +requests, this field is deprecated, and SHOULD be ignored by the +driver. Historically, devices copied data length there. + +\field{sense_len} is only present for scsi packet command +requests and indicates the number of bytes actually written to +the \field{sense} buffer. + +\field{residual} field is only present for scsi packet command +requests and indicates the residual size, calculated as data +length - number of bytes actually transferred. + +\subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device +Types / Block Device / Legacy Interface: Framing Requirements} + +When using legacy interfaces, transitional drivers which have not +negotiated VIRTIO_F_ANY_LAYOUT: + +\begin{itemize} +\item MUST use a single 8-byte descriptor containing \field{type}, + \field{reserved} and \field{sector}, followed by descriptors + for \field{data}, then finally a separate 1-byte descriptor + for \field{status}. + +\item For SCSI commands there are additional constraints. + \field{errors}, \field{data_len}, \field{sense_len} and + \field{residual} MUST reside in a single, separate + device-writable descriptor, \field{sense} MUST reside in a + single separate device-writable descriptor of size 96 bytes, + and \field{errors}, \field{data_len}, \field{sense_len} and + \field{residual} MUST reside a single separate + device-writable descriptor. +\end{itemize} + +See \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing}. + +\section{Console Device}\label{sec:Device Types / Console Device} + +The virtio console device is a simple device for data input and +output. A device MAY have one or more ports. Each port has a pair +of input and output virtqueues. Moreover, a device has a pair of +control IO virtqueues. The control virtqueues are used to +communicate information between the device and the driver about +ports being opened and closed on either side of the connection, +indication from the device about whether a particular port is a +console port, adding new ports, port hot-plug/unplug, etc., and +indication from the driver about whether a port or a device was +successfully added, port open/close, etc. For data IO, one or +more empty buffers are placed in the receive queue for incoming +data and outgoing characters are placed in the transmit queue. + +\subsection{Device ID}\label{sec:Device Types / Console Device / Device ID} + + 3 + +\subsection{Virtqueues}\label{sec:Device Types / Console Device / Virtqueues} + +\begin{description} +\item[0] receiveq(port0) +\item[1] transmitq(port0) +\item[2] control receiveq +\item[3] control transmitq +\item[4] receiveq(port1) +\item[5] transmitq(port1) +\item[\ldots] +\end{description} + +The port 0 receive and transmit queues always exist: other queues +only exist if VIRTIO_CONSOLE_F_MULTIPORT is set. + +\subsection{Feature bits}\label{sec:Device Types / Console Device / Feature bits} + +\begin{description} +\item[VIRTIO_CONSOLE_F_SIZE (0)] Configuration \field{cols} and \field{rows} + are valid. + +\item[VIRTIO_CONSOLE_F_MULTIPORT (1)] Device has support for multiple + ports; \field{max_nr_ports} is valid and control virtqueues will be used. + +\item[VIRTIO_CONSOLE_F_EMERG_WRITE (2)] Device has support for emergency write. + Configuration field emerg_wr is valid. +\end{description} + +\subsection{Device configuration layout}\label{sec:Device Types / Console Device / Device configuration layout} + + The size of the console is supplied + in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature + is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature + is set, the maximum number of ports supported by the device can + be fetched. + + If VIRTIO_CONSOLE_F_EMERG_WRITE is set then the driver can use emergency write + to output a single character without initializing virtio queues, or even + acknowledging the feature. + +\begin{lstlisting} +struct virtio_console_config { + le16 cols; + le16 rows; + le32 max_nr_ports; + le32 emerg_wr; +}; +\end{lstlisting} + +\subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device Types / Console Device / Device configuration layout / Legacy Interface: Device configuration layout} +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_console_config +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +\subsection{Device Initialization}\label{sec:Device Types / Console Device / Device Initialization} + +\begin{enumerate} +\item If the VIRTIO_CONSOLE_F_EMERG_WRITE feature is offered, + \field{emerg_wr} field of the configuration can be written at any time. + Thus it works for very early boot debugging output as well as + catastophic OS failures (eg. virtio ring corruption). + +\item If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver + can read the console dimensions from \field{cols} and \field{rows}. + +\item If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the + driver can spawn multiple ports, not all of which are necessarily + attached to a console. Some could be generic ports. In this + case, the control virtqueues are enabled and according to + \field{max_nr_ports}, the appropriate number + of virtqueues are created. A control message indicating the + driver is ready is sent to the device. The device can then send + control messages for adding new ports to the device. After + creating and initializing each port, a + VIRTIO_CONSOLE_PORT_READY control message is sent to the device + for that port so the device can let the driver know of any additional + configuration options set for that port. + +\item The receiveq for each port is populated with one or more + receive buffers. +\end{enumerate} + +\devicenormative{\subsubsection}{Device Initialization}{Device Types / Console Device / Device Initialization} + +The device MUST allow a write to \field{emerg_wr}, even on an +unconfigured device. + +The device SHOULD transmit the lower byte written to \field{emerg_wr} to +an appropriate log or output method. + +\subsection{Device Operation}\label{sec:Device Types / Console Device / Device Operation} + +\begin{enumerate} +\item For output, a buffer containing the characters is placed in + the port's transmitq\footnote{Because this is high importance and low bandwidth, the current +Linux implementation polls for the buffer to be used, rather than +waiting for an interrupt, simplifying the implementation +significantly. However, for generic serial ports with the +O_NONBLOCK flag set, the polling limitation is relaxed and the +consumed buffers are freed upon the next write or poll call or +when a port is closed or hot-unplugged. +}. + +\item When a buffer is used in the receiveq (signalled by an + interrupt), the contents is the input to the port associated + with the virtqueue for which the notification was received. + +\item If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a + configuration change interrupt indicates that the updated size can + be read from the configuration fields. This size applies to port 0 only. + +\item If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT + feature, active ports are announced by the device using the + VIRTIO_CONSOLE_PORT_ADD control message. The same message is + used for port hot-plug as well. +\end{enumerate} + +\drivernormative{\subsubsection}{Device Operation}{Device Types / Console Device / Device Operation} + +The driver MUST NOT put a device-readable in a receiveq. The driver +MUST NOT put a device-writable buffer in a transmitq. + +\subsubsection{Multiport Device Operation}\label{sec:Device Types / Console Device / Device Operation / Multiport Device Operation} + +If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT, the two +control queues are used to manipulate the different console ports: the +control receiveq for messages from the device to the driver, and the +control sendq for driver-to-device messages. The layout of the +control messages is: + +\begin{lstlisting} +struct virtio_console_control { + le32 id; /* Port number */ + le16 event; /* The kind of control event */ + le16 value; /* Extra information for the event */ +}; +\end{lstlisting} + +The values for \field{event} are: +\begin{description} +\item [VIRTIO_CONSOLE_DEVICE_READY (0)] Sent by the driver at initialization + to indicate that it is ready to receive control messages. A value of + 1 indicates success, and 0 indicates failure. The port number \field{id} is unused. +\item [VIRTIO_CONSOLE_DEVICE_ADD (1)] Sent by the device, to create a new + port. \field{value} is unused. +\item [VIRTIO_CONSOLE_DEVICE_REMOVE (2)] Sent by the device, to remove an + existing port. \field{value} is unused. +\item [VIRTIO_CONSOLE_PORT_READY (3)] Sent by the driver in response + to the device's VIRTIO_CONSOLE_PORT_ADD message, to indicate that + the port is ready to be used. A \field{value} of 1 indicates success, and 0 + indicates failure. +\item [VIRTIO_CONSOLE_CONSOLE_PORT (4)] Sent by the device to nominate + a port as a console port. There MAY be more than one console port. +\item [VIRTIO_CONSOLE_RESIZE (5)] Sent by the device to indicate + a console size change. \field{value} is unused. The buffer is followed by the number of columns and rows: +\begin{lstlisting} +struct virtio_console_resize { + le16 cols; + le16 rows; +}; +\end{lstlisting} +\item [VIRTIO_CONSOLE_PORT_OPEN (6)] This message is sent by both the + device and the driver. \field{value} indicates the state: 0 (port + closed) or 1 (port open). This allows for ports to be used directly + by guest and host processes to communicate in an application-defined + manner. +\item [VIRTIO_CONSOLE_PORT_NAME (7)] Sent by the device to give a tag + to the port. This control command is immediately + followed by the UTF-8 name of the port for identification + within the guest (without a NUL terminator). +\end{description} + +\devicenormative{\paragraph}{Multiport Device Operation}{Device Types / Console Device / Device Operation / Multiport Device Operation} + +The device MUST NOT specify a port which exists in a +VIRTIO_CONSOLE_DEVICE_ADD message, nor a port which is equal or +greater than \field{max_nr_ports}. + +The device MUST NOT specify a port in VIRTIO_CONSOLE_DEVICE_REMOVE +which has not been created with a previous VIRTIO_CONSOLE_DEVICE_ADD. + +\drivernormative{\paragraph}{Multiport Device Operation}{Device Types / Console Device / Device Operation / Multiport Device Operation} + +The driver MUST send a VIRTIO_CONSOLE_DEVICE_READY message if +VIRTIO_CONSOLE_F_MULTIPORT is negotiated. + +Upon receipt of a VIRTIO_CONSOLE_CONSOLE_PORT message, the driver +SHOULD treat the port in a manner suitable for text console access +and MUST respond with a VIRTIO_CONSOLE_PORT_OPEN message, which MUST +have \field{value} set to 1. + +\subsubsection{Legacy Interface: Device Operation}\label{sec:Device Types / Console Device / Device Operation / Legacy Interface: Device Operation} +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_console_control +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +When using the legacy interface, the driver SHOULD ignore the +\field{len} value in used ring entries for the transmit queues +and the control transmitq. +\begin{note} +Historically, some devices put the total descriptor length there, +even though no data was actually written. +\end{note} + +\subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device +Types / Console Device / Legacy Interface: Framing Requirements} + +When using legacy interfaces, transitional drivers which have not +negotiated VIRTIO_F_ANY_LAYOUT MUST use only a single +descriptor for all buffers in the control receiveq and control transmitq. + +\section{Entropy Device}\label{sec:Device Types / Entropy Device} + +The virtio entropy device supplies high-quality randomness for +guest use. + +\subsection{Device ID}\label{sec:Device Types / Entropy Device / Device ID} + 4 + +\subsection{Virtqueues}\label{sec:Device Types / Entropy Device / Virtqueues} +\begin{description} +\item[0] requestq +\end{description} + +\subsection{Feature bits}\label{sec:Device Types / Entropy Device / Feature bits} + None currently defined + +\subsection{Device configuration layout}\label{sec:Device Types / Entropy Device / Device configuration layout} + None currently defined. + +\subsection{Device Initialization}\label{sec:Device Types / Entropy Device / Device Initialization} + +\begin{enumerate} +\item The virtqueue is initialized +\end{enumerate} + +\subsection{Device Operation}\label{sec:Device Types / Entropy Device / Device Operation} + +When the driver requires random bytes, it places the descriptor +of one or more buffers in the queue. It will be completely filled +by random data by the device. + +\drivernormative{\subsubsection}{Device Operation}{Device Types / Entropy Device / Device Operation} + +The driver MUST NOT place driver-readable buffers into the queue. + +The driver MUST examine the length written by the device to determine +how many random bytes were received. + +\devicenormative{\subsubsection}{Device Operation}{Device Types / Entropy Device / Device Operation} + +The device MUST place one or more random bytes into the buffer, but it +MAY use less than the entire buffer length. + +\section{Traditional Memory Balloon Device}\label{sec:Device Types / Memory Balloon Device} + +This is the traditional balloon device. The device number 13 is +reserved for a new memory balloon interface, with different +semantics, which is expected in a future version of the standard. + +The traditional virtio memory balloon device is a primitive device for +managing guest memory: the device asks for a certain amount of +memory, and the driver supplies it (or withdraws it, if the device +has more than it asks for). This allows the guest to adapt to +changes in allowance of underlying physical memory. If the +feature is negotiated, the device can also be used to communicate +guest memory statistics to the host. + +\subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device ID} + 5 + +\subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues} +\begin{description} +\item[0] inflateq +\item[1] deflateq +\item[2] statsq. +\end{description} + + Virtqueue 2 only exists if VIRTIO_BALLON_F_STATS_VQ set. + +\subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits} +\begin{description} +\item[VIRTIO_BALLOON_F_MUST_TELL_HOST (0)] Host has to be told before + pages from the balloon are used. + +\item[VIRTIO_BALLOON_F_STATS_VQ (1)] A virtqueue for reporting guest + memory statistics is present. +\item[VIRTIO_BALLOON_F_DEFLATE_ON_OOM (2) ] Deflate balloon on + guest out of memory condition. + +\end{description} + +\drivernormative{\subsubsection}{Feature bits}{Device Types / Memory Balloon Device / Feature bits} +The driver SHOULD accept the VIRTIO_BALLOON_F_MUST_TELL_HOST +feature if offered by the device. + +\devicenormative{\subsubsection}{Feature bits}{Device Types / Memory Balloon Device / Feature bits} +If the device offers the VIRTIO_BALLOON_F_MUST_TELL_HOST feature +bit, and if the driver did not accept this feature bit, the +device MAY signal failure by failing to set FEATURES_OK +\field{device status} bit when the driver writes it. +\subparagraph{Legacy Interface: Feature bits}\label{sec:Device +Types / Memory Balloon Device / Feature bits / Legacy Interface: +Feature bits} +As the legacy interface does not have a way to gracefully report feature +negotiation failure, when using the legacy interface, +transitional devices MUST support guests which do not negotiate +VIRTIO_BALLOON_F_MUST_TELL_HOST feature, and SHOULD +allow guest to use memory before notifying host if +VIRTIO_BALLOON_F_MUST_TELL_HOST is not negotiated. + +\subsection{Device configuration layout}\label{sec:Device Types / Memory Balloon Device / Device configuration layout} + Both fields of this configuration + are always available. + +\begin{lstlisting} +struct virtio_balloon_config { + le32 num_pages; + le32 actual; +}; +\end{lstlisting} + +\subparagraph{Legacy Interface: Device configuration layout}\label{sec:Device Types / Memory Balloon Device / Device +configuration layout / Legacy Interface: Device configuration layout} +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_balloon_config +according to the little-endian format. +\begin{note} +This is unlike the usual convention that legacy device fields are guest endian. +\end{note} + +\subsection{Device Initialization}\label{sec:Device Types / Memory Balloon Device / Device Initialization} + +The device initialization process is outlined below: + +\begin{enumerate} +\item The inflate and deflate virtqueues are identified. + +\item If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated: + \begin{enumerate} + \item Identify the stats virtqueue. + \item Add one empty buffer to the stats virtqueue. + \item DRIVER_OK is set: device operation begins. + \item Notify the device about the stats virtqueue buffer. + \end{enumerate} +\end{enumerate} + +\subsection{Device Operation}\label{sec:Device Types / Memory Balloon Device / Device Operation} + +The device is driven either by the receipt of a configuration +change interrupt, or by changing guest memory needs, such as +performing memory compaction or responding to out of memory +conditions. + +\begin{enumerate} +\item \field{num_pages} configuration field is examined. If this is + greater than the \field{actual} number of pages, the balloon wants + more memory from the guest. If it is less than \field{actual}, + the balloon doesn't need it all. + +\item To supply memory to the balloon (aka. inflate): + \begin{enumerate} + \item The driver constructs an array of addresses of unused memory + pages. These addresses are divided by 4096\footnote{This is historical, and independent of the guest page size. +} and the descriptor + describing the resulting 32-bit array is added to the inflateq. + \end{enumerate} + +\item To remove memory from the balloon (aka. deflate): + \begin{enumerate} + \item The driver constructs an array of addresses of memory pages + it has previously given to the balloon, as described above. + This descriptor is added to the deflateq. + + \item If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is negotiated, the + guest informs the device of pages before it uses them. + + \item Otherwise, the guest is allowed to re-use pages previously + given to the balloon before the device has acknowledged their + withdrawal\footnote{In this case, deflation advice is merely a courtesy. +}. + \end{enumerate} + +\item In either case, the device acknowledges inflate and deflate +requests by using the descriptor. +\item Once the device has acknowledged the inflation or + deflation, the driver updates \field{actual} to reflect the new number of pages in the balloon. +\end{enumerate} + +\drivernormative{\subsubsection}{Device Operation}{Device Types / Memory Balloon Device / Device Operation} +The driver SHOULD supply pages to the balloon when \field{num_pages} is +greater than the actual number of pages in the balloon. + +The driver MAY use pages from the balloon when \field{num_pages} is +less than the actual number of pages in the balloon. + +The driver MAY supply pages to the balloon when \field{num_pages} is +greater than or equal to the actual number of pages in the balloon. + +If VIRTIO_BALLOON_F_DEFLATE_ON_OOM has not been negotiated, the +driver MUST NOT use pages from the balloon when \field{num_pages} +is less than or equal to the actual number of pages in the +balloon. + +If VIRTIO_BALLOON_F_DEFLATE_ON_OOM has been negotiated, the +driver MAY use pages from the balloon when \field{num_pages} +is less than or equal to the actual number of pages in the +balloon if this is required for system stability +(e.g. if memory is required by applications running within + the guest). + +The driver MUST use the deflateq to inform the device of pages that it +wants to use from the balloon. + +If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is negotiated, the +driver MUST NOT use pages from the balloon until +the device has acknowledged the deflate request. + +Otherwise, if the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is not +negotiated, the driver MAY begin to re-use pages previously +given to the balloon before the device has acknowledged the +deflate request. + +In any case, the driver MUST NOT use pages from the balloon +after adding the pages to the balloon, but before the device has +acknowledged the inflate request. + +The driver MUST NOT request deflation of pages in +the balloon before the device has acknowledged the inflate +request. + +The driver MUST update \field{actual} after changing the number +of pages in the balloon. + +The driver MAY update \field{actual} once after multiple +inflate and deflate operations. + +\devicenormative{\subsubsection}{Device Operation}{Device Types / Memory Balloon Device / Device Operation} + +The device MAY modify the contents of a page in the balloon +after detecting its physical number in an inflate request +and before acknowledging the inflate request by using the inflateq +descriptor. + +If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is negotiated, the +device MAY modify the contents of a page in the balloon +after detecting its physical number in an inflate request +and before detecting its physical number in a deflate request +and acknowledging the deflate request. + +\paragraph{Legacy Interface: Device Operation}\label{sec:Device +Types / Memory Balloon Device / Device Operation / Legacy +Interface: Device Operation} +When using the legacy interface, the driver SHOULD ignore the \field{len} value in used ring entries. +\begin{note} +Historically, some devices put the total descriptor length there, +even though no data was actually written. +\end{note} +When using the legacy interface, the driver MUST write out all +4 bytes each time it updates the \field{actual} value in the +configuration space, using a single atomic operation. + +When using the legacy interface, the device SHOULD NOT use the +\field{actual} value written by the driver in the configuration +space, until the last, most-significant byte of the value has been +written. +\begin{note} +Historically, devices used the \field{actual} value, even though +when using Virtio Over PCI Bus the device-specific configuration +space was not guaranteed to be atomic. Using intermediate +values during update by driver is best avoided, except for +debugging. + +Historically, drivers using Virtio Over PCI Bus wrote the +\field{actual} value by using multiple single-byte writes in +order, from the least-significant to the most-significant value. +\end{note} +\subsubsection{Memory Statistics}\label{sec:Device Types / Memory Balloon Device / Device Operation / Memory Statistics} + +The stats virtqueue is atypical because communication is driven +by the device (not the driver). The channel becomes active at +driver initialization time when the driver adds an empty buffer +and notifies the device. A request for memory statistics proceeds +as follows: + +\begin{enumerate} +\item The device pushes the buffer onto the used ring and sends an + interrupt. + +\item The driver pops the used buffer and discards it. + +\item The driver collects memory statistics and writes them into a + new buffer. + +\item The driver adds the buffer to the virtqueue and notifies the + device. + +\item The device pops the buffer (retaining it to initiate a + subsequent request) and consumes the statistics. +\end{enumerate} + + Within the buffer, statistics are an array of 6-byte entries. + Each statistic consists of a 16 bit + tag and a 64 bit value. All statistics are optional and the + driver chooses which ones to supply. To guarantee backwards + compatibility, devices omit unsupported statistics. + +\begin{lstlisting} +struct virtio_balloon_stat { +#define VIRTIO_BALLOON_S_SWAP_IN 0 +#define VIRTIO_BALLOON_S_SWAP_OUT 1 +#define VIRTIO_BALLOON_S_MAJFLT 2 +#define VIRTIO_BALLOON_S_MINFLT 3 +#define VIRTIO_BALLOON_S_MEMFREE 4 +#define VIRTIO_BALLOON_S_MEMTOT 5 + le16 tag; + le64 val; +} __attribute__((packed)); +\end{lstlisting} + +\drivernormative{\paragraph}{Memory Statistics}{Device Types / Memory Balloon Device / Device Operation / Memory Statistics} +Normative statements in this section apply if and only if the +VIRTIO_BALLOON_F_STATS_VQ feature has been negotiated. + +The driver MUST make at most one buffer available to the device +in the statsq, at all times. + +After initializing the device, the driver MUST make an output +buffer available in the statsq. + +Upon detecting that device has used a buffer in the statsq, the +driver MUST make an output buffer available in the statsq. + +Before making an output buffer available in the statsq, the +driver MUST initialize it, including one struct +virtio_balloon_stat entry for each statistic that it supports. + +Driver MUST use an output buffer size which is a multiple of 6 +bytes for all buffers submitted to the statsq. + +Driver MAY supply struct virtio_balloon_stat entries in the +output buffer submitted to the statsq in any order, without +regard to \field{tag} values. + +Driver MAY supply a subset of all statistics in the output buffer +submitted to the statsq. + +Driver MUST supply the same subset of statistics in all buffers +submitted to the statsq. + +\devicenormative{\paragraph}{Memory Statistics}{Device Types / Memory Balloon Device / Device Operation / Memory Statistics} +Normative statements in this section apply if and only if the +VIRTIO_BALLOON_F_STATS_VQ feature has been negotiated. + +Within an output buffer submitted to the statsq, +the device MUST ignore entries with \field{tag} values that it does not recognize. + +Within an output buffer submitted to the statsq, +the device MUST accept struct virtio_balloon_stat entries in any +order without regard to \field{tag} values. + +\paragraph{Legacy Interface: Memory Statistics}\label{sec:Device Types / Memory Balloon Device / Device Operation / Memory Statistics / Legacy Interface: Memory Statistics} + +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_balloon_stat +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +When using the legacy interface, +the device SHOULD ignore all values in the first buffer in the +statsq supplied by the driver after device initialization. +\begin{note} +Historically, drivers supplied an uninitialized buffer in the +first buffer. +\end{note} + +\subsubsection{Memory Statistics Tags}\label{sec:Device Types / Memory Balloon Device / Device Operation / Memory Statistics Tags} + +\begin{description} +\item[VIRTIO_BALLOON_S_SWAP_IN (0)] The amount of memory that has been + swapped in (in bytes). + +\item[VIRTIO_BALLOON_S_SWAP_OUT (1)] The amount of memory that has been + swapped out to disk (in bytes). + +\item[VIRTIO_BALLOON_S_MAJFLT (2)] The number of major page faults that + have occurred. + +\item[VIRTIO_BALLOON_S_MINFLT (3)] The number of minor page faults that + have occurred. + +\item[VIRTIO_BALLOON_S_MEMFREE (4)] The amount of memory not being used + for any purpose (in bytes). + +\item[VIRTIO_BALLOON_S_MEMTOT (5)] The total amount of memory available + (in bytes). +\end{description} + +\section{SCSI Host Device}\label{sec:Device Types / SCSI Host Device} + +The virtio SCSI host device groups together one or more virtual +logical units (such as disks), and allows communicating to them +using the SCSI protocol. An instance of the device represents a +SCSI host to which many targets and LUNs are attached. + +The virtio SCSI device services two kinds of requests: +\begin{itemize} +\item command requests for a logical unit; + +\item task management functions related to a logical unit, target or + command. +\end{itemize} + +The device is also able to send out notifications about added and +removed logical units. Together, these capabilities provide a +SCSI transport protocol that uses virtqueues as the transfer +medium. In the transport protocol, the virtio driver acts as the +initiator, while the virtio SCSI host provides one or more +targets that receive and process the requests. + +This section relies on definitions from \hyperref[intro:SAM]{SAM}. + +\subsection{Device ID}\label{sec:Device Types / SCSI Host Device / Device ID} + 8 + +\subsection{Virtqueues}\label{sec:Device Types / SCSI Host Device / Virtqueues} + +\begin{description} +\item[0] controlq +\item[1] eventq +\item[2\ldots n] request queues +\end{description} + +\subsection{Feature bits}\label{sec:Device Types / SCSI Host Device / Feature bits} + +\begin{description} +\item[VIRTIO_SCSI_F_INOUT (0)] A single request can include both + device-readable and device-writable data buffers. + +\item[VIRTIO_SCSI_F_HOTPLUG (1)] The host SHOULD enable reporting of + hot-plug and hot-unplug events for LUNs and targets on the SCSI bus. + The guest SHOULD handle hot-plug and hot-unplug events. + +\item[VIRTIO_SCSI_F_CHANGE (2)] The host will report changes to LUN + parameters via a VIRTIO_SCSI_T_PARAM_CHANGE event; the guest + SHOULD handle them. + +\item[VIRTIO_SCSI_F_T10_PI (3)] The extended fields for T10 protection + information (DIF/DIX) are included in the SCSI request header. +\end{description} + +\subsection{Device configuration layout}\label{sec:Device Types / SCSI Host Device / Device configuration layout} + + All fields of this configuration are always available. + +\begin{lstlisting} +struct virtio_scsi_config { + le32 num_queues; + le32 seg_max; + le32 max_sectors; + le32 cmd_per_lun; + le32 event_info_size; + le32 sense_size; + le32 cdb_size; + le16 max_channel; + le16 max_target; + le32 max_lun; +}; +\end{lstlisting} + +\begin{description} +\item[\field{num_queues}] is the total number of request virtqueues exposed by + the device. The driver MAY use only one request queue, + or it can use more to achieve better performance. + +\item[\field{seg_max}] is the maximum number of segments that can be in a + command. A bidirectional command can include \field{seg_max} input + segments and \field{seg_max} output segments. + +\item[\field{max_sectors}] is a hint to the driver about the maximum transfer + size to use. + +\item[\field{cmd_per_lun}] is tells the driver the maximum number of + linked commands it can send to one LUN. + +\item[\field{event_info_size}] is the maximum size that the device will fill + for buffers that the driver places in the eventq. It is + written by the device depending on the set of negotiated + features. + +\item[\field{sense_size}] is the maximum size of the sense data that the + device will write. The default value is written by the device + and MUST be 96, but the driver can modify it. It is + restored to the default when the device is reset. + +\item[\field{cdb_size}] is the maximum size of the CDB that the driver will + write. The default value is written by the device and MUST + be 32, but the driver can likewise modify it. It is + restored to the default when the device is reset. + +\item[\field{max_channel}, \field{max_target} and \field{max_lun}] can be + used by the driver as hints to constrain scanning the logical units + on the host to channel/target/logical unit numbers that are less than + or equal to the value of the fields. \field{max_channel} SHOULD + be zero. \field{max_target} SHOULD be less than or equal to 255. + \field{max_lun} SHOULD be less than or equal to 16383. +\end{description} + +\drivernormative{\subsubsection}{Device configuration layout}{Device Types / SCSI Host Device / Device configuration layout} + +The driver MUST NOT write to device configuration fields other than +\field{sense_size} and \field{cdb_size}. + +The driver MUST NOT send more than \field{cmd_per_lun} linked commands +to one LUN, and MUST NOT send more than the virtqueue size number of +linked commands to one LUN. + +\devicenormative{\subsubsection}{Device configuration layout}{Device Types / SCSI Host Device / Device configuration layout} + +On reset, the device MUST set \field{sense_size} to 96 and +\field{cdb_size} to 32. + +\subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device Types / SCSI Host Device / Device configuration layout / Legacy Interface: Device configuration layout} +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_scsi_config +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +\devicenormative{\subsection}{Device Initialization}{Device Types / SCSI Host Device / Device Initialization} + +On initialization the driver SHOULD first discover the +device's virtqueues. + +If the driver uses the eventq, the driver SHOULD place at least one +buffer in the eventq. + +The driver MAY immediately issue requests\footnote{For example, INQUIRY +or REPORT LUNS.} or task management functions\footnote{For example, I_T +RESET.}. + +\subsection{Device Operation}\label{sec:Device Types / SCSI Host Device / Device Operation} + +Device operation consists of operating request queues, the control +queue and the event queue. + +\paragraph{Legacy Interface: Device Operation}\label{sec:Device +Types / SCSI Host Device / Device Operation / Legacy +Interface: Device Operation} +When using the legacy interface, the driver SHOULD ignore the \field{len} value in used ring entries. +\begin{note} +Historically, devices put the total descriptor length, +or the total length of device-writable buffers there, +even when only part of the buffers were actually written. +\end{note} + +\subsubsection{Device Operation: Request Queues}\label{sec:Device Types / SCSI Host Device / Device Operation / Device Operation: Request Queues} + +The driver queues requests to an arbitrary request queue, and +they are used by the device on that same queue. It is the +responsibility of the driver to ensure strict request ordering +for commands placed on different queues, because they will be +consumed with no order constraints. + +Requests have the following format: + +\begin{lstlisting} +struct virtio_scsi_req_cmd { + // Device-readable part + u8 lun[8]; + le64 id; + u8 task_attr; + u8 prio; + u8 crn; + u8 cdb[cdb_size]; + // The next two fields are only present if VIRTIO_SCSI_F_T10_PI + // is negotiated. + le32 pi_bytesout; + le32 pi_bytesin; + u8 pi_out[pi_bytesout]; + u8 dataout[]; + + // Device-writable part + le32 sense_len; + le32 residual; + le16 status_qualifier; + u8 status; + u8 response; + u8 sense[sense_size]; + // The next two fields are only present if VIRTIO_SCSI_F_T10_PI + // is negotiated + u8 pi_in[pi_bytesin]; + u8 datain[]; +}; + + +/* command-specific response values */ +#define VIRTIO_SCSI_S_OK 0 +#define VIRTIO_SCSI_S_OVERRUN 1 +#define VIRTIO_SCSI_S_ABORTED 2 +#define VIRTIO_SCSI_S_BAD_TARGET 3 +#define VIRTIO_SCSI_S_RESET 4 +#define VIRTIO_SCSI_S_BUSY 5 +#define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6 +#define VIRTIO_SCSI_S_TARGET_FAILURE 7 +#define VIRTIO_SCSI_S_NEXUS_FAILURE 8 +#define VIRTIO_SCSI_S_FAILURE 9 + +/* task_attr */ +#define VIRTIO_SCSI_S_SIMPLE 0 +#define VIRTIO_SCSI_S_ORDERED 1 +#define VIRTIO_SCSI_S_HEAD 2 +#define VIRTIO_SCSI_S_ACA 3 +\end{lstlisting} + +\field{lun} addresses the REPORT LUNS well-known logical unit, or +a target and logical unit in the virtio-scsi device's SCSI domain. +When used to address the REPORT LUNS logical unit, \field{lun} is 0xC1, +0x01 and six zero bytes. The virtio-scsi device SHOULD implement the +REPORT LUNS well-known logical unit. + +When used to address a target and logical unit, the only supported format +for \field{lun} is: first byte set to 1, second byte set to target, +third and fourth byte representing a single level LUN structure, followed +by four zero bytes. With this representation, a virtio-scsi device can +serve up to 256 targets and 16384 LUNs per target. The device MAY also +support having a well-known logical units in the third and fourth byte. + +\field{id} is the command identifier (``tag''). + +\field{task_attr} defines the task attribute as in the table above, but +all task attributes MAY be mapped to SIMPLE by the device. Some commands +are defined by SCSI standards as "implicit head of queue"; for such +commands, all task attributes MAY also be mapped to HEAD OF QUEUE. +Drivers and applications SHOULD NOT send a command with the ORDERED +task attribute if the command has an implicit HEAD OF QUEUE attribute, +because whether the ORDERED task attribute is honored is vendor-specific. + +\field{crn} may also be provided by clients, but is generally expected +to be 0. The maximum CRN value defined by the protocol is 255, since +CRN is stored in an 8-bit integer. + +The CDB is included in \field{cdb} and its size, \field{cdb_size}, +is taken from the configuration space. + +All of these fields are defined in \hyperref[intro:SAM]{SAM} and are +always device-readable. + +\field{pi_bytesout} determines the size of the \field{pi_out} field +in bytes. If it is nonzero, the \field{pi_out} field contains outgoing +protection information for write operations. \field{pi_bytesin} determines +the size of the \field{pi_in} field in the device-writable section, in bytes. +All three fields are only present if VIRTIO_SCSI_F_T10_PI has been negotiated. + +The remainder of the device-readable part is the data output buffer, +\field{dataout}. + +\field{sense} and subsequent fields are always device-writable. \field{sense_len} +indicates the number of bytes actually written to the sense +buffer. + +\field{residual} indicates the residual size, +calculated as ``data_length - number_of_transferred_bytes'', for +read or write operations. For bidirectional commands, the +number_of_transferred_bytes includes both read and written bytes. +A \field{residual} that is less than the size of \field{datain} means that +\field{dataout} was processed entirely. A \field{residual} that +exceeds the size of \field{datain} means that \field{dataout} was +processed partially and \field{datain} was not processed at +all. + +If the \field{pi_bytesin} is nonzero, the \field{pi_in} field contains +incoming protection information for read operations. \field{pi_in} is +only present if VIRTIO_SCSI_F_T10_PI has been negotiated\footnote{There + is no separate residual size for \field{pi_bytesout} and + \field{pi_bytesin}. It can be computed from the \field{residual} field, + the size of the data integrity information per sector, and the sizes + of \field{pi_out}, \field{pi_in}, \field{dataout} and \field{datain}.}. + +The remainder of the device-writable part is the data input buffer, +\field{datain}. + + +\devicenormative{\paragraph}{Device Operation: Request Queues}{Device Types / SCSI Host Device / Device Operation / Device Operation: Request Queues} + +The device MUST write the \field{status} byte as the status code as +defined in \hyperref[intro:SAM]{SAM}. + +The device MUST write the \field{response} byte as one of the following: + +\begin{description} + +\item[VIRTIO_SCSI_S_OK] when the request was completed and the \field{status} + byte is filled with a SCSI status code (not necessarily + ``GOOD''). + +\item[VIRTIO_SCSI_S_OVERRUN] if the content of the CDB (such as the + allocation length, parameter length or transfer size) requires + more data than is available in the datain and dataout buffers. + +\item[VIRTIO_SCSI_S_ABORTED] if the request was cancelled due to an + ABORT TASK or ABORT TASK SET task management function. + +\item[VIRTIO_SCSI_S_BAD_TARGET] if the request was never processed + because the target indicated by \field{lun} does not exist. + +\item[VIRTIO_SCSI_S_RESET] if the request was cancelled due to a bus + or device reset (including a task management function). + +\item[VIRTIO_SCSI_S_TRANSPORT_FAILURE] if the request failed due to a + problem in the connection between the host and the target + (severed link). + +\item[VIRTIO_SCSI_S_TARGET_FAILURE] if the target is suffering a + failure and to tell the driver not to retry on other paths. + +\item[VIRTIO_SCSI_S_NEXUS_FAILURE] if the nexus is suffering a failure + but retrying on other paths might yield a different result. + +\item[VIRTIO_SCSI_S_BUSY] if the request failed but retrying on the + same path is likely to work. + +\item[VIRTIO_SCSI_S_FAILURE] for other host or driver error. In + particular, if neither \field{dataout} nor \field{datain} is empty, and the + VIRTIO_SCSI_F_INOUT feature has not been negotiated, the + request will be immediately returned with a response equal to + VIRTIO_SCSI_S_FAILURE. +\end{description} + +All commands must be completed before the virtio-scsi device is +reset or unplugged. The device MAY choose to abort them, or if +it does not do so MUST pick the VIRTIO_SCSI_S_FAILURE response. + +\drivernormative{\paragraph}{Device Operation: Request Queues}{Device Types / SCSI Host Device / Device Operation / Device Operation: Request Queues} + +\field{task_attr}, \field{prio} and \field{crn} SHOULD be zero. + +Upon receiving a VIRTIO_SCSI_S_TARGET_FAILURE response, the driver +SHOULD NOT retry the request on other paths. + +\paragraph{Legacy Interface: Device Operation: Request Queues}\label{sec:Device Types / SCSI Host Device / Device Operation / Device Operation: Request Queues / Legacy Interface: Device Operation: Request Queues} +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_scsi_req_cmd +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +\subsubsection{Device Operation: controlq}\label{sec:Device Types / SCSI Host Device / Device Operation / Device Operation: controlq} + +The controlq is used for other SCSI transport operations. +Requests have the following format: + +{ +\lstset{escapechar=\$} +\begin{lstlisting} +struct virtio_scsi_ctrl { + le32 type; +$\ldots$ + u8 response; +}; + +/* response values valid for all commands */ +#define VIRTIO_SCSI_S_OK 0 +#define VIRTIO_SCSI_S_BAD_TARGET 3 +#define VIRTIO_SCSI_S_BUSY 5 +#define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6 +#define VIRTIO_SCSI_S_TARGET_FAILURE 7 +#define VIRTIO_SCSI_S_NEXUS_FAILURE 8 +#define VIRTIO_SCSI_S_FAILURE 9 +#define VIRTIO_SCSI_S_INCORRECT_LUN 12 +\end{lstlisting} +} + +The \field{type} identifies the remaining fields. + +The following commands are defined: + +\begin{itemize} +\item Task management function. +\begin{lstlisting} +#define VIRTIO_SCSI_T_TMF 0 + +#define VIRTIO_SCSI_T_TMF_ABORT_TASK 0 +#define VIRTIO_SCSI_T_TMF_ABORT_TASK_SET 1 +#define VIRTIO_SCSI_T_TMF_CLEAR_ACA 2 +#define VIRTIO_SCSI_T_TMF_CLEAR_TASK_SET 3 +#define VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET 4 +#define VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET 5 +#define VIRTIO_SCSI_T_TMF_QUERY_TASK 6 +#define VIRTIO_SCSI_T_TMF_QUERY_TASK_SET 7 + +struct virtio_scsi_ctrl_tmf +{ + // Device-readable part + le32 type; + le32 subtype; + u8 lun[8]; + le64 id; + // Device-writable part + u8 response; +} + +/* command-specific response values */ +#define VIRTIO_SCSI_S_FUNCTION_COMPLETE 0 +#define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED 10 +#define VIRTIO_SCSI_S_FUNCTION_REJECTED 11 +\end{lstlisting} + + The \field{type} is VIRTIO_SCSI_T_TMF; \field{subtype} defines which + task management function. All + fields except \field{response} are filled by the driver. + + Other fields which are irrelevant for the requested TMF + are ignored but they are still present. \field{lun} + is in the same format specified for request queues; the + single level LUN is ignored when the task management function + addresses a whole I_T nexus. When relevant, the value of \field{id} + is matched against the id values passed on the requestq. + + The outcome of the task management function is written by the + device in \field{response}. The command-specific response + values map 1-to-1 with those defined in \hyperref[intro:SAM]{SAM}. + + Task management function can affect the response value for commands that + are in the request queue and have not been completed yet. For example, + the device MUST complete all active commands on a logical unit + or target (possibly with a VIRTIO_SCSI_S_RESET response code) + upon receiving a "logical unit reset" or "I_T nexus reset" TMF. + Similarly, the device MUST complete the selected commands (possibly + with a VIRTIO_SCSI_S_ABORTED response code) upon receiving an "abort + task" or "abort task set" TMF. Such effects MUST take place before + the TMF itself is successfully completed, and the device MUST use + memory barriers appropriately in order to ensure that the driver sees + these writes in the correct order. + +\item Asynchronous notification query. +\begin{lstlisting} +#define VIRTIO_SCSI_T_AN_QUERY 1 + +struct virtio_scsi_ctrl_an { + // Device-readable part + le32 type; + u8 lun[8]; + le32 event_requested; + // Device-writable part + le32 event_actual; + u8 response; +} + +#define VIRTIO_SCSI_EVT_ASYNC_OPERATIONAL_CHANGE 2 +#define VIRTIO_SCSI_EVT_ASYNC_POWER_MGMT 4 +#define VIRTIO_SCSI_EVT_ASYNC_EXTERNAL_REQUEST 8 +#define VIRTIO_SCSI_EVT_ASYNC_MEDIA_CHANGE 16 +#define VIRTIO_SCSI_EVT_ASYNC_MULTI_HOST 32 +#define VIRTIO_SCSI_EVT_ASYNC_DEVICE_BUSY 64 +\end{lstlisting} + + By sending this command, the driver asks the device which + events the given LUN can report, as described in paragraphs 6.6 + and A.6 of \hyperref[intro:SCSI MMC]{SCSI MMC}. The driver writes the + events it is interested in into \field{event_requested}; the device + responds by writing the events that it supports into + \field{event_actual}. + + The \field{type} is VIRTIO_SCSI_T_AN_QUERY. \field{lun} and \field{event_requested} + are written by the driver. \field{event_actual} and \field{response} + fields are written by the device. + + No command-specific values are defined for the \field{response} byte. + +\item Asynchronous notification subscription. +\begin{lstlisting} +#define VIRTIO_SCSI_T_AN_SUBSCRIBE 2 + +struct virtio_scsi_ctrl_an { + // Device-readable part + le32 type; + u8 lun[8]; + le32 event_requested; + // Device-writable part + le32 event_actual; + u8 response; +} +\end{lstlisting} + + By sending this command, the driver asks the specified LUN to + report events for its physical interface, again as described in + \hyperref[intro:SCSI MMC]{SCSI MMC}. The driver writes the events it is + interested in into \field{event_requested}; the device responds by + writing the events that it supports into \field{event_actual}. + + Event types are the same as for the asynchronous notification + query message. + + The \field{type} is VIRTIO_SCSI_T_AN_SUBSCRIBE. \field{lun} and + \field{event_requested} are written by the driver. + \field{event_actual} and \field{response} are written by the device. + + No command-specific values are defined for the response byte. +\end{itemize} + +\paragraph{Legacy Interface: Device Operation: controlq}\label{sec:Device Types / SCSI Host Device / Device Operation / Device Operation: controlq / Legacy Interface: Device Operation: controlq} + +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_scsi_ctrl, struct +virtio_scsi_ctrl_tmf, struct virtio_scsi_ctrl_an and struct +virtio_scsi_ctrl_an +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + + +\subsubsection{Device Operation: eventq}\label{sec:Device Types / SCSI Host Device / Device Operation / Device Operation: eventq} + +The eventq is populated by the driver for the device to report information on logical +units that are attached to it. In general, the device will not +queue events to cope with an empty eventq, and will end up +dropping events if it finds no buffer ready. However, when +reporting events for many LUNs (e.g. when a whole target +disappears), the device can throttle events to avoid dropping +them. For this reason, placing 10-15 buffers on the event queue +is sufficient. + +Buffers returned by the device on the eventq will be referred to +as ``events'' in the rest of this section. Events have the +following format: + +\begin{lstlisting} +#define VIRTIO_SCSI_T_EVENTS_MISSED 0x80000000 + +struct virtio_scsi_event { + // Device-writable part + le32 event; + u8 lun[8]; + le32 reason; +} +\end{lstlisting} + +The devices sets bit 31 in \field{event} to report lost events +due to missing buffers. + +The meaning of \field{reason} depends on the +contents of \field{event}. The following events are defined: + +\begin{itemize} +\item No event. +\begin{lstlisting} +#define VIRTIO_SCSI_T_NO_EVENT 0 +\end{lstlisting} + + This event is fired in the following cases: + +\begin{itemize} +\item When the device detects in the eventq a buffer that is + shorter than what is indicated in the configuration field, it + MAY use it immediately and put this dummy value in \field{event}. + A well-written driver will never observe this + situation. + +\item When events are dropped, the device MAY signal this event as + soon as the drivers makes a buffer available, in order to + request action from the driver. In this case, of course, this + event will be reported with the VIRTIO_SCSI_T_EVENTS_MISSED + flag. +\end{itemize} + +\item Transport reset +\begin{lstlisting} +#define VIRTIO_SCSI_T_TRANSPORT_RESET 1 + +#define VIRTIO_SCSI_EVT_RESET_HARD 0 +#define VIRTIO_SCSI_EVT_RESET_RESCAN 1 +#define VIRTIO_SCSI_EVT_RESET_REMOVED 2 +\end{lstlisting} + + By sending this event, the device signals that a logical unit + on a target has been reset, including the case of a new device + appearing or disappearing on the bus. The device fills in all + fields. \field{event} is set to + VIRTIO_SCSI_T_TRANSPORT_RESET. \field{lun} addresses a + logical unit in the SCSI host. + + The \field{reason} value is one of the three \#define values appearing + above: + + \begin{description} + \item[VIRTIO_SCSI_EVT_RESET_REMOVED] (``LUN/target removed'') is used + if the target or logical unit is no longer able to receive + commands. + + \item[VIRTIO_SCSI_EVT_RESET_HARD] (``LUN hard reset'') is used if the + logical unit has been reset, but is still present. + + \item[VIRTIO_SCSI_EVT_RESET_RESCAN] (``rescan LUN/target'') is used if + a target or logical unit has just appeared on the device. + \end{description} + + The ``removed'' and ``rescan'' events can happen when + VIRTIO_SCSI_F_HOTPLUG feature was negotiated; when sent for LUN 0, + they MAY apply to the entire target so the driver can ask the + initiator to rescan the target to detect this. + + Events will also be reported via sense codes (this obviously + does not apply to newly appeared buses or targets, since the + application has never discovered them): + + \begin{itemize} + \item ``LUN/target removed'' maps to sense key ILLEGAL REQUEST, asc + 0x25, ascq 0x00 (LOGICAL UNIT NOT SUPPORTED) + + \item ``LUN hard reset'' maps to sense key UNIT ATTENTION, asc 0x29 + (POWER ON, RESET OR BUS DEVICE RESET OCCURRED) + + \item ``rescan LUN/target'' maps to sense key UNIT ATTENTION, asc + 0x3f, ascq 0x0e (REPORTED LUNS DATA HAS CHANGED) + \end{itemize} + + The preferred way to detect transport reset is always to use + events, because sense codes are only seen by the driver when it + sends a SCSI command to the logical unit or target. However, in + case events are dropped, the initiator will still be able to + synchronize with the actual state of the controller if the + driver asks the initiator to rescan of the SCSI bus. During the + rescan, the initiator will be able to observe the above sense + codes, and it will process them as if it the driver had + received the equivalent event. + + \item Asynchronous notification +\begin{lstlisting} +#define VIRTIO_SCSI_T_ASYNC_NOTIFY 2 +\end{lstlisting} + + By sending this event, the device signals that an asynchronous + event was fired from a physical interface. + + All fields are written by the device. \field{event} is set to + VIRTIO_SCSI_T_ASYNC_NOTIFY. \field{lun} addresses a logical + unit in the SCSI host. \field{reason} is a subset of the + events that the driver has subscribed to via the ``Asynchronous + notification subscription'' command. + + \item LUN parameter change +\begin{lstlisting} +#define VIRTIO_SCSI_T_PARAM_CHANGE 3 +\end{lstlisting} + + By sending this event, the device signals a change in the configuration parameters + of a logical unit, for example the capacity or cache mode. + \field{event} is set to VIRTIO_SCSI_T_PARAM_CHANGE. + \field{lun} addresses a logical unit in the SCSI host. + + The same event SHOULD also be reported as a unit attention condition. + \field{reason} contains the additional sense code and additional sense code qualifier, + respectively in bits 0\ldots 7 and 8\ldots 15. + \begin{note} + For example, a change in capacity will be reported as asc 0x2a, ascq 0x09 + (CAPACITY DATA HAS CHANGED). + \end{note} + + For MMC devices (inquiry type 5) there would be some overlap between this + event and the asynchronous notification event, so for simplicity the host never + reports this event for MMC devices. +\end{itemize} + +\drivernormative{\paragraph}{Device Operation: eventq}{Device Types / SCSI Host Device / Device Operation / Device Operation: eventq} + +The driver SHOULD keep the eventq populated with buffers. These +buffers MUST be device-writable, and SHOULD be at least +\field{event_info_size} bytes long, and MUST be at least the size of +struct virtio_scsi_event. + +If \field{event} has bit 31 set, the driver SHOULD +poll the logical units for unit attention conditions, and/or do +whatever form of bus scan is appropriate for the guest operating +system and SHOULD poll for asynchronous events manually using SCSI commands. + +When receiving a VIRTIO_SCSI_T_TRANSPORT_RESET message with +\field{reason} set to VIRTIO_SCSI_EVT_RESET_REMOVED or +VIRTIO_SCSI_EVT_RESET_RESCAN for LUN 0, the driver SHOULD ask the +initiator to rescan the target, in order to detect the case when an +entire target has appeared or disappeared. + +\devicenormative{\paragraph}{Device Operation: eventq}{Device Types / SCSI Host Device / Device Operation / Device Operation: eventq} + +The device MUST set bit 31 in \field{event} if events were lost due to +missing buffers, and it MAY use a VIRTIO_SCSI_T_NO_EVENT event to report +this. + +The device MUST NOT send VIRTIO_SCSI_T_TRANSPORT_RESET messages +with \field{reason} set to VIRTIO_SCSI_EVT_RESET_REMOVED or +VIRTIO_SCSI_EVT_RESET_RESCAN unless VIRTIO_SCSI_F_HOTPLUG was negotiated. + +The device MUST NOT report VIRTIO_SCSI_T_PARAM_CHANGE for MMC devices. + +\paragraph{Legacy Interface: Device Operation: eventq}\label{sec:Device Types / SCSI Host Device / Device Operation / Device Operation: eventq / Legacy Interface: Device Operation: eventq} +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_scsi_event +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +\subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device +Types / SCSI Host Device / Legacy Interface: Framing Requirements} + +When using legacy interfaces, transitional drivers which have not +negotiated VIRTIO_F_ANY_LAYOUT MUST use a single descriptor for the +\field{lun}, \field{id}, \field{task_attr}, \field{prio}, +\field{crn} and \field{cdb} fields, and MUST only use a single +descriptor for the \field{sense_len}, \field{residual}, +\field{status_qualifier}, \field{status}, \field{response} and +\field{sense} fields. + +\chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits} + +Currently these device-independent feature bits defined: + +\begin{description} + \item[VIRTIO_F_RING_INDIRECT_DESC (28)] Negotiating this feature indicates + that the driver can use descriptors with the VIRTQ_DESC_F_INDIRECT + flag set, as described in \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors}~\nameref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors}. + + \item[VIRTIO_F_RING_EVENT_IDX(29)] This feature enables the \field{used_event} + and the \field{avail_event} fields as described in \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression} and \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}. + + \item[VIRTIO_F_VERSION_1(32)] This indicates compliance with this + specification, giving a simple way to detect legacy devices or drivers. + + \item[VIRTIO_F_IOMMU_PLATFORM(33)] This feature indicates that the device is + behind an IOMMU that translates bus addresses from the device into physical + addresses in memory. If this feature bit is set to 0, then the device emits + physical addresses which are not translated further, even though an IOMMU + may be present. +\end{description} + +\drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits} + +A driver MUST accept VIRTIO_F_VERSION_1 if it is offered. A driver +MAY fail to operate further if VIRTIO_F_VERSION_1 is not offered. + +A driver SHOULD accept VIRTIO_F_IOMMU_PLATFORM if it is offered, and it MUST +then either disable the IOMMU or configure the IOMMU to translate bus addresses +passed to the device into physical addresses in memory. If +VIRTIO_F_IOMMU_PLATFORM is not offered, then a driver MUST pass only physical +addresses to the device. + +\devicenormative{\section}{Reserved Feature Bits}{Reserved Feature Bits} + +A device MUST offer VIRTIO_F_VERSION_1. A device MAY fail to operate further +if VIRTIO_F_VERSION_1 is not accepted. + +A device SHOULD offer VIRTIO_F_IOMMU_PLATFORM if it is behind an IOMMU that +translates bus addresses from the device into physical addresses in memory. +A device MAY fail to operate further if VIRTIO_F_IOMMU_PLATFORM is not +accepted. + +\section{Legacy Interface: Reserved Feature Bits}\label{sec:Reserved Feature Bits / Legacy Interface: Reserved Feature Bits} + +Transitional devices MAY offer the following: +\begin{description} +\item[VIRTIO_F_NOTIFY_ON_EMPTY (24)] If this feature + has been negotiated by driver, the device MUST issue + an interrupt if the device runs + out of available descriptors on a virtqueue, even though + interrupts are suppressed using the VIRTQ_AVAIL_F_NO_INTERRUPT + flag or the \field{used_event} field. +\begin{note} + An example of a driver using this feature is the legacy + networking driver: it doesn't need to know every time a packet + is transmitted, but it does need to free the transmitted + packets a finite time after they are transmitted. It can avoid + using a timer if the device interrupts it when all the packets + are transmitted. +\end{note} +\end{description} + +Transitional devices MUST offer, and if offered by the device +transitional drivers MUST accept the following: +\begin{description} +\item[VIRTIO_F_ANY_LAYOUT (27)] This feature indicates that the device + accepts arbitrary descriptor layouts, as described in Section + \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing / Legacy Interface: Message Framing}~\nameref{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing / Legacy Interface: Message Framing}. + +\item[UNUSED (30)] Bit 30 is used by qemu's implementation to check + for experimental early versions of virtio which did not perform + correct feature negotiation, and SHOULD NOT be negotiated. +\end{description} |