diff options
author | Michael S. Tsirkin <mst@redhat.com> | 2018-03-09 23:23:30 +0200 |
---|---|---|
committer | Michael S. Tsirkin <mst@redhat.com> | 2018-03-20 02:29:10 +0200 |
commit | 6131f8d9477b8ac4930f56cb936731675fe416e3 (patch) | |
tree | 0510463e79ff8f0468cd4eb26a24e89980e7da5d /split-ring.tex | |
parent | 2c0ec7d7065b4249cf6fa53420458aa9da604338 (diff) |
content: move ring text out to a separate file
Will be easier to manage this way.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Approved-by: https://www.oasis-open.org/apps/org/workgroup/virtio/ballot.php?id=3177
Fixes: https://github.com/oasis-tcs/virtio-spec/issues/3
Diffstat (limited to 'split-ring.tex')
-rw-r--r-- | split-ring.tex | 498 |
1 files changed, 498 insertions, 0 deletions
diff --git a/split-ring.tex b/split-ring.tex new file mode 100644 index 0000000..418f63d --- /dev/null +++ b/split-ring.tex @@ -0,0 +1,498 @@ +\section{Split Virtqueues}\label{sec:Basic Facilities of a Virtio Device / Split Virtqueues} +The split virtqueue format is the original format used by legacy +virtio devices. The split virtqueue format separates the +virtqueue into several parts, where each part is write-able by +either the driver or the device, but not both. Multiple +locations need to be updated when making a buffer available +and when marking it as used. + + +Each queue has a 16-bit queue size +parameter, which sets the number of entries and implies the total size +of the queue. + +Each virtqueue consists of three parts: + +\begin{itemize} +\item Descriptor Table +\item Available Ring +\item Used Ring +\end{itemize} + +where each part is physically-contiguous in guest memory, +and has different alignment requirements. + +The memory aligment and size requirements, in bytes, of each part of the +virtqueue are summarized in the following table: + +\begin{tabular}{|l|l|l|} +\hline +Virtqueue Part & Alignment & Size \\ +\hline \hline +Descriptor Table & 16 & $16 * $(Queue Size) \\ +\hline +Available Ring & 2 & $6 + 2 * $(Queue Size) \\ + \hline +Used Ring & 4 & $6 + 8 * $(Queue Size) \\ + \hline +\end{tabular} + +The Alignment column gives the minimum alignment for each part +of the virtqueue. + +The Size column gives the total number of bytes for each +part of the virtqueue. + +Queue Size corresponds to the maximum number of buffers in the +virtqueue\footnote{For example, if Queue Size is 4 then at most 4 buffers +can be queued at any given time.}. Queue Size value is always a +power of 2. The maximum Queue Size value is 32768. This value +is specified in a bus-specific way. + +When the driver wants to send a buffer to the device, it fills in +a slot in the descriptor table (or chains several together), and +writes the descriptor index into the available ring. It then +notifies the device. When the device has finished a buffer, it +writes the descriptor index into the used ring, and sends an interrupt. + +\drivernormative{\subsection}{Virtqueues}{Basic Facilities of a Virtio Device / Virtqueues} +The driver MUST ensure that the physical address of the first byte +of each virtqueue part is a multiple of the specified alignment value +in the above table. + +\subsection{Legacy Interfaces: A Note on Virtqueue Layout}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Layout} + +For Legacy Interfaces, several additional +restrictions are placed on the virtqueue layout: + +Each virtqueue occupies two or more physically-contiguous pages +(usually defined as 4096 bytes, but depending on the transport; +henceforth referred to as Queue Align) +and consists of three parts: + +\begin{tabular}{|l|l|l|} +\hline +Descriptor Table & Available Ring (\ldots padding\ldots) & Used Ring \\ +\hline +\end{tabular} + +The bus-specific Queue Size field controls the total number of bytes +for the virtqueue. +When using the legacy interface, the transitional +driver MUST retrieve the Queue Size field from the device +and MUST allocate the total number of bytes for the virtqueue +according to the following formula (Queue Align given in qalign and +Queue Size given in qsz): + +\begin{lstlisting} +#define ALIGN(x) (((x) + qalign) & ~qalign) +static inline unsigned virtq_size(unsigned int qsz) +{ + return ALIGN(sizeof(struct virtq_desc)*qsz + sizeof(u16)*(3 + qsz)) + + ALIGN(sizeof(u16)*3 + sizeof(struct virtq_used_elem)*qsz); +} +\end{lstlisting} + +This wastes some space with padding. +When using the legacy interface, both transitional +devices and drivers MUST use the following virtqueue layout +structure to locate elements of the virtqueue: + +\begin{lstlisting} +struct virtq { + // The actual descriptors (16 bytes each) + struct virtq_desc desc[ Queue Size ]; + + // A ring of available descriptor heads with free-running index. + struct virtq_avail avail; + + // Padding to the next Queue Align boundary. + u8 pad[ Padding ]; + + // A ring of used descriptor heads with free-running index. + struct virtq_used used; +}; +\end{lstlisting} + +\subsection{Legacy Interfaces: A Note on Virtqueue Endianness}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Endianness} + +Note that when using the legacy interface, transitional +devices and drivers MUST use the native +endian of the guest as the endian of fields and in the virtqueue. +This is opposed to little-endian for non-legacy interface as +specified by this standard. +It is assumed that the host is already aware of the guest endian. + +\subsection{Message Framing}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing} +The framing of messages with descriptors is +independent of the contents of the buffers. For example, a network +transmit buffer consists of a 12 byte header followed by the network +packet. This could be most simply placed in the descriptor table as a +12 byte output descriptor followed by a 1514 byte output descriptor, +but it could also consist of a single 1526 byte output descriptor in +the case where the header and packet are adjacent, or even three or +more descriptors (possibly with loss of efficiency in that case). + +Note that, some device implementations have large-but-reasonable +restrictions on total descriptor size (such as based on IOV_MAX in the +host OS). This has not been a problem in practice: little sympathy +will be given to drivers which create unreasonably-sized descriptors +such as by dividing a network packet into 1500 single-byte +descriptors! + +\devicenormative{\subsubsection}{Message Framing}{Basic Facilities of a Virtio Device / Message Framing} +The device MUST NOT make assumptions about the particular arrangement +of descriptors. The device MAY have a reasonable limit of descriptors +it will allow in a chain. + +\drivernormative{\subsubsection}{Message Framing}{Basic Facilities of a Virtio Device / Message Framing} +The driver MUST place any device-writable descriptor elements after +any device-readable descriptor elements. + +The driver SHOULD NOT use an excessive number of descriptors to +describe a buffer. + +\subsubsection{Legacy Interface: Message Framing}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing / Legacy Interface: Message Framing} + +Regrettably, initial driver implementations used simple layouts, and +devices came to rely on it, despite this specification wording. In +addition, the specification for virtio_blk SCSI commands required +intuiting field lengths from frame boundaries (see + \ref{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation}~\nameref{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation}) + +Thus when using the legacy interface, the VIRTIO_F_ANY_LAYOUT +feature indicates to both the device and the driver that no +assumptions were made about framing. Requirements for +transitional drivers when this is not negotiated are included in +each device section. + +\subsection{The Virtqueue Descriptor Table}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table} + +The descriptor table refers to the buffers the driver is using for +the device. \field{addr} is a physical address, and the buffers +can be chained via \field{next}. Each descriptor describes a +buffer which is read-only for the device (``device-readable'') or write-only for the device (``device-writable''), but a chain of +descriptors can contain both device-readable and device-writable buffers. + +The actual contents of the memory offered to the device depends on the +device type. Most common is to begin the data with a header +(containing little-endian fields) for the device to read, and postfix +it with a status tailer for the device to write. + +\begin{lstlisting} +struct virtq_desc { + /* Address (guest-physical). */ + le64 addr; + /* Length. */ + le32 len; + +/* This marks a buffer as continuing via the next field. */ +#define VIRTQ_DESC_F_NEXT 1 +/* This marks a buffer as device write-only (otherwise device read-only). */ +#define VIRTQ_DESC_F_WRITE 2 +/* This means the buffer contains a list of buffer descriptors. */ +#define VIRTQ_DESC_F_INDIRECT 4 + /* The flags as indicated above. */ + le16 flags; + /* Next field if flags & NEXT */ + le16 next; +}; +\end{lstlisting} + +The number of descriptors in the table is defined by the queue size +for this virtqueue: this is the maximum possible descriptor chain length. + +\begin{note} +The legacy \hyperref[intro:Virtio PCI Draft]{[Virtio PCI Draft]} +referred to this structure as vring_desc, and the constants as +VRING_DESC_F_NEXT, etc, but the layout and values were identical. +\end{note} + +\devicenormative{\subsubsection}{The Virtqueue Descriptor Table}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table} +A device MUST NOT write to a device-readable buffer, and a device SHOULD NOT +read a device-writable buffer (it MAY do so for debugging or diagnostic +purposes). + +\drivernormative{\subsubsection}{The Virtqueue Descriptor Table}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table} +Drivers MUST NOT add a descriptor chain over than $2^{32}$ bytes long in total; +this implies that loops in the descriptor chain are forbidden! + +\subsubsection{Indirect Descriptors}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors} + +Some devices benefit by concurrently dispatching a large number +of large requests. The VIRTIO_F_INDIRECT_DESC feature allows this (see \ref{sec:virtio-queue.h}~\nameref{sec:virtio-queue.h}). To increase +ring capacity the driver can store a table of indirect +descriptors anywhere in memory, and insert a descriptor in main +virtqueue (with \field{flags}\&VIRTQ_DESC_F_INDIRECT on) that refers to memory buffer +containing this indirect descriptor table; \field{addr} and \field{len} +refer to the indirect table address and length in bytes, +respectively. + +The indirect table layout structure looks like this +(\field{len} is the length of the descriptor that refers to this table, +which is a variable, so this code won't compile): + +\begin{lstlisting} +struct indirect_descriptor_table { + /* The actual descriptors (16 bytes each) */ + struct virtq_desc desc[len / 16]; +}; +\end{lstlisting} + +The first indirect descriptor is located at start of the indirect +descriptor table (index 0), additional indirect descriptors are +chained by \field{next}. An indirect descriptor without a valid \field{next} +(with \field{flags}\&VIRTQ_DESC_F_NEXT off) signals the end of the descriptor. +A single indirect descriptor +table can include both device-readable and device-writable descriptors. + +\drivernormative{\paragraph}{Indirect Descriptors}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors} +The driver MUST NOT set the VIRTQ_DESC_F_INDIRECT flag unless the +VIRTIO_F_INDIRECT_DESC feature was negotiated. The driver MUST NOT +set the VIRTQ_DESC_F_INDIRECT flag within an indirect descriptor (ie. only +one table per descriptor). + +A driver MUST NOT create a descriptor chain longer than the Queue Size of +the device. + +A driver MUST NOT set both VIRTQ_DESC_F_INDIRECT and VIRTQ_DESC_F_NEXT +in \field{flags}. + +\devicenormative{\paragraph}{Indirect Descriptors}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors} +The device MUST ignore the write-only flag (\field{flags}\&VIRTQ_DESC_F_WRITE) in the descriptor that refers to an indirect table. + +The device MUST handle the case of zero or more normal chained +descriptors followed by a single descriptor with \field{flags}\&VIRTQ_DESC_F_INDIRECT. + +\begin{note} +While unusual (most implementations either create a chain solely using +non-indirect descriptors, or use a single indirect element), such a +layout is valid. +\end{note} + +\subsection{The Virtqueue Available Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Available Ring} + +\begin{lstlisting} +struct virtq_avail { +#define VIRTQ_AVAIL_F_NO_INTERRUPT 1 + le16 flags; + le16 idx; + le16 ring[ /* Queue Size */ ]; + le16 used_event; /* Only if VIRTIO_F_EVENT_IDX */ +}; +\end{lstlisting} + +The driver uses the available ring to offer buffers to the +device: each ring entry refers to the head of a descriptor chain. It is only +written by the driver and read by the device. + +\field{idx} field indicates where the driver would put the next descriptor +entry in the ring (modulo the queue size). This starts at 0, and increases. + +\begin{note} +The legacy \hyperref[intro:Virtio PCI Draft]{[Virtio PCI Draft]} +referred to this structure as vring_avail, and the constant as +VRING_AVAIL_F_NO_INTERRUPT, but the layout and value were identical. +\end{note} + +\subsection{Virtqueue Interrupt Suppression}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression} + +If the VIRTIO_F_EVENT_IDX feature bit is not negotiated, +the \field{flags} field in the available ring offers a crude mechanism for the driver to inform +the device that it doesn't want interrupts when buffers are used. Otherwise +\field{used_event} is a more performant alternative where the driver +specifies how far the device can progress before interrupting. + +Neither of these interrupt suppression methods are reliable, as they +are not synchronized with the device, but they serve as +useful optimizations. + +\drivernormative{\subsubsection}{Virtqueue Interrupt Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression} +If the VIRTIO_F_EVENT_IDX feature bit is not negotiated: +\begin{itemize} +\item The driver MUST set \field{flags} to 0 or 1. +\item The driver MAY set \field{flags} to 1 to advise +the device that interrupts are not needed. +\end{itemize} + +Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated: +\begin{itemize} +\item The driver MUST set \field{flags} to 0. +\item The driver MAY use \field{used_event} to advise the device that interrupts are unnecessary until the device writes entry with an index specified by \field{used_event} into the used ring (equivalently, until \field{idx} in the +used ring will reach the value \field{used_event} + 1). +\end{itemize} + +The driver MUST handle spurious interrupts from the device. + +\devicenormative{\subsubsection}{Virtqueue Interrupt Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression} + +If the VIRTIO_F_EVENT_IDX feature bit is not negotiated: +\begin{itemize} +\item The device MUST ignore the \field{used_event} value. +\item After the device writes a descriptor index into the used ring: + \begin{itemize} + \item If \field{flags} is 1, the device SHOULD NOT send an interrupt. + \item If \field{flags} is 0, the device MUST send an interrupt. + \end{itemize} +\end{itemize} + +Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated: +\begin{itemize} +\item The device MUST ignore the lower bit of \field{flags}. +\item After the device writes a descriptor index into the used ring: + \begin{itemize} + \item If the \field{idx} field in the used ring (which determined + where that descriptor index was placed) was equal to + \field{used_event}, the device MUST send an interrupt. + \item Otherwise the device SHOULD NOT send an interrupt. + \end{itemize} +\end{itemize} + +\begin{note} +For example, if \field{used_event} is 0, then a device using + VIRTIO_F_EVENT_IDX would interrupt after the first buffer is + used (and again after the 65536th buffer, etc). +\end{note} + +\subsection{The Virtqueue Used Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring} + +\begin{lstlisting} +struct virtq_used { +#define VIRTQ_USED_F_NO_NOTIFY 1 + le16 flags; + le16 idx; + struct virtq_used_elem ring[ /* Queue Size */]; + le16 avail_event; /* Only if VIRTIO_F_EVENT_IDX */ +}; + +/* le32 is used here for ids for padding reasons. */ +struct virtq_used_elem { + /* Index of start of used descriptor chain. */ + le32 id; + /* Total length of the descriptor chain which was used (written to) */ + le32 len; +}; +\end{lstlisting} + +The used ring is where the device returns buffers once it is done with +them: it is only written to by the device, and read by the driver. + +Each entry in the ring is a pair: \field{id} indicates the head entry of the +descriptor chain describing the buffer (this matches an entry +placed in the available ring by the guest earlier), and \field{len} the total +of bytes written into the buffer. + +\begin{note} +\field{len} is particularly useful +for drivers using untrusted buffers: if a driver does not know exactly +how much has been written by the device, the driver would have to zero +the buffer in advance to ensure no data leakage occurs. + +For example, a network driver may hand a received buffer directly to +an unprivileged userspace application. If the network device has not +overwritten the bytes which were in that buffer, this could leak the +contents of freed memory from other processes to the application. +\end{note} + +\field{idx} field indicates where the driver would put the next descriptor +entry in the ring (modulo the queue size). This starts at 0, and increases. + +\begin{note} +The legacy \hyperref[intro:Virtio PCI Draft]{[Virtio PCI Draft]} +referred to these structures as vring_used and vring_used_elem, and +the constant as VRING_USED_F_NO_NOTIFY, but the layout and value were +identical. +\end{note} + +\subsubsection{Legacy Interface: The Virtqueue Used +Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues +/ The Virtqueue Used Ring/ Legacy Interface: The Virtqueue Used +Ring} + +Historically, many drivers ignored the \field{len} value, as a +result, many devices set \field{len} incorrectly. Thus, when +using the legacy interface, it is generally a good idea to ignore +the \field{len} value in used ring entries if possible. Specific +known issues are listed per device type. + +\devicenormative{\subsubsection}{The Virtqueue Used Ring}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring} + +The device MUST set \field{len} prior to updating the used \field{idx}. + +The device MUST write at least \field{len} bytes to descriptor, +beginning at the first device-writable buffer, +prior to updating the used \field{idx}. + +The device MAY write more than \field{len} bytes to descriptor. + +\begin{note} +There are potential error cases where a device might not know what +parts of the buffers have been written. This is why \field{len} is +permitted to be an underestimate: that's preferable to the driver believing +that uninitialized memory has been overwritten when it has not. +\end{note} + +\drivernormative{\subsubsection}{The Virtqueue Used Ring}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring} + +The driver MUST NOT make assumptions about data in device-writable buffers +beyond the first \field{len} bytes, and SHOULD ignore this data. + +\subsection{Virtqueue Notification Suppression}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Notification Suppression} + +The device can suppress notifications in a manner analogous to the way +drivers can suppress interrupts as detailed in section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression}. +The device manipulates \field{flags} or \field{avail_event} in the used ring the +same way the driver manipulates \field{flags} or \field{used_event} in the available ring. + +\drivernormative{\subsubsection}{Virtqueue Notification Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Notification Suppression} + +The driver MUST initialize \field{flags} in the used ring to 0 when +allocating the used ring. + +If the VIRTIO_F_EVENT_IDX feature bit is not negotiated: +\begin{itemize} +\item The driver MUST ignore the \field{avail_event} value. +\item After the driver writes a descriptor index into the available ring: + \begin{itemize} + \item If \field{flags} is 1, the driver SHOULD NOT send a notification. + \item If \field{flags} is 0, the driver MUST send a notification. + \end{itemize} +\end{itemize} + +Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated: +\begin{itemize} +\item The driver MUST ignore the lower bit of \field{flags}. +\item After the driver writes a descriptor index into the available ring: + \begin{itemize} + \item If the \field{idx} field in the available ring (which determined + where that descriptor index was placed) was equal to + \field{avail_event}, the driver MUST send a notification. + \item Otherwise the driver SHOULD NOT send a notification. + \end{itemize} +\end{itemize} + +\devicenormative{\subsubsection}{Virtqueue Notification Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Notification Suppression} +If the VIRTIO_F_EVENT_IDX feature bit is not negotiated: +\begin{itemize} +\item The device MUST set \field{flags} to 0 or 1. +\item The device MAY set \field{flags} to 1 to advise +the driver that notifications are not needed. +\end{itemize} + +Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated: +\begin{itemize} +\item The device MUST set \field{flags} to 0. +\item The device MAY use \field{avail_event} to advise the driver that notifications are unnecessary until the driver writes entry with an index specified by \field{avail_event} into the available ring (equivalently, until \field{idx} in the +available ring will reach the value \field{avail_event} + 1). +\end{itemize} + +The device MUST handle spurious notifications from the driver. + +\subsection{Helpers for Operating Virtqueues}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Helpers for Operating Virtqueues} + +The Linux Kernel Source code contains the definitions above and +helper routines in a more usable form, in +include/uapi/linux/virtio_ring.h. This was explicitly licensed by IBM +and Red Hat under the (3-clause) BSD license so that it can be +freely used by all other projects, and is reproduced (with slight +variation) in \ref{sec:virtio-queue.h}~\nameref{sec:virtio-queue.h}. |