summaryrefslogtreecommitdiff
path: root/split-ring.tex
blob: 87ecee2d39b409d89fb89c6c249a3421a25a705a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
\section{Split Virtqueues}\label{sec:Basic Facilities of a Virtio Device / Split Virtqueues}
The split virtqueue format was the only format supported
by the version 1.0 (and earlier) of this standard.

The split virtqueue format separates the virtqueue into several
parts, where each part is write-able by either the driver or the
device, but not both. Multiple parts and/or locations within
a part need to be updated when making a buffer
available and when marking it as used.

Each queue has a 16-bit queue size
parameter, which sets the number of entries and implies the total size
of the queue.

Each virtqueue consists of three parts:

\begin{itemize}
\item Descriptor Table - occupies the Descriptor Area
\item Available Ring - occupies the Driver Area
\item Used Ring - occupies the Device Area
\end{itemize}

where each part is physically-contiguous in guest memory,
and has different alignment requirements.

The memory alignment and size requirements, in bytes, of each part of the
virtqueue are summarized in the following table:

\begin{tabular}{|l|l|l|}
\hline
Virtqueue Part    & Alignment & Size \\
\hline \hline
Descriptor Table  & 16        & $16 * $(Queue Size) \\
\hline
Available Ring    & 2         & $6 + 2 * $(Queue Size) \\
 \hline
Used Ring         & 4         & $6 + 8 * $(Queue Size) \\
 \hline
\end{tabular}

The Alignment column gives the minimum alignment for each part
of the virtqueue.

The Size column gives the total number of bytes for each
part of the virtqueue.

Queue Size corresponds to the maximum number of buffers in the
virtqueue\footnote{For example, if Queue Size is 4 then at most 4 buffers
can be queued at any given time.}.  Queue Size value is always a
power of 2.  The maximum Queue Size value is 32768.  This value
is specified in a bus-specific way.

When the driver wants to send a buffer to the device, it fills in
a slot in the descriptor table (or chains several together), and
writes the descriptor index into the available ring.  It then
notifies the device. When the device has finished a buffer, it
writes the descriptor index into the used ring, and sends an interrupt.

\drivernormative{\subsection}{Virtqueues}{Basic Facilities of a Virtio Device / Virtqueues}
The driver MUST ensure that the physical address of the first byte
of each virtqueue part is a multiple of the specified alignment value
in the above table.

\subsection{Legacy Interfaces: A Note on Virtqueue Layout}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Layout}

For Legacy Interfaces, several additional
restrictions are placed on the virtqueue layout:

Each virtqueue occupies two or more physically-contiguous pages
(usually defined as 4096 bytes, but depending on the transport;
henceforth referred to as Queue Align)
and consists of three parts:

\begin{tabular}{|l|l|l|}
\hline
Descriptor Table & Available Ring (\ldots padding\ldots) & Used Ring \\
\hline
\end{tabular}

The bus-specific Queue Size field controls the total number of bytes
for the virtqueue.
When using the legacy interface, the transitional
driver MUST retrieve the Queue Size field from the device
and MUST allocate the total number of bytes for the virtqueue
according to the following formula (Queue Align given in qalign and
Queue Size given in qsz):

\begin{lstlisting}
#define ALIGN(x) (((x) + qalign) & ~qalign)
static inline unsigned virtq_size(unsigned int qsz)
{
     return ALIGN(sizeof(struct virtq_desc)*qsz + sizeof(u16)*(3 + qsz))
          + ALIGN(sizeof(u16)*3 + sizeof(struct virtq_used_elem)*qsz);
}
\end{lstlisting}

This wastes some space with padding.
When using the legacy interface, both transitional
devices and drivers MUST use the following virtqueue layout
structure to locate elements of the virtqueue:

\begin{lstlisting}
struct virtq {
        // The actual descriptors (16 bytes each)
        struct virtq_desc desc[ Queue Size ];

        // A ring of available descriptor heads with free-running index.
        struct virtq_avail avail;

        // Padding to the next Queue Align boundary.
        u8 pad[ Padding ];

        // A ring of used descriptor heads with free-running index.
        struct virtq_used used;
};
\end{lstlisting}

\subsection{Legacy Interfaces: A Note on Virtqueue Endianness}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Legacy Interfaces: A Note on Virtqueue Endianness}

Note that when using the legacy interface, transitional
devices and drivers MUST use the native
endian of the guest as the endian of fields and in the virtqueue.
This is opposed to little-endian for non-legacy interface as
specified by this standard.
It is assumed that the host is already aware of the guest endian.

\subsection{Message Framing}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing}
The framing of messages with descriptors is
independent of the contents of the buffers. For example, a network
transmit buffer consists of a 12 byte header followed by the network
packet. This could be most simply placed in the descriptor table as a
12 byte output descriptor followed by a 1514 byte output descriptor,
but it could also consist of a single 1526 byte output descriptor in
the case where the header and packet are adjacent, or even three or
more descriptors (possibly with loss of efficiency in that case).

Note that, some device implementations have large-but-reasonable
restrictions on total descriptor size (such as based on IOV_MAX in the
host OS). This has not been a problem in practice: little sympathy
will be given to drivers which create unreasonably-sized descriptors
such as by dividing a network packet into 1500 single-byte
descriptors!

\devicenormative{\subsubsection}{Message Framing}{Basic Facilities of a Virtio Device / Message Framing}
The device MUST NOT make assumptions about the particular arrangement
of descriptors.  The device MAY have a reasonable limit of descriptors
it will allow in a chain.

\drivernormative{\subsubsection}{Message Framing}{Basic Facilities of a Virtio Device / Message Framing}
The driver MUST place any device-writable descriptor elements after
any device-readable descriptor elements.

The driver SHOULD NOT use an excessive number of descriptors to
describe a buffer.

\subsubsection{Legacy Interface: Message Framing}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing / Legacy Interface: Message Framing}

Regrettably, initial driver implementations used simple layouts, and
devices came to rely on it, despite this specification wording.  In
addition, the specification for virtio_blk SCSI commands required
intuiting field lengths from frame boundaries (see
 \ref{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation}~\nameref{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation})

Thus when using the legacy interface, the VIRTIO_F_ANY_LAYOUT
feature indicates to both the device and the driver that no
assumptions were made about framing.  Requirements for
transitional drivers when this is not negotiated are included in
each device section.

\subsection{The Virtqueue Descriptor Table}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table}

The descriptor table refers to the buffers the driver is using for
the device. \field{addr} is a physical address, and the buffers
can be chained via \field{next}. Each descriptor describes a
buffer which is read-only for the device (``device-readable'') or write-only for the device (``device-writable''), but a chain of
descriptors can contain both device-readable and device-writable buffers.

The actual contents of the memory offered to the device depends on the
device type.  Most common is to begin the data with a header
(containing little-endian fields) for the device to read, and postfix
it with a status tailer for the device to write.

\begin{lstlisting}
struct virtq_desc {
        /* Address (guest-physical). */
        le64 addr;
        /* Length. */
        le32 len;

/* This marks a buffer as continuing via the next field. */
#define VIRTQ_DESC_F_NEXT   1
/* This marks a buffer as device write-only (otherwise device read-only). */
#define VIRTQ_DESC_F_WRITE     2
/* This means the buffer contains a list of buffer descriptors. */
#define VIRTQ_DESC_F_INDIRECT   4
        /* The flags as indicated above. */
        le16 flags;
        /* Next field if flags & NEXT */
        le16 next;
};
\end{lstlisting}

The number of descriptors in the table is defined by the queue size
for this virtqueue: this is the maximum possible descriptor chain length.

\begin{note}
The legacy \hyperref[intro:Virtio PCI Draft]{[Virtio PCI Draft]}
referred to this structure as vring_desc, and the constants as
VRING_DESC_F_NEXT, etc, but the layout and values were identical.
\end{note}

\devicenormative{\subsubsection}{The Virtqueue Descriptor Table}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table}
A device MUST NOT write to a device-readable buffer, and a device SHOULD NOT
read a device-writable buffer (it MAY do so for debugging or diagnostic
purposes).

\drivernormative{\subsubsection}{The Virtqueue Descriptor Table}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table}
Drivers MUST NOT add a descriptor chain over than $2^{32}$ bytes long in total;
this implies that loops in the descriptor chain are forbidden!

\subsubsection{Indirect Descriptors}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors}

Some devices benefit by concurrently dispatching a large number
of large requests. The VIRTIO_F_INDIRECT_DESC feature allows this (see \ref{sec:virtio-queue.h}~\nameref{sec:virtio-queue.h}). To increase
ring capacity the driver can store a table of indirect
descriptors anywhere in memory, and insert a descriptor in main
virtqueue (with \field{flags}\&VIRTQ_DESC_F_INDIRECT on) that refers to memory buffer
containing this indirect descriptor table; \field{addr} and \field{len}
refer to the indirect table address and length in bytes,
respectively.

The indirect table layout structure looks like this
(\field{len} is the length of the descriptor that refers to this table,
which is a variable, so this code won't compile):

\begin{lstlisting}
struct indirect_descriptor_table {
        /* The actual descriptors (16 bytes each) */
        struct virtq_desc desc[len / 16];
};
\end{lstlisting}

The first indirect descriptor is located at start of the indirect
descriptor table (index 0), additional indirect descriptors are
chained by \field{next}. An indirect descriptor without a valid \field{next}
(with \field{flags}\&VIRTQ_DESC_F_NEXT off) signals the end of the descriptor.
A single indirect descriptor
table can include both device-readable and device-writable descriptors.

\drivernormative{\paragraph}{Indirect Descriptors}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors}
The driver MUST NOT set the VIRTQ_DESC_F_INDIRECT flag unless the
VIRTIO_F_INDIRECT_DESC feature was negotiated.   The driver MUST NOT
set the VIRTQ_DESC_F_INDIRECT flag within an indirect descriptor (ie. only
one table per descriptor).

A driver MUST NOT create a descriptor chain longer than the Queue Size of
the device.

A driver MUST NOT set both VIRTQ_DESC_F_INDIRECT and VIRTQ_DESC_F_NEXT
in \field{flags}.

\devicenormative{\paragraph}{Indirect Descriptors}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table / Indirect Descriptors}
The device MUST ignore the write-only flag (\field{flags}\&VIRTQ_DESC_F_WRITE) in the descriptor that refers to an indirect table.

The device MUST handle the case of zero or more normal chained
descriptors followed by a single descriptor with \field{flags}\&VIRTQ_DESC_F_INDIRECT.

\begin{note}
While unusual (most implementations either create a chain solely using
non-indirect descriptors, or use a single indirect element), such a
layout is valid.
\end{note}

\subsection{The Virtqueue Available Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Available Ring}

\begin{lstlisting}
struct virtq_avail {
#define VIRTQ_AVAIL_F_NO_INTERRUPT      1
        le16 flags;
        le16 idx;
        le16 ring[ /* Queue Size */ ];
        le16 used_event; /* Only if VIRTIO_F_EVENT_IDX */
};
\end{lstlisting}

The driver uses the available ring to offer buffers to the
device: each ring entry refers to the head of a descriptor chain.  It is only
written by the driver and read by the device.

\field{idx} field indicates where the driver would put the next descriptor
entry in the ring (modulo the queue size). This starts at 0, and increases.

\begin{note}
The legacy \hyperref[intro:Virtio PCI Draft]{[Virtio PCI Draft]}
referred to this structure as vring_avail, and the constant as
VRING_AVAIL_F_NO_INTERRUPT, but the layout and value were identical.
\end{note}

\drivernormative{\subsubsection}{The Virtqueue Available Ring}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Available Ring}
A driver MUST NOT decrement the available \field{idx} on a virtqueue (ie.
there is no way to ``unexpose'' buffers).

\subsection{Virtqueue Interrupt Suppression}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression}

If the VIRTIO_F_EVENT_IDX feature bit is not negotiated,
the \field{flags} field in the available ring offers a crude mechanism for the driver to inform
the device that it doesn't want interrupts when buffers are used.  Otherwise
\field{used_event} is a more performant alternative where the driver
specifies how far the device can progress before interrupting.

Neither of these interrupt suppression methods are reliable, as they
are not synchronized with the device, but they serve as
useful optimizations.

\drivernormative{\subsubsection}{Virtqueue Interrupt Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression}
If the VIRTIO_F_EVENT_IDX feature bit is not negotiated:
\begin{itemize}
\item The driver MUST set \field{flags} to 0 or 1.
\item The driver MAY set \field{flags} to 1 to advise
the device that interrupts are not needed.
\end{itemize}

Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated:
\begin{itemize}
\item The driver MUST set \field{flags} to 0.
\item The driver MAY use \field{used_event} to advise the device that interrupts are unnecessary until the device writes entry with an index specified by \field{used_event} into the used ring (equivalently, until \field{idx} in the
used ring will reach the value \field{used_event} + 1).
\end{itemize}

The driver MUST handle spurious interrupts from the device.

\devicenormative{\subsubsection}{Virtqueue Interrupt Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression}

If the VIRTIO_F_EVENT_IDX feature bit is not negotiated:
\begin{itemize}
\item The device MUST ignore the \field{used_event} value.
\item After the device writes a descriptor index into the used ring:
  \begin{itemize}
  \item If \field{flags} is 1, the device SHOULD NOT send an interrupt.
  \item If \field{flags} is 0, the device MUST send an interrupt.
  \end{itemize}
\end{itemize}

Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated:
\begin{itemize}
\item The device MUST ignore the lower bit of \field{flags}.
\item After the device writes a descriptor index into the used ring:
  \begin{itemize}
  \item If the \field{idx} field in the used ring (which determined
    where that descriptor index was placed) was equal to
    \field{used_event}, the device MUST send an interrupt.
  \item Otherwise the device SHOULD NOT send an interrupt.
  \end{itemize}
\end{itemize}

\begin{note}
For example, if \field{used_event} is 0, then a device using
  VIRTIO_F_EVENT_IDX would interrupt after the first buffer is
  used (and again after the 65536th buffer, etc).
\end{note}

\subsection{The Virtqueue Used Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}

\begin{lstlisting}
struct virtq_used {
#define VIRTQ_USED_F_NO_NOTIFY  1
        le16 flags;
        le16 idx;
        struct virtq_used_elem ring[ /* Queue Size */];
        le16 avail_event; /* Only if VIRTIO_F_EVENT_IDX */
};

/* le32 is used here for ids for padding reasons. */
struct virtq_used_elem {
        /* Index of start of used descriptor chain. */
        le32 id;
        /* Total length of the descriptor chain which was used (written to) */
        le32 len;
};
\end{lstlisting}

The used ring is where the device returns buffers once it is done with
them: it is only written to by the device, and read by the driver.

Each entry in the ring is a pair: \field{id} indicates the head entry of the
descriptor chain describing the buffer (this matches an entry
placed in the available ring by the guest earlier), and \field{len} the total
of bytes written into the buffer.

\begin{note}
\field{len} is particularly useful
for drivers using untrusted buffers: if a driver does not know exactly
how much has been written by the device, the driver would have to zero
the buffer in advance to ensure no data leakage occurs.

For example, a network driver may hand a received buffer directly to
an unprivileged userspace application.  If the network device has not
overwritten the bytes which were in that buffer, this could leak the
contents of freed memory from other processes to the application.
\end{note}

\field{idx} field indicates where the driver would put the next descriptor
entry in the ring (modulo the queue size). This starts at 0, and increases.

\begin{note}
The legacy \hyperref[intro:Virtio PCI Draft]{[Virtio PCI Draft]}
referred to these structures as vring_used and vring_used_elem, and
the constant as VRING_USED_F_NO_NOTIFY, but the layout and value were
identical.
\end{note}

\subsubsection{Legacy Interface: The Virtqueue Used
Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues
/ The Virtqueue Used Ring/ Legacy Interface: The Virtqueue Used
Ring}

Historically, many drivers ignored the \field{len} value, as a
result, many devices set \field{len} incorrectly.  Thus, when
using the legacy interface, it is generally a good idea to ignore
the \field{len} value in used ring entries if possible.  Specific
known issues are listed per device type.

\devicenormative{\subsubsection}{The Virtqueue Used Ring}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}

The device MUST set \field{len} prior to updating the used \field{idx}.

The device MUST write at least \field{len} bytes to descriptor,
beginning at the first device-writable buffer,
prior to updating the used \field{idx}.

The device MAY write more than \field{len} bytes to descriptor.

\begin{note}
There are potential error cases where a device might not know what
parts of the buffers have been written.  This is why \field{len} is
permitted to be an underestimate: that's preferable to the driver believing
that uninitialized memory has been overwritten when it has not.
\end{note}

\drivernormative{\subsubsection}{The Virtqueue Used Ring}{Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}

The driver MUST NOT make assumptions about data in device-writable buffers
beyond the first \field{len} bytes, and SHOULD ignore this data.

\subsection{Virtqueue Notification Suppression}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Notification Suppression}

The device can suppress notifications in a manner analogous to the way
drivers can suppress interrupts as detailed in section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression}.
The device manipulates \field{flags} or \field{avail_event} in the used ring the
same way the driver manipulates \field{flags} or \field{used_event} in the available ring.

\drivernormative{\subsubsection}{Virtqueue Notification Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Notification Suppression}

The driver MUST initialize \field{flags} in the used ring to 0 when
allocating the used ring.

If the VIRTIO_F_EVENT_IDX feature bit is not negotiated:
\begin{itemize}
\item The driver MUST ignore the \field{avail_event} value.
\item After the driver writes a descriptor index into the available ring:
  \begin{itemize}
        \item If \field{flags} is 1, the driver SHOULD NOT send a notification.
        \item If \field{flags} is 0, the driver MUST send a notification.
  \end{itemize}
\end{itemize}

Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated:
\begin{itemize}
\item The driver MUST ignore the lower bit of \field{flags}.
\item After the driver writes a descriptor index into the available ring:
  \begin{itemize}
        \item If the \field{idx} field in the available ring (which determined
          where that descriptor index was placed) was equal to
          \field{avail_event}, the driver MUST send a notification.
        \item Otherwise the driver SHOULD NOT send a notification.
  \end{itemize}
\end{itemize}

\devicenormative{\subsubsection}{Virtqueue Notification Suppression}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Notification Suppression}
If the VIRTIO_F_EVENT_IDX feature bit is not negotiated:
\begin{itemize}
\item The device MUST set \field{flags} to 0 or 1.
\item The device MAY set \field{flags} to 1 to advise
the driver that notifications are not needed.
\end{itemize}

Otherwise, if the VIRTIO_F_EVENT_IDX feature bit is negotiated:
\begin{itemize}
\item The device MUST set \field{flags} to 0.
\item The device MAY use \field{avail_event} to advise the driver that notifications are unnecessary until the driver writes entry with an index specified by \field{avail_event} into the available ring (equivalently, until \field{idx} in the
available ring will reach the value \field{avail_event} + 1).
\end{itemize}

The device MUST handle spurious notifications from the driver.

\subsection{Helpers for Operating Virtqueues}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Helpers for Operating Virtqueues}

The Linux Kernel Source code contains the definitions above and
helper routines in a more usable form, in
include/uapi/linux/virtio_ring.h. This was explicitly licensed by IBM
and Red Hat under the (3-clause) BSD license so that it can be
freely used by all other projects, and is reproduced (with slight
variation) in \ref{sec:virtio-queue.h}~\nameref{sec:virtio-queue.h}.

\subsection{Virtqueue Operation}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Operation}

There are two parts to virtqueue operation: supplying new
available buffers to the device, and processing used buffers from
the device.

\begin{note} As an
example, the simplest virtio network device has two virtqueues: the
transmit virtqueue and the receive virtqueue. The driver adds
outgoing (device-readable) packets to the transmit virtqueue, and then
frees them after they are used. Similarly, incoming (device-writable)
buffers are added to the receive virtqueue, and processed after
they are used.
\end{note}

What follows is the requirements of each of these two parts
when using the split virtqueue format in more detail.

\subsection{Supplying Buffers to The Device}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device}

The driver offers buffers to one of the device's virtqueues as follows:

\begin{enumerate}
\item\label{itm:Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Place Buffers} The driver places the buffer into free descriptor(s) in the
   descriptor table, chaining as necessary (see \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table}~\nameref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Descriptor Table}).

\item\label{itm:Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Place Index} The driver places the index of the head of the descriptor chain
   into the next ring entry of the available ring.

\item Steps \ref{itm:Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Place Buffers} and \ref{itm:Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Place Index} MAY be performed repeatedly if batching
  is possible.

\item The driver performs suitable a memory barrier to ensure the device sees
  the updated descriptor table and available ring before the next
  step.

\item The available \field{idx} is increased by the number of
  descriptor chain heads added to the available ring.

\item The driver performs a suitable memory barrier to ensure that it updates
  the \field{idx} field before checking for notification suppression.

\item If notifications are not suppressed, the driver notifies the device
    of the new available buffers.
\end{enumerate}

Note that the above code does not take precautions against the
available ring buffer wrapping around: this is not possible since
the ring buffer is the same size as the descriptor table, so step
(1) will prevent such a condition.

In addition, the maximum queue size is 32768 (the highest power
of 2 which fits in 16 bits), so the 16-bit \field{idx} value can always
distinguish between a full and empty buffer.

What follows is the requirements of each stage in more detail.

\subsubsection{Placing Buffers Into The Descriptor Table}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Placing Buffers Into The Descriptor Table}

A buffer consists of zero or more device-readable physically-contiguous
elements followed by zero or more physically-contiguous
device-writable elements (each has at least one element). This
algorithm maps it into the descriptor table to form a descriptor
chain:

for each buffer element, b:

\begin{enumerate}
\item Get the next free descriptor table entry, d
\item Set \field{d.addr} to the physical address of the start of b
\item Set \field{d.len} to the length of b.
\item If b is device-writable, set \field{d.flags} to VIRTQ_DESC_F_WRITE,
    otherwise 0.
\item If there is a buffer element after this:
    \begin{enumerate}
    \item Set \field{d.next} to the index of the next free descriptor
      element.
    \item Set the VIRTQ_DESC_F_NEXT bit in \field{d.flags}.
    \end{enumerate}
\end{enumerate}

In practice, \field{d.next} is usually used to chain free
descriptors, and a separate count kept to check there are enough
free descriptors before beginning the mappings.

\subsubsection{Updating The Available Ring}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Updating The Available Ring}

The descriptor chain head is the first d in the algorithm
above, ie. the index of the descriptor table entry referring to the first
part of the buffer.  A naive driver implementation MAY do the following (with the
appropriate conversion to-and-from little-endian assumed):

\begin{lstlisting}
avail->ring[avail->idx % qsz] = head;
\end{lstlisting}

However, in general the driver MAY add many descriptor chains before it updates
\field{idx} (at which point they become visible to the
device), so it is common to keep a counter of how many the driver has added:

\begin{lstlisting}
avail->ring[(avail->idx + added++) % qsz] = head;
\end{lstlisting}

\subsubsection{Updating \field{idx}}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Updating idx}

\field{idx} always increments, and wraps naturally at
65536:

\begin{lstlisting}
avail->idx += added;
\end{lstlisting}

Once available \field{idx} is updated by the driver, this exposes the
descriptor and its contents.  The device MAY
access the descriptor chains the driver created and the
memory they refer to immediately.

\drivernormative{\paragraph}{Updating idx}{Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Updating idx}
The driver MUST perform a suitable memory barrier before the \field{idx} update, to ensure the
device sees the most up-to-date copy.

\subsubsection{Notifying The Device}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Notifying The Device}

The actual method of device notification is bus-specific, but generally
it can be expensive.  So the device MAY suppress such notifications if it
doesn't need them, as detailed in section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Notification Suppression}.

The driver has to be careful to expose the new \field{idx}
value before checking if notifications are suppressed.

\drivernormative{\paragraph}{Notifying The Device}{Basic Facilities of a Virtio Device / Virtqueues / Supplying Buffers to The Device / Notifying The Device}
The driver MUST perform a suitable memory barrier before reading \field{flags} or
\field{avail_event}, to avoid missing a notification.

\subsection{Receiving Used Buffers From The Device}\label{sec:Basic Facilities of a Virtio Device / Virtqueues / Receiving Used Buffers From The Device}

Once the device has used buffers referred to by a descriptor (read from or written to them, or
parts of both, depending on the nature of the virtqueue and the
device), it interrupts the driver as detailed in section \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Interrupt Suppression}.

\begin{note}
For optimal performance, a driver MAY disable interrupts while processing
the used ring, but beware the problem of missing interrupts between
emptying the ring and reenabling interrupts.  This is usually handled by
re-checking for more used buffers after interrups are re-enabled:

\begin{lstlisting}
virtq_disable_interrupts(vq);

for (;;) {
        if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) {
                virtq_enable_interrupts(vq);
                mb();

                if (vq->last_seen_used != le16_to_cpu(virtq->used.idx))
                        break;

                virtq_disable_interrupts(vq);
        }

        struct virtq_used_elem *e = virtq.used->ring[vq->last_seen_used%vsz];
        process_buffer(e);
        vq->last_seen_used++;
}
\end{lstlisting}
\end{note}