summaryrefslogtreecommitdiff
path: root/linux-core
diff options
context:
space:
mode:
authorKeith Packard <keithp@keithp.com>2008-05-10 21:04:42 -0700
committerKeith Packard <keithp@keithp.com>2008-05-10 21:04:42 -0700
commit177b8b07033c56c84d335808121690d235516bb5 (patch)
tree5c2b68e8a16b36f14c800a327e0889543361a9ed /linux-core
parenta37ac493da1730436028ecc79a38513380ce15d0 (diff)
[GEM] Add drm-gem.txt
Add some API and implementation documentation for GEM.
Diffstat (limited to 'linux-core')
-rw-r--r--linux-core/drm-gem.txt799
1 files changed, 799 insertions, 0 deletions
diff --git a/linux-core/drm-gem.txt b/linux-core/drm-gem.txt
new file mode 100644
index 00000000..8f95c206
--- /dev/null
+++ b/linux-core/drm-gem.txt
@@ -0,0 +1,799 @@
+ The Graphics Execution Manager
+ Part of the Direct Rendering Manager
+ ==============================
+
+ Keith Packard <keithp@keithp.com>
+ Eric Anholt <eric@anholt.net>
+ 2008-5-9
+
+Contents:
+
+ 1. GEM Overview
+ 2. API overview and conventions
+ 3. Object Creation/Destruction
+ 4. Reading/writing contents
+ 5. Mapping objects to userspace
+ 6. Memory Domains
+ 7. Execution (Intel specific)
+ 8. Other misc Intel-specific functions
+
+1. Graphics Execution Manager Overview
+
+Gem is designed to manage graphics memory, control access to the graphics
+device execution context and handle the essentially NUMA environment unique
+to modern graphics hardware. Gem allows multiple applications to share
+graphics device resources without the need to constantly reload the entire
+graphics card. Data may be shared between multiple applications with gem
+ensuring that the correct memory synchronization occurs.
+
+Graphics data can consume arbitrary amounts of memory, with 3D applications
+constructing ever larger sets of textures and vertices. With graphics cards
+memory space growing larger every year, and graphics APIs growing more
+complex, we can no longer insist that each application save a complete copy
+of their graphics state so that the card can be re-initialized from user
+space at each context switch. Ensuring that graphics data remains persistent
+across context switches allows applications significant new functionality
+while also improving performance for existing APIs.
+
+Modern linux desktops include significant 3D rendering as a fundemental
+component of the desktop image construction process. 2D and 3D applications
+paint their content to offscreen storage and the central 'compositing
+manager' constructs the final screen image from those window contents. This
+means that pixel image data from these applications must move within reach
+of the compositing manager and used as source operands for screen image
+rendering operations.
+
+Gem provides simple mechanisms to manage graphics data and control execution
+flow within the linux operating system. Using many existing kernel
+subsystems, it does this with a modest amount of code.
+
+2. API Overview and Conventions
+
+All APIs here are defined in terms of ioctls appplied to the DRM file
+descriptor. To create and manipulate objects, an applications must be
+'authorized' using the DRI or DRI2 protocols with the X server. To relax
+that, we will need to implement some better access control mechanisms within
+the hardware portion of the driver to prevent inappropriate
+cross-application data access.
+
+Any DRM driver which does not support GEM will return -ENODEV for all of
+these ioctls. Invalid object handles return -EINVAL. Invalid object names
+return -ENOENT. Other errors are as documented in the specific API below.
+
+To avoid the need to translate ioctl contents on mixed-size systems (with
+32-bit user space running on a 64-bit kernel), the ioctl data structures
+contain explicitly sized objects, using 64-bits for all size and pointer
+data and 32-bits for identifiers. In addition, the 64-bit objects are all
+carefully aligned on 64-bit boundaries. Because of this, all pointers in the
+ioctl data structures are passed as uint64_t values. Suitable casts will
+be necessary.
+
+One significant operation which is explicitly left out of this API is object
+locking. Applications are expected to perform locking of shared objects
+outside of the GEM api. This kind of locking is not necessary to safely
+manipulate the graphics engine, and with multiple objects interacting in
+unknown ways, per-object locking would likely introduce all kinds of
+lock-order issues. Punting this to the application seems like the only
+sensible plan. Given that DRM already offers a global lock on the hardware,
+this doesn't change the current situation.
+
+3. Object Creation and Destruction
+
+Gem provides explicit memory management primitives. System pages are
+allocated when the object is created, either as the fundemental storage for
+hardware where system memory is used by the graphics processor directly, or
+as backing store for graphics-processor resident memory.
+
+Objects are referenced from user space using handles. These are, for all
+intents and purposes, equivalent to file descriptors. We could simply use
+file descriptors were it not for the small limit (1024) of file descriptors
+available to applications, and for the fact that the X server (a rather
+significant user of this API) uses 'select' and has a limited maximum file
+descriptor for that operation. Given the ability to allocate more file
+descriptors, and given the ability to place these 'higher' in the file
+descriptor space, we'd love to simply use file descriptors.
+
+Objects may be published with a name so that other applications can access
+them. The name remains valid as long as the object exists. Right now, our
+DRI APIs use 32-bit integer names, so that's what we expose here
+
+ A. Creation
+
+ struct drm_gem_create {
+ /**
+ * Requested size for the object.
+ *
+ * The (page-aligned) allocated size for the object
+ * will be returned.
+ */
+ uint64_t size;
+ /**
+ * Returned handle for the object.
+ *
+ * Object handles are nonzero.
+ */
+ uint32_t handle;
+ uint32_t pad;
+ };
+
+ /* usage */
+ create.size = 16384;
+ ret = ioctl (fd, DRM_IOCTL_GEM_CREATE, &create);
+ if (ret == 0)
+ return create.handle;
+
+ Note that the size is rounded up to a page boundary, and that
+ the rounded-up size is returned in 'size'. No name is assigned to
+ this object, making it local to this process.
+
+ If insufficient memory is availabe, -ENOMEM will be returned.
+
+ B. Closing
+
+ struct drm_gem_close {
+ /** Handle of the object to be closed. */
+ uint32_t handle;
+ uint32_t pad;
+ };
+
+
+ /* usage */
+ close.handle = <handle>;
+ ret = ioctl (fd, DRM_IOCTL_GEM_CLOSE, &close);
+
+ This call makes the specified handle invalid, and if no other
+ applications are using the object, any necessary graphics hardware
+ synchronization is performed and the resources used by the object
+ released.
+
+ C. Naming
+
+ struct drm_gem_flink {
+ /** Handle for the object being named */
+ uint32_t handle;
+
+ /** Returned global name */
+ uint32_t name;
+ };
+
+ /* usage */
+ flink.handle = <handle>;
+ ret = ioctl (fd, DRM_IOCTL_GEM_FLINK, &flink);
+ if (ret == 0)
+ return flink.name;
+
+ Flink creates a name for the object and returns it to the
+ application. This name can be used by other applications to gain
+ access to the same object.
+
+ D. Opening by name
+
+ struct drm_gem_open {
+ /** Name of object being opened */
+ uint32_t name;
+
+ /** Returned handle for the object */
+ uint32_t handle;
+
+ /** Returned size of the object */
+ uint64_t size;
+ };
+
+ /* usage */
+ open.name = <name>;
+ ret = ioctl (fd, DRM_IOCTL_GEM_OPEN, &open);
+ if (ret == 0) {
+ *sizep = open.size;
+ return open.handle;
+ }
+
+ Open accesses an existing object and returns a handle for it. If the
+ object doesn't exist, -ENOENT is returned. The size of the object is
+ also returned. This handle has all the same capabilities as the
+ handle used to create the object. In particular, the object is not
+ destroyed until all handles are closed.
+
+4. Basic read/write operations
+
+By default, gem objects are not mapped to the applications address space,
+getting data in and out of them is done with I/O operations instead. This
+allows the data to reside in otherwise unmapped pages, including pages in
+video memory on an attached discrete graphics card. In addition, using
+explicit I/O operations allows better control over cache contents, as
+graphics devices are generally not cache coherent with the CPU, mapping
+pages used for graphics into an application address space requires the use
+of expensive cache flushing operations. Providing direct control over
+graphics data access ensures that data are handled in the most efficient
+possible fashion.
+
+ A. Reading
+
+ struct drm_gem_pread {
+ /** Handle for the object being read. */
+ uint32_t handle;
+ uint32_t pad;
+ /** Offset into the object to read from */
+ uint64_t offset;
+ /** Length of data to read */
+ uint64_t size;
+ /** Pointer to write the data into. */
+ uint64_t data_ptr; /* void * */
+ };
+
+ This copies data into the specified object at the specified
+ position. Any necessary graphics device synchronization and
+ flushing will be done automatically.
+
+ struct drm_gem_pwrite {
+ /** Handle for the object being written to. */
+ uint32_t handle;
+ uint32_t pad;
+ /** Offset into the object to write to */
+ uint64_t offset;
+ /** Length of data to write */
+ uint64_t size;
+ /** Pointer to read the data from. */
+ uint64_t data_ptr; /* void * */
+ };
+
+ This copies data out of the specified object into the
+ waiting user memory. Again, device synchronization will
+ be handled by the kernel to ensure user space sees a
+ consistent view of the graphics device.
+
+5. Mapping objects to user space
+
+For most objects, reading/writing is the preferred interaction mode.
+However, when the CPU is involved in rendering to cover deficiencies in
+hardware support for particular operations, the CPU will want to directly
+access the relevant objects.
+
+Because mmap is fairly heavyweight, we allow applications to retain maps to
+objects persistently and then update how they're using the memory through a
+separate interface. Applications which fail to use this separate interface
+may exhibit unpredictable behaviour as memory consistency will not be
+preserved.
+
+ A. Mapping
+
+ struct drm_gem_mmap {
+ /** Handle for the object being mapped. */
+ uint32_t handle;
+ uint32_t pad;
+ /** Offset in the object to map. */
+ uint64_t offset;
+ /**
+ * Length of data to map.
+ *
+ * The value will be page-aligned.
+ */
+ uint64_t size;
+ /** Returned pointer the data was mapped at */
+ uint64_t addr_ptr; /* void * */
+ };
+
+ /* usage */
+ mmap.handle = <handle>;
+ mmap.offset = <offset>;
+ mmap.size = <size>;
+ ret = ioctl (fd, DRM_IOCTL_GEM_MMAP, &mmap);
+ if (ret == 0)
+ return (void *) (uintptr_t) mmap.addr_ptr;
+
+
+ B. Unmapping
+
+ munmap (addr, length);
+
+ Nothing strange here, just use the normal munmap syscall.
+
+6. Memory Domains
+
+Graphics devices remain a strong bastion of non cache-coherent memory. As a
+result, accessing data through one functional unit will end up loading that
+cache with data which then needs to be manually synchronized when that data
+is used with another functional unit.
+
+Tracking where data are resident is done by identifying how functional units
+deal with caches. Each cache is labeled as a separate memory domain. Then,
+each sequence of operations is expected to load data into various read
+domains and leave data in at most one write domain. Gem tracks the read and
+write memory domains of each object and performs the necessary
+synchronization operations when objects move from one domain set to another.
+
+For example, if operation 'A' constructs an image that is immediately used
+by operation 'B', then when the read domain for 'B' is not the same as the
+write domain for 'A', then the write domain must be flushed, and the read
+domain invalidated. If these two operations are both executed in the same
+command queue, then the flush operation can go inbetween them in the same
+queue, avoiding any kind of CPU-based synchronization and leaving the GPU to
+do the work itself.
+
+6.1 Memory Domains (GPU-independent)
+
+ * DRM_GEM_DOMAIN_CPU.
+
+ Objects in this domain are using caches which are connected to the CPU.
+ Moving objects from non-CPU domains into the CPU domain can involve waiting
+ for the GPU to finish with operations using this object. Moving objects
+ from this domain to a GPU domain can involve flushing CPU caches and chipset
+ buffers.
+
+6.1 GPU-independent memory domain ioctl
+
+This ioctl is independent of the GPU in use. So far, no use other than
+synchronizing objects to the CPU domain have been found; if that turns out
+to be generally true, this ioctl may be simplified further.
+
+ A. Explicit domain control
+
+ struct drm_gem_set_domain {
+ /** Handle for the object */
+ uint32_t handle;
+
+ /** New read domains */
+ uint32_t read_domains;
+
+ /** New write domain */
+ uint32_t write_domain;
+ };
+
+ /* usage */
+ set_domain.handle = <handle>;
+ set_domain.read_domains = <read_domains>;
+ set_domain.write_domain = <write_domain>;
+ ret = ioctl (fd, DRM_IOCTL_GEM_SET_DOMAIN, &set_domain);
+
+ When the application wants to explicitly manage memory domains for
+ an object, it can use this function. Usually, this is only used
+ when the application wants to synchronize object contents between
+ the GPU and CPU-based application rendering. In that case,
+ the <read_domains> would be set to DRM_GEM_DOMAIN_CPU, and if the
+ application were going to write to the object, the <write_domain>
+ would also be set to DRM_GEM_DOMAIN_CPU. After the call, gem
+ guarantees that all previous rendering operations involving this
+ object are complete. The application is then free to access the
+ object through the address returned by the mmap call. Afterwards,
+ when the application again uses the object through the GPU, any
+ necessary CPU flushing will occur and the object will be correctly
+ synchronized with the GPU.
+
+7. Execution (Intel specific)
+
+Managing the command buffers is inherently chip-specific, so the core of gem
+doesn't have any intrinsic functions. Rather, execution is left to the
+device-specific portions of the driver.
+
+The Intel DRM_I915_GEM_EXECBUFFER ioctl takes a list of gem objects, all of
+which are mapped to the graphics device. The last object in the list is the
+command buffer.
+
+7.1. Relocations
+
+Command buffers often refer to other objects, and to allow the kernel driver
+to move objects around, a sequence of relocations is associated with each
+object. Device-specific relocation operations are used to place the
+target-object relative value into the object.
+
+The Intel driver has a single relocation type:
+
+ struct drm_i915_gem_relocation_entry {
+ /*
+ * Handle of the buffer being pointed to by this
+ * relocation entry.
+ /*
+ * It's appealing to make this be an index into the
+ * mm_validate_entry list to refer to the buffer, but
+ * handle lookup should be O(1) anyway, and prevents
+ * O(n) search in userland to find what that index is.
+
+ */
+ uint32_t target_handle;
+
+ /**
+ * Value to be added to the offset of the target
+ * buffer to make up the relocation entry.
+ */
+ uint32_t delta;
+
+ /**
+ * Offset in the buffer the relocation entry will be
+ * written into
+ */
+ uint64_t offset;
+
+ /**
+ * Offset value of the target buffer that the
+ * relocation entry was last written as.
+ *
+ * If the buffer has the same offset as last time, we
+ * can skip syncing and writing the relocation. This
+ * value is written back out by the execbuffer ioctl
+ * when the relocation is written.
+ */
+ uint64_t presumed_offset;
+
+ /**
+ * Target memory domains read by this operation.
+ */
+ uint32_t read_domains;
+
+ /*
+ * Target memory domains written by this operation.
+ *
+ * Note that only one domain may be written by the
+ * whole execbuffer operation, so that where there are
+ * conflicts, the application will get -EINVAL back.
+ */
+ uint32_t write_domain;
+ };
+
+ 'target_handle', the handle to the target object. This object must
+ be one of the objects listed in the execbuffer request or
+ bad things will happen. The kernel doesn't check for this.
+
+ 'offset' is where, in the source object, the relocation data
+ are written. Each relocation value is a 32-bit value consisting
+ of the location of the target object in the GPU memory space plus
+ the 'delta' value included in the relocation.
+
+ 'presumed_offset' is where user-space believes the target object
+ lies in GPU memory space. If this value matches where the object
+ actually is, then no relocation data are written, the kernel
+ assumes that user space has set up data in the source object
+ using this presumption. This offers a fairly important optimization
+ as writing relocation data requires mapping of the source object
+ into the kernel memory space.
+
+ 'read_domains' and 'write_domains' list the usage by the source
+ object of the target object. The kernel unions all of the domain
+ information from all relocations in the execbuffer request. No more
+ than one write_domain is allowed, otherwise an EINVAL error is
+ returned. read_domains must contain write_domain. This domain
+ information is used to synchronize buffer contents as described
+ above in the section on domains.
+
+7.1.1 Memory Domains (Intel specific)
+
+The Intel GPU has several internal caches which are not coherent and hence
+require explicit synchronization. Memory domains provide the necessary data
+to synchronize what is needed while leaving other cache contents intact.
+
+ * DRM_GEM_DOMAIN_I915_RENDER.
+ The GPU 3D and 2D rendering operations use a unified rendering cache, so
+ operations doing 3D painting and 2D blts will use this domain
+
+ * DRM_GEM_DOMAIN_I915_SAMPLER
+ Textures are loaded by the sampler through a separate cache, so
+ any texture reading will use this domain. Note that the sampler
+ and renderer use different caches, so moving an object from render target
+ to texture source will require a domain transfer.
+
+ * DRM_GEM_DOMAIN_I915_COMMAND
+ The command buffer doesn't have an explicit cache (although it does
+ read ahead quite a bit), so this domain just indicates that the object
+ needs to be flushed to the GPU.
+
+ * DRM_GEM_DOMAIN_I915_INSTRUCTION
+ Fragment programs on Gen3 and all of the programs on later
+ chips use an instruction cache to speed program execution. It must be
+ explicitly flushed when new programs are written to memory by the CPU.
+
+ * DRM_GEM_DOMAIN_I915_VERTEX
+ Vertex data uses two different vertex caches, but they're
+ both flushed with the same instruction.
+
+7.2 Execution object list (Intel specific)
+
+ struct drm_i915_gem_exec_object {
+ /**
+ * User's handle for a buffer to be bound into the GTT
+ * for this operation.
+ */
+ uint32_t handle;
+
+ /**
+ * List of relocations to be performed on this buffer
+ */
+ uint32_t relocation_count;
+ /* struct drm_i915_gem_relocation_entry *relocs */
+ uint64_t relocs_ptr;
+
+ /**
+ * Required alignment in graphics aperture
+ */
+ uint64_t alignment;
+
+ /**
+ * Returned value of the updated offset of the object,
+ * for future presumed_offset writes.
+ */
+ uint64_t offset;
+ };
+
+ Each object involved in a particular execution operation must be
+ listed using one of these structures.
+
+ 'handle' references the object.
+
+ 'relocs_ptr' is a user-mode pointer to a array of 'relocation_count'
+ drm_i915_gem_relocation_entry structs (see above) that
+ define the relocations necessary in this buffer. Note that all
+ relocations must reference other exec_object structures in the same
+ execbuffer ioctl and that those other buffers must come earlier in
+ the exec_object array. In other words, the dependencies mapped by the
+ exec_object relocations must form a directed acyclic graph.
+
+ 'alignment' is the byte alignment necessary for this buffer. Each
+ object has specific alignment requirements, as the kernel doesn't
+ know what each object is being used for, those requirements must be
+ provided by user mode. If an object is used in two different ways,
+ it's quite possible that the alignment requirements will differ.
+
+ 'offset' is a return value, receiving the location of the object
+ during this execbuffer operation. The application should use this
+ as the presumed offset in future operations; if the object does not
+ move, then kernel need not write relocation data.
+
+7.3 Execbuffer ioctl (Intel specific)
+
+ struct drm_i915_gem_execbuffer {
+ /**
+ * List of buffers to be validated wit their
+ * relocations to be performend on them.
+ *
+ * These buffers must be listed in an order such that
+ * all relocations a buffer is performing refer to
+ * buffers that have already appeared in the validate
+ * list.
+ */
+ /* struct drm_i915_gem_validate_entry *buffers */
+ uint64_t buffers_ptr;
+ uint32_t buffer_count;
+
+ /**
+ * Offset in the batchbuffer to start execution from.
+ */
+ uint32_t batch_start_offset;
+
+ /**
+ * Bytes used in batchbuffer from batch_start_offset
+ */
+ uint32_t batch_len;
+ uint32_t DR1;
+ uint32_t DR4;
+ uint32_t num_cliprects;
+ uint64_t cliprects_ptr; /* struct drm_clip_rect *cliprects */
+ };
+
+
+ 'buffers_ptr' is a user-mode pointer to an array of 'buffer_count'
+ drm_i915_gem_exec_object structures which contains the complete set
+ of objects required for this execbuffer operation. The last entry in
+ this array, the 'batch buffer', is the buffer of commands which will
+ be linked to the ring and executed.
+
+ 'batch_start_offset' is the byte offset within the batch buffer which
+ contains the first command to execute. So far, we haven't found a
+ reason to use anything other than '0' here, but the thought was that
+ some space might be allocated for additional initialization which
+ could be skipped in some cases. This must be a multiple of 4.
+
+ 'batch_len' is the length, in bytes, of the data to be executed
+ (i.e., the amount of data after batch_start_offset). This must
+ be a multiple of 4.
+
+ 'num_cliprects' and 'cliprects_ptr' reference an array of
+ drm_clip_rect structures that is num_cliprects long. The entire
+ batch buffer will be executed multiple times, once for each
+ rectangle in this list. If num_cliprects is 0, then no clipping
+ rectangle will be set.
+
+ 'DR1' and 'DR4' are portions of the 3DSTATE_DRAWING_RECTANGLE
+ command which will be queued when this operation is clipped
+ (num_cliprects != 0).
+
+ DR1 bit definition
+ 31 Fast Scissor Clip Disable (debug only).
+ Disables a hardware optimization that
+ improves performance. This should have
+ no visible effect, other than reducing
+ performance
+
+ 30 Depth Buffer Coordinate Offset Disable.
+ This disables the addition of the
+ depth buffer offset bits which are used
+ to change the location of the depth buffer
+ relative to the front buffer.
+
+ 27:26 X Dither Offset. Specifies the X pixel
+ offset to use when accessing the dither table
+
+ 25:24 Y Dither Offset. Specifies the Y pixel
+ offset to use when accessing the dither
+ table.
+
+ DR4 bit definition
+ 31:16 Drawing Rectangle Origin Y. Specifies the Y
+ origin of coordinates relative to the
+ draw buffer.
+
+ 15:0 Drawing Rectangle Origin X. Specifies the X
+ origin of coordinates relative to the
+ draw buffer.
+
+ As you can see, these two fields are necessary for correctly
+ offsetting drawing within a buffer which contains multiple surfaces.
+ Note that DR1 is only used on Gen3 and earlier hardware and that
+ newer hardware sticks the dither offset elsewhere.
+
+7.3.1 Detailed Execution Description
+
+ Execution of a single batch buffer requires several preparatory
+ steps to make the objects visible to the graphics engine and resolve
+ relocations to account for their current addresses.
+
+ A. Mapping and Relocation
+
+ Each exec_object structure in the array is examined in turn.
+
+ If the object is not already bound to the GTT, it is assigned a
+ location in the graphics address space. If no space is available in
+ the GTT, some other object will be evicted. This may require waiting
+ for previous execbuffer requests to complete before that object can
+ be unmapped. With the location assigned, the pages for the object
+ are pinned in memory using find_or_create_page and the GTT entries
+ updated to point at the relevant pages using drm_agp_bind_pages.
+
+ Then the array of relocations is traversed. Each relocation record
+ looks up the target object and, if the presumed offset does not
+ match the current offset (remember that this buffer has already been
+ assigned an address as it must have been mapped earlier), the
+ relocation value is computed using the current offset. If the
+ object is currently in use by the graphics engine, writing the data
+ out must be preceeded by a delay while the object is still busy.
+ Once it is idle, then the page containing the relocation is mapped
+ by the CPU and the updated relocation data written out.
+
+ The read_domains and write_domain entries in each relocation are
+ used to compute the new read_domains and write_domain values for the
+ target buffers. The actual execution of the domain changes must wait
+ until all of the exec_object entries have been evaluated as the
+ complete set of domain information will not be available until then.
+
+ B. Memory Domain Resolution
+
+ After all of the new memory domain data has been pulled out of the
+ relocations and computed for each object, the list of objects is
+ again traversed and the new memory domains compared against the
+ current memory domains. There are two basic operations involved here:
+
+ * Flushing the current write domain. If the new read domains
+ are not equal to the current write domain, then the current
+ write domain must be flushed. Otherwise, reads will not see data
+ present in the write domain cache. In addition, any new read domains
+ other than the current write domain must be invalidated to ensure
+ that the flushed data are re-read into their caches.
+
+ * Invaliding new read domains. Any domains which were not currently
+ used for this object must be invalidated as old objects which
+ were mapped at the same location may have stale data in the new
+ domain caches.
+
+ If the CPU cache is being invalidated and some GPU cache is being
+ flushed, then we'll have to wait for rendering to complete so that
+ any pending GPU writes will be complete before we flush the GPU
+ cache.
+
+ If the CPU cache is being flushed, then we use 'clflush' to get data
+ written from the CPU.
+
+ Because the GPU caches cannot be partially flushed or invalidated,
+ we don't actually flush them during this traversal stage. Rather, we
+ gather the invalidate and flush bits up in the device structure.
+
+ Once all of the object domain changes have been evaluated, then the
+ gathered invalidate and flush bits are examined. For any GPU flush
+ operations, we emit a single MI_FLUSH command that performs all of
+ the necessary flushes. We then look to see if the CPU cache was
+ flushed. If so, we use the chipset flush magic (writing to a special
+ page) to get the data out of the chipset and into memory.
+
+ C. Queuing Batch Buffer to the Ring
+
+ With all of the objects resident in graphics memory space, and all
+ of the caches prepared with appropriate data, the batch buffer
+ object can be queued to the ring. If there are clip rectangles, then
+ the buffer is queued once per rectangle, with suitable clipping
+ inserted into the ring just before the batch buffer.
+
+ D. Creating an IRQ Cookie
+
+ Right after the batch buffer is placed in the ring, a request to
+ generate an IRQ is added to the ring along with a command to write a
+ marker into memory. When the IRQ fires, the driver can look at the
+ memory location to see where in the ring the GPU has passed. This
+ magic cookie value is stored in each object used in this execbuffer
+ command; it is used whereever you saw 'wait for rendering' above in
+ this document.
+
+ E. Writing back the new object offsets
+
+ So that the application has a better idea what to use for
+ 'presumed_offset' values later, the current object offsets are
+ written back to the exec_object structures.
+
+
+8. Other misc Intel-specific functions.
+
+To complete the driver, a few other functions were necessary.
+
+8.1 Initialization from the X server
+
+As the X server is currently responsible for apportioning memory between 2D
+and 3D, it must tell the kernel which region of the GTT aperture is
+available for 3D objects to be mapped into.
+
+ struct drm_i915_gem_init {
+ /**
+ * Beginning offset in the GTT to be managed by the
+ * DRM memory manager.
+ */
+ uint64_t gtt_start;
+ /**
+ * Ending offset in the GTT to be managed by the DRM
+ * memory manager.
+ */
+ uint64_t gtt_end;
+ };
+ /* usage */
+ init.gtt_start = <gtt_start>;
+ init.gtt_end = <gtt_end>;
+ ret = ioctl (fd, DRM_IOCTL_I915_GEM_INIT, &init);
+
+ The GTT aperture between gtt_start and gtt_end will be used to map
+ objects. This also tells the kernel that the ring can be used,
+ pulling the ring addresses from the device registers.
+
+8.2 Pinning objects in the GTT
+
+For scan-out buffers and the current shared depth and back buffers, we need
+to have them always available in the GTT, at least for now. Pinning means to
+lock their pages in memory along with keeping them at a fixed offset in the
+graphics aperture. These operations are available only to root.
+
+ struct drm_i915_gem_pin {
+ /** Handle of the buffer to be pinned. */
+ uint32_t handle;
+ uint32_t pad;
+
+ /** alignment required within the aperture */
+ uint64_t alignment;
+
+ /** Returned GTT offset of the buffer. */
+ uint64_t offset;
+ };
+
+ /* usage */
+ pin.handle = <handle>;
+ pin.alignment = <alignment>;
+ ret = ioctl (fd, DRM_IOCTL_I915_GEM_PIN, &pin);
+ if (ret == 0)
+ return pin.offset;
+
+ Pinning an object ensures that it will not be evicted from the GTT
+ or moved. It will stay resident until destroyed or unpinned.
+
+ struct drm_i915_gem_unpin {
+ /** Handle of the buffer to be unpinned. */
+ uint32_t handle;
+ uint32_t pad;
+ };
+
+ /* usage */
+ unpin.handle = <handle>;
+ ret = ioctl (fd, DRM_IOCTL_I915_GEM_UNPIN, &unpin);
+
+ Unpinning an object makes it possible to evict this object from the
+ GTT. It doesn't ensure that it will be evicted, just that it may.
+