Что такое mgpu directx 12

The DX12 API places more responsibilities on the programmer than any former DirectX™ API. This starts with resource state barriers and continues with the use of fences to synchronize command queues. Likewise illegal API usage won’t be caught or corrected by the DX-runtime or the driver. In order to stay on top of things the developer needs to strongly leverage the debug runtime and pay close attention to any errors that get reported. Also make sure to be thoroughly familiar with the DX12 feature specifications.

Prefer a tasks graph architecture for parallel draw submission

This way you may achieve sufficient parallelism in terms of draw submission whilst making sure that resource and command queue dependencies get respected

The idea is to get the worker threads generate command lists and for the master thread to pick those up and submit them

The app has to replace driver reasoning about how to most efficiently drive the underlying hardware

Don’ts

Don’t rely on the driver to parallelize any Direct3D12 works in driver threads

On DX11 the driver does farm off asynchronous tasks to driver worker threads where possible – this doesn’t happen anymore under DX12
While the total cost of work submission in DX12 has been reduced, the amount of work measured on the application’s thread may be larger due to the loss of driver threading. The more efficiently one can use parallel hardware cores of the CPU to submit work in parallel, the more benefit in terms of draw call submission performance can be expected.

Accept the fact that you are responsible for achieving and controlling GPU/CPU parallelism

Submitting work to command lists doesn’t start any work on the GPU
Calls to ExecuteCommadList() finally do start work on the GPU

Recording commands is a CPU intensive operation and no driver threads come to the rescue
Command lists are not free threaded so parallel work submission means submitting to multiple command lists

You still need a reasonable number of command lists for efficient parallel work submission
Fences force the splitting of command lists for various reasons ( multiple command queues, picking up the results of queries)

No need to spend CPU time once again

This allows bundles to be reused with less overhead as it facilitates more thoroughly cooked bundles
Check carefully if the use of a separate compute command queues really is advantageous

Don’ts

Don’t use bundles to record more than a few draw calls (e.g.~12 draw calls is fine)

Otherwise you typically limit the reusability of the bundle

This may lead to bubbles in the asynchronous compute queue
Switch compute workload to graphics workloads in this case if possible

Small command lists can sometimes complete faster than the OS scheduler on the CPU can submit new ones. This can result in wasted idle GPU cycles.
The OS takes 50-80 microseconds to schedule command lists after the previous ExecuteCommandLists call. If a command list or all command lists in the call executes faster than that, there will be a bubble in the HW queue
Check for bubbles using GPUView

This limits your ability to fully utilize all your CPU cores
Also building a few large command lists means you’ll potentially find it harder to keep the GPU from going idle

You may waste the opportunity to keep the GPU working in parallel with the recording of other command lists

There are usually many per-frame changes in terms of objects visibility etc.
Post-processing may be an exception

Too many threads will oversubscribe your CPU resources, whilst too many command lists may accumulate too much overhead

PSO creation is where shaders compilation and related stalls happen

Gets you up running faster even if you are not running the most optional PSO/shader yet
It is your job to generate shader specializations – the driver will not generate constant optimized shader variants behind your back

The driver-managed shader disk cache may come to the rescue though

A PSO doesn’t necessarily map to an atomic state change on the GPU

This allows for more possibilities for PSO reuse

This allows for the compiler to do a better job at optimizing texture accesses. We have seen frame rate improvements of > 1% when toggling this flag on.

Don’ts

Don’t toggle between compute and graphics on the same command queue more than absolutely necessary

This is still a heavyweight switch to make

Again, this is still a heavyweight switch to make

It is really important to create PSO asynchronously and early enough before they get used

Tread carefully with thread priorities for PSO compilation threads

Use Idle priority if there is no ‘hurry’ to prevent slowdowns for game threads
Consider temporarily boosting priorities when there is a ‘hurry’

Place constants and CBVs (SRVs and UAVs only if you have directly into the root signature if possible on NVIDIA Hardware

Start with the entries for the pixel stage

Constants that sit directly in root can speed up pixel shaders significantly on NVIDIA hardware – specifically consider shader constants that toggle parts of uber-shaders
CBVs that sit in the root signature can also speed up pixel shaders significantly on NVIDIA hardware

We have seen significant speedups through managing changes properly

There is overhead in the driver and on the GPU for each stage that needs to see those views
Use the DENY_*_ACCESS flags to explicitly limit resource-shader visibility

The problem is not the change of the RS but there is usually a follow up cost of initializing the root signature entries after such a change

For these Tiers, the application must fill in all descriptors defined in the root signature (and descriptor tables used) by the time the command list executes. This is even the case if the used shaders may not reference all these descriptors.
For Tier 3 do keep your unused descriptors bound – don’t waste time unbinding them as this can easily introduce state thrashing bottlenecks

Don’ts

Don’t group CBVs into CBV descriptor tables that have a different update frequency

Ideally all CBVs in a table would need updating at the same time

Try to aim at using a minimum set of entries for each set of materials

For current drivers the deny flags only work when D3D12_SHADER_VISIBILITY_ALL is set

A change in root signatures removes/clears all resource binding used in the previous root signature

Don’ts

Don’t forget that Allocator and Lists consume GPU memory

A too large allocators may limit your GPU working set in other undesirable ways

Save the overhead for allocator creation/destruction

This leads to worst case size allocator

Not resetting an allocator means leaking memory!

This is illegal and may free or overwrite memory that the command list is still using

Avoid vidmem overcommitment

Use IDXGIAdapter3:: QueryVideoMemoryInfo() to gain accurate information about the available video memory
Foreground app isn’t necessarily allocated all, or even a high %, of vidmem

Respond to budget changes from OS

Consider using IDXGIAdapter3::RegisterVideoMemoryBudgetChangeNotificationEvent
Consider capping graphics settings based on memory available

Break up command lists so that the amount of memory referenced in each one fits in vidmem.
Keep track of what's used per CL
Consider using MakeResident/Evict before/after executing command lists when you are going over the vidmem budget

Use committed resources where possible to give the driver more knowledge

This allows the driver to better manage GPU memory
A good use case for placed resources are resource heaps that are e.g. used during streaming and do hold different sets of read-only textures over their life time

This lowers the overhead inside the driver and the GPU

Do drop mip levels of tiled resources as needed
Need to handle the case when MakeResident fails

UAV count across all stages may be limited to 8 or 64
CBV count may be limited to 14 per stage
Sampler count may be limited to 16 per stage

See tiled resource specification for a good roll-up

On some heap tiers there may be more restrictions than on others
Check resource heap tier capabilities

Copying only the depth part of the resource may hit a slow path

Dont's

Don’t go overboard with your re-use count for placed resources for depth stencil and render target resources

On top of the need to clear those resources before they can be rendered to, there may be other hardware dependent book-keeping operations that make those switches expensive

Don’t rely on the availability of tiled resources (check cap bits)

Still need to think about different DX12 hardware classes

Depending on the underlying GPU architecture the memory may or may not be segmented

Cost might be deferred until another MakeResident call utilizes the memory

Use GPUView analysis to find out about deferred paging requests

Better to use MakeUnresident and MakeResident where possible
Saves the overhead of creation and destruction of resources

Minimize the use of barriers and fences

We have seen redundant barriers and associated wait for idle operations as a major performance problem for DX11 to DX12 ports

The DX11 driver is doing a great job of reducing barriers – now under DX12 you need to do it

Stay away from using D3D12_RESOURCE_USAGE_GENERIC_READ unless you really need every single flag that is set in this combination of flags
Redundant flags may trigger redundant flushes and stalls and slow down your game unnecessarily
To reiterate: We have seen redundant and/or overly conservative barrier flags and their associated wait for idle operations as a major performance problem for DX11 to DX12 ports.

Adding false dependencies adds redundancy

This way the worst case can be picked instead of sequentially going through all barriers

Use the _BEGIN_ONLY/_END_ONLY flags
This helps the driver doing a more efficient job

Dont's

Don’t insert redundant barriers

This limits parallelism
A transition from D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE to D3D12_RESOURCE_STATE_RENDER_TARGET and back without any draw calls in-between is redundant

Avoid read-to-read barriers

Get the resource in the right state for all subsequent reads

For transitions from write-to-read states, ensure the transition target is inclusive of all required read states needed before the next transition to write. This is done from the API by combining read state flags– and is preferred over transitioning from read-to-read in subsequent ResourceBarrier calls.

This doesn’t allow the driver to pick the worst case of a set of barriers

Use the DX12 standard checks to find out how many GPUs are in your system

No need to use vendor specific APIs anymore
Make sure to check the CROSS_NODE_SHARING tier

Make full use of the explicit control over resources

Create resources that need to by synchronized on each node

Use the proper CreationNodeMask
Make them visible on other nodes that need access

Always compare performance to a tier 1 type implementation

Keep the main queue open to do rendering work in parallel

Dont's

Don’t rely on any surface syncs to be done automatically (implicitly behind your back)
You should take full control over what syncs happen if you need them

Do use flip mode swap-chains
Do use SetFullScreenState(TRUE) along with a (borderless) fullscreen window and a non-windowed flip model swap-chain to switch to true immediate independent flip mode
This is at the moment, according to Microsoft, the only mode you can get unleashed frame rates with tearing out of D3D12 when calling Present(0,0)
Any other mode doesn’t allow unlimited frame rates with tearing

The flag is not necessary to achieve unlimited frame rates (see above) if your window size matches the current screen resolution
If this flag is set, trying to change resolution using ResizeTarget() before calling SetFullScreenState(TRUE) works fine and you’ll achieve uncapped FPS If this flag is not set, trying to change resolution using ResizeTarget() before calling SetFullScreenState(TRUE) results in no change of display resolution. Your target will get stretched to the current resolution and FPS won’t be uncapped.

Use IDXGISwapChain2::SetMaximumFrameLatency(MaxLatency) to set the desired latency

For this to work you need to create your swap-chain with the DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT flag set.

At the default latency of 3 this means that you FPS can’t go higher than 2 * RefershRate. So for a 60Hz monitor the FPS can’t go above 120 FPS.

Please note that this will lead to some frame never being even partially visible, but may be a good solution for benchmarking
Using the waitable object swapchain and GetFrameLatencyWaitableObject(), one can test if a buffer is available before rendering to it or presenting it – the following options are available:

Use an additional off-screen surface
- Render to the off-screen surface. Test the waitable object with timeout 0 to check if a buffer is available. If so copy to the swap-chain back buffer and Present(). If no buffer is available start the frame over again. At the beginning of the frame, test the waitable object. If it succeeds, render to the available swapchain buffer. If it fails, render to the offscreen surface.
Use a 3 or 4 buffer swapchain
- Render directly to a back buffer. Before calling Present(), test the waitable object. If it succeeds, call Present(), if not, start over.

Dont's

Don’t forget that there's a per swap-chain limit of 3 queued frames before DXGI will start to block in Present().

Set the DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT flag on swapchain creation and use IDXGISwapChain2::SetMaximumFrameLatency to modify this default value

Don’t ever call SetStablePowerState(TRUE) from game engine code.

Do consider carefully whether or not you need highly stable results at the expense of lower performance. See the discussion in our blog.

If and only if you want its stable results, do call SetStablePowerState from a separate, standalone application.

To avoid confusion, do make it crystal clear when the function is in effect or not. (One way to make it obvious is to record clocks along with performance results. We often do that. Our blog has a code snippet showing how to query GPU clocks on NVIDIA.)

Do use the DX12 API and our standalone program to stabilize the clocks when testing other APIs.

Считается, что связки из двух графических процессоров неэффективны в современных играх. Но в действительности технология по-прежнему широко поддерживается и приносит впечатляющие результаты. Кроме того, мы поговорим о том, как работа нескольких GPU устроена в Direct3D 12 и что ждет «двухголовые» системы в будущем

Эта статья была задумана как дань старой традиции: нам хотелось полюбоваться дорогими системами с несколькими GPU, как это было в 2014 и 2016 годах, а потом мы готовились с разочарованием признать, что технологии SLI и CrossFire уже утратили всякую практическую ценность. Но вместо этого получилось своего рода продолжение наших недавних публикаций про API нового поколения (см. первую и вторую части исследования), ведь большинство игр из нашей тестовой методики поддерживают Direct3D 12. Под этим API пара видеокарт работают совсем по-иному, нежели в Direct3D 11, и нет никакого смысла ограничиваться сравнением под формально устаревшим, но все еще преобладающим интерфейсом Direct3D 11.

А что касается главного вопроса (осталась ли какая-то польза в SLI и CrossFire), то придется признать, что похороны двухадаптерных систем снова откладываются! Да, в связи со сменой API возникли новые проблемы, связанные и c реализаций Multi-Adapter в Direct3D 12, и с пропускной способностью шины PCI Express, и с пресловутой процессорозависимостью игр. Но, с другой стороны, именно Direct3D 12 несет в себе возможности для того, чтобы их преодолеть.

В целом оба производителя дискретных GPU сейчас прохладно относятся к мультиадаптерному рендерингу. Это заметно по тому, как сократилось разнообразие конфигураций, в которых работают SLI и CrossFire. Пропали видеокарты на основе двух графических процессоров, без которых прежде не обходилось ни одно обновление архитектуры. Драйверы AMD и NVDIA официально поддерживают не больше двух GPU. А ведь в 2011 году мы тестировали связки из трех и четырех видеокарт класса GeForce GTX 580, и в то время игры могли вполне неплохо загрузить три ускорителя высшего эшелона. К тому же NVIDIA загнала SLI в самый верх своей продуктовой линейки: младшей видеокартой в серии GeForce 10, которая имеет разъемы SLI, является сравнительно мощный и дорогой ускоритель GeForce GTX 1070.

Конечно, именно мощные видеокарты, как правило, и устанавливают парами, а утраченная возможность использовать три или четыре графических процессора по большей части была полезна только для набора рекордных баллов в 3DMark. Но потеря интереса к SLI и CrossFire со стороны хозяев рынка видеокарт отражает общий застой этого направления, к которому привело несоответствие между сложностью технических задач и низким спросом со стороны рядовых геймеров.

В концептуальном плане рендеринг при помощи множественных GPU — это вполне очевидная идея, которая логически следует из высокого параллелизма вычислений, но на практике такие технологии всегда были довольно-таки капризны и требовали постоянного внимания со стороны драйверописателей. В рамках Direct3D 11 функцию разделения нагрузки между несколькими видеоадаптерами целиком выполняет драйвер. Игровой движок, как и в случае одиночного GPU, отдает команды общей очередью, а драйвер распределяет их так, чтобы, пока первый графический процессор создает свой кадр видеоряда, второй GPU занимается следующим кадром (метод AFR — Alternate Frame Rendering). Для полноты картины стоит заметить, что мультиадаптерный рендеринг не сводится к AFR. В редких случаях используется метод SFR (Split Screen Rendering), в котором каждый GPU обрабатывает свою часть единого кадра, а Direct3D 12 предусматривает и более сложные режимы, но о последнем — чуть позже.

В идеале процедура мультиадаптерного рендринга в Direct3D 11 прозрачна для игрового движка и не требует дополнительных усилий от его разработчиков, но на практике все совсем не так просто. При использовании метода AFR нужно считаться с ограничениями в тех ситуациях, когда существуют зависимости между последовательными кадрами, а это практически неизбежно в современных играх. Как следствие, адекватная работа мультиадаптерной системы возможна только при наличии профилей настроек для каждой конкретной игры, благодаря которым драйвер получает подсказки о том, чем занимается движок, а разработчикам игры — в идеале — нужно понимать, что пытается сделать драйвер. Полезно использовать и проприетарный API (такой как NVAPI для GPU NVIDIA), открывающий доступ к GPU в обход уровня абстракции Direct3D, но даже в идеальных условиях не от каждой игры можно добиться хорошего масштабирования быстродействия на нескольких адаптерах.

Свой вклад в эту проблему вносит и пресловутая «процессорозависимость»: графические чипы сейчас развиваются быстрее, нежели быстродействие CPU архитектуры x86 с небольшим числом потоков, и это заметно даже в конфигурациях с одной мощной видеокартой, не говоря уже о двух. Наконец, как мы раз за разом видим по результатам группового теста GPU в какой-нибудь популярной игре, единственный адаптер высшего эшелона, скорее всего, удовлетворит любого геймера. С другой стороны, и разница в качестве изображения между низкими и максимальными настройками качества графики уже не та, что в золотые времена SLI и CrossFire. В результате побуждение наращивать кадровую частоту любой ценой сошло на нет, а раз так, то какой смысл для людей, работающих над драйверами GPU, вкладывать усилия в нишевую технологию мультиадаптерного рендеринга? Пока работа продолжается (особенно со стороны NVIDIA), но с пришествием Direct3D 12 ситуация может полностью измениться — причем как в лучшую, так и в худшую сторону.

Новый графический API Microsoft предусматривает два различных подхода к программированию мультиадаптерного рендеринга. В режиме Implicit Multi-Adapter задачу разделения работы между GPU выполняет драйвер — как в Direct3D 11, со всеми его плюсами и минусами. С другой стороны, в режиме Explicit Multi-Adapter ресурсами графических процессоров целиком распоряжается игровой движок, и это одновременно и благословение, и проклятие, ведь в таком случае все зависит от готовности разработчиков вкладывать силы в поддержку Multi-Adapter.

При должном старании программисты смогут извлечь из связки GPU быстродействие, принципиально недостижимое в предыдущей версии API. В частности, можно отказаться от AFR и применять сложные методы распределения нагрузки между адаптерами — такие как конвейеризация кадров (Frame Pipelining), при которой несколько GPU выполняют различные этапы рендеринга одного кадра, а проблема зависимостей между соседними кадрами отсутствует как таковая. Кроме того, конвейеризацию можно использовать в пользу качества рендеринга, а не частоты смены кадров. К примеру, загрузить второй GPU расчетом глобального освещения, трассировки лучей, физики и так далее.

У Explicit Multi-Adapter есть два метода реализации: Linked Node и Unlinked Node. Первый метод — это аналог SLI и CrossFire в рамках новой парадигмы мультиадаптерного рендеринга. Как и в этих проприетарных технологиях, работающих на уровне драйвера под Direct3D 11, здесь несколько адаптеров представлены общими очередями команд: графика, вычисления общего назначения и очередь Copy для передачи данных по шине PCI Express. Проще говоря, игра «видит» несколько адаптеров как один, а принадлежность команды определенному GPU определяется так называемой маской узла.

Direct3D 12 Linked Node

Linked Node имеет несколько важных достоинств. В первую очередь, подразумевается общая архитектура и производительность узлов, что существенно упрощает задачу балансировки нагрузки. Также Linked Node позволяет узлам напрямую обращаться к оперативной памяти друг друга, минуя системную RAM. Наконец, при рендеринге методом AFR возможна передача кадров «ведущему» узлу через интерфейсы, отличные от PCI Express (то есть мостики SLI, поскольку AMD давно избавилась от специализированной шины в CrossFire).

В свою очередь, в Unlinked Node каждый узел предоставляет собственный набор очередей инструкций и допускает максимально гибкое управление ресурсами GPU. В частности, возможны асимметричные конфигурации из адаптеров неодинаковой мощности и различной архитектуры. Вполне жизнеспособна даже комбинация устройств AMD и NVIDIA в одной системе. Не менее заманчива возможность увеличить быстродействие дискретной графики за счет встроенного в центральный процессор GPU, который сможет выполнять финальные стадии обработки кадра (фактически, это частный случай конвейеризации кадров, и он не требует тщательной балансировки нагрузки, неизбежной в асимметричных связках GPU).

Переход графических адаптеров в Linked Node осуществляется на уровне драйвера, а для пользователя — включением опции SLI или CrossFire в настройках. Тем не менее не всякая игра сможет использовать адаптеры в Linked Node или, напротив, Unlinked Node. К примеру, Ashes of the Singilarity требует активации Unlinked Node, а остальные из наших тестов работают только в режиме Linked.

Direct3D 12 Unlinked Node

Чтобы получить Linked Node под Direct3D 12 на видеокартах NVIDIA, нужно связать их мостиком, иначе драйвер просто не даст включить SLI. Как сказано выше, Direct3D 12 позволяет использовать мостики по назначению в связанном режиме, но мы проверили: ни в одной из тестовых игр нет разницы по кадровой частоте между современным мостом HB SLI Bridge и простым гибким мостиком, которая в противном случае обязательно бы возникла. С одной стороны, это проблема, ведь мостик обеспечивает необходимую пропускную способность для передачи кадров между GPU в обход шины PCI Express даже в таких тяжелых режимах, как 4К. С другой, жесткие мостики с подсветкой, которые способны работать в двухканальном режиме и на повышенной частоте, — дорогое удовольствие, а в Direct3D 12 можно обойтись копеечным гибким интерфейсом.

Конфигурация тестового стенда
CPU	Intel Core i7-5960X @ 4 ГГц (100 МГц × 40), постоянная частота
Материнская плата	ASUS RAMPAGE V EXTREME
Оперативная память	Corsair Vengeance LPX, 2133 МГц, 4 × 4 Гбайт
ПЗУ	Intel SSD 520 240 Гбайт + Crucial M550 512 Гбайт
Блок питания	Corsair AX1200i, 1200 Вт
Система охлаждения CPU	Thermalright Archon
Корпус	CoolerMaster Test Bench V1.0
Монитор	NEC EA244UHD
Операционная система	Windows 10 Pro x64
ПО для GPU AMD
Все видеокарты	AMD Radeon Software Crimson ReLive Edition 18.6.1
ПО для GPU NVIDIA
Все видеокарты	NVIDIA GeForce Game Ready Driver 398.11

Бенчмарки: игры
Игра (в порядке даты выхода)	API	Настройки, метод тестирования	Полноэкранное сглаживание
Игра (в порядке даты выхода)	API	Настройки, метод тестирования	1920 × 1080 / 2560 × 1440	3840 × 2160
GTA V	DirectX 11	Макс. качество. Встроенный бенчмарк	MSAA 4x + FXAA + Reflection MSAA 4x	Выкл.
The Witcher 3: Wild Hunt	DirectX 11	Макс. качество. FRAPS, локация Caer Morhen	AA + HairWorks AA 4x
Rise of the Tomb Raider	DirectX 11 / Direct3D 12	Макс. качество, VXAO выкл. Встроенный бенчмарк	SSAA 4x
Tom Clancy's The Division	DirectX 11 / Direct3D 12	Макс. качество, HFTS выкл. Встроенный бенчмарк	SMAA 1x Ultra + TAA: Supersampling	TAA: Stabilization
Deus Ex: Mankind Divided	DirectX 11 / Direct3D 12	Макс. качество. Встроенный бенчмарк	MSAA 4x
Battlefield 1	DirectX 11 / Direct3D 12	Макс. качество. OCAT, начало миссии Over the Top	TAA
Ashes of the Singularity: Escalation	DirectX 11 / Direct3D 12	Макс. качество. Встроенный бенчмарк	MSAA 4x + TAA 4x
Total War: WARHAMMER II, встроенный бенчмарк	DirectX 11 / Direct3D 12	Макс. качество. Встроенный бенчмарк (Battle Benchmark)	MSAA 4x
Far Cry 5	DirectX 11	Макс. качество. Встроенный бенчмарк	TAA

В набор бенчмарков вошли девять игр 2016–2017 годов выпуска, среди которых шесть способны работать под API Direct3D 12. К Direct3D 11 прикованы только относительно старые игры — GTA V и The Witcher 3: Wild Hunt, а также Far Cry 5.

Что касается совместимости с мультиадаптерными системами под Direct3D 11, то она исключена в Ashes of the Singularity, а Battlefield 1 не поддерживает CrossFire. В остальных играх SLI и CrossFire работоспособны под старым API.

В режиме Direct3D 12 две видеокарты задействованы в Ashes of the Singularity, Battlefield 1, Deus Ex: Mankind Divided и Rise of the Tomb Raider. Tom Clancy’s The Division и Total War: WARHAMMER II этой возможности лишены. Кроме того, в Deus Ex: Mankind Divided под Direct3D 12 не работает связка из двух ускорителей Vega 64, хотя нет никаких проблем с Radeon RX 580.

Смещение фокуса от CrossFire к mGPU сопровождалось ещё одной сменой приоритетов, как удалось выяснить коллегам с сайта PCWorld в ходе недавнего общения с представителями AMD. Компания с недавних пор не видит особого смысла в поддержке связок из более чем двух видеокарт в игровых приложениях.

Во всяком случае, это касается конфигураций mGPU в среде DirectX 12 и видеокарт Radeon RX Vega. Можно предположить, что на их преемников это правило тоже распространится. Если говорить о вычислениях или профессиональных приложениях, то прирост от использования трёх или четырёх видеокарт в одной системе можно будет почувствовать по-прежнему.

Стоит добавить, что и NVIDIA в последнее время охладела к конфигурациям 3-Way SLI и 4-Way SLI, в "развлекательном" сегменте сохранив их поддержку только для нескольких тестовых приложений, при помощи рекордов в которых энтузиасты продолжают тешить своё самолюбие. Стало быть, это общая тенденция, и демарш AMD в данной ситуации не носит исключительного характера.

Engine requirements

To explicitly utilize multiple GPUs, the renderer needs to be aware of their existence. This requires some new code infrastructure. Building the infrastructure can actually be the task that requires most effort. Once the infrastructure exists, it’s easier to experiment with different ways of utilizing all the GPUs in the system.

When using a linked node adapter, you create one ID3D12Device as you would with a single physical GPU. But various objects that you create through the ID3D12Device use node affinity masks to identify the nodes with which the object is associated. The affinity mask is a bitmask where each bit represents one node (physical GPU). Some objects are exclusive to one physical GPU meaning that exactly one bit in the node mask must be set. Others can be associated with (or created on) arbitrary GPUs.

Image 6. Node affinity bitmasks are used to reference nodes (GPUs)

For example, when you create a ID3D12CommandQueue for submitting work to the GPU, you specify a node mask to identify the physical GPU to which the command queue feeds work. The ID3D12CommandQueue is one of the APIs that are exclusive to one node. Likewise, the ID3D12CommandList objects are exclusive to one node. As a result, you have to replicate your command list pooling system for each node. ID3D12PipelineState, on the other hand, is an example of an object that can be associated with arbitrary nodes. You don’t have to create separate object for each node. The same pipeline state object can be set to command lists associated with any node.

When sharing resources, the application is responsible for synchronizing the command queues to avoid access conflicts. Also, the application must ensure that the queues see resources in the same state, i.e. the resource barriers set by different command queues must match. ID3D12Fence is the synchronization tool that is used for these purposes.

Image 7. Fences are used to synchronize resource access by different queues (GPUs)

DirectX 12 exposes “copy engines”, i.e. command queues that accept only command lists containing copy operations such as ID3D12GraphicsCommandList::CopyResource(). Copy engines are special hardware that can perform copy operations at the same time as graphics and compute engines are doing other work. They are additional parallel processing power available for copy operations. The number of copy engines available at hardware level varies on each GPU model, but a safe assumption is that there is at least one hardware copy engine available per each physical GPU. The copy engines are very useful in multi-GPU programming. Copying resources over PCIe bus is slow and the copy engines allow other processing on the GPU to go on while they are doing the slow copy operations. (However, copy engines are not good for copies within a physical GPU because they are not designed to operate faster than the PCIe bus allows. Graphics and compute engines are much faster for this purpose. Copy engines do have their place on single-GPU systems - for copying from system memory to video memory over the PCIe bus.)

Image 8. A copy engine can do the slow copies between GPUs while other engines continue working.

In a linked node adapter, there is a tier system for cross node sharing functionality. Tier 1 supports only copy operations on resources residing on other nodes. Tier 2 supports using resources through SRVs, CBVs and UAVs in draw and dispatch calls. The tier 2 functionality may seem convenient, but the parallel copy engines cannot do draw or dispatch calls and slowing down the other engines with cross node resource access is usually not wise.

When creating resources, you specify two separate node affinity masks: CreationNodeMask and VisibleNodeMask. The CreationNodeMask determines the node where the resource physically resides. VisibleNodeMask determines the nodes on which the resource is mapped for access. In the CreationNodeMask, exactly one bit must be set but in the VisibleNodeMask, arbitrary bits can be set. When a resource is accessed from other than creation node, data is transferred between nodes. The transferred data may be cached to avoid retransfer when it’s accessed again but the application should not rely on this. There are no guarantees about the caching behavior. I.e. the application cannot ensure that a given resource stays in cache and it cannot see whether or not a given resource is still in cache. For achieving reliable performance, manually replicating art assets (vertex buffers, textures etc) for each node that uses them is recommended. I.e. don’t just create resources on one node and make them visible to others. Using them through SRVs from other nodes is possible, but there’s no guarantees about the performance.

ID3D12DescriptorHeap objects are exclusive to one node. This means that regardless of whether you replicate the resource objects for each node that uses them or not, the resource views must be replicated in any case. Though when cross node access happens with copy engine, resource views are not used. Copy engine access just needs properly set bits in VisibleNodeMask.

If you implement classic AFR, you should manually duplicate all your resources to all nodes. This includes art assets, render targets and constant buffers and other dynamic resources. On each frame, you use the resources that reside on the node doing the rendering for that frame. You upload dynamic data from CPU to resources on that node and render to target resources on that node using asset resources residing on that node. The transfer of the rendered frame to the primary node (the node to which the monitor is attached) for presentation on screen is best done using the special API provided for it. (See again IDXGISwapChain3::ResizeBuffers1()). Possible data dependencies between the frames should be handled with additional resource copy operations using the copy engine.

This concludes part 1. The second part of the blog post examines frame pipelining, one new alternative to classic AFR that’s now possible with the exposed functionality.

Adapters - Multiple or Linked Together

DirectX 12 exposes two alternate ways of controlling multiple physical GPUs. They can be controlled as multiple independent adapters where each adapter represents one physical GPU. Alternatively, they can be configured as one “linked node adapter” where each node represents one physical GPU. However, it’s important to note that application cannot control how it sees multiple GPUs. It cannot link or unlink adapters. The selection is done by end user through display driver settings.

Image 1. Multiple physical GPUs seen as multiple adapters
Image 2. Multiple physical GPUs seen as a linked node adapter
Image 3. Enabling SLI in NVIDIA control panel enables linked node mode in DirectX 12 API

In practice, the linked node mode is meant for multiple equal, discrete GPUs, i.e. classic SLI setups. It offers a couple of important benefits. Within a linked node adapter, resources can be copied directly from memory of one discrete GPU to memory of another. The copy doesn’t have to pass through system memory. Additionally, when presenting frames from secondary GPUs in AFR, there’s a special API for supporting connections other than PCIe. (See IDXGISwapChain3::ResizeBuffers1() in DirectX documentation.)

Image 4. In linked node adapter, connections other than PCIe can be utilized for presenting frames from secondary GPUs in alternate frame rendering

Today, all available linked node implementations link only equivalent GPUs. In practice, applications can build their load balancing between the nodes on this assumption. However, the linked node API doesn’t actually guarantee that nodes have equal performance. Someday, heterogeneous linked node adapters may be available making the load balancing less trivial.

Image 5. Today, linked node adapters are homogeneous in practice.

In this blog post, I’ll focus on linked node adapter due to its suitability for classic multi GPU setups.

Background

Since the launch of SLI, a long time ago, utilization of multiple GPUs was handled automatically by the display driver. The application always saw one graphics device object no matter how many physical GPUs were behind it. With DirectX 12, this is not the case anymore. But why start doing something manually that has been working automatically? Because, actually, for a good while before DirectX 12 arrived, the utilization of multiple GPUs has not been that automatic anymore.

As rendering engines have grown more sophisticated, the distribution of rendering workload automatically to multiple GPUs has become problematic. Namely, temporal techniques that create data dependencies between consecutive frames make it challenging to execute alternate frame rendering (AFR), which still is the method of choice for distribution of work to multiple GPUs. In practice, the display driver needs hints from the application to understand which resources it must copy from one GPU to another and which it should not. Data transfer bandwidth between GPUs is very limited and copying too much stuff can make the transfers the bottleneck in the rendering process. Giving hints to the driver can be implemented with NVAPI or by making additional Clear() or Discard() calls for selected resources.

Consequently, even when you didn’t have explicit control over multiple GPUs, you had to understand what happened implicitly and give the driver hints for doing it efficiently in order to get the desired performance out of multi-GPU setups. Now with DirectX 12, you can take full and explicit control of what is happening. And you are no longer limited to AFR. You are free to invent new ways of making use of multiple GPUs that better suit your application.

Сейчас обсуждают

Для регистрации перейдите по ссылке, указанной ниже. Если у вас уже есть аккаунт в конференции, то для доступа ко всем функциям сайта вам достаточно войти в конференции под своим аккаунтом.

Соблюдение Правил конференции строго обязательно!
Флуд, флейм и оффтоп преследуются по всей строгости закона!

Комментарии, содержащие оскорбления, нецензурные выражения (в т.ч. замаскированный мат), экстремистские высказывания, рекламу и спам , удаляются независимо от содержимого, а к их авторам могут применяться меры вплоть до запрета написания комментариев и, в случае написания комментария через социальные сети, жалобы в администрацию данной сети.

Наши результаты показывают, что масштабирование быстродействия на двух видеокартах подчиняется двум правилам. Во-первых, чем выше разрешение экрана, тем эффективнее работает мультиадаптерная система. Во-вторых, чем мощнее одиночная видеокарта, тем меньше проку от второй такой же.

И все же даже старшие модели NVIDIA и AMD (GeForce GTX 1080 Ti и Radeon RX Vega 64) в SLI и CrossFire под Direct3D 11 обеспечили прирост быстродействия на 58 и 52 % соответственно при разрешении 2160p, то есть в таком режиме, для которого и требуется как можно более производительное железо.

GeForce GTX 1070 в подобных условиях выступил еще лучше, продемонстрировав 72-процентное увеличение FPS за счет работы в SLI. Но посмотрите на Radeon RX 580: вторая видеокарта такого уровня принесла еще 68 % кадровой частоты уже в режиме 1080p, а в 2160p эффективность CrossFire доходит до 82 %!

Что касается мультиадаптерного рендеринга в Direct3D 12, то делать прямое сравнение с Direct3D 11 по результатам тестов не вполне честно из-за того, что новый API поддерживают не все тестовые игры, а среди них не все позволяют задействовать две видеокарты. Но и здесь заметны определенные тенденции. Лучше всего под Direct3D 12 работает тандем Radeon RX 580 или Radeon RX Vega 64. Последняя связка немного проседает по эффективности при разрешении 1080p по сравнению с тем, что мы наблюдали в Direct3D 11, но в остальном для видеокарт AMD практически нет разницы между двумя версиями API.

Гораздо хуже обстоят дела у GeForce GTX 1070 и GeForce GTX 1080 Ti. В зависимости от разрешения, пара таких видеокарт может потерять очень много при переходе на Direct3D 12. Эффективность тандема GeForce GTX 1080 Ti в режиме 2160p снизилась ни много ни мало в два раза. С другой стороны, именно GTX 1080 Ti предпочитает Direct3D 12 в режиме 1080p.

1920 × 1080

2560 × 1440

3840 × 2160

Коль скоро мультиадаптерный рендеринг по-прежнему весьма эффективен на ускорителях средней категории (таких как Radeon RX 580) и классом выше (GeForce GTX 1070), полезно сравнить связку таких видеокарт с флагманскими устройствами AMD и NVIDIA. В конце концов, два Radeon RX 580 или GeForce GTX 1070 по числу шейдерных ALU и текстурных блоков превосходят Radeon RX Vega 64 и GeForce GTX 1080 Ti соответственно.

Что касается Radeon RX 580, то, действительно, по средней частоте смены кадров (усредненной о всем играм) они обгоняют Vega 64 при любом разрешении экрана, особенно под Direct3D 12, а также имеют хороший запас по производительности в играх, неидеально оптимизированных под CrossFire и mGPU.

Прим.: процентные соотношения в таблицах рассчитаны по результатам только в тех играх, которые поддерживают мультиадаптерный рендеринг для соответствующего API и архитектуры GPU.

1920 × 1080

2560 × 1440

3840 × 2160

Связка двух GeForce GTX 1070 тоже выгоднее по сравнению с единственным GeForce GTX 1080 Ti, но преимущество SLI ограничено игрой под Direct3D 11. Под Direct3D 12 лишь в режиме 2160p пара GTX 1070 обеспечивает в среднем одинаковую с показателями GTX 1080 Ti частоту смены кадров.

Прим.: процентные соотношения в таблицах рассчитаны по результатам только тех игр, которые поддерживают мультиадаптерный рендеринг для соответствующего API и архитектуры GPU.

1920 × 1080

2560 × 1440

3840 × 2160

Несмотря на то, что AMD и NVIDIA в последние годы уделяют мало внимания технологиям SLI и CrossFire, по результатам тестирования можно констатировать, что, пока жив Direct3D 11, в этой сфере все остается по-старому. Большинство игр поддерживают мультиадаптерный рендеринг под старым API, и при благоприятных условиях вторая видеокарта вполне может увеличить быстродействие на 70–80 %.

Наилучшее масштабирование наблюдается в тяжелых графических режимах (как минимум разрешение 1440p с полноэкранным сглаживанием, а лучше 4К) и на видеокартах не слишком высокого уровня — Radeon RX 580 и GeForce GTX 1070, в особенности RX 580. Связка из двух GeForce GTX 1060 была бы как минимум столь же эффективна, вот только со SLI эта модель по воле NVIDIA не совместима.

Нагрузить так же хорошо два мощных видеоадаптера, таких как Radeon RX Vega 64 и GeForce GTX 1080 Ti, в современных компьютерах невозможно. Судя по всему, масштабирование упирается в ресурсы CPU, и в результате средняя эффективность SLI/CrossFire не превышает 60 %.

Технология Multi-Adapter в Direct3D 12 пока не пользуется большой поддержкой со стороны разработчиков игр. Несколько тестовых проектов, совместимых с как со старым, так и с новым API, задействуют две видеокарты только в режиме Direct3D 11. Кроме того, имеются определенные проблемы с производительностью у чипов NVIDIA. Связки GeForce GTX 1070 и GTX 1080 Ti серьезно пострадали под Direct3D 12 по сравнению с Direct3D 11, в то время как Radeon RX 580 и Vega 64 пережили апгрейд интерфейса программирования без значительных потерь. Выходит, ставка AMD на API следующего поколения, как и в предыдущих испытаниях Direct3D 12, оправдала себя целиком и полностью.

This blog post is about explicit multi-GPU programming that became possible with the introduction of the DirectX 12 API. In previous versions of DirectX, the driver had to manage multiple SLI GPUs. Now, DirectX 12 gives that control to the application. There are two parts in this blog post. In this first part, I’ll explain how multiple GPUs are exposed in the DirectX API, giving some pointers to the API documentation. Please look for further details in the documentation itself. In the second part, I’ll describe a technique called frame pipelining - a new way for utilizing multiple GPUs that was not possible before DirectX 12.

Читайте также:

Что такое mgpu directx 12

Don’ts

Don’ts

Don’ts

Don’ts

Don’ts

Dont's

Dont's

Dont's

Dont's

реклама

Комментарии

Популярные статьи

Engine requirements

Adapters - Multiple or Linked Together

Background

Сейчас обсуждают