Optimizing a Task-Based Game Engine - Game Development Tools

Game Development Reference

In-Depth Information

ReviewingtheGPAtracein Figure17.7 , theframeshowstheanimationand

render tasksets executing in sequence. Wall time for the frame increased to 2.8 ms.

That is unexpected. The GPA trace shows that both animation and rendering take

about 1.8 ms which is an improvement. The submit task is forcing a serialization

point, however. The drivers used to gather these data do not support multithreaded

command lists. Internally, the D3D11 API creates tokens which are then played

back in the ExecuteCommandList function. The multithreaded emulation slightly

increases the frame cost, yet all is not lost. Even for drivers where multithreaded

submission is not enabled, we can use the drain-out time (defined below) with

pipelining.

17.3.7 Pipelining Systems across Frames and Latency

The tasksets are now free of synchronization points and their dependencies are

properly specified. There are surely many more algorithmic and implementation

optimizations possible for this example. From a tasking perspective, however, the

scheduling is as ecient as possible for this frame. The tasking system schedules the

various game systems' work as soon as the dependencies allow, and the system's

tasks run concurrently. To get to the next level of tasking utilization, multiple

frames need to be in flight at once.

Pipelining in a thread-per-system game. With the thread-per-system model, the

number of frames in flight to achieve the maximum possible throughput is equal

to the number of dependent systems in a frame 3 . Let us assume a game has three

systems A, B, C and that the frame is CPU bound. System B depends on the

output of A and system C depends on the output of B. If the game systems are run

on one thread, total frame time is the sum of the running time of A, B, and C and

the Latency =1:

.

Time(Frame) = Time(A) + Time(B) + Time(C)

If each system is on a thread, then to achieve maximum throughput, the latency

will be three frames and

ExecTime(Frame) = max(ExecTime(A)

,

ExecTime(B)

,

ExecTime(C))

.

Also, the memory footprint expands since the inputs and outputs of the systems

need to be queued. For simplicity, we can describe the memory usage as a function of

the latency because each pipelined frame needs independent memory to operate on.

3 Complex threading systems that more closely resemble tasking can result in lower frame

latency at the cost of code complexity. There are a continuum of solutions between the ridged

thread-per-system model and the tasking model.

Search WWH ::

Custom Search

Home