EmbeddedRelated.com
Forums
Memfault Beyond the Launch

AMD Technical Day at Develop 2007

Started by R300 August 11, 2007
http://www.beyond3d.com/content/articles/89/1

AMD Tech Day at Develop 2007
written by Rys for Consumer Graphics



In the 12 months since last year's AMD technical day at the Develop
conference, a lot has happened. The day was dominated by ATI,
Microsoft and Intel talking about D3D10 and multi-threaded game
development. Since then AMD purchased ATI, creating the AMD Graphics
Products Group; Microsoft shipped Vista early this year to provide the
official OS platform for D3D10; AMD joined NVIDIA in shipping D3D10
hardware to accelerate the new API; and finally multi-threading for PC
platforms has taken centre stage as the battle for multi-core CPU
supremacy inside your development, home or gaming PC rages on.

Looking forward, Agena will fight Kentsfield and Yorkfield in short
order at the tail end of this year to decide who rules the quad-core
multi-threaded PC roost. NVIDIA will likely ship their next round of
D3D10 acceleration before 2008, too, and AMD have a Radeon or two left
up their sleeves before year's end. Microsoft recently shipped the
Tech Preview for D3D10.1, too, which some of the aforementioned coming
hardware might support. All these factors combine to focus AMD and its
Develop Tech Day efforts. Put simply, there's enough going on in
graphics and games development that such a Tech Day almost becomes
mandatory.

While this year's turn out was low compared to last, those that didn't
attend missed a trick in our opinion. More than enough good
information was imparted -- although the crowd kept quiet feeding that
back to the presenters at times -- to make us wonder why IHV devrel
teams don't tour slide decks like those presented more often.
Technical details, implementation tricks, things to try and experiment
with in game engines and more were all thrown out there for the ISV
representatives to soak up. Hopefully those not taking notes either
have photographic memories, tiny Nick Thibieroz or Emil Persson clones
in their shirt pockets, or they fancy a read of this article.

We'll cover what was presented by the AMD Classic, AMD Graphics
Products Group and Microsoft speakers, dicussing things as we go
along. The focus is definitely D3D10 on Vista and multi-threaded game
programming on Windows, with a little about the upcoming new dual- and
quad-core AMD processor architecture at the end. Let's forge on.


Richard Huddy


Richard Huddy was up first, discussing the Graphics Products Group's
latest hardware and where they think the industry is headed as process
nodes get ever smaller. The early part of his presentation centered
around the best use of their R6-family of processors for both D3D9 and
D3D10. Richard indicated that given the performance of RV610 in D3D10,
it might not be unreasonable to consider it a good D3D9 GPU and stop
there, essentially asking developers to consider leaving RV610 out of
the loop when it came to D3D10 GPU targets.

That was in reference to the general lack of capability checking that
favours D3D10 development. Given the base specification for D3D10,
hardware implementations are always guaranteed to meet that minimum
specification with no exceptions, with Microsoft working with the IHVs
to ensure that base spec is as feature-rich as possible. That standard
feature base means that game developers get to concentrate on scaling
the game experience to different hardware platforms on the PC by
rendering fidelity, not rendering ability.

The (hopefully) more predictable performance of D3D10 accelerators,
that all offer hardware unified shading and transparent load
balancing, means that developers can be left to offer simple
resolution, GPU-based, and asset-based quality controls. Turning down
texture or shadow map size for example, or enabling lesser levels of
surface filtering and MSAA via the API and user control, means the end-
user tailors the gaming experience to his or her machine. Huddy's
advice for RV610 goes one step further, with a developer that follows
that advice possibly not offering D3D10 execution on that GPU at all,
because of its performance limitations.

Crossfire
Crossfire was next on the agenda, Richard talking about how important
multi-GPU is going to be going forward. At this point it's one of
those quietly whispered secrets that AMD's next generation graphics
architecture is going to centre around a dual-chip offering for the
high-end, with R700 likely comprising two discrete GPU dice.


With that in mind, Huddy's advice about Crossfire maybe signals intent
on AMD's part to push multi-GPU as graphics progresses deep into D3D10
and then on to D3D11 and beyond.

Huddy says that AMD simply aren't interested in building large
monolithic GPUs any more, instead preferring to scale their offerings
via multi-GPU. Speaking about a future where GPUs are built on process
nodes approaching 22nm and beyond, limitations because of the process
start to encroach on how the IHVs build their chips. Huddy argues that
chip size and pad count and one of the biggest things the GPG
architects are considering going forward, with no real chance of 512-
bit external memory buses on tiny chips.

We asked him why you wouldn't just build an R600-sized GPU at 22nm,
encompassing way north of a billion transistors, while retaining the
large external bus size, but he was firm that the future of graphics
scaling lay not in large monolithic designs, despite their inherent
attractions. Instead the future lies with multi-GPU, and AMD are
looking to implement consumer rendering systems where more than two
GPUs take part in the rendering process. Richard made mention that the
software to drive it gets more complex, both at the application level
and at the driver level, but AMD will ensure their part of the
software stack works and ISVs should be prepared to do the same.

Richard urged developers to test on Crossfire systems now, since
they're available and sport some of the basic scaling traits that AMD
expect to see in future multi-GPU implementations. Tips for developers
centered around rendertarget usage, and temporal usage of rendered
frames for feedback algorithms including exposure systems and the
like.

On a multi-GPU system rendering in a round-robin AFR fashion, with
each GPU in the system rendering a discrete frame on its own, if one
frame depends on the one (or more) that was rendered previously,
shuffling its contents to the GPU drawing the current one costs
performance in multiple (and sometimes subtle) ways.

The message was a clear one: Crossfire is important now, and it will
only become even more so in the future, especially as AMD introduce
new architectures and deliver consumer game rendering systems with
more than 2 GPUs contributing to rendering. We can't help but think
that the push to thinking about multiple hardware components
contributing to rendering is a basic tenet of the industry going
forward for graphics, and that it'll occur not just at AMD.

Back to the R6-family
Coming back to the R6 family, Richard made note of various current
properties of their shipping hardware range. Notes like L2 cache sizes
for the GPUs (256KiB on R600, half that on RV630 and no L2 texel cache
on RV610), that each SIMD can only run one program type per clock
group, and that you should keep the chip and its memory controller and
on-chip memories happy in D3D10 by indexing RT arrays sensibly (use
RTs 0, 1 and 2 rather than 0, 4 and 7 for example, if possible). The
chip will speculatively fetch across sensible memory locations into
cache, so you can reduce misses by packing data not just per-surface
but by groups of surfaces.

Before we get to the tesselator chatter, it's worth noting that AMD
recommend that a 1:4 input:output relationship for GS amplification as
an upper bound to keep output primitive data on chip, and thus
performance high, before the hardware has to regularly spill to board
memory via SMX.

Tesselation
The R6-family tesselator's light burned strongly in Richard's
presentation. A programmable primitive tesselator will become part of
the official D3D spec with DirectX 11, however AMD are the first out
of the gate with a pre-11 implementation. AMD look to expose it first
in D3D9 applications if developers are interested, with D3D10 support
provided to the ISVs probably closer to 2008.

Richard said the tesselator's main use is for added graphical
fidelity, rather than situations where the tesselated geometry might
interact in a crucial game-changing way. The unit's performance was
also mentioned, where triangle throughput upwards of a billion per
second is pretty trivial for the tesselator to push down to the rest
of the hardware.

Nicolas Thibieroz



Nick is one of the GPG devrel ninjas, coming to AMD by way of ATI and
PowerVR before that. Nick's had a hand in more triple AAA games from
an ISV devrel perspective than we care to mention, and D3D10's his
favourite D3D so far. So much so, in fact, that he decided to reprise
his D3D10 For Techies presentation from Develop last year and present
even more information.

Good D3D10 Practices And Things To Remember
Nick's talk centered around porting a D3D9-level renderer to D3D10,
outlining some of the pitfalls and things to watch out for as you make
that transition. He started by curging developers to separate their
D3D10 shaders according to opaqueness or transparency, running
different versions depending on whether the alpha channel is a
consideration when shading, and to remember to set render state
correctly at each point.

The geometry shader was talked about as the replacement for point
sprites, where you emit the primitives for the sprite in the GS at the
screen space point you require. Nick also reaffirmed that there's no
longer a half texel offset when mapping texture UVs to geometry, so
D3D9-era shaders you might port to D3D10 that take that in to account
will need to be corrected. Input semantics for the vertex shader stage
was also mentioned, with D3D10 more strict about your vertex
structures and which shaders you feed them to. Make sure your input
layouts match to ensure correct binding semantics, so that D3D10 can
do the right thing with shader caching and binding.

Nick went on to discuss how D3D10 might have a base level spec, but
there are still additions to that base spec that hardware might not
implement, which require asking the API to check for compatibility
before you perform the operation. Surface filtering, RT blending, MSAA
on RTs and MSAA resolve for an RT were all mentioned in that group,
and it's up to you to make sure the hardware can deal with the surface
format you want to use for those operations. For example, current
Radeon D3D10 hardware can't perform MSAA on the 96-bit RGB float
surface type (DXGI_FORMAT_R32G32B32_FLOAT), but can if you make use of
DXGI_FORMAT_R32G32B32A32_FLOAT instead, with its extra channel. Full
orthogonality isn't there yet, so make sure the formats you use are
compatible with what you're doing.

Talking about DX10's improved small batch performance, Nick highlights
the fact that the improvements are harder to spot if you port your
D3D9 engine in a naive way and don't make the best use of the features
on offer. Making good and correct use of constant buffers, state
objects, the geometry shader, texture arrays and resource views, and
geometry instancing (to name a few key considerations), are key to
extracting the best driver-level and runtime performance from the
D3D10 subsystem, to reduce small batch overhead.

The general idea Nick pushed in that section of his talk is that you
need to give the driver and runtime the very best picture of what
you're trying to achieve in your application for any given frame, so
that they can combine to drive the hardware efficiently. Make sure you
only update key resources when you need to, which might not be every
frame. That's leads nicely on to constant buffer management.

Constant Buffer Management
Somewhat surprisingly, at least to us, Nick says that the number one
cause for poor performance in the first D3D10 games is bad CB
management. Developers forget that when any part of a constant buffer
is updated, the entire buffer is resent to the GPU, not just the
changed values. 1-4 CBs is the optimal number for best performance on
current hardware, with Nick mentioning that doesn't just apply to
Radeon, but GeForce too. Nick also mentioned that you should keep your
CBs in a controlled namespace, since filling up the global CB is a
quick way to worse performance.

Best practice with CBs also centers around the position of constants
in the buffer itself, to help caching to work properly. If at a
particular point in a shader it asks for, for example, four constants
from one of your CBs, make sure those four constants are next to each
other in the CB index. Speculative constant fetch into the constant
cache means that the hardware is more likely to have neighbouring
constant values available as soon as possible, rather than if it has
to jump all over the constant space for the values you require. Think
about ordering constants in your CB index by their size, too. Profile
your app before and after a CB order change if possible, to check how
performance was affected.

More D3D10 Tips
Next up was a note that you need to make sure to specify the winding
order for output primitives in the GS for best performance and to make
sure the hardware actually draws your geometry. A mistake he's seeing
in current games is to not specify the winding order and wonder why
geometry is missing, so be sure and tell the API how it should
instruct the hardware to draw your GS output.

If you're making use of streamout from your geometry stage, be it VS
or GS (you can use VS streamout by setting a null or passthrough GS),
hide latency in that stage of the pipeline by doing more math. Nick
mentions that you can move math ops around the pipeline in certain
cases, to help balance your workload and mask latency. If your VS is
bound by streamout latency for example, think about giving it some of
your PS work to do if that's possible.

After that was the reminder that the GS isn't only capable of
amplifying geometry, but it can also cull or deamplify your geometry
load. If there's scope for it in your application, consider using the
GS for culling before rasterisation. General hints for programming the
GS were sensible, and Nick outlined that the smaller the output struct
the faster things will run, and the smaller your amplification level
the better as well.

MSAA, Z and S access in your D3D10 shader
Moving on to MSAA, depth and stencil considerations in a D3D10 app,
Nick talked about 10.0's restriction that you can only access depth
and stencil buffers in your shader when MSAA is off, a restriction
that's removed in D3D10.1. Depth can't be bound in your shader while
depth writing is on in your render state, and you have to remember to
bind the stencil buffer with an integer format for the view, not
float.

For custom MSAA resolve in your shader, Nick made note that in a
modern game engine with HDR pixel output, it's not always correct to
let the hardware perform the resolve since that might take place
before a tonemapping pass on the frame. Instead, you'll want to
tonemap first to filter colour values into the right range before
resolve, which you perform yourself in the shader.

Output Merger
Lastly, talking about the OM stage of the D3D10 pipe, Nick encouraged
developers to remember that the blender can take two input colours, as
long as you observe some semantic rules about which rendertarget you
output to. Dual-source colour blending as implemented by D3D10 has
some restrictions in terms of the blend ops you have available, but
it's the beginnings of being able to fully program the ROP's blender
which is on the D3D roadmap for the future.

Alpha-to-coverage was mentioned, Nick explaining how it works
regardless of MSAA being on or off, with the hardware taking alpha
samples as a coverage mask for the pixel, to determine how to blend
your bound rendertargets in the output stage of the pipe.

Essentially Nick's talk and slides were all about using D3D10
sensibly, all while keeping in mind what the hardware can do for you
at the various render stages. Chances are if you don't abuse the API
you won't abuse the hardware, and common sense prevails. Being mostly
hardware agnostic, the advice given for ISVs looking to make the most
of their D3D10 investment, be it a fresh codebase or a D3D9 port, will
apply to their development on other hardware, not just Radeons.

Bruce Dawson



Bruce Dawson is one of Microsoft's DirectX ISV guys, and his
presentation focused on application development in terms of
performance profiling. Bruce's early message was don't make
performance profiling a last minute ship you do before going to
master. It needs to be part of the software dev cycle from the first
instance with Bruce urging Windows game developers to make use of a
well understood profiling infrastructure, especially to get the most
out of the GPU and multi-core CPUs.

Bruce talked about having clearly designed performance goals that
allow for Draw* call costs from the beginning, using those goals and
profiling to focus development to make sure the performance targets
are hit. Having the user adjust the performance experience only get
you so far, and having the user throw more hardware at what could be
your problem as a developer isn't really what you want.

Bruce urges developers to create a representative benchmark or test
levels to use for performance profiles and public demos, to help
collect lots of data from the users who'll eventually run the final
game. Collecting that as often as the development process allows is
key. A couple of laughs from the crowd ensued when Bruce suggested
developers should use lower spec machines than they're used to, to
make sure the game runs well on those (and thus automatically well on
anything with a better spec).

Then came the slightly dubious advice of expecting around 40-60% of a
modern CPU core to be taken up by the driver and OS on a Windows
system, as overhead. Whether that was to urge developers to make the
most efficient use of remaining CPU time where the reality is there
should reasonably be more available, or whether Bruce was speaking
from experience, wasn't clear. Automated performance profiling can
also be one of the biggest boons to developers, allowing nightly
performance testing of builds without developer interaction, with
Bruce telling the audience to make sure that's in their next and
upcoming games, to catch performance issues early and often.

The next round of slides were concentrated on making sure developers
were using the best available performance testing tools. Intel vTune,
AMD Code Analyst, Microsoft's CLR profiler if it's managed code, the
ANTS profiler, Microsoft PIX, NVIDIA PerfKit and AMD GPU PerfStudio
were all mentioned for Windows performance analysis, along with Event
Tracing for Windows (ETW). Developers should use those, and others, to
continually performance profile their applications, but not to the
point of obsession or exclusion of other important development that
needs to take place.

Regular performance testing should also happen on release builds,
without any asserts, logging or debug code, if possible, on machines
not used for development. Care should be taken to make sure the
application runs as well as possible in that instance, too, Bruce
recommending turning off the DWM on Vista (although it's disabled in
full-screen exclusive mode anyway), making sure Vsync is off and
things like the Windows Sidebar are disabled, to give the game the
best chance. It's an inverse environment in places to the one an end-
user will use to experience the game, but it removes more chances of
external software interference into your application's runtime
performance.

For graphics debugging, Bruce pushed PIX as a means to get the big
picture, using its default frame tracing to get an idea of what's
going on. PIX's ability to do per-pixel debugging, especially in terms
of pixel history (to show how a pixel has been shaded), can be a key
tool to figure out why a pixel looks the way it does, if it's not what
you're expecting. PIX will capture draw calls per frame, so you can
check your call budget, and it'll tell you constant update frequency,
and where you change state.

File I/O Bottlenecks
Dawson moved on to talking about file I/O next, saying that
bottlenecks here simply don't get enough attention. His advice was
simple:

Don't compile your shaders from HLSL stored on disk in the middle of a
frame, because the disk will slow you down (he was quite serious,
presumably that's happened)
Use asynchronous I/O if you can, to load resources, so you don't block
the CPU waiting for the return
Use I/O worker threads to control that asynchronouse loading scheme
Fully memory map large files if you have the virtual address space,
which can be a huge win on 64-bit systems
remember to use the right file access flags to trigger disk I/O fast
paths (SEQUENTIAL_SCAN and RANDOM_ACCESS are hints for Windows to do
the right thing)
Helping Windows Do The Right Thing
Next up was helping Windows do the right thing. Only run one
heavyweight thread per CPU core so that the thread scheduler can do
the right thing, and don't try and outsmart it by forcing threads to
run on certain cores, like you might do on Xbox 360. The PC isn't a
console and you need to respect other apps that might be running while
yours is, so let the Windows scheduler manage your processor usage for
you.

If you know the rough size of your application's working set, use
SetProcessWorkingSetSize() to let Windows know you're going to ask for
roughly that amount of memory, so it can know that allocation is
coming and move things out of the way if need be. Dawson spent good
time on those points, for developers writing games for Windows,
especially those whose engines might do things a little differently on
Xbox 360, or other target platforms.

Justin Boggs


Justin is one of AMD's senior developer relations engineers on the CPU
side, helping game developers get the best CPU performance out of
their products. He gave two presentations, which we'll coalesce into
one page. The morning presentation led into the afternoon's topics.
Both focused on AMD processor technology, with special attention given
to their upcoming native quad-core implementations.

Corporate Housekeeping and High-Level Bits
Justin started by mentioning that even though AMD bought ATI, most at
AMD see it as a merger of expertise, technology and products. As a CPU
guy, hearing him say that x86 was pervasive and had application in
graphics wasn't too surprising. Fusion was next on the agenda. Boggs
confirmed that the products would appear as both MCMs and single-die
chips. Justin discussed Fusion's tie to Torrenza as well. Torrenza
itself was portrayed not only as a socket architecture for
coprocessors but also covering processors on add-in boards in slots,
encompassing both common methods of getting new silicon connected to a
system.

Fusion was all about increasing the minimum PC spec, Boggs said to
quiet cheers from the crowd. AMD will ensure that Fusion isn't
integration for the sake of cost, and Boggs emphasised that the
available compute power from a Fusion product would help lift the
baseline of performance for systems that use it as their central
processing devices.

AMD's fabs were up next, with Boggs talking about their in-progress
32nm fab in New York, the migration of Fab36 to 300/45 (mm/nm) from
300/65, and Fab30 getting a wafer size boost to 300/45 as AMD sells
off the current 200mm equipment currently producing 65nm devices.
Closer relationships with AMD's foundry partners are also on the
cards, as AMD anticipates volume growth in the number of wafer starts
it'll order; next month's Barcelona launch is one of the main volume
growth drivers that AMD is anticipating.

Roadmap details like the DDR3 transition and Socket AM3, DX10 IGPs in
2008, and HyperTransport 3.0 and PCI Express Gen 2 in 2007 were all
mentioned, and Boggs was keen to talk up Griffin on top of that.
Griffin is its next generation mobile processor architecture, a first
for AMD according to Boggs in that it has been engineered from the
ground up as a mobile processor rather than a binned desktop die.
DisplayPort comes in 2008, and indeed AMD have recently tested a GPU
implementation with VESA. PCIe Gen2 will appear on mobile platforms in
2008 as well.


Native Quad Core Architecture Highlights
SSE4a support (a subset of 4 instructions from the min SSE4
implementation) in Barcelona has been known for a while, as has the
architecture's 128-bit SSE FPU. Boggs mentioned overclocking potential
due to the split power plane, while keeping the CPU within its defined
TDP. He mentioned the more efficient memory controller (~85%
efficiency apparently), the float IPC rate (four 64-bit IEEE754 ops
per clock, eight single precision, split 50:50 ADD:MUL) and the fact
that AMD have tweaked the memory controller to better feed four
processor cores.

The software support for the new architecture is what most were
interested in, though. Boggs talked about the AMD Performance Library,
which will ship with Barcelona microarchitecture support for the
performance-critical code sections on launch. APL 1.1 will support
updated SSE routines for Barcelona processors, and the library is
increasingly popular with game developers according to Boggs, with its
support for image and signal processing functions at high speed on the
processor.

Looking forward to compilers supporting Barcelona microarchitecture
enhancements, Microsoft will have support for Barcelona (in terms of
SSE4a, 128-bit SSE operations and knowledge of the cache hierarchy in
particular) in their CLR (and presumably C) compiler due to ship with
Visual Studio 2008. Indeed, the current beta versions, codenamed
Orcas, already have some of that support built in, allowing developers
to test performance-critical CPU code on Barcelona systems before the
official launch of the tools next year. For cross-platform developers
using GCC, that compiler has had support for the new architecture for
a little while now, thanks to AMD's engineers, so use a recent GCC4
build to get that.

Note that you don't need one of these compilers to run code on
Barcelona; they're just the current compilers that support the
specific architecture improvements that'll help software performance
on the CPU.

To end, Boggs quickly returned to the hardware side, mentioning that
the RDTSC instruction (which reads the CPU's internal timestamp) is
now invariant and will return the right value no matter what core it's
run on, and all cores will report the correct value at all times to
software. The invariance comes at a cost though, so if you're using it
for timing in your application, beware--there's now a 60 clock
latency, so sampling it repeatedly might cause some slowdowns you
weren't previously experiencing.

Minor architectural details were also presented, but we'll leave that
for a more in-depth architecture analysis after Barcelona launches.

So software support at the compiler level should be good for AMD going
forward and the architecture improvements hint at increased
performance in gaming workloads, especially those that make heavy use
of the FPU and SSE.

Thoughts
Let's get the biggest thing out of the way first. During the session
breaks, more than one AMD representative had reservations about
holding the Tech Day next year because of the low turnout for this
year. As one of the only presentation runs AMD do in the UK for their
game development and technology understanding efforts, it'd be
disingenuous to can it for 2008. There's plenty to be learned from the
day for UK-based game developers at all levels, so don't give up
because of sub-par turnout.

Hopefully, it's clear from reading our summaries of the presentations
-- D3D10 and multi-GPU from Richard and Nicolas, Bruce Dawson's
profiling and performance talk and Justin's architecture and CPU-
driven software talk -- that game development on the PC is at one of
those nice technology and performance sweet spots that comes around
every so often. The best D3D yet is shipping, graphics hardware is
around to accelerate it and multi-core processors are in almost all
modern shipping PCs. Quad-core on the desktop is about to explode.

If you're a keen developer of 3D graphics applications on the PC, you
should be rubbing your hands at the thought of working with a
reasonably clean and predictable API that lets you express your
problem easier and more efficiently than ever before. You should also
be happy that basic CPU performance is taking big leaps in terms of
per-socket performance, and the first D3D10 hardware accelerators are
enabling D3D10 development to go much faster.

The current D3D10 application stutters we're seeing, especially with
the first wave of games that make use of it, should go away in due
course. Nick's talk on how to use D3D10 is eye-opening in places, if
you're getting to grips with it now, and Richard's talk should
hopefully have motivated more developers to think about multi-GPU and
D3D10-level hardware features as they build their applications. Those
D3D10-level hardware features will get faster over time, and there's
always the tessellator to start thinking about now, especially if
you're looking towards DX11.

Any large API revision will always take time to get used to,
especially one like D3D10 that ushers in some fairly large changes to
the pipeline, and multi-GPU has joined multi-core CPU programming as
topics with that "argh, more than one is too many to deal with!" gut
reaction. The Develop Tech Day is designed to make that all a bit
easier and less scary, with good advice (in our opinion; we'd love to
hear comments from developers using the mentioned technologies in
anger, especially D3D10) from the presenters and lots to take away and
think about.

We should mention that one of the GPG's latest recruits, Holger Gruen,
presented on the use of GPU PerfStudio and GPU Shader Analyser for GPU-
level performance tuning and optimisation. We'd cover his presentation
here but we're in the process of reviewing those tools for another
piece, so we'll hold back for that. Holger, who is ex-Intel and is no
stranger to getting his hands dirty with code, presented well on the
tools, so we're somewhat remiss in not covering them.

So the high-level highlights from this year's day, then, in no
particular order:

*Multi-GPU is going to be a big part of your development life on the
PC in years to come, more so than now
*D3D10 works best if you think sensibly about the new pipe and
improvements before deploying them
*Programmable primitive tesselation is coming in DX11, but you can try
it now on R6-family GPUs
*D3D10 doesn't have full orthogonality, so be mindful of what works
with what, despite the base spec
*Constant buffer management is the main reason early D3D10 apps are
slow, use them properly
*Be sensible with GS geometry amp and streamout
*Profile often and well; be sure to remember file I/O when profiling

*Make public benchmarks available to let users profile your code for
you
*Help Windows do the right thing with memory allocation and thread
management in games
*Quad-core CPUs are going to become the norm in the averages gamer's
desktop in fairly short order


Memfault Beyond the Launch