Rendering Text Fast |
While under development, the Insight3D development team wrote a series of blog posts, which are now archived here to provide in-depth technical information about some of the underlying techniques and design decisions. These posts reflect the work-in-progress state of the product at the time.
The following blog post was originally published on the Insight3D blog on August 29, 2008 by Patrick Cozzi. At the time of publishing, Insight3D was referred to by its development code-name, "Point Break".
Our new text rendering code is 5x faster (in a bad case) than our STK code - and the new code uses the same algorithm! How's that work? Well, we are using the same algorithm but the implementation is vastly different. In this post, I'll describe the new implementation, which offloads work from the CPU to a vertex shader running on the GPU, enabling the use of static vertex buffer objects.
The algorithm STK uses to render 2D text in the 3D scene is to render a textured quad for each character. First each character from a font is written into one or more texture atlases as shown below:
The bounding texture coordinates for each character are computed. To render a string, its world position is converted to window coordinates on the CPU (now is a good time to review your transformations). Then, a textured quad is rendered for each character with the corresponding texture coordinates. The window coordinate is translated on the CPU after each character so that characters aren't rendered on top of each other.
This is probably the most widely used text rendering algorithm. It works fine but has performance problems: each character is processed by the CPU every frame. Even worse, pretty much any example code you are going to find uses immediate mode to render the quad! Today's GPUs are programmable and massively parallel, so let's offload this work to the GPU.
The GPU implementation is a straightforward extension to the CPU implementation. Instead of translating each vertex on the CPU one after another, each vertex is translated on the GPU in parallel. The mesh (e.g. one quad per character) is now static as far as the CPU is concerned, so it can be stored in fast video memory, and the CPU to GPU bus traffic can be avoided by using static vertex buffer objects.
Each vertex contains its window space translation. For example, the first character's quad will have four vertices: (0, 0), (w, 0), (w, h), and (0, h), where w and h are the width and height of the character, respectively, in pixels. The job of the vertex shader is to take the text's origin in world coordinates and the vertex's pixel translation and output the transformed vertex in clip coordinates.
The following GLSL vertex shader is one possible implementation. The shader we use in production is a quite a bit different due to techniques to improve precision and other features that are unimportant to us here.
uniform vec3 uTextOrigin; // In world space uniform vec4 uViewport; // [left, bottom, width, height] void main(void) { // // The model-view matrix (hyphen in model-view?) is really the // model-view-projection matrix, so v is in clip coordinates. // vec4 v = gl_ModelViewMatrix * vec4(uTextOrigin, 1); // // Perspective divide to get normalized device coordinates. // v.xyz /= v.w; // // Viewport transform to get window coordinates. // v.x = uViewport.x + uViewport.z * (v.x + 1.0) * 0.5; v.y = uViewport.y + uViewport.w * (v.y + 1.0) * 0.5; v.z = (v.z + 1.0) * -0.5; // // v is now the text's origin in window coordinates. Translate // to get this character's position. // gl_Position = gl_ProjectionMatrix * (vec4(v.xyz, 0) + gl_Vertex); // // Pass through color and texture coordinates // gl_FrontColor = gl_Color; gl_TexCoord[0] = gl_MultiTexCoord0; }
First, the text's origin is transformed by the perspective model-view-projection matrix. This is the "standard" vertex shader output. Since we need to operate in window coordinates, the vertex goes through perspective divide and a viewport transform. Now we simply add the translation to position the vertex for its character. Finally, the translated vertex is transformed by an orthographic projection matrix and a few inputs are passed through to the fragment shader (which can be fixed function, in this case).
Now that we have pushed a bunch of work to the GPU, there are other details to consider.
First, not all fonts will fit in a single texture, even if the texture is big. When a font spans multiple textures, sort the draw calls by texture to reduce the amount of texture binds. It is not guaranteed that one string can be rendered in a single draw call since the string may span multiple textures. Texture arrays could make it possible but they require a GeForce 8 or better.
Having one or more draw calls per string isn't terrible (in OpenGL at least). The thing to avoid is setting the text's origin uniform every draw call. I found it to be very expensive, so I removed it by duplicating the text's origin across all vertices and modifying the vertex shader accordingly. Besides not having to set the uniform every draw call, this enables draw calls from multiple strings to be batched into a single draw call as long as they share the same texture. In some cases, I got a 250% performance gain on a GeForce 8800 GTX. I'm still not sure if this is a good idea - I've heard setting uniforms may cause an expensive shader recompilation to allow for optimizations. But, I've also heard that this drastic of an improvement could imply a driver bug in setting uniforms.
Regardless, once draw calls are batched and sorted by texture, the rendering code is trivial.
// load matrices, enable blending, alpha test, texturing, etc. useProgram(...); // activate vertex shader uniform(viewport); bind(vertexBuffer); // one vertex buffer for all strings for each texture bind(texture); for each drawCall // should only be one or a few per texture draw(drawCall.indexCount, drawCall.indexOffset); // clean up
To be fair, sorting and batching make the rendering code elegant and efficient but make dynamic updates difficult. What if a string changes? Batched draw calls may need to be broken apart. What if the length of the string changes? Not fun. What if changing the string means different textures are required? This is a lot of bookkeeping and not interesting enough to discuss here.
Here are two ideas that may provide modest gains but could also be more trouble then they are worth.
When a space is rendered, a quad is sent down the pipeline but the frame buffer is never written to because all the fragments fail the alpha test. That's a lot of work to do nothing. If you know what characters are spaces for a given font, you can simply not create a quad for that character. I'm pretty sure you can't guarantee that ' ' is a space in every font so you'll have to inspect the bitmap for each character. If you're going to that much trouble, you might as well reduce the size of the quad and adjust the translation according for every character - you'll save even more fragments from failing the alpha test.
There has been a ton of recent work in cache-coherent vertex layouts, e.g. OpenCCL, Tom Forsyth's algorithm, and the algorithm we use: Fast Triangle Reordering for Vertex Locality and Reduced Overdraw. I'm unaware of any work in cache-coherent texture atlas layouts, although there are plenty of heuristics for texture atlas layouts that minimize wasted space. Perhaps a cache-coherent layout can be computed using knowledge of the language (e.g., the letter e is the most common letter in the English language) or using a machine learning algorithm with the expected set of strings to be rendered. I suppose such a layout could improve texture cache hits. Given the high degree of GPU threading used to hide latencies and the fact that this algorithm is already very light in fragment processing, I didn't explore this any further. This might be a good project for a graduate student but I won't promise that it will work.
I rendered 13,691 strings, with a total of 192,073 characters using both Point Break's GPU-based implementation and STK's CPU-based implementation. It is not exactly an apples-to-apples comparison but at least we can get a feel for the GPU implementation's speed.
In an empty 3D window with just the text, the GPU version was 545% faster. Not bad - but I expected better. This test doesn't capture the full potential of the algorithm. Since the vertex crunching is all done on the GPU, the CPU is now free to do other things. This is really the use-case we care about; an application built with Point Break is likely to have lots of CPU-intensive anaylsis computations (potentially on multiple threads).
To emulate this, I put a dummy loop that counts to 50,000,000 and outputs the value to the console window at the start of each frame. Without any text rendering, the frame rate was 17.31 fps - obviously all CPU load. The frame rate dropped to 10.86 fps when text is added using the CPU code. When the GPU code is used, the frame drops only slightly to 17.28 fps. The difference is almost "in the noise."
This is an approach we are implementing throughout Point Break - utilizing the programmable GPU to free up the CPU for more important things, such as your application.