§ ¶FPO and a callee-pops parameter passing convention makes perfect stack walks impossible
There's a bit of discussion over at Larry Osterman's blog about the Frame Pointer Omission (FPO) optimization in the Visual C++ compiler and how it affects stack walking, which I've been participating in. I figured I'd expound it a bit more here.
The basic problem to be solved when doing a stack walk is finding the locations of return addresses in the stack, which are also the locations of the stack pointer upon entry to each function in the call stack. If you can somehow determine how much local data is present at each stack frame, you can maintain a virtual stack pointer and hop from stack frame to stack frame until the call stack is determined. On x86, the steps involved are generally as follows:
- Obtain the instruction pointer (EIP) and the stack pointer (ESP) of the thread.
- Look up the current virtual EIP in debugging information to determine the current function.
- Obtain the base of the stack frame, either by reading the frame pointer on the stack, or offsetting ESP if there is no frame pointer. This is now the new virtual ESP.
- Read the return address into the virtual EIP.
- Go to step 2.
The trick is trying to determine the base each stack frame. When EBP frame pointers are present, this is easy -- just keep following the saved frame pointers next to each return address. What's not so easy is the FPO case, where ESP is used directly, because the offset from ESP to the return address depends on how much local variable space is allocated, and how many parameters for called functions are present.
I claim that it is impossible to reliably stack walk in the general case with the __stdcall or thiscall calling convention and FPO involved -- even with full debugging information! And no, the code doesn't have to be that weird.
Consider the following function disassembly:
00000000: 8B 01 mov eax,dword ptr [ecx] 00000002: 6A 02 push 2 00000004: 6A 01 push 1 00000006: FF D0 call eax 00000008: FF D0 call eax 0000000A: C3 ret
What would be the appropriate debug information for this function? Ideally, you would want to encode an ESP-to-return-address offset for each instruction, so based on the instruction pointer you could unambiguously determine the offset from every possible instruction that could crash. In some cases, you wouldn't even need to encode this information, if you could walk the instruction stream and update a virtual ESP based on executed instructions. This is frequently possible with compiler-generated code, since the compiler uses well-defined and simple patterns to maintain the stack. This is often done with RISC CPUs that have very easy to parse instruction streams. It's also done on X64, with the help of restrictions on prolog/epilog code and unwind bytecode. X86, however, has neither of these advantages.
Let's say the second CALL instruction in the above code crashes, due to EAX=0 -- which would mean that the first function call returned a null pointer. What would the proper offset to add to ESP to get to the return address? You can't tell from the called function, because the call is indirect and you don't know which function was called.
If you answered 0, you were wrong. If you answered 8, you were wrong. In fact, no matter what value you picked, you would be wrong.
Here's the source code that produced the above machine code, when compiled with Visual Studio 2005 SP1 at /O2:
typedef void *(__stdcall *StateFunction0)(); typedef void *(__stdcall *StateFunction1)(int a); typedef void *(__stdcall *StateFunction2)(int a, int b);
struct IState { virtual void RunState() = 0; };
struct State0 : public IState { virtual void RunState(); StateFunction0 fn; };
struct State1 : public IState { virtual void RunState(); StateFunction1 fn; };
void State0::RunState() { ((StateFunction2)fn())(1, 2); }
void State1::RunState() { ((StateFunction1)fn(1))(2); }
(The unusual casting is due to C++'s inability to create a recursive typedef. Returning a function pointer as a void* is common when programming state machines, for this reason.)
You might say, aren't there two methods there? Yes, but they compile to the exact same code, and the Visual C++ linker will collapse two functions that have the same code even if they do completely different things. Essentially, the correct ESP offset at the second CALL instruction can be either +8 or +4, depending on whether State0::RunState() or State1::RunState() was executing. Both of these are implementations of the same virtual call on the same interface, so knowing the parent function doesn't help; the only way you could tell is by examining the type of this by checking the vtable pointer, and unfortunately after the first CALL instruction this is no longer available (ECX is a volatile register in the thiscall calling convention). I'm pretty sure that this is unsolvable in the general case except by knowing the entire execution history of the program up to this point.
Moral of this story: Callee-pops calling conventions are absolutely evil with regard to accurate stack walks.
(Read more....)§ ¶HTTRANSPARENT is evil
I spent part of last weekend tracking down an annoying problem in 1.7.2's video display code. One of my current obsessions is field display in Windows -- now that I have a very small and convenient video capture device, it annoys me that most programs in Windows still display video as if it were progressive, which leads to a poor quality live display. For some reason, DScaler has abnormally high latency with my USB 2.0 device, so it's back to rolling my own. I also want to make use of 3D hardware acceleration, because (a) it's extremely CPU intensive to fill a 1920x1200 display at 60fps, and (b) I'm lazy and it's easier to experiment with pixel shaders than highly optimized SSE2 code.
(As I've said in the past, nearly all features in VirtualDub are tied to some sort of video game or anime series. The non-interlaced field display code got me through Lunar 2. Interlaced field display is for Valkyrie Profile 2.)
Now, the problem with doing 60 fps field display with 3D acceleration is that with a 60Hz refresh rate, you must hit every frame exactly, or at least close enough that the glitches are more than several seconds apart. This is very difficult when you take into account the need to avoid tearing, by not switching frames/fields in the middle of the screen. In windowed mode, this is very difficult. DirectX is lame and doesn't give you any sort of vertical blank event or interrupt -- well, actually, it's IBM's fault for reportedly making the VBI optional for VGA -- and so the only option is to poll. I tried just letting Direct3D do this with D3DPRESENT_INTERVAL_ONE, and not only did it do a poor job of avoiding the beam in windowed mode, but it burned up a lot of CPU time doing so and also blocked my message loop for unacceptable periods, which caused the latency on the DirectShow graph to skyrocket. So, I had to resort to another method.
What I ended up with was moving the entire display window to another thread, so that it could poll in peace at high priority. A persistent problem that kept cropping up here was the display thread taking 100% of the CPU, even though I had a MsgWaitForMultipleObjects() loop with a 1ms timeout. I tracked the problem down to that function constantly returning WAIT_OBJECTS_0, meaning that a message available, without there actually being one -- meaning that PeekMessage() was getting called in a tight loop. I hacked in a Sleep(1) as a temporary workaround, but then I had the weird problem of the UI becoming totally unresponsive even though the CPU was idle 80-90% of the time -- but still repainting. Even weirder, when I took the Sleep() out, VTune showed an abnormally high amount of time being spent in the kernel (ring 0) in functions like "win32k!xxxWindowHitTest."
It wasn't until I looked at the ReactOS and Wine source code that I discovered the culprit.
The problem was a WM_NCHITTEST handler I had put in to accommodate the cropping UI. The cropping UI needs mouse clicks to go through the display, so the display code returns HTTRANSPARENT so that all mouse input propagates to the parent window. There is a warning in MSDN saying that this only applies to windows within the same thread, and it turns out that returning HTTRANSPARENT when your parent is on a different thread is indeed a very bad idea. What happens is that Windows has problems determining which window "owns" the mouse message, and keeps bouncing it back and forth between the threads, resending WM_NCHITTEST to the transparent window each time. In Wine, this is apparently caused by a WindowFromPoint() call after the thread hop, which apparently doesn't return faithful results for transparent windows. Somehow in the real Windows this doesn't cause the threads to lock together, so the threads do idle, but the loop still blocks input messages, giving you a set of windows that repaints properly but doesn't respond to input. This also likely explains the phantom returns from MsgWaitForMultipleObjects(), probably caused by some sort of internal callback.
Removing the WM_NCHITTEST handler gave silky smooth 60Hz video, which freed me to solve some evil jumping puzzles in VP2. :)
The next problem I have to solve is trying to come up with a pixel shader that does better than bicubic interpolation with motion-detection-based weave/bob switching and gamma correction, but that's less enigmatic, at least.
(Read more....)§ ¶Is it too much to ask to have ONE good image display API in Windows?
Lately, I've been becoming increasingly frustrated with how difficult it is in Windows to reliably and efficiently blit an image to the screen with high quality. It shouldn't be that hard, but it is, because there are half a dozen different ways to do so and none of them meet all of the requirements. So I sat down and made a table of all of the ways to blit an image to the screen in Windows, and how they all suck in some fashion.
VirtualDub has code paths for GDI, DirectDraw blit, DirectDraw overlay, Direct3D, and OpenGL. GDI+ is here because it looks like a good API, until you discover that it has no useful hardware acceleration, has incorrect subpixel positioning for image blits, and is no longer being evolved. I put WPF (Avalon) here because I looked into it as a possible alternative when operating under DWM composition / Aero Glass on Vista, which is problematic since neither GDI nor DirectDraw are accelerated, and Direct3D in child windows seems very flaky. The huge problem with WPF is that it requires .NET managed code, since the API is in .NET and the underlying MIL API isn't documented (grumble); another problem is that it seems unusually slow and flickers a lot whenever windows are resized.
Anyway, the table of image blitting woe:
GDI | GDI+ | DirectDraw (blit) |
DirectDraw (overlay) |
Direct3D | OpenGL | WPF (Avalon) | |
---|---|---|---|---|---|---|---|
Platform support | 95+ NT3.1+ |
98+ NT4+[1] |
98+ NT4+[2] |
98+ NT4+[2] |
98+ NT4+[2] |
driver | XP+[12] |
Requires managed code | no | no | no | no | no | no | yes |
Hardware accel w/o 3D HW | yes | no | yes | yes | no | no | no |
Hardware accel with 3D HW | yes | no | yes | yes | yes | yes | yes |
Software fallback | yes | yes | yes | no | yes [3] | yes [4] | yes [3] |
Works with DWM composition | sw | sw | sw | no [5] | yes | yes | yes |
Bilinear filtering | sw [13] | sw | yes [6] | yes [7] | yes | yes | yes |
Bicubic filtering | no | sw | no | no | yes [8] | yes [8] | sw |
Terminal Services | sw | sw | sw | no | no | no | sw [9] |
Supports 256 color display | yes | yes | yes | yes | no | no | ? |
RGB format conversion | yes | sw | no | no [10] | yes | yes | yes |
YCbCr format conversion | no | no | no | yes | yes | yes | no |
Beam detection | no | no | yes | yes | yes | no | no |
Beam avoidance (vsync) | no [11] | no [11] | yes | yes | yes | yes | no [11] |
Explanations:
- sw: Supported, but with software emulation only.
- Platform support: Versions of Windows on which this API is available, including ones for which a redistributable is required.
- Requires managed code: Whether this API can only be used from .NET managed code.
- Hardware acceleration w/o 3D HW: Whether this API can be accelerated with older graphics hardware that only supports 2D acceleration.
- Hardware acceleration with 3D HW: Whether this API can be accelerated on 3D-capable hardware.
- Software fallback: If operation is possible without hardware support.
- Works with DWM composition: Operation when DWM composition (Aero Glass) is active under Windows Vista.
- Bilinear filtering: If bilinear filtering (4 tap) is supported on stretched images.
- Bicubic filtering: If bicubic filtering (16 tap) is supported on stretched images.
- Terminal Services: Whether the API works over Terminal Services (Remote Desktop) is active.
- Supports 256 color display: If operation is possible on a paletted display.
- RGB color conversion: If display of an RGB image in a different format than the display buffer is supported.
- YCbCr color conversion: If display of a YCbCr-encoded image is supported.
- Beam detection: If the API supports reading the position of the display image scanning beam.
- Beam avoidance (vsync): If the API supports altering the timing of image display to avoid tearing.
Notes:
- Requires redistributable prior to Windows XP.
- Requires redistributable for Windows 95.
- With RGBRast. (Refrast is not counted as it requires the SDK and is excruciatingly slow.)
- Microsoft's OpenGL 1.1 software implementation is available, but it is very slow.
- Not supported. Overlay creation succeeds, but the overlay never shows up.
- DirectDraw blits are point-sampled when DWM composition (Aero Glass) is active. Otherwise, filtering is up to the driver.
- Varies widely; some drivers don't interpolate vertically, and some only interpolate chroma.
- Requires custom implementation.
- Can be hardware accelerated between two Vista-based systems using Avalon Remoting.
- RGB overlays are possible, but I've never seen hardware that supported it.
- Automatic if DWM composition (Aero Glass) is enabled.
- Requires redistributable.
- Requires Windows NT; quite slow.