Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Borland compilers and floating-point exceptions

One of the difficulties about releasing a program into the wild is that sometimes you get reports of crashes that are simply weird and don't make any sense at first. Take this crash, for instance:

Disassembly:
0043eba0: d8afb0000000    fsubr  dword ptr [edi+b0]      <-- FAULT

Crash reason: FP Invalid Operation

I used to get crashes like this periodically, mostly in audio codecs, and for the longest time couldn't figure out what was happening. The users reporting the problem could not reproduce it on demand and I had never seen it myself. Which made it a bit difficult to diagnose the problem much less fix or work around it. So basically, I had to file the problem into the Could Not Reproduce file and keep going.

The problems, as it turns out, were caused by completely unrelated video codecs that had been compiled with the Borland C/C++ compiler. These codecs weren't actively being used at the time, but merely having them installed was enough to trip the problem. It took me a while to understand what was going on.

The evidence

It wasn't until I looked really closely at the crash reports that I realized what was going on. The tipoff turned out to be this line:

FPUCW = ffff1270

FPUCW stands for Floating Point Unit Control Word. Only the low 16 bits of this register are important, and under Windows it has the default value of 027F. At the time of the crash, however, it was 1270. Bit 12 being set doesn't matter, but bits 0-3 being cleared are really important as those are the masking bits for the overflow, zero-divide, denormal, and invalid operation exceptions. Clearing those bits enables exceptions that would otherwise be masked (not occur), and since Win32 code doesn't usually expect such exceptions, this results in a fatal crash.

Tripping floating-point exceptions

Floating-point exceptions are, as you might expect by virtue of the name, rare. The ones that happen most commonly by mistake in my experience are the zero-divide and invalid operation exceptions. Zero divide tends to happen whenever you have an unchecked normalization operation, such as resetting a 2D or 3D vector to unit length which works fine, until someone hands you a vector of length zero. Another example would be trying to normalize a portion of audio that was totally silent. When the zero-divide exception is masked, the FPU spits out a signed infinity instead, which sometimes works out in the end. For instance, if the expression is of the form |x/y| > n, then the infinity would give you the correct result.

Invalid operation exceptions are more serious and result from operations that don't have a graceful way to degrade, such as 0/0, the square root of -1, etc. These too often result from the lack of bounds checks. For instance, a common way to determine the angle between two vectors is through dot product, since the shortest angle between two vectors is acos(dot(v1 / |v1|, v2 / |v2|)). Unfortunately, the common way of normalizing vectors is to multiply by the reciprocal square root of the squared length (dot(v,v)), which can give you a not-quite-unit-length vector since the squaring operation discards half of the usual precision. This can then lead to taking the arccosine of a number slightly larger than 1. When such an operation occurs and invalid operation exceptions are masked, the FPU spits out a Not a Number (NaN) value and keeps going. You can also trip such an exception by trying to operate on NaNs, especially by loading garbage data that isn't a valid IEEE finite number.

In general, you don't want to be tripping floating-point exceptions, even if they are masked. The reason is that when the FPU hits one, the fast hardware can't handle it and punts to the microcode, which then takes about twenty times longer. This is especially bad with NaNs since any operation with a NaN produces another NaN, causing them to spread throughout your calculations (NaN disease) and slow down everything massively. You can even crash due to NaNs blowing past clamp expressions, since any comparison with a NaN is false and converting one to integer form results in integer indefinite (0x80000000). Despite the erroneous results, though, NaNs can appear sporadically in a large Win32 program without anyone knowing, and may go unnoticed in a code base for years.

Note that although exceptions are really slow and usually indicate mistakes, the results when the exceptions are masked are well-defined. It is possible, and sometimes reasonable, to actually depend on and test for specific results from masked exceptions. So it isn't valid to simply say "don't do that."

How Borland C/C++ factors into the picture

The Borland DLL run-time library, as it turns out, enables floating-point exceptions on initialization. This happens even if you simply load the DLL! Because Windows programs generally don't touch the floating-point control word, the effects of this can persist long after the DLL has been unloaded. For instance, you could:

It is possible to disable this behavior of the Borland run-time library and avoid this problem, but most people aren't aware of it, and unintentionally release DLLs that cause this issue. I have heard that DLLs built with Delphi can cause this problem as well. The best way to fix it is to not modify the control word, but I don't know if that is possible; barring that, a usable workaround is to remask the exceptions with _controlfp(), as noted at http://homepages.borland.com/ccalvert/TechPapers/FloatingPoint.html.

It's not just me, either. The Java bug database has an interesting incident where loading a Delphi DLL caused the JVM to subsequently crash with a floating-point divide-by-zero exception: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4644270

Checking for this problem is easy. Execute sqrt(-1) after your DLL loads and see if it crashes.

VirtualDub contains a rather brute-force workaround for this problem: it wraps all calls to video codecs, audio codecs, and video filters with a pair of routines that checks for and fixes broken FPU/MMX state. This protects VirtualDub from having its floating-point calculations screwed up by a broken driver. It also works the other way, too if I screw up, the FPU state will be reset before the external routine is invoked. 1.6.7 will be even more aggressive and will check for such issues whenever the primary message loop is idle.

Fairness doctrine

It wouldn't be fair if I just knocked Borland for this problem. While DLLs built with Visual C++ don't commit this particular sin, Microsoft has committed a far worse one in the Direct3D API. Initializing Direct3D with default settings causes the precision bits in the floating-point control word to be reset such that FPU calculations always occur with 24-bit precision (single precision / float). This is much more serious as it causes roundoff errors to become much larger, and it means that double-precision math can no longer represent all valid values of a 32-bit int. For this reason, if you invoke Direct3D within an application that may not be expecting it, such as within a video filter, you should set the D3DCREATE_FPU_PRESERVE flag when creating the device. VirtualDub does this in its display code to ensure that the accuracy of its floating-point calculations is not disturbed.

Comments

Comments posted:


Sadly, this has been the case in a few DirectShow filters, namely the WinDVD MPEG2 Video Decoder.

Before I realised what the problem was, I contacted InterVideo about this and they managed to find and fix the problem. The problem was that some header entry had a value of 0 and Delphi made the problem visible, while VC++ hid it so the WinDVD engineers didn't realise there was a problem.

They did fix it (after quite a few months of me bugging), but sadly, the next version had resurfaced the exact same issue. By then I realised the difference and put the code to circumvent this issue directly into Zoom Player. Still... this can be quite a headache.

Blight (link) - 13 06 05 - 19:46


In my old-forked copy of VD's crash handler (used in another application), I do:

pExc->ContextRecord->FloatSave.ControlWord |= 0x3F;
return EXCEPTION_CONTINUE_EXECUTION;

to forcibly re-mask FPU exceptions if one makes it to the default exception handler.

Glenn Maynard - 13 06 05 - 23:20


Sadly, strictly speaking you cannot simply recover by remasking the FPU exceptions and restarting the faulting instruction. The problem is that the x87 FPU doesn't signal the interrupt until the next floating-point instruction, at which point necessary information to retry the instruction is irretrievably lost. Take this instruction sequence for example:

FDIV DWORD PTR [EAX]
XOR EAX, EAX
FSTP DWORD PTR [EDX]

A divide-by-zero error here will actually result in the FSTP instruction faulting, not the FDIV. Even if you could backtrack and find the FDIV (an ultimately impossible task), you could not emulate the FDIV as the old value of EAX has been lost. Unfortunately, if it was a popping arithmetic instruction that faulted, attempting to resume can result in debris being left on the stack, causing unrelated calculations to fail.

Phaeron - 14 06 05 - 00:30


In practice, it's worked. I hit this problem when using DirectShow to decode movies in a game; some codecs would screw with the FPUCW, causing exceptions down the line. However, (long) since then I switched to using avcodec (since depending on people having sensible codecs installed is asking for endless headaches), so there may have been other related issues that I've forgotton.

Glenn Maynard - 15 06 05 - 01:51


I've been getting this error a lot recently too, though we've been using Borland tools for many years without ever encountering it... (Is it contagious? :) )

Furthermore, I have found the work-around to not always work. Sometimes, it just makes no difference at all, and other times, it seems to cause other subtle problems elsewhere. Why can't Borland just fix the bug? Is working with the FPU really THAT complex?

Jonathan Neve (link) - 14 03 06 - 11:58


Do you think this could be the root cause behind this ...?

http://bugzilla.gnome.org/show_bug.cgi?i..

Similar crashes have also affected programs such as Yahoo! Messenger and perhaps have the same root cause?

Malcolm - 19 06 06 - 17:56


A maintainer of the Delphi compiler here. Just wanted to point out (particularly to Jonathan Neve) that this exception behaviour is considered a feature, not a bug; dividing by zero or sqrt of -1, and having the results flow through your program, can make it very hard to find out what happened, hence the desire to signal a problem at the right location.

Of course, because RTL initialization is shared between DLLs and EXEs, it can have unintended side effects. But similar issues happen in Delphi programs that end up running MSVC initialization logic, which fiddles with precision or other settings.

As to the exception at the next instruction issue, Delphi outputs a WAIT (depending on version) after floating point operations to make sure the exception occurs at the right time.

Barry Kelly (link) - 15 10 09 - 10:21


That's fine within Delphi applications. That Delphi-compiled DLLs change the floating point exception mode of the loading thread is NOT okay, and is responsible for destabilizing non-Delphi applications. I consider that to be a bug.

Phaeron - 15 10 09 - 15:17


Phaeron: In case you don't realize it, this response really is from the vendor of the compilers in question. Embarcadero bought Borland's CodeGear division in 2008.

Yuhong Bao - 11 03 10 - 15:42


Which, by his response, implies there's little chance that this will be fixed. MSVC DLLs cause similar breakage in Delphi programs? Sorry, but that's their problem, and they can go fight it out with Microsoft. That's no excuse for Delphi DLLs breaking other programs, especially since MSVC is setting the default expected FP state and Delphi isn't.

To repeat: as a DLL, you do not own the thread state of the loading thread, and you do not have the right to change it.

It's also worth noting that the Windows x64 ABI forbids this behavior:
http://msdn.microsoft.com/en-us/library/..
http://msdn.microsoft.com/en-us/library/..

The exception mask is explicitly part of the non-volatile state in the calling convention.

Phaeron - 11 03 10 - 16:05


"MSVC DLLs cause similar breakage in Delphi programs? Sorry, but that's their problem, and they can go fight it out with Microsoft"
And Raymond Chen has a blog article mentioning one such problem:
http://blogs.msdn.com/oldnewthing/archiv..

Yuhong Bao - 11 03 10 - 16:09

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.