Friday, July 31, 2009

Numerical question in C code?

A student writes the following C code.


#include %26lt;stdio.h%26gt;


int main()


{


float a=1.000002, b=1.000001, result;


result=1./((a-b)*(a-b));


printf("Result = %g\n",result);


return 0;


}


When the code is run the ouput is as follows.


Result = 8.6875e+11





What should the correct result be? Briefly explain why the result produced by the code is so inaccurate. Suggest a way in which the code might be improved.





Is this because there is an underflow error as (a-b)*(a-b) is so small?

Numerical question in C code?
Correct answer is 1e+12





You can rewrite the first line as :





float a=1+(2e-6), b=1+(1e-6), result;





because 2e-6 = 2 * (10^-6) = 2 / (10^6) = 0.000002





so a-b = 1e-6 so (a-b) * (a-b) = 1e-12





Finally 1/((a-b)*(a-b)) = 1e+12





You can check this easily by changing the first line





float a=1.000002, b=1.000001, result;





to





double a=1.000002, b=1.000001, result;





This question is highly dependent on the platform, but assuming that the platform has a more precise double than float then here is your explanation:





The size of the floating point value does not contain enough bits to retain the precision for the calculation so the division ends up using truncated values and the result is not accurate.





Floating point values in general are less precise than double precision values. This is because they use less bits to hold the exponent and the significant part of the number than a double does.





A real easy way to see how precise your platform is :





Keep changing the precision of the numbers on the first line.





So start with





float a=1.2, b=1.1, result;





So far so good - the floating point value has enough bits to containing the required precision of the calculation.





now keep going until you get to the original code:





float a=1.02, b=1.01, result;


float a=1.002, b=1.001, result;


float a=1.0002, b=1.0001, result;


float a=1.00002, b=1.00001, result;





You can see the floating point value result is gradually losing precision because it does not use enough bits to retain the accuracy for the calculation.





In general the size of a double %26gt;= size of a float which normally means a double uses more bits for its calculations and is therefore more precise.





So the improvement is to use double instead of float.


No comments:

Post a Comment