Jump to content

Source ports and fixed_t


Recommended Posts

Maes said:

To make things more difficult, the performance may change dramatically betwen PI, PII, PIII etc. and Intel/AMD/Cyrix/whatever. What did you test it on?

True. That's because (i think) EE is faster then PrBoom on demo1 for SoM (for me it is slower on Core2). He is using AMD and EE uses floats extensively.

But do not worry, SoM promised me his next computer will be Intel based :)

Share this post


Link to post
Maes said:

Yeah, but if you start a new instruction on every cycle, on average you can expect that a full, non-stalled pipeline will also start churning one completed instruction (or more) per cycle, with some latency.

E.g. if I pump 20 different instructions in my 20-cycle pipeline, after 20 cycles of nothing I will start getting an apparent 1-cycle-per-instruction performance. And if I have superscalarity and no stalls, I can get even more. And if I keep the pipeline full, then the miraculous one cycle per instruction will be an effective reality even without a RISC architecture, and without the 20-cycle latency. Unless ofc the pipeline stalls and flushes everything....


You do not need all this theory. Just compile glboom-plus with two versions of PointToAngleEx()

First:

angle_t R_PointToAngleEx(fixed_t x, fixed_t y)
{
  int y_viewy = y - viewy;
  int x_viewx = x - viewx;

  if (y_viewy < INT_MAX/4 && x_viewx < INT_MAX/4
      && y_viewy > -INT_MAX/4 && x_viewx > -INT_MAX/4)
  {
    return (y -= viewy, (x -= viewx) || y) ?
      x >= 0 ?
        y >= 0 ?
          (x > y) ? tantoangle[SlopeDiv(y,x)] :                      // octant 0
                  ANG90-1-tantoangle[SlopeDiv(x,y)] :                // octant 1
          x > (y = -y) ? 0-tantoangle[SlopeDiv(y,x)] :                // octant 8
                         ANG270+tantoangle[SlopeDiv(x,y)] :          // octant 7
        y >= 0 ? (x = -x) > y ? ANG180-1-tantoangle[SlopeDiv(y,x)] : // octant 3
                              ANG90 + tantoangle[SlopeDiv(x,y)] :    // octant 2
          (x = -x) > (y = -y) ? ANG180+tantoangle[ SlopeDiv(y,x)] :  // octant 4
                                ANG270-1-tantoangle[SlopeDiv(x,y)] : // octant 5
      0;
  }
  else
  {
    return (int)((float)atan2(y_viewy, x_viewx) * (ANG180/M_PI));
  }
}
Second:
angle_t R_PointToAngleEx(fixed_t x, fixed_t y)
{
  return (int)((float)atan2(y - viewy, x - viewx) * (ANG180/M_PI));
}
You can see how the first one is complicated than the second. It even still uses atan2 sometimes to avoid overflows, but it is ~14%! faster on epic.wad map05 on my Core2 + MSVC2008 compiler. Not only this function, whole timedemo is 14% faster.

Share this post


Link to post

And, your point, at this point?

It's obvious that a function that only does some bit shifting and some logic checking will be faster than one relying on FPU, at least on an architecture where there's so much performance disparity in the case of atan2 vs multiplications.

My initial argument was about FixedMul and FixedDiv. FixedMul is definitively slower than FMUL, FDIV however takes ~40 cycles. If the benchmarks don't indicate a conclusive superiority of one way vs the other (fixed vs full floats) then it means they're not as important to performance as I thought, at least in certain source ports on a certain platform.

It would be interesting to see if the results would change more unambiguously in other OSes, CPUs (680x0, PPC, ARM etc.), pre-Pentium Intel etc.

Share this post


Link to post

Why do you get so hung up on this one function? It's not in any way something that even remotely defines the performance of the code as any serious profiling would show you. Yes, of course FMUL is faster but what does it matter if there's countless other things where floats would mean more overhead?

Share this post


Link to post

I would gladly see a float rewrite of Doom only to see whether the code gets simpler at all.

Share this post


Link to post

Not really. At least not if you restrict yourself to C. With C++ you could simplify some things by encapsulating them in classes, like vectors, for example.

The biggest advantage is that with floats you'd get rid of overflow problems - but on the other side you'd have to be aware of implied underflows that'd be no longer present with floats. And these may even need some added code to take care of them.

Share this post


Link to post

Some (good) news: I implemented a Java vs C test suite to verify the accuracy of some of the results, and so far it proved possible to produce bit-exact results with both languages.

Essentially I wrote a "float to fixed" method both in C and Java so I can set explicit float values to fixed, and monitored the hex values of the results after operations such as adding, multiplying etc. and I also carefully monitored the integer and fractional part of the results.

Surprisingly, all methods worked "as they were" in Java with practically no adaptations or cryptic masking, including overflow handling. An example:

public static fixed_t FixedMul
( fixed_t	a,
  fixed_t	b )
{
    return new fixed_t((int)(((long) a.val * (long ) b.val) >> FRACBITS));
}
and
public static fixed_t
FixedDiv
( fixed_t	a,
  fixed_t	b )
{
    if ( (Math.abs(a.val)>>14) >= Math.abs(b.val))
	return (a.val^b.val)<0 ? new fixed_t(Integer.MIN_VALUE) : new fixed_t(Integer.MAX_VALUE);
    return FixedDiv2 (a,b);
}
etc.

You can see how I use that "val" member since I can't just treat fixed_t as ints with a typedef, like you can in C/C++/Delphi. A getter/setter also exists, should probably switch to that since Java optimizes them away.

Other changes include making fixed_t implementing the Comparable interface (essentially it has an equals(Object o) method) and having instance methods for multiplication/division as well, apart from static ones. I will probably also slip in instance/static addition/multiplication methods to do away with some of the clutter.

BTW, addition/subtraction work just fine "directly", as in C< as long as you use the getters/setters

E.g.
C:

    a=F2F(2.5);
    b=F2F(2.5);
        
    a=a+b-FixedMul(F2F(1.5),a);
    printf("%8X\n",(int)a);    

Java:

    a=F2F(2.5f);
    b=F2F(2.5f);
        
    a.set(a.val+b.val-(fixed_t.FixedMul(F2F(1.5f),a)).val);
    System.out.println(Integer.toHexString(a.get()));  
F2F is a function that converts floats to fixed point, implemented in both languages. I can get rid of the "fixed_t" static class reference if I do a static import in Java (I couldn't now because I was using the default package).

Edit: I did some quick n' dirty benchmarks timing several millions of fixedmul and fixeddiv operations in both Java 1.6 and latest gcc. With GCC I could only get a very coarse msec resolution by using GetSystemTicks() from windows.h, but more or less Java's times were between 50-70% of the C ones.

Also, I tried versions of FixedMul and FixedDiv that passed the result storage object by reference or returned an int, rather than returning an object. These versions were faster by a constant amount of time, which in case of FixedMul was about 90% of the whole computations, so the version that passed object references was actually 10x times faster and closer to the C time.

Actual numbers: for 1000000 random FixedMuls:

C: 15 ms (uncertainty between 15-31 due to timer coarseness)
Java: Pretty constant around 27-30 ms with object reuse, varied wildly between 29 and 478 ms with new object creations on returns!

Actual numbers: for 1500000 random FixedDivs:

C: 93 ms (again, timer coarseness means an uncertainty of +15ms.
Java: 100 ms with new object creations, 200 ms with object reuse. Strangely the trend got reversed here, so I repeated the FixedMuls test too:

Actual numbers: for 1500000 random FixedMuls:

C: 16 ms (again, timer coarseness means an uncertainty of +15ms and round numbers).
Java: 42-50ms with new object creations, from 50 to 500 ms with object reuse.

Probably the gc's habit of kicking in when it shouldn't is causing this discrepancy (and sometimes similarity) in results. By using instance methods/object reference I can avoid this completely.

So reusing objects is beneficial in any case, and I'll be using this approach in Java code, since there is no performance-efficient equivalent for C-structs.

Test hardware is a Pentium IV 3.00 GHz with hyperthreading.

If anything, that proves that at least the fixed math part is viable. Now on to the rest....

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...