Jump to content

Dynamic wiggle/Tall Sector fix for fixed-point software renderer


kb1

Recommended Posts

fabian said:

I don't think that float math should be considered slow per-se anymore -- at least since the rise of the 486-DX ;) -- but converting back and forth between float and int types wastes CPU time, c.f.
http://stereopsis.com/FPU.html

You can always maintain both float (for rendering) and fixed (for gameplay) copies of vars, but, even so, simply having a co-processor (486DX) does not make float operations anywhere near free. The newer processors have faster math hardware, which helps, and they can do floating point simultaneously with integer math, in some cases. This is improved incrementally with each new family of processors, generally. But to take full advantage of it requires either a brilliant compiler, or some hand-written assembly. You're no longer in a place where you can just write good C code and be done with it. To implement float, and have it be as fast as possible, requires knowledge of the platform, CPU, etc, in the extreme cases.

A nice middle ground is to use float to set everything up, and do the heavy lifting (inner loops) with fixed. Then, you don't have to be quite as clever. But, as you said, Fabien, you don't want to be doing a lot of conversions, unless you also tell the CPU to ignore the complexities of denormals and such (which can really slow down a conversion). And, you have to make sure your compiler is not adding stupid FP setup code every time you use a float. But, then, you're back to writing/examining CPU-specific low-level stuff.

Don't get me wrong, there's nothing wrong with building super-fast CPU-specific branches either, but the point I'm making is that it's not as simple as just switching to float and being done with it, at least not for me. For my port, I'll probably add options to switch out fixed/float versions of the renderer functions, profile on various hardware, tweak, repeat until satisfied. After doing all that, I may find that it was not worth it, but, I'll never trust it unless I do all the profiling work first :(

And, that's what it's really all about : Me being satisfied. Honestly, that's always the final real test, right? What I DON'T want is to be in the middle of a kick-ass coop game, on a huge, detailed map, getting 5 FPS because I prevented walls from wiggling... "Oh, but those 5 frames look nice, don't they?". In the end, above all, it has to be fast.

EDIT: Excellent article you linked, about FP conversions.

Share this post


Link to post
  • Replies 147
  • Created
  • Last Reply

Top Posters In This Topic

You don't trust modern compilers to optimize float operations? You should write properly working and readable code, then let the compiler optimize according to the target CPU and flags you set.

Then if you have performance problems you work on the bottleneck. But this is just a hobby project so I can understand the over the top code nerdiness! :-)

Share this post


Link to post

I agree with you, kb1. I find it acceptable to use float math during game or level setup but try to avoid it at every cost in per-tic calculations. My reasons are manifold:

1) For purely technical reasons: Conversions between float and int are costy, so try to avoid them in time-critical calculations.

2) For "aesthetic" reasons: I find it challenging to get things done in int math. After all, that was a design decision done by the original authors of the engine and I somehow feel obliged to comply with it.

3) For pragmatic reasons: Things are actually possible in int math. See, for example, what we have already achieved by the discussions in this thread alone. If a few months ago, someone would have written here that he wants wiggling lines and jumping long walls fixed in the Vanilla software renderer, all fingers would have pointed him to Cardboard or maybe ZDoom instead.

Share this post


Link to post
VGA said:

You don't trust modern compilers to optimize float operations? You should write properly working and readable code, then let the compiler optimize according to the target CPU and flags you set.

Then if you have performance problems you work on the bottleneck. But this is just a hobby project so I can understand the over the top code nerdiness! :-)

I trust modern compilers to do a decent job, most of the time. And, most of the time, decent is ok. Obviously, if you are writing a piece of software targeted for multiple platforms, you sacrifice most opportunity to tweak things. For me, writing "readable" code means writing code that the compiler can read first, the human second.

Now, specifically, for floating-point, no, I do not trust the compiler blindly, as it attempts to know which vars to optimize the most, using it's built-in heuristics. You see, the compiler must always do "the right thing", and not assume anything. However, only I know if a particular function will be passed, say, a negative number, or a number within a given range.

The compiler can have some rules that it checks, but it cannot "understand" the goal of the program, and, therefore, it does not know what assumptions can be made, or what shortcuts can be taken. Hell, some math bugs may even "look good", and be desirable, in some cases. You should always, at least, take a look at the assembly output of the compiler, because it simply cannot possibly be aware of the best way to assemble to code in every case. Yeah, it's nerdy :)

fabian said:

I agree with you, kb1. I find it acceptable to use float math during game or level setup but try to avoid it at every cost in per-tic calculations.

It will be a long, long time, before screen coordinates no longer fit within the bounds of an int :), but, as we've seen in Doom, scaling to non-int sizes can be done with ints, but, it's tricky, and can have artifacts, run-on precision errors, wiggles, jagged lines, and all sorts of other considerations, which translates to a lot of debugging. If we could use floats everywhere, those issues all but disappear. But, there's always a price to pay.

Then again, in defense of VGA's view, sometimes a good compiler can pair float instructions with integer instructions, and essentially get one or the other for free! Modern compilers can be good at this, which is what you want, cause it's a son-bitch to do manually :)

Share this post


Link to post
kb1 said:

For me, writing "readable" code means writing code that the compiler can read first, the human second.

Lee Killough must have felt the same. ;)

You should always, at least, take a look at the assembly output of the compiler,

Never, not even once. ;)

Share this post


Link to post

Unless you start throwing in calls to stuff like atan2 like they're candy, or repeatedly convert back and forth from float to int, then in any realistic situation where the Doom renderer would be running at 5 FPS, you're going to get at most maybe 6 FPS by switching back to fixed if I had to do a rough estimate? We're talking on the scales of nanoseconds here, pretty much.

You need to understand where Doom is cost the most time in its renderer in order to understand what can be done to improve it when it is under stress. It is in fact primarily cache-bound, and almost impervious to improvement through mathematical optimizations inside the critical code segments - this is particularly true for the innermost loops of the drawseg and vissprite rendering.

Improving minor operations in that code wouldn't budge EE's performance on Speed of Doom MAP28. On the other hand, adapting the locality-of-reference optimization involving caching parts of the drawsegs into a local array bought the game a full 5 FPS improvement, which is dramatic considering it made the difference between an unplayable 15 fps and 20 fps (back up near the normal running speed of the engine).

Then you have maps like Sunder MAP11, where thinker processing cost actually overtakes the renderer. No amount of float vs fixed will help you here, since the game code's already using fixed-point math.

Share this post


Link to post
fabian said:

Lee Killough must have felt the same. ;)

I am sure of it. I admire him for it (though some of it was a bit ridiculous). I bet he looked at the compiler's output often, which drove him to write the one-line wonders.

fabian said:

Never, not even once. ;)

It's not really necessary (or helpful) when you're designing highly-portable source. But, if you're going for a bare-metal CPU-specific engine, you might find the compiler working against you. And, it's not all about assembly either. Looking at the source may suggest that you should rearrange your source a bit. Sometimes, a simple swap of source lines makes a big difference in the efficiency of the generated code. But, you'll never know it if you don't look at the assembly code.

Quasar said:

Unless you start throwing in calls to stuff like atan2 like they're candy, or repeatedly convert back and forth from float to int, then in any realistic situation where the Doom renderer would be running at 5 FPS, you're going to get at most maybe 6 FPS by switching back to fixed if I had to do a rough estimate? We're talking on the scales of nanoseconds here, pretty much.

Now I know that's all true for modern processors - float instructions have come a long way in recent years. But I've got it stuck in my mind that floats were a lot more expensive on the earlier processors, even with math coprocessors. It's hard to let go, but I am starting to believe that I should stop worrying about supporting the old CPUs so much. Nowadays, $300 can get you a reasonable, fast desktop, that probably does, in fact, have nice float support.

As far as cache goes, yes, it's awful. Doom wanting to paint walls top to bottom causes per-line cache misses on each column! You're better off turning the output buffer cache off, if you can. And, the quad rendering stuff makes the renderer convoluted.

But, yes, Quasar, I think you've convinced me - I just needed to hear it from someone else: My next version will use float for all the setup, and, of course, ints in the inner loops. It was nice to have things like the WiggleFix available, when your renderer is working 99.9% of the time, but you've just got this one nasty bug, it's nice to be able to put in a quick fix. But, now there's a half-dozen quick fixes everywhere. It's time to bite the bullet :)

Share this post


Link to post

Depending on the use case, fixed point is most likely even slower on modern systems. A few years back when investigating a problem in ZDoom I ended up doing some profiling of P_InterceptVector.

That function is interesting because it comes with a commented out floating point implementation that back in 1993 was prohibitive to use so I had 3 versions of this function to play with:

- Doom's original
- ZDoom's increased precision, fixed point version
- the 'optimal but slow' floating point

and not surprisingly (at least for me) the floating point version performed better than the 64 bit version

It even performed better after removing the redundant divisions by FRACUNIT from it and using doubles instead of float (to get around the inefficient code MSVC uses for strict compliance with single precision floats. In the end it was faster than the original fixed point version.

Here's my changelog for this:

- did some profiling of P_InterceptVector. Here's the results I got:
* the unaltered floating point version is 10% faster than the 64 bit integer version.
* using doubles instead of floats increases performance by another 25%.
* another 15% can be gained by manually optimizing the code.
- P_InterceptVector now uses the optimized floating point version which is almost twice as fast as the 64bit integer version.

Share this post


Link to post
Graf Zahl said:

...and not surprisingly (at least for me) the floating point version performed better than the 64 bit version...

Wow, interesting. Of course, 64-bit math running in a 32-bit process is going to be slow, due to the need for multiple instructions, but, nonetheless, that's what's required to get the job done.

These results are encouraging. I must admit, I did enjoy fixing the fixed-point bugs in a fixed-point way, for a couple of reasons:

1. It's nice to know what the problem was, and fun to figure out a possible way of handling it within those constraints.

2. It was nice to be able to fix the bugs with simple solutions, without having to revamp the whole renderer.

But, now that those things have been accomplished, and I've gotten a few confirmatons, I am ready to move forward with some floaty goodness. Still, I will be moving slowly, doing some real profiling, checking the assembly output, and adjusting as necessary. If there's even a chance that I'll be trading some speed for more accuracy, I want to minimize the impact.

Share this post


Link to post
kb1 said:

checking the assembly output


... and don't forget that x87 code and SSE2 are entirely different with slight variations in performance.

Share this post


Link to post
kb1 said:

I bet he looked at the compiler's output often, which drove him to write the one-line wonders.


As has been said before, there's quite a difference if you are coding for one specific compiler, one specific OS and one specific CPU. Killough could be sure that his code runs well when compiled with Watcom for DOS on an i486. Nowadays, however, source ports are expected to get compiled with CLANG and then run on 64bit-Linux on an ARM system. The compiler output of MSVC on an 32-bit Windows XP running on an Athlon CPU doesn't really tell anything about your code quality then. ;)

Share this post


Link to post

... and I don't think that it comes as a surprise when I say that Killough's code mess creates significantly worse object code than something clean on most modern compilers.

In these days it's completely and utterly counterproductive to optimize source for one compiler. Just switch to another system and all these 'improvements' will come bite you in the ass.

What I have found out is that clean human readable code tends to produce good results everywhere, you may squeeze a bit more out of each one by doing special optimizations but each compiler is different and needs different handling.

And now consider the Doom source port situation where cross-platform code is the norm. So it's the completely wrong approach.

Share this post


Link to post
fabian said:

Killough could be sure that his code runs well when compiled with Watcom for DOS on an i486.

GCC actually. 2.something, IIRC. Utterly ancient by any standard today.

Share this post


Link to post

I have consistently checked for smallest code while coding DoomLegacy.
This may be of interest:

- Compiler optimizes most simple code adequately, or better than combined one liners. Assignment within an IF is not necessary.
Assembly has not proven to be faster any more (DoomLegacy still has both).
- Compiler cannot optimize repeated use of indexing, especially if the index itself is an expression. Changing this code to use a ptr will reduce size considerably.
- Compiler cannot optimize common sub-expressions, if one is in conditional code. If you want to save executable you have to optimize this yourself.
- Simply loading a value to a local var, doing the calculations, and then at the end of the function storing the value back to the source, has often reduced code size considerably.
- Adding vars costs.
- Adding separate var declare scopes costs even more. Combine them into one, preferably at the start of a function, if you can stand it.
Declaring a single var local to an IF will cost you.
- Making a separate var for a separate issue can help the compiler when reduces the scope of a var. Otherwise it thinks one use of 'val' may be related to a later use of 'val', and so it preserves the value.
- Pushing rare error conditions to the end of a function, and having the test do a GOTO the rare code, keeps rarely executed code from being loaded to your program cache. The render loop is not tight enough to keep all the functions in cache.
- Function calls are cheap, if they have no parameters.
Passing the parameters costs more than the function call.
- Adding a function is expensive. It needs to be used more than once, and be at least 1/2 page, or involve a loop, to really pay off.
However, the need to reduce complexity often overrides this, so the code gets a little larger.
- I have caught the compiler adding 3K to executable size simply for exiting a function early. This happened when it done using an IF, or a GOTO. I must assume this is some compiler bug. Repeating a function call and return stmt at the error point gave a program that was 3K smaller.

Share this post


Link to post

I think Lee's goals were a bit different than his team's goals. The team wanted a clean, bug-fixed, portable code base, which was to serve as a baseline for enhancements to be added. But Lee wanted a bug-fixed, fast DOS executable, with enhancements that he personally liked.

There's nothing wrong with either approach, and with care you can have both. My philosophy is as follows:

1. Build the cleanest, most portable code base possible.
2. Bug fix it like crazy.
3. Clean it again.
4. Rinse and repeat, from step 2.
5. Once you're satisfied that you have a highly-stable code base, switch yourself into "optimization mode":

a. Check out your compiler's assembly output, and look for places where the compiler goofs up (cause it does, occasionally, no matter how good it is). In this phase, look for opportunities to rearrange your "clean" functions so they are still just as clean, but now the compiler does a better job.

b. Profile. Determine which functions are the bottlenecks. Do this while playing various types of maps (maps with tons of monsters, then maps with lots of sectors/subsectors, etc).

c. Only when you're sure you've found a bottleneck, should you consider adding CPU-specific versions of the base functions. Create a set of conditional compilation directives (#ifdef's). Figure out the visually cleanest way to incorporate multiple versions of functions into your files.

d. Profile the new functions, to make sure they function identically, and that they are actually faster.


My basic point is that there is nothing wrong with writing CPU-specific optimizations, so long as they are treated as add-ons. Maintain a nice, clean, portable codebase at all times. Only then should you consider adding separate, conditional optimized replacements for base code. These add-ons do not have to be quite as clean.

The only big caveat is that, once you add those optimized code blocks, you must carry them along for the life of the program, keeping them working identically to their base code counterparts. In fact, they may lose their edge and no longer be optimal, if the base code changes enough. Optimizations hinder maintainability, which is an additional "cost" on top of the fact that you have to carry around 2 or more versions of a function. So the question becomes: Is this optimization worth the effort, now and later?

Share this post


Link to post
wesleyjohnson said:

I have consistently checked for smallest code while coding DoomLegacy.
This may be of interest:

- Compiler optimizes most simple code adequately, or better than combined one liners. Assignment within an IF is not necessary.
Assembly has not proven to be faster any more (DoomLegacy still has both).
- Compiler cannot optimize repeated use of indexing, especially if the index itself is an expression. Changing this code to use a ptr will reduce size considerably.
- Compiler cannot optimize common sub-expressions, if one is in conditional code. If you want to save executable you have to optimize this yourself.
- Simply loading a value to a local var, doing the calculations, and then at the end of the function storing the value back to the source, has often reduced code size considerably.
- Adding vars costs.


Yes, these are all common high level optimizations that have been known for decades.

wesleyjohnson said:

- Adding separate var declare scopes costs even more. Combine them into one, preferably at the start of a function, if you can stand it.
Declaring a single var local to an IF will cost you.


These I can not confirm, at least not with Microsoft's compiler. It's obvious if there's too many variables in scope at the same time it may not be able to merge them but my experience is that variables in mutually exclusive scopes will not gobble up registers for nothing.

wesleyjohnson said:

- Pushing rare error conditions to the end of a function, and having the test do a GOTO the rare code, keeps rarely executed code from being loaded to your program cache. The render loop is not tight enough to keep all the functions in cache.


Heh, yes. Of course this common wisdom will make many coding purists go ballistic (Noooooo, GOTO is the root of all evil - what a load of bullshit... :D)

wesleyjohnson said:

- Function calls are cheap, if they have no parameters.
Passing the parameters costs more than the function call.
- Adding a function is expensive. It needs to be used more than once, and be at least 1/2 page, or involve a loop, to really pay off.
However, the need to reduce complexity often overrides this, so the code gets a little larger.


Depends. Modern compilers can auto-inline small functions and there's many situations where calling a function is better than having it all inside the function - if only for better register allocation.
But in general it's true - large quantities of small functions can really have a negative impact on performance. I once knew a guy who made every C++ accessor method non-inlined and his code suffered badly for it, not only performance wise, it was also a nightmare to debug.

wesleyjohnson said:

- I have caught the compiler adding 3K to executable size simply for exiting a function early. This happened when it done using an IF, or a GOTO. I must assume this is some compiler bug. Repeating a function call and return stmt at the error point gave a program that was 3K smaller.


That can happen if a function needs to run some costly destructors. Many compilers just cannot efficiently merge such cases of expensive function epilogue code.

Share this post


Link to post
wesleyjohnson said:

- I have caught the compiler adding 3K to executable size simply for exiting a function early. This happened when it done using an IF, or a GOTO. I must assume this is some compiler bug. Repeating a function call and return stmt at the error point gave a program that was 3K smaller.

Please don't mix up object code size and code performance. There is a reason why optimizations like "-funroll-loops" exist.

Share this post


Link to post
wesleyjohnson said:

I have consistently checked for smallest code while coding DoomLegacy.

Pretty sure you're wasting your time.

Share this post


Link to post

Yea, this may be 'old hat' to some of you and too obvious. I don't know exactly who may find it interesting. I try to keep it to one page, and have to leave out much of the obvious.

Doom Legacy 4.25 is C code, so it cannot be destructors on that 3K size gain.
There were three function calls at the end of the function, and I wanted my error exit to just call the last one. Doing that with an IF statement made the code grow by 3K. Doing it with a GOTO made the code grow by 3K. I could not identify anything to even suspect. there was nothing that it could be destroying. If interest is high enough I could look at a dump of the assembly.
The routine is not important enough to warrant more tuning.
I left it with duplicated code instead.

I am using GCC. I expect Microsoft optimizations could be similar, but could be wildly different results.

There are cases where I do know that unrolling a loop will increase speed. DoomLegacy has unrolled loops in its drawing routines.

With every new compiler there are new claims of how well it can optimize. I have to check what actually works. I keep finding these
cases where it has not optimized something simple, that I thought it would.

One of my concerns is avoiding code bloat. Every new feature adds more to the code size, and Doom Legacy has enough features for this to be a concern. The easiest and fastest check on code bloat is the executable size. If I add a little feature the executable should grow by only 300 to 500 bytes, or even 1K to 2K for a major feature. More than this is an indication that something has gone wrong.
I usually find the problem and create something much more efficient.

Code size also directly affects speed in that repeated loading of the executable to cache often dominates execution time. This is usually true of code that executes all over the place. Except for the draw loops, the rest of the Doom code thrashes the cache. For that code, reducing the size supports keeping the program execution time from bogging. It is much quicker to check size than doing a timed trial.
Any un-executed code that can be kept out of the cache will speed the program. It has little cost, even if code has to be duplicated to achieve separation.

Have to recognize that code growth in areas out of the main rendering loops is not very important to rendering speed. There it is a matter of aesthetics. Init, loading, and other setup code can bloat. Even though the total code size increases, pushing code to setup routines is usually a positive overall gain. This is obvious stuff that I should not have to mention.

This technique must have some benefit. Last I heard, Doom Legacy had the top rendering frame rate. But as far as I am concerned this just represents my current findings, which you may find interesting, or not. They probably will change with the next compiler upgrade, or some different compiler settings.

Share this post


Link to post
wesleyjohnson said:

Yea, this may be 'old hat' to some of you and too obvious. I don't know exactly who may find it interesting. I try to keep it to one page, and have to leave out much of the obvious.

Doom Legacy 4.25 is C code, so it cannot be destructors on that 3K size gain.
There were three function calls at the end of the function, and I wanted my error exit to just call the last one. Doing that with an IF statement made the code grow by 3K. Doing it with a GOTO made the code grow by 3K. I could not identify anything to even suspect. there was nothing that it could be destroying. If interest is high enough I could look at a dump of the assembly.
The routine is not important enough to warrant more tuning.
I left it with duplicated code instead.


Are you sure the code gets added to that particular function? It may just be that this just triggered a page overflow in the executable which required adding another full page of file space.

Older Windows compilers had this value at 4k, today it's 512 bytes if I am not mistaken, but there's other factors coming into play as well like segment alignment and other obscure necessities to allow memory mapping the EXE.

Share this post


Link to post

That was the first thing I suspected. But tests of other patches would not trigger it. Not even adding a few variables, just to test that. The latest patch seems to have triggered it again, so it seems inevitable.

Looked at the assembly.
The function is virtually identical before and after. The optimizer is making the same change that I am. But when I do it I get an approx. 3K hit. What it does do is move some functions around so the function order is different than in the source code. The static variables move closer, but something else in the code moves out. I cannot tell because every assembly line after the modified function has changed due to functions being rearranged.

That means the 3K bump must be overhead, or coincidental to rearrangement. The optimizer response to small changes is not nicely behaved, and I have encountered a corner case.

Other things I noticed.

Static functions appear to be packed, but externally visible functions are aligned on 16 byte addresses using NOP's and have some table after their first Return. Apparently, it is the externally visible functions that are larger beyond what would be expected, but at least not in a way that would be execution time expensive.

The optimizer is moving small IF bodies that end in Return, out past the return. This is the same optimization that I mentioned for Rare errors, but less discriminating. It also moves some other IF bodies out there.

The optimizer is not catching obvious optimizations. It does not notice that the function call cleanup is identical in most calls. Even when it has optimized to share the Return code, it misses the identical duplicated instructions right before that Return.
It also cleans up the stack after a function call and cleans up the stack for Return separately, even when they both simply ADD to the SP.

Enough.

Share this post


Link to post
wesleyjohnson said:

...
The optimizer is moving small IF bodies that end in Return, out past the return. This is the same optimization that I mentioned for Rare errors, but less discriminating. It also moves some other IF bodies out there.

The optimizer is not catching obvious optimizations. It does not notice that the function call cleanup is identical in most calls. Even when it has optimized to share the Return code, it misses the identical duplicated instructions right before that Return.
It also cleans up the stack after a function call and cleans up the stack for Return separately, even when they both simply ADD to the SP.

Enough.

Returns are predicted separately and cached by modern CPUs. I imagine your compiler knows that, and that's why it sets things up as it does. Messing with the SP incorrectly, JUMPing into functions so they return to a previous CALL, POPing your own return value off the stack, etc, generally can really screw this up, and, a mis-predicted return will really slow things down.

Unfortunately, some obvious optimizations are unobviously non-optimal these days. Duplicating those returns and adding that 3k is probably your compiler's way of avoiding some nasty pipeline flushing issues. You'd really have to spend a lot of time building lots of hand-assembled test cases to find out exactly what's going on. But, I understand your frustration: I used to be able to optimize assembly size and speed with some real skill, but, these days, it's non-intuitive. 99% of most any program is best left to the compiler. I'd suggest letting it add the padding, the duplicate blocks, etc, ,'cause there's probably a good reason for it. You'd hope the compiler writers did a bunch of that testing.

Having said all that, still, sometimes the compiler just screws it up, especially in time-critical sections. There, you'll benefit from some hand-written assembly. But, it would be wise to learn some tricks from the ugly compiler output when doing so, otherwise your super-fast column renderer will flush the cache or pipeline on exit!

Share this post


Link to post

My guess was that it optimized at a higher level (common expression), before it broke the CALL into assembly instructions, and then did not do common instruction optimization. There was already a JMP to the common cleanup and return instructions, so the pipeline was going to get flushed anyway. And, it would only have had to target the JMP to two instructions earlier in the same sequence.

There will always be stuff that hand optimization can do better than the heuristics used by the compiler. The problem is guessing what that is and how easiest to accomplish it.

Share this post


Link to post
wesleyjohnson said:

My guess was that it optimized at a higher level (common expression), before it broke the CALL into assembly instructions, and then did not do common instruction optimization. There was already a JMP to the common cleanup and return instructions, so the pipeline was going to get flushed anyway.

Not necessarily. If predicted properly, there will be no flush, and it will run at full speed.

What I was describing were the hacks that people sometimes write. For example, say you are at the end of function A, that needs to call function B, and then return.

Instead of writing CALL Func_B; RET; sometimes you'll see JMP Func_B;
Once Func_B finishes, it will return back to the main program, bypassing Func_A. That kind of "optimization" screws up the return stack. Also, doing SP arithmetic, can screw it up too.

Point being, the compiler writers are supposed to avoid certain "tricks" that are known to cause problems. And, sometimes they appear to waste space, or duplicate things, which may actually help execution time. But, it's hard to tell.

A general rule is to profile, and find the bottleneck. Then, you can always try out an optimization, and re-profile, to see if it helps. If it's an inner-loop that gets executed a lot, that's a function that should be examined in detail. For Doom, I'd guess that line-of-sight, renderer setup, draw functions, and blit code are the most-likely, and most-useful areas to be concerned with, and probably offer the most bang for your buck.

Another clue is when you're looking at the assembly, and it's shifting data out of registers, into memory, on the stack, reading data via indexing, moving it back to registers, etc - that's probably a routine that could be rewritten to handle data in a more-efficient manner. Compilers have a hard time choosing which variables to make registers, so you end up with this data shifting garbage which can really add up.

Share this post


Link to post
kb1 said:

A general rule is to profile, and find the bottleneck. Then, you can always try out an optimization, and re-profile, to see if it helps. If it's an inner-loop that gets executed a lot, that's a function that should be examined in detail.


Indeed. 95% of all code has no need for optimization because it only accumulates to 1% of overall run time. Why even bother to speed any of this up by 10%, that'd be tiny peanuts.

kb1 said:

For Doom, I'd guess that line-of-sight, renderer setup, draw functions, and blit code are the most-likely, and most-useful areas to be concerned with, and probably offer the most bang for your buck.


Correct. Line of sight is one critical part, especially since the algorithm in Doom is rather poor - the old 1.2 version is a lot faster and scales a lot better to larger maps because it doesn't have to traverse the entire huge BSP for these maps (you just need to fix the corner bug.)

Other parts where I could get out quite a bit of time in GZDoom was the clipper. Unlike software rendering this requires far more precise angle calculations so R_PointToAngle2 is useless - but using atan2 is quite a bit slower. After a bit of research I found an approach using 'pseudo angles' which calculate just as fast as R_PointToAngle2 but don't translate 1:1 to real angles. For the clipper they are fully sufficient, though.

Both cases show one thing: If you want to do some optimization that really counts, it's rarely optimizing code, it's far more likely you get more mileage by rethinking the algorithm the code uses.

I personally consider the kind of micro-optimization which wesleyjohnson descripbes a complete waste of time, it rarely does anything good, but it really can hurt readability of the code.

What's the point anyway if you have to target 4 CPU architectures these days (PowerPC, x86-32, x86-64 and ARM)? What's good on one architecture may be deadly on another and even have a negative impact on the next generation of the architecture it was optimized for.

kb1 said:

Another clue is when you're looking at the assembly, and it's shifting data out of registers, into memory, on the stack, reading data via indexing, moving it back to registers, etc - that's probably a routine that could be rewritten to handle data in a more-efficient manner. Compilers have a hard time choosing which variables to make registers, so you end up with this data shifting garbage which can really add up.


That's often easier said than done, and the question still remains whether the code in question is actually worth it. This rarely adds up unless the code consists of loops with complex content and if you ask me, that's a design style that just asks for problems.

Share this post


Link to post

@Graf:

RE: Pseudo angles:
Interesting. Like faking a sine wave with a multi-point line? Yes, atan2 is awful.

RE: micro-optimization, and multiple architectures
Not necessarily useless, but I do see your point. It all depends on how much dedication you have to sticking with it, now, and in the future. With the proper approach, you can do CPU-specific optimizations on a CPU-per-CPU basis.

In other words, you can run a different code path, based on which CPU you're compiling for. Basically, you have your 'base code' that is written to work reasonably well on any architecture. Then, if you have a CPU-specific optimization, you can build a special routine targeted to that architecture. (#ifdef X686 ...). You have to build separate executables for each architecture anyway, so, why not?

It can bloat your code, and you will have to maintain all versions of each function now.

There's nothing wrong with this approach. For example, if I ever get around to releasing my source port, I do want it to be able to run on multiple architectures. But, you can be sure that it will certainly run great on my home computer! I will optimize the code to take advantage of what I personally have available. But, those optimizations will be inside #ifdefs, so they do not have detrimental effects on other architectures.

But, again, and in Graf's defense: It is extra work, and must be done carefully to prevent detriment on other potential systems. However, if done logically and properly, it's not that much more work. You just have to be careful and mindful

Share this post


Link to post
  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...