Jump to content

FastDoom: DOS Vanilla Doom optimized for 386/486 processors


Recommended Posts

Some fun testing, the idea proposed by @Optimus (avoid re-setting the video plane in potato mode) is extremely effective for the Cyrix MediaGX SoC (random ingame scene, unlimited framerate, Neoware Capio 600, sound and music enabled):

 

Low quality: 16 fps

Medium quality: 29.8 fps (86% faster)

Potato quality: 144.1 fps (900.6% faster)

 

The OUTP instruction literally kills the performance for this SoC, maybe the Rendition Verité has exactly the same problem. It should be possible to modify the renderer to render the scene plane by plane, this should increase the performance, but don't know how complex this development could be.

 

20201028_185953.jpg.c05596b9d5ae785d571c827d81925891.jpg

 

 

20201028_190004.jpg

20201028_190013.jpg

Edited by viti95

Share this post


Link to post

Lmao, that SoC is in a ton of old thin clients and its not a fully physical VGA card to begin with (VIA XpressGraphics virtualizes certain VGA calls, being one of the first IGP implementations of this kind). I am surprised by the performance uptick however for Potato, that's neat.

Share this post


Link to post
19 hours ago, viti95 said:

Some fun testing, the idea proposed by @Optimus (avoid re-setting the video plane in potato mode) is extremely effective for the Cyrix MediaGX SoC (random ingame scene, unlimited framerate, Neoware Capio 600, sound and music enabled):

 

Low quality: 16 fps

Medium quality: 29.8 fps (86% faster)

Potato quality: 144.1 fps (900.6% faster)

 

The OUTP instruction literally kills the performance for this SoC, maybe the Rendition Verité has exactly the same problem. It should be possible to modify the renderer to render the scene plane by plane, this should increase the performance, but don't know how complex this development could be.

 

20201028_185953.jpg.c05596b9d5ae785d571c827d81925891.jpg

 

 

20201028_190004.jpg

20201028_190013.jpg

I have that exact thin client

 

I am Running a lightweight Linux on it but how well does it work in dos with doom?

 

I Will be sure to try this optimized doom for it

Edited by CBM

Share this post


Link to post

Vanilla Doom and FastDoom doesn't perform very well in the Neoware Capio 600 (Cyrix MediaGX 300MHz), usually 15~25 fps on high detail. This is due to the use of VGA Mode-X (which the hardware doesn't support and is emulated), but you can use MBF 2.04 as alternative. It uses VESA graphic modes (even with linear framebuffer) and never drops under 35 fps. Sound and OPL music are fully supported on Vanilla Doom, FastDoom and MBF.

Edited by viti95

Share this post


Link to post
  • 3 weeks later...

New release, FastDoom 0.7 Release Candidate 2. Minor changes, mostly bugfixes.

  • Fixed some command line parameters not working as expected
  • Vsync option now is stored correctly
  • Faster melting screen
  • Faster flat drawing functions (unrolling by 4)
  • More ingame code optimizations
  • Fixed issue #14 (Cacodemons bites instakill, even with IDDQD enabled, only happened in FastDoom 0.7 RC1)

https://github.com/viti95/FastDoom/releases/tag/0.7_RC2

Edited by viti95
Cacodemos != Cacodemons LOL

Share this post


Link to post
4 hours ago, VGA said:

A DOS program with a Vsync option?

 

Every write to any address mapped to the DOS VGA framebuffer is sent directly to video memory, but it's still not every cycle. It cannot draw faster than the screen refreshes, so perhaps it's to prevent flickering between drawing the walls and flats and sprites on low-end hardware?

Share this post


Link to post
7 hours ago, Gustavo6046 said:

 

Every write to any address mapped to the DOS VGA framebuffer is sent directly to video memory, but it's still not every cycle. It cannot draw faster than the screen refreshes, so perhaps it's to prevent flickering between drawing the walls and flats and sprites on low-end hardware?

 

There is page flipping between 3 framebuffers in vram (and always were in the original DOS Doom) and that would prevent flickering anyway.

In FastDoom I think, turning on Uncapped Framerate and then enable Vsync, will lock the game at 70fps which is the physical refresh of the VGA (although I am not sure if the gameplay update is done in half, it seems not more smooth than 35).

Share this post


Link to post
  • 3 weeks later...

Hi! Time for the final version of FastDoom 0.7. I've been really busy in life for the last months and now I have more time to develop FastDoom (or other projects, who knows ^^), so here it is. This is the full changelog:

  • Added Stereo OPL2, OPL3 and Stereo OPL3 music support (Adlib uses basic OPL2, Sound Blaster tries to detect the best possible option)
  • Added Disney Sound Source (also Tandy Sound Source) sound support. You have to manually set "snd_sfxdevice" variable in the "default.cfg" file to 12 (Disney Sound Source) or 13 (Tandy Sound Source). There are three new command line options to force the parallel port in case auto-initialization process doesn't work ("-LPT1" -> port 3bc, "-LPT2" -> port 378, "-LPT3" -> port 278). COVOX / LPT DAC is not supported.
  • Fixed Gravis Ultrasound music support
  • Added profiling support in the makefile (Intel Pentium processor required)
  • Added option that forces 8 bit audio mixing instead of 16 bit (-8bitsound
  • Fixed potato invisible column renderer (issue #2)
  • Fixed chainsaw incorrect behaviour (issue #9)
  • Fixed all items respawn when loading a savegame (issue #10)
  • Fixed Arch-Vile fire spawned at the wrong location (original Vanilla Doom blatant error)
  • Fixed Chaingun makes two sounds firing single bullet (another Vanilla Doom error, fixed as there is no multiplayer support)
  • Fixed invulnerability sky colormap
  • Fixed issue #14 (Cacodemons can instakill you with a single bite in 0.7 RC1, even with god mode)
  • Fixed issue #16 (Video garbage being drawn outside the game window area, only in 0.7 RC2)
  • Fixed some command line parameters not working as expected
  • Removed IDMYPOS cheat
  • Added VSYNC support (-vsync)
  • Added new command line parameters that disable some optimizations (bypassing the saved configuration, made for benchmarking): -normalsurfaces, -normalsky, -normalshadows, -normalsprites, -stereo, -melt, -capped, -novsync, -nofps.
  • All new options finally are saved in default.cfg
  • As always added more optimizations (mostly ASM optimized multiplications, divisions and modulo, faster screen melting code, faster cheats detection, optimized sound code, faster potato mode [thanks @Optimus6128 !!], faster melting screen, faster flat drawing functions)

https://github.com/viti95/FastDoom/releases/tag/0.7

 

About the new VSYNC option, it's possible to wait for VGA screen retrace and then draw the whole scene. I'm using the same ASM code as Duke Nukem 3D. Vanilla Doom only uses this when ingame when the VGA palette needs to change (being hurt for example), due to some old video cards drawing some garbage in the screen between refreshes. But it's possible to use it whenever the screen updates, this is what FastDoom VSYNC option enables. This allows the following scenarios:

 

  • VSYNC OFF, Uncapped framerate OFF: 35fps maximum, can occur screen tearing but the chances are reeaaaaally small. Old VGA cards will show garbage with pallete changes.
  • VSYNC ON, Uncapped framerate OFF: 35fps maximum, no tearing at all.
  • VSYNC OFF, Uncapped framerate ON: Unlimited fps, the sprites will tear when the framerate it's bigger than 70 (don't know exactly why this happens). As @Optimus said, the frames aren't interpolated, so it will look exactly the same as the 35fps mode. For the next version i'll try to implement the interpolation.
  • VSYNC ON, Uncapped framerate ON: 70fps maximum, no tearing at all.
Edited by viti95

Share this post


Link to post
  • 1 month later...

Sorry for not noticing this sooner. The main goal of ZokumBSP was to tune maps to make them smaller and have fewer visplanes, to enable bigger maps within he vanilla engine. Producing balanced BSP trees for fast rendering was not one of the goals. This might explain the minor difference in speed seen in the benchmarks.

You can tune the nodebuilder for focusing on reducing segs or subsectors or a combination of them (default). If your map produces a VPO, going for the subsector reduction might "save" it. I also recommend using the wide mode for final builds for that tiny bit of extra performance you can get. This might take minutes, hours, days or weeks depending on the map complexity. I have never run benchmarks for rendering speed, so you might find a combo of settings that do produce faster rendering.

As for blockmap, there is a switch to specifically instruct ZokumBSP to produce id-compatible blockmaps. It should perform exactly like the ones doombsp generated, but take less space due to compression. I tested it on Henning Skogstø's ~45 minute Doom 2 nm run until it played back perfectly and then on a range of other demos to check. No desyncs. As far as I know, no other tool apart from maybe idbsp can do this feat. If there's a difference it's most likely due to floating point precision and determining whether a diagonal line is inside a block or not.

You will however still get desyncs IF you change the nodes, but they will be much more rare. When diagonal lines are cut they can lead to vertice integer roundoffs that can slightly enlarge or shrink some sectors. In some rare cases monsters can be in one subsector in one set of nodes, and in another subsector and different sector in another. This could change monster/player height and visibility, leading to desyncs. This is an extremely rare occurance, maybe only theoretical, but it will most likely show up sooner or later in a demo. If you want once-in-a-million rare desyncs and problems, rebuilding nodes is the way to go :)

I know of a few easy ways to optimize the engine. The first thing I would go for would be to do sidedef compression. This could lower the amount of memory needed and should make lookups faster. None of the original maps have this compression, but many tools and editors do it as part of the build process these days so it could be that gains in pwads would be low. There are other tricks you can do to reduce data structures along these lines. I plan to add some of them in the next version of ZokumBSP.

Share this post


Link to post
  • 3 weeks later...

https://github.com/viti95/FastDoom/blob/master/fastmath.h:

fixed_t FixedMul(fixed_t a, fixed_t b);
#pragma aux FixedMul = \
    "imul ebx",        \
    "shrd eax,edx,16" parm[eax][ebx] value[eax] modify exact[eax edx]

#define FixedDiv(a,b) (((abs(a) >> 14) >= abs(b)) ? (((a) ^ (b)) >> 31) ^ MAXINT : FixedDiv2(a, b))
fixed_t FixedDiv2(fixed_t a, fixed_t b);
#pragma aux FixedDiv2 =        \
    "cdq",                     \
    "shld edx,eax,16", \
    "sal eax,16",      \
    "idiv ebx" parm[eax][ebx] value[eax] modify exact[eax edx]

int Mul80(int value);
#pragma aux Mul80 = \
    "lea eax, [eax+eax*4]", \
    "shl eax, 4" parm[eax] value[eax] modify exact[eax]

int Mul320(int value);
#pragma aux Mul320 = \
    "lea eax, [eax+eax*4]", \
    "sal eax, 6" parm[eax] value[eax] modify exact[eax]

int Mul10(int value);
#pragma aux Mul10 = \
    "lea eax, [eax+eax*4]", \
    "add eax, eax" parm[eax] value[eax] modify exact[eax]

int Mul100(int value);
#pragma aux Mul100 = \
    "lea eax, [eax+eax*4]", \
    "lea eax, [eax+eax*4]", \
    "sal eax, 2" parm[eax] value[eax] modify exact[eax]

Ok, wow. And this is faster than just regular math?

Share this post


Link to post
On 1/6/2021 at 4:28 AM, zokum said:

Sorry for not noticing this sooner. The main goal of ZokumBSP was to tune maps to make them smaller and have fewer visplanes, to enable bigger maps within he vanilla engine. Producing balanced BSP trees for fast rendering was not one of the goals. This might explain the minor difference in speed seen in the benchmarks.

You can tune the nodebuilder for focusing on reducing segs or subsectors or a combination of them (default). If your map produces a VPO, going for the subsector reduction might "save" it. I also recommend using the wide mode for final builds for that tiny bit of extra performance you can get. This might take minutes, hours, days or weeks depending on the map complexity. I have never run benchmarks for rendering speed, so you might find a combo of settings that do produce faster rendering.

As for blockmap, there is a switch to specifically instruct ZokumBSP to produce id-compatible blockmaps. It should perform exactly like the ones doombsp generated, but take less space due to compression. I tested it on Henning Skogstø's ~45 minute Doom 2 nm run until it played back perfectly and then on a range of other demos to check. No desyncs. As far as I know, no other tool apart from maybe idbsp can do this feat. If there's a difference it's most likely due to floating point precision and determining whether a diagonal line is inside a block or not.

You will however still get desyncs IF you change the nodes, but they will be much more rare. When diagonal lines are cut they can lead to vertice integer roundoffs that can slightly enlarge or shrink some sectors. In some rare cases monsters can be in one subsector in one set of nodes, and in another subsector and different sector in another. This could change monster/player height and visibility, leading to desyncs. This is an extremely rare occurance, maybe only theoretical, but it will most likely show up sooner or later in a demo. If you want once-in-a-million rare desyncs and problems, rebuilding nodes is the way to go :)

I know of a few easy ways to optimize the engine. The first thing I would go for would be to do sidedef compression. This could lower the amount of memory needed and should make lookups faster. None of the original maps have this compression, but many tools and editors do it as part of the build process these days so it could be that gains in pwads would be low. There are other tricks you can do to reduce data structures along these lines. I plan to add some of them in the next version of ZokumBSP.

 

I've been trying to generate full optimized IWADs with the wide mode, but the process always ends up in an infinite loop (or something like that). Maybe I'm doing the process wrong, what settings do you recomend?

 

The best perfomance i've tested is using the wide mode in combination with the tool WadPtr. I think reducing the number of visplanes, segments, nodes and memory pressure makes the doom engine run smoother. I prefer that external tools optimize the wads instead of implementing those optimizations into FastDoom. 386 and 486 are way too slow to do these things. Maybe I could add ZokumBSP and WadPtr into the releases, with some batch process to allow people optimize the IWADs themselves.

 

On 1/25/2021 at 6:39 PM, AnotherGrunt said:

https://github.com/viti95/FastDoom/blob/master/fastmath.h:


fixed_t FixedMul(fixed_t a, fixed_t b);
#pragma aux FixedMul = \
    "imul ebx",        \
    "shrd eax,edx,16" parm[eax][ebx] value[eax] modify exact[eax edx]

#define FixedDiv(a,b) (((abs(a) >> 14) >= abs(b)) ? (((a) ^ (b)) >> 31) ^ MAXINT : FixedDiv2(a, b))
fixed_t FixedDiv2(fixed_t a, fixed_t b);
#pragma aux FixedDiv2 =        \
    "cdq",                     \
    "shld edx,eax,16", \
    "sal eax,16",      \
    "idiv ebx" parm[eax][ebx] value[eax] modify exact[eax edx]

int Mul80(int value);
#pragma aux Mul80 = \
    "lea eax, [eax+eax*4]", \
    "shl eax, 4" parm[eax] value[eax] modify exact[eax]

int Mul320(int value);
#pragma aux Mul320 = \
    "lea eax, [eax+eax*4]", \
    "sal eax, 6" parm[eax] value[eax] modify exact[eax]

int Mul10(int value);
#pragma aux Mul10 = \
    "lea eax, [eax+eax*4]", \
    "add eax, eax" parm[eax] value[eax] modify exact[eax]

int Mul100(int value);
#pragma aux Mul100 = \
    "lea eax, [eax+eax*4]", \
    "lea eax, [eax+eax*4]", \
    "sal eax, 2" parm[eax] value[eax] modify exact[eax]

Ok, wow. And this is faster than just regular math?

 

OpenWatcom doesn't optimize divisions (just uses DIV and IDIV instructions, which are really slow) neither multiplications (uses MUL and IMUL instructions). Only power of two divisions and multiplications are optimized. What I do is generate the same code with recent versions of GCC and then use the assembly code it generates with FastDoom, instead of relying in the OpenWatcom compiler. I've tested it to be faster, GCC is a better compiler but I can't use it to build FastDoom, as there are multiple assembly files that only can be generated with Borland TASM (John Carmack's span and column rendering functions, sound mixing code...)

Edited by viti95
Power of two divisions are also optimized by OpenWatcom

Share this post


Link to post
On 8/6/2020 at 5:53 PM, viti95 said:

The problem with using ZokumBSP in it's current state is that the best level of optimization takes a long time. A multitree rebuild with a width of 2 would take several days, even with a multi GHz cpu. The algorithms aren't very intelligent, it's mostly a proof of concept. The idea is that mappers need only do this once, so a long build time isn't a problem.

I do think there is a lot of memory and speed to be saved by sidedef/segs compression.

There is one thing that might speed up things a bit. I haven't checked but in most cases the REJECT map contains a lookup for sector A to B and for sector B to A. In most cases they will show the same result. You could reduce the memory needed by almost 50%. If you have a map with say 5 sectors, sector 0 would need lookup data for sector 1,2,3,4. Sector 1 would need for sector 2,3,4, etc. Sector 4 would not need a lookup table at all, since you can check the lower numbered sectors lookup tables.

You would still need to have some checks done both ways if maps include special effects. There is no clean way of doing that efficiently without breaking things. I think the order of sectors matter :)

Share this post


Link to post

Correct, Borland TASM generates .obj files that GCC doesn't understand (only supports .coff files), so we can't use GCC without porting all the assembly code that came originally with vanilla Doom. The most problematic is planar.asm, which contains the column and span drawing functions (highly optimized by John Carmack). That code is hard to understand, including self-modifying parts. That makes even harder to port to other assemblers. Even using a version different from TASM 3.1, causes the generated code not to work.

Share this post


Link to post
16 minutes ago, Lol 6 said:

imagen.png.ad02f301a97c2967f82c7f7b28c64e65.png

 

I got it running above 1000 FPS, is that fast?

 

Not so long ago I tested that scenario with real hardware, and checked that If you pass ~2000 fps, the fps counter glitches and starts showing negative numbers

Edited by viti95

Share this post


Link to post
  • 3 weeks later...

 

Here is a little update of what i'm currently developing for FastDoom. Text video modes rendering at 80x25 and 80x50 (and maybe more hidden modes such as 132x44) with 16 colors. It's missing all menus / intermissions / text messages and HUD but rendering is done nearly 100%. I have to solve the flickering (by using multiple video pages), fix some off by one errors and use better the 16 color palette.

 

This idea came from SMMU port, but I implement it in a different way (much faster). Also having 16 color support will let me port FastDoom to CGA (160x100) and EGA (320x200)

Share this post


Link to post

@viti95: a "cheap" way of displaying more than 16 (dithered) colors in text mode is to abuse special characters like B0-B2 and have different colors for the foreground/background, or even use different characters depending on what color mixing proportions you want to achieve.

 

draw097label.png

 

Alternatively, you could use that same trick and abuse characters DC and DF to "double" your vertical resolution, by remaining at 16 colors but rendering two "pixels" as one character, by changing the foreground and background color, as each of those characters is essentially a "half-height pixel". I thought there also was a "half-width column" ASCII character, that would allow you to double your horizontal resolution instead, but I guess if you are going into that sort of trickery, you could always use custom character sets...

 

Now, supporting CGA and EGA is a whole other ballgame. EGA is just plain fugly and slow (bit-planar...so rendering to it is likely to slow things down, rather than speeding them up, unless you make some really clever use of pre-rendered tiles to avoid pixel-by-pixel drawing or C2P conversion).

 

CGA on the other hand may be less problematic from that point of view, as its hacky 160x100 mode (a text mode, if I don't err) should be "chunky" and thus directly addressable.

Edited by Maes

Share this post


Link to post

Strictly speaking, if you're in text mode, you could probably just make a really barebones text HUD instead of recreating the graphical one with ASCII characters. Just say "Health: 100", "Armor: 58", "Ammo: 67", "Keys: BYR" etc.

 

You'd lose out on the Doom Guy face and the overall ammo counter far to the right, I suppose. The latter could easily be optionally enabled by taking out another line for it above the main bar, Quake-style. As for the former, you could probably sort of mimic it with the ol' ZZT smiley (characters 0x01 and 0x02), color-shifting it from blue to green to yellow to red to gray as you lose health I guess. Getting hit from the side could pop up a quick < or > to indicate which direction you were hit from.

Edited by Shadow Hog

Share this post


Link to post
16 minutes ago, Shadow Hog said:

You'd lose out on the Doom Guy face and the overall ammo counter far to the right, I suppose.

 

Well...a few -very few- select graphics (fixed-resolution graphics such as title screen and HUD elements are perfect candidates) could probably deserve having their own extra colorized ASCII-art text lumps, with a proper rendition of the original graphics and a small hack in the engine to draw those directly if available -those could be provided by a small external PWAD, delimited e.g. by a TTEXT/_TTEXT pair of markers. Their names would be identical to the actual graphics lumps they'd replace.

Edited by Maes

Share this post


Link to post

FWIW, it may be significantly faster to load the IWAD on an old PC if you "unpack" it so that all textures are a single patch. In a sense you are precalculating the work that the engine has to do every time you boot the game.

 

Share this post


Link to post
4 hours ago, viti95 said:

 

Here is a little update of what i'm currently developing for FastDoom. Text video modes rendering at 80x25 and 80x50 (and maybe more hidden modes such as 132x44) with 16 colors. It's missing all menus / intermissions / text messages and HUD but rendering is done nearly 100%. I have to solve the flickering (by using multiple video pages), fix some off by one errors and use better the 16 color palette.

 

This idea came from SMMU port, but I implement it in a different way (much faster). Also having 16 color support will let me port FastDoom to CGA (160x100) and EGA (320x200)

YES! Geniune text-mode rendering was only ever attempted by SMMU (Everything else relies on a library to simulate the behavior in a console shell) so this pleases me greatly. And in text-mode, there is a lot of CPU available to just throw at new effects. That's beyond FastDoom scope, but it will be great to see this run in text mode modes :)

Share this post


Link to post

very fascinating

but a 386 CAN run vanilla doom in VGA

 

however... I look forward to trying DOOM in EGA mode

Edited by CBM

Share this post


Link to post

@Maes thanks for the ideas, I have implemented the 80x50 mode using the double height resolution using DF character! 80x25 mode now looks much better, but I still have to fix screen tearing (should be easy to fix with triple buffering), and fix CGA snow somehow (that should be more complex).

 

I've discovered that someone has implemented Doom in glorious CGA 320x200 (4 colors), so it shouldn't be hard to implement. The main problem with FastDoom is that I still haven't implemented a backbuffer for certain modes, as the original Mode-X writes directly to the video card. A backbuffer is required to avoid flickering when no multiple video pages are available.

 

https://hup.hu/index.php/node/161564

 

 

EDIT: You can grab that Doom CGA port from here: http://forum.amigaspirit.hu/index.php?action=vthread&amp;forum=8&amp;topic=356&amp;page=2

 

@Linguica How did you unpacked the IWADs ? The speedup in loading times is very noticeable!! This should be even more noticeable with 386 processors.

 

EDIT 2: Another small update, new video showing a CGA card running FastDoom (80x50). Also I've solved the flickering problem using multiple video pages.

 

 

Edited by viti95
New video

Share this post


Link to post
  • 1 month later...
10 minutes ago, EpicTyphlosion said:

All it needs is non text mode 16-color and 4-color rendering and it will be a perfect port ;)

 

Well, a reduction in total displayed color depth can simply be achieved by supplying appropriately crafted palettes to the existing 8-bit color renderer. In fact, those have existed for some time in idgames as EGADooM II (I guess EGADOOM 1 was bollocks) and CGADOOM.

 

Making a native, pixel-addressable CGA or EGA mode is going to be a lot harder, and ultimately slower due to the planar display and the need to perform chucky-to-planar conversions, thus missing the point of this port (speed...).

Share this post


Link to post

A decent EGA mode should use dithering to improve the graphics. A palette hack would look pretty bad in comparison to what would be possible to do if an algorithm used dithering.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...