-
Posts
63 -
Joined
-
Last visited
About sqpat
-
Rank
Mini-Member
Recent Profile Visitors
The recent visitors block is disabled and is not being shown to other users.
-
Here's a run I did today on real hardware - a 16.5 mhz v20 with 0 wait states and reduced dram refresh times and some other things. Didn't have XMS configured for the WAD loading into there. Is that possible with a simple EMS card and some driver? I'm not as familiar with XMS setup on XT class stuff. EDIT: This was with the mode Y executable... would 13 be faster on an 8 bit constrained XT system?
-
RealDOOM 0.21 is released Main Changes: - R_DrawColumn, R_DrawSpan, R_MapPlanes render functions in asm and optimized - FixedMul functions all in asm and optimized - A few bugfixes (re-fixed doom2 timedemos which i somehow broke before the 0.20 release days after fixing...) - Memory usage up 8k, so 8088/v20 builds temporary on hold since they wont fit in memory easily. I have been running a lot of benchmarks over the past week. They are now in one easily accessible google spreadsheet here. Generally everything was stock settings - don't take the fastdoom ones super duper seriously.. I'm sure choosing the appropriate EXE for a given machine may give even better numbers. The general gist of performance gains from 0.20b to 0.21: - 35% FPS gain on big screen (screenblocks 9) hi detail - 70% FPS gain on big screen low detail - 30% FPS gain on small screen (screenblocks 5) hi detail - 45% FPS gain on small screen low detail - 45% FPS gain on potato over low detail (there was already some semi-optimized drawspan code in high detail already, hence the disparity, i think) In general, on big screens, RealDOOM 0.21 low detail is a tiny bit faster than DOOM 1.9 high detail. RealDOOM Potato is a bit slower than DOOM low detail. RealDOOM Potato is a bit faster than FastDOOM hi quality (lol). I'd say 0.21 took all the 'easy performance gains', and got us halfway to the goal of matching DOOM 1.9 speed. That means finding another 50% performance improvement... There's not much that's an obvious big (10-20%) performance gain anymore, it'll be a fight to make small gains here and there. FixedDiv is an obvious target, as is the rest of the render pipeline being asm handwritten. Eventually the busiest physics code (movement and collision stuff) will probably be looked at too. I also want to move from the TASM assembler toolchain to something a little better, either MASM or NASM. I need to figure out some very basic issues, like I don't know how to call into C from ASM - maybe it's a linker problem. I'm juggling a lot of hardware projects right now on the side - many related to older motherboards and hardware and being able to run realdoom benchmarks. In a few days I'll have some real 286 SCAT chipset based boards that can run proper benchmarks hopefully. Same for XT class stuff via memory boards. I'll keep adding to that spreadsheet and post here once there's more interesting data. I also will go back and run some timedemos on 0.11/0.20 to have more data on older builds. Meanwhile I'm going to take a short break to try and work on an EMS 4.0 driver for the VLSI SCAMP chipset, which is what my fastest 286 boards are. If I write something generic enough, it may hopefully be of use to some other chipsets, hobbyist hardware, 86box devs, stuff like that. But once it's done, hopefully I can run this on 30-35 mhz 0ws 286 hardware. EDIT: Whoops, the benchmarks link should have the right permissions now!
-
I've made updated versions of FixedMul. According to martypc the cycle counts on an 8088 for the original FixedMul (simply casting to long long and using the internal openwatcom i8m.asm 8 byte multiply function) took ~ 3006 cycles in situations where it couldn't "quick-out" catching zero inputs and such. The new one is now around 805 cycles for the base version. I'm sure its much much faster on 286, but it's still a heavy operation. There are also various faster versions i've implemented, for example when inputs are 16 bits signed, unsigned, 17 bits signed (finesine/finecosine mults) and some 24 * 24 options too. A couple timedemo bugs popped up in doom2, i will have to revisit them for accuracy. I might leave a for loop running overnight attempting to test the algo with every possible input and see if it catches anything incorrect or something. FixedDiv is going to be even worse, and another similar can of worms. I haven't even looked at drawfuzzcol either - I'm not even 100% sure how it works so I need to examine the vga register usage. I'll be looking that over the next few days and will continue to work on MapPlanes and DrawPlanes ASM. Aside from that I need to fix a couple render bugs I've noticed and cut a 0.21 release with a whole bunch of the early ASM optimizations. I know most of the processing time is currently spent in BSP and I'd like to rewrite the whole BSP function tree in asm too - not just the draws. I don't want to get too into the weeds with alternate render options yet due to hassles with config files, inputs menus, etc etc. But I did add basic support for potato mode since it was trivial. I was worried the menu might crash or display garbage when potato was selected, but we got something much better by accident: It's not wrong... Anyway, I think for the next two months I might start to work a less on this again as I have some hardware projects to get to. There should be some progress though. August might again see a lot of work done, then september and october likely will see nearly no work. Then things might pick up again.
-
Yeah that's pretty bad. I've noticed that sometimes instructions end up in a weird order, i'm guessing trying to keep the prefetch queue happy. BUt sometimes there are back to back instructions to access the same memory. Here's a favorite from the weekend. I even found that "AND AX, immediate" instructions generated use a one byte larger format than what tasm generates. Cool, I wasn't aware of godbolt... The tools for x86-16 are really sparse, unforuntalely. I have been using shell-storm which has an x86-16 option that is mostly fine for quickly assembling/disassembling bits of asm back and forth. I bounce around development machines and between mac and PC so having something web based is convenient. I've been looking to get martypc set up to boot and run arbitrary code so I can actively debug it, especially for math-heavy operations to make sure things are coming out as expected. But it doesn't support VGA and EMS yet so I can't debug application-level stuff there either. My workflow has been to take the openwatcom generated code, toss it in shell storm, toss that in tasm, work out the variables and function calls and improve from there. It's not too bad to get something working quickly. Yeah that's probably true - I commented out the prelevel texture caching a few months back. It was a constant source of hassle at one point when the engine was undergoing a lot of changes and it kept getting in the way and causing bugs. Now that the architecture is more settled I can go back and rewrite that I suppose. I will add it to the list of stuff that need's to be improved. The game doesn't really dynamically allocate more or less memory per level, it's always the same: 256 KB base system memory + 64KB in one upper block, then 2 MB of EMS memory. The memory limits are set at build time. I will eventually have a build config tailored around machines with more memory, or have the game use more EMS if it's there with dynamic-sized texture caches, but that feels more like polish for a finished product so I don't think I'm there yet.
-
To be clear, normal high and low quality options work just fine too. I'm sure I will eventually add more quality-reducing options like non textured flats and what not, but not until the high quality options are as optimized as is reasonable. Potato quality feels like it should eventually be pretty playable, but maybe we can get something closer to low quality playable on a fast 286. I ran it for a bit on an mmx-233, it felt great to play, though obviously that hardware is overkill. I tried to get a 286 here up and running but I need to fight with QRAM/EMS settings a bit... it wasn't behaving right away and I didn't have more than 15 minutes to mess around with it and I don't want to debug stuff like that right now. I'll do more real hardware testing soon, especially for having benching numbers and results, and I will have comparisons to older versions, commercial doom, and fastdoom. This project won't ever be as fast as fastdoom (well optimized 32-bit is too powerful) but I think realdoom potato is already faster than commercial doom low quality, and I'm feeling better about realdoom eventually matching and beating commercial doom performance on equal quality levels.
-
Here's a video! I'll have bench numbers later when 0.21 is a little more stable and close to release. On 486 and older I think we're running at least ~40% faster fps than 2 weeks ago on large screen sizes, and at least ~20-30% faster on smaller sizes. Draws are all ASM, loops unrolled and potato supported. Still a long way to go though. I still have some bug cleanup and code/memory cleanup related to the drawspan/drawcolumn implementations. While I'm doing planes, I want to move R_MapPlane to assembly and maybe DrawPlanes while i'm at it. DrawFuzzColumn is also not done yet. I think I can also write a faster version of drawcolumn for sky textures. I think I should probably move DrawMaskedColumn to assembly too... I am pretty serious about eventually handwriting the whole render pipeline eventually. It's not actually a large number of functions. One thing I'm also curious to try is a dynamic distance-based graduation from hi to potato as the spans get further from the player. I actually have a single function that does drawcolumn and a single that does drawspan for all qualities that handles the renders based on supplied parameters. There's no reason i cant do a combination of hi, low, potato quality calls in the same frame. I think it might be a good compromise on speed and quality for planes. Meanwhile I'm still very new to the asm tooling, I am using TASM since thats what PCDoomC2 had, not using macros at all, etc etc. It's all a little ugly with manually unrolled loops and there's a lot of ASM black magic (self modifying code, disabling interrupts and using stack registers for math, etc) but I've squeezed a lot of performance out of these functions. I might make the jump to NASM at some point, I just really hate messing with tooling instead of being able to sit down and code. Another thing I need to look into is how to replace some of these FixedMul/etc calls with asm optimized, possibly inlined versions that I have worked on. Every time i disassemble a compiled C function from openwatcom I see so much that is trivial to optimize. Not just in making things faster (usually fewer memory accesses, swap high and low bytes instead of shifting by 8 which is slow on 8088/286) but usually the code can be made smaller at the same time. Just doing 2 or 3 functions like this it hurt to imagine the entire codebase could be so much faster...
-
The everex does not support EMS 4.0 with backfill. It is a standard EMS 3.2 card with a page frame. You will be able to run Realdoom 0.10/0.11 with such a setup, though. There is actually no emulator out there that will run realdoom on an 8088. You can do a 286 with a SCAT/NEAT chipset and it's drivers on 86 box just fine. A real physical 8088 machine with a fully filled out intel above board should be able to do it. I will test it on real hardware sometime next week as I'm away from my machines right now. Hopefully one day someone will support an above board in an emulator, or make a device that can match it. The LoTech boards and PicoMEM don't do it either. It's not a popular feature I guess. (small update - ASM R_DrawColumn leads to 10% faster timedemos in small screen and 20% faster in big screens. Cleaning up the implementation a bit, then I will generalize it to Low/Potato modes, handle sky draws and fuzz draws, then move onto spans...) EDIT: Okay, my mistake. I remember 86box having the Micro Mainframe card. But the Everex actually seems to support the proper EMS functions but i'm not 100% sure. First you will have to set your XT to 256 KB of ram, then you need to configure your card to start at 256k and have 384 kb contiguous size. Make sure EMS is enabled and you have 3 MB of memory size. Make sure EMS is configured to at least 2 MB in setup. Now here is the problem. I cant find a way to load anything high on an 86box XT. Usually you use some memory card to open up C000 or D000 on your machine and then USE!UMBS and load dos and some other drivers high. I don't think 86box supports any of this kind of hardware, so we are stuck with 560k conventional or so, which is about 50k short of what we need. You should have 610, 620k free in DOS. I'll let you know how real hardware goes when I try it next week. In theory it should work, I just don't believe emulators support enough yet.
-
OK - a followup release as promised. This was mostly required because 0.20 had a masked render bug. But this one is much more improved than I had originally planned. - CC00, b000 memory regions no longer necessary - Binary is 5-6k smaller too - thanks to the shrunken binary, 8088 build is small enough to fit into memory again. (8088/188 builds included) The memory/binary savings came from a few things: - I went through and labeled just about every function __near or __far, which I had previously not done (about 2k in binary savings, near function calls dont need push cs before them). - I went through and changed all #define variables to be 0x####0000 address form, so everything is now a 0000 offset, which means many of these variable accesses no longer require an extra offset addition to figure out their memory address. (for example, lets say the sectors memory region is now at 0xE100000 instead of 0xE0001000, every sectors access is now a bit faster/smaller code). This saved about 2k in the binary. It added some small gaps to memory in these usage but not a big deal. - I moved some static code (info.c stuff) into a binary file and put it in higher ems region and we call from there. That was about 1kb of code removed. - Various small code optimizations. for example constructing wad filenames using char math instead of sprintf saves a couple hundred bytes over the binary. So here is v 0.20b, a much better version with the main bugfix, but also way less of a headache to run. https://github.com/sqpat/RealDOOM/releases/tag/0.20b I've been working on this project 10 or 11 months now. Up until now the work has been maybe 90% architecture work and 10% speed. I think from this point on for a couple of releases it's going to be more like 50/50. I did a little bit of work on draw column (+low, potato) asm today. I think I will have that working and optimized in a couple days as long as nothing unexpected comes up.
-
I created the v0.20 release. https://github.com/sqpat/RealDOOM/releases/tag/0.20 Don't forget to set EMM386 or run your EMS driver. Sample EMM386 config is: DOS=UMB,HIGH DEVICE=C:\DOS\HIMEM.SYS /TESTMEM:OFF DEVICEHIGH=C:\DOS\EMM386.EXE RAM /I=E000-EFFF /I=B000-B7FF /I=C800-CFFF The release is a 286 binary so the 8088 wont work right now. An 8088 build will end up with a larger binary and it'll be too big to fit everything in conventional memory, still. This should naturally be fixed in a few weeks to months as I make the binary smaller. I got things working earlier today on a 286 on 86box but then a mishap led to my good 286 config.sys getting deleted. I just spent 30 minutes failing to get my QRAM parameters correct again. Oh well. You need some upper memory available in both the B000 block and CC00 block. You also need the entire E000 block free (whether this is an EMS page frame or not, doesn't matter currently since we aren't paging here.). D000 will eventually be used for sounds/music and the page frame, but right now isnt used. There may be a followup release somewhat soon, and hopefully I can reduce the dependency on the b000/cc00 upper memory regions because it's kind of annoying to configure around. But I need to do some more memory optimization on the binary first. EDIT: something like this will work for a 286 SCAT board in 86box: DOS=UMB,HIGH DEVICE=C:\DOS\HIMEM.SYS /TESTMEM:OFF DEVICEHIGH=C:\QRAM\QRAM.SYS EXCLUDE=D000-D3FF DEVICE=C:\QRAM\LOADHI.SYS /R:3 C:\SCATEMM\SCATEMM.SYS FRAME=E000 And I walked around shareware e1m1 a bit and of course the release has some graphical glitches going on in a few spots. Oh well, like I said there will be a followup release that is hopefully a little more cleaned up.
-
I put EMS visplanes aside for awhile and more or less finished all the other remaining bugs - - All doom1/doom2 timedemos now play accurately (to version 1.9, it fails just like that doom2 timedemo 1 under engine version 1.9). The three timedemos had three different bugs. I wouldn't be surprised if I could find a lot more bugs if i try out some other timedemos in the future. I'm sure the engine's far from bug free... - Fixed "Read This" menu bug - I fixed some colormap bugs, which i broke in the last couple weeks - Fixed the finale, which i also broke in the last couple weeks. - Fixed an issue with sky textures not rendering on tick 1 of the level (which I also broke in the last couple weeks...) I also did some work on trying to run functions that have been taken out of the binary and loaded dynamically into to EMS memory locations. There might be an openwatcom compiler bug leading to this not working right - i am going to update the toolchain from 1.9 to 2.0 to see if the bug is still present. Basically I want to be able to do stuff like this: #define getDamageAddr ((uint8_t (__far *)(uint8_t)) (0x6EA902DA)) // function previously loaded from file into memory #define getDamage(a) ((getDamageAddr)(a) ) ... // inlined getDamage should work with same syntax as before when it was a compiled function in the binary damage = ((P_Random()%8)+1)*getDamage(tmthing->type); (Since the damage field in info.c was mostly repetitive, I saved memory long ago by tossing such fields in getter functions with switch cases. These are really simple functions that don't depend on outside data or near variables) Seems Its possible to do this but only if you store the function pointer to a correctly typed variable first, rather than using the inlined #define "function" . Which probably doesn't generate different machine code than if it were correctly compiled in inline-fashion, but the code would be annoying that way. (Then of course, I will have monstrosities like this going on:) #define getPainChanceAddr ((int16_t (__far *)(uint8_t)) (0x6EA90034)) #define getRaiseStateAddr ((int16_t (__far *)(uint8_t)) (0x6EA900B2)) #define getXDeathStateAddr ((int16_t (__far *)(uint8_t)) (0x6EA9010A)) #define getMeleeStateAddr ((int16_t (__far *)(uint8_t)) (0x6EA9015A)) #define getMobjMassAddr ((int32_t (__far *)(uint8_t)) (0x6EA901B8)) #define getActiveSoundAddr ((sfxenum_t (__far *)(uint8_t)) (0x6EA90222)) #define getPainSoundAddr ((sfxenum_t (__far *)(uint8_t)) (0x6EA90284)) #define getAttackSoundAddr ((sfxenum_t (__far *)(uint8_t)) (0x6EA902B8)) #define getDamageAddr ((uint8_t (__far *)(uint8_t)) (0x6EA902DA)) #define getSeeStateAddr ((statenum_t (__far *)(uint8_t)) (0x6EA90350)) #define getMissileStateAddr ((statenum_t (__far *)(uint8_t)) (0x6EA903F4)) #define getDeathStateAddr ((statenum_t (__far *)(uint8_t)) (0x6EA904A8)) #define getPainStateAddr ((statenum_t (__far *)(uint8_t)) (0x6EA90586)) #define getSpawnHealthAddr ((int16_t (__far *)(uint8_t)) (0x6EA9063C)) // (who needs linkers) Of course, most of my variables are also already typed #define memory locations already too. So ever more complicated functions accessing outside variables can potentially be moved around (will really depend on the compiled code though). In theory I can also start to mix and match openwatcom generated functions and those generated by other compilers and pick and choose based on size or speed. Anyway. EMS visplanes are the only other thing i really want to get done for a release but if I cant figure it out by Friday I will probably just do the release anyway then work on it alongside some other pre-asm refactorings and optimizations I want to do before a smaller second release that will follow. EDIT: visplanes are working... my function that unpaged EMS visplanes back to their base state was broken, and I'm still not sure what's wrong with the original code, but I did things in a different way and now it runs perfectly up to 125 visplanes (at which a dramatic crash occurs). I'll package up a release tomorrow probably.
-
I've found a half dozen bugs or so thus far in my EMS visplanes implementation and things are much more stable now. Basically things used to go crazy once I hit 50 visplanes and went into my 3rd physical page of visplanes. i fixed a few off by one bugs and bad arguments to my memory management functions, now things are mostly just going bad at 75 visplanes, which is when things actually are paged in and out. But that also isolates the bugs in that region of code now. I also fixed some bugs related to sprites and visplanes overlapping in ems page regions, found another spot where I could save an EMS page, and came up with a handful of other small optimization ideas to implement later. As far as the drawsegs issue I think it's possible that there is some sort of leak or miscalculation causing the number of drawsegs to increase, but its not causing any visual artifacting. It still crashes if i keep drawsegs at 32 instead of 64 though. I'll revisit it and compare drawsegs per rendered frame in a timedemo in a logfile with pcdoomv2. (Maybe it's even a result of me having dropped some fields to 16 vs 32 bit precision...?) I'm going to keep working at the visplanes as it does seem to almost be perfect. I even built a little visplane counter (piggybacking on HUD code) for debugging. It made me realize how easy it'd be to dupe this hud ticker code into an FPS or other diagnostic counter at some point, as a debug build option. I think I'll end up taking another week getting a clean build ready after all. I actually had EMS visplanes working right back in the 0.10/0.11 releases and didn't realize how hard it would be to reintroduce the feature to the current memory model.
-
Aha, I see. Well in that case I will have to look at the algorithms and figure out why mine is using more segs in the same spots, because mine definitely overflowed. Maybe I introduced a bug at some point... maybe its related to that extra memmove I have in there which is supposed to be an old lee killough optimization.
-
Oh sorry I took a simple quick video to show how I triggered it in FastDOOM, see attached. EDIT: Sorry I realize its an older version of FDOOM I have on this disk image from 2022 (0.8.15). Let me try with the newest version. EDIT 2: It actually doesn't happen in the same way after all on FDOOM 0.9.9c. I had checked the code on github yesterday when I saw that and I saw the limits were the same, and i saw in my code I was hitting the limit. I know there are algorithm improvements that combine visplanes, Is there some drawseg combining code too I'm perhaps missing? Screen Recording 2024-05-13 at 15.33.57.mov.zip
-
Okay - I think found the problem. It took me way too long to track this down but the original solidsegs limit of 32 is wayyy too low for commercial doom 2! Map 15 has some (inbounds!) areas where it's very easy to hit this limit. Basically the memmove in R_ClipSolidWallSegment ended up going out of bounds once this limit got broken and in my case it was nulling out ds_p and causing garbage to overwrite the OS in memory. I'm kind of confused... is this a well known thing? I checked the fastdoom and pcdoomv2 sources and both have the limit of 32, and i was able to reproduce this problem pretty easily. Unlike with visplanes, this error doesn't even get caught by code so it's sort of just a nasty crash. My copy of commercial doom 2 doesn't run into the same problem so maybe it was actually raised there? In my case I upped it to 64 and the problem has gone away for now. Some out of bounds areas via IDCLIP cheats can trigger it... in theory SCREENWIDTH /2 + 1, or 161 is theoretically ideal and can prevent the issue from every happenning but I don't want to throw so much memory at it if its not causing trouble, so I'll leave it at 64 for now. While debugging this I also realized something else, which is that the only patch/composite texture used during R_DrawPlanes is the sky texture, and its not used in other render steps. So I can have the sky texture paged into the "texture cache" area since it shouldn't be used during other rendering steps.... this will also prevent it from eating up pages of regular texture cache memory which should improve overall rendering speed by decreasing pagination. Its sort of annoying because the sky textures are about 34k in size, but i will have to throw 48k (3 pages) of ems memory at it. Maybe I can come up with a clever way to fit other useful things in the other 14k of that page of memory, not sure yet.
-
Ive made a lot of memory improvements over the past few days while trying to get higher visplane limits working. By lowering various sprite fields (the offsets and widths) to 1 byte instead of 2 (around 4 KB saved) and replacing vissprite prev/next pointers with just a next index (about 1 KB saved) and then changing far pointers in colormaps and zlights to segment offsets (about 3 KB saved) and then also combining some similar functions (like some plat functions and thinker accessory functions) and moving a few fields like fuzz offsets and par/cpar time arrays to upper memory i was able to make room for another 16 KB page of visplanes in the EMS memory region - in addition to the original 2 pages which were already present, each page containing 25 pages, for 75 total without paging and supporting page-outs for up to 125 pages. Unfortunately , something is crashing and giving me trouble when lots of visplanes are in use, and after two days I've just realized its not the visplane implementation at all, but rather some other mysterious cause that pops up when the rendered scene becomes very complex (DOOM2 map 15 with over 60 visplanes seems to cause it). . It seems like there's errant writes to vram (I think) which is not actually always causing a crash but just killing the video outputs sometimes, making this hard to trap. I have some ideas, and hopefully will be able to figure it out in the next couple days. Once higher visplane limits are supported, there are only a couple bugs left to fix - the "Read This" menu is bugged and of course there are some timedemo desyncs to track down, but once that's fixed I should be able to cut a release, hopefully in the next week or two. After that, ASM optimizations are coming, and while I'm a little comfortable with x86 asm itself I still have a lot to learn about the tooling, and how to actually compile and integrate these things. I will probably make a first attempt with improving the Fixedmul/FixedDiv functions a bit, which should be easy. However, I have a lot of much more complicated things planned such as modifying the compiler to hack the DS value - and dynamically loading code in EMS regions then doing jumps to that code (so I can use CS to access strategically placed data like colormaps in key render functions).