GooberMan Posted October 11, 2020 (edited) https://github.com/GooberMan/rum-and-raisin-doom Rum and Raisin Doom is a limit-removing port focusing on both vanilla accuracy by default; and performance on modern systems. Features include: Startup screen that looks like ye olde DOS Doom's text startup Complete integrated launcher, allowing free IWAD downloads, idgames browsing, and more Limit removing when using the -removelimits command line parameter A multithreaded renderer 64-bit software rendering Dynamic resolution scaling Frame interpolation Widescreen assets support Support for unsigned values in map formats, as well as DeepBSP and ZDoom nodes UMAPINFO and DMAPINFO support Flats and wall textures can be used anywhere Dashboard functionality implemented with Dear ImGui Full support for Raspberry Pi on Debian-based operating systems Latest pre-release: 0.3.1-pre.8 for Windowshttps://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.3.1-pre.8 1:1 square pixels! FOV slider! Reduced memory consumption because I killed colormap-expanded textures! Sigil 2 maps sit in the correct episode number! Latest release: 0.3.0 https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.3.0 This one's a big release, because it comes with official support. Got a bug? Let me know! New Features: Complete launcher frontend is now brought up if you launch without an -iwad parameter. This is designed for a good first-time-user experience. If no IWADs are detected, you can download some. Want some new maps? There's a fully integrated idgames browser. More features and polish to come in the future. Dynamic Resolution Scaling is now included in an initial format. It scales both horizontally and vertically. This is currently in a preview stage, as I need to work on a few other things before this works as intended DMAPINFO and UMAPINFO is supported. Tested on Unity IWADs, Sigil, and Knee-Deep in Knee-Deep in ZDoom. This involved a complete reworking of how maps are referred to, with vanilla-exact matching of functionality retained. Futher map info format support will come down to writing the handler. Also required rewriting intermission screens, so that's waiting for some kind of format to exploit that. Multiple software backbuffers. DOS Doom had three backbuffers, giving noclip/HoM effects are very noticable forward velocity. You can now replicate that effect by increasing the number of software backbuffers. Stats overlay. Currently very spartan, but only because the default layout is like that. The system implements way more functionality, which I will expose in the future to users via an options window panel. Sound and detail shortcuts, as well as Options from the main menu, now bring you to the relevant section of the dashboard. Backbuffer resolution can now match window sizes; and the drop-down box for selecting backbuffer sizes is categorised as well as having named shortcuts (such as Vanilla, Crispy, and Unity style resolutions). Every platform now saves data to the user's home folder. You can make a portable install by placing an empty file called rnr_portable.dat next to the executable. Additional lighting slider added to the "View" tab in the options window Bug Fixes: 0 or less column rendering bug in balance loading no longer occurs Rendering interpolates while the menu is active Distance lighting is way more accurate for resolution heights that don't cleanly divide by 200 Previous contents of this post: Spoiler Latest release: 0.2.3 https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.2.3 Features: New GL-driven backbuffer decompression, now converts paletted rendering to 32-bit colour on the GPU instead of on the CPU. Speed increase on all tested hardware. Thread balancer now functions as intended, >4 threads balances correctly. As such, number of threads has been promoted to a fully supported option savable to your configuration file. Vsync options added. -removelimits now lets frame 966 of Dehacked load in correctly. BGF Edition IWADs are now detected by looking for M_CHG, since Unity IWADs also include DMENUPIC. Bug fixes: Loading a save game doesn't munge up thing Y and Z values. Latest release: 0.2.2 https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.2.2 Interim release to push out some features. 4K resolutions Widescreen assets support Latest release: 0.2.1 https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.2.1 Still the same deal as the last release, it's semi supported. I want maps that are limit-removing that break this port so I can work out why and tighten it up. This release has some null pointer bug fixes, and some oddities I encountered when trying to -merge Alien Vendetta instead of -file. The big one y'all will be interested in though: I decided it was well past time I implement frame interpolation. Now it hits whatever your video card can handle. As it's borderless fullscreen on Windows, it'll be limited to your desktop refresh rate. Latest release: 0.2.0 https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.2.0 This release is semi-supported. Basically, I want anyone that tries this port and finds a limit-removing map that crashes it to link it here so I can see what the problem is. Keep in mind that this is limit-removing only. Boom or higher will not work. https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.1.1https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.1.0 (I'll go through and put a better release front up here when this port deserves it. Until then, enjoy the original announcement.) Haha software render go BRRRRR! So I heard you like performance Currently tested on the following operating systems: Windows 10 Linux Mint 19 Cinnamon Ubuntu 20 MATE Raspbian/Raspberry Pi OS OSX (thanks to Linguica) And with the following development hardware: i7-6700HQ 16GiB DDR4 RAM (main dev machine, laptop) i7-3930K 16GiB DDR3 RAM Raspberry Pi 4B Cortex-A72 (ARM v8) 8GiB RAM This is source-only for now, exact same build steps as Chocolate Doom. I have no time to provide to support for anyone (so don't ask); and the code certainly isn't in release quality yet. Until that changes you can consider this project to be academic in nature, with real and testable results. What is Rum & Raisin Doom? R&R Doom is a vanilla-compatible source port, using Chocolate Doom as a base. It is focusing exclusively on optimising the software renderer for modern hardware. And since it's pointless to attempt these optimisations without high resolution support, all work is currently being tested with a 2560x1600 backbuffer. I'm otherwise taking a preservationist approach, sticking as close to the original software renderer as I can (I even scale the HUD/end text flats/interpics etc correctly; and the screen melt functionality is 100% proportional to the original code). But I will be fixing bugs like wobbly walls and potato-quality flats at the near plane. This entire exercise is just a way for me to relax. I'm working at Housemarque these days on Returnal, and since I haven't been able to switch off from programming I decided to sit down and finally do this. I've preferred to spread knowledge over the years than write code, especially since contracts with previous employers have restricted what I own outside of the office. So now that I can write code, this will also help out those people who will look at my resume (AAA engine programmer, specialsing in low-level optimisations and multithreading) and demand code anyway whenever I speak. Progress is constantly being pushed to Github... Which means it can occasionally be in a broken state, like that time everything went all acid-trip melty. Honestly, breaking the Doom renderer is fun. I've got many screenshots and videos saved locally. I'm also writing articles explaining why things are done in a certain way and what benefits they give whenever I have results. The wiki on the Github repository will be updated whenever I feel like it. What hardware shouldn't you run this on? Yeah, don't expect this to make Doom run faster on VGA hardware. Most of what I'm doing will outright run worse on it. This is explicitly targeting modern systems. What's already being done better? Transposed backbuffer. It's certainly not a new idea, but it's a very obvious first step. It also immediately gives wins... and makes everything else planned possible Pre-lit textures and flats. Reads are bad. Indirect reads are worse. We end up using 33 times as much memory, but whatever, it shaves some time off and reduces code complexity, again making some things possible that previously weren't. What's currently being worked on? SIMD wall rendering. Yep. If you think things through, there's some very solid optmisations you can make with SIMD. Short story though: You can't have just one column render function and call it optimised. For any given 16-byte output block, you will need to do anywhere from 0 to 16 unique texture data reads. You know what's great about reading 0-15 texels from the source texture? They're in the same 16-byte location in your texture. So to account for wrapping, you only need to do two SIMD reads for any given output block at most. Shuffle the bytes the right way, write. Boom. Currently, I've got the 0-1 texel read function up and running (with one very annoying visual bug that I'm so very close to solving). Want some performance analysis? Here's a scene where a good chunk of the screen calls that particular column function, versus one that doesn't. So 5-10% off the frame on average already (and remember, this is already after my transposed backbuffer optimisations, a scene like this will look much much better compared to the stock renderer). There's plenty of space to improve on this actually, the target for just this one function is at least another half a millisecond off the time taken. My algorithm for inflating a value from 0-15 in to a 128-bit mask is rubbish as it turns out. I'll have to go find some binary gods out there that used to do this stuff with their eyes closed back in the day. And I'm not even finished yet. This is just my first attempts to get an algorithm running. I already know theoretically better ways to do it. Next step, however, will be to fill out the rest of those functions so that I have a complete working implementation that I can improve on. But I might take a break from SIMD first to do something else. I've done plenty of SIMD over the years but barely touched integer SIMD, so I need a little break before finishing it. What's planned? Multithreaded rendering. Yep. I have a solid plan of attack here. More information when I try it and get some results. SIMD flat rendering. Interestingly, transposing the backbuffer resulted in roughly the same performance for flats as the non-transposed backbuffer. I will rewrite the flat renderer to render by column first; and once that's up and running make it SIMD. Thanks are in order for people that have been commenting/testing/etc (in alphabetical order): AlexMax, Altazimuth, Edward850, fraggle, Linguica, Quasar, sponge. If I accidentally forgot you, you know how to harass me directly and make me update this post. DISCLAIMER: Do you suffer from the following symptoms? You think software renderers are pointless You think it will have a limited audience You don't think anything I'm doing is technically possible or worthwhile Then by all means, direct your concerns to the correct part of the internet. This is just a way for me to relax. I really don't care if anyone's honor gets offended/it invalidates lies pushed around for years/etc. This work is all open and explained, so it can only benefit anyone interested in software rendering. Take your negativity elsewhere. Edited January 27 by GooberMan 0.3.1-pre.8 release 79 Quote Share this post Link to post
boris Posted October 11, 2020 Now that sounds like @Wadmodder RiderPùdu's dream. 0 Quote Share this post Link to post
GooberMan Posted October 11, 2020 Just did a run against a nearly-stock Chocco I have here locally that adds high res support and not much else. And wow. I've been so focused on incremental improvements that I forgot how far it's already come. First post updated with the profile in question. 1 Quote Share this post Link to post
esselfortium Posted October 11, 2020 This is amazing! Looking forward to seeing how much further you can push it. 0 Quote Share this post Link to post
GooberMan Posted October 11, 2020 (edited) You think that's amazing? I just compared the high-res Chocco running on my i7-6700HQ to my optimisations running on the Raspberry Pi at the same resolution. aaaaahahahahahahaha an ARM that has a maximum clockrate of 1.5GHz running my optimisations performs basically as well an i7 running an uprezzed Chocco. *ahem* So. Uh. That red line is gonna go further down by the time I'm done. Edited October 11, 2020 by GooberMan 12 Quote Share this post Link to post
Altazimuth Posted October 11, 2020 (edited) Oh man it's finally here. Time to fail to bring these improvements to EE. This work has continually astounded me and I look forward to further developments. Edited October 11, 2020 by Altazimuth 12 Quote Share this post Link to post
esselfortium Posted October 11, 2020 9 minutes ago, Altazimuth said: Oh man it's finally here. Time to fail to bring these improvements to EE. This work has continually astounded me and I look forward to further developments. Heck yes. It'll be great if you can get these optimizations into Eternity. 0 Quote Share this post Link to post
GooberMan Posted October 12, 2020 At a minimum, the backbuffer transpose should be applicable to every port with a software renderer. I am curious to see it profiled against ports that try to render multiple columns at a time, but my suspicion is that this will perform better because I'm not branching all over the place to handle multiple columns and it stays within one cache line for writes far longer than other methods. This really should have been done and made standard years ago IMO. 4 Quote Share this post Link to post
Boomslang Posted October 12, 2020 sorry i am really fucking dumb here.. so.. it's another modern source port that uh....... tries to be as close to vanilla? so what does it do better than the other sourceports like chocolate doom and (the one i mostly use crispy doom) Prboom? 0 Quote Share this post Link to post
Altazimuth Posted October 12, 2020 It's using Choco more as a proof that it'll work with the classic renderer, to my knowledge. Many of these should be applicable to just about any traditional software renderer, which would greatly improve rendering performance of said ports. 5 Quote Share this post Link to post
Blzut3 Posted October 12, 2020 Definitely will be following the progress on this project. I've thought about similar ideas (particularly transposing the frame buffer) myself but usually figured these things would be a wash at best for various reasons. Prerendering the light levels for textures in order to make the code SIMD friendly is a pretty cool idea I've never even thought about. Although I'd be surprised if you have a significant problem with vanilla compatible data sets, I'm not sure if that would be a reasonable default for say GZDoom's or EE software renderer. Maybe I'm overestimating but I would suspect that some of the larger mods could easily use several GB of memory with this technique. It's pretty easy to push GZDoom over a GB with 4x texture resizing. It would make a lot of sense as an opt in feature for systems with enough RAM since if the hardware is there then the performance benefit could be huge. 3 Quote Share this post Link to post
Mayomancer Posted October 12, 2020 High performance on a vanilla source port using software rendering? you have my attention! so excited about this 0 Quote Share this post Link to post
M_W Posted October 12, 2020 The most fascinating part of this project to me is explaining why something like Doom still can have performance problems on more modern low-end hardware that, in my uneducated brain, should still be far more capable than anything from the DOS era. I never considered how many optimizations were made for the limited hardware at the time that would end up being a problem on modern hardware at higher resolutions. This is amazing! 7 hours ago, GooberMan said: DISCLAIMER: Do you suffer from the following symptoms? You think software renderers are pointless You think it will have a limited audience You don't think anything I'm doing is technically possible or worthwhile Then by all means, direct your concerns to the correct part of the internet... Why would you say such a thing? Surely no one here has a track record of behaving like that :^) 6 Quote Share this post Link to post
Turin Turambar Posted October 12, 2020 Interested in your attempts for a multithread renderer. I will follow your progress. 0 Quote Share this post Link to post
GooberMan Posted October 12, 2020 4 hours ago, Blzut3 said: Prerendering the light levels for textures in order to make the code SIMD friendly is a pretty cool idea I've never even thought about. Although I'd be surprised if you have a significant problem with vanilla compatible data sets, I'm not sure if that would be a reasonable default for say GZDoom's or EE software renderer. Which reminds me, how maintained is GZ's software renderer these days? I looked at the code the other day but not the history. Things like PNGs would definitely need special consideration to even run properly in this code path. (I've also stated how I'd do a hardware renderer previously on these forums. I'll get back to that at some point, but now that I'm learning the software renderer inside out this will honestly improve the methods I was going to employ.) I have had to bump the default page size to 128MiB thanks to REKKR. I'll rewrite the allocator one day to be a bit more modern, specifically grabbing new virtual pages when needed. Likely a solved problem in every other source port, but as noted above Chocco is so close to vanilla. 2 Quote Share this post Link to post
Blzut3 Posted October 12, 2020 (edited) 1 hour ago, GooberMan said: Which reminds me, how maintained is GZ's software renderer these days? I looked at the code the other day but not the history. Things like PNGs would definitely need special consideration to even run properly in this code path. I don't have enough time in the day to stay active with GZDoom, so I don't know what the official current status is. dpJudas did a lot of refactoring over the past few years, but it looks like he recently stepped away from the project. As you may know, there are now two software renderers since the softpoly backend has an "hardware accelerated" mode with is full 3D. The classic renderer is still largely the same just reorganized for multithreading and what not. Just realized that my statement about pre-lighting textures not being really possible with more advanced mods is more true than I initially thought, since with colored lighting and fog there's potentially tens or hundreds more texture variations required depending on the map. (Was previously just thinking about the sheer number of textures/sprites and higher resolution assets in larger mods.) Of course one could treat those as exceptions and fall back to a slow path. In any case though, not something you'd need to worry about for this project since the scope is limited. Edited October 12, 2020 by Blzut3 0 Quote Share this post Link to post
GooberMan Posted October 12, 2020 Fog is something I'm have to deal with when I get to making Hexen run again. Let's see what I come up with when I get to it. 0 Quote Share this post Link to post
GooberMan Posted October 12, 2020 So the advantage of working at a company with a strong demoscene culture/history. One of the graphics guys, programs Atari ST demos in his spare time. Suggested to just use a lookup table for a SIMD mask I was trying to calculate at runtime. Given that I've been trying to avoid loads, I didn't think of it. Or, as I've been putting it: "It's so obvious, it's unintuitive". Because the results speak for themselves. Before: And after: (It looks much clearer side-by-side, open in different tabs and switch back and forth) 8 Quote Share this post Link to post
Redneckerz Posted October 12, 2020 I actually knew of this before, but it was not exactly clear what you planned out to do with it, @GooberMan. Now that the Lost Soul has escaped its prison, this is fucking awesome. Turbocharged software rendering with multicore support? Where i can sign up? This is as beastly as FastDoom for totally different reasons. And it being done by a Housemarque with a demoscene background (Which is all the more amazing): This has serious potential. Any kind of demoscene magic/influence applied to Doom should be cherished with a shrine of dedication, for the scene is where the real coding comes to be. 6 Quote Share this post Link to post
GooberMan Posted October 12, 2020 Oh, to be clear, I'm Australian and haven't written a demo in my life. Working at Remedy and Housemarque though, I've been surrounded by demosceners. Getting arcane knowledge about bit twiddling is just a matter of finding the right person to ask. 3 Quote Share this post Link to post
Blzut3 Posted October 13, 2020 14 hours ago, GooberMan said: Fog is something I'm have to deal with when I get to making Hexen run again. Let's see what I come up with when I get to it. Unless I'm misremembering the fog is Hexen is just a full level colormap swap (to fogmap) so if nothing else you could just re-render the textures on map change. But even if you rendered textures once that's still only going to double what you have now. Compared to Boom colormaps or ZDoom's colored lighting/fog where the growth is basically unbounded depending on the mod (and dynamic). The more interesting question for Heretic/Hexen will be if there's anything that can be done about TINTTAB. 2 Quote Share this post Link to post
dpJudas Posted October 13, 2020 About the backbuffer transpose, I wonder how much that would affect the bottleneck in the GZDoom software renderer. Right now the drawers doesn't really seem to be the performance bottleneck in GZD. At least not if you increase the resolution to 4K. Even though I went from an i7 haswell cpu (4c/8t) to a threadripper (32c/64t) the frame time stayed virtually the same. Right now it seems that drawer setup is what slows it down more than anything. If you're lucky you'll be less impacted by this in vanilla Doom because there's less features there than what zdoom supported. For very complex scenes the BSP traversal and sprites becomes the main bottlenecks. You can multithread the BSP by splitting the frame buffer into multiple subsections of the scene, reducing the field of view for each thread. The sprite performance can be improved by not always calculating the top/bottom clipping lists from scratch. 2 Quote Share this post Link to post
GooberMan Posted October 13, 2020 4 hours ago, dpJudas said: For very complex scenes the BSP traversal and sprites becomes the main bottlenecks. You can multithread the BSP by splitting the frame buffer into multiple subsections of the scene, reducing the field of view for each thread. This is my exact plan, in fact. Well. In my experience with similar splitting of buffers, you need to pay attention to cache sizes on your system or else the L3 will trip over itself trying to propagate the buffers before it needs to. So I won't use a single buffer. There'sextra advantages to not using a single buffer besides the complete avoidance of cache contention. I'll be doing threading next actually, it's time to take that break from SIMD, so I'll have more information if it does actually work as I think it should soonish. 0 Quote Share this post Link to post
GooberMan Posted October 13, 2020 Oh, and just to highlight that cache really is the problem on modern systems. Here's performance against a non-transposed renderer at 2560x1600 on an ARM processor. Ignore the titlepic performance, I didn't patch the scaling code across to my clean Chocco build. But that graph is essentially the same as the original i7 graph I captured at lower resolutions. Notice how outright terrible ARM's cache performs on wall/sprite heavy parts of the DEMO1 loop. (The capture is 700 frames from program start, it ends around where the barrel in front of the secret wall is being shot) 5 Quote Share this post Link to post
GooberMan Posted October 20, 2020 So here's a sneak preview of something I'll be ready to talk about proper in a few days time, screencapped from the Pi used in the above post. 3 Quote Share this post Link to post
GooberMan Posted October 21, 2020 And another sneak preview. Being Chocolate based (and testing that I don't break Vanilla every step of the way) means that I can just go ahead and load up Plutonia 2 to get a screenshot. 6 Quote Share this post Link to post
Eric Claus Posted October 22, 2020 (edited) I should stop being a wimp and figure out how to compile this on Windows if you need testers. Edit: Bah having trouble with Windows ill just set up a linux environment and mess around with it when I am bored Edited October 22, 2020 by Eric Claus 0 Quote Share this post Link to post
GooberMan Posted October 22, 2020 (edited) The CMake files are currently not up to date, so until I fix that in about 12 hours I can't compile it for my Raspberry Pi nor my Linux box and neither can you. So you might want to hold off a little there. Edited October 22, 2020 by GooberMan 1 Quote Share this post Link to post
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.