V1.1.0 - Python wrapper for wad archive data dump - fully works with compressed archive

smeghammer · November 4, 2022

>>> See replies below for updates to what is done <<<

Following the announcement in this thread about the closure of WAD Archive, I thought I'd put together a basic framework to present proper WAD filenames as download links, rather than the GUIDs that they are stored as.

>>> Update 10/03/24 <<<<

Changes:

- file list order to lexical
- better file loading spinner

- added a file list loading spinner

- better logic for UI element show/hide as loading new data

- removed erroneous retrieval of source data where filename database array field was zero-length (removes `undefined` from filelist)

DOWNLOAD HERE:

~~https://github.com/smeghammer/wad-archive/tree/zipped_sources~~

Merged with main branch:

https://github.com/smeghammer/wad-archive

>>> 02/02/24 <<<

Added notes on adding UNC paths in Ubuntu; tweak to SCSS font module file (path was wrong) to re-enable DooM font in title. See this post below.

>>> 25/11/22 <<<

Confirmed that configuring a UNC path to the data works as expected - archive is not loaded into memory in it's entirety so accessing a 4GB archive over the LAN works OK.

Also, some minor style changes and bugfixes;

Screenshot%202022-11-25%20144417.png?raw

>>> Update 23/11/22 <<<

Added code to allow using just the zipped archive. No need to unzip first! This is RC1 and I'd love feedback... Please make sure you switch to the 'zipped sources' branch:

https://github.com/smeghammer/wad-archive/tree/zipped_sources

As long as you have all the archives downloaded, you should be able to use the application to find/download maps within the archives. You will need to set up mongoDB with the metadata that has been released already by @WadArchive, and you will need to set the configuration to match your system in the /config/settings.py file.

Note I have only tested this on Windows with a local path, so any feedback for UNC paths or Linux paths would be appreciated.

Have fun!

>>> update 20/11/22 <<<

Minor style update and altered map download; added loading spinners for AJAX loading delay if needed - see posts for today.

Added current map indicator and moved image paginator to top of images. Oh, and changed to 'filter' on the button.:

Screenshot%202022-11-20%20134149.png?raw

>>> Update 18/11/22 <<<

Restyled:

Spoiler

Screenshot_20221118_210719.png?raw=true

>>> update! now with browseable details <<<

Spoiler

Get the code at Github:

https://github.com/smeghammer/wad-archive

## features

- metadata JSON files in a mongoDB back-end (Done, Mongo database)

- paged JSON endpoint listing the WAD filenames (Done, Python 3/Flask)

- mapping of filenames to GUIDs (Done)

- Mapping of GUIDs to archived WAD files (Done)

- Web front-end to service the REST endpoints (Done, for basic details display and download link)

- Download logic with correct download filename (Done)

[** updated 12/11/22 **]

## And...

- Show readme file if it is present (Done)

- Show screenshots, maps and graphics png files, if they are there (DONE!!)

The whole thing will be a Python3 and Flask web application backed by the mongo database. I'll probably make it configurable (local or remote database host, port, server IP address, filesystem path to archive location etc.)

[** updated 15/11/22 **]

Added paginator for images

I'll update this as I do more.

Edited March 10, 2024 by smeghammer
updated code

smeghammer · November 6, 2022

Right - I have an end-to-end proof of concept working. YAY!...

I built a basic REST API using the flask framework that exposes:

/app/count = count of files

/app/files/<int:page_size>/<int:page_num> - page page_num of files of length page_size

/app/file/<guid> - returns file specified by GUID, BUT also returns the filename, and offers to save as this filename

So my very crude UI so far lists the first 25 (by default) entries from the filenames collection, and builds links behind the scenes that pass the GUID to the /app/file/ endpoint above. The back end uses this GUID to look up the filename, and pass that back to the front end.

ui.png?raw=true

So, for the first link, we have a href value of

http://127.0.0.1:5000/app/file/0000e0b4993f0b7130fc3b58abf996bbb4acb287

where the GUID maps to the file we want.

Clicking on this link (normal, NOT right-click) will prompt to save with the correct filename:

save-as.png?raw=true

So the caveat:

The main 4GB archives need to be decompressed - HOWEVER, the actual WAD.gz or PK3.gz DOES NOT, because the back-end code does this for you (with the gzip library) before the file is sent for download/save.

I already implemented the listing endpoint to accept paging flags (page num, page size, entry count) and the return data JSON includes this, so I just need to build the front-end javascript to capture and pass these back to get the next page of results.

I can likely get at a whole bunch more data to display in the UI as it looks to all be tied together with the GUIDs.

IMPORTANT: When you unzip the main archives, DO NOT change the folder name! This is used as a prefix to the GUIDs, so the code knows which folder to look in - this looks to be how the folder lookup was implemented before.

There's a bunch for me to do - not least make it configurable... Currently, the path to the downloads is hardcoded. I need to test on Linux, as well as pretty up the UI somewhat. I'm happy that my PoC works so far and I'll update this thread as I add more.

For those interested in hacking about with my code, it is here:

https://github.com/smeghammer/wad-archive/

Edited November 6, 2022 by smeghammer

smeghammer · November 6, 2022

Reference List

Here is a few URLs I used to figure out some of this stuff. Flask is new to me, so it's a bit rough and ready so far. I'll add more refs as I come across them.

https://pythonbasics.org/flask-rest-api/ - some refs for making my back-end API
https://stackoverflow.com/questions/20646822/how-to-serve-static-files-in-flask - I need to serve static JS and (eventually) CSS files
https://isotropic.co/how-to-implement-pagination-in-mongodb/ - remind myself on how to do simple pagination queries against MongoDB
https://docs.python.org/3/library/gzip.html - So I can unzip the wad/pk3 file at runtime
https://stackoverflow.com/questions/57564873/how-to-download-in-memory-zip-file-object-using-flask-send-file
https://stackoverflow.com/questions/24577349/flask-download-a-file
https://roytuts.com/how-to-download-file-using-python-flask/
https://stackoverflow.com/questions/41543951/how-to-change-downloading-name-in-flask - this was key!
https://stackoverflow.com/questions/44672524/how-to-create-in-memory-file-object
https://tedboy.github.io/flask/generated/flask.send_file.html

smeghammer · November 7, 2022

OK so today I added the pagination code - a bit mickey mouse ATM, but proves the point. It uses the /app/files/ REST endpoint I noted above with some simple javascript driving it. The github repo has been updated if you want to have a look.

Here's a couple of pages, showing different downloads on each, and the page number:

page1.png?raw=true

page5.png?raw=true

As you see, there are a LOT of pages and I will add links to jump by 10 or something, but really I need to add a filter/search option. Thinking about a string search against the idgames collection - this holds a bunch of ID Games text files amongst other metadata - and as they are held as a single text field, they should be regex-searchable, so a search box to filter the list would be possible... It is not clear whether ALL wad/pk3's have an associated text file. Let's find out...

FEDEX · November 8, 2022

Congrat. for this idea. I miss wad-archive very much, it was the best way to catch obscure stuff.

smeghammer · November 11, 2022

So I added a filter for the filenames, fixed the pagination properly and tidied up the CSS a bit:

Screenshot_20221111_130637.png?raw=true

Screenshot_20221111_130759.png?raw=true

Note the filtering and the different files counts.

I also added a simple configuration file for various things:

settings = {
    'archive_root_path' : 'E:\wad-archive dump\DATA',
    'records_per_page' : 20,
    'metadata_database_name' : 'wadarchive',
    'metadata_database_address' : '127.0.0.1',
    'metadata_database_port' : 27017
}

I need to look at testing on Ubuntu, and making sure that the archive_root_path is correctly processed as a linux path and as a UNC network path...

The next step is to tie in other metadata, such as ID Games and readme collections.

Edited November 11, 2022 by smeghammer

smeghammer · November 12, 2022

OK so today I altered the download links to be detail links.

Currently, the right hand panel will show the filename as heading and download link for the selected file, as well as the contents of the readme, if it is present.

Here are a few examples:

readmes1.png?raw=true

readmes2.png?raw=true

readmes3.png?raw=true

The next step is to get at the /MAPS, /SCREENSHOTS and /GRAPHICS directories and - if found - find any .PNGs inside and return the binary data as image src attribute values - more flask.send_file() shenanigans...

smeghammer · November 14, 2022

Update 14/11/22:

Rather than just show all the images inline, I added a bit of javascript to render links to each image/map/graphic found for each record.

See OP for a video.

smeghammer · November 18, 2022

Big stylistic update - see OP - I am using my smeghammer site style.

Not finished, but you can see the difference a couple of classes make:

Screenshot_20221118_210719.png?raw=true

The section titles will only show if there is content, and the paginators for the images will I think be better above the images?

Also, I was looking into whether I need to unzip the big archives first...

- I think actually it is unnecessary, as I think I can use the python zipfile core library.

Watch this space...

Also, I want to add an overflow on the really long, unbroken filenames so they are not overlapping into the main content area.

Edited November 18, 2022 by smeghammer

smeghammer · November 20, 2022

ARGH! That was frustrating.

I'm trying to extract the files directly from the zipped archives, rather than unzipping each big archive first. See, the thing is, the zip file doesn't have the concept of directory hierarchy (thanks to @gez for pointing this out ages ago) and the key for each file within the archive is the path string of the location from which the file was originally zipped up.

I spent ages constructing the correct path strings to get at the files - on Windows... I was building paths with backslashes as you would expect to on Windows, and didn't notice the bloody keys in the archive were 'nix style... Flipping the '\' for '/' did the trick.

Should be able to extract everything from the zipped archives now.

smeghammer · November 20, 2022

OK so the map files can now be downloaded directly from the zipped archives - you don't need to uncompress the big archives first... It did involve double-decompressing, so a bit fiddly. Works though.

I need to do something similar for the image lists - they are still extracted from the unzipped archives - so a bit of a logical mismatch ATM. I also need to handle exceptions better - Currently, it doesn't bubble up errors if there is no archive present or if the type of file being opened is not a.wad or a .pk3. There's a bit of a hacky (but probably 'pythonic') test for .pk3 and .wad that involves trying to open and excepting for .pk3 and then .wad...

smeghammer · November 20, 2022

Added a loading spinner for the details page - many of the map details contain loads of images - particularly the map images are quite large - and loading sometimes takes several seconds. Needed a visual indicator to show that something is actually happening. Oh yeah and added your friendly neighbourhood BoH to say hello...:

RastaManGames · November 21, 2022

11 hours ago, smeghammer said:

...particularly the map images are quite large...

I remember that every time I loaded those from "WadArchive", my WEB-browser just suffered from RAM usage while loading them...

Is it can be optimized somehow?

Edited November 21, 2022 by RastaManGames

smeghammer · November 21, 2022

Yeah. What I want to do ideally is only load the first image if an array is found and load incrementally as you paginate. It may be a javascript optimisation thing but more likely a python/serverside code thing. It also may be a mongo index thing but the images are comig from the filesystem anyway.

It is also likely that the original code was having to process the zipped big archives and then the gzipped individual record. That would add a significant overhead too.

smeghammer · November 23, 2022

OK RC1 is released! I would appreciate feedback.

It's quite techy, so I would value questions around what further details I should provide to help out setting this shit up and getting it running...

smeghammer · November 25, 2022

Fixes:

- confirmed UNC path to data

- typo in exception handler

- style tweaks

Doomlover77 · November 27, 2022

Excellent work @smeghammer looking forward to you getting it up and running.

Edited November 27, 2022 by Doomlover77

smeghammer · November 27, 2022

@doomlover thanks! Much appreciated.

Just to clarify this isn't directly for upload to a web host - I don't have hosting that has a MongoDB and python3 back end - it is designed to run on a local machine/network with database, python3 and sufficient file storage.

You could of course use an AWS or Azure instance, but that's expensive... don't forget, to host online it also needs the terabyte of storage for the data as well.

As I said I can document this more extensively if anyone wants further info.

Also as all my code is open source, anyone please feel free to fork or otherwise copy from my GH repo if you want to put something online or modify to suit you.

Doomlover77 · February 14, 2023

Hi @smeghammer how’s the wad archive wrapper project going ? Looking forward to seeing some more screenshots. @Doomlover77

smeghammer · February 14, 2023

Definitely works... been busy IRL unfortunately. However have a look here: https://github.com/smeghammer/wad-archive/tree/master/screenshots. Some are older. You'll see what I mean.

I really need to write the readme properly. It all works though and I can post a quick how to if you want to have a play with it

Edited February 14, 2023 by smeghammer

Xenaero · March 15, 2023

This is some phenomenal work you've put into this. Impressed by how far you've come. I got here from looking into the WA dump and not being able to make sense out of it, but it looks like you have! Bravo.

smeghammer · March 15, 2023

@Xenaero - thanks man! Have you managed to get it running? I'd love it if someone else uses this.

Unfortunately due to the size of the data, and running with python/mongo back-end, I have not hosted it on t'interweb due to cost of AWS etc.

Also, you probably need to download the entire 300 or so archives (about 1TB) so you don't get missing data - there is not an obvious correlation between the lexical order of the JSON metadata and the physical archive order of each wad.

Xenaero · March 27, 2023

I admit that I have not, mostly due to file storage concerns. However it certainly would be helpful to get a workable archive of the website to sift through to find old MP levels we can no longer find anywhere else. Multiplayer retention for historical reasons has been on my mind, recently, so I might come back to this thread in some time when I get some more storage set up.

Doomlover77 · March 29, 2023

I do hope there’s a Chance to get a working archive of the wad archive website. Do come back @Xenaero. @Doomlover77

smeghammer · March 2, 2024

Apologies for the self-bump...

Some minor fixes to this:

- Added some notes on updating Ubuntu so that UNC path can be used to mount the archive data (e.g. from a NAS/NAS attached USB disk)

- updated a SASS module for webfonts The path was wrong. This last means the main heading is styled properly again.

- also made sure the master branch is up to date.

Tested with a UNC path, and it is running just fine on my Ubuntu machine with a big 4TB USB disk attached to one of my NAS drives.

It is slightly more long winded than for Windows, but it's straightforward enough if you are comfortable with the terminal. Please see this article for how to enable UNC support on Ubuntu.

>>> https://github.com/smeghammer/wad-archive <<<

smeghammer · March 10, 2024

bumpy!

Some logic, QoL and UI changes to the code. See OP for details.

Sign In

V1.1.0 - Python wrapper for wad archive data dump - fully works with compressed archive

Recommended Posts

Share this post

Link to post

Share this post

Link to post

Reference List

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Join the conversation