Jump to content

V1.1.0 - Python wrapper for wad archive data dump - fully works with compressed archive


Recommended Posts

Right - I have an end-to-end proof of concept working. YAY!...

 

I built a basic REST API using the flask framework that exposes:

 

/app/count = count of files

/app/files/<int:page_size>/<int:page_num> - page page_num of files of length page_size

/app/file/<guid> - returns file specified by GUID, BUT also returns the filename, and offers to save as this filename

 

So my very crude UI so far lists the first 25 (by default) entries from the filenames collection, and builds links behind the scenes that pass the GUID to the /app/file/ endpoint above. The back end uses this GUID to look up the filename, and pass that back to the front end.

 

ui.png?raw=true

 

So, for the first link, we have a href value of

 

http://127.0.0.1:5000/app/file/0000e0b4993f0b7130fc3b58abf996bbb4acb287

 

where the GUID maps to the file we want.

 

Clicking on this link (normal, NOT right-click) will prompt to save with the correct filename:

 

save-as.png?raw=true

 

So the caveat:

 

The main 4GB archives need to be decompressed - HOWEVER, the actual WAD.gz or PK3.gz DOES NOT, because the back-end code does this for you (with the gzip library) before the file is sent for download/save.

 

I already implemented the listing endpoint to accept paging flags (page num, page size, entry count) and the return data JSON includes this, so I just need to build the front-end javascript to capture and pass these back to get the next page of results.

 

I can likely get at a whole bunch more data to display in the UI as it looks to all be tied together with the GUIDs. 

 

IMPORTANT: When you unzip the main archives, DO NOT change the folder name! This is used as a prefix to the GUIDs, so the code knows which folder to look in - this looks to be how the folder lookup was implemented before.

 

There's a bunch for me to do - not least make it configurable... Currently, the path to the downloads is hardcoded. I need to test on Linux, as well as pretty up the UI somewhat. I'm happy that my PoC works so far and I'll update this thread as I add more.

 

For those interested in hacking about with my code, it is here:

 

https://github.com/smeghammer/wad-archive/

 

 

 

 

Edited by smeghammer

Share this post


Link to post

OK so today I added the pagination code - a bit mickey mouse ATM, but proves the point. It uses the /app/files/ REST endpoint I noted above with some simple javascript driving it. The github repo has been updated if you want to have a look.

 

Here's a couple of pages, showing different downloads on each, and the page number:

 

page1.png?raw=true

 

page5.png?raw=true

 

As you see, there are a LOT of pages and I will add links to jump by 10 or something, but really I need to add a filter/search option. Thinking about a string search against the idgames collection - this holds a bunch of ID Games text files amongst other metadata - and as they are held as a single text field, they should be regex-searchable, so a search box to filter the list would be possible... It is not clear whether ALL wad/pk3's have an associated text file. Let's find out...

 

Share this post


Link to post

So I added a filter for the filenames, fixed the pagination properly and tidied up the CSS a bit:

Screenshot_20221111_130637.png?raw=true

 

Screenshot_20221111_130759.png?raw=true

 

Note the filtering and the different files counts.

 

I also added a simple configuration file for various things:

settings = {
    'archive_root_path' : 'E:\wad-archive dump\DATA',
    'records_per_page' : 20,
    'metadata_database_name' : 'wadarchive',
    'metadata_database_address' : '127.0.0.1',
    'metadata_database_port' : 27017
}

I need to look at testing on Ubuntu, and making sure that the archive_root_path is correctly processed as a linux path and as a UNC network path...

 

The next step is to tie in other metadata, such as ID Games and readme collections.

 

Edited by smeghammer

Share this post


Link to post

OK so today I altered the download links to be detail links.

 

Currently, the right hand panel will show the filename as heading and download link for the selected file, as well as the contents of the readme, if it is present.

 

Here are a few examples:
 

readmes1.png?raw=true

 

readmes2.png?raw=true

 

readmes3.png?raw=true

 

The next step is to get at the /MAPS, /SCREENSHOTS and /GRAPHICS directories and - if found - find any .PNGs inside and return the binary data as image src attribute values - more flask.send_file() shenanigans...

Share this post


Link to post

Update 14/11/22:

 

Rather than just show all the images inline, I added a bit of javascript to render links to each image/map/graphic found for each record.

 

See OP for a video.

Share this post


Link to post

Big stylistic update - see OP - I am using my smeghammer site style.

 

Not finished, but you can see the difference a couple of classes make:


Screenshot_20221118_210719.png?raw=true

 

The section titles will only show if there is content, and the paginators for the images will I think be better above the images?

 

Also, I was looking into whether I need to unzip the big archives first...

 - I think actually it is unnecessary, as I think I can use the python zipfile core library.

 

Watch this space...

 

Also, I want to add an overflow on the really long, unbroken filenames so they are not overlapping into the main content area.

Edited by smeghammer

Share this post


Link to post

ARGH! That was frustrating.

 

I'm trying to extract the files directly from the zipped archives, rather than unzipping each big archive first. See, the thing is, the zip file doesn't have the concept of directory hierarchy (thanks to @gez for pointing this out ages ago) and the key for each file within the archive is the path string of the location from which the file was originally zipped up. 

 

I spent ages constructing the correct path strings to get at the files - on Windows... I was building paths with backslashes as you would expect to on Windows, and didn't notice the bloody keys in the archive were 'nix style... Flipping the '\' for '/' did the trick.

 

Should be able to extract everything from the zipped archives now.

Share this post


Link to post

OK so the map files can now be downloaded directly from the zipped archives - you don't need to uncompress the big archives first... It did involve double-decompressing, so a bit fiddly. Works though. 

 

I need to do something similar for the image lists - they are still extracted from the unzipped archives - so a bit of a logical mismatch ATM. I also need to handle exceptions better - Currently, it doesn't bubble up errors if there is no archive present or if the type of file being opened is not a.wad or a .pk3. There's a bit of a hacky (but probably 'pythonic') test for .pk3 and .wad that involves trying to open and excepting for .pk3 and then .wad...

Share this post


Link to post

Added a loading spinner for the details page - many of the map details contain loads of images - particularly the map images are quite large - and loading sometimes takes several seconds. Needed a visual indicator to show that something is actually happening. Oh yeah and added your friendly neighbourhood BoH to say hello...:

 

 

Share this post


Link to post
11 hours ago, smeghammer said:

...particularly the map images are quite large...

I remember that every time I loaded those from "WadArchive", my WEB-browser just suffered from RAM usage while loading them...

Is it can be optimized somehow?

Edited by RastaManGames

Share this post


Link to post

Yeah. What I want to do ideally is only load the first image if an array is found and load incrementally as you paginate. It may be a javascript optimisation thing but more likely a python/serverside code thing. It also may be a mongo index thing but the images are comig from the filesystem anyway.

 

It is also likely that the original code was having to process the zipped big archives and then the gzipped individual record. That would add a significant overhead too.

Share this post


Link to post

OK RC1 is released! I would appreciate feedback.

 

It's quite techy, so I would value questions around what further details I should provide to help out setting this shit up and getting it running...

Share this post


Link to post

@doomlover thanks! Much appreciated. 

 

Just to clarify this isn't directly for upload to a web host - I don't have hosting that has a MongoDB and python3 back end - it is designed to run on a local machine/network with database, python3 and sufficient file storage.

 

You could of course use an AWS or Azure instance, but that's expensive... don't forget, to host online it also needs the terabyte of storage for the data as well.

 

As I said I can document this more extensively if anyone wants further info. 

 

Also as all my code is open source, anyone please feel free to fork or otherwise copy from my GH repo if you want to put something online or modify to suit you.

Share this post


Link to post
  • 2 months later...
  • 1 month later...

This is some phenomenal work you've put into this. Impressed by how far you've come. I got here from looking into the WA dump and not being able to make sense out of it, but it looks like you have! Bravo.

Share this post


Link to post

@Xenaero - thanks man! Have you managed to get it running? I'd love it if someone else uses this.

 

Unfortunately due to the size of the data, and running with python/mongo back-end, I have not hosted it on t'interweb due to cost of AWS etc.

 

Also, you probably need to download the entire 300 or so archives (about 1TB) so you don't get missing data - there is not an obvious correlation between the lexical order of the JSON metadata and the physical archive order of each wad.

Share this post


Link to post
  • 2 weeks later...

I admit that I have not, mostly due to file storage concerns. However it certainly would be helpful to get a workable archive of the website to sift through to find old MP levels we can no longer find anywhere else. Multiplayer retention for historical reasons has been on my mind, recently, so I might come back to this thread in some time when I get some more storage set up.

Share this post


Link to post
  • 11 months later...

Apologies for the self-bump...

 

Some minor fixes to this:

 - Added some notes on updating Ubuntu so that UNC path can be used to mount the archive data (e.g. from a NAS/NAS attached USB disk)

 - updated a SASS module for webfonts The path was wrong. This last means the main heading is styled properly again.

 - also made sure the master branch is up to date.

 

Tested with a UNC path, and it is running just fine on my Ubuntu machine with a big 4TB USB disk attached to one of my NAS drives.

It is slightly more long winded than for Windows, but it's straightforward enough if you are comfortable with the terminal. Please see this article for how to enable UNC support on Ubuntu.

 

>>>  https://github.com/smeghammer/wad-archive  <<<

 

 

 

 

 

 

Share this post


Link to post
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...