Doomkid Posted January 9, 2021 (edited) Just curious if anyone here knows how one could possibly download one or all of these invaluable archives of wads? There must be 100,000 individual files at least between all 3. A very helpful user on YT has offered help on looking for the lost Dwango9. Additionally, I’d like to mirror as many of these files as possible on a personal drive. If anyone can offer help with this, I’d greatly appreciate it! Edited January 9, 2021 by Doomkid 1 Quote Share this post Link to post
Dexiaz Posted January 9, 2021 (edited) Sorry, I misunderstood the thread's question. Just ignore my post. Edited January 9, 2021 by Dexiaz Ignore this post 1 Quote Share this post Link to post
smeghammer Posted January 9, 2021 (edited) 5 hours ago, Doomkid said: Just curious if anyone here knows how one could possibly download one or all of these invaluable archives of wads? There must be 100,000 individual files at least between all 3. A very helpful user on YT has offered help on looking for the lost Dwango9. Additionally, I’d like to mirror as many of these files as possible on a personal drive. If anyone can offer help with this, I’d greatly appreciate it! Website mirroring/spidering: If you go for the bull-in-a-china-shop approach (aside: see this though :-)) you can use (Win)HTTrack to spider the whole site. Not very selective though as you really would get everything (though browsable offline)... A better approach, though more work up front: As you probably know, there is a REST API for the DW view on ID Games - there may be APIs for the other sites too. The DW API is pretty straightforward - I built a browser for it a year or so ago. Certainly, it would be possible to build a crawler against this API, with some logic built in to only store contents of WAD directories, and pull all WADs it finds and store them either in a database or on the filesystem. In the real world, I use python to build web crawlers to analyse website hierarchies - it should be straightforward to build a crawler for the ID games archive API. It would also be straightforward to build a http(s) web crawler (again, potentially with some intelligence built in) if the WADs are exposed as download links. I did something related a year or so ago to find and download Lego plans via a REST API that exposes - er - Lego plans. Basically, a nice project would be to build this kind of WAD spider that is configurable per site and able to selectively store .WAD files somewhere. It may even be possible to collect and apply metadata (if using a database). Now I have mentioned it, It's actually an interesting idea - I might have a crack at this anyway (kids and real life allowing). Edited January 9, 2021 by smeghammer 2 Quote Share this post Link to post
wrkq Posted January 9, 2021 4 hours ago, smeghammer said: As you probably know, there is a REST API for the DW view on ID Games - there may be APIs for the other sites too. The DW API is pretty straightforward - I built a browser for it a year or so ago. You're thinking a little unnecessarily modern here. :) /idgames is a public anonymous FTP after all. You can use WinSCP or any other not-trash FTP client to connect to gamers.org or (preferably) one of the mirrors, select all the folders and add to download queue. Just be nice and set the queue to not run in parallel, one file at a time. For the other sites - to be completely honest, might be best to contact their operators, presumably they do have ties to the greater Doom community, and discuss the options. Maybe they'd offer some sensible access method. 3 Quote Share this post Link to post
smeghammer Posted January 9, 2021 (edited) @wrkq, absolutely - I use FileZilla for regular downloads. And yes , you can queue the whole lot by dropping the containing folder onto your queue window. I like tinkering with code and APIs, and building a generic downloader - perhaps one that could run as a systemd service - could collect all the file metadata and download links, drop that into a database and then download at leisure. It could also have a rate-limiter built in - wait one minute before downloading the next file etc. The DW API certainly, offers a simple REST API that can return JSON, and python plays very nicely with JSON. TBH, this is probably just an excuse for me to fire up Eclipse as I'm still on furlough... ** UPDATE ** If you are interested, the first cut of my new Doomworld REST API crawler is here. It's just a PoC so far, but it will crawl recursively into the /levels directory and - at the moment - will print the level names it finds. Eventually, the file metadata will be pushed to a database (mongoDB probably) and I'll build a fetcher to download files based on the list in the database. Check the readme. Edited January 9, 2021 by smeghammer info on my DW browser 2 Quote Share this post Link to post
WadArchive Posted January 9, 2021 Wad Archive isn't really set up to be easily shared, it is more set up to serve a webpage/downloads quickly while taking up a little amount of space as possible. For example all the WADs are stored in the format: <sha1 hash>.<ext>.gz eg 0a0a37cfa7782b19604794886e5e42fdc676d733.pk3.gz With the mapping back to filenames in a database. This is because each WAD can have any number of names and disk space isn't free. They are also gzipped so that the webserver doesn't have gzipp them on the fly to the browser (as well as taking less space on disk). If I was ever to shut the site down I would make it all available to download in some format 4 Quote Share this post Link to post
smeghammer Posted January 10, 2021 Just finished the downloader for ID Games. See I'll have a crack at adding TSPG and Wad Archive at some point... 3 Quote Share this post Link to post
smeghammer Posted January 11, 2021 @Doomkid - Looks like TSPG and wad-archive don't expose an API or have FTP access? Web crawling, scraping and programmatic WAD download link identification is likely therefore the only way to do this, unless the site admins have options that are not publicly exposed. Such web scraping, although certainly possible, is probably not a nice thing to do, particularly if the target web servers are under heavy load already. A rate-limited scrape (wait 30 seconds - e.g. - between each download) would be better, but of course would be much slower. 0 Quote Share this post Link to post
hrr1000 Posted January 11, 2021 59 minutes ago, smeghammer said: @Doomkid - Looks like TSPG and wad-archive don't expose an API or have FTP access? Web crawling, scraping and programmatic WAD download link identification is likely therefore the only way to do this, unless the site admins have options that are not publicly exposed. Such web scraping, although certainly possible, is probably not a nice thing to do, particularly if the target web servers are under heavy load already. A rate-limited scrape (wait 30 seconds - e.g. - between each download) would be better, but of course would be much slower. https://www.wad-archive.com/api/ 0 Quote Share this post Link to post
smeghammer Posted January 11, 2021 (edited) Yeah, I found that, but I get the following JSON back: {"error":"Not found"} and the HTTP status code is actually 404. So anonymous access is not possible. I don't have an account there, so can't test if that API is available to registered users. EDIT: Doesn't look like you can register an account? From what little I have found out about wad-archive so far, it looks like that API is available to retrieve a WAD by its hash, but what I was hoping for was some sort of index of hashes, from which download links can be assembled. Edited January 11, 2021 by smeghammer 2 Quote Share this post Link to post
Doomkid Posted January 12, 2021 This is a bit beyond my knowledge base, but if you’re having trouble creating a wad-archive account I can give you access to mine if it will help at all? 0 Quote Share this post Link to post
WadArchive Posted January 12, 2021 The API requires that you know the hash or a filename. Also I have update the site and removed the need for accounts (however the downloads are currently offline until I have some free time to fix things) 0 Quote Share this post Link to post
smeghammer Posted January 12, 2021 @Doomkid, I'll be offline for a bit - actual real-life work takes precedence here unfortunately - but I'll be back... @WadArchive - do you have any objection with me looking at building a proof-of-concept web-scraper against the wad-archive site? It would crawl and collect hashes and assemble download links. The spidering would collect only text/html MIME type only so only a minor bandwidth hit. The second part would be the actual download of the found WADs. This second part is obviously the bit that could potentially cause problems for the server - particularly if there are multiple parallel scrapers running. It would be architected similar to my ID Games downloader (decoupled metadata crawler and binary downloader). 0 Quote Share this post Link to post
Diabolución Posted March 24, 2021 (edited) Off-topic: @WadArchive: could / would you add to your database the European PSN variants of both tnt.wad and plutonia.wad? They actually happen to differ slightly from the American release. Edited March 31, 2021 by Diabolución 0 Quote Share this post Link to post
Maes Posted March 24, 2021 This sounds kinda like the inverse problem of making it easier to upload to idgames....hmmm ;-) 1 Quote Share this post Link to post
WadArchive Posted March 24, 2021 (edited) 5 hours ago, Diabolución said: Off-topic: @WadArchive: could / would you add to your database the European PSN variants of both tnt.wad and plutonia.wad? They actually happen to differ slightly from the American release. It is on my list of things to do Edited March 24, 2021 by WadArchive 1 Quote Share this post Link to post
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.