Jump to content

Downloading mirrors of idgames, TSPG and Wad-Archive?


Doomkid

Recommended Posts

Just curious if anyone here knows how one could possibly download one or all of these invaluable archives of wads? There must be 100,000 individual files at least between all 3.

 

A very helpful user on YT has offered help on looking for the lost Dwango9. Additionally, I’d like to mirror as many of these files as possible on a personal drive.

 

If anyone can offer help with this, I’d greatly appreciate it!

Edited by Doomkid

Share this post


Link to post

Sorry, I misunderstood the thread's question. Just ignore my post.

Edited by Dexiaz
Ignore this post

Share this post


Link to post
5 hours ago, Doomkid said:

Just curious if anyone here knows how one could possibly download one or all of these invaluable archives of wads? There must be 100,000 individual files at least between all 3.

 

A very helpful user on YT has offered help on looking for the lost Dwango9. Additionally, I’d like to mirror as many of these files as possible on a personal drive.

 

If anyone can offer help with this, I’d greatly appreciate it!

 

Website mirroring/spidering:

If you go for the bull-in-a-china-shop approach (aside: see this though :-)) you can use (Win)HTTrack to spider the whole site. Not very selective though as you really would get everything (though browsable offline)...

 

 

A better approach, though more work up front:

As you probably know, there is a REST API for the DW view on ID Games - there may be APIs for the other sites too. The DW API is pretty straightforward - I built a browser for it a year or so ago.

 

Certainly, it would be possible to build a crawler against this API, with some logic built in to only store contents of WAD directories, and pull all WADs it finds and store them either in a database or on the filesystem.

 

In the real world, I use python to build web crawlers to analyse website hierarchies - it should be straightforward to build a crawler for the ID games archive API. It would also be straightforward to build a http(s) web crawler (again, potentially with some intelligence built in) if the WADs are exposed as download links.

 

I did something related a year or so ago to find and download Lego plans via a REST API that exposes - er - Lego plans.

 

Basically, a nice project would be to build this kind of WAD spider that is configurable per site and able to selectively store .WAD files somewhere. It may even be possible to collect and apply metadata (if using a database).

 

Now I have mentioned it, It's actually an interesting idea - I might have a crack at this anyway (kids and real life allowing). 

 

 

 

Edited by smeghammer

Share this post


Link to post
4 hours ago, smeghammer said:

As you probably know, there is a REST API for the DW view on ID Games - there may be APIs for the other sites too. The DW API is pretty straightforward - I built a browser for it a year or so ago. 

 

You're thinking a little unnecessarily modern here. :)

/idgames is a public anonymous FTP after all.

You can use WinSCP or any other not-trash FTP client to connect to gamers.org or (preferably) one of the mirrors, select all the folders and add to download queue.

Just be nice and set the queue to not run in parallel, one file at a time.

 

For the other sites - to be completely honest, might be best to contact their operators, presumably they do have ties to the greater Doom community, and discuss the options. Maybe they'd offer some sensible access method.

Share this post


Link to post

@wrkq, absolutely - I use FileZilla for regular downloads. And yes , you can queue the whole lot by dropping the containing folder onto your queue window.

 

I like tinkering with code and APIs, and building a generic downloader - perhaps one that could run as a systemd service - could collect all the file metadata and download links, drop that into a database and then download at leisure. It could also have a rate-limiter built in - wait one minute before downloading the next file etc. The DW API certainly, offers a simple REST API that can return JSON, and python plays very nicely with JSON.

 

TBH, this is probably just an excuse for me to fire up Eclipse as I'm still on furlough...

 

** UPDATE **

If you are interested, the first cut of my new Doomworld REST API crawler is here. It's just a PoC so far, but it will crawl recursively into the /levels directory and - at the moment - will print the level names it finds. Eventually, the file metadata will be pushed to a database (mongoDB probably) and I'll build a fetcher to download files based on the list in the database. Check the readme.

Edited by smeghammer
info on my DW browser

Share this post


Link to post

Wad Archive isn't really set up to be easily shared, it is more set up to serve a webpage/downloads quickly while taking up a little amount of space as possible. For example all the WADs are stored in the format:

 

<sha1 hash>.<ext>.gz

eg

0a0a37cfa7782b19604794886e5e42fdc676d733.pk3.gz

 

With the mapping back to filenames in a database. This is because each WAD can have any number of names and disk space isn't free. They are also gzipped so that the webserver doesn't have gzipp them on the fly to the browser (as well as taking less space on disk).

 

If I was ever to shut the site down I would make it all available to download in some format

Share this post


Link to post

@Doomkid - Looks like TSPG and wad-archive don't expose an API or have FTP access? 

 

Web crawling, scraping and programmatic WAD download link identification is likely therefore the only way to do this, unless the site admins have options that are not publicly exposed. Such web scraping, although certainly possible, is probably not a nice thing to do, particularly if the target web servers are under heavy load already. A rate-limited scrape (wait 30 seconds - e.g. - between each download) would be better, but of course would be much slower.

Share this post


Link to post
59 minutes ago, smeghammer said:

@Doomkid - Looks like TSPG and wad-archive don't expose an API or have FTP access? 

 

Web crawling, scraping and programmatic WAD download link identification is likely therefore the only way to do this, unless the site admins have options that are not publicly exposed. Such web scraping, although certainly possible, is probably not a nice thing to do, particularly if the target web servers are under heavy load already. A rate-limited scrape (wait 30 seconds - e.g. - between each download) would be better, but of course would be much slower.

  https://www.wad-archive.com/api/  

Share this post


Link to post

Yeah, I found that, but I get the following JSON back:

 {"error":"Not found"}

and the HTTP status code is actually 404.

 

So anonymous access is not possible. I don't have an account there, so can't test if that API is available to registered users.

 

EDIT: Doesn't look like you can register an account?

 

From what little I have found out about wad-archive so far, it looks like that API is available to retrieve a WAD by its hash, but what I was hoping for was some sort of index of hashes, from which download links can be assembled. 

Edited by smeghammer

Share this post


Link to post

This is a bit beyond my knowledge base, but if you’re having trouble creating a wad-archive account I can give you access to mine if it will help at all?

Share this post


Link to post

The API requires that you know the hash or a filename. Also I have update the site and removed the need for accounts (however the downloads are currently offline until I have some free time to fix things)

Share this post


Link to post

@Doomkid, I'll be offline for a bit - actual real-life work takes precedence here unfortunately - but I'll be back...

 

@WadArchive - do you have any objection with me looking at building a proof-of-concept web-scraper against the wad-archive site? It would crawl and collect hashes and assemble download links. The spidering would collect only text/html MIME type only so only a minor bandwidth hit.

 

The second part would be the actual download of the found WADs. This second part is obviously the bit that could potentially cause problems for the server - particularly if there are multiple parallel scrapers running.

 

It would be architected similar to my ID Games downloader (decoupled metadata crawler and binary downloader).

Share this post


Link to post
  • 2 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...