Jump to content

WAD Downloader - v1.1


smeghammer

Recommended Posts

DETAILS HERE

 

A python3 and MongoDB based framework for WAD downloading. It is set up to be configurable to download from any reasonably structured source HTML. Bear in mind, you do need a bit of coding knowledge to create new scraper modules, though there are currently 6 available to use and adapt.

 

Have fun!

 

 

##  EDIT 8 (23/06/23)  ##

June 2023

Updated Realm667 crawler to account for changed DOM. Also some housekeeping changes to improve 'pythonicness' - started converting concatenated strings to f-strings; tidied up some comments and redundant code, reworked many inline code documentation blocks to address pylint bitching...

 

All changes merged with master branch, so you are good to go.

 

As always, please PM me if you need help setting this up.

 

## EDIT 7 ##

May 2023

Added a crawler to extract metadata into a simple SQLite database, to enable sorting and other analysis. Prompted by this thread, asking about mapper longevity.

 

##  EDIT 6  ##

Added crawler for DoomShack

 

Spoiler

@Doomkid - you might want to think about wrapping the <li></li> tag list in a <ul>.... </ul> (unordered list) elements. Using correct HTML syntax will help with Google ranking, if you are interested in that. Sorry, I'm a techy web dev geek...

 

 

##  EDIT 5  ## 

Changed popup to show a real-time view on the download queue numbers:
image.png

 

 

## EDIT 4 ##

Added a first cut of a Chrome extension to allow adding arbitrary WAD links from any website to the download queue. Currently limited to a specified server port on localhost (see readme) but is working as intended. YAY!

image.png

 

Links added like this will be flagged as source 'api'. It will check for existence of URL first.

 

Tested with IDGames, The Sentinels Playground and Archive.org.

 

TODO:

 - Add feedback

 - Add configuration options for different server location

 

## EDIT 3 ##

Jan 2022:

Added crawlers for:

 

Realm 667 repository

Camo Yoshi WAD archive

 

Updated Windows batch files (shell files to do!)

 

Tested, currently have 45 GB downloaded from all repos:

image.png.e7aa8e16e643033759a61eadb357f29f.png

 

 

 

 

## EDIT 2 ##

Jan 2022:

Added crawlers for:

 

 

Check out the readme and later post for details.

 

 

 

Triggered by this post by @Doomkid, I wrote a simple downloader for the ID Games archive. Yes, I know, you could just use filezilla or winSCP or similar, but I'm a bit of a nerd and thought this was a good idea to play with. 

 

Anyway, it's a command-line thing written in Python 3, using a MongoDB database for metadata and storing the downloaded WAD files in the correct folder structure on the filesystem.

 

I added windows batch files to run the parts, and it works just fine on Ubuntu as well, with equivalent shell files.

 

You can read all about it and download from my github site. I'd love to hear what you think.

 

##  EDIT  ##

Updated title as it will download from more than IDGames

 

Edited by smeghammer

Share this post


Link to post

OK I have been adding to this and testing.

 

I added crawlers for Sentinels Playground (nice Priest background BTW!) and for wad-archive (downloads for here are disabled at source and a temp text file as all you get at the moment). TSPG (and wad-archive) don't have a nice API, so the crawlers rely on web scraping and pre-knowledge of HTML DOM structure, rather than neat API endpoints... 

 

I tested by pointing the doomworld crawler at node id 6 (/levels) and the TSPG at https://allfearthesentinel.net/zandronum/wads.php and scraping the links in the table. I got about 12,000 links for each and - so far - downloaded about 300 files.

 

For coders:

Spoiler

 

The download links are stored in a simple queue. This is a MongoDB database with a common format to the link data, and some metadata for the source (DW, TSPG etc.) and the folder structure for later replication at download time.

 

Because of the common format of the stored download links, a single fetcher can pull WADs from any source and store them on the file system, in the folder structure indicated by the download link metadata.

 

Essentially, I can build and add a crawler for any download source (as long as there is some way of identifying the HTML that holds the links, and how to crawl to the next page) and the links gathered will be put in a queue for the fetcher to process.

 

This is an exercise in geeky coding, rather than a slick, user friendly thing. As such, you do need to be familiar with CLI applications, starting arguments and - in particular - setting up of a Mongo database. Sorry...

 

I'll add a few more batch/shell files, with suitable comments, that start the crawler in Windows or Ubuntu to crawl each of the sites I have so far built crawlers for.

 

I have used Python3 with a simple class-based architecture. I have used Abstract Base Class methodology to enforce consistency for the fetchers. I have also used dynamic class loading so I can use the same crawler front-end to launch the correct specific crawler for the site being crawled. This also benefits from the ABC methodology.

 

 

Edited by smeghammer

Share this post


Link to post

Right - 12 hours later... My test run went all night - so far I have downloaded 70GB from TSPG and 17GB from IDGames archive. My disk is now full. (I need to add a CLI arg to specify a save location I think...)

Spoiler


folders.png?raw=true

 

files.png?raw=true

 


 

 

@Doomkid, I'd say this thing works :-)

 

If anyone has suggestions for other big sources of WAD downloads, please let me know and I'll try and create a crawler for it.

 

I'll update the documentation on the repo README, and add some reference to mongoDB installation.

 

Share this post


Link to post
2 minutes ago, P41R47 said:

If it work to download from Idgames, could it work to upload to Idgames?

 

@P41R47 - no, it won't push anything.

 

For ID Games, using the standard FTP upload is probably best. The idea behind this is essentially a bulk downloader. ID Games can be accessed via FTP anyway, so not really needed for that, but other sites, such as WAD Archive, that only expose the downloads as webpage links can be crawled and downloaded from.

 

This whole thing was triggered by @Doomkid's post of a few days ago

Share this post


Link to post
11 minutes ago, smeghammer said:

@P41R47 - no, it won't push anything.

 

For ID Games, using the standard FTP upload is probably best. The idea behind this is essentially a bulk downloader. ID Games can be accessed via FTP anyway, so not really needed for that, but other sites, such as WAD Archive, that only expose the downloads as webpage links can be crawled and downloaded from.

 

This whole thing was triggered by @Doomkid's post of a few days ago

Thanks for answering, pal!

 

But well, i know that by today, the standar FTP client and uploading to Idgames is something that most of us struggle with at first.

So something like your downloader that also can upload things, with previous setup, will be really cool. 

Share this post


Link to post

OK I'll have a think about it. It would probably be a separate tool though - it would ideally need to be graphical with drag-and-drop. Maybe a project for next week... :-)

Share this post


Link to post

Right it's in a more friendly usable state. I updated the shell files to prompt if you start without arguments. The batch files can be easily adapted if needed.

 

I'd love to hear what you guys think. Also, I'd love to have a crack at crawlers for any other WAD download sites you can suggest. I have the framework, so it should be straightforward to implement more.

Share this post


Link to post
9 hours ago, P41R47 said:

But well, i know that by today, the standar FTP client and uploading to Idgames is something that most of us struggle with at first.

So something like your downloader that also can upload things, with previous setup, will be really cool. 

That was a recent topic and there is a plan to address that. TGH agreed, and I have some ideas for a nice upload form, but no time to start developing yet (partly related to Dynamo128 flooding my wiki backlog ;) ).

Edited by Xymph

Share this post


Link to post
5 minutes ago, Xymph said:

That was a recent topic and there is a plan to address that. TGH agreed, and I have some ideas for a nice upload form, but no time to start developing yet (partly related to Dynamo128 flooding my wiki backlog ;) ).

thanks for that, Xymph!

I did read that thread when it appeared at first, i also posted an opinion, i think, but didn't followed it to see how it developed.

Good to know!

Share this post


Link to post

A few changes:

 

 - Updated to include a doomwadstation crawler as well

 - Changed source of source naming to be relevant config entry

 - Updated the MongoDB _id field to be the download filename

 - Altered abstract class to include two concrete methods

 - Updated DWS, TSPG and W-A to inherit/use these methods

 

I created a master branch on the repo - that is the one to use if you want to check it out.

 

Link at top updated.

Edited by smeghammer

Share this post


Link to post
  • 11 months later...

OK big update to this.

 

  • Code is tidied up and streamlined, 
  • better use of base class methods,
  • added WAD Archive crawler,
  • added flag to fetcher to download by source
  • parameterised shell/batch files with messaging (shortly)

 

The readme is updated somewhat too.

 

If you can suggest other sources for lists of WADs, I can build fetchers for these too.

 

 

Share this post


Link to post

What about dogsoft.net or search dogsoft in the Google search and can you add this website to it.

 

https://camoy.sdf.org/wads/

Bunches of files that Camo Yoshi had put on the site from my request bunches of skulltag mods and zandronum mods too and Maybe some gzdoom ones as well and I will update my website so he can add the rest to it.

Share this post


Link to post

OK so progress of downloads so far. I'm separating the downloads by source, so I can kick off a specific fetcher and know I will get all of a certain source...

 

  • Doomworld (via REST API/FTP): Just the /levels directory - 17GB so far
  • Camo Yoshi site (HTML scrape, single big page): 550 or so files, 7GB so far
  • R667 (HTML spidering, repository section): 1100 files, 870MB (completed) 
  • The Sentinels Playground: 37GB in 33 pages scraped out of 240 pages (so looking at 300GB?)
  • Wad Archive: 18GB from 31 pages

Looking good...

 

 

 

Share this post


Link to post

Well that was a pointless rabbit hole.

 

Wanted to enable a central REST server for the chrome plugin to talk to. Spent the entire day struggling with CORS blocking. Gave up. The REST API server(currently) needs to run on the same machine as the plugin.

 

I'll definitely revisit this though.

Share this post


Link to post

so I changed the popup to show the status of the download queue:
 

image.png.9e135bbb184ce8057a974c58230af945.png

 

It updates in real time (as long as a crawler and/or a fetcher is running). It will also indicate whether it is actually connected to a suitable local download server.

Share this post


Link to post
  • 1 year later...

Bumpety bump.

 

Added IDGames metadata crawler for data analysis. WIP.

 

Prompted by this thread.

 

Currently on dev branch 'idgames_metadata_sqlite'

 

Just ran a test crawl, extracted 13446 records.

 

I collected the author (string, with a few non UTF8 entries barfing), the age (int, epoch datestamp), the title and id (int) from the wads root (id 6).

I can now order the data by author and then by age.

This should give me enough info to get the first and last map by each author, and get the time diff between them - giving the dates and duration each author was/is active in uploading maps to IDGames. Converting the timestamps back to readable dates is easy enough, so hopefully I'll end up with another table with author (unique), first map date, last map date and time active. 

 

Share this post


Link to post

What I've personally been wanting is a program which utilises the idgames API to store zips in directory by ./author/yyyy.mm.dd name/file.zip, this might be a useful reference for me to work off.

Requiring network access to a database server is a little strange, would normally only expect that kind of thing from a web server I think. Most programs I've dealt with just use the filesystem to access an sqlite/json/csv database.

Share this post


Link to post

@houston - The ID Games crawler is most of the way there to what you want. Bear with me, I'll describe what happens (also see the readme, which needs updating).

 

Architecture:

 - Platform and tech - back end language is python 3, using flask framework, and the front is a bit of hand-cranked javascript

 

 - MongoDB to hold queue metadata - NOT the binaries!. I chose mongoDB because I like it. Also, I chose this because I expected that different sites may require different data to be stored, and that is easier in mongo because each document (stored thing) can be any arbitrary JSON structure. Further, returned JSON can be natively cast to a python dict, which makes life easier. There is of course nothing stopping you using a SQL database. SQlite is nice because the python driver (and I assume other language drivers) allow direct manipulation of JSON so the end result is again a native casting of database return value(s) to a python data structure.   Also, mongo allows non-local access so I can use an existing mongo server sitting on my network somewhere.

 

 - Queue data - Here you can see an entry for a Sentinels Playground and an ID games entry:

Spoiler

 


{
  "_id" : "(2)[mod]part2.pk3",
  "url" : "https://allfearthesentinel.net/zandronum/download.php?file=(2)[mod]part2.pk3",
  "state" : "FETCHED",
  "source" : "tspg",
  "metadata" : {
    "_id" : "(2)[mod]part2.pk3",
    "href" : "/zandronum/download.php?file=(2)[mod]part2.pk3",
    "filename" : "(2)[mod]part2.pk3",
    "dir" : "page1/"
  }
},
...,
{
    "_id" : ObjectId("6001bb7313489352fe70a7b3"),
    "url" : "ftp://ftp.fu-berlin.de/pc/games/idgames/levels/strife/lamasery.zip",
    "state" : "FETCHED",
    "source" : "doomworld",
    "metadata" : {
        "id" : NumberInt(12123),
        "title" : "The Silenced Lamasery",
        "dir" : "levels/strife/",
        "filename" : "lamasery.zip",
        "size" : NumberInt(50584),
        "age" : NumberInt(1059195600),
        "date" : "2003-07-26",
        "author" : "Mike Fredericks (Gokuma)",
        "email" : "chohatsu@yahoo.com",
        "description" : "The first Strife pwad available on the internet as far as I know. This is a good size deathmatch level that should be suitable for anywhere from 2 to 8 players.<br><br> Two player performance tested with a 700mhz comp connected to a 120mhz comp with a 19200 serial connection and it ran good. I tried to keep the detail at reasonably medium level, not too high or too low.<br><br> Multiplayer varies substantially in Strife opposed to Doom. I suggest reading the additional gameplay notes further down in this text file and also Strife's multi.txt.<br><br> I recommend patching Strife up to v1.31.",
        "rating" : 3.2857,
        "votes" : NumberInt(14),
        "url" : "https://www.doomworld.com/idgames/levels/strife/lamasery",
        "idgamesurl" : "idgames://levels/strife/lamasery.zip"
    }
}

 

Note the state root field - this is key. This determines whether the entry has been processed (i.e. downloaded the WAD file)

 

 - start.py. The framework for crawling and retrieving the metadata, and inserting into the queue database. On its own it doesn't do anything, because it needs to know which concrete crawler to use. There are several but the one you should look at is doomworldcrawler.py. You can choose which concrete crawler to run with CLI args

 

 - fetcher.py - this looks at the queue and retrieves teh next available NOTFETCHED queue entry. The relative path to the link is used to generate a folder structure in [deploy]/downloads/. See below... NOTE: This is a one-shot CLI script, which is why there are shell/batch files, to repeat it.

 

 - server.py. runs a simple Flask based REST API for accessing and adding to the metadata - created for a chrome extension to add arbitrary links to the queue via a REST AXAX framework. Wraps much of the backend, so it is effectively the V of MVC.

 

Given that the fetcher.py works out the filepath to build for the downloaded WADs based on teh URL relativepath, it would be straightforward to modify it to build filepaths based on author names. Suggest using the first THREE letters only, because there are nearly 7000 unique author names in 13.5k records (based on my harvest yesterday of IDGames /levels entries.). This doesn't take into account case, or re-registered people with nearly identical usernames, which may reduce it somewhat.

 

I'll have a crack at this later on.

 

>>> EDIT <<<

Looks like I need to tweak the Doomworld crawler. The others seem OK though.

 

 

 

 

Edited by smeghammer

Share this post


Link to post

@houston - OK I added a first cut of an IDGames crawler that sorts by author. Hmmm...

 

As I noted above, there are many thousands of different authors, and even with taking the first three letters as the directory name, that's still a crap tonne of new directories... 

 

image.png.c6ad96f636905595f41a021bdaa3e6eb.png

 

Given my other WIP dev branch is the metadata collected from the IDGames repository and stored in a SQLite database - which explicitly collects the author and map name etc. - it might be best to have a web view onto that, driven by API calls onto the database. That way, it could easily be a sortable, searchable UI. Given that the metadata I collect records the eventual download path, it would be straightforward to build a download/open link to wherever it was stored (dependent of course on you not deleting or moving them...).

 

It's currently in the DW_BY_AUTHOR branch of the GH repo linked in the OP.

 

It basically works though - /CLI has @Clippy's maps, /SME has mine, and so on.

 

I'll merge this, and the idgames_metadata_sqlite branch, in due course. There's some refactoring I need to do (linting and shit) for the merge to work properly...

Edited by smeghammer

Share this post


Link to post
  • 5 weeks later...

Ping! 

 

Changes to Realm 667 crawler to account for DOM changes (tested, works, are about 1000 repository items). Mainly it is alterations in the meganav.

 

Updated minor version to 1.2.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...