hpr_transcripts/hpr2120.txt

Episode: 2120
Title: HPR2120: WEBDUMP wmap EyeWitness phantomjs selenium
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2120/hpr2120.mp3
Transcribed: 2025-10-18 14:33:27

---

This is HBR episode 2,120 entitled WebDump.eu map eyewitness phantom gen selenium.
It is hosted by Opera Zero R and is about 11 minutes long.
The summary is automated the process of finding unique websites, removing loops and getting screenshots.
Today's show is licensed under a CC Zero license.
This episode of HBR is brought to you by an honesthost.com.
Get 15% discount on all shared hosting with the offer code HBR15.
That's HBR15.
Better web hosting that's honest and fair at an honesthost.com.
Hello, my name is Robert McCurdy.
I'm doing this for Hacker Public Radio so I'm going to try to be as verbose as I can
for just an audio podcast.
What I'm going to do is explain to you a kind of way that I've method of processing a bunch of websites and finding unique ones and then following up on them.
Here's the use cases that you're on a security assessment, you're scanning 30,000 hosts and that means out of those 30,000 you're probably going to have 60,000 actual websites.
So what you can do is feed a CSV file of the host name and the port into this web dump script.
And what essentially is it's going to do is it's all it's going to do is download the first website for everything and follow all the links and all that.
So it's going to try to actually guess if it's 443 it's going to be HTTPS, having quite gotten it to detect if it's HTTPS and then switch over.
So what the script does, the first part of the script is you run burpsweet and it uses burpsweet as a curl with burpsweet as a proxy.
And it will multi-thread in the download the first web page to a file and that file will get in an HTML format which is okay except for you're going to have a lot of duplicates and you're actually going to be missing websites that have heavy Java or heavy flash.
So there's a second step to that so I'll go over that.
So in this instance we have, it's a sig1 script that could probably be somewhat easily be ported over to straight up Linux.
What it does is it sorts, it downloads the website of everything in the in the scope.
So say for example this list of a bunch of different websites is thousands of long and we don't know which one of these is duplicate so we don't know these are valid.
But it's a CSV file of the host name and then comma and a port and you can feed it sometimes I'll even feed it an entire scope depending on you know I'll remove duplicate things like SSH and known non web servers.
And then I'll feed that into web dump.
And what you're going to end up with is it's going to more or less remove anything that has less than three lines greater than three lines difference.
So if it's got three lines that are the same or three lines that are different if I can remember it will basically move it to a delete folder or folder to be deleted.
So the idea is that it's going to eat a bunch of different ho ports services using curl and it's going to follow reader acts it's going to get as a self certificate it's going to try to view each one of those sites.
But it's only going to try to view the first page it's not going to crawl like burpsweet wood or anything else.
And what we're going to end up with is a bunch of files that are all duplicates right and then once it goes through the process of removing duplicates that are more than three lines different.
Then you're going to end up with this list of actual websites that are different right.
And now what we need to do is manually look at these with a browser if it's a low number of websites you can go to them manually.
But what I found that I'll have maybe 1200 different websites and I don't really particularly want to look at 1200 different files and figure out which ones are duplicates and not.
So what I'm trying to do is kind of bundle this web dump along with a program called eyewitness.
Now eyewitness is kind of a follow up of a program that I was did a video on for WMAP which is it's burpsweet WMAP change request method active scanner.
And it goes over the active scanner and using web dump to automatically crawl different websites and auto column essentially.
What I found is that WMAP is just not up to par and it's not going to it's not going to doesn't scale very well and it just uses the IP API for essentially crawl right.
So with that said we need something a little stronger that supports you know multiple multi threaded and it's going to do a little bit quicker downloads.
So what I did not doing is playing with eyewitness.
Now what you can do with eyewitness is you can use the web feature which is Selenium and then there's also the headless feature which uses Phantom JS.
So what the two different things you'd understand about those two options for downloading a website automatically is Phantom JS is essentially not going to do things like flash.
It's going to do pretty much everything else more or less it's about as close as you can get to rendering a website without using flash or Selenium.
So those are two options if it's a bunch of websites you want to use the headless mode and if it's only just a few maybe a hundred or maybe even 600 you might get away with using Selenium.
So the headless mode is definitely going to be much quicker but you might miss websites like flash websites and things like that.
So if you're pretty sure that there's some flash in there you might want to single those out and single them out and run Selenium directly with them.
So what we're going to do is we're going to create this input file if I already have it.
And this input file is going to have our list of kind of removed duplicates that we want to double check on.
And here we have like seven websites that we want to take a look at manually and in real case this would be like 600 or so you know kind of basically unique websites.
And since we have a low number we can actually use Selenium and it will give us a little bit more chance of having everything look real or render the page with a real browser.
So you use there's an installation procedure for eyewitness there's a few caveats in there.
You just have to make sure that you get that up into these correctly.
I didn't have something installed. I think it was PhantomJS. I didn't have installed and it was airing out giving me all kinds of issues.
So that's eyewitnesspy-f which is the input file and that was our input file of our seven websites and we're going to use Selenium.
So we're going to do dash dash web and again I mean I can tell you with these seven websites it might take eight seconds per site more or less up to eight seconds per site sometimes.
And then sometimes even it might even hang. So I think they have the ability to reconnect or like save sessions essentially.
But you know it's going pretty quickly even with just the dash web function. We're almost done with seven sites.
And the idea here is instead of me manually looking at 300 HTML files or manually copying and pasting a URL and clicking it or making some kind of crypt and actually rendering the page.
I can use a tool like eyewitness to automatically render it and or just use the headless mode to show all the websites and scope or all the websites to look at.
So I give you some information here on the list. I'd really not that much information that you're going to get any value out of.
But what you want is the screenshots on the on the right side here.
So essentially what you get is a bunch of screenshots of whatever you told it to whatever websites you told you download.
So you can quickly go through here and be like oh this is kind of interesting. We have a PHP error here on this this website for whatever reason right.
Which I guess their website was crapping out in the middle of doing this test.
So anyways. So with that said, we actually have a path disclosure on this website for when we were doing our testing.
We exhausted the memory on the PHP website and it gives us a path of home moron from moron.com.
Anyways, that's just a quick overview of how I'm chaining all this together to give me a quick picture of every single website unique essentially every unique website that's going to be in scope.
Some other things that I'll backtrack by go over is what you can do while burp suite is running.
The great thing about that is you can turn on active scanner while burps while web dump is running. I'm sorry while web dump is running.
You can turn on the active scanner and essentially make it a kind of an auto-pone style scanner.
So you can say live active scanning and say scope and then just say music custom scope and say add and then make it any.
So now any website that burp hits it's going to basically auto-pone.
Which is kind of interesting across the board if you do it on it, you know, 2,000 websites.
You're probably guaranteed you know some kind of low-hanging crew in that aspect.
That's pretty much all the stuff I can go over.
Again I said WMAP is a nice little plug-in to play around with but it's slower but it doesn't require any requirements as far as anything else it just needs chrome.
So that's the plus there if you're going to do just a handful of websites, I would use WMAP but if you're going to do 100 or 200 or 600 or more, you want to use something like eyewitness and along with the web dump script to get rid of these.
So anyways, that's pretty much some set up.
What we've done here is we've taken a list of 26 websites, dumped it out to HTML, removed the duplicates and then we fed those non-duplicate slash unique items and fed those into eyewitness so that we can take a quick screenshot of every single website that we may want to take a look at and say, oh look this is some kind of flash liken.
What's the default liken for this website or maybe there's an error there that you can follow up on.
Anyways, hope this helps somebody out and gets you aimed in the right direction.
You've been listening to Hacker Public Radio at HackerPublicRadio.org.
We are a community podcast network that releases shows every weekday, Monday through Friday.
Today's show, like all our shows, was contributed by an HPR listener like yourself.
If you ever thought of recording a podcast and click on our contributing to find out how easy it really is.
Hacker Public Radio was founded by the digital dog pound and the infonomicom computer club and is part of the binary revolution at binrev.com.
If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself.
Unless otherwise stated, today's show is released on the creative comments, attribution, share a life, 3.0 license.
Initial commit: HPR Knowledge Base MCP Server - MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-10-26 10:54:13 +00:00			`Episode: 2120`
			`Title: HPR2120: WEBDUMP wmap EyeWitness phantomjs selenium`
			`Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2120/hpr2120.mp3`
			`Transcribed: 2025-10-18 14:33:27`

			`---`

			`This is HBR episode 2,120 entitled WebDump.eu map eyewitness phantom gen selenium.`
			`It is hosted by Opera Zero R and is about 11 minutes long.`
			`The summary is automated the process of finding unique websites, removing loops and getting screenshots.`
			`Today's show is licensed under a CC Zero license.`
			`This episode of HBR is brought to you by an honesthost.com.`
			`Get 15% discount on all shared hosting with the offer code HBR15.`
			`That's HBR15.`
			`Better web hosting that's honest and fair at an honesthost.com.`
			`Hello, my name is Robert McCurdy.`
			`I'm doing this for Hacker Public Radio so I'm going to try to be as verbose as I can`
			`for just an audio podcast.`
			`What I'm going to do is explain to you a kind of way that I've method of processing a bunch of websites and finding unique ones and then following up on them.`
			`Here's the use cases that you're on a security assessment, you're scanning 30,000 hosts and that means out of those 30,000 you're probably going to have 60,000 actual websites.`
			`So what you can do is feed a CSV file of the host name and the port into this web dump script.`
			`And what essentially is it's going to do is it's all it's going to do is download the first website for everything and follow all the links and all that.`
			`So it's going to try to actually guess if it's 443 it's going to be HTTPS, having quite gotten it to detect if it's HTTPS and then switch over.`
			`So what the script does, the first part of the script is you run burpsweet and it uses burpsweet as a curl with burpsweet as a proxy.`
			`And it will multi-thread in the download the first web page to a file and that file will get in an HTML format which is okay except for you're going to have a lot of duplicates and you're actually going to be missing websites that have heavy Java or heavy flash.`
			`So there's a second step to that so I'll go over that.`
			`So in this instance we have, it's a sig1 script that could probably be somewhat easily be ported over to straight up Linux.`
			`What it does is it sorts, it downloads the website of everything in the in the scope.`
			`So say for example this list of a bunch of different websites is thousands of long and we don't know which one of these is duplicate so we don't know these are valid.`
			`But it's a CSV file of the host name and then comma and a port and you can feed it sometimes I'll even feed it an entire scope depending on you know I'll remove duplicate things like SSH and known non web servers.`
			`And then I'll feed that into web dump.`
			`And what you're going to end up with is it's going to more or less remove anything that has less than three lines greater than three lines difference.`
			`So if it's got three lines that are the same or three lines that are different if I can remember it will basically move it to a delete folder or folder to be deleted.`
			`So the idea is that it's going to eat a bunch of different ho ports services using curl and it's going to follow reader acts it's going to get as a self certificate it's going to try to view each one of those sites.`
			`But it's only going to try to view the first page it's not going to crawl like burpsweet wood or anything else.`
			`And what we're going to end up with is a bunch of files that are all duplicates right and then once it goes through the process of removing duplicates that are more than three lines different.`
			`Then you're going to end up with this list of actual websites that are different right.`
			`And now what we need to do is manually look at these with a browser if it's a low number of websites you can go to them manually.`
			`But what I found that I'll have maybe 1200 different websites and I don't really particularly want to look at 1200 different files and figure out which ones are duplicates and not.`
			`So what I'm trying to do is kind of bundle this web dump along with a program called eyewitness.`
			`Now eyewitness is kind of a follow up of a program that I was did a video on for WMAP which is it's burpsweet WMAP change request method active scanner.`
			`And it goes over the active scanner and using web dump to automatically crawl different websites and auto column essentially.`
			`What I found is that WMAP is just not up to par and it's not going to it's not going to doesn't scale very well and it just uses the IP API for essentially crawl right.`
			`So with that said we need something a little stronger that supports you know multiple multi threaded and it's going to do a little bit quicker downloads.`
			`So what I did not doing is playing with eyewitness.`
			`Now what you can do with eyewitness is you can use the web feature which is Selenium and then there's also the headless feature which uses Phantom JS.`
			`So what the two different things you'd understand about those two options for downloading a website automatically is Phantom JS is essentially not going to do things like flash.`
			`It's going to do pretty much everything else more or less it's about as close as you can get to rendering a website without using flash or Selenium.`
			`So those are two options if it's a bunch of websites you want to use the headless mode and if it's only just a few maybe a hundred or maybe even 600 you might get away with using Selenium.`
			`So the headless mode is definitely going to be much quicker but you might miss websites like flash websites and things like that.`
			`So if you're pretty sure that there's some flash in there you might want to single those out and single them out and run Selenium directly with them.`
			`So what we're going to do is we're going to create this input file if I already have it.`
			`And this input file is going to have our list of kind of removed duplicates that we want to double check on.`
			`And here we have like seven websites that we want to take a look at manually and in real case this would be like 600 or so you know kind of basically unique websites.`
			`And since we have a low number we can actually use Selenium and it will give us a little bit more chance of having everything look real or render the page with a real browser.`
			`So you use there's an installation procedure for eyewitness there's a few caveats in there.`
			`You just have to make sure that you get that up into these correctly.`
			`I didn't have something installed. I think it was PhantomJS. I didn't have installed and it was airing out giving me all kinds of issues.`
			`So that's eyewitnesspy-f which is the input file and that was our input file of our seven websites and we're going to use Selenium.`
			`So we're going to do dash dash web and again I mean I can tell you with these seven websites it might take eight seconds per site more or less up to eight seconds per site sometimes.`
			`And then sometimes even it might even hang. So I think they have the ability to reconnect or like save sessions essentially.`
			`But you know it's going pretty quickly even with just the dash web function. We're almost done with seven sites.`
			`And the idea here is instead of me manually looking at 300 HTML files or manually copying and pasting a URL and clicking it or making some kind of crypt and actually rendering the page.`
			`I can use a tool like eyewitness to automatically render it and or just use the headless mode to show all the websites and scope or all the websites to look at.`
			`So I give you some information here on the list. I'd really not that much information that you're going to get any value out of.`
			`But what you want is the screenshots on the on the right side here.`
			`So essentially what you get is a bunch of screenshots of whatever you told it to whatever websites you told you download.`
			`So you can quickly go through here and be like oh this is kind of interesting. We have a PHP error here on this this website for whatever reason right.`
			`Which I guess their website was crapping out in the middle of doing this test.`
			`So anyways. So with that said, we actually have a path disclosure on this website for when we were doing our testing.`
			`We exhausted the memory on the PHP website and it gives us a path of home moron from moron.com.`
			`Anyways, that's just a quick overview of how I'm chaining all this together to give me a quick picture of every single website unique essentially every unique website that's going to be in scope.`
			`Some other things that I'll backtrack by go over is what you can do while burp suite is running.`
			`The great thing about that is you can turn on active scanner while burps while web dump is running. I'm sorry while web dump is running.`
			`You can turn on the active scanner and essentially make it a kind of an auto-pone style scanner.`
			`So you can say live active scanning and say scope and then just say music custom scope and say add and then make it any.`
			`So now any website that burp hits it's going to basically auto-pone.`
			`Which is kind of interesting across the board if you do it on it, you know, 2,000 websites.`
			`You're probably guaranteed you know some kind of low-hanging crew in that aspect.`
			`That's pretty much all the stuff I can go over.`
			`Again I said WMAP is a nice little plug-in to play around with but it's slower but it doesn't require any requirements as far as anything else it just needs chrome.`
			`So that's the plus there if you're going to do just a handful of websites, I would use WMAP but if you're going to do 100 or 200 or 600 or more, you want to use something like eyewitness and along with the web dump script to get rid of these.`
			`So anyways, that's pretty much some set up.`
			`What we've done here is we've taken a list of 26 websites, dumped it out to HTML, removed the duplicates and then we fed those non-duplicate slash unique items and fed those into eyewitness so that we can take a quick screenshot of every single website that we may want to take a look at and say, oh look this is some kind of flash liken.`
			`What's the default liken for this website or maybe there's an error there that you can follow up on.`
			`Anyways, hope this helps somebody out and gets you aimed in the right direction.`
			`You've been listening to Hacker Public Radio at HackerPublicRadio.org.`
			`We are a community podcast network that releases shows every weekday, Monday through Friday.`
			`Today's show, like all our shows, was contributed by an HPR listener like yourself.`
			`If you ever thought of recording a podcast and click on our contributing to find out how easy it really is.`
			`Hacker Public Radio was founded by the digital dog pound and the infonomicom computer club and is part of the binary revolution at binrev.com.`
			`If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself.`
			`Unless otherwise stated, today's show is released on the creative comments, attribution, share a life, 3.0 license.`