Episode: 126
Title: HPR0126: Ripping the Web
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr0126/hpr0126.mp3
Transcribed: 2025-10-07 11:50:43

---

Music
Hello and welcome to this episode of Packroom of a Radio, with your host, Opera here.
It's entitled Ripping the Web, the way you want to.
This episode will go over some techniques and things you can use to rip webpages, the content from them, and circumvent security measures used for those pages.
I'm going to go and talk about the problems with web page rippers now.
Really any web page automation tool that has recrusive options.
Here at WGettingFriends, links, LYNX.
Generally, if you want to do just a basic text-based dump, it probably carts out to a database somewhere or something.
I use LYNX with the minus dump and width of 999.
Pretty cleanly dump the web page and read out the text to use first later.
It's good for database stuff like that.
We've got HT, Track, and WinHC Track.
Great filters, great rules, white lists.
You can filter out all kinds of things.
You can follow streams, follow tags, change agent, everything.
As far as an out-of-the-box web page river, this is going to be the best one or one that you want to start with.
And then go from there and figure out exactly what parts aren't ripping and what parts are.
We've got curl, spread the popular, does cookie support, and agent packs, stuff like that.
And you can also incorporate curl into PHP if you want to do the interactive scripts, as far as I'm in a login, sending it cookie, downloading something.
On my website, you'll have videos for ripping flash, templates, and also there's a separate one for ripping pages that have things like cookies or a river checking or a JavaScript to help prevent against their pages getting ripped.
For streaming media, I found real media catcher works pretty good. It's fairly portable.
You don't have to do a thin saw of it or anything like that.
It's good for feed streams to support for, basically, mind types. Any kind of mind type you can add in there.
For example, if you want to download just JPEGs. You can add JPEGs in there, but it does have a mind less limit of anything 500K and up.
So whatever probably you're trying to download is under 500K, I think it is, is a minimal size if you go to.
It also has some issues with other sites like VO. You can only get the first commercial segment before the first commercial segment instead of getting the whole thing.
And WebEx, I've had some problems with too. I haven't been able to figure out what mind type or how it's picking up that.
Things only need to know before you start ripping a website.
Troubleshooting or trying to figure things out, you'll want to start with a Firefox plug-ins. Live HTTP headers.
This will show the headers, server tag, the target what it's posting, all that good yummy stuff.
WebDeb is good for also the same thing you can convert what I've found for converting posts to get requests.
And then also it has cookie manipulation where you can view the cookies, set cookies, change cookies, delete cookies, and see what happens on the server side.
As far as if you delete a cookie and you still get the results you want, you don't necessarily need to have to worry about that cookie.
Download them all, of course, as multi-thread support and swarming.
This is WebPage support, of course, and pretty much the downloaded that I'll use for downloading of large amount of files.
Some basic things as far as the WebPage is concerned before you start is does the content actually come from that host?
Or does it come from a different host? Or does it come from multiple hosts?
So you want to narrow down, okay, it's gallery.com, but the actual images come from albums.com.
So if you'll have to recognize it, your content's going to actually come from album.com.
Does it have all everything on the same path?
So for example, if it's album.com, does it have it in four slash photos of album.com?
Or is it sparse all over the place in different paths?
If it's in a specific path, folder, path, then you can just filter out everything except that.
So for example, if it has forms for an FAQ or something like that, you can go ahead and filter all that out and not have to even worry about picking up extra content.
Does it have security checks, for example, or a bird checking? Or does it set a cookie?
Or does it have any JavaScript that loads a cookie? Or like a flash rock connect?
It will override the browser's proxy settings. Or any agent tag checks anything that will secure the site's actual content.
Go over ripping flash. This works for just about any SWF site you might have undesirable results on the T-code in.
But you know, that's decoding and that's how decoding works.
It's going to work with pretty much any template site. They generally work the same there, going more towards XML, file databases, and loading up files through XML.
You're going to want URL snooper, which is going to track the URLs that the SWF or Flash movie is downloading.
You're going to need WGET to use the minus X tag, which will keep the path information intact so that way you can just copy and paste and be done with it and have the right correct path names and destinations.
SWF deco-piler, version three, and up. Running currently 3.6 works very well. It's actually portable without an install.
You can just copy a couple of DLLs over with it, and you're good to go. For any other details on that, there's the actual video that will cover all that good stuff.
But basically how it works is you start off with your RL snooper, get the SWF, and then copy and paste all the things once you've navigated through the Flash movie, copy and paste all the URLs to all the content.
You're going to need to know pad, add WGET minus X in the front, which will keep the paths intact, and you'll have a dump of all the files that were loaded with that SWF.
Now the only thing you need to do is deco-pile the SWF, and actually, of course, download the SWF and then deco-pile it, and you'll have the source to that SWF with all the files that you accessed using it.
All this is pretty much covered in the video tutorial. You might have the pause at some points at times, but it pretty much covers everything you're going to have to be able to do to get it working right.
The next part I'll go over is for being sites that have Java, or cookie, reverb-based, and there's also a video on my site for that also.
I'm going to show you how I came to ripping the Foxy list site that used Java, and also reverb, and also cookie settings.
So that's a little video in there for you.
It starts off, you navigate using Firefox, and you want to look at what's going on, and then you'll see, okay, cookies being set.
Okay, well, I'll turn off cookies into what happens, so I still get what I want, or I turn off the reverb, sending reverb to zero,
which it won't send a reverb, do I still get what I want?
Turn things like that on and off inside of Firefox.
We'll help a lot as far as trying to figure out what, how to circuit it, whatever to carry it's using.
As far as what I use for Java interpretation, I pick Spider Monkey, and I kind of go and make my own Java interpreter.
Of course, you're all welcome to do that.
It does a pretty decent job of getting everything to run, and spit out the right text or data that you want.
It's fairly straightforward, you've made it some JavaScript, and you might have to do it some set-off magic to get it to parse out just right.
But generally, even if it's at an end, dynamic site, the Spider Monkey is pretty good at it.
As far as running the Java that you want to drive.
Here are a reverb checking.
You'll use reverb checking, as far as if you come to the site from a different location.
Obviously, it won't let you view that page, or it'll give you an error or a redirect to somewhere else.
Generally, that's what a lot of sites use to check for things, to check to see if you came from where you're supposed to be coming from.
Curl PHP is good for automated login scripts, or ears, or interaction, and also other things.
You can go in and out of a shell script, and then PHP with curl, and then go back into shell, and eventually get everything you need to rip.
Along with the video, I've posted the source under the scripts for slash proxy, and you pretty much just download all that.
It's a very rough, disgusting source for a basic example of a site that uses cookies, reverb, and also JavaScript to hide its code, or keep you from downloading the site.
Some other examples will be like a gallery ripping script, or a copper mine, or whatever have you.
Easiest thing you use to start off with those is, especially if it has thumbnails, you'll want to use something like HDTrack to start off, preferably Windows HDTrack.
This is good to start off, because if you're not familiar with this syntax, it's a point-and-click thing, and you can add and change this syntax to save different profiles.
That'll help you out before you get to the actual command line deal.
You can do things like guac, images with TN, underscore, for example, thumbnails, or spiles in a certain directory, anything like that.
And you'll have just things options like search for robots, or turn robots off, follow robots.
Another good example using WGit with a shout-cast to grab the first 25 feeds, and then throw that through a parse to get only 480 feeds.
You can use, depending on how advanced this site is, you can pretty much get a way to use WGit, or whatever curl, whatever your popular width, but the key is knowing how far did they go to use or hide their code.
Another good example is a snort login script, DIVO, PHP, with curl.
Log in, go to the downloads, download the current snapshot, and automatically apply it, restart snort, and all that stuff.
Also, a singular check script, basically, checks your logs in, checks your minutes, and then if it's a certain amount, we'll dump out to a text file, and then email you whatever that minutes you use or appear over your minutes.
So the source for that is all I say also.
For more advanced sites, you'll also see things like Flash, people trying to hide their code with Flash.
Then again, you can use the decoder, see what's going on, see what's loading, see if you can reproduce it, just without using Flash.
Sometimes there's not a whole lot you can do once they go into the XML type of SOPE security that they have that's an option with Flash.
There's not a whole lot you can do there that I know of.
All the things they'll use is rock and exit wall, not even listen to your boxy settings.
So that's another thing that Flash can do to help protect the source.
You can get the code to anything I've talked about, and of course, the two videos on Flash, ripping, and ripping sites that have JavaScript.
So all that's on my site.
Our McCarty dot com, that's RMCCURY.com
Thank you for listening to Hack the Public Radio.
HPR is sponsored by Pharaoh.net.
She'll head on over to CARO.ENC for all of her videos.
Thanks for watching.