Day's blog - WebVulnCrawl: Bad Bot / Good Bot ?

WebVulnCrawl: Bad Bot / Good Bot ?

4th January 2006

A new robot crawled my site on 2nd Jan. I know because it crawled my download script, which is prohibited by my robots.txt. The download log that was emailed to me showed the following info.

IP Address: 216.179.125.69
Host:       tubgirl.biz
User-agent: WebVulnCrawl.blogspot.com/1.0 libwww-perl/5.803
Date:       2006-01-02 19:24:23

My interest piqued, I followed the trail. It's a project run by Dennis Brown; his first post explains it best.

At this time, not many other folks seem to have commented on the WebVulnCrawl robot; technorati gives nothing, a google search currently turns up a single blog entry which seems quite negative in its reaction. I'm personally not that bothered. Dennis' research quite intrigues me, and I for one will be subscribing to his feed and waiting for his results, due in March.

For those who can't quite see what Dennis is trying to do, for me it recalled this great article on The Register - Crackers use search engines to exploit weak sites:

"HotBot advanced search allows you to specify your search with file extensions, looking for sites or directories that include .dat files and the words 'index of' and 'admin' or 'customer'", Utreg says.
He showed us a file named data.txt on ISP Lanline.com's servers which contained the personal information of several hundred people, including their names, addresses, social security numbers and credit card account details - and all of it in plain text.

The article includes other tips on how one might locate such data:

Nothing listed in a 'robots.txt' file will turn up in a search query; but once a person has seen the directory and file names it contains, they can type them directly into their browser to access the various subdirectories and pages which the site administrators would rather keep hidden. These are of course the very subdirectories and files most likely to be of interest to crackers.

The article finishes by noting that of course:

For Web site operators afraid of falling prey to such backdoor inquiries, the solution is painfully obvious and quite simple. Stop putting sensitive data in public places. A file which you would not print out and post on a billboard simply has no business being posted on a Web site.

Dennis' bot is an attempt at researching how many web site operators are doing precisely this. Oops. If Dennis is professional with the way he handles the data he collects, notifying any vulnerable sites of their accidentally exposed data, then all power to WebVulnCrawl say I.

Epilogue:

Amusingly, tracking back to tubgirl.biz (the hostname resolved by a reverse lookup of the WebVulnCrawl IP) doesn't seem to ever load the index page, but a google search for the domain gives the following amusing snippet. Nice touch Dennis

tubgirl.biz listing

Keeping it light-hearted, should WebVulnCrawl come visiting here again, it will find a new page to analyse. First person to post the "phrase of the day" in a comment will win instant fame and fortune ;)

4 comments

Mattwho said:

16th February 2006 at 2:56 a.m. (1 month, 1 week after the post)

that secret passwords thing is great!

Will L said:

21st February 2006 at 5:08 a.m. (1 month, 2 weeks after the post)

I heard about this bot too, over at: http://www.carelessthought.com/?p=1040

-W

julie said:

4th March 2006 at 5:18 a.m. (1 month, 4 weeks after the post)

Rather than commenting on his blog, I thought I'd comment here. First, I run several web sites and the site he came to isn't even googled because my robots.txt file says

User-agent: *
Disallow: /

There may be external links to the site, but the site itself is not googled, except for robots.txt.

Second, the directories he looked for was /mtfiles. This failed because my moveable type directories are on a different website on the same server. he didn't find that directory name either in robots.txt or on google because, as I said - it doesn't exist on the site he looked at. And he couldn't do much with it anyway, because I've renamed the comment script and etc.

So the claim that he's only going through googled links or hacks to find sites is bogus. FYI, here's his perusal of my site (with site links and etc. blocked out)

2006-03-03 20:29:08 216.179.125.69 - (local server) 192.168.0.1 80 GET /robots.txt - 200 www.*******.com WebVulnCrawl.blogspot.com/1.0+libwww-perl/5.803 -
2006-03-03 20:29:09 216.179.125.69 - (local server) 192.168.0.1 80 GET /index.html - 200 www.*******.com WebVulnCrawl.blogspot.com/1.0+libwww-perl/5.803 -
2006-03-03 20:29:11 216.179.125.69 - (local server) 192.168.0.1 80 GET /mtfiles/ - 302 www.*******.com WebVulnCrawl.blogspot.com/1.0+libwww-perl/5.803 -
2006-03-03 20:29:11 216.179.125.69 - (local server) 192.168.0.1 80 GET /mtfiles/ - 403 www.*******.com WebVulnCrawl.blogspot.com/1.0+libwww-perl/5.803 -
2006-03-03 20:29:12 216.179.125.69 - (local server)192.168.0.1 80 GET / - 404 www.*******.com WebVulnCrawl.blogspot.com/1.0+libwww-perl/5.803 -

As you can see, he wasn't able to find the MTfiles directory he was looking for (because that's not where I keep them). He's now been banned from ALL of my websites.

Day said:

5th March 2006 at 12:31 a.m. (1 month, 4 weeks after the post)

@julie:

I think you have misunderstood. You say

So the claim that he's only going through googled links or hacks to find sites is bogus.

I've not seen anyone claim that's what he's doing.

What he is doing is simply making his way one-by-one through a big list of dotcoms. Your dotcom is obviously one of them, as is mine. For each dotcom, his bot looks at the robots.txt and deliberately does the opposite of what a regular bot does - it downloads all the disallowed content for later analysis.

Why? Because some people are silly enough to put sensitive restricted material on their websites and then try to "hide" it by disallowing bots from indexing it. They think this means that because it isn't listed in a search engine, no-one will be able to find it, so it will be safe. But of course anyone can download it once they know the URL so it is far from safe. In fact, by listing it in robots.txt, a foolish webmaster is providing the means to work out the URL, therefore actually advertising the fact that the "hidden" data exists!

Dennis' bot is automatically gathering the material for his research which will reveal how widespread such foolishness is. He will have to manually analyse all the data collected to work out if any of it is sensitive material. Hopefully he can help a few webmasters secure their stuff properly.

So WebVulnCrawl downloaded all the disallowed stuff on my site, which piqued my curiosity but didn't cause me any trouble or reveal any embarassing secrets. I'm surpised at what you say it did on your site - why would it look for mtfiles if that is not in your robots.txt? It didn't do that on my site. Perhaps Dennis' implementation is a bit buggy?

Day Barr / blog

WebVulnCrawl: Bad Bot / Good Bot ?

Epilogue:

4 comments

Leave a comment

About

Day Barr / blog

WebVulnCrawl: Bad Bot / Good Bot ?

Epilogue:

4 comments

Leave a comment

About

Subscribe