Blogs

Yahoo! Pipes, Caching and robots.txt

Every time I create or use a pipe, I'm indirectly causing hits on some third party's website.  So I was curious to learn how the Yahoo! Pipes backend behaves.  What caching does it do, is it a good and well behaved web citizen in general?

There isn't much official documentation to go on.  The Pipes Troubleshooting guide has some notes on how to stop Pipes from downloading a feed too frequently and how to stop Pipes from using feeds at all.

So, I put myself in the shoes of the third party website that the pipes hit upon to find out more.  ...read more »

Yahoo! Pipes Tutorial - An example using the Fetch Page module to make a web scraper

Yahoo! recently released a new Fetch Page module which dramatically increases the number of useful things that Pipes can do.  With this new "pipe input" module we're no longer restricted to working with well-organised data sets in supported formats such as CSV, RSS, Atom, XML, JSON, iCal or KML.  Now we can grab any HTML page we like and use the power of the Regex module to slice and dice the raw text into shape.

In a nutshell, the Fetch Page module turns Yahoo! Pipes into a fully fledged web scraping IDE!

Yahoo! Pipes is a web scraping IDE in a nutshell

As it happens, I already have a web scraping project which has been broken for some time now.  I don't have the energy to check out the hacky old PHP scrapers and debug the problem.  But with Yahoo! Pipes and the Fetch Page module to hand, I can throw away my PHP scripts and their associated libraries, delete the cron jobs and free my overloaded webserver from the onerous responsibility.  Time to get cracking.  ...read more »

AVG Anti-Virus and Internet (In)Security

Oh dear. I've just logged in to our home PC and been greeted by a wonderful pop up advert for AVG Internet Security.

AVG Anti-Virus launces browser as SYSTEM user

This is a real shame. I've been using Grisoft's Anti-Virus product for years, and tactics like this are only going to make me look for alternatives. Especially when it appears that their "security" software has launched my browser as the SYSTEM user! Whoops.

Google Chart API

Google has launched the Google Chart API. With a simple URL we get a simple chart:

Google Chart API - Survey Results

Disappointing end to WebVulnCrawl project

It's been nearly a year and a half since I blogged about the WebVulnCrawl bot, and slightly less than that since the crawling was completed by Dennis.

I was very eager to see the results, and had pestered Dennis via his blog several times over the months (comments from me and others have now disappeared from his blogspot for some reason). So I was initially surprised and then very interested to see a post from this now unfamiliar blog in my reader - "Long Overdue - The Final Post".

Dennis comments on some of the ridiculous overreactions that his project incited.  ...read more »

Syndicate content