Every time I create or use a pipe, I'm indirectly causing hits on some third party's website. So I was curious to learn how the Yahoo! Pipes backend behaves. What caching does it do, is it a good and well behaved web citizen in general?
There isn't much official documentation to go on. The Pipes Troubleshooting guide has some notes on how to stop Pipes from downloading a feed too frequently and how to stop Pipes from using feeds at all.
So, I put myself in the shoes of the third party website that the pipes hit upon to find out more. ...read more »
Yahoo! recently released a new Fetch Page module which dramatically increases the number of useful things that Pipes can do. With this new "pipe input" module we're no longer restricted to working with well-organised data sets in supported formats such as CSV, RSS, Atom, XML, JSON, iCal or KML. Now we can grab any HTML page we like and use the power of the Regex module to slice and dice the raw text into shape.
In a nutshell, the Fetch Page module turns Yahoo! Pipes into a fully fledged web scraping IDE!
As it happens, I already have a web scraping project which has been broken for some time now. I don't have the energy to check out the hacky old PHP scrapers and debug the problem. But with Yahoo! Pipes and the Fetch Page module to hand, I can throw away my PHP scripts and their associated libraries, delete the cron jobs and free my overloaded webserver from the onerous responsibility. Time to get cracking. ...read more »