Self-healing scrapers

Of course I’d rather talk to APIs to download data from a website. They’re stable and provide a structured view on the information you’re after. But sometimes you’re going to have to use a scraper. You’ll have to download the webpage containing the information and a page that has been optimized for humans, not computers.

For one of our projects we administer a pool of scrapers that aggregates data from different websites for internal use. By automating the information retrieval step we save ourselves the trouble of having to manually visit 50+ websites to check what’s new.

Although we save effort with aggregating the information, we now have to spend time monitoring and fixing the scrapers whenever something changes on the websites we visit. Breaking changes in websites, such as a new layout, occur approximately once every week. Every time this happens, one of our developers needs to analyze the problem and fix it.

All in all this is not a huge price to pay for having easy access to the data in question. We have a robust monitoring system in place that detects errors while scraping, and the devs have built up enough expertise to fix any kind of problem within the hour.

But it can get a bit tedious. And it takes time out of the budgets we have for other projects that are more exciting or deliver more value to our clients.

So, why not look at coding agents to cover this?

Whenever we detect one of our scrapers is failing, we can have a coding agent inspect the site, see what has changed, and fix the scraper. Since we know what kind of data to expect from each site, we’d be able to tell the agent exactly what is missing so that the agent can determine what needs to be fixed. The agent can test the fix and create a merge request so that one of our devs can have a final look before pushing to production.

What would we achieve with building something like this?

  • The turn-around time for scraper issues would be minimized: as soon as a scraper fails a coding agent can step in to fix it. The data we aggregate stays up to date and consistent.
  • Our developers can stay focused on other, more valuable and interesting, tasks. Such as figuring out new ways of extracting trending topics from time series data, or how to project saliency maps onto 3D assets.
  • We’d learn a lot about the possibilities and limitations of coding agents. Will they be able to inspect a site and fix the scraper? What will their fixes look like? Will the resulting code remain maintainable for humanoids?
  • Scaling the number of scrapers would no longer imply having to spend more time administering and maintaining them. We’d expect the time involved with scraper management to flatten out.

All in all interesting results, which could make it worthwhile building a proof-of-concept! But we can also imagine there will be downsides to this approach:

  • Having a coding agent inspect a website can consume a lot of tokens. One simple test-run where we asked devstral to analyze one of our scraping websites ended up costing us 10M tokens (the HTML sources were huge). We’d have to figure out some kind of proxy system to limit token spending.
  • Scraper maintenance is a great way for junior devs joining the team to get familiar with the system and the code-base. We’ll need to find other entry level stuff to get them up to speed.
  • The agent will most likely not be able to fix every problem. When a password for a site needs to be changed, I wouldn’t want a coding agent changing it without me knowing. And sometimes the data provided by the site is just lacking; how will the agent know it should stop looking for something that isn’t there?
  • What’s the carbon offset here? How does firing up a set of GPUs measure up against a junior developer fueled by snacks and sandwiches?

Hmm, we’ll have to crunch the numbers in order to get to a conclusion on this plan. Using agents to maintain scraper code sounds like something that is fun to build, and which might make sense in terms of economy and developer-happiness. But let’s see if the win is big enough to warrant – what will no doubt be – weeks of wrangling coding agents to work through scraper issues.

Also, keep in mind that this all started with wanting to save time browsing 50+ individual websites. Building a complex, self-healing pool of scrapers might be a bit over the top.


Leave a Reply

Your email address will not be published. Required fields are marked *