Scrape Anything with AI

live demo

Motivation

Ever since GPT brought about a paradigm shift in December '22, altering the way I work, learn and code, I felt a compulsion to venture deeper into the world of large language models (LLMs).

My work on Lupus Scripts, a web automation Chrome extension, spurred the idea to harness LLMs to create a tool to generate code to complete any data extraction task.

This tool would simplify data extraction immensely. I saw a myriad of applications for the derived data:

  1. Developers can turn websites into APIs at the snap of a finger.
  2. Traders can harness online sentiment for strategic decisions.
  3. Product scouts can swiftly gauge market demands from reviews.
  4. Resellers stay ahead with real-time stock alerts.
  5. Retail strategists always have an edge with live pricing and inventory insights.
  6. Journalists can curate comprehensive news stories.
  7. Real estate agents offer their clients data-driven insights.
  8. HR managers can optimize recruitment by sifting through online listings and reviews.
  9. Researchers expedite findings through automated literature scans.

So I began working on this. Only recently did I find competitors doing similar things, like Automatic Extraction by Zyte and tools I found on Product Hunt like BrowseAI and Hexofy.

Backend

Virtual Document

To validate that this idea was feasible, I began with the backend development. Given that a full website is large and exceeds the context limit of GPT models, I conceptualized a Virtual Document. This document would present only the most relevant parts of an entire webpage and enable GPT to explore it.

I implemented this virtual document by constructing a straightforward Node TypeScript app that employs Playwright to open a URL and extract the loaded page's HTML. This HTML is then parsed with JSDom, which allows you to use the document API within a Node environment. To make the content more concise, I removed “irrelevant” tags (like script, style, path, noscript, etc.); retained only “relevant” attributes (such as id, name, data-*); and abbreviated lengthy text.