Scrape Anything with AI

Motivation

Ever since GPT brought about a paradigm shift in December '22, altering the way I work, learn and code, I felt a compulsion to venture deeper into the world of large language models (LLMs).

My work on Lupus Scripts, a web automation Chrome extension, spurred the idea to harness LLMs to create a tool to generate code to complete any data extraction task.

This tool would simplify data extraction immensely. I saw a myriad of applications for the derived data:

Developers can turn websites into APIs at the snap of a finger.
Traders can harness online sentiment for strategic decisions.
Product scouts can swiftly gauge market demands from reviews.
Resellers stay ahead with real-time stock alerts.
Retail strategists always have an edge with live pricing and inventory insights.
Journalists can curate comprehensive news stories.
Real estate agents offer their clients data-driven insights.
HR managers can optimize recruitment by sifting through online listings and reviews.
Researchers expedite findings through automated literature scans.

So I began working on this. Only recently did I find competitors doing similar things, like Automatic Extraction by Zyte and tools I found on Product Hunt like BrowseAI and Hexofy.

Backend

Virtual Document

To validate that this idea was feasible, I began with the backend development. Given that a full website is large and exceeds the context limit of GPT models, I conceptualized a Virtual Document. This document would present only the most relevant parts of an entire webpage and enable GPT to explore it.

I implemented this virtual document by constructing a straightforward Node TypeScript app that employs Playwright to open a URL and extract the loaded page's HTML. This HTML is then parsed with JSDom, which allows you to use the document API within a Node environment. To make the content more concise, I removed “irrelevant” tags (like script, style, path, noscript, etc.); retained only “relevant” attributes (such as id, name, data-*); and abbreviated lengthy text.