Ever since GPT brought about a paradigm shift in December '22, altering the way I work, learn and code, I felt a compulsion to venture deeper into the world of large language models (LLMs).
My work on Lupus Scripts, a web automation Chrome extension, spurred the idea to harness LLMs to create a tool to generate code to complete any data extraction task.
This tool would simplify data extraction immensely. I saw a myriad of applications for the derived data:
So I began working on this. Only recently did I find competitors doing similar things, like Automatic Extraction by Zyte and tools I found on Product Hunt like BrowseAI and Hexofy.
To validate that this idea was feasible, I began with the backend development. Given that a full website is large and exceeds the context limit of GPT models, I conceptualized a Virtual Document. This document would present only the most relevant parts of an entire webpage and enable GPT to explore it.
I implemented this virtual document by constructing a straightforward Node TypeScript app that employs Playwright to open a URL and extract the loaded page's HTML. This HTML is then parsed with JSDom, which allows you to use the document API within a Node environment. To make the content more concise, I removed “irrelevant” tags (like script, style, path, noscript, etc.); retained only “relevant” attributes (such as id, name, data-*); and abbreviated lengthy text.