Lightfeed Extractor: TypeScript Library for Robust Web Data Extraction with LLMs

✍️ OpenClawRadar📅 Published: March 26, 2026🔗 Source
Lightfeed Extractor: TypeScript Library for Robust Web Data Extraction with LLMs
Ad

Lightfeed Extractor is a TypeScript library built for robust web data extraction using LLMs and Playwright browser automation. It addresses common pain points in web scraping pipelines where traditional CSS selectors break when sites change layout, and raw LLM approaches struggle with HTML noise, malformed JSON output, and URL issues.

Key Features

  • HTML to LLM-ready markdown conversion: Extracts main content while stripping navigation bars, headers, footers, and tracking junk. Includes optional image inclusion and URL cleaning.
  • LLM extraction with Zod schemas: Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama) and uses Zod schemas for type-safe extraction with real validation.
  • JSON recovery: Sanitizes and recovers partial data from malformed LLM output instead of failing entirely. If 19 out of 20 products parse correctly, you get those 19.
  • Built-in browser automation: Uses Playwright with support for local, serverless, or remote browsers. Includes anti-bot patches for reliable web scraping.
  • AI browser navigation integration: Pairs with @lightfeed/browser-agent for AI-driven page navigation before extraction.
  • URL handling: Manages relative URLs, removes invalid ones, repairs markdown-escaped links, and cleans tracking parameters.
Ad

Installation and Usage

Install via npm:

npm install @lightfeed/extractor

Then install your preferred LLM provider:

# OpenAI
npm install @langchain/openai
# Google Gemini
npm install @langchain/google-genai
# Anthropic
npm install @langchain/anthropic
# Ollama (local models)
npm install @langchain/ollama

Example usage for e-commerce product extraction:

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { extract, ContentFormat, Browser } from "@lightfeed/extractor";
import { z } from "zod";

// Define schema for product catalog extraction const productCatalogSchema = z.object({ products: z.array( z.object({ name: z.string().describe("Product name or title"), brand: z.string().optional().describe("Brand name"), price: z.number().describe("Current price"), originalPrice: z.number().optional().describe("Original price if on sale"), rating: z.number().optional().describe("Product rating out of 5"), reviewCount: z.number().optional().describe("Number of reviews"), productUrl: z.string().url().describe("Link to product detail page"), imageUrl: z.string().url().optional().describe("Product image URL") }) ).describe("List of bread and bakery products") });

// Create browser instance const browser = new Browser({ type: "local", // also supporting serverless and remote browser headless: false });

The library is Apache 2.0 licensed and used in production at Lightfeed for data pipelines that scrape websites and extract structured data. It's designed for developers building web scraping workflows who want to avoid writing repetitive boilerplate for HTML cleanup, markdown conversion, LLM calls, JSON parsing, error recovery, and schema validation.

📖 Read the full source: HN LLM Tools

Ad

👀 See Also

LLMs Leak Reasoning into Structured Output Despite Explicit Instructions
Tools

LLMs Leak Reasoning into Structured Output Despite Explicit Instructions

A developer building a tool that makes parallel API calls to Claude and parses structured output found that validation models intermittently output reasoning text before corrected content, despite explicit instructions to return only corrected text. The fix involved prompt tightening plus a defensive strip function that runs before parsing.

OpenClawRadar
Open-source Claude Code plugin captures books and converts them to structured Markdown
Tools

Open-source Claude Code plugin captures books and converts them to structured Markdown

A developer has open-sourced a Claude Code plugin that automatically captures book pages via screenshots, performs OCR with macOS Vision, and generates structured Markdown files organized by theme rather than chapter order. The tool supports Kindle, Apple Books, Kindle Cloud Reader, and scanned PDFs on macOS.

OpenClawRadar
AutoSkillUpdate: A Claude Code Plugin That Detects Outdated Skills
Tools

AutoSkillUpdate: A Claude Code Plugin That Detects Outdated Skills

AutoSkillUpdate is an open-source Claude Code plugin that scans your codebase, compares it against existing skills, and identifies drift. It provides drift reports with file paths and line references, then offers to rewrite outdated skills with user confirmation.

OpenClawRadar
Open-Source Claude IDE Bridge Connects Dispatch, Desktop App, and Claude Code
Tools

Open-Source Claude IDE Bridge Connects Dispatch, Desktop App, and Claude Code

The claude-ide-bridge is an MIT-licensed open-source tool that connects Claude Code to your IDE, providing access to LSP, debugger, terminals, git, and GitHub through 124 tools. It enables a workflow where tasks sent via Dispatch from a phone are handled by the Claude desktop app, which uses Claude Code to write code and run tests while interacting with the IDE.

OpenClawRadar