LiteParse: Fast Open-Source Document Parser for AI Agents

✍️ OpenClawRadar📅 Published: March 21, 2026🔗 Source
LiteParse: Fast Open-Source Document Parser for AI Agents
Ad

LiteParse is an open-source document parser focused on fast, local parsing with spatial text extraction and bounding boxes. It runs entirely locally without cloud dependencies or GPU requirements, processing hundreds of pages in seconds.

Key Features

  • Apache 2.0 licensed open-source tool
  • Spatial text parsing with bounding boxes for precise text positioning
  • No dependency on local or frontier VLMs (Vision Language Models)
  • Runs on any machine without GPU requirements
  • Supports multiple file formats: PDFs, Office documents, images
  • Higher accuracy than similar tools like PyPDF, PyMuPDF, MarkItDown
  • One-line installation as a skill for 40+ AI agents including Claude Code, Cursor, OpenClaw, Windsurf

Installation Options

CLI Tool Installation:

npm i -g @llamaindex/liteparse

Then use:

lit parse document.pdf
lit screenshot document.pdf

For macOS and Linux via Homebrew:

brew tap run-llama/liteparse
brew install llamaindex-liteparse

Agent Skill Installation:

npx skills add run-llama/llamaparse-agent-skills --skill liteparse

Usage Examples

Basic parsing:

lit parse document.pdf
lit parse document.pdf --format json -o output.md
lit parse document.pdf --target-pages "1-5,10,15-20"
lit parse document.pdf --no-ocr

Batch parsing:

lit batch-parse ./input-directory ./output-directory

Screenshot generation (useful for LLM agents):

lit screenshot document.pdf -o ./screenshots
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots
lit screenshot document.pdf --dpi 300 -o ./screenshots
lit screenshot document.pdf --target-pages "1-10" -o ./screenshots
Ad

Library Usage

Install as a dependency:

npm install @llamaindex/liteparse
# or
pnpm add @llamaindex/liteparse

Basic usage:

import { LiteParse } from '@llamaindex/liteparse';
const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse('document.pdf');
console.log(result.text);

Buffer/Uint8Array input (no disk I/O):

import { LiteParse } from '@llamaindex/liteparse';
import { readFile } from 'fs/promises';
const parser = new LiteParse();
const pdfBytes = await readFile('document.pdf');
const result = await parser.parse(pdfBytes);

Technical Details

  • Flexible OCR system with built-in Tesseract.js (zero setup)
  • Supports HTTP servers for OCR (EasyOCR, PaddleOCR, custom)
  • Standard OCR API specification
  • Multiple output formats: JSON and Text
  • Standalone binary with no cloud dependencies
  • Multi-platform support: Linux, macOS (Intel/ARM), Windows

For complex documents with dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs, the creators recommend LlamaParse, their cloud-based document parser built for production document pipelines.

📖 Read the full source: HN AI Agents

Ad

👀 See Also