June 1, 2026

Audily

Turn any PDF or EPUB into an audiobook with AI-powered text extraction and natural-sounding narration.

PythonNext.jsAITTSDockerOllamaSelf-Hosted

The Idea

I read a lot of technical books and papers, but I don't always have time to sit down and actually read them. Audiobook versions rarely exist for niche stuff, and the ones generated by basic TTS sound terrible. I wanted something that could take any PDF or EPUB, extract the actual content (not the copyright notices and table of contents), and turn it into a decent-sounding audiobook I can listen to while doing other things.

How It Works

Audily runs a multi-step pipeline on every document you upload. First, it extracts text using PyMuPDF for PDFs or ebooklib for EPUBs. If the PDF is scanned (no selectable text), it falls back to OCR with Tesseract. Then comes the interesting part — it uses a local LLM running on Ollama to clean up the extracted text.

The LLM does a 4-pass evaluation to filter out noise that you don't want narrated: copyright notices, DRM boilerplate, ISBN numbers, headers, footers, table of contents entries, and index pages. It also detects chapter boundaries and builds a chapter tree from the document structure.

The Pipeline

Text extraction — parse the document and pull out raw text, with OCR fallback
Layout analysis — detect headings, TOC, indexes, headers and footers
Content filtering — 4-pass LLM evaluation to remove publishing noise
Chapter detection — build a semantic chapter tree from the structure
Image captioning — describe images using BLIP-2 so they're mentioned in the narration
TTS synthesis — generate MP3 audio for each chapter with neural voices
Timeline assembly — create seek timestamps for synchronized playback

Listening Experience

The frontend is a Next.js app where you manage your library and listen to your audiobooks. The cool part is synchronized text highlighting — as the audio plays, the current paragraph is highlighted and the view scrolls along. You can also click any paragraph to jump to that point in the audio. It sounds like a small thing, but it makes it way more usable than just dumping an MP3.

Fully Local

Everything runs on your own hardware. The LLM runs locally through Ollama (using qwen3:8b), TTS uses edge-tts (Microsoft's free neural voices — they actually sound really good), and OCR runs locally too. No data ever leaves your machine, which was a hard requirement for me since I process work-related documents.

If you have a GPU available, you can optionally use Chatterbox for even higher-quality TTS, or spin up vast.ai GPU workers for faster processing. But the base setup runs fine on a regular machine — just slower.

Tech Stack

Layer	Technology
Backend	Python 3.12, FastAPI, Celery
Frontend	Next.js 15, Tailwind CSS
Database	PostgreSQL 16
Queue	Redis 7 + Celery workers
LLM	Ollama (qwen3:8b)
TTS	edge-tts, optional Chatterbox
OCR	Tesseract + Kraken fallback
Auth	JWT + TOTP 2FA

Deployment

The whole thing runs in Docker Compose. For a minimal setup, a Hetzner CX22 at around 4 euros per month is enough. If you want GPU acceleration for faster processing, you can set up a hybrid architecture — keep the API and database on a cheap VPS and spin up GPU workers on vast.ai only when you're processing new books.