Audily
Turn any PDF or EPUB into an audiobook with AI-powered text extraction and natural-sounding narration.
The Idea
I read a lot of technical books and papers, but I don't always have time to sit down and actually read them. Audiobook versions rarely exist for niche stuff, and the ones generated by basic TTS sound terrible. I wanted something that could take any PDF or EPUB, extract the actual content (not the copyright notices and table of contents), and turn it into a decent-sounding audiobook I can listen to while doing other things.
How It Works
Audily runs a multi-step pipeline on every document you upload. First, it extracts text using PyMuPDF for PDFs or ebooklib for EPUBs. If the PDF is scanned (no selectable text), it falls back to OCR with Tesseract. Then comes the interesting part — it uses a local LLM running on Ollama to clean up the extracted text.
The LLM does a 4-pass evaluation to filter out noise that you don't want narrated: copyright notices, DRM boilerplate, ISBN numbers, headers, footers, table of contents entries, and index pages. It also detects chapter boundaries and builds a chapter tree from the document structure.
The Pipeline
- Text extraction — parse the document and pull out raw text, with OCR fallback
- Layout analysis — detect headings, TOC, indexes, headers and footers
- Content filtering — 4-pass LLM evaluation to remove publishing noise
- Chapter detection — build a semantic chapter tree from the structure
- Image captioning — describe images using BLIP-2 so they're mentioned in the narration
- TTS synthesis — generate MP3 audio for each chapter with neural voices
- Timeline assembly — create seek timestamps for synchronized playback
Listening Experience
The frontend is a Next.js app where you manage your library and listen to your audiobooks. The cool part is synchronized text highlighting — as the audio plays, the current paragraph is highlighted and the view scrolls along. You can also click any paragraph to jump to that point in the audio. It sounds like a small thing, but it makes it way more usable than just dumping an MP3.
Fully Local
Everything runs on your own hardware. The LLM runs locally through Ollama (using qwen3:8b), TTS uses edge-tts (Microsoft's free neural voices — they actually sound really good), and OCR runs locally too. No data ever leaves your machine, which was a hard requirement for me since I process work-related documents.
If you have a GPU available, you can optionally use Chatterbox for even higher-quality TTS, or spin up vast.ai GPU workers for faster processing. But the base setup runs fine on a regular machine — just slower.
Tech Stack
| Layer | Technology |
|---|---|
| Backend | Python 3.12, FastAPI, Celery |
| Frontend | Next.js 15, Tailwind CSS |
| Database | PostgreSQL 16 |
| Queue | Redis 7 + Celery workers |
| LLM | Ollama (qwen3:8b) |
| TTS | edge-tts, optional Chatterbox |
| OCR | Tesseract + Kraken fallback |
| Auth | JWT + TOTP 2FA |
Deployment
The whole thing runs in Docker Compose. For a minimal setup, a Hetzner CX22 at around 4 euros per month is enough. If you want GPU acceleration for faster processing, you can set up a hybrid architecture — keep the API and database on a cheap VPS and spin up GPU workers on vast.ai only when you're processing new books.