Convert (almost) every document to Markdown
Microsoft has released its own document parser for LLM use!
.
.
Introducing MarkItDown, a 100% open-source, one-stop solution for effortlessly converting any file to Markdown—perfect for text analysis, indexing, and more!
Here’s what makes it special:
↳ Converts PDF, Word, Excel, PPT, images, audio to markdown
↳ Extracts EXIF, OCR, and transcripts automatically
↳ Available via CLI, Python API, or Docker
↳ Offers LLM-based image descriptions
↳ Supports batch conversions
“Technology is best when it brings people together.” – Matt Mullenweg
Comments
Had a quick look, and it seems for docx they actually use mammoth to first convert it into HTML, and then convert that to markdown. Just wondering how accurate the conversion is for something a bit more complex...
There is also firecrawl, readability and jina reader in similar space. And some others I forgot
Obsidian it shall be, this is the way.
Insert signature here, $5 tip required
Pandoc doesn't have that AI stuff but handles way more text formats.