MarkItDown is a new Python Library from Microsoft that aims to convert everything to Markdown
Microsoft has introduced a new Python-based tool designed to convert various file types, including Office documents, into Markdown format. This tool aims to streamline the process of transforming complex documents into Markdown, a lightweight markup language widely used for formatting text. By leveraging Python's capabilities, users can automate and customize the conversion process, enhancing efficiency and consistency in documentation workflows.
This development is particularly beneficial for developers, technical writers, and professionals who frequently work with both Office documents and Markdown, as it simplifies the transition between different document formats. The tool's release reflects Microsoft's ongoing commitment to supporting open-source solutions and integrating versatile tools that cater to the diverse needs of its user base.
These formats are supported by MarkItDown:
- PDF (.pdf)
- PowerPoint (.pptx)
- Word (.docx)
- Excel (.xlsx)
- Images (EXIF metadata, and OCR)
- Audio (EXIF metadata, and speech transcription)
- HTML (special handling of Wikipedia, etc.)
- Various other text-based formats (csv, json, xml, etc.)
To add it to a virtual environment, simply run:
pip install markitdown
It has a super simple API:
from markitdown import MarkItDown
markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")
print(result.text_content)
And you can run it from the command line, when the Python package is installed:
markitdown path-to-file.pdf > document.md
The following, simple code will allow you to use MarkItDown and OpenAI to describe images from code.
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(mlm_client=client, mlm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)
Pretty awesome, no?