MarkItDown is a new Python Library from Microsoft that aims to convert everything to Markdown

MarkItDown is a new Python Library from Microsoft that aims to convert everything to Markdown

Microsoft has introduced a new Python-based tool designed to convert various file types, including Office documents, into Markdown format. This tool aims to streamline the process of transforming complex documents into Markdown, a lightweight markup language widely used for formatting text. By leveraging Python's capabilities, users can automate and customize the conversion process, enhancing efficiency and consistency in documentation workflows.

MarkItDown GitHub repo

This development is particularly beneficial for developers, technical writers, and professionals who frequently work with both Office documents and Markdown, as it simplifies the transition between different document formats. The tool's release reflects Microsoft's ongoing commitment to supporting open-source solutions and integrating versatile tools that cater to the diverse needs of its user base.

These formats are supported by MarkItDown:

  • PDF (.pdf)
  • PowerPoint (.pptx)
  • Word (.docx)
  • Excel (.xlsx)
  • Images (EXIF metadata, and OCR)
  • Audio (EXIF metadata, and speech transcription)
  • HTML (special handling of Wikipedia, etc.)
  • Various other text-based formats (csv, json, xml, etc.)

To add it to a virtual environment, simply run:

pip install markitdown

It has a super simple API:

from markitdown import MarkItDown

markitdown = MarkItDown()
result = markitdown.convert("test.xlsx")

print(result.text_content)

And you can run it from the command line, when the Python package is installed:

markitdown path-to-file.pdf > document.md

The following, simple code will allow you to use MarkItDown and OpenAI to describe images from code.

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(mlm_client=client, mlm_model="gpt-4o")

result = md.convert("example.jpg")

print(result.text_content)

Pretty awesome, no?