Skip to content

Document Import

LeafPress can import Word (.docx), PowerPoint (.pptx), Excel (.xlsx), and LaTeX (.tex) files and convert them to Markdown. This is useful for migrating existing documents into an MkDocs project.

leafpress import report.docx
leafpress import deck.pptx
leafpress import data.xlsx
leafpress import paper.tex

# Import multiple files at once — mix and match formats
leafpress import *.docx *.pptx *.xlsx *.tex

# Send all output to a directory
leafpress import *.docx *.pptx *.xlsx *.tex -o docs/

See the CLI Reference for all flags and examples.


Word Import (DOCX)

Word documents are converted using the mammoth library, which maps Word styles to semantic HTML, then to Markdown.

What's supported

Feature How it's handled
Headings Mapped to ####### based on Word heading level
Bold / italic Preserved as **bold** and *italic*
Hyperlinks Converted to Markdown link syntax
Ordered / unordered lists Converted to Markdown lists
Tables Converted to pipe-style Markdown tables
Images Extracted to an assets/ directory and referenced via Markdown image syntax
Code blocks Detected by Word style name — use --code-styles to specify which styles are code

Code block detection

By default, no Word styles are treated as code. Use --code-styles to specify which style names should become fenced code blocks:

leafpress import report.docx --code-styles "Code Block,Source Code"

Limitations

The following Word features are not currently supported and may be lost or simplified during import:

Feature Reason
Track changes / comments Mammoth accepts the final document state only — tracked changes and comments are not included in the output.
Headers / footers Document headers and footers are not part of the body content and are skipped.
Page breaks / columns Page layout is a visual property with no Markdown equivalent.
Text boxes Floating text boxes are not part of the main document flow and are skipped by mammoth.
SmartArt / charts Rendered as embedded images by Word internally — if --extract-images is enabled, they may appear as images, but labels and data are not extractable as text.
Custom fonts / colors Mammoth maps semantic styles (bold, italic, headings) but ignores visual-only formatting like font family, size, and color.
Table of contents Word TOC fields are not resolved — they appear as static text or are omitted.
Footnotes / endnotes Converted to inline text rather than Markdown footnote syntax.

Tip

For best results, use Word's built-in heading styles (Heading 1, Heading 2, etc.) rather than manually formatted bold text. Mammoth relies on styles, not visual formatting.


PowerPoint Import (PPTX)

PowerPoint presentations are converted using the python-pptx library. Each slide becomes a section in the output Markdown.

What's supported

Feature How it's handled
Slide titles Each slide title becomes an ## H2 heading
Untitled slides Get a fallback heading ## Slide N
Body text Extracted with paragraph structure preserved
Bold / italic Preserved as **bold** and *italic*
Hyperlinks Converted to Markdown link syntax
Indented text Rendered as nested bullet lists based on indent level
Tables Converted to pipe-style Markdown tables
Images Extracted to assets/ and referenced via Markdown image syntax
Speaker notes Included as blockquotes (toggleable)
Group shapes Recursed into — nested shapes are extracted individually

Speaker notes

By default, speaker notes are included as Markdown blockquotes beneath each slide:

## Quarterly Review

Revenue increased by 15% over the prior quarter.

> Remember to highlight the APAC growth numbers.

To omit speaker notes:

leafpress import deck.pptx --no-notes

Limitations

The following PowerPoint features are not currently supported and will be silently skipped during import:

Feature Reason
SmartArt SmartArt diagrams are stored as complex XML structures that python-pptx cannot access. They appear as opaque shapes with no extractable text.
Charts Embedded charts (bar, pie, line, etc.) are rendered as OLE objects. The chart data and labels are not accessible through the shape API.
Animations / transitions Markdown has no equivalent — these are presentation-only features.
Audio / video Embedded media cannot be meaningfully represented in Markdown.
Slide master / layout formatting Only content is extracted, not visual styling from the theme.

Tip

If a slide contains SmartArt or charts, consider replacing them with static images in PowerPoint before importing — images are fully supported and will be extracted to assets/.


Excel Import (XLSX)

Excel spreadsheets are converted using the openpyxl library. Each worksheet becomes a section with a Markdown table.

What's supported

Feature How it's handled
Multiple sheets Each sheet becomes a ## Sheet Name section
Header row First row treated as the table header
Text values Rendered as-is
Numbers Integers and floats rendered as strings (whole-number floats drop the .0)
Dates / times Formatted as YYYY-MM-DD or HH:MM:SS
Empty cells Rendered as blank table cells
Pipe characters Escaped to \| so they don't break table syntax
Empty sheets Skipped silently

Example output

A sheet named "Servers" with three columns produces:

## Servers

| Host     | Role     | CPU |
| -------- | -------- | --- |
| web-01   | frontend | 4   |
| db-01    | database | 8   |

Limitations

Feature Reason
Merged cells Merged regions are not unmerged — only the top-left cell value is read.
Formulas Cell values are read in data-only mode — you see computed results, not formula text. Cells that have never been calculated in Excel may appear blank.
Charts / images Embedded charts and images are not extracted.
Conditional formatting / colors Visual-only formatting has no Markdown equivalent.
Multiple header rows Only the first row is treated as the header.

Tip

For best results, save your Excel file in Excel before importing — this ensures all formula results are cached. LeafPress reads cached values, not formulas.


LaTeX Import (TEX)

LaTeX documents are converted using a native parser (pylatexenc). The converter handles the most common academic paper and documentation constructs.

What's supported

Feature How it's handled
Headings \section##, \subsection###, \subsubsection####, etc.
Bold / italic / code \textbf**bold**, \textit / \emph*italic*, \texttt`code`
Lists itemize → bullets, enumerate → numbered, with nesting support
Math Inline $...$ and display $$...$$ / \[...\] / equation / align passed through for MathJax/KaTeX
Images \includegraphics resolved relative to .tex file, copied to assets/
Tables tabular → pipe-style Markdown tables with column alignment
Links \href{url}{text} → Markdown links, \url{url} → angle-bracket URLs
Code blocks verbatim, lstlisting, minted → fenced code blocks (with language detection)
Figures \caption text used as image alt text
Title / author \title and \author rendered at top of document
Blockquotes abstract, quote, quotation → blockquotes
Footnotes \footnote → Markdown footnote syntax

Limitations

Feature Reason
\input / \include Multi-file LaTeX projects are not supported — only the specified .tex file is converted.
Custom macros \newcommand / \def definitions are skipped with a warning. Usages of custom macros appear as raw text.
TikZ / PGF diagrams tikzpicture and pgfpicture environments are skipped with a warning.
Beamer Beamer-specific environments (frame) and overlay commands (\pause, \only<>) are not converted.
Cross-references \ref and \cite produce placeholder text like [ref:label] and [key] — not resolved to numbers or bibliography entries.
Bibliography .bib files are not parsed. \cite, \citet, \citep commands produce bracketed keys.
EPS/PDF images Only raster image formats (PNG, JPG, SVG, etc.) are copied. EPS and PDF images produce a warning.
Theorem-like environments theorem, lemma, proof, corollary, definition, and other custom environments render their body as plain text with a warning.
\paragraph headings Registered as a heading level but may not render with ##### prefix due to parser argument handling.
\multicolumn in tables Column spanning is not represented — cells render but span information is lost.
\subfigure / \subcaption Sub-figure environments are not supported — arguments render as plain text.
Accented characters LaTeX accent commands (\"o, \'{e}, \~{n}) are not converted to Unicode — they appear as raw LaTeX.
siunitx package \SI, \si, \num commands are not converted — they appear as raw text.
\label inside math Labels within math environments are passed through verbatim in the $$ block.

Tip

For best results with math, ensure your Markdown renderer supports MathJax or KaTeX. Math expressions are passed through verbatim in LaTeX syntax.


Common options

Image extraction

The Word, PowerPoint, and LaTeX importers extract embedded images to an assets/ directory next to the output file. To skip image extraction:

leafpress import report.docx --no-extract-images
leafpress import deck.pptx --no-extract-images
leafpress import paper.tex --no-extract-images

Output path

By default, the output file uses the same stem as the input (e.g., deck.pptxdeck.md). You can specify a path or directory:

# Explicit file path (single file only)
leafpress import deck.pptx -o docs/presentation.md

# Directory (creates deck.md inside it)
leafpress import deck.pptx -o docs/

When importing multiple files, --output must be a directory (or omitted):

leafpress import *.docx *.pptx *.xlsx -o docs/

URL import

You can import documents directly from URLs — the file is downloaded to a temp directory, converted, and the temp file is cleaned up automatically:

# Import a LaTeX paper from a URL
leafpress import https://example.com/paper.tex

# Import a Word document from a URL
leafpress import https://example.com/report.docx -o docs/

# Mix local files and URLs
leafpress import report.docx https://example.com/slides.pptx -o docs/

The file type is inferred from the URL path extension (.docx, .pptx, .xlsx, .tex). If the URL has no recognized extension, the Content-Type header is used as a fallback. URLs that cannot be mapped to a supported format produce an error.

Batch import

You can pass multiple files and URLs in a single command. Formats can be mixed freely — each source is detected and routed to the appropriate converter:

# Import everything in one shot
leafpress import report.docx proposal.docx slides.pptx data.xlsx paper.tex

# Use shell globs to grab all supported files
leafpress import *.docx *.pptx *.xlsx *.tex

# Mix local and remote sources
leafpress import *.docx https://example.com/paper.tex -o docs/

# Combine with other options
leafpress import *.docx --code-styles "Code Block" --no-extract-images
leafpress import *.pptx --no-notes -o imported/

If one source fails (e.g., missing file, download error, or corrupt document), the remaining sources are still processed. A summary of failures is shown at the end.