Html to Markdown

I\’m working on a documentation project where I might need to convert some existing HTML pages back into text or Markdown format for the new system. Rather than manually editing the HTML source, I\’m testing with a couple different ways to script it automatically.

Lynx

Lynx is an open-source text web browser that is usually present on Linux machines and can be installed for Mac and Windows. I\\’ve used it in the past to see how web pages will appear to search engines or for accessibility testing. In both cases, you can quickly tell whether your text is sufficiently communicating your content. For the case of saving web pages in text format, Lynx also has a command-line option -dump:

$ lynx -dump [http://www.whatismyip.com/](http://www.whatismyip.com/) > example.txt

In my test case I couldn\\’t convince Lynx to fetch an SSL page, so I download it with Curl and pipe it into Lynx:

$ curl --silent https://www.linux.com/blog/Learn/2019/2/miyolinux-Lightweight-distro-Old-School-Approach | lynx -dump -stdin > lynx.txt

Pandoc

Pandoc is an open-source \universal document converter\ which understands (and can convert between) about two dozen different formats. It\’s well suited for writing a document in a primary source, then converting to other formats for different publishing options. The option we\’ll use here is Pandoc\’s ability to convert from HTML to Markdown, for example:

$ pandoc -s -r html [http://www.whatismyip.com/](http://www.whatismyip.com/) -o pandoc.md

For my page, I use the same trick as above because Pandoc can\\’t connect to SSL directly:

$ curl --silent https://www.linux.com/blog/Learn/2019/2/miyolinux-Lightweight-distro-Old-School-Approach | pandoc -s -r html -o pandoc.md

Conclusion

Both of these options do a pretty decent job of converting HTML into text or Markdown format. Pandoc seems slightly better in terms of getting to Markdown format, but I would need to run some more samples to see how much manual editing would be needed after. I\\’m also going to play a bit more with Aaron Schwartz\’s Html2Text. In my quick test, it appeared to have a problem with malformed HTML so I need to do some further testing with it.

Avatar
Carlos Dagorret
CTO Facultad de Ciencias Económicas

My research interests include distributed robotics, mobile computing and programmable matter.

comments powered by Disqus