html与markdown的互相转换

一、目的

商高学院项目,藏经阁板块,需要采集网整理络内容,写此文档。
目前网络上有大量的书籍内容,是pdf,doc的,次方法可以转换部分doc。

二、wps另存为英文文件名

wps 转换的时候,注意,存储文件名为 英文文件名,以防后续处理会出现问题

三、使用html2text 工具去除冗余

I. 环境准备

( 1 ) python3

安装记得勾选 python3加入环境变量

( 2 ) html2text

01
项目地址:https://github.com/aaronsw/html2text
下载zip包后,解压到如下目录

1
2
3
D:\dev\env\tools\py3\html2text (master)
λ dir
COPYING html2text.py MANIFEST.in README.md setup.py test

添加到环境变量

1
D:\dev\env\tools\py3\html2text

02
测试:html2text.py –help

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
D:\>html2text.py --help
Usage: html2text.py [(filename|url) [encoding]]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--ignore-emphasis don't include any formatting for emphasis
--ignore-links don't include any formatting for links
--ignore-images don't include any formatting for images
-g, --google-doc convert an html-exported Google Document
-d, --dash-unordered-list
use a dash rather than a star for unordered list items
-e, --asterisk-emphasis
use an asterisk rather than an underscore for
emphasized text
-b BODY_WIDTH, --body-width=BODY_WIDTH
number of characters per output line, 0 for no wrap
-i LIST_INDENT, --google-list-indent=LIST_INDENT
number of pixels Google indents nested lists
-s, --hide-strikethrough
hide strike-through text. only relevant when -g is
specified as well
--escape-all Escape all special characters. Output is less
readable, but avoids corner case formatting issues.

II. 使用

  1. article.doc
  2. article.html
  3. article-body.html
    1
    html2text article-body.txt > article.md