一、目的

商高学院项目，藏经阁板块，需要采集网整理络内容，写此文档。
目前网络上有大量的书籍内容，是pdf，doc的，次方法可以转换部分doc。

二、wps另存为英文文件名

wps 转换的时候，注意，存储文件名为英文文件名，以防后续处理会出现问题

三、使用html2text 工具去除冗余

I. 环境准备

( 1 ) python3

安装记得勾选 python3加入环境变量

( 2 ) html2text

项目地址：https://github.com/aaronsw/html2text
下载zip包后，解压到如下目录

1
2
3

D:\dev\env\tools\py3\html2text (master)
λ dir
COPYING  html2text.py  MANIFEST.in  README.md  setup.py  test

添加到环境变量

1	D:\dev\env\tools\py3\html2text

测试：html2text.py –help

D:\>html2text.py --help
Usage: html2text.py [(filename|url) [encoding]]
Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --ignore-emphasis     don't include any formatting for emphasis
  --ignore-links        don't include any formatting for links
  --ignore-images       don't include any formatting for images
  -g, --google-doc      convert an html-exported Google Document
  -d, --dash-unordered-list
                        use a dash rather than a star for unordered list items
  -e, --asterisk-emphasis
                        use an asterisk rather than an underscore for
                        emphasized text
  -b BODY_WIDTH, --body-width=BODY_WIDTH
                        number of characters per output line, 0 for no wrap
  -i LIST_INDENT, --google-list-indent=LIST_INDENT
                        number of pixels Google indents nested lists
  -s, --hide-strikethrough
                        hide strike-through text. only relevant when -g is
                        specified as well
  --escape-all          Escape all special characters.  Output is less
                        readable, but avoids corner case formatting issues.

II. 使用

article.doc
article.html

article-body.html

1	html2text article-body.txt > article.md

慢慢来，理解越多，越不需要强记。

html与markdown的互相转换