BeautifulSoup

# BeautyfulSoup安装

pip install beautifulsoup4 lxml html5lib

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

# 对象的种类

Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象，所有对象可以归纳为 4 种：
- Tag ；
- NavigableString ；
- BeautifulSoup ；
- Comment；

from bs4 import BeautifulSoup, Tag, ResultSet

f = open(file="index.html", encoding="utf-8");

soup:BeautifulSoup = BeautifulSoup(f, 'lxml')

li1:Tag = soup.li; # 点方式获取标签，只取第一个li标签
liAll:ResultSet = soup.find_all(name="li");  # 获取所有的标签

meta:Tag = soup.head.meta;

print(meta);
print([s.text for s in liAll]);

1
2
3
4
5
6
7
8
9
10
11
12
13

# 遍历文档树

# （1）直接子节点

要点：.contents .children 属性

.contents tag 的 .content 属性可以将 tag 的子节点以列表的方式输出

.children 它返回的不是一个 list，不过我们可以通过遍历获取所有子节点。我们打印输出 .children 看一下，可以发现它是一个 list 生成器对象listiterator

# 所有子孙节点

知识点：.descendants 属性

.descendants .contents 和 .children 属性仅包含 tag 的直接子节点，.descendants 属性可以对所有 tag 的子孙节点进行递归循环，和 children 类似，我们也需要遍历获取其中的内容

# （3）节点内容

知识点：.string 属性

如果 tag 只有一个 NavigableString 类型子节点，那么这个 tag 可以使用 .string 得到子节点。如果一个 tag 仅有一个子节点，那么这个 tag 也可以使用 .string 方法，输出结果与当前唯一子节点的 .string 结果相同。通俗点说就是：如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容

参考：

Beautifulsoup 库 -- 01 -- 安装及使用 (opens new window)

Beautiful Soup 4.4.0 文档 (opens new window)

← python整合redis selenium的介绍→