python爬虫

# robots协议

Robots协议通常被称为是爬虫协议、机器人协议，主要是在搜素引擎中会见到，其本质是网站和搜索引擎爬虫的沟通方式，用来指导搜索引擎更好地抓取网站内容，而不是作为搜索引擎之间互相限制和不正当竞争的工具。

User-agent:Baiduspider
Disallow: /musi
Disallow: /boo
Disallow: /secre
Disallow: /ran
Disallow: /order_cen
Disallow: /hotel_z
Disallow: /sales/or
Disallow: /u/*.html

User-agent: *
Disallow: /
Disallow: /po

Sitemap: http://www.mafengwo.cn/sitemapIndex.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

https://www.mafengwo.cn/robots.txt

# HTTP请求和响应处理

# 请求头

Host (主机和端口号)
Connection (链接类型)
Upgrade-Insecure-Requests (升级为HTTPS请求)
User-Agent (浏览器名称)
Accept (传输文件类型)
Referer (页面跳转处)
Accept-Encoding（文件编解码格式）
Cookie （Cookie）
x-requested-with :XMLHttpRequest (是Ajax 异步请求)

# 1、urllib包

urllib.request - 打开和读取 URL。
urllib.error - 包含 urllib.request 抛出的异常。
urllib.parse - 解析 URL。
urllib.robotparser - 解析 robots.txt 文件。

# 01、urllib.request模块

urllib.request 定义了一些打开 URL 的函数和类，包含授权验证、重定向、浏览器 cookies等。

urllib.request 可以模拟浏览器的一个请求发起过程。

我们可以使用 urllib.request 的 urlopen 方法来打开一个 URL，语法格式如下：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url：url 地址。
data：发送到服务器的其他数据对象，默认为 None。
timeout：设置访问超时时间。
cafile 和 capath：cafile 为 CA 证书， capath 为 CA 证书的路径，使用 HTTPS 需要用到。
cadefault：已经被弃用。
context：ssl.SSLContext类型，用来指定 SSL 设置。

# urlopen方法

from http.client import HTTPResponse
from urllib.request import urlopen

response:HTTPResponse = urlopen("http://www.bing.com", data=None); #data为空时发GET请求，否则是POST请求
with response: #支持上下文，有close方法
    print(type(response), response);
    print(response.info()); # 头信息
    print(response.geturl()); # url改变了，301重定向
    print(response.read());

1
2
3
4
5
6
7
8
9

通过urlib.request.urlopen方法，发起一个HTTP的GET请求，WEB服务器返回了网页内容。响应的数据被封装到类文件对象中，可以通过read方法、readline方法、readlines方法获取数据，status和reason属性表示返回的状态码，info方法返回头信息，等等. 注意url的变化，说明重定向过。

http://httpbin.org/

# User-Agent问题

目前urlopen方法通过url字符串和data发起HTTP的请求。如果想修改HTTP头，例如useragent，就得借助其他方式。源码中构造的useragent如下

class OpenerDirector:
    def __init__(self):
        client_version = "Python-urllib/%s" % __version__
        self.addheaders = [('User-agent', client_version)]

1
2
3
4

这里显示ua的值为："User-Agent": "Python-urllib/3.11"，有些网站是反爬虫的，所以要把爬虫伪装成浏览器。随便打开一个浏览器，复制浏览器的UA值，用来伪装。

# Request类

from http.client import HTTPResponse
import random
from urllib.request import urlopen, Request

url = "http://httpbin.org/get";   #"http://www.bing.com"
uaPc = [
      'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0', #Firefox
      'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27', #Safari
      'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36', #chrome
      'Mozilla/5.0 (compatible; WOW64; MSIE 10.0; Windows NT 6.2)', #IE10
      'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)' # IE9
      ];

request = Request(url);
#ua = uaPc[random.randint(0,len(uaPc)-1)];
#ua = random.choice(uaPc);
request.add_header("user-agent",  random.choice(uaPc) )

response:HTTPResponse = urlopen(request, data=None); #data为空时发GET请求，否则是POST请求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# 02、urllib.parse模块

parse.urlencode()方法 : 查询参数必须在编码后才能加入url地址，parse.urlencode()方法的作用就是对url地址中查询参数进行编码，参数类型为字典
parse.quote()方法 : 对url地址中的中文进行编码，类似于urlencode()方法。
parse.unquote()方法 : 有了quote()方法转换，也需要有unquote()方法对url地址进行解码，作用是将编码后的字符串转为普通的Unicode字符串。

import random
from urllib import parse
from urllib.request import urlopen, Request

baseUrl = "https://www.baidu.com/s"; 
param = parse.urlencode({"wd":'周享平'}) #编码  解码unquote
url = "{}?{}".format(baseUrl, param);

request = Request(url);
request.add_header("user-agent",  'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)' # IE9 )

with urlopen(request, timeout=20) as response: #支持上下文，有close方法
    with open("aa.html",mode="wb", encoding="utf-8") as f:
        f.write(response.read())

1
2
3
4
5
6
7
8
9
10
11
12
13
14

参考： https://blog.csdn.net/sallyyellow/article/details/128846430

# 2、提交方法method

# GET方法

# Post方法

# 处理json数据

# 3、https证书忽略

import ssl
# 表示忽略未经核实的SSL证书认证
context = ssl._create_unverified_context()
urlopen(request, context=context, timeout=20)

1
2
3
4

# 4、urllib3

pip install  urllib3

from urllib3 import PoolManager, HTTPResponse

http: PoolManager = PoolManager();
with http:
    response:HTTPResponse = http.request("GET", url, headers={"user-agent":ua})

1
2
3
4
5

# 5、requests库

requests使用了urllib3,但是api更友好，推荐使用, 可以通过session发送有相关性的多次请求，第二次会带上cookie

pip install requests

import random
import requests
from requests import Response, Session

url = "https://movie.douban.com/";
uaPc = [
      'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0', #Firefox
      'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27', #Safari
      'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36', #chrome
      'Mozilla/5.0 (compatible; WOW64; MSIE 10.0; Windows NT 6.2)', #IE10
      'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)' # IE9
      ];

#ua = uaPc[random.randint(0,len(uaPc)-1)];
ua = random.choice(uaPc);

session:Session = requests.session();
with session: #可以发送有相关性的多次请求
    response:Response = requests.get(url, headers={"user-agent": ua});
    with response: #支持上下文，有close方法
        print(type(response), response);
        print(*response.headers.items(), sep="\n"); # 头信息
        print(response.url); # url改变了，301重定向
        print(response.cookies);

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

# xpath语法

在 XPath 中，有七种类型的节点：元素、属性、文本、命名空间、处理指令、注释以及文档节点（或称为根节点）。

# 节点

通配符	描述
*	匹配任何元素节点。
@*	匹配任何属性节点。
node()	匹配任何类型的节点。
nodename	选取此节点的所有子节点。
/	从根节点选取。
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.	选取当前节点。

# 谓语

谓语用来查找某个特定的节点或者包含某个指定的值的节点。

谓语被嵌在方括号[]中。

路径表达式	结果
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()❤️]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang='eng']	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

# XPath Axes（轴）

轴名称	结果
ancestor	选取当前节点的所有先辈（父、祖父等）。
ancestor-or-self	选取当前节点的所有先辈（父、祖父等）以及当前节点本身。
attribute	选取当前节点的所有属性。
child	选取当前节点的所有子元素。
descendant	选取当前节点的所有后代元素（子、孙等）。
descendant-or-self	选取当前节点的所有后代元素（子、孙等）以及当前节点本身。
following	选取文档中当前节点的结束标签之后的所有节点。
namespace	选取当前节点的所有命名空间节点。
parent	选取当前节点的父节点。
preceding	选取文档中当前节点的开始标签之前的所有节点。
preceding-sibling	选取当前节点之前的所有同级节点。
self	选取当前节点。

例子	结果
child::book	选取所有属于当前节点的子元素的 book 节点。
attribute::lang	选取当前节点的 lang 属性。
child:😗	选取当前节点的所有子元素。
attribute:😗	选取当前节点的所有属性。
child::text()	选取当前节点的所有文本子节点。
child::node()	选取当前节点的所有子节点。
descendant::book	选取当前节点的所有 book 后代。
ancestor::book	选择当前节点的所有 book 先辈。
ancestor-or-self::book	选取当前节点的所有 book 先辈以及当前节点（如果此节点是 book 节点）
child:😗/child::price	选取当前节点的所有 price 孙节点。

推荐工具：XMLQuire (opens new window)

# Ixml

lxml是一款高性能的Python XML库，主要用来解析及生成xml和html文件（解析、序列化、转换）。其天生支持Xpath1.0、XSLT1.0、定制元素类，甚至 python 风格的数据绑定接口。lxml基于Cpython实现，其底层是libxml2和libxslt两个C语言库。因此具有较高的性能。最新版支持Python 2.6+，python3支持到3.6。

官方文档：https://lxml.de/

参考：lxml库的基本使用 (opens new window)

yum install libxml2-dev libxslt-dev python-dev  # centos安装编译库
pip install lxml

1
2

研究html源代码结构， F12检查页面， ctrl+F，搜索元素，可以写xpath语句

import random
import requests
from lxml import etree
from lxml.etree import _Element
from requests import Response, Session

url = "https://movie.douban.com/";
uaPc = [
      'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0', #Firefox
      'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27', #Safari
      'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36', #chrome
      'Mozilla/5.0 (compatible; WOW64; MSIE 10.0; Windows NT 6.2)', #IE10
      'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)' # IE9
      ];

ua = random.choice(uaPc);

session:Session = requests.session();

with session, requests.get(url, headers={"user-agent": ua}) as response: #可以发送有相关性的多次请求

    if 200<=response.status_code < 300:
        print(response.text[:300]);
        root:_Element = etree.HTML(response.content);
        l = root.xpath('//div[@class="billboard-bd"]//td/a/text()');
        print( l )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

# MongoDB

MongoDB 是一个基于分布式文件存储的数据库。由 C++ 语言编写。旨在为 WEB 应用提供可扩展的高性能数据存储解决方案。

MongoDB 是一个介于关系数据库和非关系数据库之间的产品，是非关系数据库当中功能最丰富，最像关系数据库的

# 01、安装及运行

mongodb官方下载 (opens new window)， mongoDB官方文档 (opens new window) 选择系统版本即可， windows下载zip文件解压后，运行mongod.exe，需先创建默认保存目录 d:\data\db，d:\data\log

bin\mongod.exe

配置文件： /etc/mongod.conf 或安装目录\bin\mongod.cfg （win）

storage:
   dbPath: "d:/data"
net:
   bindIp: 127.0.0.1
   port: 27017

1
2
3
4
5

bin\mongod.exe -f config.yml

netstat -anp tcp

1
2
3

用navicat数据库连接工具连接即可

show dbs

# 02、pymongo

pip install pymongo

python使用mongoDB (opens new window)

# 文档

文档中，使用键值对
文档中的键/值对是有序的
键是字符串
- 区分大小写，使用UTF-8字符
- 键不能含有\0(空字符)。这个字符用来表示键字符串的结尾.和$有特别的意义，只有在特定环境下才能使用以下划线_二开头的键是保留的，例如 _id
值可以是
- 字符串、32位或64位整数、双精度、时间戳 (毫秒) 、布尔型、null
- 字节数组、BSON数组、BSON对象

# 数据类型

Object ID：⽂档ID

String：字符串，最常⽤，必须是有效的UTF-8

Boolean：存储⼀个布尔值， true或false

Integer：整数可以是32位或64位，这取决于服务器

Double：存储浮点值

Arrays：数组或列表，多个值存储到⼀个键

Object：⽤于嵌⼊式的⽂档，即⼀个值为⼀个⽂档Null：存储Null值

Timestamp：时间戳，表示从1970-1-1到现在的总秒数

Date：存储当前⽇期或时间的UNIX时间格式

# Mongo的crud

MongoDB中可以创建使用多个库，但有一些数据库名是保留的，可以直接访问这些有特殊作用的数据库。

admin: 从权限的角度来看，这是”root"数据库。要是将一个用户添加到这个数据库，这个用户自动继承所有数据库的权限。一些特定的服务器端命令也只能从这个数据库运行，比如列出所有的数据库或者关闭服务器。
local: 这个数据永远不会被复制，可以用来存储限于本地单台服务器的任意集合
config: 当Mongo用于分片设置时，config数据库在内部使用，用于保存分片的相关信息。

每条数据插入后都有一个唯一key，属性_id唯一标识一个文档。没有没有显式指明该属性，会自动生成一个Objectld类型的_id 属性。 Obiectld有12字节组成，见ObjectId类源码

4字节时间戳
3字节机器识别码
2字节进程id
3字节随机数

	@property
    def generation_time(self) -> datetime.datetime:
        """A :class:`datetime.datetime` instance representing the time of
        generation for this :class:`ObjectId`.

        The :class:`datetime.datetime` is timezone aware, and
        represents the generation time in UTC. It is precise to the
        second.
        """
        timestamp = struct.unpack(">I", self.__id[0:4])[0]
        return datetime.datetime.fromtimestamp(timestamp, utc)

1
2
3
4
5
6
7
8
9
10
11

# 查询文档

db.col.find().pretty()

1
2

操作	格式	范例	RDBMS中的类似语句
等于	`{<key>:<value>`}	`db.col.find({"by":"菜鸟教程"}).pretty()`	`where by = '菜鸟教程'`
小于	`{<key>:{$lt:<value>}}`	`db.col.find({"likes":{$lt:50}}).pretty()`	`where likes < 50`
小于或等于	`{<key>:{$lte:<value>}}`	`db.col.find({"likes":{$lte:50}}).pretty()`	`where likes <= 50`
大于	`{<key>:{$gt:<value>}}`	`db.col.find({"likes":{$gt:50}}).pretty()`	`where likes > 50`
大于或等于	`{<key>:{$gte:<value>}}`	`db.col.find({"likes":{$gte:50}}).pretty()`	`where likes >= 50`
不等于	`{<key>:{$ne:<value>}}`	`db.col.find({"likes":{$ne:50}}).pretty()`	`where likes != 50`

# 操作符

比较操作符 $eq：等于 $ne：不等于 $gt：大于 $gte：大于等于 $lt：小于 $lte：小于等于 $in：包含于 $nin：不包含于逻辑操作符 $and：逻辑与 $or：逻辑或 $not：逻辑非 $nor：不包含元素操作符 $exists：是否存在 $type：数据类型数组操作符 $all：匹配数组中的所有元素 $elemMatch：匹配数组中满足指定条件的元素 $size：数组长度正则表达式操作符 $regex：正则表达式匹配文本搜索操作符 $text：全文索引搜索日期操作符 $dateToString：日期格式化聚合操作符 $group：聚合操作 $match：筛选操作 $project：字段投影操作 $sort：排序操作 $skip：跳过指定数量的文档 $limit：限制返回的文档数量 $unwind：展开数组此外，还有一些特殊操作符，例如：

$where：执行JavaScript代码 $near：搜索附近的文档 $geoWithin：搜索多边形区域内的文档 $geoIntersects：搜索多边形区域交集的文档

# and、or条件

db.col.find({$and:[{"by":"菜鸟教程"},{"title": "MongoDB 教程"}]}).pretty()

db.col.find({$or:[{"by":"菜鸟教程"},{"title": "MongoDB 教程"}]}).pretty()

db.col.find({"likes": {$gt:50}, $or: [{"by": "菜鸟教程"},{"title": "MongoDB 教程"}]}).pretty()

1
2
3
4
5

# $type 操作符

类型	数字	备注
Double	1
String	2
Object	3
Array	4
Binary data	5
Undefined	6	已废弃。
Object id	7
Boolean	8
Date	9
Null	10
Regular Expression	11
JavaScript	13
Symbol	14
JavaScript (with scope)	15
32-bit integer	16
Timestamp	17
64-bit integer	18
Min key	255	Query with `-1`.
Max key	127

#如果想获取 "col" 集合中 title 为 String 的数据

db.col.find({"title" : {$type : 2}}) 或 db.col.find({"title" : {$type : 'string'}})

1
2
3

mongoDB教程 (opens new window)

# 插入

db.集合名称.insert(document)

db.stu.insert({name:'gj',gender:1})

db.stu.insert({_id:"20170101",name:'gj',gender:1})插⼊⽂档时，如果不指定_id参数， MongoDB会为⽂档分配⼀个唯⼀的ObjectId

# 更新

db.集合名称.update(<query> ,<update>,{multi: <boolean>})

参数query:查询条件
参数update:更新操作符
参数multi:可选， 默认是false，表示只更新找到的第⼀条记录， 值为true表示把满⾜条件的⽂档全部更新

db.stu.update({name:'hr'},{name:'mnc'})   更新一条
db.stu.update({name:'hr'},{$set:{name:'hys'}})    更新一条
db.stu.update({},{$set:{gender:0}},{multi:true})   更新全部

1
2
3
4
5
6
7
8
9

注意："multi update only works with $ operators"

# Bson原理

参考百科说明：BSON( Binary Serialized Document Format) 是一种二进制形式的存储格式，采用了类似于 C 语言结构体的名称、对表示方法，支持内嵌的文档对象和数组对象，具有轻量性、可遍历性、高效性的特点，可以有效描述非结构化数据和结构化数据。

BSON是一种类json的一种二进制形式的存储格式，简称Binary JSON，它和JSON一样，支持内嵌的文档对象和数组对象，但是BSON有JSON没有的一些数据类型，如Date和BinData类型。

BSON可以做为网络数据交换的一种存储形式，这个有点类似于Google的Protocol Buffer，但是BSON是一种schema-less的存储形式，它的优点是灵活性高，但它的缺点是空间利用率不是很理想，BSON有三个特点：轻量性、可遍历性、高效性。

# selenium开发

# scrapy

# scrapy-redis

← scrapy Openpyxl→