这里是我自己整理的一些资料，大家不懂的可以相互学习呀。。。

> Python > scrapy 教程（python3）

scrapy 教程（python3）

Python ZZT 8年前 (2017-11-02) 2138次浏览已收录 0个评论

scrapy 基础学习
 scrapy 爬知乎
 scrapy Beautiful和Xpath
scrapy 添加随机user-agent
环境：py2.7和py3.5共存，下面使用的是python3.5；ubuntu16

安装：

py -m pip install python3.5-dev #安装python3.5的依赖扩展
py -m pip install scrapy #安装scrapy

依赖

（一）BeautifulSoup //网页抓取利器

#实例

import requests

from bs4 import BeautifulSoup

headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}

html = requests.get(url, headers=headers)

soup = BeautifulSoup(html.text, 'lxml')  #以上是网络获取html

soup=BeautifulSoup(open('index.html')) # 读取本地的html，加个open函数即可

print（soup.prettify()）  # 用标准html 显示方法打印html

soup.find_all()方法介绍，soup.find()与之基本类似，只是返回的是第一个值

find_all( name , attrs , recursive , text , **kwargs )

soup.find_all('b')  #查找所有的b标签，返回列表

soup.find_all(re.compile("^b")) # 正则表达式

soup.find_all(["a", "b"])  #传入列表参数，找到所有的a标签和b标签

soup.find_all(id='link2')  #传入id是link2的参数,Beautiful Soup会搜索每个tag的”id”属性

soup.find_all(href=re.compile("elsie")) #传入正则表达式，查找所有的href标签内容中含有 elsie 的内容

soup.find_all(href=re.compile("elsie"), id='link1') # 多层过滤，除了href进行限定之外，对id标签的内容也做了限定

soup.find_all("div", class_="sister") #最常用的查找技巧，这里之所以加‘_=’是因为‘class’不仅是html中的tag，也是python语法的关键词，其他的不用加下划线

data_soup.find_all(attrs={"data-foo": "value"}) # 针对html5里面的data- 进行的专项查找

soup.find_all(text="Elsie") # 对text内容进行查找

soup.find_all(text=["Tillie", "Elsie", "Lacie"]) # 列表形式进行查找，与上面name类似

soup.find_all(text=re.compile("Dormouse")) # 正则表达式形式，与上面类似

soup.find_all("a", limit=2) # 找到前两个a标签， limit用来限定次数(

还有一个select()函数比较有用，基本用法如下：

# 我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是soup.select()，返回类型是list

（1）通过标签名查找

soup.select('title')

（2）通过类名查找

soup.select('.sister')

（3）通过 id 名查找

soup.select('#link1')

（4）组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

soup.select('p #link1')

（5）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

soup.select('a[class="sister"]')

soup.select('a[href="http://example.com/elsie"]')

get_text()方法可以用来获取内容，请看下面代码：

soup = BeautifulSoup(html.text, 'lxml')

print (type(soup.select('title')))

print (soup.select('title')[0].get_text())  # 获取第一个title标签的对应内容

for title in soup.select('title'):

           print (title.get_text()) # 获取列表中的title对应内容

（二）Xpath的介绍和用法

乐趣公园 , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权 , 转载请注明scrapy 教程（python3）！

文章评论已关闭！

版权声明
本站的文章和资源来自互联网或者站长的原创，按照CC BY-NC-SA 3.0 CN 协议发布和共享，转载或引用本站文章应遵循相同协议。如果有侵犯版权的资源请尽快联系站长，我们会在24h内删除有争议的资源。
网站备案
鄂ICP备2020022491号-1