Python学习笔记5-BeautifulSoup模块

2021-04-29

Word count: 2k | Reading time≈ 9 min

前几天因为英语期中考，想考好点所以花了大量时间看学术英语，考完可以继续学习爬虫了。前面学了Python的基本语法以及requests库，现在学一下BeautifulSoup库。

简介

BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：

以百度网页为例子学习：

<!DOCTYPE html>
<html>
<head>
    <meta content="text/html;charset=utf-8" http-equiv="content-type" />
    <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
    <meta content="always" name="referrer" />
    <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css" />
    <title>百度一下，你就知道 </title>
</head>
<body link="#0000cc">
  <div id="wrapper">
    <div id="head">
        <div class="head_wrapper">
          <div id="u1">
            <a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--新闻--></a>
            <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>
            <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>
            <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a>
            <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频</a>
            <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧</a>
            <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品 </a>
          </div>
        </div>
    </div>
  </div>
</body>
</html>

程序开头如下，因为这是简化版的百度页面代码，我从群里下到本地的所以不用requests获取数据了，直接用with open打开，然后用html.parser解析

from bs4 import BeautifulSoup
with open("baidu.html","rb") as f:
    html=f.read()
    bs=BeautifulSoup(html,"html.parser")
    #下面代码都在这个基础上执行

Tag

标签及其内容，拿到它找到的第一个标签：

# 1.Tag 标签及其内容：拿到它找到的第一个内容
print(bs.title) # 获得首个title标签及其内容,下同
print(bs.a)
print(bs.head)
print(type(bs.a)) # <class 'bs4.element.Tag'>

NavigableString

标签里的内容(字符串)：

print(bs.title.text) # 或者.string
print(bs.a.attrs) # .attra获取所有属性的字典
print(bs.a.name) # .name获取标签的名字,得到的是字符串类型
print(type(bs.title.string)) # <class 'bs4.element.NavigableString'>

BeautifulSoup

表示整个文档：

1
2
3

print(bs.name) # [document]
print(bs.attrs) # {}
print(bs) #整个文档树

Comment

它是一个特殊的NavigableString，输出的内容不包含注释符

1 2	print(bs.a.string) print(type(bs.a.string))

文档的遍历

.contents属性获得子节点的列表，将所有儿子节点存入列表。

1
2
3

print(bs.head.contents)
print(bs.head.contents[1])
#更多内容，搜索BeautifulSoup文档树

网上看了下，文档遍历有上行遍历、下行遍历和平行遍历，.contents属于下行遍历，还有很多属性没有学，可能是没必要或者用的不多，视频没怎么讲，等以后碰到问题再看吧。

文档的搜索

看到弹幕说用xpath，这个以后有时间可以去了解一下

find_all()

find_all()方法返回列表

字符串搜索

查找与字符串完全匹配的标签：

1	t_list=bs.find_all("a")

这样就可以得到包含有所a标签的列表

正则表达式搜索

正则是用re模块的search()方法来匹配内容：

1	t_list=bs.find_all(re.compile("a"))

方法搜索

传入一个函数(方法)，根据函数的要求来搜索(了解就行了)：

def name_is_exist(tag):
	return tag.has_attr("name") #返回有name属性的标签
t_list=bs.find_all(name_is_exist)
for item in t_list:
	print(item)

kwargs参数

t_list=bs.find_all(id="head") #查找属性id="head"的标签
for item in t_list:
    print(item)
    
t_list=bs.find_all(class_=True) #class是关键字，加_区分，表示有class属性的标签
    for item in t_list:
        print(item)

text参数

文本参数，搜索标签文本，就是<..>和</>之间的内容

t_list=bs.find_all(text="hao123")
t_list=bs.find_all(text=["hao123","地图","贴吧"])
t_list=bs.find_all(text=re.compile("\d")) #用正则表达式来查找包含特定文本的内容(标签里的字符串)
for item in t_list:
    print(item)

limit参数

限定获取数量

1
2
3

t_list=bs.find_all("a",limit=3)
for item in t_list:
    print(item)

select()

select()是css选择器，返回列表

t_list=bs.select("title") #元素选择器
t_list=bs.select(".mnav") #类选择器
t_list = bs.select("#u1") #id选择器
t_list=bs.select("a[class='bri']") #属性选择器
t_list=bs.select("head>title") #子元素选择器
t_list=bs.select(".mnav~.bri") #普通兄弟选择器
print(t_list[0].get_text()) #get_text()方法获取标签文本
for item in t_list:
    print(item)

之前学过css，看起来感觉差不多

最后

我感觉find_all那里没有学的很清楚，第二个参考文章里面写的挺清楚的，我以后要多看看。

最后附上完整的笔记：

# -*-coding=utf-8-*-
# @Time     : 2021/4/22 8:15
# @Auther   : Tianze
# @Email    : 1252448508@qq.com
# @File     : demo8.py
# @Software : PyCharm

#BeautifulSuop4将复杂HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：
#Tag
#NavigableString
#BeautifulSuop
#Comment

import re
from bs4 import BeautifulSoup
with open("baidu.html","rb") as f:
    html=f.read()
    bs=BeautifulSoup(html,"html.parser")

    # 1.Tag 标签及其内容：拿到它找到的第一个内容
    print(bs.title) #获得首个标签及其内容
    print(bs.a)
    print(bs.head)
    print(type(bs.a))

    #2.NavigableString 标签里的内容(字符串)
    print(bs.title.text) #或者.string
    print(bs.a.attrs) #.attra获取所有属性的字典
    print(type(bs.title.string))

    #3.BeautifulSoup  表示整个文档
    print(bs.name)
    print(bs.attrs)
    print(bs)

    #4.Comment 是一个特殊的NavigableSting  输出的内容不包含注释符
    print(bs.a.string)
    print(type(bs.a.string))

    #文档的遍历
    print(bs.head.contents)
    print(bs.head.contents[1])
    # 更多内容，搜索BeautifulSoup文档树

    #文档的搜索   #弹幕说xpath
    #1.find_all() #返回列表

    #字符串过滤：查找与字符串完全匹配的内容
    t_list=bs.find_all("a")

    #正则表达式搜索：使用search()方法来匹配内容
    t_list=bs.find_all(re.compile("a"))

    #方法搜索：传入一个函数（方法），根据函数的要求来搜索 (了解就行了）
    def name_is_exist(tag):
        return tag.has_attr("name")
    t_list=bs.find_all(name_is_exist)
    for item in t_list:
        print(item)

    #kwargs参数
    t_list=bs.find_all(id="head")
    for item in t_list:
        print(item)

    t_list=bs.find_all(class_=True) #class是关键字，加_区分，表示有class属性的标签
    for item in t_list:
        print(item)

    #text参数
    t_list=bs.find_all(text="hao123")
    t_list=bs.find_all(text=["hao123","地图","贴吧"])
    t_list=bs.find_all(text=re.compile("\d")) #用正则表达式来查找包含特定文本的内容(标签里的字符串)
    for item in t_list:
        print(item)

    #limit参数
    t_list=bs.find_all("a",limit=3)
    for item in t_list:
        print(item)

    #2.select()  ,css选择器  返回列表
    t_list=bs.select("title") #元素选择器
    t_list=bs.select(".mnav") #类选择器
    t_list = bs.select("#u1") #id选择器
    t_list=bs.select("a[class='bri']") #属性选择器
    t_list=bs.select("head>title") #子元素选择器
    t_list=bs.select(".mnav~.bri") #普通兄弟选择器
    print(t_list[0].get_text()) #get_text()方法获取标签文本
    for item in t_list:
        print(item)

参考

https://zhuanlan.zhihu.com/p/181410680

https://www.cnblogs.com/tjp40922/p/10428447.html

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.