介绍
BeauifulSoup 是一个可以从HTML或XML 文件中提取数据的python库;它能通过转换器实现惯用的文档导航、查找、修改文档的方式。
BeauifulSoup是一个基于re开发的解析库,可以提供一些强大的解析功能;使用BeauifulSoup 能够提高提取数据的效率与爬虫开发效率。
安装
sql
pip install beautifulsoup4
使用
1 构建文档树
BeauifulSoup 进行文档解析是基于文档树结构来实现的,而文档树则是由BeauifulSoup 中的四个数据对象构建而成的。
python
from bs4 import BeautifulSoup
html = """<div class="post js_watermark quill-editor" style="background-repeat: repeat; background-image: url(""); background-size: 940px 222.942px;">
<h1 class="title">版面分析------网页HTML解析 BeautifulSoup</h1>
<div class="group-info">
<a href="https://wx.zsxq.com/dweb2/index/group/51112141255244">
<span>来自:</span>
<span class="group-name">AiGC面试宝典</span>
</a>
</div>
<div class="author-info">
<div class="author">
<img src="https://images.zsxq.com/FpFYmnHpgmz5J4DicXxscPfi3GI2?e=2064038400&token=kIxbL07-8jAj8w1n4s9zv64FuZZNEATmlU_Vm6zD:hS7fTOpUpCI18IU4GweitfivQIU=" alt="用户头像">
<span class="nick-name">Just do it!</span>
</div>
<span class="date" id="article-date">2024年04月27日 14:30</span>
</div>
<div class="ql-snow">
<div class="content ql-editor"><p><img src="https://article-images.zsxq.com/FsOmOdM3jIkLawUT9z7sEbkMZgpV"></p><p><img src="https://article-images.zsxq.com/FnbQkQK1pNTESbYjScR42_PrYb9E"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-keyword">from</span> bs4 <span class="ql-token hljs-keyword">import</span> BeautifulSoup</div><div class="ql-code-block">html = <span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><html><head><title>The Dormouse's story</title></head></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="title"><b>The Dormouse's story</b></p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">Once upon a time there were three little sisters; and their names were</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span></div><div class="ql-code-block"><span class="ql-token hljs-string">and they lived at the bottom of a well.</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">...</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></html></span></div><div class="ql-code-block"><span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、BeautifulSoup对象</span></div><div class="ql-code-block">soup = BeautifulSoup(html, <span class="ql-token hljs-string">'lxml'</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup):{type(soup)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#2、Tag对象</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.head:{soup.head} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.head.name:{soup.head.name} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.head.attrs:{soup.head.attrs} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup.head):{type(soup.head)} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>()</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#3、Navigable String对象</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.title.string:{soup.title.string} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup.title.string):{type(soup.title.string)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#4、Comment对象</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.a.string:{soup.a.string} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup.a.string):{type(soup.a.string)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#5、结构化输出soup对象</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.prettify()=>{soup.prettify()}"</span>)</div></div><p><img src="https://article-images.zsxq.com/FmlPl-0tw4xgHRqKTWm5F2R15YJq"></p><div class="ql-code-block-container"><div class="ql-code-block">type(soup):<span class="ql-token hljs-tag"><class 'bs4.BeautifulSoup'></span></div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.head:<span class="ql-token hljs-tag"><head><title></span>The Dormouse's story<span class="ql-token hljs-tag"></title></head></span></div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.head.name:head</div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.head.attrs:{}</div><div class="ql-code-block"><br></div><div class="ql-code-block">type(soup.head):<span class="ql-token hljs-tag"><class 'bs4.element.Tag'></span></div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.title.string:The Dormouse's story</div><div class="ql-code-block"><br></div><div class="ql-code-block">type(soup.title.string):<span class="ql-token hljs-tag"><class 'bs4.element.NavigableString'></span></div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.a.string:Elsie</div><div class="ql-code-block"><br></div><div class="ql-code-block">type(soup.a.string):<span class="ql-token hljs-tag"><class 'bs4.element.Comment'></span></div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.prettify()=><span class="ql-token hljs-tag"><html></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><head></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><title></span></div><div class="ql-code-block"> The Dormouse's story</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></title></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"></head></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><body></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><p class="title"></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><b></span></div><div class="ql-code-block"> The Dormouse's story</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></b></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"></p></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><p class="story"></span></div><div class="ql-code-block"> Once upon a time there were three little sisters; and their names were</div><div class="ql-code-block"> <span class="ql-token hljs-tag"><a class="sister" href="http://example.com/elsie" id="link1"></span></div><div class="ql-code-block"> <span class="ql-token hljs-comment"><!--Elsie--></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"></a></span></div><div class="ql-code-block"> ,</div><div class="ql-code-block"> <span class="ql-token hljs-tag"><a class="sister" href="http://example.com/lacie" id="link2"></span></div><div class="ql-code-block"> Lacie</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></a></span></div><div class="ql-code-block"> and</div><div class="ql-code-block"> <span class="ql-token hljs-tag"><a class="sister" href="http://example.com/tillie" id="link3"></span></div><div class="ql-code-block"> Tillie</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></a></span></div><div class="ql-code-block"> ;</div><div class="ql-code-block">and they lived at the bottom of a well.</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></p></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><p class="story"></span></div><div class="ql-code-block"> ...</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></p></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"></body></span></div><div class="ql-code-block"><span class="ql-token hljs-tag"></html></span></div></div><p><br></p><p><img src="https://article-images.zsxq.com/FtPX-qsEEgZYHos3AnyDni1jH6rn"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-keyword">from</span> bs4 <span class="ql-token hljs-keyword">import</span> BeautifulSoup</div><div class="ql-code-block">html = <span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><html><head><title>The Dormouse's story</title></head></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="title"><b>The Dormouse's story</b></p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">Once upon a time there were three little sisters; and their names were</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span></div><div class="ql-code-block"><span class="ql-token hljs-string">and they lived at the bottom of a well.</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">...</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></html></span></div><div class="ql-code-block"><span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、BeautifulSoup对象</span></div><div class="ql-code-block">soup = BeautifulSoup(html, <span class="ql-token hljs-string">'lxml'</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup):{type(soup)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、向下遍历</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.p.contents)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-built_in">list</span>(soup.p.children))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-built_in">list</span>(soup.p.descendants))</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#2、向上遍历</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.p.parent.name,<span class="ql-token hljs-string">'\n'</span>)</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.p.parents:</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(i.name)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#3、平行遍历</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'a_next:'</span>,soup.a.next_sibling)</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.a.next_siblings:</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'a_nexts:'</span>,i)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'a_previous:'</span>,soup.a.previous_sibling)</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.a.previous_siblings:</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'a_previouss:'</span>,i)</div></div><p><br></p><p><img src="https://article-images.zsxq.com/FuyJDzHROhQahkpBUh4jWRuaB-mo"></p><p><br></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-built_in">type</span>(soup):<<span class="ql-token hljs-keyword">class</span> <span class="ql-token hljs-string">'bs4.BeautifulSoup'</span>></div><div class="ql-code-block"><br></div><div class="ql-code-block">[<b>The Dormouse<span class="ql-token hljs-string">'s story</b>]</span></div><div class="ql-code-block"><span class="ql-token hljs-string">[<b>The Dormouse'</span>s story</b>]</div><div class="ql-code-block">[<b>The Dormouse<span class="ql-token hljs-string">'s story</b>, "The Dormouse'</span>s story<span class="ql-token hljs-string">"]</span></div><div class="ql-code-block"><span class="ql-token hljs-string">body</span></div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-string">body</span></div><div class="ql-code-block"><span class="ql-token hljs-string">html</span></div><div class="ql-code-block"><span class="ql-token hljs-string">[document]</span></div><div class="ql-code-block"><span class="ql-token hljs-string">a_next: ,</span></div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-string">a_nexts: ,</span></div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-string">a_nexts: <a class="</span>siste<span class="ql-token hljs-string">r" href="</span>http://example.com/lacie<span class="ql-token hljs-string">" id="</span>link2<span class="ql-token hljs-string">">Lacie</a></span></div><div class="ql-code-block"><span class="ql-token hljs-string">a_nexts: and</span></div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-string">a_nexts: <a class="</span>siste<span class="ql-token hljs-string">r" href="</span>http://example.com/tillie<span class="ql-token hljs-string">" id="</span>link3<span class="ql-token hljs-string">">Tillie</a></span></div><div class="ql-code-block"><span class="ql-token hljs-string">a_nexts: ;</span></div><div class="ql-code-block"><span class="ql-token hljs-string">and they lived at the bottom of a well.</span></div><div class="ql-code-block"><span class="ql-token hljs-string">a_previous: Once upon a time there were three little sisters; and their names were</span></div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-string">a_previouss: Once upon a time there were three little sisters; and their names were</span></div></div><p><img src="https://article-images.zsxq.com/FtqWdNWSM0b8quez92lJ9SqPTK76"></p><p><span style="background-color: rgb(240, 240, 240); color: rgb(92, 92, 92);">代码</span></p><p><br></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-keyword">from</span> bs4 <span class="ql-token hljs-keyword">import</span> BeautifulSoup</div><div class="ql-code-block">html = <span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><html><head><title>The Dormouse's story</title></head></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="title"><b>The Dormouse's story</b></p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">Once upon a time there were three little sisters; and their names were</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span></div><div class="ql-code-block"><span class="ql-token hljs-string">and they lived at the bottom of a well.</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">...</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></html></span></div><div class="ql-code-block"><span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、BeautifulSoup对象</span></div><div class="ql-code-block">soup = BeautifulSoup(html, <span class="ql-token hljs-string">'lxml'</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup):{type(soup)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、find_all( )</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find_all(<span class="ql-token hljs-string">'a'</span>)) <span class="ql-token hljs-comment">#检索标签名</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find_all(<span class="ql-token hljs-string">'a'</span>,<span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">'link1'</span>)) <span class="ql-token hljs-comment">#检索属性值</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find_all(<span class="ql-token hljs-string">'a'</span>,class_=<span class="ql-token hljs-string">'sister'</span>)) </div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find_all(text=[<span class="ql-token hljs-string">'Elsie'</span>,<span class="ql-token hljs-string">'Lacie'</span>]))</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#2、find( )</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find(<span class="ql-token hljs-string">'a'</span>))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find(<span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">'link2'</span>))</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#3 、向上检索</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.p.find_parent().name)</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.title.find_parents():</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(i.name)</div><div class="ql-code-block"> </div><div class="ql-code-block"><span class="ql-token hljs-comment">#4、平行检索</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.head.find_next_sibling().name)</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.head.find_next_siblings():</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(i.name)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.title.find_previous_sibling())</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.title.find_previous_siblings():</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(i.name)</div></div><p><img src="https://article-images.zsxq.com/FgdDcWod8Suvbq5UuGYLvXz0UI8R"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-built_in">type</span>(soup):<<span class="ql-token hljs-keyword">class</span> <span class="ql-token hljs-string">'bs4.BeautifulSoup'</span>></div><div class="ql-code-block"><br></div><div class="ql-code-block">[<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block">[<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div><div class="ql-code-block">[<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block">F:\AwesomeRAG\tutorial\layout_analysis\html\tutorial\BeautifulSoup4\test3.py:<span class="ql-token hljs-number">24</span>: DeprecationWarning: The <span class="ql-token hljs-string">'text'</span> argument to find()-<span class="ql-token hljs-built_in">type</span> methods <span class="ql-token hljs-keyword">is</span> deprecated. Use <span class="ql-token hljs-string">'string'</span> instead.</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(soup.find_all(text=[<span class="ql-token hljs-string">'Elsie'</span>,<span class="ql-token hljs-string">'Lacie'</span>]))</div><div class="ql-code-block">[<span class="ql-token hljs-string">'Elsie'</span>, <span class="ql-token hljs-string">'Lacie'</span>]</div><div class="ql-code-block"><a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a></div><div class="ql-code-block"><a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a></div><div class="ql-code-block">body</div><div class="ql-code-block">head</div><div class="ql-code-block">html</div><div class="ql-code-block">[document]</div><div class="ql-code-block">body</div><div class="ql-code-block">body</div><div class="ql-code-block"><span class="ql-token hljs-literal">None</span></div></div><p><img src="https://article-images.zsxq.com/FiqTmlpR_fGE6pUZ8gCcdD9z1ao_"></p><div class="ql-code-block-container"><div class="ql-code-block">HTML标题:<h> </h></div><div class="ql-code-block">HTML段落:<p> </p></div><div class="ql-code-block">HTML链接:<a href=<span class="ql-token hljs-string">'httts://www.baidu.com/'</span>> this <span class="ql-token hljs-keyword">is</span> a link </a></div><div class="ql-code-block">HTML图像:<img src=<span class="ql-token hljs-string">'Ai-code.jpg'</span>,width=<span class="ql-token hljs-string">'104'</span>,height=<span class="ql-token hljs-string">'144'</span> /></div><div class="ql-code-block">HTML表格:<table> </table></div><div class="ql-code-block">HTML列表:<ul> </ul></div><div class="ql-code-block">HTML块:<div> </div></div></div><p><img src="https://article-images.zsxq.com/FkTgptMBTLt2w7nUUKs13PNKkckn"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-keyword">from</span> bs4 <span class="ql-token hljs-keyword">import</span> BeautifulSoup</div><div class="ql-code-block"><br></div><div class="ql-code-block">html = <span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><html><head><title>The Dormouse's story</title></head></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="title"><b>The Dormouse's story</b></p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">Once upon a time there were three little sisters; and their names were</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span></div><div class="ql-code-block"><span class="ql-token hljs-string">and they lived at the bottom of a well.</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">...</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></html></span></div><div class="ql-code-block"><span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、BeautifulSoup对象</span></div><div class="ql-code-block">soup = BeautifulSoup(html, <span class="ql-token hljs-string">'lxml'</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup):{type(soup)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'标签查找:'</span>,soup.select(<span class="ql-token hljs-string">'a'</span>))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'属性查找:'</span>,soup.select(<span class="ql-token hljs-string">'a[id="link1"]'</span>))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'类名查找:'</span>,soup.select(<span class="ql-token hljs-string">'.sister'</span>))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'id查找:'</span>,soup.select(<span class="ql-token hljs-string">'#link1'</span>))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'组合查找:'</span>,soup.select(<span class="ql-token hljs-string">'p #link1'</span>))</div></div><p><img src="https://article-images.zsxq.com/FjQkiig9fOl0Bd5qiCbyH4OddW50"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-built_in">type</span>(soup):<<span class="ql-token hljs-keyword">class</span> <span class="ql-token hljs-string">'bs4.BeautifulSoup'</span>></div><div class="ql-code-block"><br></div><div class="ql-code-block">标签查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block">属性查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div><div class="ql-code-block">类名查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block"><span class="ql-token hljs-built_in">id</span>查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div><div class="ql-code-block">组合查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div></div><p><img src="https://article-images.zsxq.com/FqZck0in441U4EYGi6KobKlS0emA"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-keyword">import</span> requests</div><div class="ql-code-block"><span class="ql-token hljs-keyword">from</span> bs4 <span class="ql-token hljs-keyword">import</span> BeautifulSoup</div><div class="ql-code-block"><span class="ql-token hljs-keyword">import</span> os</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-keyword">def</span> <span class="ql-token hljs-title">getUrl</span>(<span class="ql-token hljs-params">url</span>):</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">try</span>:</div><div class="ql-code-block"> read = requests.get(url) </div><div class="ql-code-block"> read.raise_for_status() </div><div class="ql-code-block"> read.encoding = read.apparent_encoding </div><div class="ql-code-block"> <span class="ql-token hljs-keyword">return</span> read.text </div><div class="ql-code-block"> <span class="ql-token hljs-keyword">except</span>:</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">return</span> <span class="ql-token hljs-string">"连接失败!"</span></div><div class="ql-code-block"> </div><div class="ql-code-block"><span class="ql-token hljs-keyword">def</span> <span class="ql-token hljs-title">getPic</span>(<span class="ql-token hljs-params">html</span>):</div><div class="ql-code-block"> soup = BeautifulSoup(html, <span class="ql-token hljs-string">"html.parser"</span>)</div><div class="ql-code-block"> </div><div class="ql-code-block"> all_img = soup.find(<span class="ql-token hljs-string">'ul'</span>).find_all(<span class="ql-token hljs-string">'img'</span>) </div><div class="ql-code-block"> <span class="ql-token hljs-keyword">for</span> img <span class="ql-token hljs-keyword">in</span> all_img:</div><div class="ql-code-block"> src = img[<span class="ql-token hljs-string">'src'</span>] </div><div class="ql-code-block"> img_url = src</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(img_url)</div><div class="ql-code-block"> root = <span class="ql-token hljs-string">"F:/Pic/"</span> </div><div class="ql-code-block"> path = root + img_url.split(<span class="ql-token hljs-string">'/'</span>)[-<span class="ql-token hljs-number">1</span>] </div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(path)</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">try</span>:</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">if</span> <span class="ql-token hljs-keyword">not</span> os.path.exists(root): </div><div class="ql-code-block"> os.mkdir(root)</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">if</span> <span class="ql-token hljs-keyword">not</span> os.path.exists(path):</div><div class="ql-code-block"> read = requests.get(img_url)</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">with</span> <span class="ql-token hljs-built_in">open</span>(path, <span class="ql-token hljs-string">"wb"</span>)<span class="ql-token hljs-keyword">as</span> f:</div><div class="ql-code-block"> f.write(read.content)</div><div class="ql-code-block"> f.close()</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">"文件保存成功!"</span>)</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">else</span>:</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">"文件已存在!"</span>)</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">except</span>:</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">"文件爬取失败!"</span>)</div><div class="ql-code-block"> </div><div class="ql-code-block"><span class="ql-token hljs-keyword">if</span> __name__ == <span class="ql-token hljs-string">'__main__'</span>:</div><div class="ql-code-block"> html_url=getUrl(<span class="ql-token hljs-string">"https://findicons.com/search/nature"</span>)</div><div class="ql-code-block"> getPic(html_url)</div></div><p><br></p><p><img src="https://article-images.zsxq.com/Fh_dDSbuteEI_0ArnWoZrCFDRuvm"></p><div class="ql-code-block-container"><div class="ql-code-block">标签查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block">属性查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div><div class="ql-code-block">类名查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block"><span class="ql-token hljs-built_in">id</span>查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div><div class="ql-code-block">组合查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div></div><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p></div>
</div>
<div class="milkdown-preview" style="display: none;"><p><img src="https://article-images.zsxq.com/FsOmOdM3jIkLawUT9z7sEbkMZgpV"></p><p><img src="https://article-images.zsxq.com/FnbQkQK1pNTESbYjScR42_PrYb9E"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-keyword">from</span> bs4 <span class="ql-token hljs-keyword">import</span> BeautifulSoup</div><div class="ql-code-block">html = <span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><html><head><title>The Dormouse's story</title></head></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="title"><b>The Dormouse's story</b></p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">Once upon a time there were three little sisters; and their names were</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span></div><div class="ql-code-block"><span class="ql-token hljs-string">and they lived at the bottom of a well.</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">...</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></html></span></div><div class="ql-code-block"><span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、BeautifulSoup对象</span></div><div class="ql-code-block">soup = BeautifulSoup(html, <span class="ql-token hljs-string">'lxml'</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup):{type(soup)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#2、Tag对象</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.head:{soup.head} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.head.name:{soup.head.name} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.head.attrs:{soup.head.attrs} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup.head):{type(soup.head)} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>()</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#3、Navigable String对象</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.title.string:{soup.title.string} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup.title.string):{type(soup.title.string)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#4、Comment对象</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.a.string:{soup.a.string} \n"</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup.a.string):{type(soup.a.string)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#5、结构化输出soup对象</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"soup.prettify()=>{soup.prettify()}"</span>)</div></div><p><img src="https://article-images.zsxq.com/FmlPl-0tw4xgHRqKTWm5F2R15YJq"></p><div class="ql-code-block-container"><div class="ql-code-block">type(soup):<span class="ql-token hljs-tag"><class 'bs4.BeautifulSoup'></span></div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.head:<span class="ql-token hljs-tag"><head><title></span>The Dormouse's story<span class="ql-token hljs-tag"></title></head></span></div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.head.name:head</div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.head.attrs:{}</div><div class="ql-code-block"><br></div><div class="ql-code-block">type(soup.head):<span class="ql-token hljs-tag"><class 'bs4.element.Tag'></span></div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.title.string:The Dormouse's story</div><div class="ql-code-block"><br></div><div class="ql-code-block">type(soup.title.string):<span class="ql-token hljs-tag"><class 'bs4.element.NavigableString'></span></div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.a.string:Elsie</div><div class="ql-code-block"><br></div><div class="ql-code-block">type(soup.a.string):<span class="ql-token hljs-tag"><class 'bs4.element.Comment'></span></div><div class="ql-code-block"><br></div><div class="ql-code-block">soup.prettify()=><span class="ql-token hljs-tag"><html></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><head></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><title></span></div><div class="ql-code-block"> The Dormouse's story</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></title></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"></head></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><body></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><p class="title"></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><b></span></div><div class="ql-code-block"> The Dormouse's story</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></b></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"></p></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><p class="story"></span></div><div class="ql-code-block"> Once upon a time there were three little sisters; and their names were</div><div class="ql-code-block"> <span class="ql-token hljs-tag"><a class="sister" href="http://example.com/elsie" id="link1"></span></div><div class="ql-code-block"> <span class="ql-token hljs-comment"><!--Elsie--></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"></a></span></div><div class="ql-code-block"> ,</div><div class="ql-code-block"> <span class="ql-token hljs-tag"><a class="sister" href="http://example.com/lacie" id="link2"></span></div><div class="ql-code-block"> Lacie</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></a></span></div><div class="ql-code-block"> and</div><div class="ql-code-block"> <span class="ql-token hljs-tag"><a class="sister" href="http://example.com/tillie" id="link3"></span></div><div class="ql-code-block"> Tillie</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></a></span></div><div class="ql-code-block"> ;</div><div class="ql-code-block">and they lived at the bottom of a well.</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></p></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"><p class="story"></span></div><div class="ql-code-block"> ...</div><div class="ql-code-block"> <span class="ql-token hljs-tag"></p></span></div><div class="ql-code-block"> <span class="ql-token hljs-tag"></body></span></div><div class="ql-code-block"><span class="ql-token hljs-tag"></html></span></div></div><p><br></p><p><img src="https://article-images.zsxq.com/FtPX-qsEEgZYHos3AnyDni1jH6rn"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-keyword">from</span> bs4 <span class="ql-token hljs-keyword">import</span> BeautifulSoup</div><div class="ql-code-block">html = <span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><html><head><title>The Dormouse's story</title></head></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="title"><b>The Dormouse's story</b></p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">Once upon a time there were three little sisters; and their names were</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span></div><div class="ql-code-block"><span class="ql-token hljs-string">and they lived at the bottom of a well.</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">...</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></html></span></div><div class="ql-code-block"><span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、BeautifulSoup对象</span></div><div class="ql-code-block">soup = BeautifulSoup(html, <span class="ql-token hljs-string">'lxml'</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup):{type(soup)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、向下遍历</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.p.contents)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-built_in">list</span>(soup.p.children))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-built_in">list</span>(soup.p.descendants))</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#2、向上遍历</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.p.parent.name,<span class="ql-token hljs-string">'\n'</span>)</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.p.parents:</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(i.name)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#3、平行遍历</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'a_next:'</span>,soup.a.next_sibling)</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.a.next_siblings:</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'a_nexts:'</span>,i)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'a_previous:'</span>,soup.a.previous_sibling)</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.a.previous_siblings:</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'a_previouss:'</span>,i)</div></div><p><br></p><p><img src="https://article-images.zsxq.com/FuyJDzHROhQahkpBUh4jWRuaB-mo"></p><p><br></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-built_in">type</span>(soup):<<span class="ql-token hljs-keyword">class</span> <span class="ql-token hljs-string">'bs4.BeautifulSoup'</span>></div><div class="ql-code-block"><br></div><div class="ql-code-block">[<b>The Dormouse<span class="ql-token hljs-string">'s story</b>]</span></div><div class="ql-code-block"><span class="ql-token hljs-string">[<b>The Dormouse'</span>s story</b>]</div><div class="ql-code-block">[<b>The Dormouse<span class="ql-token hljs-string">'s story</b>, "The Dormouse'</span>s story<span class="ql-token hljs-string">"]</span></div><div class="ql-code-block"><span class="ql-token hljs-string">body</span></div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-string">body</span></div><div class="ql-code-block"><span class="ql-token hljs-string">html</span></div><div class="ql-code-block"><span class="ql-token hljs-string">[document]</span></div><div class="ql-code-block"><span class="ql-token hljs-string">a_next: ,</span></div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-string">a_nexts: ,</span></div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-string">a_nexts: <a class="</span>siste<span class="ql-token hljs-string">r" href="</span>http://example.com/lacie<span class="ql-token hljs-string">" id="</span>link2<span class="ql-token hljs-string">">Lacie</a></span></div><div class="ql-code-block"><span class="ql-token hljs-string">a_nexts: and</span></div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-string">a_nexts: <a class="</span>siste<span class="ql-token hljs-string">r" href="</span>http://example.com/tillie<span class="ql-token hljs-string">" id="</span>link3<span class="ql-token hljs-string">">Tillie</a></span></div><div class="ql-code-block"><span class="ql-token hljs-string">a_nexts: ;</span></div><div class="ql-code-block"><span class="ql-token hljs-string">and they lived at the bottom of a well.</span></div><div class="ql-code-block"><span class="ql-token hljs-string">a_previous: Once upon a time there were three little sisters; and their names were</span></div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-string">a_previouss: Once upon a time there were three little sisters; and their names were</span></div></div><p><img src="https://article-images.zsxq.com/FtqWdNWSM0b8quez92lJ9SqPTK76"></p><p><span style="background-color: rgb(240, 240, 240); color: rgb(92, 92, 92);">代码</span></p><p><br></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-keyword">from</span> bs4 <span class="ql-token hljs-keyword">import</span> BeautifulSoup</div><div class="ql-code-block">html = <span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><html><head><title>The Dormouse's story</title></head></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="title"><b>The Dormouse's story</b></p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">Once upon a time there were three little sisters; and their names were</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span></div><div class="ql-code-block"><span class="ql-token hljs-string">and they lived at the bottom of a well.</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">...</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></html></span></div><div class="ql-code-block"><span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、BeautifulSoup对象</span></div><div class="ql-code-block">soup = BeautifulSoup(html, <span class="ql-token hljs-string">'lxml'</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup):{type(soup)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、find_all( )</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find_all(<span class="ql-token hljs-string">'a'</span>)) <span class="ql-token hljs-comment">#检索标签名</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find_all(<span class="ql-token hljs-string">'a'</span>,<span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">'link1'</span>)) <span class="ql-token hljs-comment">#检索属性值</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find_all(<span class="ql-token hljs-string">'a'</span>,class_=<span class="ql-token hljs-string">'sister'</span>)) </div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find_all(text=[<span class="ql-token hljs-string">'Elsie'</span>,<span class="ql-token hljs-string">'Lacie'</span>]))</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#2、find( )</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find(<span class="ql-token hljs-string">'a'</span>))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.find(<span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">'link2'</span>))</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-comment">#3 、向上检索</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.p.find_parent().name)</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.title.find_parents():</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(i.name)</div><div class="ql-code-block"> </div><div class="ql-code-block"><span class="ql-token hljs-comment">#4、平行检索</span></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.head.find_next_sibling().name)</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.head.find_next_siblings():</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(i.name)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(soup.title.find_previous_sibling())</div><div class="ql-code-block"><span class="ql-token hljs-keyword">for</span> i <span class="ql-token hljs-keyword">in</span> soup.title.find_previous_siblings():</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(i.name)</div></div><p><img src="https://article-images.zsxq.com/FgdDcWod8Suvbq5UuGYLvXz0UI8R"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-built_in">type</span>(soup):<<span class="ql-token hljs-keyword">class</span> <span class="ql-token hljs-string">'bs4.BeautifulSoup'</span>></div><div class="ql-code-block"><br></div><div class="ql-code-block">[<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block">[<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div><div class="ql-code-block">[<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block">F:\AwesomeRAG\tutorial\layout_analysis\html\tutorial\BeautifulSoup4\test3.py:<span class="ql-token hljs-number">24</span>: DeprecationWarning: The <span class="ql-token hljs-string">'text'</span> argument to find()-<span class="ql-token hljs-built_in">type</span> methods <span class="ql-token hljs-keyword">is</span> deprecated. Use <span class="ql-token hljs-string">'string'</span> instead.</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(soup.find_all(text=[<span class="ql-token hljs-string">'Elsie'</span>,<span class="ql-token hljs-string">'Lacie'</span>]))</div><div class="ql-code-block">[<span class="ql-token hljs-string">'Elsie'</span>, <span class="ql-token hljs-string">'Lacie'</span>]</div><div class="ql-code-block"><a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a></div><div class="ql-code-block"><a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a></div><div class="ql-code-block">body</div><div class="ql-code-block">head</div><div class="ql-code-block">html</div><div class="ql-code-block">[document]</div><div class="ql-code-block">body</div><div class="ql-code-block">body</div><div class="ql-code-block"><span class="ql-token hljs-literal">None</span></div></div><p><img src="https://article-images.zsxq.com/FiqTmlpR_fGE6pUZ8gCcdD9z1ao_"></p><div class="ql-code-block-container"><div class="ql-code-block">HTML标题:<h> </h></div><div class="ql-code-block">HTML段落:<p> </p></div><div class="ql-code-block">HTML链接:<a href=<span class="ql-token hljs-string">'httts://www.baidu.com/'</span>> this <span class="ql-token hljs-keyword">is</span> a link </a></div><div class="ql-code-block">HTML图像:<img src=<span class="ql-token hljs-string">'Ai-code.jpg'</span>,width=<span class="ql-token hljs-string">'104'</span>,height=<span class="ql-token hljs-string">'144'</span> /></div><div class="ql-code-block">HTML表格:<table> </table></div><div class="ql-code-block">HTML列表:<ul> </ul></div><div class="ql-code-block">HTML块:<div> </div></div></div><p><img src="https://article-images.zsxq.com/FkTgptMBTLt2w7nUUKs13PNKkckn"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-keyword">from</span> bs4 <span class="ql-token hljs-keyword">import</span> BeautifulSoup</div><div class="ql-code-block"><br></div><div class="ql-code-block">html = <span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><html><head><title>The Dormouse's story</title></head></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="title"><b>The Dormouse's story</b></p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">Once upon a time there were three little sisters; and their names were</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and</span></div><div class="ql-code-block"><span class="ql-token hljs-string"><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;</span></div><div class="ql-code-block"><span class="ql-token hljs-string">and they lived at the bottom of a well.</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"><p class="story">...</p></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></body></span></div><div class="ql-code-block"><span class="ql-token hljs-string"></html></span></div><div class="ql-code-block"><span class="ql-token hljs-string">"""</span></div><div class="ql-code-block"><span class="ql-token hljs-comment">#1、BeautifulSoup对象</span></div><div class="ql-code-block">soup = BeautifulSoup(html, <span class="ql-token hljs-string">'lxml'</span>)</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">f"type(soup):{type(soup)} \n"</span>)</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'标签查找:'</span>,soup.select(<span class="ql-token hljs-string">'a'</span>))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'属性查找:'</span>,soup.select(<span class="ql-token hljs-string">'a[id="link1"]'</span>))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'类名查找:'</span>,soup.select(<span class="ql-token hljs-string">'.sister'</span>))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'id查找:'</span>,soup.select(<span class="ql-token hljs-string">'#link1'</span>))</div><div class="ql-code-block"><span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">'组合查找:'</span>,soup.select(<span class="ql-token hljs-string">'p #link1'</span>))</div></div><p><img src="https://article-images.zsxq.com/FjQkiig9fOl0Bd5qiCbyH4OddW50"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-built_in">type</span>(soup):<<span class="ql-token hljs-keyword">class</span> <span class="ql-token hljs-string">'bs4.BeautifulSoup'</span>></div><div class="ql-code-block"><br></div><div class="ql-code-block">标签查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block">属性查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div><div class="ql-code-block">类名查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block"><span class="ql-token hljs-built_in">id</span>查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div><div class="ql-code-block">组合查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div></div><p><img src="https://article-images.zsxq.com/FqZck0in441U4EYGi6KobKlS0emA"></p><div class="ql-code-block-container"><div class="ql-code-block"><span class="ql-token hljs-keyword">import</span> requests</div><div class="ql-code-block"><span class="ql-token hljs-keyword">from</span> bs4 <span class="ql-token hljs-keyword">import</span> BeautifulSoup</div><div class="ql-code-block"><span class="ql-token hljs-keyword">import</span> os</div><div class="ql-code-block"><br></div><div class="ql-code-block"><span class="ql-token hljs-keyword">def</span> <span class="ql-token hljs-title">getUrl</span>(<span class="ql-token hljs-params">url</span>):</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">try</span>:</div><div class="ql-code-block"> read = requests.get(url) </div><div class="ql-code-block"> read.raise_for_status() </div><div class="ql-code-block"> read.encoding = read.apparent_encoding </div><div class="ql-code-block"> <span class="ql-token hljs-keyword">return</span> read.text </div><div class="ql-code-block"> <span class="ql-token hljs-keyword">except</span>:</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">return</span> <span class="ql-token hljs-string">"连接失败!"</span></div><div class="ql-code-block"> </div><div class="ql-code-block"><span class="ql-token hljs-keyword">def</span> <span class="ql-token hljs-title">getPic</span>(<span class="ql-token hljs-params">html</span>):</div><div class="ql-code-block"> soup = BeautifulSoup(html, <span class="ql-token hljs-string">"html.parser"</span>)</div><div class="ql-code-block"> </div><div class="ql-code-block"> all_img = soup.find(<span class="ql-token hljs-string">'ul'</span>).find_all(<span class="ql-token hljs-string">'img'</span>) </div><div class="ql-code-block"> <span class="ql-token hljs-keyword">for</span> img <span class="ql-token hljs-keyword">in</span> all_img:</div><div class="ql-code-block"> src = img[<span class="ql-token hljs-string">'src'</span>] </div><div class="ql-code-block"> img_url = src</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(img_url)</div><div class="ql-code-block"> root = <span class="ql-token hljs-string">"F:/Pic/"</span> </div><div class="ql-code-block"> path = root + img_url.split(<span class="ql-token hljs-string">'/'</span>)[-<span class="ql-token hljs-number">1</span>] </div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(path)</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">try</span>:</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">if</span> <span class="ql-token hljs-keyword">not</span> os.path.exists(root): </div><div class="ql-code-block"> os.mkdir(root)</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">if</span> <span class="ql-token hljs-keyword">not</span> os.path.exists(path):</div><div class="ql-code-block"> read = requests.get(img_url)</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">with</span> <span class="ql-token hljs-built_in">open</span>(path, <span class="ql-token hljs-string">"wb"</span>)<span class="ql-token hljs-keyword">as</span> f:</div><div class="ql-code-block"> f.write(read.content)</div><div class="ql-code-block"> f.close()</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">"文件保存成功!"</span>)</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">else</span>:</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">"文件已存在!"</span>)</div><div class="ql-code-block"> <span class="ql-token hljs-keyword">except</span>:</div><div class="ql-code-block"> <span class="ql-token hljs-built_in">print</span>(<span class="ql-token hljs-string">"文件爬取失败!"</span>)</div><div class="ql-code-block"> </div><div class="ql-code-block"><span class="ql-token hljs-keyword">if</span> __name__ == <span class="ql-token hljs-string">'__main__'</span>:</div><div class="ql-code-block"> html_url=getUrl(<span class="ql-token hljs-string">"https://findicons.com/search/nature"</span>)</div><div class="ql-code-block"> getPic(html_url)</div></div><p><br></p><p><img src="https://article-images.zsxq.com/Fh_dDSbuteEI_0ArnWoZrCFDRuvm"></p><div class="ql-code-block-container"><div class="ql-code-block">标签查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block">属性查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div><div class="ql-code-block">类名查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/lacie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link2"</span>>Lacie</a>, <a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/tillie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link3"</span>>Tillie</a>]</div><div class="ql-code-block"><span class="ql-token hljs-built_in">id</span>查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div><div class="ql-code-block">组合查找: [<a <span class="ql-token hljs-keyword">class</span>=<span class="ql-token hljs-string">"sister"</span> href=<span class="ql-token hljs-string">"http://example.com/elsie"</span> <span class="ql-token hljs-built_in">id</span>=<span class="ql-token hljs-string">"link1"</span>><!--Elsie--></a>]</div></div><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p><p><br></p></div>
<footer>
<div class="horizon-line"></div>
<img id="logo" src="/assets_dweb/logo@1x.png">
<div class="text">知识星球</div>
<div class="horizon-line"></div>
</footer>
<div class="qrcode-container">
<img class="qrcode" id="qrcode" src="">
<div class="text-desc">扫码加入星球</div>
<div class="text-desc">查看更多优质内容</div>
</div>
<div id="qrcode-url">https://wx.zsxq.com/mweb/views/joingroup/join_group.html?group_id=51112141255244</div>
<input type="hidden" name="group_allow_copy" value="false">
<input type="hidden" name="group_enable_watermark" value="true">
<input type="hidden" name="member_id" value="111888182154422">
<input type="hidden" name="member_name" value="wws">
<input type="hidden" name="member_role" value="other">
</div>
"""
## 上面html跟下面的结果对不上,但是不影响理解应该,跑的时候换成自己的html跑一下就知道了
# 1. BeautifulSoup对象
soup = BeautifulSoup(html, 'lxml')
print(f"type(soup):{type(soup)}\n")
# 2. Tag 对象
print(f"soup.head:{soup.head} \n")
print(f"soup.head.name:{soup.head.name} \n")
print(f"soup.head.attrs:{soup.head.attrs} \n")
print(f"type(soup.head):{type(soup.head)} \n")
# 3. Navigable String 对象
print(f"soup.title.stringh:{soup.title.string} \n")
print(f"type(soup.title.string):{type(soup.title.string)} \n")
# 4. Comment 对象
print(f"soup.a.string:{soup.a.string} \n")
print(f"type(soup.a.string):")
# 5. 结构化输出soup对象
print(f"soup.prettify()=>{soup.prettify()}")
2. 遍历文档树
BeautifulSoup 之所以将文档转为树结构,是因为树结构更便于对内容遍历提取
python
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dotmouse's stroy</title></head>
<body>
<p ...>...</p>
...
</body>
</html>
"""
# 1. BeautifulSoup对象
soup = BeautifulSoup(html, 'lxml')
print(f"type(soup):{type(soup)}\n")
# 2. 向下遍历
print(soup.p.contents)
print(list(soup.p.children))
print(list(soup.p.descendants))
# 3. 向上遍历
print(list(soup.p.parent.name))
for i in soup.p.parents:
print(i.name)
# 4. 平行遍历
print('a_next:',soup.a.next_sibling)
for i in soup.a.next_sibling:
print('a_nexts:', i)
print('a_previous:',soup.a.previous_sibling )
for i in soup.a.previous_sibling:
print('a_previous:', i)
4 搜索文档树
搜索方法:
python
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dotmouse's stroy</title></head>
<body>
<p ...>...</p>
...
</body>
</html>
"""
# 1. BeautifulSoup对象
soup = BeautifulSoup(html, 'lxml')
print(f"type(soup):{type(soup)}\n")
# 2. find()
print(soup.find('a'))# 查找a标签
print(soup.find(id='link2'))# 查找id等于link2的元素
# 3. find_all()
print(soup.find_all('a'))# 查找标签名
print(soup.find_all('a',id='link1'))# 检索属性值
print(soup.find_all('a',class='sister'))# 检索属性值
print(soup.find_all(text=['Elsie','Lacie']))
# 4. 向上检索
print(list(soup.p.find_parent().name))
for i in soup.title.find_parents():
print(i.name)
# 5. 平行检索
print(soup.head.find_next_sibling().name())
for i in soup.head.find_next_sibling():
print('a_nexts:', i)
print(soup.title.find_previous_sibling())
for i in soup.title.find_previous_sibling():
print('a_previous:', i)
5 CSS 选择器
在Tag或者BeautifulSoup对象的select()方法中传入字符串参数,即可使用CSS选择器找到Tag
6 爬取图片示例
python
import requests
from bs4 import BeautifulSoup
import os
def geturl(url):
try :
read =requests.get(url)
read.raise for status()
read,encoding=read.apparent encoding
return read.text
except:
return"连接失败!
def getPic(html):
soup= BeautifulSoup(html, "html.parser")
all_img = soup.find('ul').find_all( img )
for img in all_img:
src = img['src']
img url = src
print(img_url)
root ='F:/Pic/'
path=root + img_url.split('/')[-1]
print(path)
try:
if not os.path.exists(root):
os.mkdir (root)
if not os.path.exists(path):
read =requests.get(img url)
with open(path, "wb )as f:
f.write(read.content)
f.close()
print("文件保存成功!")
else :
print("文件已存在!")
except:
print(~文件爬取失败!")
if __name__=='__main__':
html_url=getUrl( 'https://findicons.com/search/nature' )
getPic(html_url)
参考
版面分析--网页HTML解析
Beautiful Soup 4.4.0 文档
python爬虫之Beautifulsoup模块用法详解
网络爬虫之BeautifulSoup详解(含多个案例)