目录
一、需求
1、对新闻主页上的新闻进行爬取,要求解析出标题、内容、新闻类型、图片并存入数据库。
2、只爬取带有图片的新闻,一张即可。
二、代码
以下是对新华网爬取的代码示例。
python
import requests as rq
from bs4 import BeautifulSoup
import re,os
import datetime
from datetime import timedelta
from difflib import SequenceMatcher
from gbase import GBASE_DB
from conf import IMGPATH,LOCALPATH,PICSIZE
xinhua_dict = {'politics':1, #时政
'culture':2, #文化
'health':3, #健康
'fortune':4, #财经
'world':5, #国际
}
def classify_news(s, news_list):
for li in news_list:
if li in s:
return li
return None
def get_xinhua_news(url):
'''
爬取新华网标题、内容、分类、图片
'''
newsWeb = rq.get(url)
newsWeb.encoding = 'utf-8'
soup = BeautifulSoup(newsWeb.text,'html.parser')
#获取标题
title_element = soup.find('span', class_='title')
title = title_element.get_text(strip=True)
#获取分类
news_type = xinhua_dict[classify_news(url,xinhua_dict.keys())]
#获取内容
content_element = soup.find('div', id='detail')
paragraphs = content_element.find_all('p')
content = '\n'.join(paragraph.get_text(strip=True) for paragraph in paragraphs)
content = re.sub('\n+', '\n', content).replace('"', '\\"').replace('\n', '\\n')
#获取图片
jpg_element = soup.find_all('img')
jpg_pattern = re.compile(r'src="([^"]*1n\.(jpg|jpeg))"')
j_list = jpg_pattern.findall(str(jpg_element))
for j in j_list:
jpg_path = os.path.basename(j[0])
jpg_url = url[:url.find('c_')] + jpg_path
picture = rq.get(jpg_url)
if picture.status_code==200:
if len(picture.content)>PICSIZE:
with open(LOCALPATH+jpg_path,"wb") as f:
f.write(picture.content)
return_path = IMGPATH+jpg_path
break
else:
pass
return title,news_type,content,return_path
def main():
db = GBASE_DB()
newsUrl = 'http://www.xinhuanet.com/'
newsWeb = rq.get(newsUrl)
newsWeb.encoding = 'utf-8'
soup = BeautifulSoup(newsWeb.text,'lxml')
#获取新闻网址列表
link_list = []
li_elements = soup.find_all('li')
for li_element in li_elements:
a_element = li_element.find('a')
if a_element:
url = a_element.get('href')
if url.startswith("http://www.news.cn/") and url.endswith(".htm") and 'c_' in url and classify_news(url,xinhua_dict.keys())!=None:
link_list.append(url)
#逐个解析新闻网址
for link in link_list:
try:
title,news_type,content,jpg_path = get_xinhua_news(link)
sql = '''insert into table_name(title,type,content,image)
values ('{}', {},'{}','{}')
'''.format(title,news_type,content,jpg_path)
db.execute_sql(sql)
print('(成功)爬取新华网:',title)
except Exception as e:
print('爬取失败:',link,' :',e)
continue
if __name__ == '__main__':
main()
首先,对新华网主页进行爬取,获取页面上所有的新闻链接,存放进入link_list列表中。
然后,依次访问每一个新闻链接,并解析标题、内容,需要对空格、特殊字符等做一下清洗。根据子频道路径进行分类,并爬取像素值大于阈值的图片(避免爬取到页面上的二维码等小图),图片保存在服务器本地某个文件夹下,如果没有符合条件的图片,则会报错,在main函数中抛出异常,跳过此新闻链接的爬取。
最后,存入数据库。