生成网站sitemap.xml地图教程

要生成 sitemap.xml 文件,需要通过爬虫程序抓取网站的所有有效链接。以下是完整的解决方案:


步骤 1:安装必要的 Python 库

  • ounter(line

    pip install requests beautifulsoup4 lxml


步骤 2:创建 Python 爬虫脚本 (sitemap_generator.py)

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

  • ounter(line

    import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urljoin, urlparseimport xml.etree.ElementTree as ETfrom datetime import datetime
    def get_all_links(base_url): # 存储已访问和待访问的链接 visited = set() queue = [base_url] all_links = set()
    while queue: url = queue.pop(0) if url in visited: continue
    try: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } response = requests.get(url, headers=headers, timeout=10) if response.status_code != 200: continue
    # 添加到已访问列表 visited.add(url) all_links.add(url) print(f"Crawled: {url}")
    # 解析 HTML 获取新链接 soup = BeautifulSoup(response.text, 'lxml') for link in soup.find_all('a', href=True): href = link['href'].strip() full_url = urljoin(url, href)
    # 过滤无效链接 parsed = urlparse(full_url) if parsed.scheme not in ('http', 'https'): continue if not parsed.netloc.endswith('91kaiye.cn'): # 仅限本站链接 continue if '#' in full_url: # 忽略锚点 full_url = full_url.split('#')[0]
    # 添加到待访问队列 if full_url not in visited: queue.append(full_url)
    except Exception as e: print(f"Error crawling {url}: {str(e)}")
    return all_links
    def create_sitemap(links, filename='sitemap.xml'): root = ET.Element('urlset', xmlns='http://www.sitemaps.org/schemas/sitemap/0.9') for link in sorted(links): url_elem = ET.SubElement(root, 'url') ET.SubElement(url_elem, 'loc').text = link ET.SubElement(url_elem, 'lastmod').text = datetime.now().strftime('%Y-%m-%d') ET.SubElement(url_elem, 'changefreq').text = 'daily' ET.SubElement(url_elem, 'priority').text = '0.8' tree = ET.ElementTree(root) tree.write(filename, encoding='utf-8', xml_declaration=True) print(f"\nSitemap generated: {filename} with {len(links)} URLs")
    if name == 'main': base_url = 'https://www.91kaiye.cn/' print("Starting crawl...") links = get_all_links(base_url) create_sitemap(links)


步骤 3:运行脚本

  • ounter(line

    python sitemap_generator.py


执行说明:

  1. 爬虫逻辑

    • 从首页 https://www.91kaiye.cn/ 开始广度优先搜索

    • 自动过滤非本站链接、锚点和无效 URL

    • 记录每个页面的最后修改日期(默认当天)

    • 设置更新频率为 daily,优先级为 0.8

  2. 输出文件

    • 生成的 sitemap.xml 格式如下:
      • ounter(line

      • ounter(line

      • ounter(line

      • ounter(line

      • ounter(line

      • ounter(line

      • ounter(line

      • ounter(line

      • ounter(line

      • ounter(line

        <?xml version='1.0' encoding='utf-8'?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://www.91kaiye.cn/page1</loc> <lastmod>2023-10-05</lastmod> <changefreq>daily</changefreq> <priority>0.8</priority> </url> ...</urlset>

注意事项:

  1. 反爬措施

    • 如果网站有反爬机制,可能需要:
      • 添加 time.sleep(1) 延迟请求

      • 使用代理 IP

      • 设置更真实的请求头

  2. 动态内容

    • 对于 JavaScript 渲染的页面(如 Vue/React),需改用 SeleniumPlaywright
  3. 优化建议


替代方案:使用在线工具

如果不想运行代码,可用在线服务生成:

  1. XML-Sitemaps.com

  2. Screaming Frog SEO Spider(桌面工具)


生成后请将 sitemap.xml 上传到网站根目录,并通过百度/Google站长工具提交。

相关推荐
霜绛1 天前
Unity:XML笔记(一)——Xml文件格式、读取Xml文件、存储修改Xml文件
xml·笔记·学习·unity·游戏引擎
fatiaozhang95271 天前
晶晨线刷工具下载及易错点说明:生成工作流程XML失败
android·xml·网络·电视盒子·刷机固件·机顶盒刷机
l1t1 天前
利用DeepSeek编写验证xlsx格式文件中是否启用sharedStrings.xml对读写效率影响python程序
xml·开发语言·python·算法·xlsx
Wins_calculator3 天前
解决Visual Studio中UWP设计器无法显示的问题:需升级至Windows 11 24H2
xml·ide·windows·visual studio
MediaTea4 天前
Python 第三方库:lxml(高性能 XML/HTML 解析与处理)
xml·开发语言·前端·python·html
CodeCraft Studio4 天前
PPT处理控件Aspose.Slides教程:使用 C# 编程将 PPTX 转换为 XML
xml·c#·powerpoint·aspose·ppt转xml·ppt文档开发
携欢6 天前
PortSwigger靶场之SQL injection with filter bypass via XML encoding通关秘籍
xml·数据库·sql
练习时长一年11 天前
logback-spring.xml 文件
xml·spring·logback
l1t11 天前
DeepSeek辅助编写的将xlsx格式文件中sheet1.xml按需分别保留或去掉标签的程序
xml·python·excel·wps·xlsx