pythonComputation for the Social Sciences

Introduction to Computation for the Social Sciences Assignment 7

Prof. Dr. Karsten Donnay, Stefan Scholz Winter Term 2018 / 2019

Please solve the exercises below and submit your solutions to our GitHub Classroom until Dec, 18th midnight. Submit all your code in one executable file (py

/ ipynb ) and your list in one delimited text file (csv).

This assignment sheet is graded according to the following scheme: 3 points (fully complete and correct), 2 points (mostly ), 1 point (partly ) and 0 points (hardly). You will get individual feedback in your repository.

Exercise 1: Simple Web Crawler

Web Crawling, i.e. the automated access of Web resources via software, and Web scraping, i.e. the extraction of content from Web resources, are essential techniques to gather data for many analysis tasks in the social sciences. We will implement a very simple Web crawler that starts at a given URL and iteratively accesses all links, more specifically all href-tags in the HTML source code of a Website.

To get an idea how to approach this problem using Python, have a look at the sample implementation of a crawler in the subchapters Traversing a Single Domain and Crawling an Entire Site in the book Web Scraping with Python by Ryan Mitchell. If you search for the book online, you will find a PDF version of it.

Implement an iterative crawler (no recursive calls) that:

  • opens the front page of the German Wikipedia[1] and downloads the html resource

  • parses the html file as a BeautifulSoup object.

  • collects all internal href tags, i.e. only links to other Wikipedia pages on the server de.wikipedia.org

  • opens each of the collected links and parses the returned html resource for additional internal links

    Additionally, your crawler has to meet the following requirements:

  • Consider the Wikipedia front page as the root (level 0) of a tree and the Wikipedia pages linked to on the front page as level 1. Crawl no deeper than level 2!

  • Visit each unique link only once, i.e. disregard links that you traversed before

  • Store all links of the same level in a list, i.e. one list per level.

  • Include links that you encounter more than once in the list of the lowest level (closer to the root) they appear at.

  • Write the lists of links to a CSV file in the format [level ; link_URL].

Finally, submit your solutions by adding your code and csv file in your Git repository within the folder assignment07 > solution. Do not forget to commit and push your solutions afterwards.

1\] https://de.wikipedia.org/wiki/Wikipedia:Hauptseite

相关推荐
爱装代码的小瓶子几秒前
【C++与Linux基础】进程间通讯方式:匿名管道
android·c++·后端
CoderCodingNo几秒前
【GESP】C++ 二级真题解析,[2025年12月]第一题环保能量球
开发语言·c++·算法
数研小生4 分钟前
亚马逊商品列表API详解
前端·数据库·python·pandas
独好紫罗兰5 分钟前
对python的再认识-基于数据结构进行-a005-元组-CRUD
开发语言·数据结构·python
LYOBOYI1236 分钟前
qtcpSocket详解
c++·qt
REDcker9 分钟前
gRPC完整文档
服务器·网络·c++·网络协议·grpc
jianghua00116 分钟前
Python中的简单爬虫
爬虫·python·信息可视化
喵手26 分钟前
Python爬虫实战:针对Python官网,精准提取出每一个历史版本的版本号、发布日期以及对应的文档/详情页链接等信息,并最终清洗为标准化的CSV文件!
爬虫·python·爬虫实战·零基础python爬虫教学·python官方数据采集·采集历史版本版本号等信息·导出csv文件
databook39 分钟前
像搭积木一样思考:数据科学中的“自下而上”之道
python·数据挖掘·数据分析
Mr_Xuhhh40 分钟前
介绍一下ref
开发语言·c++·算法