pythonComputation for the Social Sciences

Introduction to Computation for the Social Sciences Assignment 7

Prof. Dr. Karsten Donnay, Stefan Scholz Winter Term 2018 / 2019

Please solve the exercises below and submit your solutions to our GitHub Classroom until Dec, 18th midnight. Submit all your code in one executable file (py

/ ipynb ) and your list in one delimited text file (csv).

This assignment sheet is graded according to the following scheme: 3 points (fully complete and correct), 2 points (mostly ), 1 point (partly ) and 0 points (hardly). You will get individual feedback in your repository.

Exercise 1: Simple Web Crawler

Web Crawling, i.e. the automated access of Web resources via software, and Web scraping, i.e. the extraction of content from Web resources, are essential techniques to gather data for many analysis tasks in the social sciences. We will implement a very simple Web crawler that starts at a given URL and iteratively accesses all links, more specifically all href-tags in the HTML source code of a Website.

To get an idea how to approach this problem using Python, have a look at the sample implementation of a crawler in the subchapters Traversing a Single Domain and Crawling an Entire Site in the book Web Scraping with Python by Ryan Mitchell. If you search for the book online, you will find a PDF version of it.

Implement an iterative crawler (no recursive calls) that:

  • opens the front page of the German Wikipedia[1] and downloads the html resource

  • parses the html file as a BeautifulSoup object.

  • collects all internal href tags, i.e. only links to other Wikipedia pages on the server de.wikipedia.org

  • opens each of the collected links and parses the returned html resource for additional internal links

    Additionally, your crawler has to meet the following requirements:

  • Consider the Wikipedia front page as the root (level 0) of a tree and the Wikipedia pages linked to on the front page as level 1. Crawl no deeper than level 2!

  • Visit each unique link only once, i.e. disregard links that you traversed before

  • Store all links of the same level in a list, i.e. one list per level.

  • Include links that you encounter more than once in the list of the lowest level (closer to the root) they appear at.

  • Write the lists of links to a CSV file in the format [level ; link_URL].

Finally, submit your solutions by adding your code and csv file in your Git repository within the folder assignment07 > solution. Do not forget to commit and push your solutions afterwards.

1\] https://de.wikipedia.org/wiki/Wikipedia:Hauptseite

相关推荐
知行合一。。。1 天前
Python--04--数据容器(总结)
开发语言·python
架构师老Y1 天前
008、容器化部署:Docker与Python应用打包
python·容器·架构
澈2071 天前
深入浅出C++滑动窗口算法:原理、实现与实战应用详解
数据结构·c++·算法
A.A呐1 天前
【C++第二十九章】IO流
开发语言·c++
lifewange1 天前
pytest-类中测试方法、多文件批量执行
开发语言·python·pytest
ambition202421 天前
从暴力搜索到理论最优:一道任务调度问题的完整算法演进历程
c语言·数据结构·c++·算法·贪心算法·深度优先
pluvium271 天前
记对 xonsh shell 的使用, 脚本编写, 迁移及调优
linux·python·shell·xonsh
2401_827499991 天前
python项目实战09-AI智能伴侣(ai_partner_5-6)
开发语言·python
kebeiovo1 天前
atomic原子操作实现无锁队列
服务器·c++
PD我是你的真爱粉1 天前
MCP 协议详解:从架构、工作流到 Python 技术栈落地
开发语言·python·架构