pythonComputation for the Social Sciences

Introduction to Computation for the Social Sciences Assignment 7

Prof. Dr. Karsten Donnay, Stefan Scholz Winter Term 2018 / 2019

Please solve the exercises below and submit your solutions to our GitHub Classroom until Dec, 18th midnight. Submit all your code in one executable file (py

/ ipynb ) and your list in one delimited text file (csv).

This assignment sheet is graded according to the following scheme: 3 points (fully complete and correct), 2 points (mostly ), 1 point (partly ) and 0 points (hardly). You will get individual feedback in your repository.

Exercise 1: Simple Web Crawler

Web Crawling, i.e. the automated access of Web resources via software, and Web scraping, i.e. the extraction of content from Web resources, are essential techniques to gather data for many analysis tasks in the social sciences. We will implement a very simple Web crawler that starts at a given URL and iteratively accesses all links, more specifically all href-tags in the HTML source code of a Website.

To get an idea how to approach this problem using Python, have a look at the sample implementation of a crawler in the subchapters Traversing a Single Domain and Crawling an Entire Site in the book Web Scraping with Python by Ryan Mitchell. If you search for the book online, you will find a PDF version of it.

Implement an iterative crawler (no recursive calls) that:

  • opens the front page of the German Wikipedia[1] and downloads the html resource

  • parses the html file as a BeautifulSoup object.

  • collects all internal href tags, i.e. only links to other Wikipedia pages on the server de.wikipedia.org

  • opens each of the collected links and parses the returned html resource for additional internal links

    Additionally, your crawler has to meet the following requirements:

  • Consider the Wikipedia front page as the root (level 0) of a tree and the Wikipedia pages linked to on the front page as level 1. Crawl no deeper than level 2!

  • Visit each unique link only once, i.e. disregard links that you traversed before

  • Store all links of the same level in a list, i.e. one list per level.

  • Include links that you encounter more than once in the list of the lowest level (closer to the root) they appear at.

  • Write the lists of links to a CSV file in the format [level ; link_URL].

Finally, submit your solutions by adding your code and csv file in your Git repository within the folder assignment07 > solution. Do not forget to commit and push your solutions afterwards.

1\] https://de.wikipedia.org/wiki/Wikipedia:Hauptseite

相关推荐
资深流水灯工程师几秒前
基于Python的Qt开发之Pyside6 QtSerialPort库的使用
python·qt
Yupureki2 分钟前
《算法竞赛从入门到国奖》算法基础:入门篇-贪心算法(上)
c语言·数据结构·c++·算法·贪心算法·visual studio
散峰而望4 分钟前
【算法竞赛】队列和 queue
开发语言·数据结构·c++·算法·链表·github·线性回归
木头左8 分钟前
HMM隐马尔可夫模型在指数期权双币种套利策略
python
@zulnger8 分钟前
Django 框架(django-admin 命令详解)
python·django
yuanmenghao8 分钟前
车载Linux 系统问题定位方法论与实战系列 - 开篇: 为什么需要一套“系统化”的 Linux 问题定位方法
linux·运维·服务器·数据结构·c++·自动驾驶
一只小bit10 分钟前
Qt 对话框全方面详解,包含示例与解析
前端·c++·qt·cpp·页面
柏木乃一11 分钟前
基础IO(上)
linux·服务器·c语言·c++·shell
_Johnny_14 分钟前
【API Echo】API测试回调工具 - 回显请求内容
python·flask·api
知乎的哥廷根数学学派14 分钟前
基于物理约束指数退化与Hertz接触理论的滚动轴承智能退化趋势分析(Pytorch)
开发语言·人工智能·pytorch·python·深度学习·算法·机器学习