pythonComputation for the Social Sciences

Introduction to Computation for the Social Sciences Assignment 7

Prof. Dr. Karsten Donnay, Stefan Scholz Winter Term 2018 / 2019

Please solve the exercises below and submit your solutions to our GitHub Classroom until Dec, 18th midnight. Submit all your code in one executable file (py

/ ipynb ) and your list in one delimited text file (csv).

This assignment sheet is graded according to the following scheme: 3 points (fully complete and correct), 2 points (mostly ), 1 point (partly ) and 0 points (hardly). You will get individual feedback in your repository.

Exercise 1: Simple Web Crawler

Web Crawling, i.e. the automated access of Web resources via software, and Web scraping, i.e. the extraction of content from Web resources, are essential techniques to gather data for many analysis tasks in the social sciences. We will implement a very simple Web crawler that starts at a given URL and iteratively accesses all links, more specifically all href-tags in the HTML source code of a Website.

To get an idea how to approach this problem using Python, have a look at the sample implementation of a crawler in the subchapters Traversing a Single Domain and Crawling an Entire Site in the book Web Scraping with Python by Ryan Mitchell. If you search for the book online, you will find a PDF version of it.

Implement an iterative crawler (no recursive calls) that:

  • opens the front page of the German Wikipedia[1] and downloads the html resource

  • parses the html file as a BeautifulSoup object.

  • collects all internal href tags, i.e. only links to other Wikipedia pages on the server de.wikipedia.org

  • opens each of the collected links and parses the returned html resource for additional internal links

    Additionally, your crawler has to meet the following requirements:

  • Consider the Wikipedia front page as the root (level 0) of a tree and the Wikipedia pages linked to on the front page as level 1. Crawl no deeper than level 2!

  • Visit each unique link only once, i.e. disregard links that you traversed before

  • Store all links of the same level in a list, i.e. one list per level.

  • Include links that you encounter more than once in the list of the lowest level (closer to the root) they appear at.

  • Write the lists of links to a CSV file in the format [level ; link_URL].

Finally, submit your solutions by adding your code and csv file in your Git repository within the folder assignment07 > solution. Do not forget to commit and push your solutions afterwards.

1\] https://de.wikipedia.org/wiki/Wikipedia:Hauptseite

相关推荐
pele7 分钟前
PHP源码运行受主板供电影响吗_供电相数重要性说明【技巧】
jvm·数据库·python
sinat_383437367 分钟前
CSS如何实现元素悬浮在页面底部_利用fixed定位与底部间距
jvm·数据库·python
gmaajt38 分钟前
mysql如何备份与恢复函数定义_mysql mysqldump导出存储对象
jvm·数据库·python
十五年专注C++开发44 分钟前
WaitingSpinnerWidget: 一个高度可配置的自定义Qt等待加载动画组件
开发语言·c++·qt·waitingspinner
qeen871 小时前
【数据结构】树的基本概念及存储
c语言·数据结构·c++·学习·
qq_460978401 小时前
Python爬虫怎么模拟手机端抓取_设置手机型号User-Agent字符串
jvm·数据库·python
love530love1 小时前
Clink 调校指南:让 Windows CMD 拥有现代终端的便捷体验
人工智能·windows·python·cmd·clink
王老师青少年编程1 小时前
csp信奥赛C++高频考点专项训练之贪心算法 --【区间贪心】:种树
c++·算法·贪心·csp·信奥赛·区间贪心·种树
hi_ro_a1 小时前
C++ 哈希表封装 unordered_map /unordered_set
数据结构·c++·算法·哈希算法