pythonComputation for the Social Sciences

Introduction to Computation for the Social Sciences Assignment 7

Prof. Dr. Karsten Donnay, Stefan Scholz Winter Term 2018 / 2019

Please solve the exercises below and submit your solutions to our GitHub Classroom until Dec, 18th midnight. Submit all your code in one executable file (py

/ ipynb ) and your list in one delimited text file (csv).

This assignment sheet is graded according to the following scheme: 3 points (fully complete and correct), 2 points (mostly ), 1 point (partly ) and 0 points (hardly). You will get individual feedback in your repository.

Exercise 1: Simple Web Crawler

Web Crawling, i.e. the automated access of Web resources via software, and Web scraping, i.e. the extraction of content from Web resources, are essential techniques to gather data for many analysis tasks in the social sciences. We will implement a very simple Web crawler that starts at a given URL and iteratively accesses all links, more specifically all href-tags in the HTML source code of a Website.

To get an idea how to approach this problem using Python, have a look at the sample implementation of a crawler in the subchapters Traversing a Single Domain and Crawling an Entire Site in the book Web Scraping with Python by Ryan Mitchell. If you search for the book online, you will find a PDF version of it.

Implement an iterative crawler (no recursive calls) that:

  • opens the front page of the German Wikipedia[1] and downloads the html resource

  • parses the html file as a BeautifulSoup object.

  • collects all internal href tags, i.e. only links to other Wikipedia pages on the server de.wikipedia.org

  • opens each of the collected links and parses the returned html resource for additional internal links

    Additionally, your crawler has to meet the following requirements:

  • Consider the Wikipedia front page as the root (level 0) of a tree and the Wikipedia pages linked to on the front page as level 1. Crawl no deeper than level 2!

  • Visit each unique link only once, i.e. disregard links that you traversed before

  • Store all links of the same level in a list, i.e. one list per level.

  • Include links that you encounter more than once in the list of the lowest level (closer to the root) they appear at.

  • Write the lists of links to a CSV file in the format [level ; link_URL].

Finally, submit your solutions by adding your code and csv file in your Git repository within the folder assignment07 > solution. Do not forget to commit and push your solutions afterwards.

1\] https://de.wikipedia.org/wiki/Wikipedia:Hauptseite

相关推荐
AndrewHZ18 分钟前
【图像处理基石】图像形态学处理:从基础运算到工业级应用实践
图像处理·python·opencv·算法·计算机视觉·cv·形态学处理
仰泳的熊猫21 分钟前
LeetCode:1905. 统计子岛屿
数据结构·c++·算法·leetcode
B站_计算机毕业设计之家39 分钟前
基于大数据的游戏数据可视化分析与推荐系统 Steam游戏 电子游戏 娱乐数据 Flask框架 selenium爬虫 协同过滤推荐算法 python✅
大数据·python·深度学习·游戏·信息可视化·1024程序员节·steam
月夜的风吹雨1 小时前
【C++ string 类实战指南】:从接口用法到 OJ 解题的全方位解析
c++·接口·string·范围for·auto·力扣oj
OKkankan1 小时前
模板的进阶
开发语言·数据结构·c++·算法
拾光Ծ1 小时前
【高阶数据结构】哈希表
数据结构·c++·哈希算法·散列表
终焉代码1 小时前
【C++】C++11特性学习(1)——列表初始化 | 右值引用与移动语义
c语言·c++·学习·1024程序员节
gfdgd xi1 小时前
Wine运行器3.4.0——虚拟机安装工具支持设置UEFI启动
android·windows·python·ubuntu·架构