pythonComputation for the Social Sciences

Introduction to Computation for the Social Sciences Assignment 7

Prof. Dr. Karsten Donnay, Stefan Scholz Winter Term 2018 / 2019

Please solve the exercises below and submit your solutions to our GitHub Classroom until Dec, 18th midnight. Submit all your code in one executable file (py

/ ipynb ) and your list in one delimited text file (csv).

This assignment sheet is graded according to the following scheme: 3 points (fully complete and correct), 2 points (mostly ), 1 point (partly ) and 0 points (hardly). You will get individual feedback in your repository.

Exercise 1: Simple Web Crawler

Web Crawling, i.e. the automated access of Web resources via software, and Web scraping, i.e. the extraction of content from Web resources, are essential techniques to gather data for many analysis tasks in the social sciences. We will implement a very simple Web crawler that starts at a given URL and iteratively accesses all links, more specifically all href-tags in the HTML source code of a Website.

To get an idea how to approach this problem using Python, have a look at the sample implementation of a crawler in the subchapters Traversing a Single Domain and Crawling an Entire Site in the book Web Scraping with Python by Ryan Mitchell. If you search for the book online, you will find a PDF version of it.

Implement an iterative crawler (no recursive calls) that:

  • opens the front page of the German Wikipedia[1] and downloads the html resource

  • parses the html file as a BeautifulSoup object.

  • collects all internal href tags, i.e. only links to other Wikipedia pages on the server de.wikipedia.org

  • opens each of the collected links and parses the returned html resource for additional internal links

    Additionally, your crawler has to meet the following requirements:

  • Consider the Wikipedia front page as the root (level 0) of a tree and the Wikipedia pages linked to on the front page as level 1. Crawl no deeper than level 2!

  • Visit each unique link only once, i.e. disregard links that you traversed before

  • Store all links of the same level in a list, i.e. one list per level.

  • Include links that you encounter more than once in the list of the lowest level (closer to the root) they appear at.

  • Write the lists of links to a CSV file in the format [level ; link_URL].

Finally, submit your solutions by adding your code and csv file in your Git repository within the folder assignment07 > solution. Do not forget to commit and push your solutions afterwards.

1\] https://de.wikipedia.org/wiki/Wikipedia:Hauptseite

相关推荐
moeyui7053 分钟前
Python文件编码读取和处理整理知识点
开发语言·前端·python
敲上瘾4 分钟前
C++ ODB ORM 完全指南:从入门到实战应用
linux·数据库·c++·oracle·db
程序员爱钓鱼22 分钟前
Python编程实战 - Python实用工具与库 - 正则表达式匹配(re 模块)
后端·python·面试
程序员爱钓鱼24 分钟前
Python编程实战 - Python实用工具与库 - 爬取并存储网页数据
后端·python·面试
一匹电信狗1 小时前
【C++11】右值引用+移动语义+完美转发
服务器·c++·算法·leetcode·小程序·stl·visual studio
草莓熊Lotso1 小时前
《算法闯关指南:优选算法--位运算》--36.两个整数之和,37.只出现一次的数字 ||
开发语言·c++·算法
陈辛chenxin1 小时前
【大数据技术01】数据科学的基础理论
大数据·人工智能·python·深度学习·机器学习·数据挖掘·数据分析
蒋星熠1 小时前
爬虫中Cookies模拟浏览器登录技术详解
开发语言·爬虫·python·正则表达式·自动化·php·web
小年糕是糕手2 小时前
【数据结构】常见的排序算法 -- 选择排序
linux·数据结构·c++·算法·leetcode·蓝桥杯·排序算法
huangyuchi.2 小时前
【Linux网络】Socket编程实战,基于UDP协议的Dict Server
linux·网络·c++·udp·c·socket