pythonComputation for the Social Sciences

Introduction to Computation for the Social Sciences Assignment 7

Prof. Dr. Karsten Donnay, Stefan Scholz Winter Term 2018 / 2019

Please solve the exercises below and submit your solutions to our GitHub Classroom until Dec, 18th midnight. Submit all your code in one executable file (py

/ ipynb ) and your list in one delimited text file (csv).

This assignment sheet is graded according to the following scheme: 3 points (fully complete and correct), 2 points (mostly ), 1 point (partly ) and 0 points (hardly). You will get individual feedback in your repository.

Exercise 1: Simple Web Crawler

Web Crawling, i.e. the automated access of Web resources via software, and Web scraping, i.e. the extraction of content from Web resources, are essential techniques to gather data for many analysis tasks in the social sciences. We will implement a very simple Web crawler that starts at a given URL and iteratively accesses all links, more specifically all href-tags in the HTML source code of a Website.

To get an idea how to approach this problem using Python, have a look at the sample implementation of a crawler in the subchapters Traversing a Single Domain and Crawling an Entire Site in the book Web Scraping with Python by Ryan Mitchell. If you search for the book online, you will find a PDF version of it.

Implement an iterative crawler (no recursive calls) that:

  • opens the front page of the German Wikipedia[1] and downloads the html resource

  • parses the html file as a BeautifulSoup object.

  • collects all internal href tags, i.e. only links to other Wikipedia pages on the server de.wikipedia.org

  • opens each of the collected links and parses the returned html resource for additional internal links

    Additionally, your crawler has to meet the following requirements:

  • Consider the Wikipedia front page as the root (level 0) of a tree and the Wikipedia pages linked to on the front page as level 1. Crawl no deeper than level 2!

  • Visit each unique link only once, i.e. disregard links that you traversed before

  • Store all links of the same level in a list, i.e. one list per level.

  • Include links that you encounter more than once in the list of the lowest level (closer to the root) they appear at.

  • Write the lists of links to a CSV file in the format [level ; link_URL].

Finally, submit your solutions by adding your code and csv file in your Git repository within the folder assignment07 > solution. Do not forget to commit and push your solutions afterwards.

1\] https://de.wikipedia.org/wiki/Wikipedia:Hauptseite

相关推荐
小徐学编程-zZ2 小时前
量产测试数据
python·压力测试·数据库架构
QQ8057806512 小时前
django基于机器学习的电商评论情感分析系统设计实现
python·机器学习·django
wx09092 小时前
stata实现机器学习的环境配置
python·机器学习·stata
REDcker3 小时前
C++变量存储与ELF段布局详解 从const全局到rodata与nm_readelf验证实践
java·c++·面试
nuowenyadelunwen4 小时前
CS 61A Lab 2 笔记:短路求值、高阶函数与 Lambda 表达式
python·函数式编程·cs61a·berkeley
王老师青少年编程5 小时前
csp信奥赛C++高频考点专项训练之字符串 --【字符串排序】:合并序列
c++·字符串·csp·高频考点·信奥赛·字符串排序·合并序列
qq_422828625 小时前
android图形学之SurfaceControl和Surface的关系 五
android·开发语言·python
weixin_444012935 小时前
c++如何将std--vector直接DUMP到二进制文件_指针地址直写【附代码】
jvm·数据库·python
woxihuan1234566 小时前
Go语言中--=运算符详解:位右移赋值操作的原理与应用
jvm·数据库·python
handler016 小时前
UDP协议与网络通信知识点
c语言·网络·c++·笔记·网络协议·udp