pythonComputation for the Social Sciences

Introduction to Computation for the Social Sciences Assignment 7

Prof. Dr. Karsten Donnay, Stefan Scholz Winter Term 2018 / 2019

Please solve the exercises below and submit your solutions to our GitHub Classroom until Dec, 18th midnight. Submit all your code in one executable file (py

/ ipynb ) and your list in one delimited text file (csv).

This assignment sheet is graded according to the following scheme: 3 points (fully complete and correct), 2 points (mostly ), 1 point (partly ) and 0 points (hardly). You will get individual feedback in your repository.

Exercise 1: Simple Web Crawler

Web Crawling, i.e. the automated access of Web resources via software, and Web scraping, i.e. the extraction of content from Web resources, are essential techniques to gather data for many analysis tasks in the social sciences. We will implement a very simple Web crawler that starts at a given URL and iteratively accesses all links, more specifically all href-tags in the HTML source code of a Website.

To get an idea how to approach this problem using Python, have a look at the sample implementation of a crawler in the subchapters Traversing a Single Domain and Crawling an Entire Site in the book Web Scraping with Python by Ryan Mitchell. If you search for the book online, you will find a PDF version of it.

Implement an iterative crawler (no recursive calls) that:

  • opens the front page of the German Wikipedia1 and downloads the html resource

  • parses the html file as a BeautifulSoup object.

  • collects all internal href tags, i.e. only links to other Wikipedia pages on the server de.wikipedia.org

  • opens each of the collected links and parses the returned html resource for additional internal links

    Additionally, your crawler has to meet the following requirements:

  • Consider the Wikipedia front page as the root (level 0) of a tree and the Wikipedia pages linked to on the front page as level 1. Crawl no deeper than level 2!

  • Visit each unique link only once, i.e. disregard links that you traversed before

  • Store all links of the same level in a list, i.e. one list per level.

  • Include links that you encounter more than once in the list of the lowest level (closer to the root) they appear at.

  • Write the lists of links to a CSV file in the format level ; link_URL.

Finally, submit your solutions by adding your code and csv file in your Git repository within the folder assignment07 > solution. Do not forget to commit and push your solutions afterwards.

1 https://de.wikipedia.org/wiki/Wikipedia:Hauptseite

相关推荐
lifloveyou6 小时前
table接口结构
python
Warson_L8 小时前
class 扩展
python
前端与小赵9 小时前
Python 数据结构陷阱与复数运算优化:列表、元组、字典成员操作辨析及 NumPy 高效实践
python
天天进步20159 小时前
Python全栈项目--基于深度学习的视频目标跟踪系统
python·深度学习·音视频
天天进步20159 小时前
Python全栈项目--Python自动化运维工具开发
运维·python·自动化
(●—●)橘子……10 小时前
力扣第503场周赛练习理解
python·学习·算法·leetcode·职场和发展·周赛
爱吃羊的老虎10 小时前
【JAVA】python转java:Spring Boot 入门
java·spring boot·python
小桥流水---人工智能11 小时前
【已解决】ImportError: cannot import name ‘AdamW‘ from ‘transformers.optimization‘
python
芝麻开门GEO11 小时前
泰安GEO优化服务,真的能提升效果吗?
人工智能·python
颜酱11 小时前
选读:工业级调用 LangChain:从 Demo 到企业级应用
python