Go to top

Scrapinghub and GSoC 2017

At Scrapinghub, we love open source and we know the community can build amazing things.

If you haven’t heard about it already Google Summer of Code is a global program that offers students stipends to write code for open source projects. Scrapinghub is applying to GSoC for the 4th time, and had participated in the GSoC 2014, 2015 & 2016. Julia Medina, our student in 2014, did an amazing work on Scrapy’s API and settings. Jakob de Maeyer, our student in 2015, did a great job getting Scrapy Addons off the ground.

If you’re interested in participating in GSoC 2017 as a student, take a look at the curated list of ideas below. Check the corresponding “Information for Students“ section and get in touch with the mentors. Don’t be afraid, we’re nice people :)

We would be thrilled to see any of the ideas below happen, but these are just our ideas, you are free to come up with a new subject, preferably around information retrieval :)

Let’s make it a great Google Summer of Code!

Scrapy Ideas for GSoC 2017

Scrapy and Google Summer of Code

Scrapy is a very popular web crawling and scraping framework for Python (15th in Github most trending Python projects) used to write spiders for crawling and extracting data from websites. Scrapy has a healthy and active community, and it’s applying for Google Summer of Code in 2017.

Information for Students

If you’re interested in participating in GSoC 2017 as a student, you should join the scrapy-users mailing list and post your questions and ideas there. You can also join the #scrapy IRC channel at Freenode to chat with other Scrapy users & developers. All Scrapy development happens at GitHub Scrapy repo.

Ideas

Scrapy integration tests

Intermediate
Brief explanation

Add integration tests for different networking scenarios.

Expected Results

Be able to tests from vertical to horizontal crawling against websites in same and different ips respecting throttling and handling timeouts, retries, dns failures. It must be simple to define new scenarios with predefined components (websites, proxies, routers, injected error rates).

Required skills Python, Networking, Virtualization
Difficulty level Intermediate
Mentor(s) tbd

New HTTP/1.1 download handler

Intermediate
Brief explanation

Replace current HTTP1.1 downloader handler with a in-house solution easily customizable to crawling needs. Current HTTP1.1 download handler depends on code shipped with Twisted that is not easily extensible by us, we ship twisted code under scrapy.xlib.tx to support running Scrapy in older twisted versions for distributions that doesn’t ship uptodate Twisted packages. But this is an ongoing cat-mouse game, the http download handler is an essential component of a crawling framework and having no control over its release cycle leaves us with code that is hard to support. The idea of this task is to depart from current Twisted code looking for a design that can cover current and future needs taking in count the goal is to deal with websites that don’t follow standards to the letter.

Expected Results

A HTTP parser that degrades nicely to parse invalid responses, filtering out the offending headers and cookies as browsers do. It must be able to avoid downloading responses bigger than a size limit, it can be configured to throttle bandwidth used per download, and if there is enough time it can lay out the interface to response streaming and support features such as HTTP pipelining.

Required skills Python, Twisted, HTTP protocol
Difficulty level Intermediate
Mentor(s) tbd

Asyncio Prototype

Advanced
Brief explanation

The asyncio library provides infrastructure for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives. We are looking to see how it fits into Scrapy architecture.

Expected Results

A working prototype of an asyncio-based Scrapy.

Required skills Python
Difficulty level Advanced
Mentor(s) Steven Almeroth

IPython IDE for Scrapy

Advanced
Brief explanation

Develop a better IPython + Scrapy integration that would display the HTML page inline in the console, provide some interactive widgets and run Python code against the results.

Here is an old scrapy-ipython proof of concept demo. See also: Splash custom IPython/Jupyter kernel.

Expected Results

It should become possible to develop Scrapy spiders interactively and visually inside IPython notebooks.

Required skills Python, JavaScript, HTML, Interface, Design, SecurityPython
Difficulty level Advanced
Mentor(s) Mikhail Korobov

Scrapy benchmarking suite

Advanced
Brief explanation

Develop a more comprehensive benchmarking suite. Profile and address CPU bottlenecks found. Address both known memory inefficiencies (which will be provided) and new ones uncovered.

Expected Results

Reusable benchmarks, measureable performance improvements.

Required skills Python, Profiling, Algorithms, Data Structures
Difficulty level Advanced
Mentor(s) Konstantin Lopukhin

Portia Ideas for GSoC 2017

Information for Students

If you’re interested in participating in GSoC 2017 as a student, contributing to Portia Ideas, you should join the portia-scraper mailing list and post your questions and ideas there. All Portia development happens at GitHub Portia repo.

Ideas

Increase Crawling Performance through page clustering

Intermediate
Brief explanation

With the rise of Angular and React web crawling has required the use of tools like Selenium, PhantomJS and Splash to render pages so that data can be extracted. Rendering pages like this can cause a crawl to take 10 times as long to complete. This project aims to use page clustering to examine pages before and after rendering and build up some rules depending on available data and links to decide if other similar pages should be rendered or not.

Expected Results

It should be possible to reduce the number of pages rendered for some crawls, reducing the time and bandwidth needed to extract data from a site.

Required skills Python
Difficulty level Intermediate
Mentor(s) Ruairi Fahy

Portia Spider Generation

Advanced
Brief explanation

One problem with traditionally scraping websites using XPath and CSS selectors is that when a website changes its layout your spiders may no longer work. This project aims to use crawl datasets to try to build new Portia spiders from website content and extracted data, repair spiders if the website layout has changed and then merge the templates used by the spiders into a small manageable number.

Expected Results

A tool that takes crawl data and web pages and uses those to generate new spiders

Required skills Python
Difficulty level Advanced
Mentor(s) Ruairi Fahy

Splash Ideas for GSoC 2017

Information for Students

Splash doesn’t yet have a mailing list, so if you’re interested in discussing any of these ideas, drop us a line via email at gsoc@scrapinghub.com, or open an issue on GitHub. You can also check the documentation at https://splash.readthedocs.org/en/latest/.

All Splash development happens at GitHub Splash repo.

Ideas

Migrate to QtWebEngine

Intermediate
Brief explanation

Implement as much Splash features as possible using QtWebEngine instead of QtWebKit, while keeping QtWebKit compatibility

Expected Results

Most (if not all) tests should pass under Python 3.4 with qt 5.5/5.6.

Required skills Python 2, Python 3, PyQT
Difficulty level Intermediate
Mentor(s) Mikhail Korobov

Frontera Ideas for GSoC 2017

Frontera and Google Summer of Code

Frontera is a web crawling framework consisting of a crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler. Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Frontera is on it’s way of building a healthy and active community and it’s applying for Google Summer of Code in 2017.

Information for Students

You can check the documentation at https://frontera.readthedocs.org/en/latest/.

All Frontera development happens at GitHub Frontera repo. Please use the Frontera mailing list as a main communications channel.

Ideas

A tool to configure and setup a Frontera-based crawler on Kubernetes cluster

Advanced
Brief explanation

Many people are facing dificulties trying to configure Frontera-based crawler stack (Kafka, HBase, Frontera and Scrapy). Another problem is when Frontera-based crawler is working there is a need to provision the whole application, e.g. check if instances are up and HW is functioning properly. The goal of this project is to automate the configuration and provisioning of Frontera-based crawler by using Kubernetes and well crafted config generation tool.

Expected Results

A tool code and documentation contribution.

Required skills Linux administration, Kubernetes and Docker ecosystem, distributed systems design
Difficulty level Advanced
Mentor(s) Alexander Sibiryakov

Frontera Web UI

Intermediate
Brief explanation

A web management UI will ease the usage of Frontera, and make it more attractive for people new to Frontera. It could provide a way to get the status of all components, i.e. errors, download speed, view the storage contents and manage the crawler: stop generating new batches, revisit URLs, add new seeds, adjust priority. The possible solution would be that all components are sending their stats to Kafka topic. to have a process collecting stats from Kafka topic and UI made with Django rendering them to web pages.

Expected Results

Code and documentation contribution.

Required skills Django, Web UI
Difficulty level Intermediate
Mentor(s) Alexander Sibiryakov

Implement message bus on RabbitMQ

Advanced
Brief explanation

The message bus component in Frontera is responsible for trasferring messages between workers and spiders. Currently, there are two implementations supporting Kafka and ZeroMQ. We would like to have a new one running on RabbitMQ.

Expected Results

Code and docs contribition.

Required skills RabbiitMQ understanding,, distributed systems
Difficulty level Advanced
Mentor(s) Alexander Sibiryakov

Dateparser Ideas for GSoC 2017

Information for Students

Learn what dateparser can do at https://dateparser.readthedocs.io/en/latest/.

dateparser doesn’t have a mailing list yet, if you’re interested in discussing any of these ideas, drop us a line via email at gsoc@scrapinghub.com, or open an issue on GitHub.

All dateparser development happens at GitHub dateparser repo.

Ideas

Make dateparser Fully Compliant with ISO 8601:2004.

Intermediate
Brief explanation

dateparser has a fair support to parse calendar dates - date represented in terms of calendar year, calendar month and day. But it lacks support for parsing following representations which are documented in ISO 8601 - International Standard for the representation of dates and times.

  • ordinal dates expressed in terms of calendar year and calendar day of the year.
  • week dates expressed in terms of calendar year, calendar week number and calendar day of the week.
  • time intervals
  • recurring time intervals
Expected Results

a subparser fully compliant with ISO 8601:2004 and able to parse all representations of dates and times covered in the standard along with extensive test coverage.

Required skills python, TDD
Difficulty level Intermediate
Mentor(s) Waqas Shabir, Artur Sadurski

Integrate unicode CLDR datbase with dateparser

Intermediate
Brief explanation

dateparser now supports parsing dates written in 26 languages. The conversion data comes from the community, individuals and web data sources.

Although, above methodology has worked out greatly for dateparser but it’s slow and we struggle to find peers who know the language to review the additions. For this reason we want to include new languages from unciode CLDRs database which is already a reliable source for a number of software applications for their locale specific references and data needs.

Expected Results

Full support for all the languages in CLDR database.

Required skills python, TDD
Difficulty level Intermediate
Mentor(s) Waqas Shabir, Artur Sadurski

Find and Parse Expressions of Times in Large Texts

Intermediate
Brief explanation

Add support to dateparser for finding dates in large chunks of text in a language agnostic fashion and return corresponding python datetime, timedelta or range objects.

It should also be able to parse dates expressed by numerals and other natural constructs describing date and time in available in CLDR data.

Expected Results

An entry point like dateparser.parse that accepts large chunks of text in the similar fashion as parse() and returns corresponding python objects.

Required skills python, TDD
Difficulty level Intermediate
Mentor(s) Waqas Shabir, Artur Sadurski