Go to top

Scrapinghub in GSoC 2017

Scrapinghub is applying!

Scrapinghub is a company focused on information retrieval and its later manipulation, deeply involved on developing and contributing in Open Source projects regarding web crawling and data processing technologies.

This year we are applying with five of our most renowned projects, Scrapy, Portia, Splash, Frontera and dateparser. You can learn more about these projects on their respective repositories on GitHub: scrapy/scrapy, scrapinghub/portia, scrapinghub/splash, scrapinghub/frontera and scrapinghub/dateparser.


Scrapy is a very popular web crawling and scraping framework for Python (10th in Github most trending Python projects) used to write spiders for crawling and extracting data from websites.

Check Scrapy ideas


Portia is a tool that allows you to visually scrape websites without any programming knowledge required. Users can annotate web pages to identify the data they wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.

Check Portia ideas


Splash is a lightweight web browser with an HTTP API. It is used to render web pages that use JavaScript, interact with them, get detailed information and take screenshots of the crawled websites as they are seen in a browser.

Check Splash ideas


Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

Check Frontera ideas


dateparser is a python library for parsing localized dates in almost any string formats commonly found on web pages.

Check dateparser ideas