Skip to content
Change the repository type filter

All

    Repositories list

    • arachnado

      Public
      Web Crawling UI and HTTP API, based on Scrapy and Tornado
      Python
      611601621Updated Nov 18, 2025Nov 18, 2025
    • soft404

      Public
      A classifier for detecting soft 404 pages
      Jupyter Notebook
      135746Updated Oct 10, 2025Oct 10, 2025
    • A library to collect data from search forms
      Python
      3011Updated Oct 10, 2025Oct 10, 2025
    • autopager

      Public
      Detect and classify pagination links
      HTML
      2610471Updated Oct 10, 2025Oct 10, 2025
    • Middleware that limits number of internal/external links during broad crawl
      Python
      2211Updated Oct 10, 2025Oct 10, 2025
    • autologin

      Public
      A project to attempt to automatically login to a website given a single seed
      Python
      41127106Updated Oct 10, 2025Oct 10, 2025
    • Python
      3211Updated Oct 10, 2025Oct 10, 2025
    • eli5

      Public
      A library for debugging/inspecting machine learning classifiers and explaining their predictions
      Jupyter Notebook
      3292.8k14518Updated Apr 19, 2025Apr 19, 2025
    • Formasaurus

      Public
      Formasaurus tells you the type of an HTML form and its fields using machine learning
      HTML
      46119122Updated Jun 18, 2024Jun 18, 2024
    • scikit-learn inspired API for CRFsuite
      Python
      2134323412Updated Sep 25, 2023Sep 25, 2023
    • agnostic

      Public
      Agnostic Database Migrations
      Python
      185491Updated Aug 10, 2023Aug 10, 2023
    • Log TensorBoard events without touching TensorFlow
      Python
      5062994Updated Dec 26, 2022Dec 26, 2022
    • Scrapy middleware which allows to crawl only new content
      Python
      217943Updated Oct 31, 2022Oct 31, 2022
    • use multiple proxies with Scrapy
      Python
      161770469Updated May 20, 2022May 20, 2022
    • Show summary of a large number of URLs in a Jupyter Notebook
      Python
      71701Updated Jun 8, 2021Jun 8, 2021
    • Site Hound (previously THH) is a Domain Discovery Tool
      HTML
      92324Updated Jun 1, 2021Jun 1, 2021
    • html-text

      Public
      Extract text from HTML
      HTML
      22135132Updated Jul 22, 2020Jul 22, 2020
    • A rotating socks proxy using Tor, Delegate and Haproxy
      Dockerfile
      131310Updated Dec 19, 2019Dec 19, 2019
    • aquarium

      Public
      Splash + HAProxy + Docker Compose
      Python
      38197240Updated Nov 29, 2018Nov 29, 2018
    • Read JSON lines (jl) files, including gzipped and broken
      Python
      83520Updated Nov 21, 2018Nov 21, 2018
    • Scrapy extension which writes crawled items to Kafka
      Python
      83020Updated Nov 8, 2018Nov 8, 2018
    • Item definition and utils for storing items in CDR format for scrapy
      Python
      5700Updated Oct 29, 2018Oct 29, 2018
    • Sitehound's backend
      HTML
      4600Updated Oct 17, 2018Oct 17, 2018
    • Scrapy middleware for the autologin
      Python
      143651Updated May 29, 2018May 29, 2018
    • A generic crawler
      Python
      2378170Updated May 29, 2018May 29, 2018
    • Broad crawler for domain discovery
      Python
      81920Updated May 29, 2018May 29, 2018
    • Simple heuristic for measuring web page similarity (& data set)
      HTML
      169110Updated May 29, 2018May 29, 2018
    • Headless Horseman Page Classifier service
      Python
      4500Updated May 29, 2018May 29, 2018
    • deep-deep

      Public
      Adaptive crawler which uses Reinforcement Learning methods
      Jupyter Notebook
      3516800Updated May 29, 2018May 29, 2018
    • A collection of example LUA scripts and JS utilities
      JavaScript
      3700Updated May 29, 2018May 29, 2018