TeamHG

All

74 repositories

arachnado
Public
Web Crawling UI and HTTP API, based on Scrapy and Tornado
Python
•61•160•16•21•Updated Nov 18, 2025Nov 18, 2025
soft404
Public
A classifier for detecting soft 404 pages
Jupyter Notebook
•13•57•4•6•Updated Oct 10, 2025Oct 10, 2025
crazy-form-submitter
Public
A library to collect data from search forms
Python
•3•0•1•1•Updated Oct 10, 2025Oct 10, 2025
autopager
Public
Detect and classify pagination links
HTML
•26•104•7•1•Updated Oct 10, 2025Oct 10, 2025
broadcrawl
Public
Middleware that limits number of internal/external links during broad crawl
Python
•2•2•1•1•Updated Oct 10, 2025Oct 10, 2025
autologin
Public
A project to attempt to automatically login to a website given a single seed
Python
•
Apache License 2.0
•41•127•10•6•Updated Oct 10, 2025Oct 10, 2025
autoregister
Public
Python
•3•2•1•1•Updated Oct 10, 2025Oct 10, 2025
eli5
Public
A library for debugging/inspecting machine learning classifiers and explaining their predictions
python nlp data-science machine-learning scikit-learn xgboost lightgbm inspection explanation crfsuite
Jupyter Notebook
•
MIT License
•329•2.8k•145•18•Updated Apr 19, 2025Apr 19, 2025
Formasaurus
Public
Formasaurus tells you the type of an HTML form and its fields using machine learning
HTML
•46•119•12•2•Updated Jun 18, 2024Jun 18, 2024
sklearn-crfsuite
Public
scikit-learn inspired API for CRFsuite
Python
•213•432•34•12•Updated Sep 25, 2023Sep 25, 2023
agnostic
Public
Agnostic Database Migrations
Python
•
MIT License
•18•54•9•1•Updated Aug 10, 2023Aug 10, 2023
tensorboard_logger
Public
Log TensorBoard events without touching TensorFlow
Python
•
MIT License
•50•629•9•4•Updated Dec 26, 2022Dec 26, 2022
scrapy-crawl-once
Public
Scrapy middleware which allows to crawl only new content
scrapy
Python
•
MIT License
•21•79•4•3•Updated Oct 31, 2022Oct 31, 2022
scrapy-rotating-proxies
Public
use multiple proxies with Scrapy
proxy scrapy
Python
•
MIT License
•161•770•46•9•Updated May 20, 2022May 20, 2022
url-summary
Public
Show summary of a large number of URLs in a Jupyter Notebook
Python
•
MIT License
•7•17•0•1•Updated Jun 8, 2021Jun 8, 2021
sitehound-frontend
Public
Site Hound (previously THH) is a Domain Discovery Tool
topics domain-discovery
HTML
•
Apache License 2.0
•9•23•2•4•Updated Jun 1, 2021Jun 1, 2021
html-text
Public
Extract text from HTML
HTML
•
MIT License
•22•135•13•2•Updated Jul 22, 2020Jul 22, 2020
docker-tor-rotator
Public
A rotating socks proxy using Tor, Delegate and Haproxy
Dockerfile
•13•13•1•0•Updated Dec 19, 2019Dec 19, 2019
aquarium
Public
Splash + HAProxy + Docker Compose
Python
•
MIT License
•38•197•24•0•Updated Nov 29, 2018Nov 29, 2018
json-lines
Public
Read JSON lines (jl) files, including gzipped and broken
Python
•
MIT License
•8•35•2•0•Updated Nov 21, 2018Nov 21, 2018
scrapy-kafka-export
Public
Scrapy extension which writes crawled items to Kafka
Python
•
MIT License
•8•30•2•0•Updated Nov 8, 2018Nov 8, 2018
scrapy-cdr
Public
Item definition and utils for storing items in CDR format for scrapy
Python
•
MIT License
•5•7•0•0•Updated Oct 29, 2018Oct 29, 2018
sitehound-backend
Public
Sitehound's backend
HTML
•
Apache License 2.0
•4•6•0•0•Updated Oct 17, 2018Oct 17, 2018
autologin-middleware
Public
Scrapy middleware for the autologin
Python
•14•36•5•1•Updated May 29, 2018May 29, 2018
undercrawler
Public
A generic crawler
Python
•23•78•17•0•Updated May 29, 2018May 29, 2018
domain-discovery-crawler
Public
Broad crawler for domain discovery
Python
•
MIT License
•8•19•2•0•Updated May 29, 2018May 29, 2018
page-compare
Public
Simple heuristic for measuring web page similarity (& data set)
HTML
•16•91•1•0•Updated May 29, 2018May 29, 2018
hh-page-classifier
Public
Headless Horseman Page Classifier service
Python
•
MIT License
•4•5•0•0•Updated May 29, 2018May 29, 2018
deep-deep
Public
Adaptive crawler which uses Reinforcement Learning methods
Jupyter Notebook
•35•168•0•0•Updated May 29, 2018May 29, 2018
scrash-lua-examples
Public
A collection of example LUA scripts and JS utilities
JavaScript
•3•7•0•0•Updated May 29, 2018May 29, 2018