Super-Simple Scraper

This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be imported into Typesense.


  • Scrape HTML & PDF documents based on the configured selectors
  • Selectors can use CSS selectors or template-based ones which have sprig functions available.


See the example configuration. Many of these options are directly copied to the Colly equivalents:


We have an image on DockerHub, so after installing Docker and jq, something like this will work:

docker run -it -v `pwd`:/go/src/app -e "CONFIG=$(cat ./path/to/your/config.json | jq -r tostring)" gotripod/ssscraper:main

The manual method is:

docker build -t ssscraper .
docker run -v `pwd`:/go/src/app -it --rm --name ssscraper-ahoy ssscraper

# you're now in the docker container

cd src/app
go build


Using VSCode, clone and open the repo directory with the Containers extension installed.

Future ideas

  • Webhook support - POST the output to a URL on completion
  • Different output formats
  • Custom weighting for selectors
  • Extract the selector/template logic to a common function
  • Add Word doc support