This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be imported into Typesense.
- Scrape HTML & PDF documents based on the configured selectors
- Selectors can use CSS selectors or template-based ones which have sprig functions available.
See the example configuration. Many of these options are directly copied to the Colly equivalents:
We have an image on DockerHub, so after installing
jq, something like this will work:
docker run -it -v `pwd`:/go/src/app -e "CONFIG=$(cat ./path/to/your/config.json | jq -r tostring)" gotripod/ssscraper:main
The manual method is:
docker build -t ssscraper . docker run -v `pwd`:/go/src/app -it --rm --name ssscraper-ahoy ssscraper # you're now in the docker container cd src/app go build ./ssscraper
Using VSCode, clone and open the repo directory with the Containers extension installed.
- Webhook support - POST the output to a URL on completion
- Different output formats
- Custom weighting for selectors
- Extract the selector/template logic to a common function
- Add Word doc support