gh-archive-clickhouse

Save public GitHub event stream to ClickHouse as json.

CREATE TABLE github_events_raw
(
    id  Int64,
    ts  DateTime32,
    raw String CODEC (ZSTD(16))
) ENGINE = ReplacingMergeTree
      PARTITION BY toYYYYMMDD(ts)
      ORDER BY (ts, id);

Alternative to gharchive crawler with decreased probability to miss events.

  • Streaming to ClickHouse via native protocol instead of using files, so storage and fethching are decoupled.
  • Automatic pagination if more than one page of new events is available
  • Automatic fetch rate adjustment based on rate limit github headers and request duration
  • ETag support to skip cached results

GitHub

View Github