S3 Parallel Downloader

CLI utility that downloads multiple s3 objects at a time, with multiple range-requests issued per object. It also has support for copying between local filesystem locations using multiple threads.

Operations will always recurse the specified directories. When reading from a local filesystem, symlinks will not be followed.

Known Issues:

  • If your S3 bucket has a folder and object with the same name, this utility will fail. (E.g. s3://mybucket/test.txt && s3://mybucket/test.txt/another-object.txt). This fails as POSIX filesystems cannot have a folder and file with the same absolute path.
  • Support for writing to S3 has not yet been added.

Benchmark resuts

65.6069Gibps – 7926ms transfering 65GiB of data – downloaded 2,080 32MiB objects across 370 (185 * 2) concurrent HTTP requests

./s3pd-linux-amd64 \
--region=us-west-2 \
--workers=185 \
--threads=2 \
--partsize=$((4*1024*1024)) \
s3://test-400gbps-s3/32MiB/ /mnt/ram-disk

73.9592Gibps – 31585ms transfering 292GiB of data – downloaded 146 2GiB objects across 1,280 (40*32) concurrent HTTP requests

./s3pd-linux-amd64 \
--region=us-west-2 \
--workers=40 \
--threads=32 \
--partsize=$((16*1024*1024)) \
s3://test-400gbps-s3/2GiB/ /mnt/ram-disk

258.0953Gibps – 2014ms transferring 65GiB of data from a local RAM disk to a local RAM disk

./s3pd-linux-amd64 \
--workers=300 \
--threads=1 \
--partsize=$((128*1024)) \
/mnt/ram-disk/32MiB /mnt/ram-disk/234

Example CLI usage

Equivalent to: aws s3 cp s3://ml-training-dataset/pictures/* /mnt/nvme-local-disks
But instead of downloading 1 object at a time, it’ll download 40 objects at a time, with a higher concurrency rate than the aws s3 utility.

./s3pd-linux-amd64 \
--region=us-west-2 \
--workers=40 \
--threads=32 \
--partsize=$((16*1024*1024)) \
s3://ml-training-dataset/pictures /mnt/my-nvme-local-disks

If you just want to run a benchmark, and avoid needing to spin up a large-enough RAM disk, you can use the --benchmark flag which will only store the data temporarily in an in-memory buffer. For example:

./s3pd-linux-amd64 \
--workers=40 \
--threads=32 \
--partsize=$((16*1024*1024)) \
--benchmark \
s3://test-400gbps-s3/2GiB/

If you want to copy between the local filesystem

./s3pd-linux-amd64 \
--workers=40 \
--threads=32 \
--partsize=$((8*1024*1024)) \
/mnt/my-nvme-disk-1/datasetA /mnt/my-nvme-disk-2/

GitHub

View Github