Table of Contents:
Introduction
“How to transfer a bulk of small files from AWS S3 in parallel using Python?” After not finding anything reliable in Stack Overflow, I went to the Boto3 documentation and started developing codes for this purpose. I published my codes as a Python package called bulkboto3. I’m writing this post to explain how to use this package. You can find more details and also the source codes on my Github and PyPi Package Index:
https://github.com/iamirmasoud/bulkboto3
https://pypi.org/project/bulkboto3/
Boto3 is the official Python SDK for accessing and managing all AWS resources such as Amazon Simple Storage Service (S3). Generally, it’s pretty ok to transfer a small number of files using Boto3. However, transferring a large number of small files impede performance. Although it only takes a few milliseconds per file to transfer, it can take up to hours to transfer hundreds of thousands, or millions, of files if you do it sequentially. Moreover, because Amazon S3 does not have folders/directories, managing the hierarchy of directories and files manually can be a bit tedious especially if there are many files located in different folders.
The bulkboto3
package solves these issues. It speeds up transferring of many small files to Amazon AWS S3 by executing multiple download/upload operations in parallel by leveraging the Python multiprocessing module. Depending on the number of cores of your machine, Bulk Boto3 can make S3 transfers even 100X faster than sequential mode using traditional Boto3! Furthermore, Bulk Boto3 can keep the original folder structure of files and directories when transferring them. There are also some other features as follows.
Note: You can deploy a free S3 server using MinIO on your local machine by following the steps explained in: Deploy Standalone MinIO using Docker Compose on Linux.
Use the package manager pip to install bulkboto3
.
pip install bulkboto3
You can find the following scripts in examples.py.
BulkBoto3
object with your credentials
from bulkboto3 import BulkBoto3
TARGET_BUCKET = "test-bucket"
NUM_TRANSFER_THREADS = 50
TRANSFER_VERBOSITY = True
bulkboto_agent = BulkBoto3(
resource_type="s3",
endpoint_url="<Your storage endpoint>",
aws_access_key_id="<Your access key>",
aws_secret_access_key="<Your secret key>",
max_pool_connections=300,
verbose=TRANSFER_VERBOSITY,
)
bulkboto_agent.create_new_bucket(bucket_name=TARGET_BUCKET)
Suppose that there is a directory with the following structure on your local machine:
test_dir
├── first_subdir
│ ├── f1
│ ├── f2
│ └── f3
└── second_subdir
└── f4
To upload the directory (with its subdirectories) to the bucket under a new directory name called my_storage_dir
:
bulkboto_agent.upload_dir_to_storage(
bucket_name=TARGET_BUCKET,
local_dir="test_dir",
storage_dir="my_storage_dir",
n_threads=NUM_TRANSFER_THREADS,
)
# output:
# 2022-03-26 18:12:40 — INFO — Start uploading from local 'test_dir' to 'my_storage_dir' on the object storage with 50 threads.
# 100%|██████████| 4/4 [00:00<00:00, 4.00s/it]
# 2022-03-26 18:12:41 — INFO — Successfully uploaded 4 files to bucket 'test-bucket' in 0.07 seconds.
bulkboto_agent.download_dir_from_storage(
bucket_name=TARGET_BUCKET,
storage_dir="my_storage_dir",
local_dir="new_test_dir",
n_threads=NUM_TRANSFER_THREADS,
)
# output:
# 2022-03-26 18:14:08 — INFO — Start downloading from 'my_storage_dir' on storage to local 'new_test_dir' with 50 threads.
# 100%|██████████| 4/4 [00:00<00:00, 4.00it/s]
# 2022-03-26 18:14:09 — INFO — Successfully downloaded 4 files from bucket: 'test-bucket' in 0.04 seconds.
The structure of the downloaded directory will be as follows on the local directory:
new_test_dir
└── my_storage_dir
├── first_subdir
│ ├── f1
│ ├── f2
│ └── f3
└── second_subdir
└── f4
You can set local_dir=''
(it is the default value) to avoid the creation of the new_test_dir
directory.
To transfer a list of arbitrary files to a bucket, you should instantiate StorageTransferPath
class to determine the storage (s3) and local paths, and then use .upload()
and .download()
methods. Here is an example:
# upload arbitrary files from local to an S3 bucket
upload_paths = [
StorageTransferPath(
local_path="test_dir/first_subdir/f2",
storage_path="f2",
),
StorageTransferPath(
local_path="test_dir/second_subdir/f4",
storage_path="my_storage_dir/f4",
),
]
bulkboto_agent.upload(bucket_name=TARGET_BUCKET, upload_paths=upload_paths)
# output:
# 100%|██████████| 2/2 [00:00<00:00, 2.44it/s]
# 2022-04-05 13:40:10 — INFO — Successfully uploaded 2 files to bucket: 'test-bucket'.
# download arbitrary files from an S3 bucket to local
download_paths = [
StorageTransferPath(
storage_path="f2",
local_path="f2",
),
StorageTransferPath(
storage_path="my_storage_dir/f4",
local_path="f5",
),
]
bulkboto_agent.download(bucket_name=TARGET_BUCKET, download_paths=download_paths)
# output:
# 100%|██████████| 2/2 [00:00<00:00, 2.44it/s]
# 2022-04-05 13:34:10 — INFO — Successfully downloaded 2 files from bucket: 'test-bucket'.
bulkboto_agent.empty_bucket(TARGET_BUCKET)
# output:
# 2022-03-26 22:23:23 — INFO — Successfully deleted objects on: 'test-bucket'.
print(
bulkboto_agent.check_object_exists(
bucket_name=TARGET_BUCKET, object_path="my_storage_dir/first_subdir/test_file.txt"
)
)
# output: False
print(
bulkboto_agent.check_object_exists(
bucket_name=TARGET_BUCKET, object_path="my_storage_dir/first_subdir/f1"
)
)
# output: True
print(
bulkboto_agent.list_objects(
bucket_name=TARGET_BUCKET, storage_dir="my_storage_dir"
)
)
# output:
# ['my_storage_dir/first_subdir/f1', 'my_storage_dir/first_subdir/f2', 'my_storage_dir/first_subdir/f3', 'my_storage_dir/second_subdir/f4']
print(
bulkboto_agent.list_objects(
bucket_name=TARGET_BUCKET, storage_dir="my_storage_dir/second_subdir"
)
)
# output:
# ['my_storage_dir/second_subdir/f4']
Uploaded 88800 small files (totally about 7GB) with 100 threads in 505 seconds which is about 72X faster than the non-parallel mode.
Any contributions you make are greatly appreciated. If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag “enhancement”. To contribute to bulkboto3
, follow these steps:
git checkout -b feature/AmazingFeature
)git commit -m 'Add some AmazingFeature'
)git push origin feature/AmazingFeature
)Alternatively, see the GitHub documentation on creating a pull request.
In this post, I have introduced the bulkboto3 python package that can help you transfer a bulk of files to an S3 bucket. Stay tuned for the next posts!
1 Comment
[…] BulkBoto3: Python package for fast and parallel transferring a bulk of files to S3 based on boto3! […]