Efficiently Listing Large AWS S3 Buckets with Python
In AWS S3, there are no folders, just a flat namespace. For effective object distribution, it’s recommended to use hashes as key prefixes. This ensures better performance and scalability.
One major issue arises when you need to list objects in a large bucket with dozens of thousands or even millions of objects. This process can take many minutes, because for AWS this is just single huge flat list, collected from many nodes.
To address this problem, I’ve created a Python class called
S3BucketObjects
that
efficiently lists large buckets.
Async
S3BucketObjects
uses aiobotocore
for non-blocking IO operations.
This allows the package to efficiently work in parallel.
Intelligent Parallelism
Although S3 doesn’t have actual folders, S3BucketObjects
simulates recursive folder
traversal by using the Delimiter
parameter in the AWS S3 list_objects
API call.
This parameter treats the given delimiter (/
) as a folder separator,
causing list_objects
to return not all the keys with the requested prefix,
but two lists: objects (“files”) and “common prefixes” - logical “subfolders”.
S3BucketObjects
starts by listing objects at the specified prefix (the root directory),
utilizing the delimiter to retrieve the immediate objects and “subfolders” instead of listing
all the objects at once.
It then recursively calls list_objects
for each “subfolder”, treating it as a new root
prefix.
To reduce the number of API calls needed, S3BucketObjects
employs two key optimizations.
Recursion Depth Limitation
You can specify a maximum recursion depth (max_depth
) to limit how deep the package
traverses into the directory structure.
Under the specified depth, S3BucketObjects
will list objects just as flat list,
without further recursion.
This can significantly reduce the number of API calls required, especially for buckets with deeply nested “subfolders”.
Prefix Grouping
S3BucketObjects
intelligently groups “folders” prefixes to minimize the number of API calls
needed.
Instead of listing objects for each individual “folder” (which would require a separate API call per “folder”), it groups “folders” by common prefixes and makes a single API call for each group prefix.
For example, with object keys like:
folder0001/
folder0002/
folder0003/
..
folder9999/
Instead of making thousands of API calls (one for each “folder”), S3BucketObjects
with max_folders=10
parameter will group them into just ten prefix groups:
folder0
folder1
...
folder9
This significantly reduces the number of API calls required, resulting in faster listing times for “folders” with big number of “subfolders”.
By combining asynchronous operations, recursive traversal utilizing the “Delimiter” parameter,
depth control, and intelligent prefix grouping, S3BucketObjects
can improve the
performance tenfold.
Usage
See documentation.
Command-Line Utility
The async-s3
package also includes a command-line utility
for convenient experiments with your S3 buckets.
as3 du s3://my-bucket/my-key -l 1 -f 20 -r 3
This command shows the size and number of objects in s3://my-bucket/my-key
, limiting the
recursion level to 1
.
If there are more than 20
folders at one level, it tries to group them by prefixes.
The request is repeated three
times, and the average time is calculated.