Skip to content
Advertisement

Obtain Folder size in Azure Data Lake Gen2 using Java

There is some literature over the internet for C# to compute folder size. But could not find Java.

  1. Is there an easy way to know the folder size? in Gen2
  2. How to compute if not?

There are several examples on the internet for (2) with C# and powershell. Any means with Java?

Advertisement

Answer

As far as I am aware, there is no API that directly provides the folder size in Azure Data Lake Gen2.

To do it recursively:

DataLakeServiceClient dataLakeServiceClient = new DataLakeServiceClientBuilder()
        .credential(new StorageSharedKeyCredential(storageAccountName, secret))
        .endpoint(endpoint)
        .buildClient();
DataLakeFileSystemClient container = dataLakeServiceClient.getFileSystemClient(containerName);


/**
 * Returns the size in bytes
 *
 * @param folder
 * @return
 */
@Beta
public Long getSize(String folder) {
    DataLakeDirectoryClient directoryClient = container.getDirectoryClient(folder);
    if (directoryClient.exists()) {
        AtomicInteger count = new AtomicInteger();
        return directoryClient.listPaths(true, false, null, null)
                .stream()
                .filter(x -> !x.isDirectory())
                .mapToLong(PathItem::getContentLength)
                .sum();
    }
    throw new RuntimeException("Not a valid folder: " + folder);
}

This recursively iterates through the folders and obtains the size.

The default records per page is 5000. So if there are 12000 records (folders + files combined), it would need to make 3 API calls to fetch details. From the docs:

recursive – Specifies if the call should recursively include all paths.

userPrincipleNameReturned – If “true”, the user identity values returned in the x-ms-owner, x-ms-group, and x-ms-acl response headers will be transformed from Azure Active Directory Object IDs to User Principal Names. If “false”, the values will be returned as Azure Active Directory Object IDs. The default value is false. Note that group and application Object IDs are not translated because they do not have unique friendly names.

maxResults – Specifies the maximum number of blobs to return per page, including all BlobPrefix elements. If the request does not specify maxResults or specifies a value greater than 5,000, the server will return up to 5,000 items per page. If iterating by page, the page size passed to byPage methods such as PagedIterable.iterableByPage(int) will be preferred over this value.

timeout – An optional timeout value beyond which a RuntimeException will be raised.

Advertisement