Tip: Top 10 largest folders

A question came up today where the requester may need to do some disk cleanup, and wanted to find “the Top 10 Largest Subdirectories of a directory”.

Finding “The Largest” folder has been discussed here before, but with no native method of Sorting, much less “Indexing an Array”, this problem has mostly been out-of-reach…at least, until the fairly recent additions of ‘tuple string’ inspectors and ‘tuple string indexing’.

This is still fairly complicated logic, and in a real-world case I’d probably recommend pushing this off into a shell script instead of native Relevance, but it was a handy challenge that might be helpful.

The first step is to get each folder and the size of its descendants:

q: (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: 46, c:\temp\actions
A: 5891233, c:\temp\autopatch_export
A: 21079034, c:\temp\BESDownloadCacher
A: 53, c:\temp\crypto
A: 170, c:\temp\dir1
A: 170, c:\temp\dir2
A: 2850, c:\temp\keys
A: 58164339, c:\temp\log4j
A: 11915072, c:\temp\logpresso-log4j2-scan-2.8.1-win64
A: 738885569, c:\temp\export
A: 9882596, c:\temp\ServerKeyTool (3)
A: 5373604, c:\temp\sql_backup
A: 6001023342, c:\temp\WinServer2016

Our next step is going to be to make a ‘set’ from the sizes, but we’ll still need to track “which size goes with which folder”. So let’s save the size & pathname together in the form “folder size:folder path”. I’ll display the ‘elements’ of the set here for debugging, but in the query following this I’ll keep it packed up as a set

q: elements of set of (item 0 of it as string & “:” & item 1 of it) of (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: 11915072:c:\temp\logpresso-log4j2-scan-2.8.1-win64
A: 170:c:\temp\dir1
A: 170:c:\temp\dir2
A: 21079034:c:\temp\BESDownloadCacher
A: 2850:c:\temp\keys
A: 376:c:\temp\yara
A: 46:c:\temp\actions
A: 5373604:c:\temp\sql_backup
A: 53:c:\temp\crypto
A: 58164339:c:\temp\log4j
A: 5891233:c:\temp\autopatch_export
A: 6001023342:c:\temp\WinServer2016
A: 738885569:c:\temp\export
A: 9882596:c:\temp\ServerKeyTool (3)

‘Set’ automatically sorts the values, but since we’re treating the sizes as Strings here we aren’t sorted in numerical order - we’re sorted in alphabetical order, where “2” comes after “10”. But given we now have a string set, we can grab the sizes (as the string before the first “:”), cast them back to Integers, and make a new Set from those which will come out in sorted order:

q: ( elements of (set of (preceding texts of firsts ":" of elements of it as integer)) of it) of set of (item 0 of it as string & ":" & item 1 of it) of (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: 46
A: 53
A: 170
A: 376
A: 2850
A: 5373604
A: 5891233
A: 9882596
A: 11915072
A: 21079034
A: 58164339
A: 738885569
A: 6001023342
T: 1658.701 ms
I: plural integer

Now, to the fairly new methods available given ‘tuple string’. One of the nice things about tuple string of <plural results> is that we get a single result that is formatted as a ‘tuple string’ - so if we take that Integer set, cast it back into Strings, and make a Tuple String of it, we’ll now have a tuple result in sorted order:

q: ( tuple string of (it as string) of elements of (set of (preceding texts of firsts ":" of elements of it as integer)) of it) of set of (item 0 of it as string & ":" & item 1 of it) of (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: 46, 53, 170, 376, 2850, 5373604, 5891233, 9882596, 11915072, 21079034, 58164339, 738885569, 6001023342

Another nice thing about ‘tuple string’ is we can request any index like tuple string item 2 of ... and we can also find how many elements there are with number of tuple string items of ...

Given the query so far, we can find out how many elements there are (13), and the value of the last/largest one (which is ‘tuple string item 12 of it’ with

q: ( (number of tuple string items of it, tuple string item (number of tuple string items of it - 1) of it) of  tuple string of (it as string) of elements of (set of (preceding texts of firsts ":" of elements of it as integer)) of it) of set of (item 0 of it as string & ":" & item 1 of it) of (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: 13, 6001023342

Working backwards through (integers in (12, 2)) we can get the 10 largest sizes - the tuple strings starting from the last index, and working forward through the 10 items before it

q: ( (tuple string items (integers in (number of tuple string items of it - 1, number of tuple string items of it - 10)) of it) of tuple string of (it as string) of elements of (set of (preceding texts of firsts “:” of elements of it as integer)) of it) of set of (item 0 of it as string & “:” & item 1 of it) of (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: 6001023342
A: 738885569
A: 58164339
A: 21079034
A: 11915072
A: 9882596
A: 5891233
A: 5373604
A: 2850
A: 376

So now we know the top 10 folder sizes, we need to preserve the original string set that ties a folder path back to the folder size. This changes some of our comparisons from ‘of it’ into ‘of item 0 of it’ since we’re now carrying forward both the Integer set and the String set. Here I’ll unwind the string set - matching up every size with every size:path combination. This gives a huge number of results but that’s ok, since this is just string & set manipulation it’s actually quite fast compared to directory scans.

q: (item 0 of it, elements of item 1 of it) of  (tuple string items (integers in (number of tuple string items of it - 1, number of tuple string items of it - 10)) of tuple string of (it as string) of elements of (set of (preceding texts of firsts ":" of elements of it as integer)) of it, it) of set of (item 0 of it as string & ":" & item 1 of it) of (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: 6001023342, 11915072:c:\temp\logpresso-log4j2-scan-2.8.1-win64
A: 6001023342, 170:c:\temp\dir1
A: 6001023342, 170:c:\temp\dir2
A: 6001023342, 21079034:c:\temp\BESDownloadCacher
A: 6001023342, 2850:c:\temp\keys
A: 6001023342, 376:c:\temp\yara
A: 6001023342, 46:c:\temp\actions
A: 6001023342, 5373604:c:\temp\sql_backup
A: 6001023342, 53:c:\temp\crypto
A: 6001023342, 58164339:c:\temp\log4j
A: 6001023342, 5891233:c:\temp\autopatch_export
A: 6001023342, 6001023342:c:\temp\WinServer2016
A: 6001023342, 738885569:c:\temp\export
A: 6001023342, 9882596:c:\temp\ServerKeyTool (3)
A: 738885569, 11915072:c:\temp\logpresso-log4j2-scan-2.8.1-win64
A: 738885569, 170:c:\temp\dir1
A: 738885569, 170:c:\temp\dir2
A: 738885569, 21079034:c:\temp\BESDownloadCacher
... MANY more results

That query tied every directory to each of the top 10 sizes - so now we need to filter to only those directories whose size is actually equal to one of the top 10 sizes…

q: (item 0 of it, elements of item 1 of it) whose (item 0 of it as string = preceding text of first ":" of item 1 of it) of  (tuple string items (integers in (number of tuple string items of it - 1, number of tuple string items of it - 10)) of tuple string of (it as string) of elements of (set of (preceding texts of firsts ":" of elements of it as integer)) of it, it) of set of (item 0 of it as string & ":" & item 1 of it) of (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: 6001023342, 6001023342:c:\temp\WinServer2016
A: 738885569, 738885569:c:\temp\export
A: 58164339, 58164339:c:\temp\log4j
A: 21079034, 21079034:c:\temp\BESDownloadCacher
A: 11915072, 11915072:c:\temp\logpresso-log4j2-scan-2.8.1-win64
A: 9882596, 9882596:c:\temp\ServerKeyTool (3)
A: 5891233, 5891233:c:\temp\autopatch_export
A: 5373604, 5373604:c:\temp\sql_backup
A: 2850, 2850:c:\temp\keys
A: 376, 376:c:\temp\yara
T: 556.321 ms
I: plural ( string, string )

Now, we know the Top 10 sizes and we know which folder goes with each of them, we can just strip off the pathnames from the items 1 …

q: following texts of firsts ":" of items 1 of (item 0 of it, elements of item 1 of it) whose (item 0 of it as string = preceding text of first ":" of item 1 of it) of  (tuple string items (integers in (number of tuple string items of it - 1, number of tuple string items of it - 10)) of tuple string of (it as string) of elements of (set of (preceding texts of firsts ":" of elements of it as integer)) of it, it) of set of (item 0 of it as string & ":" & item 1 of it) of (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: c:\temp\WinServer2016
A: c:\temp\export
A: c:\temp\log4j
A: c:\temp\BESDownloadCacher
A: c:\temp\logpresso-log4j2-scan-2.8.1-win64
A: c:\temp\ServerKeyTool (3)
A: c:\temp\autopatch_export
A: c:\temp\sql_backup
A: c:\temp\keys
A: c:\temp\yara
T: 563.794 ms
I: plural substring

There, we have the 10 largest subdirectories of my C:\Temp folder!

Hope this was helpful,

  • Jason
1 Like

Just thought of another variation to do this. Once we have the ordered set of folder sizes, we don’t necessarily need to list out the last ten of them; we just need to know “the 10th largest”, and output each folder that is larger or equal to it in size.

We also should account for the possibility that we may have fewer than 10 subdirectories in total. So, instead of looping through the largest 10 folders, we find the 10th largest; if there are fewer than 10 folders, we’ll take the smallest of them. We can do that by finding the tuple string at the index that is the greater of ( 0 or the tenth-to-last index)

q:  items 0 of (tuple string items (maximum of (0; number of tuple string items of it - 10)) of tuple string of (it as string) of elements of (set of (preceding texts of firsts ":" of elements of it as integer)) of it, it) of set of (item 0 of it as string & ":" & item 1 of it) of (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: 376

Now we know the tenth-largest directory is 376 bytes, we’ll take every directory that is that size or larger:

q:  (item 0 of it, elements of item 1 of it) whose (preceding text of first ":" of item 1 of it as integer >= item 0 of it of it as integer)  of (tuple string items (maximum of (0 ; number of tuple string items of it - 10)) of tuple string of (it as string) of elements of (set of (preceding texts of firsts ":" of elements of it as integer)) of it, it) of set of (item 0 of it as string & ":" & item 1 of it) of (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: 376, 11915072:c:\temp\logpresso-log4j2-scan-2.8.1-win64
A: 376, 21079034:c:\temp\BESDownloadCacher
A: 376, 2850:c:\temp\keys
A: 376, 376:c:\temp\yara
A: 376, 5373604:c:\temp\sql_backup
A: 376, 58164339:c:\temp\log4j
A: 376, 5891233:c:\temp\autopatch_export
A: 376, 6001023342:c:\temp\WinServer2016
A: 376, 738885569:c:\temp\export
A: 376, 9882596:c:\temp\ServerKeyTool (3)

And again we strip off just the pathnames from the ‘item 1’ element:

q: following texts of firsts ":" of items 1 of (item 0 of it, elements of item 1 of it) whose (preceding text of first ":" of item 1 of it as integer >= item 0 of it of it as integer)  of (tuple string items (maximum of (0 ; number of tuple string items of it - 10)) of tuple string of (it as string) of elements of (set of (preceding texts of firsts ":" of elements of it as integer)) of it, it) of set of (item 0 of it as string & ":" & item 1 of it) of (sum of sizes of descendants of it, pathname of it) of folders of folders "c:\temp"
A: c:\temp\logpresso-log4j2-scan-2.8.1-win64
A: c:\temp\BESDownloadCacher
A: c:\temp\keys
A: c:\temp\yara
A: c:\temp\sql_backup
A: c:\temp\log4j
A: c:\temp\autopatch_export
A: c:\temp\WinServer2016
A: c:\temp\export
A: c:\temp\ServerKeyTool (3)
T: 588.323 ms

(side-note: This query gives the 10 largest folders, but they remain in no particular order; in the previous query, we had the 10 largest folders, sorted in order from largest to smalles)

3 Likes