How to? Use wild cards in include list without duplicate processing

Dan Glynhampton · Aug 30, 2013

rconn said:
Unless you're running Windows x64 with a LOT of RAM (>8Gb), you're not going to be able to do this regardless. (And you'll need a lot of patience.)

As it happens I am, Windows x64 with 16Gb on this machine, but whether I have the patience is another question... :)

I had the idea a little while ago to use PDIR and then pipe the output to TPIPE, but sorting in TPIPE is disabled. If the TPIPE developers ever fix that performance issue and we get TPIPE sorting working in TCC then it might not need quite so much patience.

Dan

vefatica · Aug 30, 2013

Code:

How about the ten largest files on the drive?

rconn said:
Unless you're running Windows x64 with a LOT of RAM (>8Gb), you're not going to be able to do this regardless. (And you'll need a lot of patience.)

This took about a minute and used about 60 MB in the process. I didn't pay much attention to the sorting but these seem to be the 10 biggest files on my system drive.

Code:

v:\> pdir /s /a /a:-d /(z fpn) c:\ | sort | tail
      129304692 C:\ProgramData\Adobe\Setup\{AC76BA86-7AD7-1033-7B44-AB0000000001}\Data1.cab
      129304692 C:\Users\All Users\Adobe\Setup\{AC76BA86-7AD7-1033-7B44-AB0000000001}\Data1.cab
      159453696 C:\Windows\Installer\41bf127.msp
      161521664 C:\Windows\Installer\41bf128.msp
      172586496 C:\Windows\Installer\1f09653.msp
      181483595 C:\Windows\Microsoft.NET\Framework\v4.0.30319\SetupCache\Client\netfx_core.mzz
      192231424 C:\Windows\Installer\1f09b15.msp
      198917120 C:\Windows\Installer\1f0987c.msp
      201392128 C:\Windows\SoftwareDistribution\DataStore\DataStore.edb
    1073741824 C:\pagefile.sys

vefatica · Aug 30, 2013

vefatica said:

Code:

How about the ten largest files on the drive?

This took about a minute and used about 60 MB in the process. I didn't pay much attention to the sorting but these seem to be the 10 biggest files on my system drive.

Code:

v:\> pdir /s /a /a:-d /(z fpn) c:\ | sort | tail
      129304692 C:\ProgramData\Adobe\Setup\{AC76BA86-7AD7-1033-7B44-AB0000000001}\Data1.cab
      129304692 C:\Users\All Users\Adobe\Setup\{AC76BA86-7AD7-1033-7B44-AB0000000001}\Data1.cab
      159453696 C:\Windows\Installer\41bf127.msp
      161521664 C:\Windows\Installer\41bf128.msp
      172586496 C:\Windows\Installer\1f09653.msp
      181483595 C:\Windows\Microsoft.NET\Framework\v4.0.30319\SetupCache\Client\netfx_core.mzz
      192231424 C:\Windows\Installer\1f09b15.msp
      198917120 C:\Windows\Installer\1f0987c.msp
      201392128 C:\Windows\SoftwareDistribution\DataStore\DataStore.edb
    1073741824 C:\pagefile.sys

And the size range /[s1M] cut that down to 8 seconds and very little memory.

Steve Fabian · Aug 30, 2013

This posting panel is below the nearly hidden "Page 1 of 2" display on Page 1. This post is physically just after previous post on the first page. I wonder where ti will end up - in its chronologically correct place, or its physical order location>

Steve Fabian · Aug 30, 2013

This was really weird! Posting was chronologically correct, and caused the merging of 2nd page posts into the first page...

rconn · Aug 30, 2013

vefatica said:
Code:

How about the ten largest files on the drive?

This took about a minute and used about 60 MB in the process. I didn't pay much attention to the sorting but these seem to be the 10 biggest files on my system drive.

You've got a tiny drive or not very many files.

vefatica · Aug 30, 2013

rconn said:
You've got a tiny drive or not very many files.

Right, it's pretty much OS only, about 20GB. But I think that's reasonable since it must plow through 85,000 files. As the size grows I'd expect the time to go through the files to increase linearly and the time to sort logarithmically. I suspect my 2GB system could handle 400GB of files in under 20 minutes and without running out of RAM (it might be close). If the target is only the biggest files, a size range (as I said before) can cut the RAM needed to very little.

Charles Dye · Aug 30, 2013

If the goal is to find just the 10 largest (oldest, newest, etc.) files, then you only need to retain info on 10 files....

vefatica · Aug 30, 2013

Charles Dye said:
If the goal is to find just the 10 largest (oldest, newest, etc.) files, then you only need to retain info on 10 files....

(I may have missed your point, Charles) ... If you have no prior knowledge you still have to enumerate all the files. If you minimize the processing and eliminate the outputting of many (as with a size range) then the enumeration goes a little faster and the sorting goes a lot faster.

My main point was that we have tools to this sort of thing already and, given what must be done, I doubt TCC alone could do it any faster or without running into limits imposed by sheer size and hardware.

samintz · Sep 4, 2013

Surprisingly, I'm going to take Rex's side in this argument. :woot:

You can always use a regular expression to do the either/or processing.

Vince's example:

Code:

    dir /km *ses*;*sion*
can be done using a regex:
    dir /km "::.*ses.*|.*sion.*"

-Scott

Steve Fabian · Sep 4, 2013

Scott:
You are right, RE does the trick. I just manipulated your example to allow me to test the RE processing vs. the simple include list, using the same set of files my OP was about, and to my surprise, found that REs overcome the multiple match / multiple report issue. I need to learn more about the TCC / Oniguruma implementation. In long ago times I was reasonably proficient in POSIX-style REs, which did not - IIRC - include matching alternate strings, only alternate characters. Of course, Rex could turn an include list into RE alternate string list, possibly even eliminating some code, though I doubt it wold be worth the effort. I just need to bite the bullet...

ben · Sep 4, 2013

TCC's regex filename matching differs from its wildcard filename matching in that while a wildcard expression must match the entire filename, a regex may match any part of the filename. (That is, the regex does not have an implied ^ at its beginning and $ at its end. I have no idea why.)

So Scott's example can be written as
dir /km ::"ses|sion"

Steve Fabian · Sep 4, 2013

Much easier! Also interesting to note: the leading quotation mark (") can be either before or after the two colons :: which signal that an RE is used. As for why REs match partial file names, it is the nature of REs - they are floating matches except when anchored to a specific part of the objects searched, e.g., to the beginning or end of the file name.

ben · Sep 5, 2013

It is not the nature of regular expressions that they specify partial matches: the regular expression "x" matches "x", not "zxz". (Regular expressions may be composed of other regular expressions, which plainly do not match partially.) But there are many tools that employ regular expressions to match partially: these tools repeatedly attempt to make whole matches within the text.

Since we are accustomed to wildcard expressions matching whole filenames, it seems natural to me for regular expressions to be used in a similar way when matching filenames.

Anyway, I am wholly (not partially) aware that this isn't going to change. My observation was not a request for change.

Steve Fabian · Sep 5, 2013

ben said:
It is not the nature of regular expressions that they specify partial matches: the regular expression "x" matches "x", not "zxz". (Regular expressions may be composed of other regular expressions, which plainly do not match partially.) But there are many tools that employ regular expressions to match partially: these tools repeatedly attempt to make whole matches within the text....

Sorry Ben, that is not my understanding, based on how every search utility I ever used in the last half century works. The pattern to be matched (the RE) is just a substring of the objects (words, syllables, text lines, file names) to be found, unless the RE includes beginning-of-object and / or end-of-object anchors. OTOH the whole RE (in one of its manifestations) must be present in the searched object for there to be a match. What is reported as a matching object may be a syllable, a word, a substring, a line, a paragraph, or - in this thread - a file specification.

ben · Sep 5, 2013

That is to confuse regular expressions with the tools that use them. Each tool uses them in a way appropriate to the context of its use.

I said:
Regular expressions may be composed of other regular expressions, which plainly do not match partially.

If the component regular expressions matched in the way you suggest, the containing regular expression as a whole would not match what it does match.

I didn't mean to raise the question of what a regular expression is, and this might not be the place to discuss it.

Steve Fabian · Sep 5, 2013

We are, in fact, discussing the terminology of regular expression matching. If we do not want users, esp. those new to REs, to have to experiment a lot in order to formulate the RE search string they desire, our terminology describing the match rules must be unambiguous, i.e., very precise. Fortunately, the RE help page is such, even if it contains many concepts and many rules, and probably does require several readings for proficiency.
I used this command: *dir/b/ogen/a-d/p/ne ::"post|path"
to find the specifications (filenames) of certain files. This was the response (listing all desired files and only them):
ADDPATH.BTM
OLDPATH.BTM
POSTPATH.BTM
POSTVAR.BTM
SHOWPATHEXT.BTM

As you can see, in every match some element of the total RE was a substring of the matching file specification (as intended). IMHO discucssing matching the whole RE is meaningless, it is contrary to the definition of RE.

vefatica · Sep 5, 2013

Steve Fabian said:
IMHO discucssing matching the whole RE is meaningless, it is contrary to the definition of RE.

That's just semantics. IMHO, "post|path" is a regular expression. And "addpath.btm" contains a match for it.

Steve Fabian · Sep 6, 2013

Vince, I agree, "post|path" is a RE, and each file in my list matches it. It's Ben's claim that they do not match "the whole RE"... I think we had gone quite too far in this subthread...

ben · Sep 7, 2013

I claimed nothing of the sort.

I doubt that anyone else is inspired by this discussion, so I've taken it elsewhere. If I'm mistaken, please let me know.

Search

Welcome!

How to? Use wild cards in include list without duplicate processing

Dan Glynhampton

vefatica

vefatica

Steve Fabian

Steve Fabian

rconn

Administrator

vefatica

Charles Dye

Super Moderator

vefatica

samintz

Scott Mintz

Steve Fabian

ben

Steve Fabian

ben

Steve Fabian

ben

Steve Fabian

vefatica

Steve Fabian

ben

Similar threads