Relevance Challenge April 2018 - Unique without sort

JasonWalker · May 23, 2018, 12:13am

@brolly33 posted an interesting relevance challenge on his DeveloperWorks blog at Legacy Communities - IBM TechXchange Community

This is a blatant repost. Because it took me quite a while to figure out a solution, I want to be able to refer back to it here later. There are at least two solutions to this. I’ll post my solution in a few days, please take a crack at it and see if we can come up with others!

Is there a scalable, efficient, reusable relevance technique that allows de-duplication of a string input while preserving the original order of the strings?

Example

q: (“g”;“bob”;“foo”;“bar”;“bob”;“a”;“g”;“zoo”;“apple”)

desired output:

A: g
A: bob
A: foo
A: bar
A: a
A: zoo
A: apple

The second instance of “g” and “bob” have been de-duped, but the original order is preserved.

Using set, or unique values will de-dupe, but will order the list. Because these do not retain the order, they are not solutions.

q: elements of set of (“g”;“bob”;“foo”;“bar”;“bob”;“a”;“g”;“zoo”;“apple”)

A: a
A: apple
A: bar
A: bob
A: foo
A: g
A: zoo

q: unique values of (“g”;“bob”;“foo”;“bar”;“bob”;“a”;“g”;“zoo”;“apple”)

A: a
A: apple
A: bar
A: bob
A: foo
A: g
A: zoo

Other Challenges:

jgo · May 23, 2018, 1:05am

Oh nice. I think I see how that would be done. Cool. Post answers here?

JasonWalker · May 29, 2018, 1:13am

I’d still like to hear other solutions on this! I think there’s probably a way to do it with a regular expression, but I’m not sure it would be supported in Relevance as the “lookahead/lookbehind” expressions don’t seem to work.

I ended up finding a couple of different solutions to this. As is often the case, the simpler solution came much later, so I’ll give them here out of order. Both illustrate what I think are lesser-used properties of the “substring” inspector that may be useful in different contexts.

Simpler solution:

Summary

q:  (substrings separated by "|" of it) whose (not exists (it, substrings separated by "|" of preceding text of it) whose (item 0 of it = item 1 of it)) of concatenation "|" of ("g";"bob";"foo";"bar";"bob";"a";"g";"zoo";"apple")
A: g
A: bob
A: foo
A: bar
A: a
A: zoo
A: apple
T: 0.386 ms
I: plural substring

This solution first joins all of the strings into one larger string -

q: concatenation "|" of ("g";"bob";"foo";"bar";"bob";"a";"g";"zoo";"apple")
A: g|bob|foo|bar|bob|a|g|zoo|apple
T: 0.062 ms
I: singular string

Then, we take advantage of the fact that a substring is not a string. In higher-level languages, taking part of a string results in a new string. For example, szNewString=Left(szOldString,3); would generate a new string, and retain no awareness of the properties of the original string. In relevance language, however, a substring has several properties, including the ability to see what came before or after the given substring.

This can be illustrated by retrieving substrings separated by "|"; we can return not just the substring, but also have awareness of what came before…

q: (it, preceding text of it) of substrings separated by "|" of concatenation "|" of ("g";"bob";"foo";"bar";"bob";"a";"g";"zoo";"apple")
A: g, 
A: bob, g|
A: foo, g|bob|
A: bar, g|bob|foo|
A: bob, g|bob|foo|bar|
A: a, g|bob|foo|bar|bob|
A: g, g|bob|foo|bar|bob|a|
A: zoo, g|bob|foo|bar|bob|a|g|
A: apple, g|bob|foo|bar|bob|a|g|zoo|

Thus, the solution -

q: (substrings separated by "|" of it) whose (not exists (it, substrings separated by "|" of preceding text of it) whose (item 0 of it = item 1 of it)) of concatenation "|" of ("g";"bob";"foo";"bar";"bob";"a";"g";"zoo";"apple")

Join the original strings into one long string by concatenation with pipe character. Split the results by the same character. Keep only the results where this string is not present earlier in the string (i.e., this is the first occurrence of this string)

The more complex solution I came up with involved another property available with a substring:

Summary

Start of <substring> gives the index of the starting point of a string within another. For instance we can get the starting positions of each word in a sentence:

q: (start of it, it) of substrings separated by " " of "The quick red fox jumps over the lazy red dog."
A: 0, The
A: 4, quick
A: 10, red
A: 14, fox
A: 18, jumps
A: 24, over
A: 29, the
A: 33, lazy
A: 38, red
A: 42, dog.
T: 0.234 ms
I: plural ( string position, substring )

The first (more complex) solution I found starts the same way - by concatenating the plural strings together, and then separating them into substrings. From there I tracked each word, along with the index at which each word starts, along with an set of unique words found (using the ‘set’ object). For each unique word in the set, I found the matching word with the earliest starting position, keeping both the word itself and its starting position. Then I used a set of integers to put the results back into order.

q: following texts of firsts ":" of items 1 of (elements of item 0 of it, elements of item 1 of it) whose (item 0 of it = preceding text of first ":" of item 1 of it as integer) of (set of (preceding texts of firsts ":" of it as integer) of elements of it, it) of set of unique values of items 1 of (/* unique strings */item 0 of it, /* indexed strings */ elements of item 1 of it,  /* set of indexed strings */ item 1 of it) whose (/* check this indexed element matches the unique string */ following text of first ":" of item 1 of it = item 0 of it and /* check this match is the minimum indexed match by comparing back tot he full set*/ preceding text of first ":" of item 1 of it as integer = minimum of (preceding texts of firsts ":" of item 0 of it as integer) of (elements of item 2 of it, item 0 of it) whose (following text of first ":" of item 0 of it = item 1 of it))  of (elements of item 0 of it, item 1 of it) of (/* set of unique values */ set of substrings separated by "|" of it, /* set of strings with indexes */ set of (start of it as string & ":" & it) of substrings separated by "|" of it) of  concatenation "|" of ("g";"bob";"foo";"bar";"bob";"a";"g";"zoo";"apple")
A: g
A: bob
A: foo
A: bar
A: a
A: zoo
A: apple
T: 3.271 ms
I: plural substring

Create a concatenated string from the original plural result (to preserve their order), determine indexes of the starting position of each substring, determine the unique values from the substring, and for each unique substring find the earliest instance of that string in the concatenation. The resulting matches can be sorted by their indexes within the concatenation.

Concatenate the original plural strings into a single longer string, preserving the original order:

q: concatenation "|" of ("g";"bob";"foo";"bar";"bob";"a";"g";"zoo";"apple")
A: g|bob|foo|bar|bob|a|g|zoo|apple
T: 0.065 ms
I: singular string

Next, build one set of the unique values of the substrings, and a second set containing all of the substrings along with their positions. These positions are used later in ordering the set of results (allowing us to build a result sorted by numerical order of substring index)

q: (/* debug output */ "-------unique:"; elements of item 0 of it ; "-----------indexed:" ; elements of item 1 of it ) of (/* set of unique values */ set of substrings separated by "|" of it, /* set of strings with indexes */ set of (start of it as string & ":" & it) of substrings separated by "|" of it) of  concatenation "|" of ("g";"bob";"foo";"bar";"bob";"a";"g";"zoo";"apple")
A: -------unique:
A: a
A: apple
A: bar
A: bob
A: foo
A: g
A: zoo
A: -----------indexed:
A: 0:g
A: 10:bar
A: 14:bob
A: 18:a
A: 20:g
A: 22:zoo
A: 26:apple
A: 2:bob
A: 6:foo
T: 0.368 ms
I: plural string

Iterate through each element of the non-indexed set, to find the matching value in the indexed set with the lowest index value. This finds the first match of each unique string.

q: (/* debug output */ item 1 of it) of  (/* unique strings */item 0 of it, /* indexed strings */ elements of item 1 of it,  /* set of indexed strings */ item 1 of it) whose (/* check this indexed element matches the unique string */ following text of first ":" of item 1 of it = item 0 of it and /* check this match is the minimum indexed match by comparing back tot he full set*/ preceding text of first ":" of item 1 of it as integer = minimum of (preceding texts of firsts ":" of item 0 of it as integer) of (elements of item 2 of it, item 0 of it) whose (following text of first ":" of item 0 of it = item 1 of it))  of (elements of item 0 of it, item 1 of it) of (/* set of unique values */ set of substrings separated by "|" of it, /* set of strings with indexes */ set of (start of it as string & ":" & it) of substrings separated by "|" of it) of  concatenation "|" of ("g";"bob";"foo";"bar";"bob";"a";"g";"zoo";"apple")
A: 18:a
A: 26:apple
A: 10:bar
A: 2:bob
A: 6:foo
A: 0:g
A: 22:zoo
T: 2.429 ms
I: plural string

Try to build a set of these matches to put them in order - the first match for each unique string. This set is ordered by the index of each matching string. But, the ordering is treating the index as a string - sorting “2” and “3” after “10”, for example.

q: elements of set of items 1 of  (/* unique strings */item 0 of it, /* indexed strings */ elements of item 1 of it,  /* set of indexed strings */ item 1 of it) whose (/* check this indexed element matches the unique string */ following text of first ":" of item 1 of it = item 0 of it and /* check this match is the minimum indexed match by comparing back tot he full set*/ preceding text of first ":" of item 1 of it as integer = minimum of (preceding texts of firsts ":" of item 0 of it as integer) of (elements of item 2 of it, item 0 of it) whose (following text of first ":" of item 0 of it = item 1 of it))  of (elements of item 0 of it, item 1 of it) of (/* set of unique values */ set of substrings separated by "|" of it, /* set of strings with indexes */ set of (start of it as string & ":" & it) of substrings separated by "|" of it) of  concatenation "|" of ("g";"bob";"foo";"bar";"bob";"a";"g";"zoo";"apple")
A: 0:g
A: 10:bar
A: 18:a
A: 22:zoo
A: 26:apple
A: 2:bob
A: 6:foo
T: 2.579 ms
I: plural string

Try again; this time, build a set of matching indexes as integers, so we can sort the indexes in numeric order.
Iterate through the set of indexes, capturing the matching indexed string element from the indexed string set.
This returns the unique strings, sorted back into the order of their original positions.

q: following texts of firsts ":" of items 1 of (elements of item 0 of it, elements of item 1 of it) whose (item 0 of it = preceding text of first ":" of item 1 of it as integer) of (set of (preceding texts of firsts ":" of it as integer) of elements of it, it) of set of unique values of items 1 of (/* unique strings */item 0 of it, /* indexed strings */ elements of item 1 of it,  /* set of indexed strings */ item 1 of it) whose (/* check this indexed element matches the unique string */ following text of first ":" of item 1 of it = item 0 of it and /* check this match is the minimum indexed match by comparing back tot he full set*/ preceding text of first ":" of item 1 of it as integer = minimum of (preceding texts of firsts ":" of item 0 of it as integer) of (elements of item 2 of it, item 0 of it) whose (following text of first ":" of item 0 of it = item 1 of it))  of (elements of item 0 of it, item 1 of it) of (/* set of unique values */ set of substrings separated by "|" of it, /* set of strings with indexes */ set of (start of it as string & ":" & it) of substrings separated by "|" of it) of  concatenation "|" of ("g";"bob";"foo";"bar";"bob";"a";"g";"zoo";"apple")
A: g
A: bob
A: foo
A: bar
A: a
A: zoo
A: apple
T: 3.368 ms
I: plural substring

Thanks to @brolly33 for posing this riddle, keep them coming!