Scrub invalid XML characters from Property with Relevance?

We have a property that’s returning characters that are invalid for the XML returned by the following Session Relevance query:
"(ids of computers of it, names of properties of it, values of it) of (results of properties of bes analysis whose (name of it is "AnalysisName")) whose (last report time of computer of it > (now - (5 * minute)))"

The query’s not important, but the property values have invalid XML characters in them.

(Insert complaint that BigFix isn’t scrubbing these as part of the XML creation process >here<.)

Is there a way to pre-scrub the values in the Analysis? Any magic relevance to replace any non-ASCII characters with “”?

If I understand you correctly , you checked this query on the QNA Web Reports and it returned results. But when your took it to the Query API - it did not returned any results?

I think you should open a bug on it, the server should handle whatever XML escaping or replacements are needed.

1 Like

The query works fine, but the results, returned as XML, include characters that are invalid in XML. This results in the consumer of the results, which is expecting valid XML, throwing errors.

Can you please share exactly the workflow you are doing and the thrown error you get?

Is this related to XML Unicode Root API ?

Not sure… here’s what I just submitted in Case #CS0400174

We have an analysis that gathers computer information that is then gathered via API using the following Session Relevance query:
“(ids of computers of it, names of properties of it, values of it) of (results of properties of bes analysis whose (name of it is "Planisphere")) whose (last report time of computer of it > (now - (5 * minute)))”
The Analysis properties and API query work just fine, BUT the data returned in the properties occasionally includes characters that are not valid when used in XML. These characters are not removed or replaced when returned via the API which is supposed to be returning an XML response. Solutions consuming this response are (rightly) throwing “Invalid XML” errors because BigFix is indeed returning invalid XML by including these characters.

Examples of data passed on through the API that are invalid in XML:

The “smart” single quotes used in macOS computer names:

67488730
ps_comp_name
Hunter’s MacBook Pro

Em-dashes and en-dashes used in application names:

817038
ps_apps
Microsoft SQL Server Data Tools – Database Projects – Web installer entry point|10.3.20116.0

Accented characters and other foreign language characters used in application names (the second example is a Chinese application):

4879640
ps_apps
Herramientas de corrección de Microsoft Office 2016: español|16.0.4266.1001


13306779
ps_apps
好压 - 2345|v6.3

Trademark characters used in application names (Intel seems to like those a lot):

4879640
ps_apps
Intel® Software Installer|21.110.2.1

NUL characters (which apparently display fine here, but both of those fields end with a NUL, not a space, for whatever reason):

86779226
ps_apps
ZTE USB Driver |1.0.1.5_Turkcell

The character set in valid XML is well defined. The BigFix API, when returning XML, should not return any characters not in the valid character set for XML. HOW this should be done is entirely up to HCL: simple deletion? swap with a uniform replacement (like a question mark)? swap based on the invalid character (“-” for en-dash, “–” for em-dash, “(TM)” for ™, etc.)?

In any case, BigFix is currently returning invalid characters in its XML responses and solutions attempting to consume that XML are rightly throwing errors because of it.

The workflow is an in-house ruby-based inventory system reading in this XML and stopping when it hits an invalid character. I’m searching for an example now, but (as I’ve just noted in my support case) many of my examples (whether invalid or not) are coming through okay.

Oh, that’s a different issue altogether then, I think.

I can send a simple API query to return the “™” string, and I believe it actually is valid as an XML character in the value returned -

<BESAPI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BESAPI.xsd">
<Query Resource=""™"">
<Result>
<Answer type="string">™</Answer>
</Result>
<Evaluation>
<Time>0.45ms</Time>
<Plurality>Singular</Plurality>
</Evaluation>
</Query>
</BESAPI>

How it’s processed by your XML client, though, may vary depending on what kind of encoding you choose when saving the XML file or passing it along upstream.
In fact we can embed the ™ character as part of a Fixlet Title as well. When I export that in the Console, it’s saved as UTF-8 which recognizes that TM character directly

<?xml version="1.0" encoding="UTF-8"?>
<BES xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BES.xsd">
	<Fixlet>
		<Title>Test Trademark ™</Title> 

Most of the API work I do is in Python, which uses utf-8 for its strings by default anyway…are you using some language that expects ASCII, or latin-1, or something else besides UTF-8? It may require some encoding in your application

Honestly, I’m not sure that I’m even experiencing an problem anymore. :confused: I can’t find an example of the issue in our external consumer of the XML (and the dev is out until next week, so…)

I do have evidence of NUL characters being brought through, however, and they’re never valid XML.

Hm yeah I think I can reproduce that, sending a relevance query for ‘character 0’.

In the browser I get an ‘invalid XML’ error message, and if I view the page source I get

<?xml version="1.0" encoding="UTF-8"?>
<BESAPI xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BESAPI.xsd">
	<Query Resource="character 0">
		<Result>
			<Answer type="string"></Answer>
		</Result>
		<Evaluation>
			<Time>0.30ms</Time>
			<Plurality>Singular</Plurality>
		</Evaluation>
	</Query>
</BESAPI>

I would have expected the ‘Answer’ field to include the XML-escape of the null character, like \u0000 , but reading a little more on the XML spec I don’t think the null character is even allowed. I’m not sure what should be done in a case like that.

First of all: the null char is intended as ‘valid data’ because it is read and stored in the DB and this cannot skip or substitute with another one when it comes read. It is true that the XML returned by RESTAPI is not valid (because it contains non-valid char) but it is the coherent result of the data present in the DB. In the past, this behavior has been already discussed on several tickets opened to the support, and there we have suggested using the JSON format instead the XML. The JSON format has been introduced to manage the non-printable characters, see https://www.ibm.com/support/pages/apar/IV73297 for further details. Thus, when the customer receives special chars in the XML output, he can use the JSON format that does not have this limitation.

These characters are not removed or replaced when returned via the API which is supposed to be returning an XML response. Solutions consuming this response are (rightly) throwing “Invalid XML” errors because BigFix

This presumes that the customer has the option of selecting between XML and JSON. This also presumes, for the customer that needs XML, that invalid XML will be acceptable. It seems to me that the more sensical approach would be to strip any invalid characters from the data in the XMP response (so as to return valid XML) and instruct the user that, if such characters are needed, they should try to use the JSON response option (which will include the characters stripped from the XML).

So, my original question still stands: how to scrub NUL characters from data via Relevance? (These characters are in the software title and publisher data embedded in an executable, so we cannot remove them from the source. I can however attempt to scrub them from by Analysis instead of trying to clean the XML API response.)

Still discussing this with the support & dev teams on how we’d want to approach generating the XML when the property has characters that can’t be represented in XML…but if they do make changes, that would take time to build, test, and publish.

As a workaround for now if you want to change the property definitions to filter out those characters in relevance, you could use something like this to limit to the ASCII printable range (characters 32 through 126 of the ASCII table).
In this example string the NULL characters %00, along with characters 0x01, 0x02 are stripped off of the front.

q: concatenation of characters whose ((it >= 32 and it <= 126 ) of hexadecimal integer (it as hexadecimal)) of "%00%01%02 !%22#$%25&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
A:  !"#$%25&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
2 Likes

Thanks, Jason! This is a good start. Some of these software publishers put other characters in there, but I can work with this.