File Encoding Inspection

JonL · August 6, 2019, 9:36pm

Is there a way to query the encoding of a file as a property of the file? I was digging through the inspector list for files, but it wasn’t obvious how to get that.

JonL · August 7, 2019, 7:43pm

Any ideas? I have a file that is being changed from one encoding to another. I need a way to inspect what encoding the file currently has.

TimRice · August 7, 2019, 8:23pm

What kind of encoding are you talking about? Video Encoding?

alinder · August 7, 2019, 8:42pm

Hmm… I can’t find a property that just returns the encoding, but there is the encoding cast: https://developer.bigfix.com/relevance/reference/encoding.html

I won’t pretend to know everything there is to know about file encoding, but I do like a relevance challenge! Using the info in the above link, we can compare if the results of reading a file with a given encoding contains a character or string we would expect:

q: line 1 of file "C:\test.txt" of encoding "UTF-8" contains "a"
A: True
T: 20.367 ms
I: singular boolean

q: line 1 of file "C:\test.txt" of encoding "UTF-16" contains "a"
A: False
T: 20.088 ms
I: singular boolean

Here’s the raw data:

q: line 1 of file "C:\test.txt" of encoding "UTF-8"
A: Encrypted and Escrowed - 744
T: 11.450 ms
I: singular file line

q: line 1 of file "C:\test.txt" of encoding "UTF-16" 
A: 䕮捲祰瑥搠慮搠䕳捲潷敤‭‷㐴ഊ䕮捲祰瑥搠慮搠敳捲潷敤Ⱐ扵琠湯⁍慳瑥爠䭥祣桡楮‭‱ഊ䕮捲祰瑥搬⁮潴⁥獣牯睥搠ⴠ㐳ഊ乯琠敮捲祰瑥搠ⴠ㄰㈍ૌ
T: 11.171 ms
I: singular file line

You might say “well, that only helps me if I know the content of the file,” and you’re right! But that got me thinking about how to do this if you didn’t know the content of the file. If it’s enough to assume it will definitely have some letter of the alphabet, this approach could work. Again, I’m not an expert in this area so I can’t speak to issues with edge cases or anything:

q: elements of intersection of (set of characters of "abcdefghijklmnopqrstuvwxyz" ; set of characters of (it as lowercase) of lines of file "C:\test.txt" of encoding "UTF-8")
A: a
A: b
A: c
A: d
A: e
A: h
A: i
A: k
A: m
A: n
A: o
A: p
A: r
A: s
A: t
A: u
A: w
A: y
T: 21.039 ms
I: plural string

q: exists elements of intersection of (set of characters of "abcdefghijklmnopqrstuvwxyz" ; set of characters of (it as lowercase) of lines of file "C:\test.txt" of encoding "UTF-8")
A: True
T: 19.621 ms
I: singular boolean

q: elements of intersection of (set of characters of "abcdefghijklmnopqrstuvwxyz" ; set of characters of (it as lowercase) of lines of file "C:\test.txt" of encoding "UTF-16")
T: 18.283 ms
I: plural string

q: exists elements of intersection of (set of characters of "abcdefghijklmnopqrstuvwxyz" ; set of characters of (it as lowercase) of lines of file "C:\test.txt" of encoding "UTF-16")
A: False
T: 16.457 ms
I: singular boolean

All this does is grab a set of all of the lowercase characters present in the alphabet, a set of the characters present in the file, convert it to lowercase to do a like comparison, and return if there are any commonalities. As we can see in the example above, I get a true when I test for UTF-8 and a false when I test for UTF-16. Generalized, that leaves us with:

exists elements of intersection of (set of characters of "abcdefghijklmnopqrstuvwxyz" ; set of characters of (it as lowercase) of lines of file "MYFILE" of encoding "MYENCODING")

Hopefully this helps!

JonL · August 7, 2019, 8:45pm

File encoding. For example, is a file ANSI, UTF-8, UTF-16, UTF-32, UTF-EBCDIC, ASCII, etc? It would be super handy to have a native inspector.

I may have to resort to running a powershell against it, drop the results somewhere, then pick up the results in an analysis.

JonL · August 7, 2019, 8:47pm

@alinder, thanks! It looks like content within the lines of the file can be inspected, but not the file itself. I can probably work with that.

JonL · August 8, 2019, 10:48am

So I tried @alinder suggestion. Technically it works …

q: exists elements of intersection of (set of characters of “abcdefghijklmnopqrstuvwxyz” ; set of characters of (it as lowercase) of lines of file “c:\somefile.log” of encoding “UTF-16”)
A: False
T: 451771.839 ms

… the duration makes it impractical.

Likely I’ll need to go back to running a powershell then parsing the output in an analysis.

alinder · August 8, 2019, 11:05am

I was testing it on a one line simple text file. Presumably you wouldn’t need to test against your whole giant log file, just one line of it, or maybe a handful. What if you change it to this?

exists elements of intersection of (set of characters of “abcdefghijklmnopqrstuvwxyz” ; set of characters of (it as lowercase) of line 1 of file “c:\somefile.log” of encoding “UTF-16”)

If line 1 of the file doesn’t reliably have alphanumeric characters, maybe just do the first 5 lines of a file.

JonL · August 8, 2019, 1:36pm

I agree with you in theory. However my use case is a strange one. I have a UTF-8 file that retains that encoding when the system is functioning properly. When the app writing to it malfunctions, the encoding changes to UTF-16 Little-Endian with lines that are multiple million characters in length. I’m attempting to detect when the app is going south by finding the encoding delta.

So inspecting the content of the lines is no trouble when the app is behaving normally with ~150 character lines. It is when it misbehaves with lines millions of characters that would hang the agent if I were to inspect that way. The debugger result that I showed in my prior post was inspecting one of the bad files with about 20 million characters per line.

alinder · August 8, 2019, 2:05pm

Whoa, funky. Well, what if we just keep restricting further…

exists elements of intersection of (set of characters of "abcdefghijklmnopqrstuvwxyz" ; set of characters of (it as lowercase) of firsts 5 of line 1 of file "c:\somefile.log" of encoding “UTF-16”)

Adding firsts 5 in there will give us only the first 5 characters of the line.

JasonWalker · August 8, 2019, 2:53pm

I’ll need to check some references, but I think you may be able to tell when it changes by checking the first few bytes of the file to look for a BOM header, the way a text editor would.

brolly33 · August 8, 2019, 2:55pm

I wonder if…

q: set of lines of file “c:\somefile.log” of encoding “UTF-8” = set of lines of file "c:\somefile.log"
A: True

JonL · August 8, 2019, 3:23pm

Good ideas everyone!

Inspecting the first 5 of a given line reduces the evaluation time significantly. It still is expensive at 940ms.

To Jason’s idea of inspecting the first bytes, I’ve been running a powershell that does exactly that from this site. It does work well and is my plan B if I can’t evaluate it properly directly.

In testing Brolly33’s approach, I found what I suspect is a bug in the encoding inspector. If I use the encoding that I know the file to have, it evaluates to true. However if I put in an incorrect value, it continues to evaluate to true.

For example, if the file is actually UTF-8, then this is correct:
q: set of lines of file “c:\somefile.log” of encoding “UTF-8” = set of lines of file "c:\somefile.log"
A: True

However this also incorrectly shows true:
q: set of lines of file “c:\somefile.log” of encoding “UTF-16” = set of lines of file "c:\somefile.log"
A: True

q: set of lines of file “c:\somefile.log” of encoding “ANSI” = set of lines of file "c:\somefile.log"
A: True

Can you duplicate this bug?

I’m using 9.5.12.68 Debugger in my test.

JasonWalker · August 8, 2019, 3:52pm

Here’s an example of what I was getting at. I have one text file saved in whatever Notepad++'s default encoding uses, and another that is converted to UTF-8 BOM. Other than the BOM header they are identical. I can inspect the first few bytes of the file though to see the difference…

q: (name of it, lines of it, byte 0 of it, byte 1 of it, byte 2 of it, byte 3 of it, byte 4 of it) of files of folder "C:\temp2"
A: test-UTF8-BOM.txt, This is a test file, 239, 187, 191, 84, 104
A: test1.txt, This is a test file, 84, 104, 105, 115, 32

The UTF-8 file starts with the bytes with integer values 239, 187, 191, then starts the string with ascii values 84, 104, etc.; while the file with no header starts right into the ASCII values 84, 104, …

Perhaps you can check only for the first 3 bytes of the file to match your expected encodings, or to at least be above the printable range of ASCII values.

JonL · August 9, 2019, 12:04pm

That makes sense Jason. Inspecting the first few bytes turns out to be handy as it evaluates quickly and can be done with a native inspector.

To test for UTF-16 LE in this particular instance, I can match the first two bytes of the header.

q: (byte 0 of it = 255 AND byte 1 of it = 254) of file "c:\somefile.log"
A: True
T: 1.915 ms

Thanks all for the input!