Friday, December 28, 2018

Information Science: Information is Meaningless Without Context - Now With Puppies!

Motivation


Most simple text editors will open any file and try to display it as text.  This phenomenon is where this post came from... a junior coworker asked why an .exe looks like garbled text when viewed with a text editor.

The answer is context.

What does "10110101101000101110101110" mean?  


Well it depends on who you ask.
Let's put those exact bits into a file called:
  • my_file.jpeg and open it with Photoshop - it's going to be interpreted as an image
  • my_file.mp3 and open it with iTunes - it's going to be interpreted as sound
  • my_file.txt and open it with Notepad - it's going to be interpreted as text

That's oversimplifying it because some standard file types such as jpegs and mp3s must always start with a "header" that basically says "this is an mp3," but let's ignore that for a moment.  Let's just talk about the 1s and 0s that represent the song in the mp3 file.

How many puppies does "11" represent - A, B, or C?

A


B


C

The answer again is context.
  • If "11" is read as decimal (base 10, which we all use in everyday life), then it's A.
  • If "11" is read as binary (base 2), then it's B.
  • If "11" is read as octal (base 8), then it's C.

For a quick read on this, see Information Science - A Simple Explanation of Base Number Systems: Binary, Decimal, Octal, Hexadecimal

So when a program like iTunes sees a string of 1s and 0s it's going to read it as sound, and if it's just a random string of 1s and 0s, it'll be pretty noisy.  An .exe file is supposed to be read by the operating system, so if you open it with a text editor it tries to turn the 1s and 0s into a bunch of letters, so it just looks like garbled text.

Other examples

Human Languages

To think of this yet another way, I could write "Plato" on a piece of paper and give it to someone who speaks English and they'd probably interpret it as the name of the Greek philosopher.  If I give that same piece of paper to someone who speaks Spanish, they'd probably interpret it as a flat round thing you eat food off of ("plato" is Spanish for the English word "plate").

Character encoding schemes

Similarly to the decimal vs binary example, we have different standards that make a specific set of bits represent different readable symbols.  Some examples you may have heard of are ASCII and ANSI.  Let's take this set of bits:
01001000 01000101 01011001 00100001

ASCII was thought of long ago as a simple character set with a small range of 00000000 to 11111111 (a total of 255 unique characters), so each of the 4 chunks above represent a character... in fact it represents "HEY!"

Unicode is intended to be the last character set we'll ever need, so it has a huge range of 1,114,112 unique characters, currently only around 100,000 of which are used.  It's a much more complex system, including the Latin, Greek, Cyrillic, Chinese, and Thai alphabets (along with others) and all kinds of symbols for music, math, just about anything you can think of.  Suffice to say, if you give a long string of 1s and 0s, interpreting it as ASCII and Unicode will likely not yield the same result.

To sum up...


and reiterate (to a hopefully tediously clear degree): information in a file is just like information in the real world - it can only be correctly understood if the context is correctly understood first.

1 comment:

  1. As a software product company, Tech saga is the best Automation strategy development company and we are here to enable you to achieve the best in terms of accuracy, efficiency, and profit. Opt for our end-to-end automation solutions and enjoy key benefits comprising well-documented information, easy tracking and scheduling, less paperwork, zero data loss, less human error, faster process completion, and so on

    ReplyDelete