Coding for Dummies
Not being an expert in the designated area, I, nevertheless, read a lot of specialized literature to get acquainted with the subject and, breaking through the thorns to the stars, filled, in the initial stages, a lot of cones. With all the abundance of information, I could not find simple articles on coding as such, outside the scope of special literature (so to speak, without formulas and with pictures).
The article, in the first part, is an educational program on coding as such with examples of manipulations with bit codes, and in the second I would like to touch upon the simplest ways to encode images.
Since I am addressing newbies in this matter, I would not consider it shameful to refer to Wikipedia. And there, to denote the coding of information, we have such a definition – the process of converting a signal from a form convenient for direct use of information into a form convenient for transmission, storage or automatic processing.
What I lacked in the 70s and 80s was at school, albeit not in computer science, but, for example, in mathematics lessons – basic information on coding. The fact is that each of us is engaged in coding information every second, constantly and in general – without concentrating on the coding itself. That is, in everyday life we do it all the time. So how does this happen?
Mimicry, gestures, speech, signals of different levels – a plate with an inscription, a sign on the road, traffic lights, and for the modern world – bar and bar codes, URL, hash tags.
Let’s take a look at some in more detail.
1.1 Speech, facial expressions, gestures
Surprisingly, these are all codes. With the help of them, we transmit information about our actions, sensations, emotions. The most important thing is that the codes are clear to everyone. For example, being born in dense forests near the Amazon and not seeing a modern urban person, one may face the problem of not understanding the code – a smile, like a demonstration of teeth, will be perceived as a threat, and not as an expression of joy.
By definition, what happens when we speak? Thought – as a form that is convenient for direct use, is converted into speech – a form that is convenient for transmission. And, look, since the sound has a limitation on both the speed and the transmission distance, then, for example, a gesture, in some situation, can be chosen to transmit the same information, but over a greater distance.
But we will still be limited by the range of our visual acuity, and then – a person begins to come up with other ways of transmitting and transforming information, such as fire or smoke.
1.2 Alternating signals
In a primitive form, coding with alternating signals has been used by mankind for a very long time. In the previous section, we talked about smoke and fire. If an obstacle is placed and removed between the observer and the source of fire, then it will seem to the observer that he sees alternating “on / off” signals. By changing the frequency of such inclusions, we can work out a sequence of codes that will be unambiguously interpreted by the receiving side.
Along with the signal flags on sea and river vessels, when the radio appeared, they began to use the Morse code. And for all the seeming binary (representation of the code by two values), since the dot and dash signals are used, in fact this is a thorny code, since a pause in the transmission of the code is required to separate individual code-symbols. That is, the Morse code, in addition to the “dot-dash”, which gives us the letter “A” can sound like this – “dot-pause-dash” and then it is already two letters “ET”.
When we use a computer, we understand that information can be different – sound, video, text. But what are the main differences? And before you start to encode information, in order, for example, to transmit it through communication channels, you need to understand what the information is in each specific case, that is, pay attention to the content. Sound – a sequence of discrete values about a sound signal, video – a sequence of image frames, text – a sequence of text symbols. If we do not take into account the context, and, for example, we will use Morse code to transfer all three types of information, then if this method may be acceptable for text, then for sound and video, the time spent on transmitting, for example, 1 second of information may turn out to be too long – an hour or even a couple of weeks.
2. Text encoding
Let’s move on from the general description of coding to the practical part. From conventions, we will take as a constant that we will encode data for a personal computer, where bits and bytes are taken as a unit of information. A bit is like an atom of information, and a byte is like a conditional block of 8 bits.
The text in the computer is a part of 256 characters, for each one byte is allocated and values from 0 to 255 can be used as a code. 255 as 11111111. Reading of such a representation of the number occurs from right to left, that is, one will be written as 00000001.
So, the characters of the English alphabet are 26 for the upper case and 26 for the lower case, 10 digits. There are also punctuation marks and other symbols, but for the experiments we will only use uppercase letters (uppercase) and a space.
Test phrase “RIDING THE GREEK THROUGH THE RIVER SEES THE GREEK IN THE RIVER CANCER HAS BEEN HANDED IN THE RIVER CANCER FOR THE HAND OF THE GREEK DAC”.
2.1 Block coding
Information in the PC is already presented in the form of blocks of 8 bits, but knowing the context, we will try to represent it in the form of smaller blocks. To do this, we need to collect information about the symbols presented and, for the future, we will immediately calculate the frequency of use of each symbol:
We used 19 characters in total (including space). 18 + 12 + 11 + 11 + 9 + 8 + 4 + 3 + 2 + 2 + 2 + 2 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 91 bytes (91 * 8 = 728 bits).
We take these values as constants and try to change the approach to storing blocks. To do this, we notice that out of 256 codes for symbols we use only 19. To find out how many bits are needed to store 19 values, we must calculate LOG 2 (19) = 4.25, since we cannot use a fractional bit value, then we must round to 5, which gives us a maximum of 32 different values (if we wanted to use 4 bits, this would only give 16 values and we could not encode the entire string).
It is easy to calculate that we will get 91 * 5 = 455 bits, that is, knowing the context and changing the storage method, we were able to reduce the use of PC memory by 37.5%.
Unfortunately, such a representation of information will not allow it to be unambiguously decoded without storing information about symbols and new codes, and adding a new dictionary will increase the size of the data. Therefore, such coding methods manifest themselves on large amounts of data.
In addition, for storing 19 values, we used the number of bits for 32 values, this reduces the coding efficiency.
2.2 Variable length codes
Let’s use the same row and table and try to encode the data differently. Let’s remove the blocks of a fixed size and represent the data based on their frequency of use – the more often the data is used, the fewer bits we will use. We will have a second table:
|Symbol||amount||Variable code, bit|
To calculate the length of the encoded message, we must add all the products of the number of characters by the lengths of the codes in bits, and then we get 179 bits.
But this method, although it allowed a decent memory saving, will not work, because it is impossible to decode it. In such a situation, we will not be able to determine what the code “111” means, it can be “PPP”, “PA”, “AP” or “X”.
2.3 Prefix block codes
To solve the problem of the previous example, we need to use prefix codes – this is a code that, when read, can be unambiguously decoded into the desired character, since only he has it. Remember earlier we talked about Morse code and there was a pause as a prefix. So now we need to put into circulation some kind of code that will determine the beginning and / or the end of a specific code value.
Let’s create a third table for the same row:
|Symbol||amount||Prefix code with variable blocks, bit|
The peculiarity of the new codes is that we use the first bit to indicate the size of the next block, where 0 is a three-bit block, 1 is a four-bit block. It is easy to calculate that this approach will encode our string at 379 bits. Previously, with block coding, we got a result of 455 bits.
We can develop this approach and increase the prefix to 2 bits, which will allow us to create 4 groups of blocks:
|Symbol||amount||Prefix code with variable blocks, bit|
Where 00 is a 1 bit block, 01 is 2 bits, 10 and 11 are 3 bits. We calculate the size of the string – 356 bits.
As a result, in three modifications of one method, we regularly reduce the string size, from 455 to 379, and then to 356 bits.
2.4 Huffman code
One of the most popular ways to build prefix codes. If certain conditions are met, it allows you to get the optimal code for any data, although it allows free modifications to the methods for creating codes.
This code ensures that there is only one unique sequence of bit values for each character. This will assign short codes to frequent characters.
We consider the result – 328 bits.
Note that although we started using codes in 6 and 7 bits, there are too few of them to affect the outcome.
2.5.1 Strings and Substrings
So far, we have encoded data by treating it as a collection of individual characters. Now we will try to encode with whole words.
Let me remind you of our line: “RIDING THE GREEK THROUGH THE RIVER SEEING THE GREEK IN THE RIVER CANCER SUSPENDED THE GREK’S HAND IN THE RIVER CANCER FOR THE HAND OF THE GREEK DAC”.
Let’s make a table of repetitions of words:
To encode, we need to come up with a concept, for example – we create a dictionary and assign an index to each word, ignore spaces and do not encode, but we believe that each word is separated by a space character.
First, we create a dictionary:
Thus, our string is encoded into a sequence:
7, 0, 12, 3, 5, 0, 1, 9, 2, 10, 0, 4, 1, 3, 2, 8, 4, 6, 11
This is the preparatory stage, but how exactly we encode the dictionary and data after the preparatory coding is a creative process. For now, we will stay within the framework of the methods already known to us and start with block coding.
We write the indices in the form of blocks of 4 bits (this is how the indices from 0 to 15 can be represented), we will have two such chains, one for the encoded message, and the second for matching the index and the word. The words themselves will be encoded with Huffman codes, only we still have to set the separator of entries in the dictionary, you can, for example, specify the word length in a block, the longest word we have is 5 characters, 3 bits are enough for this, but we can also use the space code, which consists of two bits – let’s do it. As a result, we get the dictionary storage scheme:
|Index / bits||Word / bits||End of word / bits|
|0/4||GRECA / 18||SPACEBAR / 2|
|14||AT 5||SPACEBAR / 2|
|2/4||RAC / 10||SPACEBAR / 2|
|3/4||RECU / 12||SPACEBAR / 2|
|4/4||ARM / 12||SPACEBAR / 2|
|5/4||SEES / 31||SPACEBAR / 2|
|6/4||GRECU / 17||SPACEBAR / 2|
|7/4||EHAL / 20||SPACEBAR / 2|
|8/4||FOR / 10||SPACEBAR / 2|
|9/4||RECHKE / 18||SPACEBAR / 2|
|10/4||SUNUL / 26||SPACEBAR / 2|
|11/4||DAC / 17||SPACEBAR / 2|
|12/4||AFTER / 21||SPACEBAR / 2|
and the message itself, 4 bits per code.
We count everything together and get 371 bits. In this case, the message itself was encoded in 19 * 4 = 76 bits. But we still need to keep the Huffman code and symbol consistent, as in all previous cases.
I hope this article will give you a general impression of coding and show that it is not only a military encryption clerk or a complex algorithm for mathematical geniuses.
From time to time I come across how students are trying to solve coding problems and simply cannot abstract, get creative with this process. But coding is like a hairstyle or fancy pants, which in this way show our social code.
Thanks for Visit http://nullcode.cc