Coding for Dummies, Part 1

Written by

Coding for Dummies

Not being an expert in the designated area, I, nevertheless, read a lot of specialized literature to get acquainted with the subject and, breaking through the thorns to the stars, filled, in the initial stages, a lot of cones. With all the abundance of information, I could not find simple articles on coding as such, outside the scope of special literature (so to speak, without formulas and with pictures).

The article, in the first part, is an educational program on coding as such with examples of manipulations with bit codes, and in the second I would like to touch upon the simplest ways to encode images.

0. Beginning

Since I am addressing newbies in this matter, I would not consider it shameful to refer to Wikipedia. And there, to denote the coding of information, we have such a definition – the process of converting a signal from a form convenient for direct use of information into a form convenient for transmission, storage or automatic processing.

What I lacked in the 70s and 80s was at school, albeit not in computer science, but, for example, in mathematics lessons – basic information on coding. The fact is that each of us is engaged in coding information every second, constantly and in general – without concentrating on the coding itself. That is, in everyday life we ​​do it all the time. So how does this happen?

Mimicry, gestures, speech, signals of different levels – a plate with an inscription, a sign on the road, traffic lights, and for the modern world – bar and bar codes, URL, hash tags.

Let’s take a look at some in more detail.

1.1 Speech, facial expressions, gestures

Surprisingly, these are all codes. With the help of them, we transmit information about our actions, sensations, emotions. The most important thing is that the codes are clear to everyone. For example, being born in dense forests near the Amazon and not seeing a modern urban person, one may face the problem of not understanding the code – a smile, like a demonstration of teeth, will be perceived as a threat, and not as an expression of joy.

By definition, what happens when we speak? Thought – as a form that is convenient for direct use, is converted into speech – a form that is convenient for transmission. And, look, since the sound has a limitation on both the speed and the transmission distance, then, for example, a gesture, in some situation, can be chosen to transmit the same information, but over a greater distance.

But we will still be limited by the range of our visual acuity, and then – a person begins to come up with other ways of transmitting and transforming information, such as fire or smoke.

1.2 Alternating signals

Coding for Dummies
Indian pings

In a primitive form, coding with alternating signals has been used by mankind for a very long time. In the previous section, we talked about smoke and fire. If an obstacle is placed and removed between the observer and the source of fire, then it will seem to the observer that he sees alternating “on / off” signals. By changing the frequency of such inclusions, we can work out a sequence of codes that will be unambiguously interpreted by the receiving side.

Along with the signal flags on sea and river vessels, when the radio appeared, they began to use the Morse code. And for all the seeming binary (representation of the code by two values), since the dot and dash signals are used, in fact this is a thorny code, since a pause in the transmission of the code is required to separate individual code-symbols. That is, the Morse code, in addition to the “dot-dash”, which gives us the letter “A” can sound like this – “dot-pause-dash” and then it is already two letters “ET”.

1.3 Context

When we use a computer, we understand that information can be different – sound, video, text. But what are the main differences? And before you start to encode information, in order, for example, to transmit it through communication channels, you need to understand what the information is in each specific case, that is, pay attention to the content. Sound – a sequence of discrete values ​​about a sound signal, video – a sequence of image frames, text – a sequence of text symbols. If we do not take into account the context, and, for example, we will use Morse code to transfer all three types of information, then if this method may be acceptable for text, then for sound and video, the time spent on transmitting, for example, 1 second of information may turn out to be too long – an hour or even a couple of weeks.

2. Text encoding

Let’s move on from the general description of coding to the practical part. From conventions, we will take as a constant that we will encode data for a personal computer, where bits and bytes are taken as a unit of information. A bit is like an atom of information, and a byte is like a conditional block of 8 bits.

The text in the computer is a part of 256 characters, for each one byte is allocated and values ​​from 0 to 255 can be used as a code. 255 as 11111111. Reading of such a representation of the number occurs from right to left, that is, one will be written as 00000001.

So, the characters of the English alphabet are 26 for the upper case and 26 for the lower case, 10 digits. There are also punctuation marks and other symbols, but for the experiments we will only use uppercase letters (uppercase) and a space.

Test phrase “RIDING THE GREEK THROUGH THE RIVER SEES THE GREEK IN THE RIVER CANCER HAS BEEN HANDED IN THE RIVER CANCER FOR THE HAND OF THE GREEK DAC”.

2.1 Block coding

Information in the PC is already presented in the form of blocks of 8 bits, but knowing the context, we will try to represent it in the form of smaller blocks. To do this, we need to collect information about the symbols presented and, for the future, we will immediately calculate the frequency of use of each symbol:

Symbolamount
SPACEeighteen
R12
TOeleven
Eeleven
Have9
AND8
Dfour
IN3
H2
L2
AND2
Z2
Done
Xone
FROMone
Tone
Cone
Hone
Pone

We used 19 characters in total (including space). 18 + 12 + 11 + 11 + 9 + 8 + 4 + 3 + 2 + 2 + 2 + 2 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 91 bytes (91 * 8 = 728 bits).

We take these values ​​as constants and try to change the approach to storing blocks. To do this, we notice that out of 256 codes for symbols we use only 19. To find out how many bits are needed to store 19 values, we must calculate LOG 2 (19) = 4.25, since we cannot use a fractional bit value, then we must round to 5, which gives us a maximum of 32 different values ​​(if we wanted to use 4 bits, this would only give 16 values ​​and we could not encode the entire string).

It is easy to calculate that we will get 91 * 5 = 455 bits, that is, knowing the context and changing the storage method, we were able to reduce the use of PC memory by 37.5%.

Unfortunately, such a representation of information will not allow it to be unambiguously decoded without storing information about symbols and new codes, and adding a new dictionary will increase the size of the data. Therefore, such coding methods manifest themselves on large amounts of data.

In addition, for storing 19 values, we used the number of bits for 32 values, this reduces the coding efficiency.

2.2 Variable length codes

Let’s use the same row and table and try to encode the data differently. Let’s remove the blocks of a fixed size and represent the data based on their frequency of use – the more often the data is used, the fewer bits we will use. We will have a second table:

SymbolamountVariable code, bit
SPACEeighteen0
R12one
TOeleven00
Eeleven01
Have9ten
AND8eleven
Dfour000
IN3001
H2010
L2011
AND2one hundred
Z2101
Done110
Xone111
FROMone0000
Tone0001
Cone0010
Hone0011
Pone0100

To calculate the length of the encoded message, we must add all the products of the number of characters by the lengths of the codes in bits, and then we get 179 bits.

But this method, although it allowed a decent memory saving, will not work, because it is impossible to decode it. In such a situation, we will not be able to determine what the code “111” means, it can be “PPP”, “PA”, “AP” or “X”.

2.3 Prefix block codes

To solve the problem of the previous example, we need to use prefix codes – this is a code that, when read, can be unambiguously decoded into the desired character, since only he has it. Remember earlier we talked about Morse code and there was a pause as a prefix. So now we need to put into circulation some kind of code that will determine the beginning and / or the end of a specific code value.

Let’s create a third table for the same row:

SymbolamountPrefix code with variable blocks, bit
SPACEeighteen0 000
R120 001
TOeleven0 010
Eeleven0 011
Have90 100
AND80 101
Dfour0 110
IN30 111
H21 0001
L21 0010
AND21 0011
Z21 0100
Done1 0101
Xone1 0110
FROMone1 0111
Tone1 1000
Cone1 1001
Hone1 1010
Pone1 1011

The peculiarity of the new codes is that we use the first bit to indicate the size of the next block, where 0 is a three-bit block, 1 is a four-bit block. It is easy to calculate that this approach will encode our string at 379 bits. Previously, with block coding, we got a result of 455 bits.

We can develop this approach and increase the prefix to 2 bits, which will allow us to create 4 groups of blocks:

SymbolamountPrefix code with variable blocks, bit
SPACEeighteen00 0
R1200 1
TOeleven01 00
Eeleven01 01
Have901 10
AND801 11
Dfour10 000
IN310 001
H210 010
L210 011
AND210 100
Z210 101
Done10 110
Xone10 111
FROMone11 000
Tone11 001
Cone11 010
Hone11 011
Pone11 100

Where 00 is a 1 bit block, 01 is 2 bits, 10 and 11 are 3 bits. We calculate the size of the string – 356 bits.

As a result, in three modifications of one method, we regularly reduce the string size, from 455 to 379, and then to 356 bits.

2.4 Huffman code

One of the most popular ways to build prefix codes. If certain conditions are met, it allows you to get the optimal code for any data, although it allows free modifications to the methods for creating codes.

This code ensures that there is only one unique sequence of bit values ​​for each character. This will assign short codes to frequent characters.

SymbolamountThe code
SPACEeighteen00
R12101
TOelevenone hundred
Eeleven011
Have9010
AND81111
Dfour11011
IN311001
H2111011
L2111010
AND2111001
Z2111000
Done1101011
Xone1101010
FROMone1101001
Tone1101000
Cone1100011
Hone1100010
Pone110,000

We consider the result – 328 bits.

Note that although we started using codes in 6 and 7 bits, there are too few of them to affect the outcome.

2.5.1 Strings and Substrings

So far, we have encoded data by treating it as a collection of individual characters. Now we will try to encode with whole words.

Let me remind you of our line: “RIDING THE GREEK THROUGH THE RIVER SEEING THE GREEK IN THE RIVER CANCER SUSPENDED THE GREK’S HAND IN THE RIVER CANCER FOR THE HAND OF THE GREEK DAC”.

Let’s make a table of repetitions of words:

Wordamount
SPACEeighteen
Greek3
IN2
CANCER2
RIVER2
Hand2
SEEone
Greekone
EKHALone
PERone
RIVERone
SUNULone
DACone
ACROSSone

To encode, we need to come up with a concept, for example – we create a dictionary and assign an index to each word, ignore spaces and do not encode, but we believe that each word is separated by a space character.

First, we create a dictionary:

WordamountIndex
Greek30
IN2one
CANCER22
RIVER23
Hand2four
SEEonefive
Greekone6
EKHALone7
PERone8
RIVERone9
SUNULoneten
DAConeeleven
ACROSSone12

Thus, our string is encoded into a sequence:

7, 0, 12, 3, 5, 0, 1, 9, 2, 10, 0, 4, 1, 3, 2, 8, 4, 6, 11

This is the preparatory stage, but how exactly we encode the dictionary and data after the preparatory coding is a creative process. For now, we will stay within the framework of the methods already known to us and start with block coding.

We write the indices in the form of blocks of 4 bits (this is how the indices from 0 to 15 can be represented), we will have two such chains, one for the encoded message, and the second for matching the index and the word. The words themselves will be encoded with Huffman codes, only we still have to set the separator of entries in the dictionary, you can, for example, specify the word length in a block, the longest word we have is 5 characters, 3 bits are enough for this, but we can also use the space code, which consists of two bits – let’s do it. As a result, we get the dictionary storage scheme:

Index / bitsWord / bitsEnd of word / bits
0/4GRECA / 18SPACEBAR / 2
14AT 5SPACEBAR / 2
2/4RAC / 10SPACEBAR / 2
3/4RECU / 12SPACEBAR / 2
4/4ARM / 12SPACEBAR / 2
5/4SEES / 31SPACEBAR / 2
6/4GRECU / 17SPACEBAR / 2
7/4EHAL / 20SPACEBAR / 2
8/4FOR / 10SPACEBAR / 2
9/4RECHKE / 18SPACEBAR / 2
10/4SUNUL / 26SPACEBAR / 2
11/4DAC / 17SPACEBAR / 2
12/4AFTER / 21SPACEBAR / 2
70123five0one92ten0fourone328four6eleven

and the message itself, 4 bits per code.

We count everything together and get 371 bits. In this case, the message itself was encoded in 19 * 4 = 76 bits. But we still need to keep the Huffman code and symbol consistent, as in all previous cases.

Afterword

I hope this article will give you a general impression of coding and show that it is not only a military encryption clerk or a complex algorithm for mathematical geniuses.

From time to time I come across how students are trying to solve coding problems and simply cannot abstract, get creative with this process. But coding is like a hairstyle or fancy pants, which in this way show our social code.

Thanks for Visit http://nullcode.cc

Article Categories:
Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Shares
x