Better Adaptive Text Compression Scheme

A data compression scheme suggested by Ziv and Lempel, LZ77, is applied to text compression. A slightly modified version suggested by Storer and Szymanski ,LZSS, is found to achieve compression ratios as good as most existing schemes for a wide range of texts. In these two methods and all other dynamic methods, a text file is searched from left to right to find the longest match between the lookahead buffer (the previously encoded text) and the characters to be encoded. The method suggested in this work depends the searching in two directions, from left to right and from right to left, although this process takes more time, better compression results were obtained.


Introduction
Data compression is the business of reducing the amount of space needed to store files on computers or of reducing the amount of time taken to transmit the information over a channel of given bandwidth [7].
Data stored on a computer falls into two groups: First: digital representation of data that is continuous in nature such as images, sounds, video sequences,…., bec use the form stored is lre dy the quantized version of the original, it is appropriate for further approximation to be permitted and lossy compression techniques can be used to obtain extremely compact representation.Second: data such as text, archival images of historical documents,….Where the origin l source of d t must be capable of being reconstructed exactly, so lossless compression methods can be used to compact and save these kinds of data [17].
In this work, we concentrate on lossless compression, precisely on adaptive methods (explained in section 3), specially LZSS, which is an improvement of LZ77, a table of previous works (from 1977 to 2005) also presented in this section, a development is applied to LZSS generating a new scheme (named as LZD) with better compression results, the idea of LZD is explained in details in section 4. Finally, and in section 5, some experimental results on some files are showed followed by a simple comparison between LZSS and LZD methods.

Types of Compression Methods
Compression methods can be classified into many groups (Fig. 2.1), as they have been designed for a wide variety of types of information such as text, images, and sound.These usually call for quite different approaches to the problem because of the different types of information they contain.In general, compression methods are divided into two groups: Lossy and Lossless: -Lossy, or irreversible, compression is used for digitized analogue signals such as speech and pictures.
-Lossless, reversible or noiseless, compression (where the original can be recovered exactly from it's compressed version) is particularly important for text, since in this situation errors are not exactly accepted [2].Lossless compression methods, on the other hand, can be assorted to Online and Offline.
-Online methods accomplish the compression in one pass.
-Offline methods process the entire input string several times before the final encoding strategy is determined [25].Another categorization can be made to static, semiadaptive, and adaptive compression schemes.
-In Static schemes the rules used to encode the string are kept fixed during the process.
-Semiadaptive schemes differ from the previous in using a different model for each encoded text.
-Adaptive (or dynamic) methods changes the rules according to the characteristics of the text, this is normally implemented as an online learning process [2].
Finally, an other assortment can be made, according to the entropy theory, (where the entropy is a measure of the information content of the text to give a limit of the best possible compression) to Entropy and Nonentropy methods.
-Entropy methods are used when the objective is to maximize compression.

Adaptive Methods
Almost all practical adaptive encoders are encompressed by a family of algorithms derived from the work of Ziv and Lempel.The essence is that the phrases are replaced with a pointer to where they have occurred earlier in the text.This family of schemes is called Ziv_lempel compression, abbreviated as LZ compression [6].This method adapt quickly to a new topic, but it is also able to code short function words because they appear so frequently.
Decoding a text that has been compressed in this manner is straight forward; the decoder simply replaces a pointer with the already decoded text that it points to.In practice, LZ coding achieves good compression, and an important feature is that decoding can be very fast.
One form of a pointer is a pair (m,l) that represents the phrase of l characters starting at position m of the input string.The pointer is constructed from the earlier text of a predefined window.The window may be unrestricted (growing window) or it may restricted to a fixed_size window of the previous N characters, where N is typically several thousands [6].
-The growing window offers better compression by making more substrings available.As the window becomes larger, however, the encoding may slow down because of the time taken to search for matching substrings; compression may get worse because pointer must be larger; and if memory runs out the window may have to be discarded; giving poor compression until it grows again.
-A fixed_size window avoids all these problems, but it has fewer substrings available as targets of pointers.Within the window chosen, limiting the set of substrings that may be the target of pointers makes the pointers smaller and encoding faster.
The table below labels the previous works and the most significant variations of LZ compression, and summarizes the main distinguishing features among them:- LZW [24] (1984) the output contains pointers only, pointers indicates a previously phrased substring, pointers are of fixed size.
LZMW [16] (1984) same as LZT, but phrases are built by concatenating the previous two phrases.
LZJ [11] (1985) the output contains pointers only, pointers indicates a substring anywhere in the previous characters.
LZFG [9] (1989) pointers select a node in a trie, strings in a trie are from a sliding window.
LZRW [5] (1991) refers to variants of the LZ77 with an emphasis on improving compression speed through the use of hash table.
LZMA [5] (1998) uses a dictionary scheme similar to LZ77 with a variable size up to 4GB.

LZWL [5] (2005) work with syllables
Better Adaptive Text Compression Scheme Journal of Education and Science -Pure Sciences 52 3.1-LZ77 Scheme:-LZ77 was the first form of LZ compression to be published [27].In this scheme, pointers denote phrases in a fixed-size window that precedes the coding position.There is a maximum length for substrings that may be replaced by a pointer , given by the parameter F (typically 10-20).These restrictions allow LZ77 to be implemented using a "sliding window" of N characters.Of these, the first N-F have already been encoded and the last F constitute a lookahead buffer.
To encode a character, the first N-F characters of the window are reached to find the longest match with the lookahead buffer.The match may overlap with the buffer but obviously can not be the buffer itself.
The longest match is then coded into the triple (i,j,a), where i is the offset of the longest match from the lookahead buffer, j is the length of the match, and a is the first character that did not match the substring in the window.The window is then shifted right j+1 characters, ready for another coding step.Attaching the explicit character to each pointer ensures that coding can proceed even if no match is found for the first character of the lookahead buffer [13].

3.2-LZSS Scheme:-
The output of the LZ77 is a series of triples, which can also be viewed as a series of alternating pointers and characters.The use of explicit character followed every pointer is wasteful in practice because it could often be included as part of the next pointer.LZSS addresses this problem by using a free mixture of pointers and characters, the later being included whenever a pointer would take more space than the characters it codes.A window of N characters is used in the same way as for LZ77, so the pointer size is fixed.An extra bit is added to each pointer or character to distinguish between them, and the output is packed to eliminate unused bits [2].The LZSS algorithm is: where p is the number of characters (or bytes) taken by a pointer [2].If we take the same string in section 3.1: abbaabbbabab , the coded string would be: (0,a)(0,b)(0,b)(0,a)(1,1,3),(1,3,2) (1,8,3) The output pointer contains either two or three elements, the first element in two cases is a single distinguishing bit, if it is 0 means that there is no coding and a complete character would be found in the coded file, if it is 1 a pointer of offset and match length is followed.

4-LZD (the proposed) Scheme:-
LZD is abbreviated from LZSS with two-Dimension search, so it uses the structure of LZSS scheme.LZD encoder is parameterized by N, the size of the window in the text, and F the maximum of the substring that may be replaced by a pointer as in LZSS.
The main difference between the two methods is that in LZSS the searching process for a match is implemented using greedy algorithm and encoding proceeds from left to right, while in LZD the search for a match proceeds from left to right then return back from right to left, this gives better chance to find a longer match between the already encoded string and the previously encoded text.For example, if we take the two words "MACHINE" and "CAMERA", in LZSS there is no similar phrase between them, but if we use LZD, the first underlined phrase of the word CAMERA will simulate the phrase "MAC" of the other word, when backward search is accomplished.
A single bit is added to distinguish whether the substring is coded in forward and backward manner.
More time is needed in using LZD than LZSS scheme, to days this is not very important, as the CPU's have become very cheap and with different high speeds, so all the recent schemes have concentrated on achieving better possible compression rather than the time they take.
The algorithm of LZD is:- If the algorithm is implemented on the string: abbaabbbabab , then the output would be: (0,a) (0,b) (1,0, First bit , i , of the pointer used as LZSS (to distinguish if the output is a character or a pointer), the second bit , j, (which is either 0 or 1) used to distinguish if the match is from left to right (if the bit is 1) or from right to left (if the bit is 0), bits k and l represent the offset and the length of the longest match respectively (as in LZSS).
To decode the compressed string: first the size of the lookahead buffer is zero and a single bit is read from the coded string or file, if it is 0,then the code of a complete character (8 bits) is read (and the size of the lookahead buffer increased by 1), if it is 1, the three element of the pointer j,k,l must be read, and l characters are taken from the lookahead buffer starting at position or character k, depending on the value of j, if it is 0, the search starts from the lookahead buffer down to 0, and vice versa, if j is 1.

Experimental Results
In this section, the result of some experiments with the coding scheme are presented using a variety of different sorts of text files and respectable performance is achieved with all of them.Empirical comparison between the enhanced method and the standard one are also described.These results are obtained by a program written in C++ language.Below the files used in experiments; the name of each file beside its type is presented:-1-Huge1, Huge2, Huge3: text files collected from a set of research abstracts.(size: 1Mb-5Mb).2-Small: a help file taken from C-language package.(17970 characters (≈ 7KB)).3-Lzd.c: a commented C program-the same program used in compression, (10649 characters).4-Data: a collection of characters and numeric data in text format (24000 ch r cters (≈ 4 KB)).Table 5.1 shows the tests ran on those files, the size of the file after compression is shown under the file name, the second row of the table shows the compressed file as a percentage of the original using LZSS method.Table 5.2 shows the same tests, but using LZD method.

Same as LZSS
Pointers contain 4 elements (i,j,k,l)  The two methods are implemented on the same five files with the same window size(N=4096 bytes or characters), which is a middle window size, as if the window becomes larger the search and encoding process may slow down, with smaller window size, the chance of finding a match between the lookahead buffer and the encoded text will be decreased causes the compression performance to be decreased .
From the tables, it has been obviously seen that LZD achieved better compression performance than LZSS, the difference in compression performance between the two methods falls in the range of 4-10%, better result was obtained with the "data" file, means that this file contains more contrast phrases than others.
LZD can be considered as a development of LZSS, which is a type of lossless compression methods, means that there is no loss of data and the file after decompression is completely similar to the original.

Text Compression Scheme Journal of Education and Science -Pure Sciences 53
Lookahead bufferBetter Adaptive