Printed Arabic Character Recognition Using Variation Method and Discrete Cosine Transform

Automatic character recognition has been the subject of intensive research for almost last decades. Because of the complexity of printed and handwritten Arabic text a little research has been conducted on the automatic recognition of Arabic characters. This research proposed a new technique for recognizing printed Arabic character. After acquisition Arabic character image a number of preprocessing steps are performed for the digitized image. These steps generally include smoothing by using median filter, the horizontal and vertical histogram profile are used for segmentation and a standard Guo thinning algorithm for thinning, Etc.. Variation Method and Discrete Cosine Transform Method are used for feature extraction. For Printed Arabic Character Recognition Using Variation Method and Discrete ... 113 classification radial basis function (RBF) network are used. This method performs extremely well. This new technique is able to handle printed Arabic character task efficiently.


Introduction
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text.It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website.OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it.OCR is a field of research in pattern recognition, artificial intelligence and computer vision.[Wikipedia 2011] The ultimate objective of any Optical Character Recognition (OCR) system is to simulate the human reading capabilities.That is why OCR systems are considered a branch of artificial intelligence and a branch of computer vision [Sunji Mori 1999] as well Character recognition has received a lot of attention and success for Latin and Chinese based languages, but this is not the case for Arabic and Arabiclike languages such as Urdu, Persian, Jawi, Pishtu and others [Abdelmalek 2004].Researchers classify OCR problem into two domains.One deals with the image of the character after it is input to the system by, for instant, scanning in which is called Off-line recognition.The other has different input way, where the writer writes directly to the system using, for example, light pen as a tool of input.This is called Online recognition.[Aburas 2008] OCR systems require calibration to read a specific font; early versions needed to be programmed with images of each character, and worked on one font at a time."Intelligent" systems with a high degree of recognition accuracy for most fonts are now common.Some systems are capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.[Wikipedia 2011] The potential of OCR systems is enormous because they enable users to harness the power of computers to access printed documents.OCR is already being used widely in the legal profession, where searches that once required hours or days can now be accomplished in a few seconds.[Wikipedia 2011]

Previous Work:
There are many offline OCR systems available for printed Arabic documents.Khelifi B. and Zaghden N proposed work to find similar text regions basing on their fonts.They are extracted text regions, and then font matching is performed using fractal descriptors(box counting).Experiments are done for both maps and ancient documents.[Khelifi 2008] Tang, Yuan Y., Tao, Yu, Tao, Jin, and Xi, Dihuav [Tang 1999] present method of feature extraction based on the principles of fractal geometry(box-counting approach) and wavelet to classify isolated Chinese character.Varieties of different approach have been applied for the recognition process of OCR.
In the proposed research, two feature extraction techniques were investigated for Arabic cursive character recognition.which is never use for Arabic OCR.The first is the variation method and the second is the discreet cosine transform technique.The output of each feature extraction technique was tested using Radial Basis Function (RBF) classifiers.The flow chart for new approach produced in this research is shown in

Characteristics of Arabic text:
Arabic is a popular script.It is estimated that there are more than one billion Arabic script users in the world.If OCR systems are available for Arabic characters, they will have a great commercial value.However, due to the cursive nature of Arabic script, the development of Arabic OCR systems involves many technical problems, especially in the segmentation stage.[Al-A'ali2007].
The Arabic alphabet consists of 28 characters, where the shape of each character depends on its position within a word.Thus the characters are divided into four disjoint sets.These types are listed in Table 1 in details.The first set includes those characters which appear in an isolated form wherever their position are in different words.The second set includes characters at the head of words, naming beginning characters.The third type includes the characters within words naming middle characters.Finally, the last type includes those characters at the tail of words naming end characters.Thus after segmenting a given word, it will be known a priori which character set needs to be considered [Jannoud 2007

Data Acquisition and Preprocessing:
The text is scanned off-line from the input document by a scanning device and is stored as a portable grey map (PGM) format file with a resolution of 300 dpi.Then, a number of preprocessing steps are performed for the digitized image.These steps generally include smoothing, segmentation thinning,.etc. Noise errors caused by the data acquisition system, needs to be eliminated from the scanned document, median filter is used to remove nonlinear noise such, which.A 3*3 window is used to examine each pixel.

Segmentation
Segmentation is a necessary step in order to isolate the text image objects which will be passed to the recognition stage for recognize characters correctly, the image must be segmented to set of images which only contain one character.These character images will be passed to the OCR module for recognizing.This is accomplished by examining the horizontal histogram profile.Line separation is usually followed by a procedure that separates the text line into words, and in to characters see figure (2).It focuses on identifying physical gaps using only the components.Then the outer rectangle of the character image must be found.Outer rectangle is a rectangle with the least size that all pixels of character are in it.The outer rectangle can be found using horizontal and vertical projection of image.

Thinning
Thinning is a morphological operation that successively erodes away the foreground pixels until they are one pixel wide or skeletons.A standard thinning algorithm [Guo 1989] is employed, The obtained shape is of one-pixel width with continuous lines carrying the important feature points of the script image.see figure (3).
The last preprocessing step is size normalization.It is the most important preprocessing phase that affect recognition rate directly [George 2002].all character images have to normalize to 64 x 64 pixels.

Feature Extraction
Feature extraction addresses the problem of finding the most compact and informative set of features, to improve the efficiency or data storage and processing.Defining feature vectors remains the most common and convenient means of data representation for classification and regression problems.Data can then be stored in simple tables (lines representing "entries", "data points, "samples", or "patterns", and columns representing "features").Each feature results from a quantitative or qualitative measurement, it is an "attribute" or a "variable".Modern feature extraction methodology is driven by the size of the data tables, which is ever increasing as data storage becomes more and more efficient.[ Switzerland 2006] .

Characters Features Extraction
This approach propose two method for feature extraction which is not used before for Arabic character recognition this method is illustrated below:

Variation Method
In this research the variation method is consider.This method, provides a means to estimating fractal dimension of an image, where the image intensity or amplitude can be written as a function of a spatial coordinate, Z=f(x,y).This method requires the choice of various size ε.A covering element of radius ε is placed at each data point within the image.Next, the variation at each point vf(x,y, ε), within the element is then measured by taking the difference between the maximum, zmax, and minimum, zmin, functional values contained within the covering elements region.
These variations are then averaged over the image to form the εvariation, Vf (ε), for the image.Note, in the finite case, this is a simple summing operation and, in the analytic or general case, this becomes an integral.The fractal dimension definition then takes the form: In order to estimate a finite sampled set, one takes the slope of the plot of . [Dubuc 1996] [Dubuc 1989][Summers 1999] The Variation Method is applied to calculate over each pixel of segmented and normalized characters images see figure (4).The extracted feature for each character image is written into a file in a specific format.This file is the input for a classifier.

Discrete Cosine Transform
The discrete cosine transform measures the contribution of the cosine function at different discrete frequency.The DCT transformation is applied to image blocks of NxN pixels in size where N is usually 2 N and provides an excellent energy compaction and fast algorithms exist.The fact that the DCT is discrete makes it especially easy for effective computations.The DCT coefficients could be computed using equation Where x and y are spatial coordinates in image blocks, and u and v are coordinates in the DCT coefficient block.The C terms are defined as: The DCT has been used in many practical applications, especially in signal compression.For example, the compression achieved in the famous JPEG image format is based on the DCT.The strong capability of 1/√2 for u,v =0 1 otherwise the DCT to compress energy makes the DCT a good candidate for pattern recognition applications.Coupled with classification techniques such as Vector Quantization (VQ) and ANN, the DCT can constitute an integral part of a successful pattern recognition system.For example, the DCT was successfully used in face recognition applications.[Sarhan 2009] To obtain the DC value that is represented by f(0,0) of the DCT block.Equation ( 2) is simplified to equation ( 4 The segmented and normalized characters images are divided into several blocks of fixed length (8*8 pixels).Then DCT (Discrete Cosine Transform) applied to calculation over each pixel of the frames.
The result of this DCT step is a set of 64 coefficients for a block of image of size 8x8 pixels.The resultant coefficients were zero and nearzero values.The DC coefficient holds most of the image energy and the average of the 63 coefficients of the block which is known as AC coefficients.The remaining 63 coefficients denote the intensity changes among the block image.In order to achieve a high compaction for the DCT coefficients, these coefficients were eliminated by the quantization operation.The quantization operation sets the near-zero coefficients to zero while sets the other coefficients to a reduced precision.The results of this operation are that the non-zero coefficients were located at the upperleft hand corner of the block and the zeros in the lower corner.These blocks should be changed to a linear form had known as a stream.To maximize the number of subsequent zeros in the stream, the block coefficients is not read line-by-line, but in a zigzagging patterns as shown in Figure (5) The DCT Method applied to calculation over each pixel of segmented and normalized characters images see figure (6).The extracted feature for each character image is written into a file in a specific format.This file is the input for a classifier.

Classification
In character recognition, the main task is extraction of features from data.Classification methods are well developed and they generally present low errors if the features are suitable for the task.The classification stage consists of two parts, training and testing.In the training phase, the features of character data are computed and fed to the classifier for training purposes.In the testing phase, features of the unknown input character are extracted.The constructed feature vector is sent to classifier to match the nearest class.
classifiers chosen for this task was Radial Basis function (RBF) network.

Radial Basis Function Neural Network
A radial basis function (RBF) network is a special type of neural network that uses a radial basis function as its activation function.RBF networks are very popular for function approximation, curve fitting, time series prediction, control and classification problems.The radial basis function network is different from other neural networks, possessing several distinctive features.Because of their universal approximation, more compact topology and faster learning speed, RBF networks have attracted considerable attention and they have been widely applied in many science and engineering fields [Kurban 2009].
Radial basis function (RBF) is a multi-layer neural network consisting of an input layer, hidden layers, and an output layer see figure (7).Nodes in each layer are fully connected to those in the layers above and below, and nodes in hidden layers (basis function nodes) have kernel functions usually given as Gaussian profiles.Each connection is associated with a synaptic weight but the unit weight is assigned to all connections between the input layer and hidden layers.[Ng 1991] The RBF network is trained first by unsupervised learning to determine the characteristics of the hidden layer and then by supervised learning.In the unsupervised learning process, the means and variances of the basis functions for the hidden layers are determined using K-means clustering algorithm [Hush 1993].The supervised learning process is followed by presenting each input-output pattern to the network and calculating the basis function node outputs.The basis function node outputs and the desired outputs are used to determine the network output weights.
A classifier is used to identify the characters by using their features obtained by applying variation method and DCT.These are then compared and saved as models for the training stage.

Result and Discussion
In order to investigate the effectiveness of the proposed algorithm, a series of tests were performed using Radial basis function neural network where Numbers of neurons in input layer equal the number of output for feature extraction of each method used in this research.The output layer contained one node for each class, so the number of neurons in output layer is 28.
Fifty input documents are used which have a hundreds of characters of all types (isolated, beginning, middle and end characters) as a training and testing data with different font type (Arabic Transparent, Simplified Arabic, Arial, Courier) font size varying from 12 to 18 point size.Thirty documents are used for training and twenty documents are used as testing.
Recognition is composed according to the type of font position of the primary part within the word; i.e.Isolated, Beginning, Middle, and End.Both methods (variation method and DCT) were used and compared in terms of the information content of features extracted from the characters with the same ANN structure in the classification stage.
The results in this research are displayed in tabular form for each set of experiments, Table (2,3,4) shows the performance of the new algorithm which proposed in this research when using feature extracted from two feature extracted methods.The proposed method was efficient and best results are achieved with the features extracted by Variation method for isolated character the worst result for middle character when using DCT method.The two methods was efficient for isolated and end characters.

Conclusion
This research proposed a new method of Printed Arabic Character Recognition where the variation method and discrete cosine transform are used for extract features of normalized characters.
Classification and Normalization of off-line Printed Arabic characters has been proven to be efficient on proposed approach.Significant increase in accuracy levels has been found on comparison of this method with the others for character recognition see table (5).With the addition of sufficient pre processing the approach offers a simple and fast structure for fostering a full OCR system.The experimental results show that the tow methods (variation method and discrete cosine transform) achieved good performance when they are using for printed Arabic character.As observed from the results of research show that both methods gave very good results for the isolated and end characters.Due to the use of size normalization and thinning in preprocessing stage there is no effect of font size on all experiments.

Table (5) Performance analysis for the proposed algorithm with existence ones.
The use of Radial Basis Function (RBF) neural network effect good results obtained in this research and This may be attributed to the fact that the Gaussian function in the hidden layer of the RBF network.
In future research, experiments will be using non-resized character images.Also, further experiments will be using non-thinned character image, experiments using another type of font, using handwriting character, using any other type of character like English or Turkish etc., comparison with other kinds of algorithms and other type of neural networks.

Figure ( 1 )
Figure (1): The steps of the proposed algorithm
AlKhateeb et al. use DCT features and neural network classifier.They discard 80% of the DCT coefficients without sacrificing the recognition accuracy [AlKhateeb 2008].