A Survey Study on Relation Extraction for Web Pages

: Natural language means a language that is used for communication by human. Natural Language Processing (NLP) helps machines to understand the natural language. The natural language for the web pages consists of many semantic relations between entities. Discovering significant types of relations from the web is challenging because of its open nature. In this paper we survey several important types of semantic relations. This paper also covers the relation extraction (RE) approaches which are divided into: supervised approach, which contains Feature base and Kernel base, and the unsupervised approach. Three relation extraction algorithms are discussed: Support Vector Machine (SVM), Genetic algorithm and Naive Bayes classifier This survey would be useful for three kinds of readers First the Newcomers in the field who want to quickly learn about relation extraction. Second the researchers who want to know how the various relation extraction techniques developed over time. Third the trainers who just need to know which RE technique works best in different settings


1-Introduction:
Through the World Wide Web increasing information and texts, knowledge are available and found in the digital archives, it has seen that web content has been kept in HTML "Hyper Text Markup Language" [1]. In this case the web is for human use because of the displaying content as syntax based HTML. Query ambiguity reduces HTML retrieval quality. For example "bank" may be border of a water body or monetary establishment. Web pages have more information, as HTML tags, hyperlinks and anchor text with the regular text content visible in a browser. These characteristics that are placed on pages are useful for classification [2]. There has been an increasing demand in "Information Extraction" (IE), which recognizes relevant information (usually of predefined types) from text documents in a specific subject and it gathers it in a structured format [3]. One of the purposes of relation extraction is to specify the named entities, and to extract the relationship between entities and the events [4]. Relation extraction is defined as the process of discovering and describing the "semantic relations" between entities of text [5]. Most algorithms of relation extraction begin with some linguistic analysis, parsing the text to find relations directly from the sentences. [6]. The relation extraction system in (Figure 1), which is inspirited by [7], enters as input the text in a document, and produces a list of (entity, relation, entity) as its output.

2-Data Source:
This research do a review about the web documents which derive its information from several sources such as: Wikipedia, ACE RDC 2003 and 2004, Social Networks (Twitter & Facebook), Clueweb09 dataset, MEDLINE, PharmGKB database and PubMed. Web document can be: 2.1 XML document "eXtensible Markup Language" is a typical format, it is used to share and transfer information in different fields, because it can transfer the content of logical structures into documents, and it is autonomous from platform [8].

HTML document Hypertext Markup Language (HTML) is the standard markup language it
aims at producing web pages and web applications [9]. A document may contain many links, a technical text or a short answer to a special question [10].

3-Text relation
It is the relations between the words in the sentence. This relation can be a relation of syntax, lexical and semantic relation. Syntax relation describes how words are grouped and connected to each other in a sentence [11]. While A lexical relation is a pattern of association that exists between lexical units in a language [12].

3.1-Semantic Relations
The primary aim of recent researches is to extract relevant documents. Web development to the next generation called the "Semantic Web" [13], the attention will move from looking for documents to getting facts, useful information [12]. The increasing capability of finding the information in the form of entities, contained within documents, leads to the important results in extracting relations between these entities. [14] Relationships are fundamental to semantics because they join the meanings to the words, terms and entities [15]. The description of word semantic relationships is shown in the following: • Synonyms Synonyms relation means a word with the same or nearly the same meaning as another in the same language [16], as shown in ( or they could be opposite by adding the following prefixes to form opposites of words: un-, il-, im-, in-, ir-as shown in table 1 [6]. Good Bad

Antonyms
• Metonyms: are words used in place of another word which has strong relation. as shown in ( Figure  4):

Figure 4: The Metonyms Relation
• Hyponym and Hypernymy: The term hyponym means a subcategory of a more general class: Like a relationship between "dog" and "animal". While Hypernymy is the state or quality of being a hypernym or superordinate (a general class under which a set of subcategories is subsumed). as shown in ( Figure 5) [17].

Figure 5: The Hyponym & Hypernymy Relations
• Polysemy It means a word, phrase, or concept which has more than one meaning or connotation, as shown in ( Figure 6) [18]

Figure 6: The Polysemy Relation
In this example "paper" in the first sentence refers to a piece of paper, in the second sentence it means a research paper and in the third one it denotes to a newspaper

4-Relation Extraction (RE)
The aim of relation extraction is to discover semantic relations between entities [19]. This means confront in open-domain of the web. This relation must be able to deal with a very, huge and rapid growth in scale, multiple styles of documents and more types of relations that are exist [20]. To find these relations, a system should not expect a specific set of relation types, nor rely on a rigid set of relation argument types. It also must efficiently capable to deal with a huge size of data [21]. A huge size of hand labeled data is needed when the supervised learning algorithms are used but annotating training data is undesirable and time overwhelming job [22]. On the Web, manually labeling data of each subject area are stubbornly, the number of subjects of interest is simply very large. Relation extraction with automated labeling is called "unsupervised relation extraction". [23].

4.1-Supervised Relation Extraction Approach
Supervised approaches concentrate on relation extraction at particular area. These approaches need labeled data where each pair of entity that are mentioned, labeled with one of the pre-defined relation types. [24].

Feature Based Approach
The feature-based methods are used to find useful lexical feature, syntactic structured feature and so on. As shown in Table 2  The cost in Lishuang Li e.al. [25] predication phase when combine the feature and kernel based calculation is lower than other but the computational cost in the training phase is bigger compared to the other.
The feature based approach is an excellent method for extracting the logical structures of HTML tables and moving them into XML documents Yeon-Seok & Yeon-Seok [8] using area segmentation and structure analysis algorithm, as well as semantic coherency feature. While Bonnie.& Gaasterland [26] use feature based approach to identify tense of the sentences at Penn Treebank tags for parse tree. The work extracts, reanalysis, and reinterpretation of both temporal and non temporal relations between two events.

Kernel based approach
Kernels based approach compares the structure of two patterns using the syntax tree from the node at the top "root" to the lowest node "child". This approach still has restrictions in measuring patterns of multiple types, which decrease the act of new relation extraction. The main advantage of kernel based methods is that such explicit feature engineering is avoided [27] as shown in Table 3

Semantic relation
The framework of Zhang et.al. [28] exploit "trigger words" as the semantic restrict to lead the "bootstrapping iterations". It widen a work on usual model of bootstrapping in extraction of the relation by construct a noble way for explaining trigger words, pattern representation, similarity method and evaluation method. Furthermore, a noble "bottom up kernel" algorithm was defined to calculate if the result's pattern from a new sentence is relation form or not. Maengsik & Harksoo [29] use SVM algorithm on social network application to identify name entity by using kernel based approach on social network. Zhou et.al. [3] combine different types of syntactic and semantic information into one tree structure; and they also extract such varieties via nobel context-sensitive convolution tree kernel.

4.2-Unsupervised Relation Extraction Approach
It refers to the task of automatically finding interesting relations between entities in large text corpora Yulan [30], as shown in Table 4 Ya-nan et.al. [4] used a proposed "statistical score S" to calculate the familiar association between strong related events and clip relations with low S value. Ying. et.al. [31] investigated Social Network using unsupervised feature based to extract name entity feature by disambiguation system. The main advantage is the collection of the unsupervised features extracted from broad resources that can effectively improve the robustness of a disambiguation system.
Bonan et.al. [21] used an algorithm handles polysemy of relation instances on Clueweb09 dataset and achieves a significant improvement in recall while maintaining the same level of precision.
Yulan et.al. [30] worked on Wikipedia, their work can abstract away from different surface realizations of text. These relations expressed in different "dependency structures" with redundant information from the growing size of Web pages.

5-Relation Extraction Algorithms
Throughout this section three algorithms (Support Vector Machines, Genetic algorithm and Naive Bayes classifier) have been discussed in relation extraction.

5-1 Support Vector Machines (SVM)
Support Vector machine is "Vector space based machine -learning method" used to extract a decision limits between two classes. These classes are a long way from any point in the training data. separately from executing linear classification, SVMs are able to run a non-linear classification in efficient manner using what is called the "kernel trick", implied mapping their inputs into highdimensional feature spaces. [32]. Table 5 illustrates the different use of SVM in relation extraction.

SVM
Bonan & Ralph [19] found that "one-pass annotation" is a powerful in cost than annotation with effective assurance. While Zhou et.al [33] found that correctly unifying multi type of syntactic and semantic information into a one tree structure; and clipping such differences via a good contextsensitive convolution tree kernel.

5-2 Genetic Algorithm (GA)
Christy & Thambidurai [34] show that Genetic Algorithm well performed in mining rules and features optimization of a text.
[35] deploy genetic algorithm and get a high precision but low recall and they combine the benefits of ML algorithms with "rule-based" techniques to find the related arabic named entities. The effect of each algorithm used linguistic module to create important results against previous one but the method unable to capture some of the relations that exist between words that are far from the named entity locations, especially in sentences which are long and complex. Table 6 illustrates the GA in relation algorithm Text Genetic Algorithm

5-3 Naive Bayes classifier
Naive Bayes classifier is a method which learns both annotated and not annotated documents in a "semi-supervised algorithm". Suresh & Kumar, [36] applied the Naive Bayes classifier on Q/A systems using "lexico-syntactic and lexico semantic feature". They reach the high precision and recall (the ideal case).

6-Evaluation Metrics
A common motivated way of evaluating results of Machine Learning experiments is using Recall, Precision and F1-measure [37]. Precision measures as shown in equation (1) is the percentage of the correct retrieved items on the number of the whole retrieved items [38]. The good system produces a high precision in retrieving correct items [39]. .
Recall, on the other hand, is a percentage of the total number of the correct items as computed in equation (2). The higher the Recall rate, indicates less missing correct items [40] Finally F1 measure: is the average of the precision and recall. The F-measure measure is prompt because in many studies this measure is the best measurement of the result of the classifier [40]. Equation  Table 7 illustrates the evaluation metrics for different algorithms that have been used in relation extraction to extract a specified feature for a given application

Conclusion
This survey paper discussed importance of relation extraction techniques in natural language processing field. Also it discussed different approaches which are widely used for relation extraction task then it discussed the evaluation criteria metrics. It is obvious that the naïve bayes classifer, using "lexico-syntactic and lexico semantic features", gives the best evaluation measures near the ideal case.
On the other hand, it is very important to reduce the time to extract web relations accurately without loosing efficiency. The use of pattern based with local dependency tree increases the accuracy and recall of eventarguments extraction process. Supervised approaches for the more can do well when the domain is more restricted. While the unsupervised approaches appear to be more appropriate for unrestricted domain relation extraction systems, due to they are capable of simply grew with the database size and can scale to new relations easily. Rule sets have a benefit of sentence structure and grammar to capture more specific information. Moreover, these rule sets can be sets in an ontology that allows modification of relationships and inference over them. [41] This work suggests that future work in this area could apply fuzzy logic which is a principal component of soft computing.