Classification of Software Systems attributes based on quality factors using linguistic knowledge and machine learning: A review

Both the functionality and the non-functionality for what the software system does and does not do within software systems requirements are documented in a Software Requirements Specification (SRS). In requirements engineering, system requirements classify into several categories such as functional, quality and constraint classes. Therefore, we evaluate several machine learning approaches as well as methodologies mentioned in previous literature in terms of automatic requirements extraction, then classification is performed based on methodically reviewing many previous works on software requirements classification to assist software engineers in selecting the best requirement classification technique. The study aims to obtain answers for several questions: “What were machine learning algorithms used for the classification process of the requirements?”, “How do these algorithms work and how are they evaluated?”, “What methods were used for extracting features from a text?”, “What evaluation criteria were used in comparing results?”, and “Which machine learning techniques and methods provided the highest accuracy?”.

Our paper aims to review a number of previous studies on the classification of software requirements. We try to compare the techniques used in the classification process to determine the most appropriate technique that provides better results (accuracy, precision, recall, and F-measurement) in the process of classifying software system features on the basis of quality factors using machine learning.
This paper is divided into sections as follows: section2 represent the theoretical background by explain the Software Systems Requirements, ML techniques and methods used in the Software Requirements Classification, Text vectorization models and feature extraction with performance measures used to evaluate the classifier. Methodology in Section3 state and explain related work with and view literature review on machine learning used in software requirement classification with various data seats which represent different types of software requirements and Summarization of relevant published papers, Discussions and conclusions in sections 4 and 5 respectively.

Research Method
This section describe the problem background theory which provide an overview of the software systems requirements (SSR), ML techniques used for classifier of SSR and text features extraction with performance classifier measurements.

Software Systems Requirements (SSR)
Within the software engineering industry, requirements engineering (RE) is consider one of the most natural language-intensive fields. As a result of, over the years ago, whereas many of previous works have been produced to automate the analysis of natural language artifacts important to RE, such as requirements documents, application reviews, privacy rules, and social media information relating to software goods. Recently the spread of game-changing natural language processing (NLP) techniques and platforms have piqued RE researchers' interest. However, there is currently no reference framework that provides a comprehensive grasp of the subject of NLP for RE [25].
Requirements operations include capturing both FRs and NFRs, which describe what the system must perform and how it must be accomplished, respectively. Business analysts and domain specialists gather and document FRs. On the other side, technology experts ,make architectural decisions. As a result, the croups of knowledge for FRs and architectural solutions are kept distinct. FRs are considered that are criticized, high-risk, fickle and entails costly reworking or has legal implications [26]. Therefore, when software quality is spoken, we should refer to NFRs term. Indeed NFRs are important limitations on a software system's development and behavior. Security, performance, availability, extensibility, and portability are just a few of the properties they specify. These characteristics are crucial in architectural design [27]. Thus, the existing issues with the concept of NFR can be separated into three categories: definition issues, classification issues, and representation issues [28].

2-2 Techniques and methods used in the Software Requirements Classification
Machine learning is a subfield of AI, it is a data analysis method that automates analytical models. The algorithm can generate an output for an input it has never seen before without the need for human interaction. Moreover, machine learning algorithms which learn from input/output pairs are known as supervised learning algorithms. For each example they learn from, a "teacher" gives supervision to the algorithms in the form of desired outputs. [29]. The input data is only known in an unsupervised algorithm, and the method is given no known output data. Although these techniques have many good uses, they are difficult to comprehend and evaluate [30]. Various ML algorithms were used to automatically classify software requirements in review papers: Latent Dirichlet Allocation (LDA) algorithms, documents are categories based on the frequency of word co-occurrences. [31]. The Biterm Topic Model (BTM) learns topics by studying patterns of words and models subjects based on word co-occurrence patterns (eg the biterm) [32]. Recent research on the categorization of short text documents confirms that, BTM has a superior ability to represent short and sparse text, such as that found in requirements specifications. Naive Bayes is a pretty famous supervised learning technique for binary classes [33]. It is based on the Bayes theorem that makes considerable feature independent assumptions. Basically it's straightforward, and effective, and unlike most other classes, it doesn't require a vast training set. It based on Bayes' theorem to forecast data that isn't visible [34]. SVM (Supporting Vector Machines): It is one of the supervised classification and regression algorithms which is characterized by its robustness and flexibility [35]. SVM is a versatile and powerful machine learning algorithm that can perform linear and nonlinear classification, regression, and outlier identification. Classification method creates a hyper-linear plane with a maximum margin between two classes. This margin leaves few opportunities to separate the data from the sample, therefore little opportunities for new cases to be misclassified [36]. MNB (Naive Bayes Multinomial): It's a method which calculates the data set conditional probability. The input features in MNB are assumed to be independent of one another (independence under certain conditions). A specific variant of Naive Bayes Multinomial is used to classify documents and text [37]. k-NN (k-Nearest Neighbor ) : is a data categorization system based on the neighbor principle, which states that examples within a data collection should be found near other instances with comparable characteristics. [38]. For regression issues, the algorithm classifies incoming data by computing the distance between it and the instances already in the database, then selecting the k closest cases and finding their mean, or getting the position [39] . LR (Logistic Regression): Is a regression approach used for the estimation of the likelihood of a specific instance that belonging to a specific class [40]. The independent variable is used in the prediction of dependent variable. Therefore when the dependent variable has only two classes, Binary logistic regression is used. However, when the dependent variable has more than two categories Polynomial logistic regression is used [41]. long short-term memory (LSTM): Recurrent Neural Network (RNN) which employs memory blocks to solve the vanishing scaling problem. The model's first layer, the input layer, receives preprocessed data in time steps. In order to produce feature vectors, each component is first given to the embed layer, thus the LSTM's hidden layer only follows the forward direction. The LSTM has three primary gates for controlling cell state and updating weights: input, forget, and output. [42]. BiLSTM (Bidirectional long short term memory): BiLSTM contains two hidden layers, where are coupled to the input and output. Thus take use of the learning information tokens, BiLSTM contains a front LSTM layer as well as a rear LSTM layer, and better predictions can be achieved. The layers of LSTM are Stacking best way to take advantage of BiLSTM. From t=1 through T, the front layers are iterated. The back layers, on the other hand, are repeated from t=T to 1 [43] . convolutional neural network (CNN): Local features can be produced by applying the concept of (convolutional neural networks) [44]. Varying vertical localities allow filters of different widths L=3, 4, and 5 through the usage of filters with a width set by the size of the word embed vector. This makes it useful for learning many features [45].
Recurrent Neural Networks (RNN): is one of the modern algorithms for processing sequential data. Since it has internal memory as it is the first algorithm that remembers its input [46], it is therefore suitable for machine learning problems involving sequential data. It is the first algorithm that has made breakthroughs in deep learning over the past few years. It is a type of powerful neural network [47]. Gated Recurrent Unit (GRU) : GRU is a sort of RNN that differs from LSTM in that it transforms data quicker. It also necessitates fewer variables [48]. There are two sorts of gates: update and reset. Specifically, it solely deals with unit information because of there is no memory to store it. It should be noted that the amount of data to be refreshed is determined by the update gateway, and the amount of previous data to be forgotten is determined by the reset gateway. The input data is received and the previously calculated state is deleted when the gate is set to zero. [49] . Singular Value Decomposition (SVD): This technique was developed to process natural language and is frequently used in the field of information retrieval. Therefore in order to recover the most useful attributes for expression, it divides the rectangular matrix A(m*n) into three smaller matrices. During dealing with vast volumes of data and declining dimensions, SVD offers the mathematical framework for text classification and latent semantic indexing. It removes data clutter and repetition from highdimensional data, resulting in cleaner data after deleting words that appear in almost every page. [50]. MaxEnt: MaxEnt, also known as maximum entropy or multinomial logistic regression, is a multi-layer classification technique. MaxEnt uses a linear collection of features together with some review criteria to calculate the likelihood of each sort of classification review [51]. Decision tree algorithm: is a classifying technique [52]. It presupposes that all features are boundary discrete and that the class classification is represented by a single objective feature (ie the leaves of the tree) [51] [53]. J48: The J48 algorithm is one of the top machine learning algorithms for categorizing and continually checking data. It is used to classify different applications with accurate results when classifying. It breaks down each aspect of the information into sub-groups to base a particular decision on. It then looks at the standard data gain that actually splits the information by selecting an attribute [54].

2-3 Text vectorization models and feature extraction
Machine learning techniques require numerical inputs to perform classification. Software requirements are documented as text dataset, therefore in order to build a classifier ML model, we need to convert text data into numerical vectors as extracting features using word embedding or vectorization model. Several techniques were used to convert text data into numerical vectors: 1-BOW(Bag of Words): is a straightforward and efficient method for extracting information from text sources. This methodology converts text documents to numeric vectors, yielding a vector for each document that is the iteration of all highlighted words in the document vector space [55]. The vector Xj = (x1,j.. xi,j.. xn,j) expresses for requirement "j" using BoW, where xi,j represents feature's weight I computed by iteration of word I in requirement "j" and "n" represents the number of items in the dictionary. The manually specified criteria are then converted into vectors and used to train classifiers using supervised machine learning algorithms. [56]. 2-Term Frequency -Inverse Document Frequency (TF-IDF) : The TF-IDF method combines between two main measures: the initial frequency of a term in a given document and the inverse of the document frequency for each term. These measures are calculated by dividing the total number of documents by the document frequency of each term. Thus, they applying to the result logarithmic scaling [57]. It can be represented mathematically as shown in Equation 1: When the two scales are combined, It can be represented mathematically as shown in Equation 2: TF-IDF(Termi,j)= tf i,j × idf i For the "I" term and the "j" document, t f indicates the term frequency and id f represents the inverse of the document frequency. 3-Chi square (CHI 2 ), A statistical test analyzes the deviation from an expected distribution when the occurrence of a characteristic is assumed to be independent of a category value. whereas the amount of independence between the terms t and a class is measured [58]. Know mathematically through Equation 3: where N is the total number of documents, A is the number of times t and c occur together, B is the number of times t occurs without c, C is the number of times c occurs without t, and D is the number of times neither c nor t occurs [59]. 4-Part-Of-Speech Tagging The technique of encoding a word into a textual in accordance with a part of speech based on both its meaning and context is known as (POS tagging, also known as grammatical tagging) [60] . 5-Word2Vec: is consider a deep learning-based predictive model within the category of unsupervised models that is used to compute and create high-quality dense, distributed. Words are represented as continuous vectors that capture contextual and semantic similarities [61]. 6-AUR-BoW: When user comments are broken into sentences, most user comments are too short, therefor when text is categorized, the text of the workbook is too short. In order to bypass this problem, several similar words are added to user reviews (comments). This classification technique is called AUR-BoW [59]. 7-Bagging: Bagging, also called bootstrap clustering, is a widely used group learning method for reducing variance within a noisy data set. The working mechanism is as follows: Initially, a data's random sample is selected in the set of training with replacement -that is, it is possible to select individual data points a number of times. These weak models are trained separately after producing a huge number of data samples, and depending on the purposeregression or classification. For example the mean or majority of those predictions produces a more accurate estimate [62].

2-4 Performance Metrics
A set of performance measures is primarily used to evaluate a classifier's performance in machine learning tasks. Performance verification is perform through static mathematical algorithms that evaluate the results of the user model's predictions with the real values in the dataset being used [63]. We highlight the outline set of a set of measures that are considered when evaluating machine learning tasks.

Matrix of confusion
It's frequently utilized in binary classification tasks, the matrix of confusion shows how good the items in the set of validation are ranked as well as providing more detail about the performance of the classifier. The following table shows the different nomenclature that can be called when class prediction, by giving the difference between the true and predicted values [64] as shown in Figure 1.  [65].

Accuracy
The percentage of samples correctly identified overall is one measure of accuracy for machine learning activities. [66]. If the validation set's size is N, as in Equation 4:

= +
Or through the following equation 5: Due to accuracy does not inform how well a model grades a particular classification, it is considered a primitive measure. If a validation set contains four positive samples and six negative samples, and the classifier predicts that all ten samples will be negative, the classifier achieves an apparent accuracy of 60%. However on closer inspection, the model graded everything negatively and failed to capture the features that distinguish the two groups, giving it a poor score. [67] .

Precision
Precision is a machine learning job performance metric that relates the number of samples correctly classified to the total number of samples [68]. The total number of accurate classifications is divided by the total number of classifications performed [69]. The ratio of true positives (TP) to positives (TP + FP) is another name for this metric: The ratio of true positives (TP) to positives (TP + FP) is another name for this measure [63] As in equation 6:

Recall
Recall is the percentage of samples with positive markers that were successfully predicted [70]. Also referred as true positive sensitivity or rate , recall is a measure of how well a classifier is at correctly predicting actual positive samples [71]. It can be calculated using the following mathematical equation As in equation 7:

F1-score
The F1-score: is test accuracy metric. The F1-score is a calculated weighted average of recall and precision [72]. It is often appropriate to combine the two scales of precision and recall into a single scale known as F1-score (also called F-measure), especially used to compare two classifiers .The harmonic means of accuracy and recall is the F1-score. While the standard average takes all numbers into account equally, the harmonic average gives lower values greater weight. A high F1-score is only obtained by a classifier if both recall and precision are good. [73]. The value of F1-score can be found using the mathematical equation 8 : Fβ-score represent the harmonic weighted average of recall and precision, measuring the relative importance of the two [52]. As in equation 9: The scale unbalance preference for recall or precision when = 1, which means that the F1 score is highest when recall = precision = 1 and poorest when recall = precision = 0.

2-5 Methodology
Classification of SSR by using ML techniques has been greatly used by researchers. Applying ML algorithms less time is spend by experts with more accuracy.
Zahra Shakeri Hossein et al [74] looked over 625 requirements from the "Open Science tera-PROMISE" collection. The goal of their research was to figure out how to enhance automated requirement classification in FR and NFR. As well as how well various machine learning algorithms perform in the classification process. Their working methodology was a processing strategy. They discovered that preprocessing improved the effectiveness of the present classification technique by standardizing and normalizing criteria prior to using classification algorithms. They also looked at how curriculum like Latent Dirichlet Allocation, Biterm Topic Modeling, and Nave Bayes for subclassifying NFRs performed. Advantages: Using a preprocessing method in the FR/NFR classification process, as well as subclassifying NFR into subcategories, can result in greater classification accuracy.
Bruno Cordeiro Mendes and Edna Dias Canedo [63] classify software requirements into FR and NFR with subcategories using machine learning techniques. In the requirements classification task, the researchers compare different text feature extraction techniques using machine learning algorithms. Techniques for selecting features BoW, TF-IDF, and CHI2 were used in this study, and the classification algorithms: Logistic Regression (LR), SVM, MNB and KNN were used. The PROMISE_exp data set used to perform the search, and using TF-IDF after that for differentiation needs. Better classification result provided by LR with accuracy up to 0.91. An advantages used for binary classification, nonfunctional requirements classification, the combination of TF-IDF with LR has the best performance metrics. Disadvantages:When the number of requirements for some labels is less in the group unbalanced data, automatic classification performance suffers. Nouf Rahimi et al [75] published a study aimed at categorizing software requirements (SRs), binary classification of SRs into FRs or NFRs, and multi-label categorization of both FRs and NFRs into various experimental categories. With a combination of four different deep learning models: The strategy employed three group methods: accuracy as a weight ensemble, mean ensemble, and accuracy per class as a weight ensemble, as well as long short term memory (LSTM), bidirectional long short term memory (BiLSTM), a gated recurrent unit (GRU), and a convolutional neural network (CNN). Models were trained and tested using the PROMISE dataset. The two-phase classification system outperformed the single-phase classification approach. The accuracy of the one-phase system was 92.56 %t, while the binary phase accuracy of the two-phase classification system was 95.75 % meanwhile the multiclass classification phase accuracy was 93.4 %. Therefore Advantages: the creation and distribution of SR rating systems that will assist software engineers, developers, and analysts in creating complete SRs for the development of reliable software systems. While disadvantages: is the suggested model's as well as classification systems' limitations; it can only support one language, which is written in a structured document, and sentences can be recovered from SRs by dealing with the extracted structured sentences . . [76] presented the relationship between requirements engineering and NLP, in order to classify binary requirements into FRs and NFRs . This class used natural language processing dataset as well as single value analysis (SVD) TERA-PROMISE was used. The author present five models employed are TF, TF-IDF, TF-IDF-CF, Bigram, and Trigram. Advantages: This cosine distance was calculated using the SVD model. This cosine distance, trigram had the best representation model. Disadvantages: The high frequency words in documents belonging to the same category are dependent on the requirements classification, that means the frequency represents both the document and the category at the same time, and this is the method's weakness.. Ishrar Hussain et al [6] present a work with the goal of discovering NFR phrases utilizing a text classifier with a part of speech (POS) tagger and using natural language processing (NLP) approaches to software requirements engineering. The authors Using 10-fold cross-validation on the identical data used in the literature. The search results were accurate to 98.56 %. Advantages: software analysts can indicate NFRs in SRS text documents to users to avoid additional supervision in the development process, which might result in poor quality of the final product and, eventually, project failure. Disadvantages: A complete prototype is not possible to make.

Muhammad Mahmoud Al-Tarawneh
Kortanovic et al [77] used meta-data, lexical, and syntactical characteristics, as well as the support vector machines (SVM) method, to create and evaluate a supervised machine learning technique . The authors depends on these techniques to categorize software requirements into FR, NFR, and subcategories of NFR. Therefore the authors made use of the PROMISE repository. Advantages: Rather than the data set for this challenge, requirements might be gathered from user comments. User evaluations are typically brief, unstructured, and infrequently follow language and punctuation requirements, resulting in reduced accuracy.
Tamai and Taichi Anzai [78] used Machine learning technology. The QRMiner tool was developed in order to analyze Quality Requirements statements from software requirements specifications (SRS) and categorize them into quality characteristics attributes. Thirteen documents were used in the case studies. SRS that was created for real-world applications in mind. Advantages : the use of the latest machine learning, deep learning and Doc2Vec technologies, which have greatly enhanced the performance of QRMiner, and the use of open source requirements documents, rather than data from student projects or open source projects . Disadvantages: the use of SRS must be written in English only. Hui Yang & Peng Liang [79] proposed an approach where the requirements information is automatically identified and categorized into FRs and NFRs from user reviews. Using both TF-IDF and NLP (regular expression) intervention. Human selection of keywords to define and categorize requirements. User evaluations from the popular APP iBooks in the English language app store have verified the recommended technique. Advantages: It is useful and practical for APP developers to elicit requirements from user reviews. Disadvantages: It is not possible to prioritize specific requirements that are further categorized to show their importance when hundreds and thousands of requirements flow to developers. Walid Maalej et al [80] offered several possible methods for classify application reviews such as : user experiences, text ratings, bug reports, and feature requests. Descriptive data such as time and star ratings, text classification, sentiment analysis techniques and natural language processing were used for the review. These series of studies were carried out to assess the accuracy of the approaches utilized and to compare them to simple series similar to. Totally it was discovered that simply having metadata leads to poor categorization accuracy. When combined with basic text classification and natural language text preprocessing -notably with capital and lowercase letters -the classification precision and recall for all review categories rose to 88-92 percent and 90-99 percent, respectively. Single multiclass classifiers were outperformed by multiple binary classifiers. Advantages: Aids in the filtering of evaluations relevant to certain stakeholders like as developers, analysts, and other users. Disadvantages: Stopword removal and lemmatization should be employed in text pretreatment NLP, since stopword removal might lower classification accuracy.
Jonas Winkler and Andreas Vogelsang [2] proposed a proprietary approach to automatically classifying content elements to the NLR specification as "requirement" or "information". This was done through the use of convolutional neural networks. The dataset used was Doors database related to an industrial partner. Advantages: This method can be used for the purpose of classifying content items in documents that have not been categorized before or for the purpose of analyzing documents that are already categorized as well as identifying the author for possible incorrect classifications of content items for the document. Disadvantages: only providing the user with actual results but without explaining why the content item was incorrectly categorized, and accuracy and recall are not reasonably high .
Abderahman Rashwan et al [81] offered a method for doing automated analysis of SRS documents for different forms of NFR utilizing Support Vector Machine (SVM) technology, as well as the Supporting Vector Machine (SVM) class for automatically categorizing requirement strings into distinct ontology classes. Functional, External and Internal Quality, Constraints, and Other NFR are the process's outcomes. PROMISE Corpus and Concordia RE Corpus were the datasets utilized in the procedure. Advantages include: Researchers interested in evaluating the effort made for the purpose of building requirements in general and improving the quality of programs in particular will be interested in the findings of this study. Disadvantages: The focus was specifically on NFR rather than FR.
Muhammad Younas and Karzan Wakil [82] based in their study the method of applying the Word2Vec model and common keywords to identify subtypes of NFR, Therefore it was considered an automated approach based on semantic similarity that does not require pre-classification of requirements to identify NFRs from requirements documents. The performance of the approach used in terms of precision, recall and F-measure was measured by applying the approach based on the PROMISE-NFR dataset. The findings suggest that a semi-supervised automated approach to NFR detection lowers manual human work. Advantages: Because these methods do not require pre-classified criteria for training the Word2Vec model, human manual work in the NFR identification process is reduced. Disadvantages: The number of NFR kinds in an off-the-shelf PROMISE dataset is limited by the developer of the dataset that specified it. Furthermore, the data set employed may contain some misunderstandings about the NFR categorization, and the Word2Vec model is linked to Wikipedia's lexicon. The model will not be able to find the similarity value if the word in the requirements is not in Wikipedia.
Mengmeng Lu and Peng Liang [59] Users' reviews were automatically divided into four categories of NFR (usability, dependability, performance, and portability), as well as functional requirements (FRs) and others. This is accomplished by combining four classification technologies (TF-IDF, CHI2, BoW, and AUR-BoW) with three machine learning methods (J48, Naive Bayes, and Bagging). The study's data collection included iBooks and WhatsApp. The results show that combining AUR-BoW with Bagging produces the greatest outcomes (71.4 percent accuracy, 72.3 percent recall, and 71.8 percent Fmeasure) of all formulas. Advantages: Automatic NFR categorization from user reviews may assist application developers better understand user reviews and address user demands from an NFR standpoint, as well as help developers retain and attract new users. Disadvantages: Two categories of NFRs, compatibility and security, do not exist in the experiment data set, and the number of NFRs for portability and performance is relatively small.
Pir Sami Ullah Shah et al [83]. developed an automated classification of software needs into two broad categories, functional and non-functional, utilizing natural language processing and machine learning. they use NLP, TF-IDF, Support Vector Machine, Naïve Bayesian, Recurrent Neural Network (RNN). The Software Requirements Dataset, which was utilized in the search, achieved the maximum accuracy of 92 percent when utilizing the RNN technique. The data was taken from the Kaggle repository. Advantages : The spotlight focuses on the NFR as much as it highlights the FRs because software developers mostly focus on FRs to compare with NFRs which end in massive software failures Users also face problems while describing NFR and sometimes NFR is hidden in user stories. Disadvantages: NFRs are not more specifically categorized into safety, security, performance, and usability requirements .
With Word2vec and rapid Text model technology, S Tiun et al [5] used the RE'17 dataset challenge as a dataset. To see how word embedding compares to typical characteristics (such a bag of words) in the NFR and FR classification. In addition to understanding that the greatest performance for the classification of NFR and FR requires the employment of a complicated neural classifier. The findings revealed that FastText is a good classification model, as it received the highest F1 score of 92.8 percent. Advantages: FastText is successful in binary classification of text when the documents to be classified are very short and contain few vocabulary. Disadvantages: fastText fails to classify large documents with a large vocabulary in which case TFIDF should be considered with NB Naïve Bayes as a classification model. Alex Dekhtyar and Vivian Fong [84]applied TensorFlow-guided learning and Word2Vec-based representations of classification problems in requirements documents where three classes of machine learning techniques were compared for the purpose of determining requirements for SecReq and NFR data sets. The first category used Na¨ıve Bayes which is the basic method on word count and TF-IDF for representation of requirements. TensorFlow's convolutional neural networks are trained on random, pretrained Word2Vec merges of words in the requirements in the remaining two category approaches. The SecReq dataset was utilized to do the search. Advantages: Using Word2Vec to represent individual words in requirements improves classification accuracy by a significant amount. Disadvantages: The classification process focused on two categories only, which are either security requirements or NFR, regardless of other sub-types of NFR (reliability, usability, etc.) and FR . Vivian Fong [52] had applied deep learning techniques (Naıve Bayes classifier and CNN classifier) for the purpose of automatic classification of software requirements, the author use word embedding when training a convolutional neural network (CNN) to represent documents. The dataset used in the network training and testing process is Quality Attributes (NFR) dataset (PROMISE corpus) and SecReq dataset. Advantages: A comparison of three word embedding strategies to assist represent requirements documents while training CNNs, and lastly a set of evaluations for the purpose of requirements categorization using two well-studied datasets Advantages: When configuring CNNs, the emphasis is on filter sizes, filter count, and number of training epochs, leaving out a vast array of CNN hyper parameter. additionally, the research did not investigate the fast text category and compare its performance as well as training time metrics with CNN outputs.

2-6 Data Sets
When performing a software requirements classification process using deep learning techniques, a data set must be provided for the purpose of training and testing the model built in the classification process.In the following paragraphs, a number of data sets that were used in research in the classification of software requirements are clarified PROMISE repository There are 625 identified natural language needs (255 Functional requirements and 370 non-functional requirements). First, the labels group the criteria into FR and NFR. Eleven  Table 1 shows the number of requirements for each category of software requirements in this repository Table1.Number of requirements in the PROMISE repository

PROMISE-Exp
The PROMISE Orig (PROMISE) range has been expanded. The dataset generated using known machine learning methods was evaluated after adding new software requirements. Determination a model and extract the specifications of the software used in the previous repository from the manual study. The results of the ML algorithms used to validate this extension were compared with the results of the original rule when they were provided for similar methods. It was discovered that percent % numbers class 40 the new PROMISE exp database could be used in research using ML algorithms that did not support the automated software requirements classification task, and that there was an increase of 55% over the original PROMISE database. The amount of requirements for each type of standard before and after the expansion process is shown in the Table 2. Corpus Repository It is one of the datasets used by [85]that is available for download via [86]. It contains a total of 765 sentences and 15 SRS problem statements from various disciplines. 270 of them (or 35 percent) have the "FR" annotation, while 495 (or 65 percent) have the "NFR" annotation. referred to as CorpusN and CorpusF, respectively. SRS (NIRS: National Institute of Radiological Sciences, JUAS: Japan Users Association of Information Systems, IPA: Information Technology Promotion Agency, Requests for Proposal RFP) : The following thirteen online social action models are use by Japanese local governments or other public entities. Therefore, the majority of requests for proposals (RFPs) concern information systems, while there are also exceptions, such RFPs for medical systems. 11,538 required sentences in all, all written in Japanese, were gathered and are displayed in Table 3 iBooks app Out of 1000 users, 217 user reviews in the English-language app store for the iBooks app contain FR information, while 622 user reviews contain NFR information (some user reviews may contain both FR and NFR information) (i.e. ground truth). Apple store data and Google store data It gathered around 1.1 million reviews for 1,100 applications, half of which are paid and the other half free, using the Apple AppStore and Google Play Stores to gather experience data. Only 80 applications, of which half were bought and the other half were free, received 146,057 reviews on the Google Store, which was only allowed to gather reviews. A random sample of a portion of the manual tagging was taken from the obtained data. Pick 1,000 reviews at random from the Google Store data and 1,000 reviews from the Apple Store. DOORS database The DOORS database is a database of DOORS (Dynamic Object-Oriented Requirements) containing 10,000 items extracted from 89 documents,These items fall into two categories: information and requirements. PROMISE Corpus The PROMISE Corpus consists of 15 SRS documents, developed as semester projects by Master students at DePaul University. This specification contains a total of 326 NFR and 358 FR . The NFR in this group are divided into 9 categories availability (A), look-and-feel (LF), legal (L), operational (O), performance (P),, maintainability (M), security (SE), usability (US) and scalability (SC) . Table 4 shows the NFR Classes and a number of sentences for each Classes in PROMISE Corpus. SecReq is a data set that is used in research to improve the task of recalling security requirements. The data set consists of requirements categorized into two categories, security-related or non-securityrelated. The Naıve Bayes class was trained on the data for the purpose of classification.

Results And Discussion Summary of results
Summarization of relevant published papers that are considered is illustrated in Table (5).

Results Discussion
By observing the data sets used in studies related to the requirements classification process, we found that most studies used the PROMISE repository as a data set in the training and testing process. It contains a set of publicly available data sets and tools to serve researchers in building predictive software models (PSM) and the software engineering community in general. Thus we found Through a survey on previous studies that the natural language processing is one of the first and most important stages that take place before building models for classification requirements. Moreover, there are also multiple text conversion models and feature extraction from them, where several different techniques were used to convert text data into digital vectors. And the Term Frequency Inverse Document Frequency (TF-IDF) technology overcame all other techniques in performing the same function. Generally there are also multiple methods and techniques used in classifying software requirements through the use of different machine learning algorithms, as some of them are under supervision and others are without supervision, and there are also semi-supervised algorithms. It turns out that There are two levels when classifying requirements. Indeed, in the first level, the objective is to classify the binary requirements into FRs and NFRs only, and in second level, the objective of which is a multiple classification of requirements, FRs and NFRs, in addition to the sub-types of NFRs.
By observing the results in the previous table, it was found that the highest value of accuracy is 0.9856, through a research presented by Ishrar Hussain et al [6]through their use of a text classifier equipped with a part-of-speech (POS) tagger in order to obtain On non-functional requirement subtypes that have a Precision score of 0.98 and a Recall value of 1 F-Measur of 0.99 .
As for the lowest value for accuracy, it was found in the study presented by Winkler and Andreas Vogelsang [2], which amounted to 0.81 through the application of convolutional neural network techniques in order to classify requirements documents into two basic categories: requirements and information, where the Precision rate reached 0.815 with a value of Recall equals 0.82, F-Measur equals 0.81.
It was also found that there is a large discrepancy in the accuracy of classification in a study presented by Vivian Fong [52] for the purpose of classifying the requirements documents into a triple classification (are they safety requirements or not, are they non-functional requirements from other categories or not, are they non-functional requirements or represent Security requirements) when they applied deep learning techniques in the classification process, where it was found that in the case of using the Naıve Bayes classifier, the classification accuracy rate was about 0.858, where the Precision rate reached 0.888 with a Recall value equal to 0.817, F-Measur equal to 0 .841. When applying the techniques of convolutional neural networks for the same classification, it was found that the accuracy rate was higher than the accuracy rate reached in the case of using Naıve Bayes, where the accuracy rate reached 0.94 and the Precision rate reached 0.935 with a Recall value equal to 0.883, F -Measur equals 0.903 . This proves that the use of convolutional neural networks provides better results in the classification process compared to the use of the Naıve Bayes classifier.

. Conclusion
By noting the Metrics results of previous researches in Table 1 , it was found that in the process of binary classification of requirements (first level of classification) that the highest values of Accuracy were achieved when applying the techniques of Latent Dirichlet Allocation, Biterm Topic Modeling and Naıve Bayes, This is when classifying a requirements document into FRs and NFRs where Accuracy reached 94.72% Precision was 94% Recall value 95% and F-Measure equal to 94% .
Accuracy was the lowest in the binary classification process when applying convolutional neural networks for classifying a requirements document, is it a requirement or information? with an Accuracy value of 83%, and the lowest value for Precision was 55% when applying Natural Language Processing techniques and TF-IDF for feature extraction. It was found that the lowest value for Recall and F-Measure when using processing techniques Natural languages and TF-IDF isotropic extraction with RNN with a Recall value of 30% and 40 % for F-Measure .
During classifying deep (requirements into more than one class), all metrics values are reached their maximum when using a text classifier equipped with a part-of-speech (POS) tagger to classify the NFRs with a value of 98.56 % for Accuracy , 98% for Precision , Recall value reached 100 % and has an F-Measure value of 99% . The minimum value of the Metrics in the multi-classification of requirements, which is when using the Naıve Bayes classifier for classification a requirements document as to whether it is a security requirement or not, and whether it is FRs or NFRs, the Accuracy reached 88.5. Also, when Word2Vec model and popular keywords for identification of NFR were used to obtain the subtypes of non-functional requirements, the minimum Metrics was reached with values of 62.25 % for Precision, Recall It has reached 44.49% and has an F-Measure value of 42.28%.

Acknowledgements
This paper and the research behind it would not have been possible without the exceptional support of my supervisor Ass .Prof .Dr. Nada N. Saleem. His enthusiasm, knowledge and keen attention to detail have been inspiring and kept my work on the right track from the first real beginning of this research all the way to the list of references.