centerbottom10500090000centercenter0105000centercenter0105000centertop10500090000

Summery

By: Reem Alkasimi

Table of Contents

TOC o “1-3” h z u Machine learning PAGEREF _Toc529083425 h 1Artificial neural network PAGEREF _Toc529083426 h 2K-Nearest Neighbor PAGEREF _Toc529083427 h 2CART PAGEREF _Toc529083428 h 3C4.5 PAGEREF _Toc529083429 h 4REPTree PAGEREF _Toc529083430 h 4ADTREE PAGEREF _Toc529083431 h 4LADTree Algorithm PAGEREF _Toc529083432 h 4Naïve Bayes PAGEREF _Toc529083433 h 4NBTree PAGEREF _Toc529083434 h 5Deep learning PAGEREF _Toc529083435 h 5Model ensembles PAGEREF _Toc529083436 h 6Bagging PAGEREF _Toc529083437 h 6Boosting PAGEREF _Toc529083438 h 6AdaBoost Algorithm PAGEREF _Toc529083439 h 7MultiBoosting PAGEREF _Toc529083440 h 7RandomTree Classifiers PAGEREF _Toc529083441 h 7Random Forests PAGEREF _Toc529083442 h 8Rotation Forest PAGEREF _Toc529083443 h 8The Random Subspace Method PAGEREF _Toc529083444 h 8Classification model evaluation PAGEREF _Toc529083445 h 9Model evaluation procedures PAGEREF _Toc529083446 h 10

Machine learningMachine learning is computers programming that enhance the performance of Artificial Intelligent quality based on example data or experience. The machine may be able to predict the future, or descriptive to gain knowledge, or both. Because Machine learning is a part of artificial intelligence the system can learn and adapt to changes, therefore, the system designer does not need to predict and provide solutions for all possible situations. Automated learning also helps us to solve many problems in vision, speech recognition, and robotics.

Artificial neural network The artificial neural networks are stimulated by using the human brain function. The brain consists of a very huge wide variety of neurons that work in parallel while the pc has one processor. A neural network receiver with one layer of weights can approximate the input functions of the input and cannot be used for nonlinear regression. This dilemma does now not practice to advanced networks with intermediate or hidden layers among input and output layers. these applications have a clean economic advantage if carried out on machines. If we can understand how the brain performs its functions, we will become aware of answers to these capabilities as formal algorithms and put into effect them on computers

K-Nearest NeighborThe k-nearest-neighbor technique is in depth work giving a extensive variety of training. it has been widely used within the field of pattern recognition. The training tuples are defined through n attributes

all the education tuples are saved in an n-dimensional pattern area. A k-nearest-neighbor classi?er searches the pattern space for the k training tuples that are can be closest to the unknown tuple whilst given an unknown tuple.

The unknown tuple is considered the most poular class among its k-nearest neighbors. When k=1, the unknown tuple is assign to the class of the training tuple that is close to it in pattern space. many measures can be used to calculate the distance between two points, however, the most preferapale distance measure is the one which a smaller space between two objects suggest a greater similarity to be in the same class. Therefore, for instance, if k-NN is being applied to classify documents, then it may be better to use the cosine unit instead of Euclidean distance.

Support Vector Machine (SVM)SVM is considered as most powerful and accurate methods of all known automated learning algorithms. The goal of SVM in a two-class learning task is to find better classification function to differentiate between members of the two classes in the training data. SVM ensure that the best function is found by maximizing the margin among the two classes. The main reason why SVM insists on finding the maximum margin hyperplanes is to offers the best generalization ability. It allows not only the best classification performance on the training data, but also leaves much room for the correct classification of the future data

CARTthe Classification and Regression Trees are machine-learning methods for constructing prediction models from data. CART is a binary recursive partitioning procedure that can processes continuous and nominal attributes both as targets and as predictors. The procedure generates trees that are invariant under any order preserving transformation of the predictor attributes. The CART mechanism aims to produce a series of nested pruned trees that are candidate optimal trees. CART does not provide any internal performance metrics for selecting trees based on training data as such, actions are suspicious. Tree performance always measured on independent test data and tree selection proceeds only after test-data-based evaluation. If no test data exist and cross validation is not performed, CART will remain neutral regarding which tree in the sequence is best. In its default classification mode CART always calculates class frequencies in any node relative to the class frequencies in the root In CART the missing value handling process is fully automatic and locally adapted in each node. . In each node in the tree, the selected splitter induces a binary partition of the data.

C4.5C4.5 ensures classifiers to expressed as decision trees, in addition it construct classifiers in more comprehensible ruleset form. C4.5 utilize two heuristic criteria to grade possible tests: information gain, which minimizes the total entropy of the subsets and the default gain ratio that divides information gain by the information provided by the test outcomes. C4.5 checks the estimated errors if any of its branches replaces the subtree and when this appears beneficial the tree is modified accordingly.

REPTreeREPTree construct a decision or regression tree using information gain/variance reduction and prunes it using reduced-error pruning. You can set the minimum number of instances per leaf, maximum tree depth , minimum proportion of training set variance for a split , and number of folds for pruning.

Alternating Decision TREES

ADTREE is a computer learning process of classification. It popularize decision trees as well it directly connect to the boosting, in each boosting iteration ,three nodes are add up to the tree. Another appealing property of AD Trees that cannot possible with any other classic boosting procedures, is their ability to be merged together. This is a useful feature in the context of multiclass problems because they usually re-formulated in the two-class setting using one or more classes against the others. AD Trees can be combined into a single classifier in such setting.

LADTree AlgorithmThe LADTree learning algorithm applies the logistic boosting algorithm in order to stimulate an alternative decision tree.

Naïve Bayes Naive Bayes classifiers are collection of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong independence assumptions among features. Naive Bayes method is important for several reasons. They do not need any schemas to estimate the complex iterative parameter and they are very easy to construct. This means that it can be easily applied to large data sets. It is easy to explain so that unskilled users in the classification technology can understand the reason for making the classification. Finally, it often performs surprisingly well: it may not be the best possible rating in any particular application, but it is usually reliable to be strong and works well. In fact, various factors may enter into force, which means that the assumption is not as harmful as it may seem

NBTree NBTree, which urges a mix of decision-makers and Nayef Bayes authors: Decision boxes contain unchangeable divisions like standard decision trees, but the leaves contain Nayef Bayezian’s works. The Naive-Bayes classifications are generally easy to understand, and the definition of these workbooks is very fast and requires only one pass through data if all attributes are separate. Naive-Bayes can be validated at a time that is linear in the number of cases, the number of attributes, and the number of naming values. The reason is that we can remove the instances and update the counters and classify them and replicate them to a different set of instances.

Deep learningDeep learning is type of artificial intelligence (AI) that is concerned with mimicking the learning process that human use to gain certain types of knowledge. Deep learning considered as a way to automated predictive analysis.

In deep learning, the goal is to learn feature levels of increasing abstraction with minimum human interference because in most applications, we do not know what the input structure is and any sort of dependencies that are, for instance , locality, should be automatically discovered during training.

A deep neural network is usually trained on one layer at a time. The goal of each layer is to extract the prominent features in the data that is fed to it, and a method such as the autoencoder can be used for this purpose; An autoencoder learns to compress data from the input layer into a short code, and then uncompress that code into something that closely matches the original data. This forces the autoencoder to engage in reducing dimensions, for example by learning how to ignore noise.

Model ensembles Ensemble methods use multiple learning algorithms to obtain better predictive performance than can be obtained from any algorithms learning foundation alone. The idea of ensemble modeling is to create and integrate multiple inductive models for the same domain, and perhaps obtain better quality for prediction than most or all. Ensemble modeling applies to two main predictive modeling tasks, classification and regression.

Bagging Bagging is an automated learning algorithms designed to improve the accuracy and stability of machine learning algorithms used in regression and statistical classification. Bagging is the simplest algorithm for modeling groups that combine the very basic approaches to base model formation and aggregation base models. Bagging create bootstrap samples of the training set , combined by averaging for the regression task or plain voting for the classification task .The bagging ensemble performance tends to improve with increasing the number of base models up to a certain point, after which it stabilizes. This is where the limit of model diversity possible is achieved using bootstrap samples.

BoostingThe idea of boosting classifiers has been introduced by Freund and Schapire . They have proven some important properties justifying the solutions. In general, boosting is a method of converting «weak» learning algorithm to a «strong» algorithm with arbitrarily high accuracy. After a number of models is learnt, they form a committee with properly weighted decisions.

AdaBoost AlgorithmOne of the first boosting procedures proposed by Freund and Schapire was AdaBoost . The algorithm is still the most popular one of this kind. The method builds a collection of models based on the training dataset and probabilities px assigned to each object x of the training set. The learning stage can be realized in different ways, depending on the preferred strategy: weighting or sampling

When a model at some stage perfectly classifies the training data , the next stage of AdaBoost is not feasible, because the distribution gets degenerate. In such cases, it is also suggested by some authors to reset the probability distribution and build next model on another bootstrap sample.

MultiBoostingThe observations that bagging and AdaBoost appear to operate by different mechanisms, have different effects, and both have greatest effect obtained from the first few committee members; suggest that it might be possible to obtain benefit by combining the two. MultiBoosting can be considered as wagging committees formed by AdaBoost3. For a single run, a decision has to be made about how many sub-committees should be formed, and the size of those sub-committees. MultiBoost has the potential computational advantage over AdaBoost that the sub-committees may be learned in parallel, although this would require a change to the handling of early termination of learning a sub-committee

RandomTree Classifiers

Random Tree is a supervised ensemble-learning algorithm classifier that produce several individual learners. It provides a bagging idea to produce a random group of data to build a decision tree. The algorithm can deal with both classification and regression problems. Random tree is a group of prediction tree that called forest. Model trees are decision trees in which a single leaf holds a linear model, which the local subspace optimized by this leaf.

Random ForestsRandom forests are a generalization of recursive partitioning that combines a collection of trees called an ensemble. However, random forests is best seen as a bootstrapped version of a classification tree generating procedure. It was invented by and substantially developed by Breiman and Cutler . A random forest is a collection of identically distributed trees whose predicted classes are obtained by a variant on majority vote. Another way to say this is that random forests are a bagged version of a tree classifier – improved by two clever tricks.

Rotation ForestRotation Forest is a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset. The concept of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is cultivated through the feature extraction for each base classifier. Due to their sensitivity to rotation of the feature axis, decision trees were chosen here and, thusly, named «forest». Accuracy is sought by holding all principal components, and also using the whole dataset to train each base classifier.

The Random Subspace MethodThe Random Subspace Method (RSM) is the combining technique proposed by Ho ADDIN ZOTERO_ITEM CSL_CITATION {“citationID”:”rWHZDYAv”,”properties”:{“formattedCitation”:”(Ho, 1998)”,”plainCitation”:”(Ho, 1998)”,”noteIndex”:0},”citationItems”:{“id”:372,”uris”:”http://zotero.org/users/local/H2eTDwgi/items/CKZVXPIJ”,”uri”:”http://zotero.org/users/local/H2eTDwgi/items/CKZVXPIJ”,”itemData”:{“id”:372,”type”:”article-journal”,”title”:”The random subspace method for constructing decision forests”,”container-title”:”IEEE transactions on pattern analysis and machine intelligence”,”page”:”832-844″,”volume”:”20″,”issue”:”8″,”ISSN”:”0162-8828″,”journalAbbreviation”:”IEEE transactions on pattern analysis and machine intelligence”,”author”:{“family”:”Ho”,”given”:”Tin Kam”},”issued”:{“date-parts”:”1998″}}},”schema”:”https://github.com/citation-style-language/schema/raw/master/csl-citation.json”} (Ho, 1998). In the RSM, one also modifies the training data. However, this modification is performed in the feature space. The RSM may benefit from using random subspaces for both constructing and aggregating the classifiers. When the number of training objects is relatively small compared with the data dimensionality, by constructing classifiers in random subspaces one may solve the small sample size problem

Classification model evaluationModel Evaluation is an integral part of the model development process. It helps to find the best model that represents our data and how well the chosen model will work in the future.

The aim of the classification model evaluation is to get a reliable quality assessment of the target concept’s approximation represented by the model’s predictive performance. Different performance measures can be used, depending on the intended application of the model. This reflects the model’s predictive utility, i.e., its capability to correctly classify arbitrary new instances from the given domain. Since the true class labels are generally unavailable for the domain, the true performance always remains unknown and can only be estimated by dataset performance.

Classification performance measures are calculated by comparing the predictions generated by the classifier on a dataset S with the true class labels of the instances from this dataset. For classification, especially for two-class problems, a variety of measures has been proposed. There are four possible cases. For a positive example, if the prediction is also positive, this is a true positive; if our prediction is negative for a positive example, this is a false negative. For a negative example, if the prediction is also negative, we have a true negative, and we have a false positive if we predict a negative example as positive.

In some two-class problems, we make a distinction between the two classes and hence the two types of errors, false positives and false negatives. In many applications, it may not be sufficient to know how often the evaluated model is wrong or even how costly its mistakes are on average. It may be similarly or even more important to know how often it fails to predict correctly particular classes .In such cases, model performance can be more deeply evaluated based on the confusion matrix.

The confusion matrix gives an extremely useful insight into the model’s capability to predict particular classes, and – under a proper evaluation procedure – into its generalization properties.

Classification accuracy is the ratio of correct predictions to the total no. of predictions. Or more simply, how often is the classifier correct.

Fig?—?Accuracy

We can calculate the accuracy using the confusion matrix. Following is the equation to calculate the accuracy using the confusion matrix:

Fig?—?Accuracy using Confusion Matrix

The misclassification error and accuracy are the same class-insensitive performance measures that were defined above, and their confusion matrix-based definitions are given here for the sake of completeness only. The remaining performance measures are much more interesting class-sensitive indicators that describe the level at which the evaluated classifier succeeds or fails to correctly detect the positive class.

Model evaluation proceduresTraining and testing on the same data

Rewards overly complex models that “overfit” the training data and won’t necessarily generalize

Train/test split

Split the dataset into two pieces, so that the model can be trained and tested on different data

Better estimate of out-of-sample performance, but still a “high variance” estimate

Useful due to its speed, simplicity, and flexibility

K-fold cross-validation

Systematically create “K” train/test splits and average the results together

Even better estimate of out-of-sample performance

Runs “K” times slower than train/test split

ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. An ROC curve is a commonly used way to visualize the performance of a binary classifier, meaning a classifier with two possible output classes. The curve plots the True Positive Rate (Recall) against the False Positive Rate (also interpreted as 1-Specificity).

A measure called the Kappa statistic takes this expected figure into account by deducting it from the predictor’s successes and expressing the result as a proportion of the total for a perfect predictor. the Kappa statistic is used to measure the agreement between predicted and observed categorizations of a dataset, while correcting for an agreement that occurs by chance. However, like the plain success rate, it does not take costs into account.