Top 10 algorithms in data mining
04 December 2007
DOI: 10.1007/s10115-007-0114-2
Autores: Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand & Dan Steinberg
- C4.5 and beyond A suite of algorithms for generating decision trees used for classification.
- The k-means algorithm A simple iterative method to partition data into distinct clusters based on similarity.
- Support vector machines A robust method that finds the best hyperplane to separate different classes with a maximum margin.
- The Apriori algorithm A seminal algorithm for finding frequent itemsets and deriving association rules, commonly used in market basket analysis.
- The EM algorithm An iterative algorithm used to find maximum likelihood estimates in models with latent variables.
- PageRank A link-analysis algorithm used by search engines to rank the importance of web pages.
- AdaBoost An ensemble learning method that combines multiple “weak” classifiers to create a strong one.
- kNN: k-nearest neighbor classification A simple classification and regression algorithm that categorizes a data point based on how its neighbors are classified.
- Naive Bayes A family of probabilistic classifiers based on Bayes’ Theorem with strong independence assumptions.
- CART (Classification and regression trees) A decision tree learning technique that produces either classification or regression trees.
Top 10 in Machine Learning
- Linear Regression: Used to predict a continuous numerical value (e.g., house prices) by finding the best-fit straight line through data points.
- Logistic Regression: Despite its name, this is a classification tool used to predict binary outcomes, such as “yes/no” or “spam/not spam,” by outputting a probability.
- Decision Trees: A flowchart-like structure that makes decisions based on a series of yes/no questions about the data’s features.
- Random Forest: An ensemble method that combines multiple individual decision trees and takes their majority vote to improve accuracy and prevent overfitting.
- Support Vector Machines (SVM): A powerful classifier that finds the “hyperplane” (a boundary line) that maximizes the distance between different classes of data.
- k-Nearest Neighbors (kNN): A simple “lazy learner” that classifies new data points based on how their closest neighbors in the dataset are classified.
- Naive Bayes: A probabilistic classifier based on Bayes’ Theorem that assumes all features are independent, often used for quick text categorization like sentiment analysis.
- k-Means Clustering: An unsupervised algorithm that groups unlabeled data into distinct clusters based on shared similarities.
- Principal Component Analysis (PCA): A dimensionality reduction technique that simplifies complex datasets with many variables into a few “principal components” while retaining the most important information.
- Gradient Boosting (e.g., XGBoost, LightGBM): An advanced ensemble technique that builds models sequentially, with each new model focusing on correcting the errors made by previous ones.