block by aaizemberg c2380d83d6365a4ac3f79220553c7b04

Top 10 algorithms in data mining

Top 10 algorithms in data mining

04 December 2007

DOI: 10.1007/s10115-007-0114-2

Autores: Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand & Dan Steinberg

  1. C4.5 and beyond A suite of algorithms for generating decision trees used for classification.
  2. The k-means algorithm A simple iterative method to partition data into distinct clusters based on similarity.
  3. Support vector machines A robust method that finds the best hyperplane to separate different classes with a maximum margin.
  4. The Apriori algorithm A seminal algorithm for finding frequent itemsets and deriving association rules, commonly used in market basket analysis.
  5. The EM algorithm An iterative algorithm used to find maximum likelihood estimates in models with latent variables.
  6. PageRank A link-analysis algorithm used by search engines to rank the importance of web pages.
  7. AdaBoost An ensemble learning method that combines multiple “weak” classifiers to create a strong one.
  8. kNN: k-nearest neighbor classification A simple classification and regression algorithm that categorizes a data point based on how its neighbors are classified.
  9. Naive Bayes A family of probabilistic classifiers based on Bayes’ Theorem with strong independence assumptions.
  10. CART (Classification and regression trees) A decision tree learning technique that produces either classification or regression trees.

Top 10 in Machine Learning

  1. Linear Regression: Used to predict a continuous numerical value (e.g., house prices) by finding the best-fit straight line through data points.
  2. Logistic Regression: Despite its name, this is a classification tool used to predict binary outcomes, such as “yes/no” or “spam/not spam,” by outputting a probability.
  3. Decision Trees: A flowchart-like structure that makes decisions based on a series of yes/no questions about the data’s features.
  4. Random Forest: An ensemble method that combines multiple individual decision trees and takes their majority vote to improve accuracy and prevent overfitting.
  5. Support Vector Machines (SVM): A powerful classifier that finds the “hyperplane” (a boundary line) that maximizes the distance between different classes of data.
  6. k-Nearest Neighbors (kNN): A simple “lazy learner” that classifies new data points based on how their closest neighbors in the dataset are classified.
  7. Naive Bayes: A probabilistic classifier based on Bayes’ Theorem that assumes all features are independent, often used for quick text categorization like sentiment analysis.
  8. k-Means Clustering: An unsupervised algorithm that groups unlabeled data into distinct clusters based on shared similarities.
  9. Principal Component Analysis (PCA): A dimensionality reduction technique that simplifies complex datasets with many variables into a few “principal components” while retaining the most important information.
  10. Gradient Boosting (e.g., XGBoost, LightGBM): An advanced ensemble technique that builds models sequentially, with each new model focusing on correcting the errors made by previous ones.