Thursday, January 14, 2016

Basic of data mining

Data Mining or Machine Learning is not something new, it has been there for years and many have used it for various types of analysis and finding hidden patterns related to a data set. However, it is NOT something we always use, apply and integrate with our solutions as it is something unique and it caters for unique requirements. Not only that, the definition many have grabbed is, just a part of the exact usage of Data Mining, not realizing the purpose of it. Let's discuss the basic of Data Mining and see how we can use even with a simple scenario.

What is Data Mining?
Data Mining is a special kind of analysis technique that reveals previously-unknown or difficult-to-identify connections and correlations in as large dataset, using statistical models. This aligns with data warehousing and business intelligence as itor always works with large datasets. Business needs to identify useful patterns related to captured data stored in data warehouses for improving the productivity and efficiency while satisfying its customers. In an organization, for smooth and success run, it is always better to identify potential customers for product recommendations, work on predictive future behaviors of customers and understand the trends including competitors'. Although this can be done up to some extent with functionalities given for reporting and analysis by BI client tools, it is bit difficult to get everything required done efficiently without having statistical analysis; Data Mining.

Data Mining Algorithms

Data Mining uses algorithms for analyzing data. There are many algorithms, some are heavily used and some are used only with specific scenario. There are algorithms created for same purposes with tiny differences. Similar ones have been given for selecting the best for the scenario. Algorithms can be categorized as below;
  • Classification algorithms: Predict one or more discrete variables based on other attributes. Example: Predict whether a credit can be granted to a customer or not. Algorithms: Microsoft Decision Tree, Neural Network, Naive Bayes
  • Regression algorithms: Predict one or more continious variables. Example: Predict the sales revenue. Algorithms: Microsoft Time Series, Linear Regression, Logistic Regression.
  • Segmentation or clustering algorithms: Group data into multiple segments. Example: Group customers based on their other attributes for marketing campaign. Algorithms: Microsoft Clustering.
  • Association algorithms: Find correlations between different attributes in a dataset. Example: Finding products to be bundled for selling. Algorithms: Microsoft Association.
  • Sequence analysis algorithms: Find sequence (or order) in a data set. Example: Finding common clickstream patterns in a web site. Algorithms: Microsoft Sequence Clustering.

Data Mining Solution
Let's talk about a scenario. Assume that we have a large number of records related to customers and it includes many general attributes such as name, age, job, housing and a specific attribute whether customer has been granted a credit or not. If we have a dataset like this, we can use it for determining whether a new customer should be granted a credit or not. 

As you see, credit risk has been marked as target attribute and all other attributes are considered as features. We can get this dataset analyzed by an algorithm (or multiple algorithms) using a Model. Model specifies the data structure that marks attributes to be used and attributes to be predicted along with a dataset. Generally, Model uses 70% of the dataset (Training set) for identifying patterns and 30% for testing (Testing set). Model with algorithms trains data for predicting the column and uses testing set for checking the accuracy. Accuracy can be easily checked because training set has the value to be predicted. If multiple algorithms have been used with the model, best one can be picked based on the accuracy of testing. Once picked, model can be marked as Trained Model which can be used for new customers.

Tools available for Data Mining
There are many tools and applications in different platform. Here are some of them offered by Microsoft;
  • Data Mining Add-ins for Excel: This provides many facilities for performing data mining. This add-ins has Table Analysis which can be used without knowing much about data mining.
  • Microsoft Analysis Services: This allows us to create data structures for data mining. Once created, it can be used for creating reports or analysis.
  • Azure Machine Learning: This is the cloud-offer which can be easily used for data mining. This allows to create Models with drag-n-drop facility, train models, and then open trained model as web services

No comments: