Data mining software package is not something we usually choose. It either already in place in the company or taught in the university or the decision is made on corporate level. However with the growing number of companies tapping into the power of analytics the question probably is asked a lot. I have recently been asked for an advice for a good data mining software and I came up with some points I would like to share with you.
I believe the first step is to clarify the needs and goals of a company with implementation of data mining in general as well as their aspirations for the future. Is it that they just want to try it or they have a business plan for embedding the data mining services in their offering or internal processes or it is something else? The purpose could be study DM, implement simple models, deployment of sophisticated models over large data and so on. This is a crucial step as it would determine the policy, the size and distribution of investments over time and so on as data mining packages come in many different prices and feature packages. After the goal is clarified the things to consider in my point of view are:
- The richness and usefulness of the data exploration feature
Data mining requires very good understanding of data and that's why I consider this feature very important. Of course, there are software packages that are specifically built for this purpose and it is OK to use alongside with the DM package but it that would mean extra work and divided attention.
- The data transformations that come with the package
All the DM packages provide data transformation facility but the difference is in the ease of use, their transparency as well as the option for implementing your own transformation.
- Availability and versatility of machine learning methods
Some packages provide rich set of methods and their variations to best suit the needs of the user while other make available just one version of a method. It not good or bad by itself and depends on the purpose. That is usually related to free/paid version of the software as well as to the application field where the software originates but is important in the overall decision. I would include here the option for including custom built algorithm.
- Learning curve
Despite you or the staff to be using the package have good level of DM skills, each package comes with set of specifics. For example, R-project requires the learning a list of "magical spells" to be typed in while Rapid Miner requires learning of its components, realization of calculation process with them and so on. The price and availability of courses also should be part of this.
- Existing experience with DM packages
If the team has experience with a DM software then choosing it or similar one will be a good steps as the learning curve will not be steep or non-existent.
- Availability and quality of the support
Problems with any software do come up and we better have a plan how to solve them. Free and open source solutions are cheaper alternative but as a rule, if you have a problem, you have to solve it on your own. It usually translates in hours of browsing through user forums and blog-posts to find the solution does not work for you. Some companies have nightmarish support for a nice product while others go the other way around so you need to decide how important is that on for you. Usually, it is a neglected part of the deal until an angry customer (or even worse - an angry boss) knocks on your door while you are trying to make the damn thing work.
- How it works with data from different sources and different sizes
Almost all packages work with almost all data formats and connect to all sort of data-sources. However, analyze your sources - maybe there is data that comes in a file format that is not supported by the DM software and would require re-formatting. Also, packages manage data in different ways - some load all the data in the memory while others do that in portions. If your a planning dealing with large data-sets then you would need to pay attention to that.
- How easy it is to deploy a model and how it will work in your infrastructure
You may start small with DM but keep in mind that your needs could start growing and your package of choice would not be able to handle it or it would come at a too steeper cost.
Last but not least. There are some astronomically expensive packages available for the largest of the large corporations while there are reasonably prices ones that offer very good value for the money. However, with DM software that more expensive means better might be holding true.
These are some high-lights to be considered in the process of selecting a data mining software to implement. I have intentionally not mentioned any software packages as there are so many of them and I have encounters with only few of them. Some of the free packages are listed here and a much more complete list you could find on KDuggets where you would be surprised to see that there are so many of them. It is not easy to find your way around all these offers and pick few for further testing so searching for the top software that meets some criteria is a better way to go. It is worth to know that some packages have free versions that allow testing their functionality and feel. I find it great as demos and reading materials could not provide the actual feel and touch of the package. I would like to hear back with your thoughts and experience on that topic.