In the Blog post below, just for grins I added (fabricated) another 50 Data Elements called Iris Mythica.
Now, I wanted to run some basic Machine Learning on the "new" Data, to see how it performed.
There are a number of great tools out there - including a package for R - however for my example here, I'm going to use the Machine Learning engine from Wise.IO - because I'm familiar with it and know the guys there. Given power of the Wise tools, it's way more horsepower than needed, but what the heck! Coming soon I'll step through the R solution.
Goal: Feed a PORTION of the data into an ML tool. Let tool learn. Then, test/validate the tool by submitting data for classification (blind) to see what the model thinks data is. Then remove the blindfold, cross-check results, and see how good the tool was.. (spoiler alert: pretty good)
- GET the CSV file of 200 data points. Should have Index, 4 attributes for each samples, and class(ificaiton)
- SEPARATE LEARN/TEST - we're not going to give the ML tool all data, just a big chunk. We'll hold some back to test the model later, and see how smart it is. Method: I simply used =RAND() and then told the spreadsheet to 'tag'as TEST any row that was more than 0.8.
- SUBMIT learn data - 153 rows - to ML model to learn. In this case, I told the model to ignore the Index (1-200 incremental); use the 4 attributes, and classify on class (setosa, versicolor, virginica and "mythica")
- MACHINE LEARNING (magic :) - more to come on methods here, next blog; Model successful;
- PREPARE 'Test" data - strip out the 'class' tag on the 47 data elements, because that's what we want the ML to predict for us. Upload. Ask model to predict what the class is.
- RECEIVE file with classification. Examine confidence (most were above 90%).
- COMPARE the predictions, to actual - remove blindfold - CSV Here
- (model got 46 out of 47 correct with default settings, and high confidence) - so in summary, good outcome.
- Error - classified row 120, test data with attributes 6 2.2 5 1.5 as Versicolor (it's Iris-virginica) - noted later that this data was the lowest value from test group for 3 of 4, and second lowest for other. (was near the edge of cluster)
- CONCLUSION: Both the Data Set and Tool performed well. Tool classified 46 of 47 test samples correctly.
About this blog
Description is...<br/>Data Analytics & Visualization Blog - Generating insights from Data since 2013
Created: July 25, 2014Englishfrançais