Iris Mythica - New and Improved Iris Data Set
Summary: Created an additional 50 data points to augment the existing, trusted, original Iris Data set used in so many Machine Learning tutorials
Background - Tonight I had a 45 minute bus ride home to Berkeley, and a patchy connection on the AC Transit bus - and thought I'd use the time to play around with the equivalent of "hello world" source data in many introductory machine learning exercises - the Edgar Anderson Iris Data Set.
Background From WIKI "The Iris flower data set is Edgar Anderson's Iris data set to measure morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). http://en.wikipedia.org/wiki/Iris_flower_data_set http://en.wikipedia.org/wiki/Edgar_Anderson Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on a linear discriminant model, this data set became a typical test case for many classification techniques in machine learning"
To spice up the IRIS data set. I thought about mapping 1:1 the data to some non-flower element (e.g animal, rock band) but didn't have enough time or battery life - so opted to EXPAND the 3 X 50 = 150 row data set by another 50 units
So my goal was to end up with an experimental / augmented IRIS data set - that was a bit bigger - and had 4 species.
One benefit of this new set would be to allow newbs like me to play with another set of data that was similar enough to compare to the 'standard' set. I also thought the traits of HOW the new 'random' set was generated might be fun to see.
- Pulled Original Data Set
- Imported into R
- Plotted Original
- Looked at traits of the clusters - DECIDED WHERE GAPS / SPACE existed to fit a 'new' species (sepal.length 6-7cm; sepal.width 3-4cm; petal.length 1.5 to 3.5cm; petal.width 0.5 to 1cm
- Fired up Excel. Added 50 new rows (151 to 200) with IRIS MYTHICA tag on species
- Generated Random Numbers for each Cell (XLS) - from 0 to 1; then scaled them to correct range. (When I did this I thought I heard the sound of data scientists screaming, but probably just my imagination)
- Saved new iris_mythica.csv
- Launched R. Ran the standard plots on original set - the Copied and pasted R code, and replaced iris with iris_mythica; Re-ran Plots. Noted new cluster in the area we expected to see data.
- Bus arrived at destination; Feet up; Poured Drink
Below Left: Surveying some space to 'fit' the new data in original 3X50-150 data set. Below Right: Excel Random functions fitting to ranges
Below left: R Studio Console with 'original IRIS code, then the modified code; plot(mythica[2:5], main="IRIS - Now with Mythica !", pch=23, bg = c("lightblue","red","orange","green") [unclass(mythica$class)])
Below right: NEW Mythica Data right where we 'boxed' it. Note how tidy it is (given RAND method) -
R PLOTS - ggplot(mythica, aes(x = petal.width, fill = class)) + geom_density() and ggplot(mythica, aes(x = petal.width, fill = class)) + geom_density() + facet_grid(class ~ .)
About this blog
Description is...<br/>Data Analytics & Visualization Blog - Generating insights from Data since 2013
Created: July 25, 2014Englishfrançais