Dream to Learn is shutting down...

We are very sorry to say that Dream to Learn will be shutting down as of December 28th, 2019. If you have content that you wish to keep, you should make a copy of it before that date.


1COMMENTS0RECOMMENDS

Iris Mythica - New and Improved Iris Data Set

17
POSTED IN: Data Analytics & Visualization Blog

Iris Mythica - New and Improved Iris Data Set

Summary:  Created an additional 50 data points to augment the existing, trusted, original Iris Data set used in so many Machine Learning tutorials

   

Background - Tonight I had a 45 minute bus ride home to Berkeley, and a patchy connection on the AC Transit bus - and thought I'd use the time to play around with the equivalent of "hello world" source data in many introductory machine learning exercises - the Edgar Anderson Iris Data Set. 

Background From WIKI "The Iris flower data set is Edgar Anderson's Iris data set to measure morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). http://en.wikipedia.org/wiki/Iris_flower_data_set   http://en.wikipedia.org/wiki/Edgar_Anderson  Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on a linear discriminant model, this data set became a typical test case for many classification techniques in machine learning"

Objective

To spice up the IRIS data set.  I thought about mapping 1:1 the data to some non-flower element (e.g animal, rock band) but didn't have enough time or battery life - so opted to EXPAND the 3 X 50 = 150 row data set by another 50 units

So my goal was to end up with an experimental / augmented IRIS data set - that was a bit bigger - and had 4 species.

One benefit of this new set would be to allow newbs like me to play with another set of data that was similar enough to compare to the 'standard' set.  I also thought the traits of HOW the new 'random' set was generated might be fun to see.

Method

  1. Pulled Original Data Set
  2. Imported into R
  3. Plotted Original
  4. Looked at traits of the clusters - DECIDED WHERE GAPS / SPACE existed to fit a 'new' species (sepal.length 6-7cm; sepal.width 3-4cm; petal.length 1.5 to 3.5cm; petal.width 0.5 to 1cm
  5. Fired up Excel.  Added 50 new rows (151 to 200) with IRIS MYTHICA tag on species
  6. Generated Random Numbers for each Cell (XLS) - from 0 to 1; then scaled them to correct range.  (When I did this I thought I heard the sound of data scientists screaming, but probably just my imagination)
  7. Saved new iris_mythica.csv
  8. Launched R.  Ran the standard plots on original set - the Copied and pasted R code, and replaced iris with iris_mythica;  Re-ran Plots.  Noted new cluster in the area we expected to see data.
  9. Plotted
  10. Bus arrived at destination; Feet up; Poured Drink

Images

Below Left: Surveying some space to 'fit' the new data in original 3X50-150 data set.    Below Right: Excel Random functions fitting to ranges

    

Below left: R Studio Console with 'original IRIS code, then the modified code;   plot(mythica[2:5], main="IRIS - Now with Mythica !", pch=23, bg = c("lightblue","red","orange","green")  [unclass(mythica$class)])

Below right: NEW Mythica Data right where we 'boxed' it.  Note how tidy it is (given RAND method)  - 

    

 

R PLOTS  - ggplot(mythica, aes(x = petal.width, fill = class)) +  geom_density()  and ggplot(mythica, aes(x = petal.width, fill = class)) +   geom_density() +   facet_grid(class ~ .)

 

 

R Code:  https://drive.google.com/file/d/0BwjxYjWyopXhTjZxYWZVQ0QyeUM4cnAyX2lwckVNWjQ2VjJF/edit?usp=sharing

Data:   https://docs.google.com/spreadsheet/ccc?key=0AgjxYjWyopXhdEhMT2JaTlA5REt6TEFIc3VSZ0xMLVE&;usp=sharing  or https://dreamtolearn.com/doc/BL1TCDF3M2V0E4NO1COGFLGSR

 

Interested in more content by this author?

About the Author

Ryan Anderson

Ryan Anderson

Hi! I like to play with data, analytics and hack around with robots and gadgets in my garage. Lately I've been learning about machine learning.

About this blog

Description is...<br/>Data Analytics & Visualization Blog - Generating insights from Data since 2013

Created: July 25, 2014

Englishfrançais

Up Next