From 2c88987eb4ec7213b36271f097468ab098d454dc Mon Sep 17 00:00:00 2001 From: Joe McCarthy Date: Sun, 22 Feb 2015 15:09:43 -0800 Subject: [PATCH 01/16] Added support for Python 3 (print_function, division) --- Python_for_Data_Science_all.ipynb | 880 +++++++++++++++++------------- README.md | 13 +- simple_ml.py | 48 +- 3 files changed, 537 insertions(+), 404 deletions(-) diff --git a/Python_for_Data_Science_all.ipynb b/Python_for_Data_Science_all.ipynb index 12fb2f9..86eaa40 100644 --- a/Python_for_Data_Science_all.ipynb +++ b/Python_for_Data_Science_all.ipynb @@ -1,7 +1,7 @@ { "metadata": { "name": "", - "signature": "sha256:b9093d2c18739c3d520c8914e782d87c8f634bd136d6f2a4a0687fe761c50b2c" + "signature": "sha256:70904b83221e3051c52ba052346cc93ef71896373e06c07b0d982ac4605674c8" }, "nbformat": 3, "nbformat_minor": 0, @@ -336,6 +336,42 @@ "3. Python: Basic Concepts" ] }, + { + "cell_type": "heading", + "level": 3, + "metadata": {}, + "source": [ + "A note on Python 2 vs. Python 3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are 2 major versions of Python in widespread use: [Python 2](https://docs.python.org/2/) and [Python 3](https://docs.python.org/3/). Python 3 has some features that are not backward compatible with Python 2, and some Python 2 libraries have not been updated to work with Python 3. I have been using Python 2, primarily because I use some of those Python 2 libraries, but an increasing proportion of them are migrating to Python 3, and I anticipate shifting to Python 3 in the near future.\n", + "\n", + "For more on the topic, I recommend a very well documented IPython Notebook, which includes numerous helpful examples and links, by [Sebastian Raschka](http://sebastianraschka.com/), [Key differences between Python 2.7.x and Python 3.x](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/key_differences_between_python_2_and_3.ipynb) ... or [googling Python 2 vs 3](https://www.google.com/q=python%202%20vs%203).\n", + "\n", + "I received an email request from a Python 3 programmer who suggested that a relatively minor change in this notebook would enable it to run with Python 2 or Python 3: importing the `print_function` from [`__future__`](https://docs.python.org/2/library/__future__.html), and changing my [`print` statements (Python 2)](https://docs.python.org/2/reference/simple_stmts.html#print) to [`print` function calls (Python 3)](https://docs.python.org/3/library/functions.html#print). Although a relatively minor conceptual change, it necessitates the changing of many cells to reflect the Python 3 `print` syntax.\n", + "\n", + "I find the arguments for [making `print` a function rather than statement](https://www.python.org/dev/peps/pep-3105/) compelling - especially as it is more consistent with printing functionality in many other programming langugages - and so while I do not want to convert this notebook to a Python 3 notebook, I have implemented this change so that it can be used in either Python 2 or Python 3. However, while I have verified that it still works in Python 2, I have not tested it in Python 3.\n", + "\n", + "I also find the arguments for [changing the division operator](https://www.python.org/dev/peps/pep-0238/) compelling, so will import that as well. Without this import in Python 2, `1 / 2` returns `0` (the integer portion of the quotient); with this import, `1 / 2` returns `0.5`, and if you want only the integer portion of the quotient (*floor division*), you can use `1 // 2` (which works the same way in Python 2 and Python 3).\n", + "\n", + "Note that if you don't understand some/any of the above discussion about Python 2 and Python 3, it should not affect your ability to understand the rest of this notebook." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "from __future__ import print_function, division" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 2 + }, { "cell_type": "heading", "level": 3, @@ -348,7 +384,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The sample instance shown above can be represented as a string. A Python *string* ([`str`](http://docs.python.org/2/tutorial/introduction.html#strings)) is a sequence of 0 or more characters enclosed within a pair of single quotes (`'`) or a pair double quotes (`\"`). " + "The sample instance of a mushroom shown above can be represented as a string. A Python *string* ([`str`](http://docs.python.org/2/tutorial/introduction.html#strings)) is a sequence of 0 or more characters enclosed within a pair of single quotes (`'`) or a pair double quotes (`\"`). " ] }, { @@ -363,13 +399,13 @@ { "metadata": {}, "output_type": "pyout", - "prompt_number": 2, + "prompt_number": 3, "text": [ "'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'" ] } ], - "prompt_number": 2 + "prompt_number": 3 }, { "cell_type": "markdown", @@ -389,22 +425,22 @@ "language": "python", "metadata": {}, "outputs": [], - "prompt_number": 3 + "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The [**`print`**](http://docs.python.org/2/tutorial/inputoutput.html) statement writes the value of its comma-delimited arguments to [`sys.stdout`](http://docs.python.org/2/library/sys.html#sys.stdout) (typically the console). Each value in the output is separated by a single blank space. If the last argument is followed by a comma, the output cursor will stay on the same line." + "The [**`print`**](https://docs.python.org/3/library/functions.html#print) function writes the value of its comma-delimited arguments to [`sys.stdout`](http://docs.python.org/2/library/sys.html#sys.stdout) (typically the console). Each value in the output is separated by a single blank space. If an `end` argument that does not include `\\n` (newline character) is supplied, the output cursor will not move to the next line." ] }, { "cell_type": "code", "collapsed": false, "input": [ - "print 'Instance 1:', single_instance_str\n", - "print 'A', 'B', # note comma at the end\n", - "print 'C' # will appear on same line" + "print('Instance 1:', single_instance_str)\n", + "print('A', 'B', end=' ') # use a space rather than newline at the end of the line\n", + "print('C') # will appear on same line" ], "language": "python", "metadata": {}, @@ -418,7 +454,7 @@ ] } ], - "prompt_number": 4 + "prompt_number": 5 }, { "cell_type": "markdown", @@ -437,7 +473,7 @@ "A multi-line\n", "comment\n", "'''\n", - "print 'no comment'" + "print('no comment')" ], "language": "python", "metadata": {}, @@ -450,7 +486,7 @@ ] } ], - "prompt_number": 5 + "prompt_number": 6 }, { "cell_type": "markdown", @@ -464,7 +500,7 @@ "collapsed": false, "input": [ "single_instance_list = single_instance_str.split(',')\n", - "print single_instance_list" + "print(single_instance_list)" ], "language": "python", "metadata": {}, @@ -477,7 +513,7 @@ ] } ], - "prompt_number": 6 + "prompt_number": 7 }, { "cell_type": "markdown", @@ -491,7 +527,7 @@ "collapsed": false, "input": [ "mixed_list = ['a', 1, 2.3, True, [1, 'b']]\n", - "print mixed_list" + "print(mixed_list)" ], "language": "python", "metadata": {}, @@ -504,7 +540,7 @@ ] } ], - "prompt_number": 7 + "prompt_number": 8 }, { "cell_type": "markdown", @@ -518,7 +554,7 @@ "collapsed": false, "input": [ "concatenated_list = ['a', 1] + [2.3, True] + [[1, 'b']]\n", - "print concatenated_list" + "print(concatenated_list)" ], "language": "python", "metadata": {}, @@ -531,7 +567,7 @@ ] } ], - "prompt_number": 8 + "prompt_number": 9 }, { "cell_type": "markdown", @@ -544,7 +580,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print single_instance_str[2], single_instance_list[2]" + "print(single_instance_str[2], single_instance_list[2])" ], "language": "python", "metadata": {}, @@ -557,7 +593,7 @@ ] } ], - "prompt_number": 9 + "prompt_number": 10 }, { "cell_type": "markdown", @@ -570,7 +606,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print single_instance_str[-1], single_instance_list[-1]" + "print(single_instance_str[-1], single_instance_list[-1])" ], "language": "python", "metadata": {}, @@ -583,7 +619,7 @@ ] } ], - "prompt_number": 10 + "prompt_number": 11 }, { "cell_type": "markdown", @@ -596,8 +632,8 @@ "cell_type": "code", "collapsed": false, "input": [ - "print single_instance_str[2:4]\n", - "print single_instance_list[2:4]" + "print(single_instance_str[2:4])\n", + "print(single_instance_list[2:4])" ], "language": "python", "metadata": {}, @@ -611,7 +647,7 @@ ] } ], - "prompt_number": 11 + "prompt_number": 12 }, { "cell_type": "markdown", @@ -624,8 +660,8 @@ "cell_type": "code", "collapsed": false, "input": [ - "print single_instance_str[-4:-2]\n", - "print single_instance_list[-4:-2]" + "print(single_instance_str[-4:-2])\n", + "print(single_instance_list[-4:-2])" ], "language": "python", "metadata": {}, @@ -639,7 +675,7 @@ ] } ], - "prompt_number": 12 + "prompt_number": 13 }, { "cell_type": "markdown", @@ -652,10 +688,10 @@ "cell_type": "code", "collapsed": false, "input": [ - "print single_instance_str[:-1] # all but the last\n", - "print single_instance_list[:-1]\n", - "print single_instance_str[1:] # all but the first\n", - "print single_instance_list[1:]" + "print(single_instance_str[:-1]) # all but the last\n", + "print(single_instance_list[:-1])\n", + "print(single_instance_str[1:]) # all but the first\n", + "print(single_instance_list[1:])" ], "language": "python", "metadata": {}, @@ -671,7 +707,7 @@ ] } ], - "prompt_number": 13 + "prompt_number": 14 }, { "cell_type": "markdown", @@ -684,10 +720,10 @@ "cell_type": "code", "collapsed": false, "input": [ - "print single_instance_str\n", - "print single_instance_str[::2] # print elements in even-numbered positions (the values, in this case)\n", - "print single_instance_str[1::2] # print elements in odd-numbered positions (the commas, in this case)\n", - "print single_instance_str[::-1] # reverse the string" + "print(single_instance_str)\n", + "print(single_instance_str[::2]) # print elements in even-numbered positions (the values, in this case)\n", + "print(single_instance_str[1::2]) # print elements in odd-numbered positions (the commas, in this case)\n", + "print(single_instance_str[::-1]) # reverse the string" ], "language": "python", "metadata": {}, @@ -703,7 +739,7 @@ ] } ], - "prompt_number": 14 + "prompt_number": 15 }, { "cell_type": "markdown", @@ -744,7 +780,7 @@ " 'spore-print-color', \n", " 'population', \n", " 'habitat']\n", - "print attribute_names" + "print(attribute_names)" ], "language": "python", "metadata": {}, @@ -757,20 +793,22 @@ ] } ], - "prompt_number": 15 + "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The [`str.strip(\\[chars\\]`)](http://docs.python.org/2/library/stdtypes.html#str.strip) method returns a copy of `str` in which any leading or trailing `chars` are removed. If no `chars` are specified, it removes all leading and trailing whitespace. [*Whitespace* is any sequence of spaces, tabs (`'\\t'`) and/or newline (`'\\n'`) characters.] " + "The [`str.strip(\\[chars\\]`)](http://docs.python.org/2/library/stdtypes.html#str.strip) method returns a copy of `str` in which any leading or trailing `chars` are removed. If no `chars` are specified, it removes all leading and trailing whitespace. [*Whitespace* is any sequence of spaces, tabs (`'\\t'`) and/or newline (`'\\n'`) characters.] \n", + "\n", + "Note that since a blank space is inserted in the output after every item in a comma-delimited list, the last asterisk is printed after a leading blank space is inserted on the new line." ] }, { "cell_type": "code", "collapsed": false, "input": [ - "print '*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n', '*'" + "print('*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n', '*')" ], "language": "python", "metadata": {}, @@ -780,17 +818,17 @@ "stream": "stdout", "text": [ "* \tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", - "*\n" + " *\n" ] } ], - "prompt_number": 16 + "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ - "print '*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'.strip(), '*'" + "print('*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'.strip(), '*')" ], "language": "python", "metadata": {}, @@ -803,7 +841,7 @@ ] } ], - "prompt_number": 17 + "prompt_number": 18 }, { "cell_type": "markdown", @@ -824,7 +862,7 @@ "input": [ "single_instance_str = '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'\n", "single_instance_list = single_instance_str.strip().split(',') # first strip leading & trailing whitespace, then split on commas\n", - "print single_instance_list" + "print(single_instance_list)" ], "language": "python", "metadata": {}, @@ -837,7 +875,7 @@ ] } ], - "prompt_number": 18 + "prompt_number": 19 }, { "cell_type": "markdown", @@ -850,7 +888,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print '*', ','.join(single_instance_list), '*'" + "print('*', ','.join(single_instance_list), '*')" ], "language": "python", "metadata": {}, @@ -863,7 +901,7 @@ ] } ], - "prompt_number": 19 + "prompt_number": 20 }, { "cell_type": "markdown", @@ -878,7 +916,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print len(single_instance_str), len(single_instance_list)" + "print(len(single_instance_str), len(single_instance_list))" ], "language": "python", "metadata": {}, @@ -891,7 +929,7 @@ ] } ], - "prompt_number": 20 + "prompt_number": 21 }, { "cell_type": "markdown", @@ -906,7 +944,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print ',' in single_instance_str, ',' in single_instance_list" + "print(',' in single_instance_str, ',' in single_instance_list)" ], "language": "python", "metadata": {}, @@ -919,7 +957,7 @@ ] } ], - "prompt_number": 21 + "prompt_number": 22 }, { "cell_type": "markdown", @@ -932,7 +970,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print single_instance_str.count(','), single_instance_list.count('f')" + "print(single_instance_str.count(','), single_instance_list.count('f'))" ], "language": "python", "metadata": {}, @@ -945,7 +983,7 @@ ] } ], - "prompt_number": 22 + "prompt_number": 23 }, { "cell_type": "markdown", @@ -958,7 +996,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print single_instance_str.index(','), single_instance_list.index('f')" + "print(single_instance_str.index(','), single_instance_list.index('f'))" ], "language": "python", "metadata": {}, @@ -971,7 +1009,7 @@ ] } ], - "prompt_number": 23 + "prompt_number": 24 }, { "cell_type": "markdown", @@ -992,16 +1030,16 @@ "input": [ "list_1 = [4, 2, 3, 5, 1]\n", "list_2 = list_1 # list_2 now references the same object as list_1\n", - "print 'list_1: ', list_1\n", - "print 'list_2: ', list_2\n", + "print('list_1: ', list_1)\n", + "print('list_2: ', list_2)\n", "list_1.remove(1)\n", - "print 'list_1.remove(1):', list_1\n", + "print('list_1.remove(1):', list_1)\n", "list_1.append(6)\n", - "print 'list_1.append(6):', list_1\n", + "print('list_1.append(6):', list_1)\n", "list_1.sort()\n", - "print 'list_1.sort(): ', list_1\n", + "print('list_1.sort(): ', list_1)\n", "list_1.reverse()\n", - "print 'list_1.reverse():', list_1" + "print('list_1.reverse():', list_1)" ], "language": "python", "metadata": {}, @@ -1019,7 +1057,7 @@ ] } ], - "prompt_number": 24 + "prompt_number": 25 }, { "cell_type": "markdown", @@ -1032,8 +1070,8 @@ "cell_type": "code", "collapsed": false, "input": [ - "print 'list_1: ', list_1\n", - "print 'list_2: ', list_2" + "print('list_1: ', list_1)\n", + "print('list_2: ', list_2)" ], "language": "python", "metadata": {}, @@ -1047,7 +1085,7 @@ ] } ], - "prompt_number": 25 + "prompt_number": 26 }, { "cell_type": "markdown", @@ -1060,10 +1098,10 @@ "cell_type": "code", "collapsed": false, "input": [ - "print 'sorted(list_1):', sorted(list_1) # return a copy of list_1 in sorted order\n", - "print 'list_1: ', list_1\n", - "print 'sorted(single_instance_str):', sorted(single_instance_str) # returns a list of sorted elements in the string\n", - "print 'single_instance_str: ', single_instance_str" + "print('sorted(list_1):', sorted(list_1)) # return a copy of list_1 in sorted order\n", + "print('list_1: ', list_1)\n", + "print('sorted(single_instance_str):', sorted(single_instance_str)) # returns a list of sorted elements in the string\n", + "print('single_instance_str: ', single_instance_str)" ], "language": "python", "metadata": {}, @@ -1080,7 +1118,7 @@ ] } ], - "prompt_number": 26 + "prompt_number": 27 }, { "cell_type": "markdown", @@ -1094,8 +1132,8 @@ "collapsed": false, "input": [ "x = (1, 2, 3, 4, 5) # a tuple\n", - "print 'x =', x, ', len(x) =', len(x), ', x.index(3) =', x.index(3), ', x[4:2:-1] = ', x[4:2:-1]\n", - "print 'sorted(x, reverse=True):', sorted(x, reverse=True) # sorted always returns a list; reverse=True specifies reverse sort order" + "print('x =', x, ', len(x) =', len(x), ', x.index(3) =', x.index(3), ', x[4:2:-1] = ', x[4:2:-1])\n", + "print('sorted(x, reverse=True):', sorted(x, reverse=True)) # sorted always returns a list; reverse=True specifies reverse sort order" ], "language": "python", "metadata": {}, @@ -1109,7 +1147,7 @@ ] } ], - "prompt_number": 27 + "prompt_number": 28 }, { "cell_type": "markdown", @@ -1122,7 +1160,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print x.index(6) # a ValueError will be raised" + "print(x.index(6)) # a ValueError will be raised" ], "language": "python", "metadata": {}, @@ -1133,12 +1171,12 @@ "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m6\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# a ValueError will be raised\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m6\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# a ValueError will be raised\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mValueError\u001b[0m: tuple.index(x): x not in tuple" ] } ], - "prompt_number": 28 + "prompt_number": 29 }, { "cell_type": "heading", @@ -1166,11 +1204,11 @@ "class_value = 'e' # try changing this to 'p' or 'x'\n", "\n", "if class_value == 'e':\n", - " print 'edible'\n", + " print('edible')\n", "elif class_value == 'p':\n", - " print 'poisonous'\n", + " print('poisonous')\n", "else:\n", - " print 'unknown'" + " print('unknown')" ], "language": "python", "metadata": {}, @@ -1183,7 +1221,7 @@ ] } ], - "prompt_number": 29 + "prompt_number": 30 }, { "cell_type": "markdown", @@ -1208,9 +1246,9 @@ "\n", "if attribute in attribute_names:\n", " i = attribute_names.index(attribute)\n", - " print attribute, 'is in position', i\n", + " print(attribute, 'is in position', i)\n", "else:\n", - " print attribute, 'is not in', attribute_names" + " print(attribute, 'is not in', attribute_names)" ], "language": "python", "metadata": {}, @@ -1223,7 +1261,7 @@ ] } ], - "prompt_number": 30 + "prompt_number": 31 }, { "cell_type": "markdown", @@ -1244,9 +1282,9 @@ "\n", "try:\n", " i = attribute_names.index(attribute)\n", - " print attribute, 'is in position', i\n", + " print(attribute, 'is in position', i)\n", "except ValueError:\n", - " print attribute, 'is not found'" + " print(attribute, 'is not found')" ], "language": "python", "metadata": {}, @@ -1259,7 +1297,7 @@ ] } ], - "prompt_number": 31 + "prompt_number": 32 }, { "cell_type": "markdown", @@ -1280,7 +1318,7 @@ " i = attribute_names.index(attribute)\n", " value = single_instance_list[i]\n", " \n", - "print attribute, '=', value" + "print(attribute, '=', value)" ], "language": "python", "metadata": {}, @@ -1293,7 +1331,7 @@ ] } ], - "prompt_number": 32 + "prompt_number": 33 }, { "cell_type": "heading", @@ -1307,7 +1345,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Python [*functions definitions*](http://docs.python.org/2/tutorial/controlflow.html#defining-functions) start with the **`def`** keyword followed by a function name, a list of 0 or more comma-delimited *parameters* (aka 'formal parameters') enclosed within parentheses, and then a colon ('`:`'). \n", + "Python [*function definitions*](http://docs.python.org/2/tutorial/controlflow.html#defining-functions) start with the **`def`** keyword followed by a function name, a list of 0 or more comma-delimited *parameters* (aka 'formal parameters') enclosed within parentheses, and then a colon ('`:`'). \n", "\n", "A function definition may include one or more [**`return`**](http://docs.python.org/2/reference/simple_stmts.html#the-return-statement) statemens to indicate the value(s) returned to where the function is called. It is good practice to include a short [docstring](http://docs.python.org/2/tutorial/controlflow.html#tut-docstrings) to briefly describe the behavior of the function and the value(s) it returns." ] @@ -1327,7 +1365,7 @@ "language": "python", "metadata": {}, "outputs": [], - "prompt_number": 33 + "prompt_number": 34 }, { "cell_type": "markdown", @@ -1341,7 +1379,7 @@ "collapsed": false, "input": [ "attribute = 'cap-shape' # try substituting any of the other attribute names shown above\n", - "print attribute, '=', attribute_value(single_instance_list, attribute, attribute_names)" + "print(attribute, '=', attribute_value(single_instance_list, attribute, attribute_names))" ], "language": "python", "metadata": {}, @@ -1354,7 +1392,7 @@ ] } ], - "prompt_number": 34 + "prompt_number": 35 }, { "cell_type": "markdown", @@ -1370,10 +1408,12 @@ "collapsed": false, "input": [ "x = 0\n", - "print 'x used as a variable:', x, type(x)\n", + "print('x used as a variable:', x, type(x))\n", + "\n", "def x():\n", - " print 'x'\n", - "print 'x used as a function:', x, type(x)" + " print('x')\n", + " \n", + "print('x used as a function:', x, type(x))" ], "language": "python", "metadata": {}, @@ -1383,34 +1423,36 @@ "stream": "stdout", "text": [ "x used as a variable: 0 \n", - "x used as a function: \n" + "x used as a function: \n" ] } ], - "prompt_number": 35 + "prompt_number": 36 }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Also note that Python function arguments are passed using *call by object reference*. Thus any modifications made to a parameter that has been passed a mutable object bound to a name as an argument will persist after the function exits." + "Another way to determine the `type` of an object is to use [`isinstance(object, class)`](https://docs.python.org/2/library/functions.html#isinstance). This is generally [preferable](http://stackoverflow.com/questions/1549801/differences-between-isinstance-and-type-in-python), as it takes into account [class inheritance](https://docs.python.org/2/tutorial/classes.html#inheritance). There is a larger issue of [*duck typing*](https://en.wikipedia.org/wiki/Duck_typing), and whether code should ever explicitly check for the type of an object, but we will omit further discussion of the topic in this notebook.\n", + "\n", + "Checking whether an object is a `function` type will require the use of the `types` library ... and more [thorough checking](http://stackoverflow.com/questions/624926/how-to-detect-whether-a-python-variable-is-a-function/624948#624948) could be done if one wants to include built-in as well as user-defined functions." ] }, { "cell_type": "code", "collapsed": false, "input": [ - "def insert_x(list_parameter):\n", - " '''Inserts \"x\" at the head of a list, modifying the list argument'''\n", - " list_parameter.insert(0, 'x')\n", - " print 'Inserted x:', list_parameter\n", - " return list_parameter\n", - "\n", - "insert_x([1, 2, 3]) # passing an unnamed object does not affect any existing names\n", - "list_argument = [1, 2, 3] # passing a named object will affect the object bound to that name\n", - "print 'Before:', list_argument\n", - "insert_x(list_argument)\n", - "print 'After:', list_argument" + "import types\n", + "\n", + "x = 0\n", + "print('Is x an int?', isinstance(x, int))\n", + "print('Is x a function?', isinstance(x, types.FunctionType))\n", + "\n", + "def x():\n", + " print('x')\n", + " \n", + "print('Is x an int?', isinstance(x, int))\n", + "print('Is x a function?', isinstance(x, types.FunctionType))" ], "language": "python", "metadata": {}, @@ -1419,14 +1461,65 @@ "output_type": "stream", "stream": "stdout", "text": [ - "Inserted x: ['x', 1, 2, 3]\n", - "Before: [1, 2, 3]\n", - "Inserted x: ['x', 1, 2, 3]\n", - "After: ['x', 1, 2, 3]\n" + "Is x an int? True\n", + "Is x a function? False\n", + "Is x an int? False\n", + "Is x a function? True\n" ] } ], - "prompt_number": 36 + "prompt_number": 37 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another important feature of Python functions is that arguments are passed using [*call by sharing*](https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_sharing). If a *mutable* object is passed as an argument to a function parameter, assignment statements using that parameter do not affect the passed argument, however mutations to the parameter do affect the passed argument.\n", + "\n", + "Not being aware of - or forgetting - this important distinction can lead to challenging debugging sessions. " + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def modify_parameters(parameter1, parameter2):\n", + " '''Inserts \"x\" at the head of parameter1, assigns \"x\" to parameter2'''\n", + " parameter1.insert(0, 'x')\n", + " print('parameter1, after inserting \"x\":', parameter1)\n", + " parameter2 = 'x'\n", + " print('parameter2, after assigning \"x\"', parameter2)\n", + " return\n", + "\n", + "argument1 = [1, 2, 3] # passing a named object will affect the object bound to that name\n", + "argument2 = 4\n", + "print('argument1, before calling modify_parameters:', argument1)\n", + "print('argument2, before calling modify_parameters:', argument2)\n", + "print()\n", + "modify_parameters(argument1, argument2)\n", + "print()\n", + "print('argument1, after calling modify_parameters:', argument1)\n", + "print('argument2, after calling modify_parameters:', argument2)" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "argument1, before calling modify_parameters: [1, 2, 3]\n", + "argument2, before calling modify_parameters: 4\n", + "\n", + "parameter1, after inserting \"x\": ['x', 1, 2, 3]\n", + "parameter2, after assigning \"x\" x\n", + "\n", + "argument1, after calling modify_parameters: ['x', 1, 2, 3]\n", + "argument2, after calling modify_parameters: 4\n" + ] + } + ], + "prompt_number": 38 }, { "cell_type": "markdown", @@ -1441,18 +1534,17 @@ "cell_type": "code", "collapsed": false, "input": [ - "def insert_x_copy(list_parameter):\n", - " '''Inserts \"x\" at the head of a list, without modifying the list argument'''\n", - " list_parameter_copy = list_parameter[:]\n", - " list_parameter_copy.insert(0, 'x')\n", - " print 'Inserted x:', list_parameter_copy\n", - " return list_parameter_copy\n", - "\n", - "insert_x_copy([1, 2, 3]) # passing an unnamed object does not affect any existing names\n", - "list_argument = [1, 2, 3] # passing a named object will affect the object bound to that name\n", - "print 'Before:', list_argument\n", - "insert_x_copy(list_argument)\n", - "print 'After:', list_argument" + "def modify_parameter_copy(parameter_1):\n", + " '''Inserts \"x\" at the head of parameter_1, without modifying the list argument'''\n", + " parameter_1_copy = parameter_1[:]\n", + " parameter_1_copy.insert(0, 'x')\n", + " print('Inserted x:', parameter_1_copy)\n", + " return\n", + "\n", + "argument_1 = [1, 2, 3] # passing a named object will not affect the object bound to that name\n", + "print('Before:', argument_1)\n", + "modify_parameter_copy(argument_1)\n", + "print('After:', argument_1)" ], "language": "python", "metadata": {}, @@ -1461,14 +1553,13 @@ "output_type": "stream", "stream": "stdout", "text": [ - "Inserted x: ['x', 1, 2, 3]\n", "Before: [1, 2, 3]\n", "Inserted x: ['x', 1, 2, 3]\n", "After: [1, 2, 3]\n" ] } ], - "prompt_number": 37 + "prompt_number": 39 }, { "cell_type": "markdown", @@ -1486,13 +1577,13 @@ " return min(list_of_values), max(list_of_values)\n", "\n", "list_1 = [3, 1, 4, 2, 5]\n", - "print 'min and max of', list_1, ':', min_and_max(list_1)\n", + "print('min and max of', list_1, ':', min_and_max(list_1))\n", "\n", "min_and_max_list_1 = min_and_max(list_1) # a single variable is assigned the two-element tuple\n", - "print 'min and max of', list_1, ':', min_and_max_list_1\n", + "print('min and max of', list_1, ':', min_and_max_list_1)\n", "\n", "min_list_1, max_list_1 = min_and_max(list_1) # the 1st variable is assigned the 1st value, the 2nd variable is assigned the 2nd value\n", - "print 'min and max of', list_1, ':', min_list_1, ',', max_list_1" + "print('min and max of', list_1, ':', min_list_1, ',', max_list_1)" ], "language": "python", "metadata": {}, @@ -1507,7 +1598,7 @@ ] } ], - "prompt_number": 38 + "prompt_number": 40 }, { "cell_type": "heading", @@ -1530,11 +1621,12 @@ "cell_type": "code", "collapsed": false, "input": [ - "print 'Index values for attributes:', range(len(attribute_names)), '\\n'\n", + "print('Index values for attributes:', range(len(attribute_names)), end='\\n\\n') # 2 newlines\n", "\n", - "print 'Values for the', len(attribute_names), 'attributes:\\n'\n", + "print('Values for the', len(attribute_names), 'attributes:', end='\\n\\n')\n", "for i in range(len(attribute_names)):\n", - " print attribute_names[i], '=', attribute_value(single_instance_list, attribute_names[i], attribute_names)" + " print(attribute_names[i], '=', \n", + " attribute_value(single_instance_list, attribute_names[i], attribute_names))" ], "language": "python", "metadata": {}, @@ -1543,7 +1635,7 @@ "output_type": "stream", "stream": "stdout", "text": [ - "Index values for attributes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] \n", + "Index values for attributes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]\n", "\n", "Values for the 23 attributes:\n", "\n", @@ -1573,7 +1665,7 @@ ] } ], - "prompt_number": 39 + "prompt_number": 41 }, { "cell_type": "markdown", @@ -1586,9 +1678,9 @@ "cell_type": "code", "collapsed": false, "input": [ - "print 'range(5, 10):', range(5, 10)\n", - "print 'range(10, 5, -1):', range(10, 5, -1)\n", - "print 'range(0, 10, 2):', range(0, 10, 2)" + "print('range(5, 10):', range(5, 10))\n", + "print('range(10, 5, -1):', range(10, 5, -1))\n", + "print('range(0, 10, 2):', range(0, 10, 2))" ], "language": "python", "metadata": {}, @@ -1603,7 +1695,7 @@ ] } ], - "prompt_number": 40 + "prompt_number": 42 }, { "cell_type": "markdown", @@ -1618,11 +1710,12 @@ "cell_type": "code", "collapsed": false, "input": [ - "print xrange(len(attribute_names)), '\\n'\n", + "print(xrange(len(attribute_names)), end='\\n\\n') # prints the string representation of the object\n", "\n", - "print 'Values for the', len(attribute_names), 'attributes:\\n'\n", + "print('Values for the', len(attribute_names), 'attributes:', end='\\n\\n')\n", "for i in xrange(len(attribute_names)):\n", - " print attribute_names[i], '=', attribute_value(single_instance_list, attribute_names[i], attribute_names)" + " print(attribute_names[i], '=', \n", + " attribute_value(single_instance_list, attribute_names[i], attribute_names))" ], "language": "python", "metadata": {}, @@ -1631,7 +1724,7 @@ "output_type": "stream", "stream": "stdout", "text": [ - "xrange(23) \n", + "xrange(23)\n", "\n", "Values for the 23 attributes:\n", "\n", @@ -1661,7 +1754,7 @@ ] } ], - "prompt_number": 41 + "prompt_number": 43 }, { "cell_type": "heading", @@ -1717,6 +1810,7 @@ " return\n", "\n", "import simple_ml # this module contains my solutions to exercises\n", + "\n", "# to test your function, delete the 'simple_ml.' module specification in the call to print_attribute_names_and_values() below\n", "simple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)" ], @@ -1755,7 +1849,7 @@ ] } ], - "prompt_number": 42 + "prompt_number": 44 }, { "cell_type": "heading", @@ -1790,8 +1884,8 @@ " for line in f:\n", " all_instances.append(line.strip().split(','))\n", " \n", - "print 'Read', len(all_instances), 'instances from', data_filename\n", - "print 'First instance:', all_instances[0] # we don't want to print all the instances, so let's just print one to verify" + "print('Read', len(all_instances), 'instances from', data_filename)\n", + "print('First instance:', all_instances[0]) # we don't want to print all the instances, so let's just print one to verify" ], "language": "python", "metadata": {}, @@ -1805,7 +1899,7 @@ ] } ], - "prompt_number": 43 + "prompt_number": 45 }, { "cell_type": "heading", @@ -1838,8 +1932,8 @@ "data_filename = 'agaricus-lepiota.data'\n", "# to test your function, delete the 'simple_ml.' module specification in the call to load_instances() below\n", "all_instances_2 = simple_ml.load_instances(data_filename)\n", - "print 'Read', len(all_instances_2), 'instances from', data_filename\n", - "print 'First instance:', all_instances_2[0] " + "print('Read', len(all_instances_2), 'instances from', data_filename)\n", + "print('First instance:', all_instances_2[0])" ], "language": "python", "metadata": {}, @@ -1853,7 +1947,7 @@ ] } ], - "prompt_number": 44 + "prompt_number": 46 }, { "cell_type": "markdown", @@ -1863,14 +1957,18 @@ "\n", "As we saw earlier, the [`str.join(words)`](http://docs.python.org/2/library/stdtypes.html#str.join) method returns a single `str`-delimited string containing each of the strings in the list `words`.\n", "\n", - "SQL and Hive database tables often use the pipe ('|') delimiter to separate column values for each row when they are stored as flat files. The following code creates a new data file using pipes rather than commas to separate the attribute values." + "SQL and Hive database tables often use the pipe ('|') delimiter to separate column values for each row when they are stored as flat files. The following code creates a new data file using pipes rather than commas to separate the attribute values.\n", + "\n", + "To help maintain internal consistency, it is generally a good practice to define a variable such as `delimiter` or `separator` and bind it to the intended delimiter string, and then use the variable throughout." ] }, { "cell_type": "code", "collapsed": false, "input": [ - "print 'Converting to pipe delimiter, e.g.,', '|'.join(all_instances[0])\n", + "delimiter = '|'\n", + "\n", + "print('Converting to {}-delimited strings, e.g.,'.format(delimiter), delimiter.join(all_instances[0]))\n", "\n", "datafile2 = 'agaricus-lepiota-2.data'\n", "with open(datafile2, 'w') as f:\n", @@ -1880,9 +1978,9 @@ "all_instances_3 = []\n", "with open(datafile2, 'r') as f:\n", " for line in f:\n", - " all_instances_3.append(line.strip().split('|')) # note: changed ',' to '|'\n", - "print 'Read', len(all_instances_3), 'instances from', datafile2\n", - "print 'First instance:', all_instances_3[0] # we don't want to print all the instances, so let's just print one to verify" + " all_instances_3.append(line.strip().split(delimiter)) # note: changed ',' to '|'\n", + "print('Read', len(all_instances_3), 'instances from', datafile2)\n", + "print('First instance:', all_instances_3[0]) # we don't want to print all the instances, so let's just print one to verify" ], "language": "python", "metadata": {}, @@ -1891,13 +1989,13 @@ "output_type": "stream", "stream": "stdout", "text": [ - "Converting to pipe delimiter, e.g., p|x|s|n|t|p|f|c|n|k|e|e|s|s|w|w|p|w|o|p|k|s|u\n", + "Converting to |-delimited strings, e.g., p|x|s|n|t|p|f|c|n|k|e|e|s|s|w|w|p|w|o|p|k|s|u\n", "Read 8124 instances from agaricus-lepiota-2.data\n", "First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']\n" ] } ], - "prompt_number": 45 + "prompt_number": 47 }, { "cell_type": "heading", @@ -1924,11 +2022,11 @@ "cell_type": "code", "collapsed": false, "input": [ - "pipe_delimited_string = ''\n", + "delimited_string = ''\n", "for x in [1, 2, 3]:\n", - " pipe_delimited_string += str(x) + '|'\n", - "pipe_delimited_string = pipe_delimited_string[:-1]\n", - "pipe_delimited_string" + " delimited_string += str(x) + delimiter\n", + "delimited_string = delimited_string[:-1]\n", + "delimited_string" ], "language": "python", "metadata": {}, @@ -1936,13 +2034,13 @@ { "metadata": {}, "output_type": "pyout", - "prompt_number": 46, + "prompt_number": 48, "text": [ "'1|2|3'" ] } ], - "prompt_number": 46 + "prompt_number": 48 }, { "cell_type": "markdown", @@ -1955,7 +2053,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "'|'.join([str(x) for x in [1, 2, 3]])" + "delimiter.join([str(x) for x in [1, 2, 3]])" ], "language": "python", "metadata": {}, @@ -1963,13 +2061,13 @@ { "metadata": {}, "output_type": "pyout", - "prompt_number": 47, + "prompt_number": 49, "text": [ "'1|2|3'" ] } ], - "prompt_number": 47 + "prompt_number": 49 }, { "cell_type": "markdown", @@ -1987,12 +2085,14 @@ "collapsed": false, "input": [ "# version 1: using an if statement nested within a for statement\n", + "unknown_value = '?'\n", + "\n", "clean_instances = []\n", "for instance in all_instances:\n", - " if '?' not in instance:\n", + " if unknown_value not in instance:\n", " clean_instances.append(instance)\n", " \n", - "print len(clean_instances), 'clean instances'" + "print(len(clean_instances), 'clean instances')" ], "language": "python", "metadata": {}, @@ -2005,16 +2105,16 @@ ] } ], - "prompt_number": 48 + "prompt_number": 50 }, { "cell_type": "code", "collapsed": false, "input": [ "# version 2: using an equivalent list comprehension\n", - "clean_instances_2 = [instance for instance in all_instances if '?' not in instance]\n", + "clean_instances_2 = [instance for instance in all_instances if unknown_value not in instance]\n", "\n", - "print len(clean_instances_2), 'clean instances'" + "print(len(clean_instances_2), 'clean instances')" ], "language": "python", "metadata": {}, @@ -2027,7 +2127,7 @@ ] } ], - "prompt_number": 49 + "prompt_number": 51 }, { "cell_type": "heading", @@ -2058,10 +2158,15 @@ "cell_type": "code", "collapsed": false, "input": [ - "attribute_values_cap_type = {'b': 'bell', 'c': 'conical', 'x': 'convex', 'f': 'flat', 'k': 'knobbed', 's': 'sunken'}\n", + "attribute_values_cap_type = {'b': 'bell', \n", + " 'c': 'conical', \n", + " 'x': 'convex', \n", + " 'f': 'flat', \n", + " 'k': 'knobbed', \n", + " 's': 'sunken'}\n", "\n", "attribute_value_abbrev = 'x'\n", - "print attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]" + "print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" ], "language": "python", "metadata": {}, @@ -2074,7 +2179,7 @@ ] } ], - "prompt_number": 50 + "prompt_number": 52 }, { "cell_type": "markdown", @@ -2090,7 +2195,7 @@ "collapsed": false, "input": [ "for attribute_value_abbrev in attribute_values_cap_type:\n", - " print attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]" + " print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" ], "language": "python", "metadata": {}, @@ -2108,7 +2213,7 @@ ] } ], - "prompt_number": 51 + "prompt_number": 53 }, { "cell_type": "markdown", @@ -2123,8 +2228,8 @@ "cell_type": "code", "collapsed": false, "input": [ - "attribute_values_cap_type_2 ={x[0]: x for x in ['bell', 'conical', 'flat', 'knobbed', 'sunken']}\n", - "print attribute_values_cap_type_2" + "attribute_values_cap_type_2 = {x[0]: x for x in ['bell', 'conical', 'flat', 'knobbed', 'sunken']}\n", + "print(attribute_values_cap_type_2)" ], "language": "python", "metadata": {}, @@ -2137,7 +2242,7 @@ ] } ], - "prompt_number": 52 + "prompt_number": 54 }, { "cell_type": "markdown", @@ -2190,8 +2295,8 @@ "\n", "attribute_filename = 'agaricus-lepiota.attributes'\n", "attribute_values = load_attribute_values(attribute_filename)\n", - "print 'Read', len(attribute_values), 'attribute values from', attribute_filename\n", - "print 'First attribute values list:', attribute_values[0]" + "print('Read', len(attribute_values), 'attribute values from', attribute_filename)\n", + "print('First attribute values list:', attribute_values[0])" ], "language": "python", "metadata": {}, @@ -2205,7 +2310,7 @@ ] } ], - "prompt_number": 53 + "prompt_number": 55 }, { "cell_type": "heading", @@ -2250,8 +2355,9 @@ "attribute_filename = 'agaricus-lepiota.attributes'\n", "# to test your function, delete the 'simple_ml.' module specification in the call to load_attribute_names_and_values() below\n", "attribute_names_and_values = simple_ml.load_attribute_names_and_values(attribute_filename)\n", - "print 'Read', len(attribute_names_and_values), 'attribute values from', attribute_filename\n", - "print 'First attribute name:', attribute_names_and_values[0]['name'], '; values:', attribute_names_and_values[0]['values']" + "print('Read', len(attribute_names_and_values), 'attribute values from', attribute_filename)\n", + "print('First attribute name:', attribute_names_and_values[0]['name'], \n", + " '; values:', attribute_names_and_values[0]['values'])" ], "language": "python", "metadata": {}, @@ -2265,7 +2371,7 @@ ] } ], - "prompt_number": 54 + "prompt_number": 56 }, { "cell_type": "heading", @@ -2291,7 +2397,8 @@ " if instance[0] == 'e':\n", " edible_count += 1 # this is shorthand for edible_count = edible_count + 1\n", "\n", - "print 'There are', edible_count, 'edible mushrooms among the', len(clean_instances), 'clean instances'" + "print('There are', edible_count, 'edible mushrooms among the', \n", + " len(clean_instances), 'clean instances')" ], "language": "python", "metadata": {}, @@ -2304,7 +2411,7 @@ ] } ], - "prompt_number": 55 + "prompt_number": 57 }, { "cell_type": "markdown", @@ -2326,9 +2433,9 @@ " cap_state_value_counts[cap_state_value] = 0\n", " cap_state_value_counts[cap_state_value] += 1\n", "\n", - "print 'Counts for each value of cap-state:'\n", + "print('Counts for each value of cap-state:')\n", "for value in cap_state_value_counts:\n", - " print value, ':', cap_state_value_counts[value]" + " print(value, ':', cap_state_value_counts[value])" ], "language": "python", "metadata": {}, @@ -2347,7 +2454,7 @@ ] } ], - "prompt_number": 56 + "prompt_number": 58 }, { "cell_type": "markdown", @@ -2369,9 +2476,9 @@ " cap_state_value = instance[1]\n", " cap_state_value_counts[cap_state_value] += 1\n", "\n", - "print 'Counts for each value of cap-state:'\n", + "print('Counts for each value of cap-state:')\n", "for value in cap_state_value_counts:\n", - " print value, ':', cap_state_value_counts[value]" + " print(value, ':', cap_state_value_counts[value])" ], "language": "python", "metadata": {}, @@ -2390,7 +2497,7 @@ ] } ], - "prompt_number": 57 + "prompt_number": 59 }, { "cell_type": "heading", @@ -2417,9 +2524,9 @@ "# remove 'simple_ml.' below to test your function definition\n", "attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)\n", "\n", - "print 'Counts for each value of', attribute, ':'\n", + "print('Counts for each value of', attribute, ':')\n", "for value in attribute_value_counts:\n", - " print value, ':', attribute_value_counts[value]" + " print(value, ':', attribute_value_counts[value])" ], "language": "python", "metadata": {}, @@ -2438,7 +2545,7 @@ ] } ], - "prompt_number": 58 + "prompt_number": 60 }, { "cell_type": "heading", @@ -2463,8 +2570,8 @@ "input": [ "original_list = [3, 1, 4, 2, 5]\n", "sorted_list = sorted(original_list)\n", - "print original_list\n", - "print sorted_list" + "print(original_list)\n", + "print(sorted_list)" ], "language": "python", "metadata": {}, @@ -2478,7 +2585,7 @@ ] } ], - "prompt_number": 59 + "prompt_number": 61 }, { "cell_type": "markdown", @@ -2491,7 +2598,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print sorted('python')" + "print(sorted('python'))" ], "language": "python", "metadata": {}, @@ -2504,7 +2611,7 @@ ] } ], - "prompt_number": 60 + "prompt_number": 62 }, { "cell_type": "markdown", @@ -2517,7 +2624,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print sorted(attribute_values_cap_type) # returns a list of sorted keys (but not values) in the dictionary" + "print(sorted(attribute_values_cap_type)) # returns a list of sorted keys (but not values) in the dictionary" ], "language": "python", "metadata": {}, @@ -2530,7 +2637,7 @@ ] } ], - "prompt_number": 61 + "prompt_number": 63 }, { "cell_type": "markdown", @@ -2544,7 +2651,7 @@ "collapsed": false, "input": [ "for attribute_value_abbrev in sorted(attribute_values_cap_type):\n", - " print attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]" + " print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" ], "language": "python", "metadata": {}, @@ -2562,7 +2669,7 @@ ] } ], - "prompt_number": 62 + "prompt_number": 64 }, { "cell_type": "markdown", @@ -2575,7 +2682,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print sorted([3, 1, 4, 2, 5], reverse=True)" + "print(sorted([3, 1, 4, 2, 5], reverse=True))" ], "language": "python", "metadata": {}, @@ -2588,13 +2695,13 @@ ] } ], - "prompt_number": 63 + "prompt_number": 65 }, { "cell_type": "code", "collapsed": false, "input": [ - "print sorted(attribute_values_cap_type, reverse=True) " + "print(sorted(attribute_values_cap_type, reverse=True))" ], "language": "python", "metadata": {}, @@ -2607,7 +2714,7 @@ ] } ], - "prompt_number": 64 + "prompt_number": 66 }, { "cell_type": "code", @@ -2616,9 +2723,9 @@ "attribute = 'cap-shape'\n", "attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)\n", "\n", - "print 'Counts for each value of', attribute, ':'\n", + "print('Counts for each value of', attribute, ':')\n", "for value in sorted(attribute_value_counts):\n", - " print value, ':', attribute_value_counts[value]" + " print(value, ':', attribute_value_counts[value])" ], "language": "python", "metadata": {}, @@ -2637,7 +2744,7 @@ ] } ], - "prompt_number": 65 + "prompt_number": 67 }, { "cell_type": "heading", @@ -2679,7 +2786,7 @@ { "metadata": {}, "output_type": "pyout", - "prompt_number": 66, + "prompt_number": 68, "text": [ "[('c', 'conical'),\n", " ('b', 'bell'),\n", @@ -2690,7 +2797,7 @@ ] } ], - "prompt_number": 66 + "prompt_number": 68 }, { "cell_type": "markdown", @@ -2711,20 +2818,20 @@ { "metadata": {}, "output_type": "pyout", - "prompt_number": 67, + "prompt_number": 69, "text": [ - "" + "" ] } ], - "prompt_number": 67 + "prompt_number": 69 }, { "cell_type": "code", "collapsed": false, "input": [ "for key, value in attribute_values_cap_type.iteritems():\n", - " print key, value" + " print(key, ':', value)" ], "language": "python", "metadata": {}, @@ -2733,16 +2840,16 @@ "output_type": "stream", "stream": "stdout", "text": [ - "c conical\n", - "b bell\n", - "f flat\n", - "k knobbed\n", - "s sunken\n", - "x convex\n" + "c : conical\n", + "b : bell\n", + "f : flat\n", + "k : knobbed\n", + "s : sunken\n", + "x : convex\n" ] } ], - "prompt_number": 68 + "prompt_number": 70 }, { "cell_type": "markdown", @@ -2771,7 +2878,7 @@ { "metadata": {}, "output_type": "pyout", - "prompt_number": 69, + "prompt_number": 71, "text": [ "[('b', 'bell'),\n", " ('c', 'conical'),\n", @@ -2782,7 +2889,7 @@ ] } ], - "prompt_number": 69 + "prompt_number": 71 }, { "cell_type": "markdown", @@ -2798,9 +2905,9 @@ "attribute = 'cap-shape'\n", "value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)\n", "\n", - "print 'Counts for each value of', attribute, ':'\n", + "print('Counts for each value of', attribute, '(sorted by count):')\n", "for value, count in sorted(value_counts.iteritems(), key=operator.itemgetter(1), reverse=True):\n", - " print value, ':', count" + " print(value, ':', count)" ], "language": "python", "metadata": {}, @@ -2809,7 +2916,7 @@ "output_type": "stream", "stream": "stdout", "text": [ - "Counts for each value of cap-shape :\n", + "Counts for each value of cap-shape (sorted by count):\n", "x : 2840\n", "f : 2432\n", "b : 300\n", @@ -2819,7 +2926,7 @@ ] } ], - "prompt_number": 70 + "prompt_number": 72 }, { "cell_type": "heading", @@ -2844,12 +2951,12 @@ "cell_type": "code", "collapsed": false, "input": [ - "print 'Output of a sample line using str.format():'\n", - "print 'class:', # comma at end keeps cursor on the same line for subsequent print statements\n", - "print '{} = {} ({:5.3f}),'.format('e', 3488, 3488 / 5644.0),\n", - "print '{} = {} ({:5.3f}),'.format('p', 2156, 2156 / 5644.0),\n", - "print # a print statement with no arguments will advance the cursor to the beginning of the next line\n", - "print 'End of sample line'" + "print('Output of a sample line using str.format():')\n", + "print('class:', end=' ') # keeps cursor on the same line for subsequent print statements\n", + "print('{} = {} ({:5.3f}),'.format('e', 3488, 3488 / 5644.0), end=' ')\n", + "print('{} = {} ({:5.3f}),'.format('p', 2156, 2156 / 5644.0), end=' ')\n", + "print() # a print statement with no arguments will advance the cursor to the beginning of the next line\n", + "print('End of sample line')" ], "language": "python", "metadata": {}, @@ -2859,12 +2966,12 @@ "stream": "stdout", "text": [ "Output of a sample line using str.format():\n", - "class: e = 3488 (0.618), p = 2156 (0.382),\n", + "class: e = 3488 (0.618), p = 2156 (0.382), \n", "End of sample line\n" ] } ], - "prompt_number": 71 + "prompt_number": 73 }, { "cell_type": "markdown", @@ -2879,7 +2986,7 @@ "input": [ "# your function definition goes here\n", "\n", - "print '\\nCounts for all attributes and values:\\n'\n", + "print('\\nCounts for all attributes and values:\\n')\n", "simple_ml.print_all_attribute_value_counts(clean_instances, attribute_names)" ], "language": "python", @@ -2892,33 +2999,33 @@ "\n", "Counts for all attributes and values:\n", "\n", - "class: e = 3488 (0.618), p = 2156 (0.382),\n", - "cap-shape: x = 2840 (0.503), f = 2432 (0.431), b = 300 (0.053), k = 36 (0.006), s = 32 (0.006), c = 4 (0.001),\n", - "cap-surface: y = 2220 (0.393), f = 2160 (0.383), s = 1260 (0.223), g = 4 (0.001),\n", - "cap-color: g = 1696 (0.300), n = 1164 (0.206), y = 1056 (0.187), w = 880 (0.156), e = 588 (0.104), b = 120 (0.021), p = 96 (0.017), c = 44 (0.008),\n", - "bruises?: t = 3184 (0.564), f = 2460 (0.436),\n", - "odor: n = 2776 (0.492), f = 1584 (0.281), a = 400 (0.071), l = 400 (0.071), p = 256 (0.045), c = 192 (0.034), m = 36 (0.006),\n", - "gill-attachment: f = 5626 (0.997), a = 18 (0.003),\n", - "gill-spacing: c = 4620 (0.819), w = 1024 (0.181),\n", - "gill-size: b = 4940 (0.875), n = 704 (0.125),\n", - "gill-color: p = 1384 (0.245), n = 984 (0.174), w = 966 (0.171), h = 720 (0.128), g = 656 (0.116), u = 480 (0.085), k = 408 (0.072), r = 24 (0.004), y = 22 (0.004),\n", - "stalk-shape: t = 2880 (0.510), e = 2764 (0.490),\n", - "stalk-root: b = 3776 (0.669), e = 1120 (0.198), c = 556 (0.099), r = 192 (0.034),\n", - "stalk-surface-above-ring: s = 3736 (0.662), k = 1332 (0.236), f = 552 (0.098), y = 24 (0.004),\n", - "stalk-surface-below-ring: s = 3544 (0.628), k = 1296 (0.230), f = 552 (0.098), y = 252 (0.045),\n", - "stalk-color-above-ring: w = 3136 (0.556), p = 1008 (0.179), g = 576 (0.102), n = 448 (0.079), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001),\n", - "stalk-color-below-ring: w = 3088 (0.547), p = 1008 (0.179), g = 576 (0.102), n = 496 (0.088), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001),\n", - "veil-type: p = 5644 (1.000),\n", - "veil-color: w = 5636 (0.999), y = 8 (0.001),\n", - "ring-number: o = 5488 (0.972), t = 120 (0.021), n = 36 (0.006),\n", - "ring-type: p = 3488 (0.618), l = 1296 (0.230), e = 824 (0.146), n = 36 (0.006),\n", - "spore-print-color: n = 1920 (0.340), k = 1872 (0.332), h = 1584 (0.281), w = 148 (0.026), r = 72 (0.013), u = 48 (0.009),\n", - "population: v = 2160 (0.383), y = 1688 (0.299), s = 1104 (0.196), a = 384 (0.068), n = 256 (0.045), c = 52 (0.009),\n", - "habitat: d = 2492 (0.442), g = 1860 (0.330), p = 568 (0.101), u = 368 (0.065), m = 292 (0.052), l = 64 (0.011),\n" + "class: e = 3488 (0.618), p = 2156 (0.382), \n", + "cap-shape: x = 2840 (0.503), f = 2432 (0.431), b = 300 (0.053), k = 36 (0.006), s = 32 (0.006), c = 4 (0.001), \n", + "cap-surface: y = 2220 (0.393), f = 2160 (0.383), s = 1260 (0.223), g = 4 (0.001), \n", + "cap-color: g = 1696 (0.300), n = 1164 (0.206), y = 1056 (0.187), w = 880 (0.156), e = 588 (0.104), b = 120 (0.021), p = 96 (0.017), c = 44 (0.008), \n", + "bruises?: t = 3184 (0.564), f = 2460 (0.436), \n", + "odor: n = 2776 (0.492), f = 1584 (0.281), a = 400 (0.071), l = 400 (0.071), p = 256 (0.045), c = 192 (0.034), m = 36 (0.006), \n", + "gill-attachment: f = 5626 (0.997), a = 18 (0.003), \n", + "gill-spacing: c = 4620 (0.819), w = 1024 (0.181), \n", + "gill-size: b = 4940 (0.875), n = 704 (0.125), \n", + "gill-color: p = 1384 (0.245), n = 984 (0.174), w = 966 (0.171), h = 720 (0.128), g = 656 (0.116), u = 480 (0.085), k = 408 (0.072), r = 24 (0.004), y = 22 (0.004), \n", + "stalk-shape: t = 2880 (0.510), e = 2764 (0.490), \n", + "stalk-root: b = 3776 (0.669), e = 1120 (0.198), c = 556 (0.099), r = 192 (0.034), \n", + "stalk-surface-above-ring: s = 3736 (0.662), k = 1332 (0.236), f = 552 (0.098), y = 24 (0.004), \n", + "stalk-surface-below-ring: s = 3544 (0.628), k = 1296 (0.230), f = 552 (0.098), y = 252 (0.045), \n", + "stalk-color-above-ring: w = 3136 (0.556), p = 1008 (0.179), g = 576 (0.102), n = 448 (0.079), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001), \n", + "stalk-color-below-ring: w = 3088 (0.547), p = 1008 (0.179), g = 576 (0.102), n = 496 (0.088), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001), \n", + "veil-type: p = 5644 (1.000), \n", + "veil-color: w = 5636 (0.999), y = 8 (0.001), \n", + "ring-number: o = 5488 (0.972), t = 120 (0.021), n = 36 (0.006), \n", + "ring-type: p = 3488 (0.618), l = 1296 (0.230), e = 824 (0.146), n = 36 (0.006), \n", + "spore-print-color: n = 1920 (0.340), k = 1872 (0.332), h = 1584 (0.281), w = 148 (0.026), r = 72 (0.013), u = 48 (0.009), \n", + "population: v = 2160 (0.383), y = 1688 (0.299), s = 1104 (0.196), a = 384 (0.068), n = 256 (0.045), c = 52 (0.009), \n", + "habitat: d = 2492 (0.442), g = 1860 (0.330), p = 568 (0.101), u = 368 (0.065), m = 292 (0.052), l = 64 (0.011), \n" ] } ], - "prompt_number": 72 + "prompt_number": 74 }, { "cell_type": "heading", @@ -2992,7 +3099,9 @@ "\n", "From the output above, we know that the proportion of `clean_instances` that are labeled `'e'` (class `edible`) in the UCI dataset is $3488 \\div 5644 = 0.618$, and the proportion labeled `'p'` (class `poisonous`) is $2156 \\div 5644 = 0.382$.\n", "\n", - "After importing the Python [`math`](http://docs.python.org/2/library/math.html) module, we can use the [`math.log(x[, base])`](http://docs.python.org/2/library/math.html#math.log) function in computing the entropy of the `clean_instances` of the UCI mushroom data set as follows:" + "After importing the Python [`math`](http://docs.python.org/2/library/math.html) module, we can use the [`math.log(x[, base])`](http://docs.python.org/2/library/math.html#math.log) function in computing the entropy of the `clean_instances` of the UCI mushroom data set as follows.\n", + "\n", + "Note that you can use a backslash character (`\\`) at the end of a line to continue the statement on the next line (this should generally be used sparingly)." ] }, { @@ -3000,8 +3109,10 @@ "collapsed": false, "input": [ "import math\n", - "entropy = - (3488 / 5644.0) * math.log(3488 / 5644.0, 2) - (2156 / 5644.0) * math.log(2156 / 5644.0, 2)\n", - "print entropy" + "\n", + "entropy = - (3488 / 5644.0) * math.log(3488 / 5644.0, 2) \\\n", + " - (2156 / 5644.0) * math.log(2156 / 5644.0, 2)\n", + "print(entropy)" ], "language": "python", "metadata": {}, @@ -3014,7 +3125,7 @@ ] } ], - "prompt_number": 73 + "prompt_number": 75 }, { "cell_type": "heading", @@ -3040,7 +3151,7 @@ "# your function definition here\n", "\n", "# delete 'simple_ml.' below to test your function\n", - "print simple_ml.entropy(clean_instances)" + "print(simple_ml.entropy(clean_instances))" ], "language": "python", "metadata": {}, @@ -3053,7 +3164,7 @@ ] } ], - "prompt_number": 74 + "prompt_number": 76 }, { "cell_type": "heading", @@ -3086,9 +3197,10 @@ "cell_type": "code", "collapsed": false, "input": [ - "print 'Information gain for different attributes:\\n'\n", + "print('Information gain for different attributes:', end='\\n\\n')\n", "for i in range(1, len(attribute_names)):\n", - " print '{:5.3f} {:2} {}'.format(simple_ml.information_gain(clean_instances, i), i, attribute_names[i])" + " print('{:5.3f} {:2} {}'.format(\n", + " simple_ml.information_gain(clean_instances, i), i, attribute_names[i]))" ], "language": "python", "metadata": {}, @@ -3115,7 +3227,9 @@ "0.306 14 stalk-color-above-ring\n", "0.279 15 stalk-color-below-ring\n", "0.000 16 veil-type\n", - "0.002 17 veil-color" + "0.002 17 veil-color\n", + "0.012 18 ring-number\n", + "0.463 19 ring-type" ] }, { @@ -3123,15 +3237,13 @@ "stream": "stdout", "text": [ "\n", - "0.012 18 ring-number\n", - "0.463 19 ring-type\n", "0.583 20 spore-print-color\n", "0.110 21 population\n", "0.101 22 habitat\n" ] } ], - "prompt_number": 75 + "prompt_number": 77 }, { "cell_type": "markdown", @@ -3144,13 +3256,14 @@ "cell_type": "code", "collapsed": false, "input": [ - "print 'Information gain for different attributes:\\n'\n", - "sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) for i in range(1, len(attribute_names))], \n", + "print('Information gain for different attributes:', end='\\n\\n')\n", + "sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) \\\n", + " for i in range(1, len(attribute_names))], \n", " reverse=True)\n", - "print sorted_information_gain_indexes, '\\n'\n", + "print(sorted_information_gain_indexes, end='\\n\\n')\n", "\n", "for gain, i in sorted_information_gain_indexes:\n", - " print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])" + " print('{:5.3f} {:2} {}'.format(gain, i, attribute_names[i]))" ], "language": "python", "metadata": {}, @@ -3168,7 +3281,7 @@ "output_type": "stream", "stream": "stdout", "text": [ - " \n", + "\n", "\n", "0.860 5 odor\n", "0.583 20 spore-print-color\n", @@ -3195,7 +3308,7 @@ ] } ], - "prompt_number": 76 + "prompt_number": 78 }, { "cell_type": "markdown", @@ -3208,7 +3321,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "print 'Information gain for different attributes:\\n'\n", + "print('Information gain for different attributes:', end='\\n\\n')\n", "\n", "information_gain_values = []\n", "for i in range(1, len(attribute_names)):\n", @@ -3216,10 +3329,10 @@ " \n", "sorted_information_gain_indexes = sorted(information_gain_values, \n", " reverse=True)\n", - "print sorted_information_gain_indexes, '\\n'\n", + "print(sorted_information_gain_indexes, end='\\n\\n')\n", "\n", "for gain, i in sorted_information_gain_indexes:\n", - " print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])" + " print('{:5.3f} {:2} {}'.format(gain, i, attribute_names[i]))" ], "language": "python", "metadata": {}, @@ -3237,7 +3350,7 @@ "output_type": "stream", "stream": "stdout", "text": [ - " \n", + "\n", "\n", "0.860 5 odor\n", "0.583 20 spore-print-color\n", @@ -3264,7 +3377,7 @@ ] } ], - "prompt_number": 77 + "prompt_number": 79 }, { "cell_type": "heading", @@ -3291,9 +3404,9 @@ "sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) for i in range(1, len(attribute_names))], \n", " reverse=True)\n", "\n", - "print 'Information gain for different attributes:\\n'\n", + "print('Information gain for different attributes:', end='\\n\\n')\n", "for gain, i in sorted_information_gain_indexes:\n", - " print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])" + " print('{:5.3f} {:2} {}'.format(gain, i, attribute_names[i]))" ], "language": "python", "metadata": {}, @@ -3329,7 +3442,7 @@ ] } ], - "prompt_number": 78 + "prompt_number": 80 }, { "cell_type": "heading", @@ -3390,7 +3503,7 @@ " return partitions\n", "\n", "partitions = split_instances(clean_instances, 5)\n", - "print [(partition, len(partitions[partition])) for partition in partitions]" + "print([(partition, len(partitions[partition])) for partition in partitions])" ], "language": "python", "metadata": {}, @@ -3403,7 +3516,7 @@ ] } ], - "prompt_number": 79 + "prompt_number": 81 }, { "cell_type": "markdown", @@ -3434,7 +3547,8 @@ "# your function here\n", "\n", "# delete 'simple_ml.' below to test your function:\n", - "print 'Best attribute index:', simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names)))" + "print('Best attribute index:', \n", + " simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names))))" ], "language": "python", "metadata": {}, @@ -3443,18 +3557,11 @@ "output_type": "stream", "stream": "stdout", "text": [ - "Best attribute index: " - ] - }, - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "5\n" + "Best attribute index: 5\n" ] } ], - "prompt_number": 80 + "prompt_number": 82 }, { "cell_type": "markdown", @@ -3478,10 +3585,10 @@ "from collections import Counter\n", "\n", "class_counts = Counter([instance[0] for instance in clean_instances])\n", - "print 'class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(\n", + "print('class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(\n", " class_counts, # the Counter object\n", " class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count\n", - " class_counts.most_common(1)[0][0]) # the most common value (1st element in that tuple)" + " class_counts.most_common(1)[0][0])) # the most common value (1st element in that tuple)" ], "language": "python", "metadata": {}, @@ -3494,7 +3601,7 @@ ] } ], - "prompt_number": 81 + "prompt_number": 83 }, { "cell_type": "markdown", @@ -3512,10 +3619,10 @@ " class_values.append(instance[0])\n", " \n", "class_counts = Counter(class_values)\n", - "print 'class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(\n", + "print ('class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(\n", " class_counts, # the Counter object\n", " class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count\n", - " class_counts.most_common(1)[0][0]) # the most common value (1st element in that tuple)" + " class_counts.most_common(1)[0][0])) # the most common value (1st element in that tuple)" ], "language": "python", "metadata": {}, @@ -3528,7 +3635,7 @@ ] } ], - "prompt_number": 82 + "prompt_number": 84 }, { "cell_type": "markdown", @@ -3549,11 +3656,11 @@ "collapsed": false, "input": [ "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n", - " print '\"{}\" is'.format(x),\n", + " print('\"{}\" is'.format(x), end=' ')\n", " if x:\n", - " print True\n", + " print(True)\n", " else:\n", - " print False" + " print(False)" ], "language": "python", "metadata": {}, @@ -3573,7 +3680,7 @@ ] } ], - "prompt_number": 83 + "prompt_number": 85 }, { "cell_type": "markdown", @@ -3587,7 +3694,7 @@ "collapsed": false, "input": [ "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n", - " print '\"{}\" is {}'.format(x, True if x else False) # using conditional expression as second argument to format()" + " print('\"{}\" is {}'.format(x, True if x else False)) # using conditional expression as second argument to format()" ], "language": "python", "metadata": {}, @@ -3607,7 +3714,7 @@ ] } ], - "prompt_number": 84 + "prompt_number": 86 }, { "cell_type": "markdown", @@ -3628,7 +3735,7 @@ "input": [ "def parameter_test(parameter1=None, parameter2=None):\n", " '''Prints the values of parameter1 and parameter2'''\n", - " print 'parameter1: {}; parameter2: {}'.format(parameter1, parameter2)\n", + " print('parameter1: {}; parameter2: {}'.format(parameter1, parameter2))\n", " \n", "parameter_test() # no args are required\n", "parameter_test(1) # if any args are provided, 1st arg gets assigned to parameter1\n", @@ -3653,7 +3760,7 @@ ] } ], - "prompt_number": 85 + "prompt_number": 87 }, { "cell_type": "heading", @@ -3677,10 +3784,13 @@ "# your definition of majority_value(instances) here\n", "\n", "# delete 'simple_ml.' below to test your function:\n", - "print 'Majority value of index {}: {}'.format(0, simple_ml.majority_value(clean_instances)) # note: relying on default parameter here\n", + "print('Majority value of index {}: {}'.format(\n", + " 0, simple_ml.majority_value(clean_instances))) # note: relying on default parameter here\n", "# although there is only one class_index for the dataset, we'll test it by providing non-default values\n", - "print 'Majority value of index {}: {}'.format(1, simple_ml.majority_value(clean_instances, 1)) # using an optional 2nd argument\n", - "print 'Majority value of index {}: {}'.format(2, simple_ml.majority_value(clean_instances, class_index=2)) # using a keyword" + "print('Majority value of index {}: {}'.format(\n", + " 1, simple_ml.majority_value(clean_instances, 1))) # supplyling an optional 2nd argument\n", + "print('Majority value of index {}: {}'.format(\n", + " 2, simple_ml.majority_value(clean_instances, class_index=2))) # supplying argument as a keyword" ], "language": "python", "metadata": {}, @@ -3695,7 +3805,7 @@ ] } ], - "prompt_number": 86 + "prompt_number": 88 }, { "cell_type": "markdown", @@ -3734,22 +3844,24 @@ " # If the dataset is empty or the candidate attributes list is empty, return the default value\n", " if not instances or not candidate_attribute_indexes:\n", " if trace:\n", - " print '{}Using default class {}'.format('< ' * trace, default_class)\n", + " print('{}Using default class {}'.format('< ' * trace, default_class))\n", " return default_class\n", " \n", " # If all the instances have the same class label, return that class label\n", " elif len(class_labels_and_counts) == 1:\n", " class_label = class_labels_and_counts.most_common(1)[0][0]\n", " if trace:\n", - " print '{}All {} instances have label {}'.format('< ' * trace, len(instances), class_label)\n", + " print('{}All {} instances have label {}'.format(\n", + " '< ' * trace, len(instances), class_label))\n", " return class_label\n", " else:\n", " default_class = simple_ml.majority_value(instances, class_index)\n", "\n", " # Choose the next best attribute index to best classify the instances\n", - " best_index = simple_ml.choose_best_attribute_index(instances, candidate_attribute_indexes, class_index) \n", + " best_index = simple_ml.choose_best_attribute_index(\n", + " instances, candidate_attribute_indexes, class_index) \n", " if trace:\n", - " print '{}Creating tree node for attribute index {}'.format('> ' * trace, best_index)\n", + " print('{}Creating tree node for attribute index {}'.format('> ' * trace, best_index))\n", "\n", " # Create a new decision tree node with the best attribute index and an empty dictionary object (for now)\n", " tree = {best_index:{}}\n", @@ -3761,13 +3873,13 @@ " remaining_candidate_attribute_indexes = [i for i in candidate_attribute_indexes if i != best_index]\n", " for attribute_value in partitions:\n", " if trace:\n", - " print '{}Creating subtree for value {} ({}, {}, {}, {})'.format(\n", + " print('{}Creating subtree for value {} ({}, {}, {}, {})'.format(\n", " '> ' * trace,\n", " attribute_value, \n", " len(partitions[attribute_value]), \n", " len(remaining_candidate_attribute_indexes), \n", " class_index, \n", - " default_class)\n", + " default_class))\n", " \n", " # Create a subtree for each value of the the best attribute\n", " subtree = create_decision_tree(\n", @@ -3786,7 +3898,7 @@ "training_instances = clean_instances[:-20]\n", "testing_instances = clean_instances[-20:]\n", "tree = create_decision_tree(training_instances, trace=1) # remove trace=1 to turn off tracing\n", - "print tree" + "print(tree)" ], "language": "python", "metadata": {}, @@ -3828,7 +3940,7 @@ ] } ], - "prompt_number": 87 + "prompt_number": 89 }, { "cell_type": "markdown", @@ -3858,6 +3970,7 @@ "collapsed": false, "input": [ "from pprint import pprint\n", + "\n", "pprint(tree)" ], "language": "python", @@ -3880,7 +3993,7 @@ ] } ], - "prompt_number": 88 + "prompt_number": 90 }, { "cell_type": "heading", @@ -3921,7 +4034,7 @@ "for instance in testing_instances:\n", " predicted_label = classify(tree, instance)\n", " actual_label = instance[0]\n", - " print 'predicted: {}; actual: {}'.format(predicted_label, actual_label)" + " print('predicted: {}; actual: {}'.format(predicted_label, actual_label))" ], "language": "python", "metadata": {}, @@ -3953,7 +4066,7 @@ ] } ], - "prompt_number": 89 + "prompt_number": 91 }, { "cell_type": "heading", @@ -3990,9 +4103,9 @@ " actual_value = testing_instances[i][class_index]\n", " if prediction == actual_value:\n", " num_correct += 1\n", - " return float(num_correct) / len(testing_instances)\n", + " return num_correct / len(testing_instances)\n", "\n", - "print classification_accuracy(tree, testing_instances)" + "print(classification_accuracy(tree, testing_instances))" ], "language": "python", "metadata": {}, @@ -4005,7 +4118,7 @@ ] } ], - "prompt_number": 90 + "prompt_number": 92 }, { "cell_type": "markdown", @@ -4026,13 +4139,13 @@ { "metadata": {}, "output_type": "pyout", - "prompt_number": 91, + "prompt_number": 93, "text": [ "[(0, 'a'), (1, 'b'), (2, 'c')]" ] } ], - "prompt_number": 91 + "prompt_number": 93 }, { "cell_type": "markdown", @@ -4054,9 +4167,9 @@ " predicted_labels = [classify(tree, instance, default_class) for instance in instances]\n", " actual_labels = [x[class_index] for x in instances]\n", " counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])\n", - " return counts[True], counts[False], float(counts[True]) / len(instances)\n", + " return counts[True], counts[False], counts[True] / len(instances)\n", "\n", - "print classification_accuracy(tree, testing_instances)" + "print(classification_accuracy(tree, testing_instances))" ], "language": "python", "metadata": {}, @@ -4069,7 +4182,7 @@ ] } ], - "prompt_number": 92 + "prompt_number": 94 }, { "cell_type": "markdown", @@ -4088,12 +4201,13 @@ "input": [ "def partition_instances(instances, num_partitions):\n", " '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''\n", - " return [[instances[j] for j in xrange(i, len(instances), num_partitions)] for i in xrange(num_partitions)]" + " return [[instances[j] for j in xrange(i, len(instances), num_partitions)] \\\n", + " for i in xrange(num_partitions)]" ], "language": "python", "metadata": {}, "outputs": [], - "prompt_number": 93 + "prompt_number": 95 }, { "cell_type": "markdown", @@ -4111,9 +4225,9 @@ "\n", "simplified_instances = [[j for j in xrange(i, instance_length + i)] for i in xrange(num_instances)]\n", "\n", - "print 'Instances:', simplified_instances\n", + "print('Instances:', simplified_instances)\n", "partitions = partition_instances(simplified_instances, 2)\n", - "print 'Partitions:', partitions" + "print('Partitions:', partitions)" ], "language": "python", "metadata": {}, @@ -4127,7 +4241,7 @@ ] } ], - "prompt_number": 94 + "prompt_number": 96 }, { "cell_type": "markdown", @@ -4158,9 +4272,9 @@ " new_instance.append(j)\n", " simplified_instances.append(new_instance)\n", "\n", - "print 'Instances:', simplified_instances\n", + "print('Instances:', simplified_instances)\n", "partitions = partition_instances(simplified_instances, 2)\n", - "print 'Partitions:', partitions" + "print('Partitions:', partitions)" ], "language": "python", "metadata": {}, @@ -4174,7 +4288,7 @@ ] } ], - "prompt_number": 95 + "prompt_number": 97 }, { "cell_type": "markdown", @@ -4188,7 +4302,7 @@ "collapsed": false, "input": [ "for i, x in enumerate(['a', 'b', 'c']):\n", - " print i, x" + " print(i, x)" ], "language": "python", "metadata": {}, @@ -4203,7 +4317,7 @@ ] } ], - "prompt_number": 96 + "prompt_number": 98 }, { "cell_type": "markdown", @@ -4217,9 +4331,9 @@ "collapsed": false, "input": [ "for i in xrange(5):\n", - " print '\\n# partitions:', i\n", + " print('\\n# partitions:', i)\n", " for j, partition in enumerate(partition_instances(simplified_instances, i)):\n", - " print 'partition {}: {}'.format(j, partition)" + " print('partition {}: {}'.format(j, partition))" ], "language": "python", "metadata": {}, @@ -4251,7 +4365,7 @@ ] } ], - "prompt_number": 97 + "prompt_number": 99 }, { "cell_type": "markdown", @@ -4265,7 +4379,7 @@ "collapsed": false, "input": [ "partitions = partition_instances(clean_instances, 10)\n", - "print [len(partition) for partition in partitions]" + "print([len(partition) for partition in partitions])" ], "language": "python", "metadata": {}, @@ -4278,7 +4392,7 @@ ] } ], - "prompt_number": 98 + "prompt_number": 100 }, { "cell_type": "markdown", @@ -4292,8 +4406,8 @@ "collapsed": false, "input": [ "for partition in partitions:\n", - " print len(partition), # note the comma at the end\n", - "print" + " print(len(partition), end=' ') # note the comma at the end\n", + "print()" ], "language": "python", "metadata": {}, @@ -4302,11 +4416,11 @@ "output_type": "stream", "stream": "stdout", "text": [ - "565 565 565 565 564 564 564 564 564 564\n" + "565 565 565 565 564 564 564 564 564 564 \n" ] } ], - "prompt_number": 99 + "prompt_number": 101 }, { "cell_type": "markdown", @@ -4320,15 +4434,17 @@ "collapsed": false, "input": [ "tree0 = create_decision_tree(partitions[0])\n", - "print 'Tree trained with {} instances:'.format(len(partitions[0]))\n", + "print('Tree trained with {} instances:'.format(len(partitions[0])))\n", "pprint(tree0)\n", + "print()\n", "\n", "tree1 = create_decision_tree(partitions[0] + partitions[1])\n", - "print '\\nTree trained with {} instances:'.format(len(partitions[0] + partitions[1]))\n", + "print('Tree trained with {} instances:'.format(len(partitions[0] + partitions[1])))\n", "pprint(tree1)\n", + "print()\n", "\n", "tree = create_decision_tree(clean_instances)\n", - "print '\\nTree trained with {} instances:'.format(len(clean_instances))\n", + "print('Tree trained with {} instances:'.format(len(clean_instances)))\n", "pprint(tree)" ], "language": "python", @@ -4380,7 +4496,7 @@ ] } ], - "prompt_number": 100 + "prompt_number": 102 }, { "cell_type": "markdown", @@ -4408,7 +4524,7 @@ "input": [ "x = [1, 2, 3]\n", "x.extend([4, 5])\n", - "print x" + "print(x)" ], "language": "python", "metadata": {}, @@ -4421,7 +4537,7 @@ ] } ], - "prompt_number": 101 + "prompt_number": 103 }, { "cell_type": "markdown", @@ -4458,7 +4574,7 @@ " return accuracy_list\n", "\n", "accuracy_list = compute_learning_curve(clean_instances)\n", - "print accuracy_list" + "print(accuracy_list)" ], "language": "python", "metadata": {}, @@ -4471,7 +4587,7 @@ ] } ], - "prompt_number": 102 + "prompt_number": 104 }, { "cell_type": "markdown", @@ -4485,7 +4601,7 @@ "collapsed": false, "input": [ "accuracy_list = compute_learning_curve(clean_instances, 100)\n", - "print accuracy_list[:10]" + "print(accuracy_list[:10])" ], "language": "python", "metadata": {}, @@ -4498,7 +4614,7 @@ ] } ], - "prompt_number": 103 + "prompt_number": 105 }, { "cell_type": "heading", @@ -4596,7 +4712,7 @@ " predicted_labels = [self.classify(instance, default_class) for instance in instances]\n", " actual_labels = [x[0] for x in instances]\n", " counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])\n", - " return counts[True], counts[False], float(counts[True]) / len(instances)\n", + " return counts[True], counts[False], counts[True] / len(instances)\n", " \n", " def pprint(self):\n", " pprint(self._tree)" @@ -4604,7 +4720,7 @@ "language": "python", "metadata": {}, "outputs": [], - "prompt_number": 104 + "prompt_number": 106 }, { "cell_type": "markdown", @@ -4619,13 +4735,13 @@ "input": [ "simple_decision_tree = SimpleDecisionTree(training_instances)\n", "simple_decision_tree.pprint()\n", - "print\n", + "print()\n", "for instance in testing_instances:\n", " predicted_label = simple_decision_tree.classify(instance)\n", " actual_label = instance[0]\n", - " print 'Model: {}; truth: {}'.format(predicted_label, actual_label)\n", - "print\n", - "print 'Classification accuracy:', simple_decision_tree.classification_accuracy(testing_instances)" + " print('Model: {}; truth: {}'.format(predicted_label, actual_label))\n", + "print()\n", + "print('Classification accuracy:', simple_decision_tree.classification_accuracy(testing_instances))" ], "language": "python", "metadata": {}, @@ -4670,7 +4786,7 @@ ] } ], - "prompt_number": 105 + "prompt_number": 107 }, { "cell_type": "heading", diff --git a/README.md b/README.md index e9b2d9d..2a8e261 100644 --- a/README.md +++ b/README.md @@ -25,4 +25,15 @@ There are several exercises included in the notebooks. Sample solutions to those There are also 2 data files, based on the [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom) in the UCI Machine Learning Repository, used for coding examples, exploratory data analysis and building and evaluating decision trees in Python: * [`agaricus-lepiota.data`](agaricus-lepiota.data): a machine-readable list of examples or instances of mushrooms, represented by a comma-separated list of attribute values -* [`agaricus-lepiota.attributes`](agaricus-lepiota.attributes): a machine-readable list of attribute names and possible attribute values and their abbreviations \ No newline at end of file +* [`agaricus-lepiota.attributes`](agaricus-lepiota.attributes): a machine-readable list of attribute names and possible attribute values and their abbreviations + +## Change Log + +2015-02-22 + +* Added `from __future__ import print_function, division` for Python 3 compatibility +* Updated `simple_ml.py` to also use Python 3 `print_function` and `division` +* Changed "call by reference" to "call by sharing" +* Added `isinstance()` (and reference to duck typing) to section on `type()` +* Added variable for `delimiter` rather than hard-coding `'|'` character +* Cleaned up various cells \ No newline at end of file diff --git a/simple_ml.py b/simple_ml.py index 9a6e26d..04cc51b 100644 --- a/simple_ml.py +++ b/simple_ml.py @@ -1,3 +1,5 @@ +from __future__ import print_function, division + ''' Utility functions to implement some simple Machine Learning tasks ''' @@ -12,6 +14,7 @@ import math import operator + from collections import defaultdict, Counter @@ -149,9 +152,10 @@ def attribute_value(instance, attribute, attribute_names): def print_attribute_names_and_values(instance, attribute_names): '''Prints the attribute names and values for instance''' - print 'Values for the', len(attribute_names), 'attributes:\n' + print('Values for the', len(attribute_names), 'attributes:', end='\n\n') for i in range(len(attribute_names)): - print attribute_names[i], '=', attribute_value(instance, attribute_names[i], attribute_names) + print(attribute_names[i], '=', + attribute_value(instance, attribute_names[i], attribute_names)) def attribute_value_counts(instances, attribute, attribute_names): @@ -167,13 +171,13 @@ def attribute_value_counts(instances, attribute, attribute_names): def print_all_attribute_value_counts(instances, attribute_names): '''Returns a list of defaultdicts containing the counts of occurrences of each value of each attribute in the list of instances. attribute_names is a list of names of attributes.''' - num_instances = len(instances) * 1.0 + num_instances = len(instances) for attribute in attribute_names: value_counts = attribute_value_counts(instances, attribute, attribute_names) - print '{}:'.format(attribute), + print('{}:'.format(attribute), end=' ') for value, count in sorted(value_counts.iteritems(), key=operator.itemgetter(1), reverse=True): - print '{} = {} ({:5.3f}),'.format(value, count, count / num_instances), - print + print('{} = {} ({:5.3f}),'.format(value, count, count / num_instances), end=' ') + print() def entropy(instances, class_index=0, attribute_name=None, value_name=None): @@ -188,18 +192,18 @@ def entropy(instances, class_index=0, attribute_name=None, value_name=None): if num_values <= 1: return 0 attribute_entropy = 0.0 - n = float(num_instances) if attribute_name: - print 'entropy({}{}) = '.format(attribute_name, '={}'.format(value_name) if value_name else '') + print('entropy({}{}) = '.format(attribute_name, + '={}'.format(value_name) if value_name else '')) for value in value_counts: - value_probability = value_counts[value] / n + value_probability = value_counts[value] / num_instances child_entropy = value_probability * math.log(value_probability, num_values) attribute_entropy -= child_entropy if attribute_name: - print ' - p({0}) x log(p({0}), {1}) = - {2:5.3f} x log({2:5.3f}) = {3:5.3f}'.format( - value, num_values, value_probability, child_entropy) + print(' - p({0}) x log(p({0}), {1}) = - {2:5.3f} x log({2:5.3f}) = {3:5.3f}'.format( + value, num_values, value_probability, child_entropy)) if attribute_name: - print ' = {:5.3f}'.format(attribute_entropy) + print(' = {:5.3f}'.format(attribute_entropy)) return attribute_entropy @@ -210,10 +214,11 @@ def information_gain(instances, parent_index, class_index=0, attribute_name=Fals for instance in instances: child_instances[instance[parent_index]].append(instance) children_entropy = 0.0 - n = float(len(instances)) + num_instances = len(instances) for child_value in child_instances: - child_probability = len(child_instances[child_value]) / n - children_entropy += child_probability * entropy(child_instances[child_value], class_index, attribute_name, child_value) + child_probability = len(child_instances[child_value]) / num_instances + children_entropy += child_probability * entropy( + child_instances[child_value], class_index, attribute_name, child_value) return parent_entropy - children_entropy @@ -283,14 +288,15 @@ def create_decision_tree(instances, candidate_attribute_indexes=None, class_inde # If the dataset is empty or the candidate attributes list is empty, return the default value if not instances or not candidate_attribute_indexes: if trace: - print '{}Using default class {}'.format('< ' * trace, default_class) + print('{}Using default class {}'.format('< ' * trace, default_class)) return default_class # If all the instances have the same class label, return that class label elif len(class_labels_and_counts) == 1: class_label = class_labels_and_counts.most_common(1)[0][0] if trace: - print '{}All {} instances have label {}'.format('< ' * trace, len(instances), class_label) + print('{}All {} instances have label {}'.format('< ' * trace, + len(instances), class_label)) return class_label else: default_class = simple_ml.majority_value(instances, class_index) @@ -298,7 +304,7 @@ def create_decision_tree(instances, candidate_attribute_indexes=None, class_inde # Choose the next best attribute index to best classify the instances best_index = simple_ml.choose_best_attribute_index(instances, candidate_attribute_indexes, class_index) if trace: - print '{}Creating tree node for attribute index {}'.format('> ' * trace, best_index) + print('{}Creating tree node for attribute index {}'.format('> ' * trace, best_index)) # Create a new decision tree node with the best attribute index and an empty dictionary object (for now) tree = {best_index:{}} @@ -310,13 +316,13 @@ def create_decision_tree(instances, candidate_attribute_indexes=None, class_inde remaining_candidate_attribute_indexes = [i for i in candidate_attribute_indexes if i != best_index] for attribute_value in partitions: if trace: - print '{}Creating subtree for value {} ({}, {}, {}, {})'.format( + print('{}Creating subtree for value {} ({}, {}, {}, {})'.format( '> ' * trace, attribute_value, len(partitions[attribute_value]), len(remaining_candidate_attribute_indexes), class_index, - default_class) + default_class)) # Create a subtree for each value of the the best attribute subtree = create_decision_tree( @@ -355,7 +361,7 @@ def classification_accuracy(tree, testing_instances, class_index=0): actual_value = testing_instances[i][class_index] if prediction == actual_value: num_correct += 1 - return float(num_correct) / len(testing_instances) + return num_correct / len(testing_instances) def compute_learning_curve(instances, num_partitions=10): From 2dc36233917ee28315fd45c62323a2e94e828164 Mon Sep 17 00:00:00 2001 From: Joe McCarthy Date: Mon, 23 Feb 2015 15:18:41 -0800 Subject: [PATCH 02/16] Re-ran nbconvert on main notebook --- Python_for_Data_Science_all.html | 6673 ++++++++++++++++++------------ 1 file changed, 3975 insertions(+), 2698 deletions(-) diff --git a/Python_for_Data_Science_all.html b/Python_for_Data_Science_all.html index 3ccb02d..d2c3937 100644 --- a/Python_for_Data_Science_all.html +++ b/Python_for_Data_Science_all.html @@ -1,1586 +1,1553 @@ - -[] + + +Python_for_Data_Science_all + + + + - - + + + + + + + +
+
+
+
+
+

Python for Data Science

+
+
+
+
+
+

Joe McCarthy, Director, Analytics & Data Science, Atigeo, LLC

-
-
+
+
+
+
In [1]:
-
+
+
from IPython.display import display, Image, HTML
 
+
+
+
+
+

1. Introduction

+
+
+
+
+
+

python-logo-master-v3-TM.png This short primer on Python is designed to provide a rapid "on-ramp" to enable computer programmers who are already familiar with concepts and constructs in other programming languages learn enough about Python to facilitate the effective use of open-source and proprietary Python-based machine learning and data science tools.

nltk_book_cover.gif The primer is motivated, in part, by the approach taken in the Natural Language Toolkit (NLTK) book, which provides a rapid on-ramp for using Python and the open-source NLTK library to develop programs using natural language processing techniques (many of which involve machine learning).

@@ -1740,14 +1746,32 @@

1. Introduction +
+
+
+

+
+
+
+

Data Science and Data Mining

+
+
+
+
+
+

DataScienceForBusiness_cover.jpg Foster Provost and Tom Fawcett offer succinct descriptions of data science and data mining in Data Science for Business:

@@ -1755,10 +1779,22 @@

Data Science and Data Mining +
+
+
+

+
+
+
+

Provost & Fawcett also offer some history and insights into the relationship between data mining and machine learning, terms which are often used somewhat interchangeably:

@@ -1767,10 +1803,22 @@

Knowledge Discove

Historically, KDD spun off from Machine Learning as a research field focused on concerns raised by examining real-world applications, and a decade and a half later the KDD community remains more concerned with applications than Machine Learning is. As such, research focused on commercial applications and business issues of data analysis tends to gravitate toward the KDD community rather than to Machine Learning. KDD also tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, and so on.

+
+
+
+
+
+

Cross Industry Standard Process for Data Mining (CRISP-DM)

+
+
+
+
+
+

The Cross Industry Standard Process for Data Mining introduced a process model for data mining in 2000 that has become widely adopted.

CRISP-DM_Process_Diagram

@@ -1784,14 +1832,32 @@

Cross Indust

We will be focusing primarily on using Python for data preparation and modeling.

+
+
+
+
+
+

Data Science Workflow

+
+
+
+
+
+

Philip Guo presents a Data Science Workflow offering a slightly different process model emhasizing the importance of reflection and some of the meta-data, data management and bookkeeping challenges that typically arise in the data science process. His 2012 PhD thesis, Software Tools to Facilitate Research Programming, offers an insightful and more comprehensive description of many of these challenges.

pguo-data-science-overview.jpg

+
+
+ + +
+
+
+

The Natural Language Toolkit (NLTK) book provides a diagram and succinct description (below, with italics and bold added for emphasis) of supervised classification:

nltk_ch06_supervised-classification.png

@@ -1822,10 +1900,22 @@

Supervised Classification +

+
+
+
+
+
+
+ +
+
+
+

The Center for Machine Learning and Intelligent Systems at the University of California, Irvine (UCI), hosts a Machine Learning Repository containing over 200 publicly available data sets.

mushroom We will use the mushroom data set, which forms the basis of several examples in Chapter 3 of the Provost & Fawcett data science book.

@@ -1907,89 +2009,166 @@

Data Mining Example: UCI Mush

Building a model with this data set will serve as a motivating example throughout much of this primer.

+
+
+
+
+
+

3. Python: Basic Concepts

+
+
+
+
+
+
-

Identifiers, strings, lists and tuples

+

A note on Python 2 vs. Python 3

+
+
+
+
+
+
-

The sample instance shown above can be represented as a string. A Python string (str) is a sequence of 0 or more characters enclosed within a pair of single quotes (') or a pair double quotes (").

+

There are 2 major versions of Python in widespread use: Python 2 and Python 3. Python 3 has some features that are not backward compatible with Python 2, and some Python 2 libraries have not been updated to work with Python 3. I have been using Python 2, primarily because I use some of those Python 2 libraries, but an increasing proportion of them are migrating to Python 3, and I anticipate shifting to Python 3 in the near future.

+

For more on the topic, I recommend a very well documented IPython Notebook, which includes numerous helpful examples and links, by Sebastian Raschka, Key differences between Python 2.7.x and Python 3.x ... or googling Python 2 vs 3.

+

I received an email request from a Python 3 programmer who suggested that a relatively minor change in this notebook would enable it to run with Python 2 or Python 3: importing the print_function from __future__, and changing my print statements (Python 2) to print function calls (Python 3). Although a relatively minor conceptual change, it necessitates the changing of many cells to reflect the Python 3 print syntax.

+

I find the arguments for making print a function rather than statement compelling - especially as it is more consistent with printing functionality in many other programming langugages - and so while I do not want to convert this notebook to a Python 3 notebook, I have implemented this change so that it can be used in either Python 2 or Python 3. However, while I have verified that it still works in Python 2, I have not tested it in Python 3.

+

I also find the arguments for changing the division operator compelling, so will import that as well. Without this import in Python 2, 1 / 2 returns 0 (the integer portion of the quotient); with this import, 1 / 2 returns 0.5, and if you want only the integer portion of the quotient (floor division), you can use 1 // 2 (which works the same way in Python 2 and Python 3).

+

Note that if you don't understand some/any of the above discussion about Python 2 and Python 3, it should not affect your ability to understand the rest of this notebook.

+
-
-
+
+
+
In [2]:
-
+
+
+
from __future__ import print_function, division
+
+ +
+
+
+ +
+
+
+
+
+
+

Identifiers, strings, lists and tuples

+
+
+
+ +
+
+
+
+
+

The sample instance of a mushroom shown above can be represented as a string. A Python string (str) is a sequence of 0 or more characters enclosed within a pair of single quotes (') or a pair double quotes (").

+
+
+
+
+
+
+In [3]: +
+
+
'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'
 
+
-
-
+
+
-
- Out[2]:
-
+
+ Out[3]:
+
 'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'
 
-
+
+
+
+
+

Python identifiers (or names) are composed of letters, numbers and/or underscores ('_'), starting with a letter or underscore. Python identifiers are case sensitive. Although camelCase identifiers can be used, it is generally considered more pythonic to use underscores. Python variables and functions typically start with lowercase letters; Python classes start with uppercase letters.

The following assignment statement binds the value of the string shown above to the name single_instance_str.

-
-
+
+
+
+
-In [3]: +In [4]:
-
+
+
single_instance_str = 'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'
 
+
+
+
+
+
-

The print statement writes the value of its comma-delimited arguments to sys.stdout (typically the console). Each value in the output is separated by a single blank space. If the last argument is followed by a comma, the output cursor will stay on the same line.

+

The print function writes the value of its comma-delimited arguments to sys.stdout (typically the console). Each value in the output is separated by a single blank space. If an end argument that does not include \n (newline character) is supplied, the output cursor will not move to the next line.

+
+
-
-
+
+
-In [4]: +In [5]:
-
-
print 'Instance 1:', single_instance_str
-print 'A', 'B', # note comma at the end
-print 'C' # will appear on same line
+
+
+
print('Instance 1:', single_instance_str)
+print('A', 'B', end=' ') # use a space rather than newline at the end of the line
+print('C') # will appear on same line
 
+
-
-
+
+
-
-
+
+
 Instance 1: p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d
 A B C
@@ -2002,32 +2181,40 @@ 

Identifiers, strings, lists and

+
+
+
+

The Python comment character is '#': anything after '#' on the line is ignored by the Python interpreter.

Pairs of triple quotes (''' or """) can be used to delimit multi-line comments.

-
-
+
+
+
+
-In [5]: +In [6]:
-
+
+
'''
 A multi-line
 comment
 '''
-print 'no comment'
+print('no comment')
 
+
-
-
+
+
-
-
+
+
 no comment
 
@@ -2039,28 +2226,36 @@ 

Identifiers, strings, lists and

+
+
+
+

A list is an ordered sequence of 0 or more comma-delimited elements enclosed within square brackets ('[', ']'). The Python str.split(sep) method can be used to split a sep-delimited string into a corresponding list of elements.

-
-
+
+
+
+
-In [6]: +In [7]:
-
+
+
single_instance_list = single_instance_str.split(',')
-print single_instance_list
+print(single_instance_list)
 
+
-
-
+
+
-
-
+
+
 ['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']
 
@@ -2072,28 +2267,36 @@ 

Identifiers, strings, lists and

+
+
+
+

Python sequences are heterogeneous, i.e., they can contain elements of different types.

-
-
+
+
+
+
-In [7]: +In [8]:
-
+
+
mixed_list = ['a', 1, 2.3, True, [1, 'b']]
-print mixed_list
+print(mixed_list)
 
+
-
-
+
+
-
-
+
+
 ['a', 1, 2.3, True, [1, 'b']]
 
@@ -2105,28 +2308,36 @@ 

Identifiers, strings, lists and

+
+
+
+

The Python + operator can be used to concatenate lists.

-
-
+
+
+
+
-In [8]: +In [9]:
-
+
+
concatenated_list = ['a', 1] + [2.3, True] + [[1, 'b']]
-print concatenated_list
+print(concatenated_list)
 
+
-
-
+
+
-
-
+
+
 ['a', 1, 2.3, True, [1, 'b']]
 
@@ -2138,27 +2349,35 @@ 

Identifiers, strings, lists and

+
+
+
+

Individual elements of sequences (lists, strings and other data structures) can be accessed by specifying their zero-based index position within square brackets ('[', ']').

-
-
+
+
+
+
-In [9]: +In [10]:
-
-
print single_instance_str[2], single_instance_list[2]
+
+
+
print(single_instance_str[2], single_instance_list[2])
 
+
-
-
+
+
-
-
+
+
 k f
 
@@ -2170,27 +2389,35 @@ 

Identifiers, strings, lists and

+
+
+
+

Negative index values can be used to specify a position offset from the end of the sequence. It is often useful to use a -1 index value to access the last element of a sequence.

-
-
+
+
+
+
-In [10]: +In [11]:
-
-
print single_instance_str[-1], single_instance_list[-1]
+
+
+
print(single_instance_str[-1], single_instance_list[-1])
 
+
-
-
+
+
-
-
+
+
 d d
 
@@ -2202,28 +2429,36 @@ 

Identifiers, strings, lists and

+
+
+
+

The Python slice notation can be used to access subsequences by specifying two index positions separated by a colon (':'); seq[start:stop] returns all the elements in seq between start and stop - 1 (inclusive).

-
-
+
+
+
+
-In [11]: +In [12]:
-
-
print single_instance_str[2:4]
-print single_instance_list[2:4]
+
+
+
print(single_instance_str[2:4])
+print(single_instance_list[2:4])
 
+
-
-
+
+
-
-
+
+
 k,
 ['f', 'n']
@@ -2236,28 +2471,36 @@ 

Identifiers, strings, lists and

+
+
+
+

Slices indices can be negative values.

-
-
+
+
+
+
-In [12]: +In [13]:
-
-
print single_instance_str[-4:-2]
-print single_instance_list[-4:-2]
+
+
+
print(single_instance_str[-4:-2])
+print(single_instance_list[-4:-2])
 
+
-
-
+
+
-
-
+
+
 ,v
 ['e', 'w']
@@ -2270,30 +2513,38 @@ 

Identifiers, strings, lists and

+
+
+
+

The start and/or stop index can be omitted. A common use of slices with a single index value is to access all but the first element or all but the last element of a sequence.

-
-
+
+
+
+
-In [13]: +In [14]:
-
-
print single_instance_str[:-1] # all but the last
-print single_instance_list[:-1]
-print single_instance_str[1:] # all but the first
-print single_instance_list[1:]
+
+
+
print(single_instance_str[:-1]) # all but the last
+print(single_instance_list[:-1])
+print(single_instance_str[1:]) # all but the first
+print(single_instance_list[1:])
 
+
-
-
+
+
-
-
+
+
 p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,
 ['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v']
@@ -2308,30 +2559,38 @@ 

Identifiers, strings, lists and

+
+
+
+

Slice notation includes an optional third element, step, as in seq[start:stop:step], that specifies the steps or increments by which elements are retrieved from seq between start and step - 1:

-
-
+
+
+
+
-In [14]: +In [15]:
-
-
print single_instance_str
-print single_instance_str[::2] # print elements in even-numbered positions (the values, in this case)
-print single_instance_str[1::2] # print elements in odd-numbered positions (the commas, in this case)
-print single_instance_str[::-1] # reverse the string
+
+
+
print(single_instance_str)
+print(single_instance_str[::2]) # print elements in even-numbered positions (the values, in this case)
+print(single_instance_str[1::2]) # print elements in odd-numbered positions (the commas, in this case)
+print(single_instance_str[::-1]) # reverse the string
 
+
-
-
+
+
-
-
+
+
 p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d
 pkfnfnfcnwe?kywnpwoewvd
@@ -2346,6 +2605,10 @@ 

Identifiers, strings, lists and

+
+
+
+

The Python tutorial offers a helpful ASCII art representation to show how positive and negative indexes are interpreted:

@@ -2356,15 +2619,24 @@ 

Identifiers, strings, lists and -5 -4 -3 -2 -1

+
+
+
+
+
+

Python statements are typically separated by newlines (rather than, say, the semi-colon in Java). Statements can extend over more than one line; it is generally best to break the lines after commas within parentheses, braces or brackets. Inserting a backslash character ('\') at the end of a line will also enable continuation of the statement on the next line, but it is generally best to look for other alternatives.

-
-
+
+
+
+
-In [15]: +In [16]:
-
+
+
attribute_names = ['class', 
                    'cap-shape', 'cap-surface', 'cap-color', 
                    'bruises?', 
@@ -2378,18 +2650,19 @@ 

Identifiers, strings, lists and 'spore-print-color', 'population', 'habitat'] -print attribute_names +print(attribute_names)

+
-
-
+
+
-
-
+
+
 ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']
 
@@ -2401,30 +2674,39 @@ 

Identifiers, strings, lists and

+
+
+
+

The str.strip(\[chars\]) method returns a copy of str in which any leading or trailing chars are removed. If no chars are specified, it removes all leading and trailing whitespace. [Whitespace is any sequence of spaces, tabs ('\t') and/or newline ('\n') characters.]

+

Note that since a blank space is inserted in the output after every item in a comma-delimited list, the last asterisk is printed after a leading blank space is inserted on the new line.

+
-
-
+
+
+
-In [16]: +In [17]:
-
-
print '*', '\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n', '*'
+
+
+
print('*', '\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n', '*')
 
+
-
-
+
+
-
-
+
+
 * 	p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d
-*
+ *
 
 
@@ -2434,24 +2716,26 @@

Identifiers, strings, lists and

-
-
+
+
-In [17]: +In [18]:
-
-
print '*', '\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n'.strip(), '*'
+
+
+
print('*', '\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n'.strip(), '*')
 
+
-
-
+
+
-
-
+
+
 * p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d *
 
@@ -2463,6 +2747,10 @@ 

Identifiers, strings, lists and

+
+
+
+

A common programming pattern when dealing with CSV (comma-separated values) data files containing is to repeatedly

    @@ -2472,26 +2760,30 @@

    Identifiers, strings, lists and

We will get to repetition control structures (loops) and file input and output shortly, but here is an example of how str.strip() and str.split() be chained together in a single instruction:

-
-
+
+
+
+
-In [18]: +In [19]:
-
+
+
single_instance_str = '\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n'
 single_instance_list = single_instance_str.strip().split(',') # first strip leading & trailing whitespace, then split on commas
-print single_instance_list
+print(single_instance_list)
 
+
-
-
+
+
-
-
+
+
 ['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']
 
@@ -2503,27 +2795,35 @@ 

Identifiers, strings, lists and

+
+
+
+

The str.join(words) method is the inverse of str.split(), returning a single string in which each string in the sequence of words is separated by str.

-
-
+
+
+
+
-In [19]: +In [20]:
-
-
print '*', ','.join(single_instance_list), '*'
+
+
+
print('*', ','.join(single_instance_list), '*')
 
+
-
-
+
+
-
-
+
+
 * p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d *
 
@@ -2535,28 +2835,36 @@ 

Identifiers, strings, lists and

+
+
+
+

A number of Python methods can be used on strings, lists and other sequences.

The len(s) function can be used to find the length of (number of items in) a sequence s. It will also return the number of items in a dictionary, a data structure we will cover further below.

-
-
+
+
+
+
-In [20]: +In [21]:
-
-
print len(single_instance_str), len(single_instance_list)
+
+
+
print(len(single_instance_str), len(single_instance_list))
 
+
-
-
+
+
-
-
+
+
 47 23
 
@@ -2568,28 +2876,36 @@ 

Identifiers, strings, lists and

+
+
+
+

The in operator can be used to determine whether a sequence contains a value.

Boolean values in Python are True and False (note the capitalization).

-
-
+
+
+
+
-In [21]: +In [22]:
-
-
print ',' in single_instance_str, ',' in single_instance_list
+
+
+
print(',' in single_instance_str, ',' in single_instance_list)
 
+
-
-
+
+
-
-
+
+
 True False
 
@@ -2601,27 +2917,35 @@ 

Identifiers, strings, lists and

+
+
+
+

The s.count(x) ormethod can be used to count the number of occurrences of item x in sequence s.

-
-
+
+
+
+
-In [22]: +In [23]:
-
-
print single_instance_str.count(','), single_instance_list.count('f')
+
+
+
print(single_instance_str.count(','), single_instance_list.count('f'))
 
+
-
-
+
+
-
-
+
+
 22 3
 
@@ -2633,27 +2957,35 @@ 

Identifiers, strings, lists and

+
+
+
+

The s.index(x) method can be used to find the first 0-based index of item x in sequence s.

-
-
+
+
+
+
-In [23]: +In [24]:
-
-
print single_instance_str.index(','), single_instance_list.index('f')
+
+
+
print(single_instance_str.index(','), single_instance_list.index('f'))
 
+
-
-
+
+
-
-
+
+
 2 2
 
@@ -2665,41 +2997,49 @@ 

Identifiers, strings, lists and

+
+
+
+

One important distinction between strings and lists has to do with their mutability.

Python strings are immutable, i.e., they cannot be modified. Most string methods (like str.strip()) return modified copies of the strings on which they are used.

Python lists are mutable, i.e., they can be modified.

The examples below illustrate a number of list methods that modify lists.

-
-
+
+
+
+
-In [24]: +In [25]:
-
+
+
list_1 = [4, 2, 3, 5, 1]
 list_2 = list_1 # list_2 now references the same object as list_1
-print 'list_1:          ', list_1
-print 'list_2:          ', list_2
+print('list_1:          ', list_1)
+print('list_2:          ', list_2)
 list_1.remove(1)
-print 'list_1.remove(1):', list_1
+print('list_1.remove(1):', list_1)
 list_1.append(6)
-print 'list_1.append(6):', list_1
+print('list_1.append(6):', list_1)
 list_1.sort()
-print 'list_1.sort():   ', list_1
+print('list_1.sort():   ', list_1)
 list_1.reverse()
-print 'list_1.reverse():', list_1
+print('list_1.reverse():', list_1)
 
+
-
-
+
+
-
-
+
+
 list_1:           [4, 2, 3, 5, 1]
 list_2:           [4, 2, 3, 5, 1]
@@ -2716,28 +3056,36 @@ 

Identifiers, strings, lists and

+
+
+
+

When more than one name (e.g., a variable) is bound to the same mutable object, changes made to that object are reflected in all names bound to that object. For example, in the second statement above, list_2 is bound to the same object that is bound to list_1, namely, the list [4, 2, 3, 5 1]. All changes made to the object bound to list_1 will thus be reflected in list_2 (since they both reference the same object).

-
-
+
+
+
+
-In [25]: +In [26]:
-
-
print 'list_1:          ', list_1
-print 'list_2:          ', list_2
+
+
+
print('list_1:          ', list_1)
+print('list_2:          ', list_2)
 
+
-
-
+
+
-
-
+
+
 list_1:           [6, 5, 4, 3, 2]
 list_2:           [6, 5, 4, 3, 2]
@@ -2750,30 +3098,38 @@ 

Identifiers, strings, lists and

+
+
+
+

There are sorting and reversing functions, sorted() and reversed(), that do not modify their arguments, and can thus be used on mutable or immutable objects. We will elaborate on each of these functions further below, but here are a couple of examples of how sorted() returns a sorted list of each element in its argument.

-
-
+
+
+
+
-In [26]: +In [27]:
-
-
print 'sorted(list_1):', sorted(list_1) # return a copy of list_1 in sorted order
-print 'list_1:        ', list_1
-print 'sorted(single_instance_str):', sorted(single_instance_str) # returns a list of sorted elements in the string
-print 'single_instance_str:        ', single_instance_str
+
+
+
print('sorted(list_1):', sorted(list_1)) # return a copy of list_1 in sorted order
+print('list_1:        ', list_1)
+print('sorted(single_instance_str):', sorted(single_instance_str)) # returns a list of sorted elements in the string
+print('single_instance_str:        ', single_instance_str)
 
+
-
-
+
+
-
-
+
+
 sorted(list_1): [2, 3, 4, 5, 6]
 list_1:         [6, 5, 4, 3, 2]
@@ -2789,29 +3145,37 @@ 

Identifiers, strings, lists and

+
+
+
+

A tuple is an ordered, immutable sequence of 0 or more comma-delimited values enclosed in parentheses ('(', ')'). Many of the functions that operate on strings and lists also operate on tuples.

-
-
+
+
+
+
-In [27]: +In [28]:
-
+
+
x = (1, 2, 3, 4, 5) # a tuple
-print 'x =', x, ', len(x) =', len(x), ', x.index(3) =', x.index(3), ', x[4:2:-1] = ', x[4:2:-1]
-print 'sorted(x, reverse=True):', sorted(x, reverse=True) # sorted always returns a list; reverse=True specifies reverse sort order
+print('x =', x, ', len(x) =', len(x), ', x.index(3) =', x.index(3), ', x[4:2:-1] = ', x[4:2:-1])
+print('sorted(x, reverse=True):', sorted(x, reverse=True)) # sorted always returns a list; reverse=True specifies reverse sort order
 
+
-
-
+
+
-
-
+
+
 x = (1, 2, 3, 4, 5) , len(x) = 5 , x.index(3) = 2 , x[4:2:-1] =  (5, 4)
 sorted(x, reverse=True): [5, 4, 3, 2, 1]
@@ -2824,32 +3188,40 @@ 

Identifiers, strings, lists and

+
+
+
+

If the s.index(x) or list.remove(x) method is used on a sequence s or list that does not contain the value x, a ValueError exception is raised.

-
-
+
+
+
+
-In [28]: +In [29]:
-
-
print x.index(6) # a ValueError will be raised
+
+
+
print(x.index(6)) # a ValueError will be raised
 
+
-
-
+
+
-
-
+
+
 ---------------------------------------------------------------------------
 ValueError                                Traceback (most recent call last)
-<ipython-input-28-67b920d0bd80> in <module>()
-----> 1 print x.index(6) # a ValueError will be raised
+<ipython-input-29-553ddc490596> in <module>()
+----> 1 print(x.index(6)) # a ValueError will be raised
 
 ValueError: tuple.index(x): x not in tuple
@@ -2859,40 +3231,54 @@

Identifiers, strings, lists and

+
+
+
+

Conditionals

+
+
+
+
+
+

One common approach to handling errors is to look before you leap (LBYL), i.e., test for potential exceptions before executing instructions that might raise those exceptions.

This approach can be implemented using the if statement (which may optionally include an else and any number of elif clauses).

The following is a simple example of an if statement:

-
-
+
+
+
+
-In [29]: +In [30]:
-
+
+
class_value = 'e' # try changing this to 'p' or 'x'
 
 if class_value == 'e':
-    print 'edible'
+    print('edible')
 elif class_value == 'p':
-    print 'poisonous'
+    print('poisonous')
 else:
-    print 'unknown'
+    print('unknown')
 
+
-
-
+
+
-
-
+
+
 edible
 
@@ -2904,6 +3290,10 @@ 

Conditionals&#

+
+
+
+
-
-
+
+
+
+
-In [30]: +In [31]:
-
+
+
attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code
 
 if attribute in attribute_names:
     i = attribute_names.index(attribute)
-    print attribute, 'is in position', i
+    print(attribute, 'is in position', i)
 else:
-    print attribute, 'is not in', attribute_names
+    print(attribute, 'is not in', attribute_names)
 
+
-
-
+
+
-
-
+
+
 bruises is not in ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']
 
@@ -2949,35 +3343,43 @@ 

Conditionals&#

+
+
+
+

Another perspective on handling errors championed by some pythonistas is that it is easier to ask forgiveness than permission (EAFP).

As in many practical applications of philosophy, religion or dogma, it is helpful to think before you choose (TBYC). There are a number of factors to consider in deciding whether to follow the EAFP or LBYL paradigm, including code readability and the anticipated likelihood and relative severity of encountering an exception. Oran Looney wrote a blog post providing a nice overview of the debate over LBYL vs. EAFP.

We will follow the LBYL paradigm throughout most of this primer. However, as an illustration of EAFP in Python, here is an alternate implementation of the functionality of the code above, using a try/except statement.

-
-
+
+
+
+
-In [31]: +In [32]:
-
+
+
attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code
 
 try:
     i = attribute_names.index(attribute)
-    print attribute, 'is in position', i
+    print(attribute, 'is in position', i)
 except ValueError:
-    print attribute, 'is not found'
+    print(attribute, 'is not found')
 
+
-
-
+
+
-
-
+
+
 bruises is not found
 
@@ -2989,15 +3391,22 @@ 

Conditionals&#

+
+
+
+

The Python null object is None (note the capitalization).

-
-
+
+
+
+
-In [32]: +In [33]:
-
-
-
+
+
-
-
+ + +
+
+
+
-
-
+
+
+
+
-In [33]: +In [34]:
-
+
+
def attribute_value(instance, attribute, attribute_names):
     '''Returns the value of attribute in instance, based on the position of attribute in the list of attribute_names'''
     if attribute not in attribute_names:
@@ -3052,32 +3475,41 @@ 

Defining and calling functionsreturn instance[i] # using the parameter name here

+
+
+
+
+

A function call starts with the function name, followed by a list of 0 or more comma-delimited arguments (aka 'actual parameters') enclosed within parentheses. A function call can be used as a statement or within an expression.

-
-
+
+
+
+
-In [34]: +In [35]:
-
+
+
attribute = 'cap-shape' # try substituting any of the other attribute names shown above
-print attribute, '=', attribute_value(single_instance_list, attribute, attribute_names)
+print(attribute, '=', attribute_value(single_instance_list, attribute, attribute_names))
 
+
-
-
+
+
-
-
+
+
 cap-shape = k
 
@@ -3089,35 +3521,45 @@ 

Defining and calling functions +
+
+
-
-
+
+
+
+
-In [35]: +In [36]:
-
+
+
x = 0
-print 'x used as a variable:', x, type(x)
+print('x used as a variable:', x, type(x))
+
 def x():
-    print 'x'
-print 'x used as a function:', x, type(x)
+    print('x')
+    
+print('x used as a function:', x, type(x))
 
+
-
-
+
+
-
-
+
+
 x used as a variable: 0 <type 'int'>
-x used as a function: <function x at 0x10671f140> <type 'function'>
+x used as a function: <function x at 0x104e44578> <type 'function'>
 
 
@@ -3127,42 +3569,115 @@

Defining and calling functions +
+
+
-
-
+
+
+
-In [36]: +In [37]:
-
-
def insert_x(list_parameter):
-    '''Inserts "x" at the head of a list, modifying the list argument'''
-    list_parameter.insert(0, 'x')
-    print 'Inserted x:', list_parameter
-    return list_parameter
-
-insert_x([1, 2, 3]) # passing an unnamed object does not affect any existing names
-list_argument = [1, 2, 3] # passing a named object will affect the object bound to that name
-print 'Before:', list_argument
-insert_x(list_argument)
-print 'After:', list_argument
+
+
+
import types
+
+x = 0
+print('Is x an int?', isinstance(x, int))
+print('Is x a function?', isinstance(x, types.FunctionType))
+
+def x():
+    print('x')
+    
+print('Is x an int?', isinstance(x, int))
+print('Is x a function?', isinstance(x, types.FunctionType))
 
+
-
-
+
+
-
-
+
+
-Inserted x: ['x', 1, 2, 3]
-Before: [1, 2, 3]
-Inserted x: ['x', 1, 2, 3]
-After: ['x', 1, 2, 3]
+Is x an int? True
+Is x a function? False
+Is x an int? False
+Is x a function? True
+
+
+
+
+ +
+
+ +
+
+
+
+
+
+

Another important feature of Python functions is that arguments are passed using call by sharing. If a mutable object is passed as an argument to a function parameter, assignment statements using that parameter do not affect the passed argument, however mutations to the parameter do affect the passed argument.

+

Not being aware of - or forgetting - this important distinction can lead to challenging debugging sessions.

+
+
+
+
+
+
+In [38]: +
+
+
+
def modify_parameters(parameter1, parameter2):
+    '''Inserts "x" at the head of parameter1, assigns "x" to parameter2'''
+    parameter1.insert(0, 'x')
+    print('parameter1, after inserting "x":', parameter1)
+    parameter2 = 'x'
+    print('parameter2, after assigning "x"', parameter2)
+    return
+
+argument1 = [1, 2, 3] # passing a named object will affect the object bound to that name
+argument2 = 4
+print('argument1, before calling modify_parameters:', argument1)
+print('argument2, before calling modify_parameters:', argument2)
+print()
+modify_parameters(argument1, argument2)
+print()
+print('argument1, after calling modify_parameters:', argument1)
+print('argument2, after calling modify_parameters:', argument2)
+
+ +
+
+
+ +
+
+ + +
+
+
+argument1, before calling modify_parameters: [1, 2, 3]
+argument2, before calling modify_parameters: 4
+
+parameter1, after inserting "x": ['x', 1, 2, 3]
+parameter2, after assigning "x" x
+
+argument1, after calling modify_parameters: ['x', 1, 2, 3]
+argument2, after calling modify_parameters: 4
 
 
@@ -3172,41 +3687,47 @@

Defining and calling functions +
+
+
-
-
+
+
+
+
-In [37]: +In [39]:
-
-
def insert_x_copy(list_parameter):
-    '''Inserts "x" at the head of a list, without modifying the list argument'''
-    list_parameter_copy = list_parameter[:]
-    list_parameter_copy.insert(0, 'x')
-    print 'Inserted x:', list_parameter_copy
-    return list_parameter_copy
-
-insert_x_copy([1, 2, 3]) # passing an unnamed object does not affect any existing names
-list_argument = [1, 2, 3] # passing a named object will affect the object bound to that name
-print 'Before:', list_argument
-insert_x_copy(list_argument)
-print 'After:', list_argument
+
+
+
def modify_parameter_copy(parameter_1):
+    '''Inserts "x" at the head of parameter_1, without modifying the list argument'''
+    parameter_1_copy = parameter_1[:]
+    parameter_1_copy.insert(0, 'x')
+    print('Inserted x:', parameter_1_copy)
+    return
+
+argument_1 = [1, 2, 3] # passing a named object will not affect the object bound to that name
+print('Before:', argument_1)
+modify_parameter_copy(argument_1)
+print('After:', argument_1)
 
+
-
-
+
+
-
-
+
+
-Inserted x: ['x', 1, 2, 3]
 Before: [1, 2, 3]
 Inserted x: ['x', 1, 2, 3]
 After: [1, 2, 3]
@@ -3219,38 +3740,46 @@ 

Defining and calling functions +
+
+

Python functions can return more than one value, by separating those return values with commas in the return statement. Multiple values are returned as a tuple. If the function-invoking expression is an assignment statement, multiple variables can be assigned the multiple values returned by the function in a single statement. This combining of values and subsequent separation is known as tuple packing and unpacking.

-
-
+
+
+
+
-In [38]: +In [40]:
-
+
+
def min_and_max(list_of_values):
     '''Returns a tuple containing the min and max values in the list_of_values'''
     return min(list_of_values), max(list_of_values)
 
 list_1 = [3, 1, 4, 2, 5]
-print 'min and max of', list_1, ':', min_and_max(list_1)
+print('min and max of', list_1, ':', min_and_max(list_1))
 
 min_and_max_list_1 = min_and_max(list_1) # a single variable is assigned the two-element tuple
-print 'min and max of', list_1, ':', min_and_max_list_1
+print('min and max of', list_1, ':', min_and_max_list_1)
 
 min_list_1, max_list_1 = min_and_max(list_1) # the 1st variable is assigned the 1st value, the 2nd variable is assigned the 2nd value
-print 'min and max of', list_1, ':', min_list_1, ',', max_list_1
+print('min and max of', list_1, ':', min_list_1, ',', max_list_1)
 
+
-
-
+
+
-
-
+
+
 min and max of [3, 1, 4, 2, 5] : (1, 5)
 min and max of [3, 1, 4, 2, 5] : (1, 5)
@@ -3264,38 +3793,53 @@ 

Defining and calling functions +
+
+
+

+
+
+
+

The for statement iterates over the elements of a sequence.

The range(stop) function returns a list of values from 0 up to stop - 1 (inclusive).

-
-
+
+
+
+
-In [39]: +In [41]:
-
-
print 'Index values for attributes:', range(len(attribute_names)), '\n'
+
+
+
print('Index values for attributes:', range(len(attribute_names)), end='\n\n') # 2 newlines
 
-print 'Values for the', len(attribute_names), 'attributes:\n'
+print('Values for the', len(attribute_names), 'attributes:', end='\n\n')
 for i in range(len(attribute_names)):
-    print attribute_names[i], '=', attribute_value(single_instance_list, attribute_names[i], attribute_names)
+    print(attribute_names[i], '=', 
+          attribute_value(single_instance_list, attribute_names[i], attribute_names))
 
+
-
-
+
+
-
-
+
+
-Index values for attributes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] 
+Index values for attributes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
 
 Values for the 23 attributes:
 
@@ -3331,29 +3875,37 @@ 

Iteration: for, range

+
+
+
+

The more general form of the function, range(start, stop[, step]), returns a list of values from start to stop - 1 (inclusive) increasing by step (which defaults to 1), or from start down to stop + 1 (inclusive) decreasing by step if step is negative.

-
-
+
+
+
+
-In [40]: +In [42]:
-
-
print 'range(5, 10):', range(5, 10)
-print 'range(10, 5, -1):', range(10, 5, -1)
-print 'range(0, 10, 2):', range(0, 10, 2)
+
+
+
print('range(5, 10):', range(5, 10))
+print('range(10, 5, -1):', range(10, 5, -1))
+print('range(0, 10, 2):', range(0, 10, 2))
 
+
-
-
+
+
-
-
+
+
 range(5, 10): [5, 6, 7, 8, 9]
 range(10, 5, -1): [10, 9, 8, 7, 6]
@@ -3367,34 +3919,43 @@ 

Iteration: for, range

+
+
+
+

The xrange(stop[, stop[, step]]) function is an iterable version of the range() function. In the context of a for loop, it returns the next item of the sequence for each iteration of the loop rather than creating all the elements of the sequence before the first iteration. This can reduce memory consumption in cases where iteration over all the items is not required.

The range() function returns a list, which can then be manipulated by any list or sequence methods. An xrange object can only be used in a for loop or the len() function. A related and slightly more general class of container objects, iterators, include a next() method for explicitly returning the next item in the container.

-
-
+
+
+
+
-In [41]: +In [43]:
-
-
print xrange(len(attribute_names)), '\n'
+
+
+
print(xrange(len(attribute_names)), end='\n\n') # prints the string representation of the object
 
-print 'Values for the', len(attribute_names), 'attributes:\n'
+print('Values for the', len(attribute_names), 'attributes:', end='\n\n')
 for i in xrange(len(attribute_names)):
-    print attribute_names[i], '=', attribute_value(single_instance_list, attribute_names[i], attribute_names)
+    print(attribute_names[i], '=', 
+          attribute_value(single_instance_list, attribute_names[i], attribute_names))
 
+
-
-
+
+
-
-
+
+
-xrange(23) 
+xrange(23)
 
 Values for the 23 attributes:
 
@@ -3430,10 +3991,20 @@ 

Iteration: for, range

+ +
+
+
+

A Python module is a file containing related definitions (e.g., of functions and variables). Modules are used to help organize the Python namespaces, the set of identifiers accessible in a particular contexts. All of the functions and variables we define in this IPython Notebook are in the __main__ namespace, so accessing them does not require any specification of a module.

A Python module named simple_ml (in the file simple_ml.py), contains a set of solutions to the exercises in this IPython Notebook. Accessing functions in that module requires that we first import the module, and then prefix the function names with the module name followed by a dot (this is known as dotted notation).

@@ -3444,38 +4015,55 @@

Modules, namespaces and dotted

print_attribute_names_and_values(single_instance_list, attribute_names)

This will reference the print_attribute_names_and_values() function in the current namespace (__main__), i.e., the top-level interpreter environment. The simple_ml.print_attribute_names_and_values() function will still be accessible in the simple_ml namespace by using the "simple_ml." prefix.

+
+
+
+
+
+

Exercise 1: define print_attribute_names_and_values()

+
+
+
+
+
+

Complete the following function definition, print_attribute_names_and_values(instance, attribute_names), so that it generates exactly the same output as the code above.

-
-
+
+
+
+
-In [42]: +In [44]:
-
+
+
def print_attribute_names_and_values(instance, attribute_names):
     '''Prints the attribute names and values for an instance'''
     # your code goes here
     return
 
 import simple_ml # this module contains my solutions to exercises
+
 # to test your function, delete the 'simple_ml.' module specification in the call to print_attribute_names_and_values() below
 simple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)
 
+
-
-
+
+
-
-
+
+
 Values for the 23 attributes:
 
@@ -3511,10 +4099,20 @@ 

Exercise 1: defin

+
+
+
+

File I/O

+
+
+
+
+
+

Python file input and output is done through file objects. A file object is created with the open(name[, mode]) statement, where name is a string representing the name of the file, and mode is 'r' (read), 'w' (write) or 'a' (append); if no second argument is provided, the mode defaults to 'r'.

A common Python programming pattern for processing an input text file is to

@@ -3524,12 +4122,15 @@

File I/O

The following code creates a list of instances, where each instance is a list of attribute values (like instance_1_str above).

-
-
+
+
+
+
-In [43]: +In [45]:
-
+
+
all_instances = [] # initialize instances to an empty list
 data_filename = 'agaricus-lepiota.data'
 
@@ -3537,19 +4138,20 @@ 

File I/O

for line in f: all_instances.append(line.strip().split(',')) -print 'Read', len(all_instances), 'instances from', data_filename -print 'First instance:', all_instances[0] # we don't want to print all the instances, so let's just print one to verify +print('Read', len(all_instances), 'instances from', data_filename) +print('First instance:', all_instances[0]) # we don't want to print all the instances, so let's just print one to verify
+
-
-
+
+
-
-
+
+
 Read 8124 instances from agaricus-lepiota.data
 First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']
@@ -3562,19 +4164,32 @@ 

File I/O

+
+
+
+

Exercise 2: define load_instances()

+
+
+
+
+
+

Define a function, load_instances(filename), that returns a list of instances in a text file. The function definition is started for you below. The function should exhibit the same behavior as the code above.

-
-
+
+
+
+
-In [44]: +In [46]:
-
+
+
def load_instances(filename):
     '''Returns a list of instances stored in a file.
     
@@ -3587,19 +4202,20 @@ 

Exercise 2: define load_instances() data_filename = 'agaricus-lepiota.data' # to test your function, delete the 'simple_ml.' module specification in the call to load_instances() below all_instances_2 = simple_ml.load_instances(data_filename) -print 'Read', len(all_instances_2), 'instances from', data_filename -print 'First instance:', all_instances_2[0] +print('Read', len(all_instances_2), 'instances from', data_filename) +print('First instance:', all_instances_2[0])

+
-
-
+
+
-
-
+
+
 Read 8124 instances from agaricus-lepiota.data
 First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']
@@ -3612,18 +4228,28 @@ 

Exercise 2: define load_instances()

+
+
+
+

Output to text file is usually done via file.write(str) method.

As we saw earlier, the str.join(words) method returns a single str-delimited string containing each of the strings in the list words.

SQL and Hive database tables often use the pipe ('|') delimiter to separate column values for each row when they are stored as flat files. The following code creates a new data file using pipes rather than commas to separate the attribute values.

+

To help maintain internal consistency, it is generally a good practice to define a variable such as delimiter or separator and bind it to the intended delimiter string, and then use the variable throughout.

-
-
+
+
+
+
-In [45]: +In [47]:
-
-
print 'Converting to pipe delimiter, e.g.,', '|'.join(all_instances[0])
+
+
+
delimiter = '|'
+
+print('Converting to {}-delimited strings, e.g.,'.format(delimiter), delimiter.join(all_instances[0]))
 
 datafile2 = 'agaricus-lepiota-2.data'
 with open(datafile2, 'w') as f:
@@ -3633,22 +4259,23 @@ 

Exercise 2: define load_instances() all_instances_3 = [] with open(datafile2, 'r') as f: for line in f: - all_instances_3.append(line.strip().split('|')) # note: changed ',' to '|' -print 'Read', len(all_instances_3), 'instances from', datafile2 -print 'First instance:', all_instances_3[0] # we don't want to print all the instances, so let's just print one to verify + all_instances_3.append(line.strip().split(delimiter)) # note: changed ',' to '|' +print('Read', len(all_instances_3), 'instances from', datafile2) +print('First instance:', all_instances_3[0]) # we don't want to print all the instances, so let's just print one to verify

+
-
-
+
+
-
-
+
+
-Converting to pipe delimiter, e.g., p|x|s|n|t|p|f|c|n|k|e|e|s|s|w|w|p|w|o|p|k|s|u
+Converting to |-delimited strings, e.g., p|x|s|n|t|p|f|c|n|k|e|e|s|s|w|w|p|w|o|p|k|s|u
 Read 8124 instances from agaricus-lepiota-2.data
 First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']
 
@@ -3660,116 +4287,148 @@ 

Exercise 2: define load_instances()

+
+
+
+

List comprehensions

+
+
+
+
+
+

Python provides a powerful list comprehension construct to simplify the creation of a list by specifying a formula in a single expression.

Some programmers find list comprehensions confusing, and avoid their use. We won't rely on list comprehensions here, but will show examples with and without list comprehensions below.

One common use of list comprehensions is in the context of the str.join(words) method we saw earlier.

If we wanted to construct a pipe-delimited string containing elements of the list, we could use a for loop to iteratively add list elements and pipe delimiters to a string. We would thereby add one pipe delimiter too many, and would thus have to shave that off at the end.

-
-
+
+
+
+
-In [46]: +In [48]:
-
-
pipe_delimited_string = ''
+
+
+
delimited_string = ''
 for x in [1, 2, 3]:
-    pipe_delimited_string += str(x) + '|'
-pipe_delimited_string = pipe_delimited_string[:-1]
-pipe_delimited_string
+    delimited_string += str(x) + delimiter
+delimited_string = delimited_string[:-1]
+delimited_string
 
+
-
-
+
+
-
- Out[46]:
-
+
+ Out[48]:
+
 '1|2|3'
 
-
+
+
+
+
+

This process is much simpler using a list comprehension.

-
-
+
+
+
+
-In [47]: +In [49]:
-
-
'|'.join([str(x) for x in [1, 2, 3]])
+
+
+
delimiter.join([str(x) for x in [1, 2, 3]])
 
+
-
-
+
+
-
- Out[47]:
-
+
+ Out[49]:
+
 '1|2|3'
 
-
+
+
+
+
+

As noted in the initial description of the UCI mushroom set above, 2480 of the 8124 instances have missing values (denoted by '?') for an attribute. There are several techniques for dealing with instances that include missing values, but to simplify things in the context of this primer - and following the example in the Data Science for Business book - we will restrict our focus to only those clean instances that have no missing values.

We could use several lines of code - with an if statement inside a for loop - to create a clean_instances list from the all_instances list. Or we could use a list comprehension.

We will show both approaches to creating clean_instances below.

-
-
+
+
+
+
-In [48]: +In [50]:
-
+
+
# version 1: using an if statement nested within a for statement
+unknown_value = '?'
+
 clean_instances = []
 for instance in all_instances:
-    if '?' not in instance:
+    if unknown_value not in instance:
         clean_instances.append(instance)
         
-print len(clean_instances), 'clean instances'
+print(len(clean_instances), 'clean instances')
 
+
-
-
+
+
-
-
+
+
 5644 clean instances
 
@@ -3781,27 +4440,29 @@ 

List comprehensions -
+
+
-In [49]: +In [51]:
-
+
+
# version 2: using an equivalent list comprehension
-clean_instances_2 = [instance for instance in all_instances if '?' not in instance]
+clean_instances_2 = [instance for instance in all_instances if unknown_value not in instance]
 
-print len(clean_instances_2), 'clean instances'
+print(len(clean_instances_2), 'clean instances')
 
+
-
-
+
+
-
-
+
+
 5644 clean instances
 
@@ -3813,10 +4474,20 @@ 

List comprehensions +
+
+
+

+
+
+
+

Although single character abbreviations of attribute values (e.g., 'x') allow for more compact data files, they are not as easy to understand by human readers as the longer attribute value descriptions (e.g., 'convex').

A Python dictionary (or dict) is an unordered, comma-delimited collection of key, value pairs, serving a siimilar function as a hash table or hashmap in other programming languages.

@@ -3827,27 +4498,36 @@

Dictionaries (dicts) -
+
+

+
+
-In [50]: +In [52]:
-
-
attribute_values_cap_type = {'b': 'bell', 'c': 'conical', 'x': 'convex', 'f': 'flat', 'k': 'knobbed', 's': 'sunken'}
+
+
+
attribute_values_cap_type = {'b': 'bell', 
+                             'c': 'conical', 
+                             'x': 'convex', 
+                             'f': 'flat', 
+                             'k': 'knobbed', 
+                             's': 'sunken'}
 
 attribute_value_abbrev = 'x'
-print attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]
+print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])
 
+
-
-
+
+
-
-
+
+
 x = convex
 
@@ -3859,29 +4539,37 @@ 

Dictionaries (dicts) +
+
+

A Python dictionary is an iterable container, so we can iterate over the keys in a dictionary using a for loop.

Note that since a dictionary is an unordered collection, the sequence of abbreviations and associated values is not guaranteed to appear in any particular order.

-
-
+
+
+
+
-In [51]: +In [53]:
-
+
+
for attribute_value_abbrev in attribute_values_cap_type:
-    print attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]
+    print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])
 
+
-
-
+
+
-
-
+
+
 c = conical
 b = bell
@@ -3898,29 +4586,37 @@ 

Dictionaries (dicts) +
+
+

Python supports dictionary comprehensions, which have a similar form as the list comprehensions described above.

For example, if we provisionally omit the 'convex' cap-type (whose abbreviation is the last letter rather than first letter in the attribute name), we could construct a dictionary of abbreviations and descriptions using the following expression.

-
-
+
+
+
+
-In [52]: +In [54]:
-
-
attribute_values_cap_type_2 ={x[0]: x for x in ['bell', 'conical', 'flat', 'knobbed', 'sunken']}
-print attribute_values_cap_type_2
+
+
+
attribute_values_cap_type_2 = {x[0]: x for x in ['bell', 'conical', 'flat', 'knobbed', 'sunken']}
+print(attribute_values_cap_type_2)
 
+
-
-
+
+
-
-
+
+
 {'s': 'sunken', 'c': 'conical', 'b': 'bell', 'k': 'knobbed', 'f': 'flat'}
 
@@ -3932,6 +4628,10 @@ 

Dictionaries (dicts) +
+
+
+
+
-In [53]: +In [55]:
-
-
-
+
+
-
-
+
+
+
+
+
+

We earlier created the attribute_names list manually. The load_attribute_values() function above creates the attribute_values list automatically from the contents of a file ... each line of which starts with the name of each attribute ... which we discard.

Complete the following function definition so that the code implements the functionality described in the docstring.

-
-
+
+
+
+
-In [54]: +In [56]:
-
+
+
def load_attribute_names_and_values(filename):
     '''Returns a list of attribute names and values in a file.
     
@@ -4042,19 +4759,21 @@ 

Exercise 3: define load_attr attribute_filename = 'agaricus-lepiota.attributes' # to test your function, delete the 'simple_ml.' module specification in the call to load_attribute_names_and_values() below attribute_names_and_values = simple_ml.load_attribute_names_and_values(attribute_filename) -print 'Read', len(attribute_names_and_values), 'attribute values from', attribute_filename -print 'First attribute name:', attribute_names_and_values[0]['name'], '; values:', attribute_names_and_values[0]['values'] +print('Read', len(attribute_names_and_values), 'attribute values from', attribute_filename) +print('First attribute name:', attribute_names_and_values[0]['name'], + '; values:', attribute_names_and_values[0]['values'])

+
-
-
+
+
-
-
+
+
 Read 23 attribute values from agaricus-lepiota.attributes
 First attribute name: class ; values: {'p': 'poisonous', 'e': 'edible'}
@@ -4067,36 +4786,51 @@ 

Exercise 3: define load_attr

+
+
+
+

Counting

+
+
+
+
+
+

Data scientists often need to count things. For example, we might want to count the numbers of edible and poisonous mushrooms in the clean_instances list we created earlier.

-
-
+
+
+
+
-In [55]: +In [57]:
-
+
+
edible_count = 0
 for instance in clean_instances:
     if instance[0] == 'e':
         edible_count += 1 # this is shorthand for edible_count = edible_count + 1
 
-print 'There are', edible_count, 'edible mushrooms among the', len(clean_instances), 'clean instances'
+print('There are', edible_count, 'edible mushrooms among the', 
+      len(clean_instances), 'clean instances')
 
+
-
-
+
+
-
-
+
+
 There are 3488 edible mushrooms among the 5644 clean instances
 
@@ -4108,16 +4842,23 @@ 

Counting

+
+
+
+

More generally, we often want to count the number of occurrences (frequencies) of each possible value for an attribute. One way to do so is to create a dictionary where each dictionary key is an attribute value and each dictionary value is the count of instances with that attribute value.

Using an ordinary dictionary, we must be careful to create a new dictionary entry the first time we see a new attribute value (that is not already contained in the dictionary).

-
-
+
+
+
+
-In [56]: +In [58]:
-
+
+
cap_state_value_counts = {}
 for instance in clean_instances:
     cap_state_value = instance[1] # cap-state is the 2nd attribute
@@ -4125,20 +4866,21 @@ 

Counting

cap_state_value_counts[cap_state_value] = 0 cap_state_value_counts[cap_state_value] += 1 -print 'Counts for each value of cap-state:' +print('Counts for each value of cap-state:') for value in cap_state_value_counts: - print value, ':', cap_state_value_counts[value] + print(value, ':', cap_state_value_counts[value])
+
-
-
+
+
-
-
+
+
 Counts for each value of cap-state:
 c : 4
@@ -4156,16 +4898,23 @@ 

Counting

+
+
+
+

The Python collections module provides a number of high performance container datatypes. A frequently useful datatype is a defaultdict, which automatically creates an appropriate default value for a new key. For example, a defaultdict(int) automatically initializes a new dictionary entry to 0 (zero); a defaultdict(list) automatically initializes a new dictionary entry to the empty list ([]).

After first importing defaultdict from collections, we can use defaultdict(int) to simplify the code above:

-
-
+
+
+
+
-In [57]: +In [59]:
-
+
+
from collections import defaultdict # don't need to use collections.defaultdict() below
 
 cap_state_value_counts = defaultdict(int)
@@ -4173,20 +4922,21 @@ 

Counting

cap_state_value = instance[1] cap_state_value_counts[cap_state_value] += 1 -print 'Counts for each value of cap-state:' +print('Counts for each value of cap-state:') for value in cap_state_value_counts: - print value, ':', cap_state_value_counts[value] + print(value, ':', cap_state_value_counts[value])
+
-
-
+
+
-
-
+
+
 Counts for each value of cap-state:
 c : 4
@@ -4204,39 +4954,53 @@ 

Counting

+
+
+
+

Exercise 4: define attribute_value_counts()

+
+
+
+
+
+

Define a function, attribute_value_counts(instances, attribute, attribute_names), that returns a defaultdict containing the counts of occurrences of each value of attribute in the list of instances. attribute_names is the list we created above, where each element is the name of an attribute.

-
-
+
+
+
+
-In [58]: +In [60]:
-
+
+
# your definition goes here
 
 attribute = 'cap-shape'
 # remove 'simple_ml.' below to test your function definition
 attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)
 
-print 'Counts for each value of', attribute, ':'
+print('Counts for each value of', attribute, ':')
 for value in attribute_value_counts:
-    print value, ':', attribute_value_counts[value]
+    print(value, ':', attribute_value_counts[value])
 
+
-
-
+
+
-
-
+
+
 Counts for each value of cap-shape :
 c : 4
@@ -4254,35 +5018,49 @@ 

Exercise 4: define attribut

+
+
+
+

Sorting

+
+
+
+
+
+

Earlier, we saw that there is a list.sort() method that will sort a list in-place, i.e., by replacing the original value of list with a sorted version of the elements in list.

The Python sorted(iterable[, cmp[, key[, reverse]]]) function can be used to return a copy of a list, dictionary or any other iterable container it is passed, in ascending order.

-
-
+
+
+
+
-In [59]: +In [61]:
-
+
+
original_list = [3, 1, 4, 2, 5]
 sorted_list = sorted(original_list)
-print original_list
-print sorted_list
+print(original_list)
+print(sorted_list)
 
+
-
-
+
+
-
-
+
+
 [3, 1, 4, 2, 5]
 [1, 2, 3, 4, 5]
@@ -4295,27 +5073,35 @@ 

Sorting

+
+
+
+

Since it returns a copy, sorted() can be used with strings.

-
-
+
+
+
+
-In [60]: +In [62]:
-
-
print sorted('python')
+
+
+
print(sorted('python'))
 
+
-
-
+
+
-
-
+
+
 ['h', 'n', 'o', 'p', 't', 'y']
 
@@ -4327,27 +5113,35 @@ 

Sorting

+
+
+
+

sorted() can also be used with dictionaries (it returns a sorted list of the dictionary keys).

-
-
+
+
+
+
-In [61]: +In [63]:
-
-
print sorted(attribute_values_cap_type) # returns a list of sorted keys (but not values) in the dictionary
+
+
+
print(sorted(attribute_values_cap_type)) # returns a list of sorted keys (but not values) in the dictionary
 
+
-
-
+
+
-
-
+
+
 ['b', 'c', 'f', 'k', 's', 'x']
 
@@ -4359,28 +5153,36 @@ 

Sorting

+
+
+
+

However, we can use the sorted keys to access the values of a dictionary.

-
-
+
+
+
+
-In [62]: +In [64]:
-
+
+
for attribute_value_abbrev in sorted(attribute_values_cap_type):
-    print attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]
+    print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])
 
+
-
-
+
+
-
-
+
+
 b = bell
 c = conical
@@ -4397,27 +5199,35 @@ 

Sorting

+
+
+
+

An optional keyword argument, reverse, can be used to reverse the order of the sorted list returned by the function. The default value of this optional parameter is False, to get non-default behavior, we must specify the name and value of the argument: reverse=True.

-
-
+
+
+
+
-In [63]: +In [65]:
-
-
print sorted([3, 1, 4, 2, 5], reverse=True)
+
+
+
print(sorted([3, 1, 4, 2, 5], reverse=True))
 
+
-
-
+
+
-
-
+
+
 [5, 4, 3, 2, 1]
 
@@ -4429,24 +5239,26 @@ 

Sorting

-
-
+
+
-In [64]: +In [66]:
-
-
print sorted(attribute_values_cap_type, reverse=True) 
+
+
+
print(sorted(attribute_values_cap_type, reverse=True))
 
+
-
-
+
+
-
-
+
+
 ['x', 's', 'k', 'f', 'c', 'b']
 
@@ -4458,29 +5270,31 @@ 

Sorting

-
-
+
+
-In [65]: +In [67]:
-
+
+
attribute = 'cap-shape'
 attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)
 
-print 'Counts for each value of', attribute, ':'
+print('Counts for each value of', attribute, ':')
 for value in sorted(attribute_value_counts):
-    print value, ':', attribute_value_counts[value]
+    print(value, ':', attribute_value_counts[value])
 
+
-
-
+
+
-
-
+
+
 Counts for each value of cap-shape :
 b : 300
@@ -4498,40 +5312,60 @@ 

Sorting

+
+
+
+

Sorting a dictionary by values

+
+
+
+
+
+
+

The dict.items() method returns an unordered list of (key, value) tuples in dict.

-
-
+
+
+
+
-In [66]: +In [68]:
-
+
+
attribute_values_cap_type.items()
 
+
-
-
+
+
-
- Out[66]:
-
+
+ Out[68]:
+
 [('c', 'conical'),
  ('b', 'bell'),
@@ -4540,75 +5374,85 @@ 

Sorting a dictionary by values +
+
+

A related method, dict.iteritems(), returns an iterator - a callable object that returns the next item in a sequence each time it is referenced (e.g., during each iteration of a for loop), which can be more efficient than generating all the items in the sequence before any are used. This is similar to the distinction between xrange() and range() described above.

-
-
+
+
+
+
-In [67]: +In [69]:
-
+
+
attribute_values_cap_type.iteritems()
 
+
-
-
+
+
-
- Out[67]:
-
+
+ Out[69]:
+
-<dictionary-itemiterator at 0x108e1adb8>
+<dictionary-itemiterator at 0x107481730>
 
-
+
-
-
+
+
-In [68]: +In [70]:
-
+
+
for key, value in attribute_values_cap_type.iteritems():
-    print key, value
+    print(key, ':', value)
 
+
-
-
+
+
-
-
+
+
-c conical
-b bell
-f flat
-k knobbed
-s sunken
-x convex
+c : conical
+b : bell
+f : flat
+k : knobbed
+s : sunken
+x : convex
 
 
@@ -4618,35 +5462,43 @@

Sorting a dictionary by values +
+
+

The Python operator module contains a number of functions that perform object comparisons, logical operations, mathematical operations, sequence operations, and abstract type tests.

To facilitate sorting a dictionary by values, we will use the operator.itemgetter(item) function that can be used to retrieve an indexed value (item) in a tuple (such as a (key, value) pair returned by [iter]items()).

We can use operator.itemgetter(1)) to reference the value - the 2nd item in each (key, value) tuple, (at zero-based index position 1) - rather than the key - the first item in each (key, value) tuple (at index position 0).

We will use the optional keyword argument key in sorted(iterable[, cmp[, key[, reverse]]]) to specify a sorting key that is not the same as the dict key (the dict key is the default sorting key)

-
-
+
+
+
+
-In [69]: +In [71]:
-
+
+
import operator
 
 sorted(attribute_values_cap_type.iteritems(), key=operator.itemgetter(1))
 
+
-
-
+
+
-
- Out[69]:
-
+
+ Out[71]:
+
 [('b', 'bell'),
  ('c', 'conical'),
@@ -4655,42 +5507,50 @@ 

Sorting a dictionary by values +
+
+

We can now sort the counts of attribute values in descending frequency of occurrence, and print them out using tuple unpacking.

-
-
+
+
+
+
-In [70]: +In [72]:
-
+
+
attribute = 'cap-shape'
 value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)
 
-print 'Counts for each value of', attribute, ':'
+print('Counts for each value of', attribute, '(sorted by count):')
 for value, count in sorted(value_counts.iteritems(), key=operator.itemgetter(1), reverse=True):
-    print value, ':', count
+    print(value, ':', count)
 
+
-
-
+
+
-
-
+
+
-Counts for each value of cap-shape :
+Counts for each value of cap-shape (sorted by count):
 x : 2840
 f : 2432
 b : 300
@@ -4706,41 +5566,55 @@ 

Sorting a dictionary by values +
+
+
+

+
+
+
+

Define a function, print_all_attribute_value_counts(instances, attribute_names), that prints each attribute name in attribute_names, and then for each attribute value, prints the value abbreviation, the count of occurrences of that value and the proportion of instances that have that attribute value.

You may find it helpful to use fancier output formatting. More details can be found in the Python documentation on format string syntax.

Examples of the str.format() function used in conjunction with print statements is shown below, followed by sample output of the simple_ml version of print_all_attribute_value_counts() (which uses similar formatting, but without hard-coded values).

-
-
+
+
+
+
-In [71]: +In [73]:
-
-
print 'Output of a sample line using str.format():'
-print 'class:', # comma at end keeps cursor on the same line for subsequent print statements
-print '{} = {} ({:5.3f}),'.format('e', 3488, 3488 / 5644.0),
-print '{} = {} ({:5.3f}),'.format('p', 2156, 2156 / 5644.0),
-print # a print statement with no arguments will advance the cursor to the beginning of the next line
-print 'End of sample line'
+
+
+
print('Output of a sample line using str.format():')
+print('class:', end=' ') # keeps cursor on the same line for subsequent print statements
+print('{} = {} ({:5.3f}),'.format('e', 3488, 3488 / 5644.0), end=' ')
+print('{} = {} ({:5.3f}),'.format('p', 2156, 2156 / 5644.0), end=' ')
+print() # a print statement with no arguments will advance the cursor to the beginning of the next line
+print('End of sample line')
 
+
-
-
+
+
-
-
+
+
 Output of a sample line using str.format():
-class: e = 3488 (0.618), p = 2156 (0.382),
+class: e = 3488 (0.618), p = 2156 (0.382), 
 End of sample line
 
 
@@ -4751,57 +5625,65 @@

Exercise 5: defin

+
+
+
+

Define your version of print_all_attribute_value_counts(instances, attribute_names) below, deleting the simple_ml. module specification when you are ready to test your function.

-
-
+
+
+
+
-In [72]: +In [74]:
-
+
+
# your function definition goes here
 
-print '\nCounts for all attributes and values:\n'
+print('\nCounts for all attributes and values:\n')
 simple_ml.print_all_attribute_value_counts(clean_instances, attribute_names)
 
+
-
-
+
+
-
-
+
+
 
 Counts for all attributes and values:
 
-class: e = 3488 (0.618), p = 2156 (0.382),
-cap-shape: x = 2840 (0.503), f = 2432 (0.431), b = 300 (0.053), k = 36 (0.006), s = 32 (0.006), c = 4 (0.001),
-cap-surface: y = 2220 (0.393), f = 2160 (0.383), s = 1260 (0.223), g = 4 (0.001),
-cap-color: g = 1696 (0.300), n = 1164 (0.206), y = 1056 (0.187), w = 880 (0.156), e = 588 (0.104), b = 120 (0.021), p = 96 (0.017), c = 44 (0.008),
-bruises?: t = 3184 (0.564), f = 2460 (0.436),
-odor: n = 2776 (0.492), f = 1584 (0.281), a = 400 (0.071), l = 400 (0.071), p = 256 (0.045), c = 192 (0.034), m = 36 (0.006),
-gill-attachment: f = 5626 (0.997), a = 18 (0.003),
-gill-spacing: c = 4620 (0.819), w = 1024 (0.181),
-gill-size: b = 4940 (0.875), n = 704 (0.125),
-gill-color: p = 1384 (0.245), n = 984 (0.174), w = 966 (0.171), h = 720 (0.128), g = 656 (0.116), u = 480 (0.085), k = 408 (0.072), r = 24 (0.004), y = 22 (0.004),
-stalk-shape: t = 2880 (0.510), e = 2764 (0.490),
-stalk-root: b = 3776 (0.669), e = 1120 (0.198), c = 556 (0.099), r = 192 (0.034),
-stalk-surface-above-ring: s = 3736 (0.662), k = 1332 (0.236), f = 552 (0.098), y = 24 (0.004),
-stalk-surface-below-ring: s = 3544 (0.628), k = 1296 (0.230), f = 552 (0.098), y = 252 (0.045),
-stalk-color-above-ring: w = 3136 (0.556), p = 1008 (0.179), g = 576 (0.102), n = 448 (0.079), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001),
-stalk-color-below-ring: w = 3088 (0.547), p = 1008 (0.179), g = 576 (0.102), n = 496 (0.088), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001),
-veil-type: p = 5644 (1.000),
-veil-color: w = 5636 (0.999), y = 8 (0.001),
-ring-number: o = 5488 (0.972), t = 120 (0.021), n = 36 (0.006),
-ring-type: p = 3488 (0.618), l = 1296 (0.230), e = 824 (0.146), n = 36 (0.006),
-spore-print-color: n = 1920 (0.340), k = 1872 (0.332), h = 1584 (0.281), w = 148 (0.026), r = 72 (0.013), u = 48 (0.009),
-population: v = 2160 (0.383), y = 1688 (0.299), s = 1104 (0.196), a = 384 (0.068), n = 256 (0.045), c = 52 (0.009),
-habitat: d = 2492 (0.442), g = 1860 (0.330), p = 568 (0.101), u = 368 (0.065), m = 292 (0.052), l = 64 (0.011),
+class: e = 3488 (0.618), p = 2156 (0.382), 
+cap-shape: x = 2840 (0.503), f = 2432 (0.431), b = 300 (0.053), k = 36 (0.006), s = 32 (0.006), c = 4 (0.001), 
+cap-surface: y = 2220 (0.393), f = 2160 (0.383), s = 1260 (0.223), g = 4 (0.001), 
+cap-color: g = 1696 (0.300), n = 1164 (0.206), y = 1056 (0.187), w = 880 (0.156), e = 588 (0.104), b = 120 (0.021), p = 96 (0.017), c = 44 (0.008), 
+bruises?: t = 3184 (0.564), f = 2460 (0.436), 
+odor: n = 2776 (0.492), f = 1584 (0.281), a = 400 (0.071), l = 400 (0.071), p = 256 (0.045), c = 192 (0.034), m = 36 (0.006), 
+gill-attachment: f = 5626 (0.997), a = 18 (0.003), 
+gill-spacing: c = 4620 (0.819), w = 1024 (0.181), 
+gill-size: b = 4940 (0.875), n = 704 (0.125), 
+gill-color: p = 1384 (0.245), n = 984 (0.174), w = 966 (0.171), h = 720 (0.128), g = 656 (0.116), u = 480 (0.085), k = 408 (0.072), r = 24 (0.004), y = 22 (0.004), 
+stalk-shape: t = 2880 (0.510), e = 2764 (0.490), 
+stalk-root: b = 3776 (0.669), e = 1120 (0.198), c = 556 (0.099), r = 192 (0.034), 
+stalk-surface-above-ring: s = 3736 (0.662), k = 1332 (0.236), f = 552 (0.098), y = 24 (0.004), 
+stalk-surface-below-ring: s = 3544 (0.628), k = 1296 (0.230), f = 552 (0.098), y = 252 (0.045), 
+stalk-color-above-ring: w = 3136 (0.556), p = 1008 (0.179), g = 576 (0.102), n = 448 (0.079), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001), 
+stalk-color-below-ring: w = 3088 (0.547), p = 1008 (0.179), g = 576 (0.102), n = 496 (0.088), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001), 
+veil-type: p = 5644 (1.000), 
+veil-color: w = 5636 (0.999), y = 8 (0.001), 
+ring-number: o = 5488 (0.972), t = 120 (0.021), n = 36 (0.006), 
+ring-type: p = 3488 (0.618), l = 1296 (0.230), e = 824 (0.146), n = 36 (0.006), 
+spore-print-color: n = 1920 (0.340), k = 1872 (0.332), h = 1584 (0.281), w = 148 (0.026), r = 72 (0.013), u = 48 (0.009), 
+population: v = 2160 (0.383), y = 1688 (0.299), s = 1104 (0.196), a = 384 (0.068), n = 256 (0.045), c = 52 (0.009), 
+habitat: d = 2492 (0.442), g = 1860 (0.330), p = 568 (0.101), u = 368 (0.065), m = 292 (0.052), l = 64 (0.011), 
 
 
@@ -4811,14 +5693,30 @@

Exercise 5: defin

+
+
+
+

4. Using Python to Build and Use a Simple Decision Tree Classifier

+
+
+
+
+
+

Decision Trees

+
+
+
+
+
+

Wikipedia offers the following description of a decision tree (with italics added to emphasize terms that will be elaborated below):

@@ -4842,10 +5740,22 @@

Decision Trees +
+
+
+

+
+
+
+

When building a supervised classification model, the frequency distribution of attribute values is a potentially important factor in determining the relative importance of each attribute at various stages in the model building process.

In data modeling, we can use frequency distributions to compute entropy, a measure of disorder (impurity) in a set.

@@ -4854,28 +5764,35 @@

Entropy

\(entropy(S) = - p_1 log_2 (p_1) - p_2 log_2 (p_2)\)

where \(p_i\) is proportion (relative frequency) of class i within the set S.

From the output above, we know that the proportion of clean_instances that are labeled 'e' (class edible) in the UCI dataset is \(3488 \div 5644 = 0.618\), and the proportion labeled 'p' (class poisonous) is \(2156 \div 5644 = 0.382\).

-

After importing the Python math module, we can use the math.log(x[, base]) function in computing the entropy of the clean_instances of the UCI mushroom data set as follows:

+

After importing the Python math module, we can use the math.log(x[, base]) function in computing the entropy of the clean_instances of the UCI mushroom data set as follows.

+

Note that you can use a backslash character (\) at the end of a line to continue the statement on the next line (this should generally be used sparingly).

+
-
-
+
+
+
-In [73]: +In [75]:
-
+
+
import math
-entropy = - (3488 / 5644.0) * math.log(3488 / 5644.0, 2) - (2156 / 5644.0) * math.log(2156 / 5644.0, 2)
-print entropy
+
+entropy = - (3488 / 5644.0) * math.log(3488 / 5644.0, 2) \
+    - (2156 / 5644.0) * math.log(2156 / 5644.0, 2)
+print(entropy)
 
+
-
-
+
+
-
-
+
+
 0.959441337353
 
@@ -4887,35 +5804,49 @@ 

Entropy

+
+
+
+

Exercise 6: define entropy()

+
+
+
+
+
+

Define a function, entropy(instances), that computes the entropy of instances. You may assume the class label is in position 0; we will later see how to specify default parameter values in function definitions.

[Note: the class label in many data files is the last rather than the first item on each line.]

-
-
+
+
+
+
-In [74]: +In [76]:
-
+
+
# your function definition here
 
 # delete 'simple_ml.' below to test your function
-print simple_ml.entropy(clean_instances)
+print(simple_ml.entropy(clean_instances))
 
+
-
-
+
+
-
-
+
+
 0.959441337353
 
@@ -4927,10 +5858,20 @@ 

Exercise 6: define entropy() +
+
+
+

+
+
+
+
-
-
+
+
+
+
-In [75]: +In [77]:
-
-
print 'Information gain for different attributes:\n'
+
+
+
print('Information gain for different attributes:', end='\n\n')
 for i in range(1, len(attribute_names)):
-    print '{:5.3f}  {:2} {}'.format(simple_ml.information_gain(clean_instances, i), i, attribute_names[i])
+    print('{:5.3f}  {:2} {}'.format(
+        simple_ml.information_gain(clean_instances, i), i, attribute_names[i]))
 
+
-
-
+
+
-
-
+
+
 Information gain for different attributes:
 
@@ -4997,37 +5943,46 @@ 

Information Gain +
+
+

We can sort the attributes based in decreasing order of information gain.

-
-
+
+
+
+
-In [76]: +In [78]:
-
-
print 'Information gain for different attributes:\n'
-sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) for i in range(1, len(attribute_names))], 
+
+
+
print('Information gain for different attributes:', end='\n\n')
+sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) \
+                                          for i in range(1, len(attribute_names))], 
                                          reverse=True)
-print sorted_information_gain_indexes, '\n'
+print(sorted_information_gain_indexes, end='\n\n')
 
 for gain, i in sorted_information_gain_indexes:
-    print '{:5.3f}  {:2} {}'.format(gain, i, attribute_names[i])
+    print('{:5.3f}  {:2} {}'.format(gain, i, attribute_names[i]))
 
+
-
-
+
+
-
-
+
+
 Information gain for different attributes:
 
-[(0.8596704358849709, 5), (0.5828694793608379, 20), (0.46290566555455265, 19), (0.42456477093655975, 12), (0.40865780788318695, 13), (0.3062989793570199, 14), (0.27891994708759504, 15), (0.2750355212178639, 10), (0.2127971869976022, 9), (0.19495343617580085, 3), (0.1400386042032834, 4), (0.1097880400299237, 21), (0.10067585994181227, 22), (0.09733858997769329, 11), (0.05836192763098613, 7), (0.03242975884332899, 8), (0.01740692300090696, 1), (0.01205967443646827, 18), (0.004572013423856602, 2), (0.0044397141315495325, 6), (0.0019702590992403124, 17), (0.0, 16)] 
+[(0.8596704358849709, 5), (0.5828694793608379, 20), (0.46290566555455265, 19), (0.42456477093655975, 12), (0.40865780788318695, 13), (0.3062989793570199, 14), (0.27891994708759504, 15), (0.2750355212178639, 10), (0.2127971869976022, 9), (0.19495343617580085, 3), (0.1400386042032834, 4), (0.1097880400299237, 21), (0.10067585994181227, 22), (0.09733858997769329, 11), (0.05836192763098613, 7), (0.03242975884332899, 8), (0.01740692300090696, 1), (0.01205967443646827, 18), (0.004572013423856602, 2), (0.0044397141315495325, 6), (0.0019702590992403124, 17), (0.0, 16)]
 
 0.860   5 odor
 0.583  20 spore-print-color
@@ -5060,16 +6015,23 @@ 

Information Gain +
+
+

The following variation does not use a list comprehension:

-
-
+
+
+
+
-In [77]: +In [79]:
-
- -
-
+
+
-
-
+
+
+
+
+
+

Define a function, information_gain(instances, i), that returns the information gain achieved by selecting the ith attribute to split instances. It should exhibit the same behavior as the simple_ml version of the function.

-
-
+
+
+
+
-In [78]: +In [80]:
-
+
+
# your definition of information_gain(instances, i) here
 
 # delete 'simple_ml.' below to test your function
 sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) for i in range(1, len(attribute_names))], 
                                          reverse=True)
 
-print 'Information gain for different attributes:\n'
+print('Information gain for different attributes:', end='\n\n')
 for gain, i in sorted_information_gain_indexes:
-    print '{:5.3f}  {:2} {}'.format(gain, i, attribute_names[i])
+    print('{:5.3f}  {:2} {}'.format(gain, i, attribute_names[i]))
 
+
-
-
+
+
-
-
+
+
 Information gain for different attributes:
 
@@ -5195,10 +6172,20 @@ 

Exercise 7: define information_ga

+
+
+
+

Building a Simple Decision Tree

+
+
+
+
+
+

We will implement a modified version of the ID3 algorithm for building a simple decision tree.

ID3 (Examples, Target_Attribute, Attributes)
@@ -5219,17 +6206,26 @@ 

Building a Simple Decision Tree

+
+
+
+
+
+

In building a decision tree, we will need to split the instances based on the index of the best attribute, i.e., the attribute that offers the highest information gain. We will use separate utility functions to handle these subtasks. To simplify the functions, we will rely exclusively on attribute indexes rather than attribute names.

Note: the algorithm above is recursive, i.e., the there is a recursive call to ID3 within the definition of ID3. Covering recursion is beyond the scope of this primer, but there are a number of other resources on using recursion in Python. Familiarity with recursion will be important for understanding both the tree construction and classification functions below.

First, we will define a function to split a set of instances based on any attribute. This function will return a dictionary where the key of each dictionary is a distinct value of the specified attribute_index, and the value of each dictionary is a list representing the subset of instances that have that attribute value.

-
-
+
+
+
+
-In [79]: +In [81]:
-
+
+
def split_instances(instances, attribute_index):
     '''Returns a list of dictionaries, splitting a list of instances according to their values of a specified attribute''
     
@@ -5241,18 +6237,19 @@ 

Building a Simple Decision Treereturn partitions partitions = split_instances(clean_instances, 5) -print [(partition, len(partitions[partition])) for partition in partitions] +print([(partition, len(partitions[partition])) for partition in partitions])

+
-
-
+
+
-
-
+
+
+
+
+
+

Now that we can split instances based on a particular attribute, we would like to be able to choose the best attribute with which to split the instances, where best is defined as the attribute that provides the greatest information gain if instances were split based on that attribute. We will want to restrict the candidate attributes so that we don't bother trying to split on an attribute that was used higher up in the decision tree (or use the target attribute as a candidate).

+
+
+
+
+
+
+

Define a function, choose_best_attribute_index(instances, candidate_attribute_indexes), that returns the index in the list of candidate_attribute_indexes that provides the highest information gain if instances are split based on that attribute index.

-
-
+
+
+
+
-In [80]: +In [82]:
-
+
+
# your function here
 
 # delete 'simple_ml.' below to test your function:
-print 'Best attribute index:', simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names)))
+print('Best attribute index:', 
+      simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names))))
 
+
-
-
+
+
-
-
+
+
 Best attribute index: 5
 
@@ -5306,6 +6324,10 @@ 

Exercise 8: define cho

+
+
+
+

A leaf node in a decision tree represents the most frequently occurring - or majority - class value for that path through the tree. We will need a function that determines the majority value for the class index among a set of instances.

We earlier saw how the defaultdict container in the collections module can be used to simplify the construction of a dictionary containing the counts of all attribute values for all attributes, by automatically setting the count for any attribute value to zero when the attribute value is first added to the dictionary.

@@ -5313,30 +6335,34 @@

Exercise 8: define cho

This container has an additional method, most_common([n]), which returns a list of 2-element tuples representing the values and their associated counts for the most common n values; if n is omitted, the method returns all tuples.

The following is an example of how we can use a Counter to represent the frequency of different class labels, and how we can identify the most frequent value and its count.

-
-
+
+
+
+
-In [81]: +In [83]:
-
+
+
from collections import Counter
 
 class_counts = Counter([instance[0] for instance in clean_instances])
-print 'class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(
+print('class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(
     class_counts, # the Counter object
     class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count
-    class_counts.most_common(1)[0][0]) # the most common value (1st element in that tuple)
+    class_counts.most_common(1)[0][0])) # the most common value (1st element in that tuple)
 
+
-
-
+
+
-
-
+
+
 class_counts: Counter({'e': 3488, 'p': 2156}); most_common(1): [('e', 3488)], most_common(1)[0][0]: e
 
@@ -5348,35 +6374,43 @@ 

Exercise 8: define cho

+
+
+
+

The following variation does not use a list comprehension:

-
-
+
+
+
+
-In [82]: +In [84]:
-
+
+
class_values = []
 for instance in clean_instances:
     class_values.append(instance[0])
     
 class_counts = Counter(class_values)
-print 'class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(
+print ('class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(
     class_counts, # the Counter object
     class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count
-    class_counts.most_common(1)[0][0]) # the most common value (1st element in that tuple)
+    class_counts.most_common(1)[0][0])) # the most common value (1st element in that tuple)
 
+
-
-
+
+
-
-
+
+
 class_counts: Counter({'e': 3488, 'p': 2156}); most_common(1): [('e', 3488)], most_common(1)[0][0]: e
 
@@ -5388,35 +6422,49 @@ 

Exercise 8: define cho

+
+
+
+

Before putting all this together to define a decision tree construction function, it may be helpful to cover a few additional aspects of Python the function will utilize.

+
+
+
+
+
+

Python offers a very flexible mechanism for the testing of truth values: in an if condition, any null object, zero-valued numerical expression or empty container (string, list, dictionary or tuple) is interpreted as False (i.e., not True):

-
-
+
+
+
+
-In [83]: +In [85]:
-
+
+
for x in [False, None, 0, 0.0, "", [], {}, ()]:
-    print '"{}" is'.format(x),
+    print('"{}" is'.format(x), end=' ')
     if x:
-        print True
+        print(True)
     else:
-        print False
+        print(False)
 
+
-
-
+
+
-
-
+
+
 "False" is False
 "None" is False
@@ -5435,28 +6483,36 @@ 

Exercise 8: define cho

+
+
+
+

Python also offers a conditional expression (ternary operator) that allows the functionality of an if/else statement that returns a value to be implemented as an expression. For example, the if/else statement in the code above could be implemented as a conditional expression as follows:

-
-
+
+
+
+
-In [84]: +In [86]:
-
+
+
for x in [False, None, 0, 0.0, "", [], {}, ()]:
-    print '"{}" is {}'.format(x, True if x else False) # using conditional expression as second argument to format()
+    print('"{}" is {}'.format(x, True if x else False)) # using conditional expression as second argument to format()
 
+
-
-
+
+
-
-
+
+
 "False" is False
 "None" is False
@@ -5475,6 +6531,10 @@ 

Exercise 8: define cho

+
+
+
+

Python function definitions can specify default parameter values indicating the value those parameters will have if no argument is explicitly provided when the function is called. Arguments can also be passed using keyword parameters indicting which parameter will be assigned a specific argument value (which may or may not correspond to the order in which the parameters are defined).

The Python Tutorial page on default parameters includes the following warning:

@@ -5483,15 +6543,18 @@

Exercise 8: define cho

Thus it is generally better to use the Python null object, None, rather than an empty list ([]), dict ({}) or other mutable data structure when specifying default parameter values for any of those data types.

-
-
+
+
+
+
-In [85]: +In [87]:
-
+
+
def parameter_test(parameter1=None, parameter2=None):
     '''Prints the values of parameter1 and parameter2'''
-    print 'parameter1: {}; parameter2: {}'.format(parameter1, parameter2)
+    print('parameter1: {}; parameter2: {}'.format(parameter1, parameter2))
     
 parameter_test() # no args are required
 parameter_test(1) # if any args are provided, 1st arg gets assigned to parameter1
@@ -5501,15 +6564,16 @@ 

Exercise 8: define cho parameter_test(parameter2=2, parameter1=1) # can use keywords for either arg, in either order

+
-
-
+
+
-
-
+
+
 parameter1: None; parameter2: None
 parameter1: 1; parameter2: None
@@ -5526,37 +6590,54 @@ 

Exercise 8: define cho

+
+
+
+

Exercise 9: define majority_value()

+
+
+
+
+
+

Define a function, majority_value(instances, class_index), that returns the most frequently occurring value of class_index in instances. The class_index parameter should be optional, and have a default value of 0 (zero).

-
-
+
+
+
+
-In [86]: +In [88]:
-
+
+
# your definition of majority_value(instances) here
 
 # delete 'simple_ml.' below to test your function:
-print 'Majority value of index {}: {}'.format(0, simple_ml.majority_value(clean_instances)) # note: relying on default parameter here
+print('Majority value of index {}: {}'.format(
+    0, simple_ml.majority_value(clean_instances))) # note: relying on default parameter here
 # although there is only one class_index for the dataset, we'll test it by providing non-default values
-print 'Majority value of index {}: {}'.format(1, simple_ml.majority_value(clean_instances, 1)) # using an optional 2nd argument
-print 'Majority value of index {}: {}'.format(2, simple_ml.majority_value(clean_instances, class_index=2)) # using a keyword
+print('Majority value of index {}: {}'.format(
+    1, simple_ml.majority_value(clean_instances, 1))) # supplyling an optional 2nd argument
+print('Majority value of index {}: {}'.format(
+    2, simple_ml.majority_value(clean_instances, class_index=2))) # supplying argument as a keyword
 
+
-
-
+
+
-
-
+
+
 Majority value of index 0: e
 Majority value of index 1: x
@@ -5570,16 +6651,23 @@ 

Exercise 9: define majority_value()

+
+
+
+

The recursive create_decision_tree() function below uses an optional parameter, class_index, which defaults to 0. This is to accommodate other datasets in which the class label is the last element on each line (which would be most easily specified by using a -1 value). Most data files in the UCI Machine Learning Repository have the class labels as either the first element or the last element.

To show how the decision tree is being built, an optional trace parameter, when non-zero, will generate some trace information as the tree is constructed. The indentation level is incremented with each recursive call via the use of the conditional expression (ternary operator), trace + 1 if trace else 0.

-
-
+
+
+
+
-In [87]: +In [89]:
-
+
+
def create_decision_tree(instances, candidate_attribute_indexes=None, class_index=0, default_class=None, trace=0):
     '''Returns a new decision tree trained on a list of instances.
     
@@ -5604,22 +6692,24 @@ 

Exercise 9: define majority_value() # If the dataset is empty or the candidate attributes list is empty, return the default value if not instances or not candidate_attribute_indexes: if trace: - print '{}Using default class {}'.format('< ' * trace, default_class) + print('{}Using default class {}'.format('< ' * trace, default_class)) return default_class # If all the instances have the same class label, return that class label elif len(class_labels_and_counts) == 1: class_label = class_labels_and_counts.most_common(1)[0][0] if trace: - print '{}All {} instances have label {}'.format('< ' * trace, len(instances), class_label) + print('{}All {} instances have label {}'.format( + '< ' * trace, len(instances), class_label)) return class_label else: default_class = simple_ml.majority_value(instances, class_index) # Choose the next best attribute index to best classify the instances - best_index = simple_ml.choose_best_attribute_index(instances, candidate_attribute_indexes, class_index) + best_index = simple_ml.choose_best_attribute_index( + instances, candidate_attribute_indexes, class_index) if trace: - print '{}Creating tree node for attribute index {}'.format('> ' * trace, best_index) + print('{}Creating tree node for attribute index {}'.format('> ' * trace, best_index)) # Create a new decision tree node with the best attribute index and an empty dictionary object (for now) tree = {best_index:{}} @@ -5631,13 +6721,13 @@

Exercise 9: define majority_value() remaining_candidate_attribute_indexes = [i for i in candidate_attribute_indexes if i != best_index] for attribute_value in partitions: if trace: - print '{}Creating subtree for value {} ({}, {}, {}, {})'.format( + print('{}Creating subtree for value {} ({}, {}, {}, {})'.format( '> ' * trace, attribute_value, len(partitions[attribute_value]), len(remaining_candidate_attribute_indexes), class_index, - default_class) + default_class)) # Create a subtree for each value of the the best attribute subtree = create_decision_tree( @@ -5656,18 +6746,19 @@

Exercise 9: define majority_value() training_instances = clean_instances[:-20] testing_instances = clean_instances[-20:] tree = create_decision_tree(training_instances, trace=1) # remove trace=1 to turn off tracing -print tree +print(tree)

+
-
-
+
+
-
-
+
+
 > Creating tree node for attribute index 5
 > Creating subtree for value a (400, 21, 0, e)
@@ -5708,6 +6799,10 @@ 

Exercise 9: define majority_value()

+
+
+
+

The structure of the tree shown above is rather difficult to discern from the normal printed representation of a dictionary.

The Python pprint module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.

@@ -5719,25 +6814,30 @@

Exercise 9: define majority_value()
import pprint

so that we can still access the pprint() function in the pprint module (since defining pprint() in the current namespace would otherwise override the imported definition of the function).

-
-
+
+
+
+
-In [88]: +In [90]:
-
+
+
from pprint import pprint
+
 pprint(tree)
 
+
-
-
+
+
-
-
+
+
 {5: {'a': 'e',
      'c': 'p',
@@ -5758,21 +6858,34 @@ 

Exercise 9: define majority_value()

+
+
+
+

Classifying Instances with a Simple Decision Tree

+
+
+
+
+
+

Usually, when we construct a decision tree based on a set of training instances, we do so with the intent of using that tree to classify a set of one or more testing instances.

We will define a function, classify(tree, instance, default_class=None), to use a decision tree to classify a single instance, where an optional default_class can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.

We will use a design pattern in which we will use a series of if statements, each of which returns a value if the condition is true, rather than a nested series of if, elif and/or else clauses, as it helps constrain the levels of indentation in the function.

-
-
+
+
+
+
-In [89]: +In [91]:
-
+
+
def classify(tree, instance, default_class=None):
     '''Returns a classification label for instance, given a decision tree'''
     if not tree:
@@ -5789,18 +6902,19 @@ 

Classifying Instances for instance in testing_instances: predicted_label = classify(tree, instance) actual_label = instance[0] - print 'predicted: {}; actual: {}'.format(predicted_label, actual_label) + print('predicted: {}; actual: {}'.format(predicted_label, actual_label))

+
-
-
+
+
-
-
+
+
 predicted: p; actual: p
 predicted: p; actual: p
@@ -5831,10 +6945,20 @@ 

Classifying Instances

+
+
+
+

Evaluating the Accuracy of a Simple Decision Tree

+
+
+
+
+
+

It is often helpful to evaluate the performance of a model using a dataset not used in the training of that model. In the simple example shown above, we used all but the last 20 instances to train a simple decision tree, then classified those last 20 instances using the tree.

The advantage of this training/testing split is that visual inspection of the classifications (sometimes called predictions) is relatively straightforward, revealing that all 20 instances were correctly classified.

@@ -5842,12 +6966,15 @@

Evaluating the Accura

The accuracy of the model above, given the set of 20 testing instances, is 100% (20/20).

The function below calculates the classification accuracy of a tree over a set of testing_instances (with an optional class_index parameter indicating the position of the class label in each instance).

-
-
+
+
+
+
-In [90]: +In [92]:
-
+
+
def classification_accuracy(tree, testing_instances, class_index=0, default_class=None):
     '''Returns the accuracy of classifying testing_instances with tree, where the class label is in position class_index'''
     num_correct = 0
@@ -5856,20 +6983,21 @@ 

Evaluating the Accura actual_value = testing_instances[i][class_index] if prediction == actual_value: num_correct += 1 - return float(num_correct) / len(testing_instances) + return num_correct / len(testing_instances) -print classification_accuracy(tree, testing_instances) +print(classification_accuracy(tree, testing_instances))

+
-
-
+
+
-
-
+
+
 1.0
 
@@ -5881,41 +7009,53 @@ 

Evaluating the Accura

+
+
+
+

The zip([iterable, ...]) function combines 2 or more sequences or iterables; the function returns a list of tuples, where the ith tuple contains the ith element from each of the argument sequences or iterables.

-
-
+
+
+
+
-In [91]: +In [93]:
-
+
+
zip([0, 1, 2], ['a', 'b', 'c'])
 
+
-
-
+
+
-
- Out[91]:
-
+
+ Out[93]:
+
 [(0, 'a'), (1, 'b'), (2, 'c')]
 
-
+
+
+
+
+

We can use list comprehensions, the Counter class and the zip() function to modify classification_accuracy() so that it returns a packed tuple with

    @@ -5924,31 +7064,35 @@

    Evaluating the Accura
  • the percentage of instances correctly classified
-
-
+
+
+
+
-In [92]: +In [94]:
-
+
+
def classification_accuracy(tree, instances, class_index=0, default_class=None):
     '''Returns the accuracy of classifying testing_instances with tree, where the class label is in position class_index'''
     predicted_labels = [classify(tree, instance, default_class) for instance in instances]
     actual_labels = [x[class_index] for x in instances]
     counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])
-    return counts[True], counts[False], float(counts[True]) / len(instances)
+    return counts[True], counts[False], counts[True] / len(instances)
 
-print classification_accuracy(tree, testing_instances)
+print(classification_accuracy(tree, testing_instances))
 
+
-
-
+
+
-
-
+
+
 (20, 0, 1.0)
 
@@ -5960,54 +7104,71 @@ 

Evaluating the Accura

+
+
+
+

We sometimes want to partition the instances into subsets of equal sizes to measure performance. One metric this partitioning allows us to compute is a learning curve, i.e., assess how well the model performs based on the size of its training set. Another use of these partitions (aka folds) would be to conduct an n-fold cross validation evaluation.

The following function, partition_instances(instances, num_partitions), partitions a set of instances into num_partitions relatively equally sized subsets.

We'll use this as yet another opportunity to demonstrate the power of using list comprehensions, this time, to condense the use of nested for loops.

-
-
+
+
+
+
-In [93]: +In [95]:
-
+
+
def partition_instances(instances, num_partitions):
     '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''
-    return [[instances[j] for j in xrange(i, len(instances), num_partitions)] for i in xrange(num_partitions)]
+    return [[instances[j] for j in xrange(i, len(instances), num_partitions)] \
+            for i in xrange(num_partitions)]
 
+
+
+
+
+

Before testing this function on the 5644 clean_instances from the UCI mushroom dataset, let's create a small number of simplified instances to verify that the function has the desired behavior.

-
-
+
+
+
+
-In [94]: +In [96]:
-
+
+
instance_length = 3
 num_instances = 5
 
 simplified_instances = [[j for j in xrange(i, instance_length + i)] for i in xrange(num_instances)]
 
-print 'Instances:', simplified_instances
+print('Instances:', simplified_instances)
 partitions = partition_instances(simplified_instances, 2)
-print 'Partitions:', partitions
+print('Partitions:', partitions)
 
+
-
-
+
+
-
-
+
+
 Instances: [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]
 Partitions: [[[0, 1, 2], [2, 3, 4], [4, 5, 6]], [[1, 2, 3], [3, 4, 5]]]
@@ -6020,15 +7181,22 @@ 

Evaluating the Accura

+
+
+
+

The following variations do not use list comprehensions.

-
-
+
+
+
+
-In [95]: +In [97]:
-
+
+
def partition_instances(instances, num_partitions):
     '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''
     partitions = []
@@ -6047,20 +7215,21 @@ 

Evaluating the Accura new_instance.append(j) simplified_instances.append(new_instance) -print 'Instances:', simplified_instances +print('Instances:', simplified_instances) partitions = partition_instances(simplified_instances, 2) -print 'Partitions:', partitions +print('Partitions:', partitions)

+
-
-
+
+
-
-
+
+
 Instances: [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]
 Partitions: [[[0, 1, 2], [2, 3, 4], [4, 5, 6]], [[1, 2, 3], [3, 4, 5]]]
@@ -6073,28 +7242,36 @@ 

Evaluating the Accura

+
+
+
+

The enumerate(sequence, start=0) function creates an iterator that successively returns the index and value of each element in a sequence, beginning at the start index.

-
-
+
+
+
+
-In [96]: +In [98]:
-
+
+
for i, x in enumerate(['a', 'b', 'c']):
-    print i, x
+    print(i, x)
 
+
-
-
+
+
-
-
+
+
 0 a
 1 b
@@ -6108,30 +7285,38 @@ 

Evaluating the Accura

+
+
+
+

We can use enumerate() to facilitate slightly more rigorous testing of our partition_instances function on our simplified_instances.

-
-
+
+
+
+
-In [97]: +In [99]:
-
+
+
for i in xrange(5):
-    print '\n# partitions:', i
+    print('\n# partitions:', i)
     for j, partition in enumerate(partition_instances(simplified_instances, i)):
-        print 'partition {}: {}'.format(j, partition)
+        print('partition {}: {}'.format(j, partition))
 
+
-
-
+
+
-
-
+
+
 
 # partitions: 0
@@ -6162,28 +7347,36 @@ 

Evaluating the Accura

+
+
+
+

Returning our attention to the UCI mushroom dataset, the following will partition our clean_instances into 10 relatively equally sized disjoint subsets. We will use a list comprehension to print out the length of each partition

-
-
+
+
+
+
-In [98]: +In [100]:
-
+
+
partitions = partition_instances(clean_instances, 10)
-print [len(partition) for partition in partitions]
+print([len(partition) for partition in partitions])
 
+
-
-
+
+
-
-
+
+
 [565, 565, 565, 565, 564, 564, 564, 564, 564, 564]
 
@@ -6195,31 +7388,39 @@ 

Evaluating the Accura

+
+
+
+

The following variation does not use a list comprehension.

-
-
+
+
+
+
-In [99]: +In [101]:
-
+
+
for partition in partitions:
-    print len(partition),  # note the comma at the end
-print
+    print(len(partition), end=' ')  # note the comma at the end
+print()
 
+
-
-
+
+
-
-
+
+
-565 565 565 565 564 564 564 564 564 564
+565 565 565 565 564 564 564 564 564 564 
 
 
@@ -6229,37 +7430,47 @@

Evaluating the Accura

+
+
+
+

The following shows the different trees that are constructed based on partition 0 (first 10th) of clean_instances, partitions 0 and 1 (first 2/10ths) of clean_instances and all clean_instances.

-
-
+
+
+
+
-In [100]: +In [102]:
-
+
+
tree0 = create_decision_tree(partitions[0])
-print 'Tree trained with {} instances:'.format(len(partitions[0]))
+print('Tree trained with {} instances:'.format(len(partitions[0])))
 pprint(tree0)
+print()
 
 tree1 = create_decision_tree(partitions[0] + partitions[1])
-print '\nTree trained with {} instances:'.format(len(partitions[0] + partitions[1]))
+print('Tree trained with {} instances:'.format(len(partitions[0] + partitions[1])))
 pprint(tree1)
+print()
 
 tree = create_decision_tree(clean_instances)
-print '\nTree trained with {} instances:'.format(len(clean_instances))
+print('Tree trained with {} instances:'.format(len(clean_instances)))
 pprint(tree)
 
+
-
-
+
+
-
-
+
+
 Tree trained with 565 instances:
 {5: {'a': 'e',
@@ -6302,35 +7513,49 @@ 

Evaluating the Accura

+
+
+
+

The only difference between the first two trees - tree0 and tree1 - is that in the first tree, instances with no odor (attribute index 5 is 'n') and a spore-print-color of white (attribute 20 = 'w') are classified as edible ('e'). With additional training data in the 2nd partition, an additional distinction is made such that instances with no odor, a white spore-print-color and a clustered population (attribute 21 = 'c') are classified as poisonous ('p'), while all other instances with no odor and a white spore-print-color (and any other value for the population attribute) are classified as edible ('e').

Note that there is no difference between tree1 and tree (the tree trained with all instances). This early convergence on an optimal model is uncommon on most datasets (outside the UCI repository).

+
+
+
+
+
+

Now that we can partition our instances into subsets, we can use these subsets to construct different-sized training sets in the process of computing a learning curve.

We will start off with an initial training set consisting only of the first partition, and then progressively extend that training set by adding a new partition during each iteration of computing the learning curve.

The list.extend(L) method enables us to extend list by appending all the items in another list, L, to the end of list.

-
-
+
+
+
+
-In [101]: +In [103]:
-
+
+
x = [1, 2, 3]
 x.extend([4, 5])
-print x
+print(x)
 
+
-
-
+
+
-
-
+
+
 [1, 2, 3, 4, 5]
 
@@ -6342,16 +7567,23 @@ 

Evaluating the Accura

+
+
+
+

We can now define the function, compute_learning_curve(instances, num_partitions=10), that will take a list of instances, partition it into num_partitions relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first num_partitions - 1 partitions then tested with instances in the last partition. That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on.

The function will return a list of num_partitions - 1 tuples representing the size of the training set and the accuracy of a tree trained with that set (and tested on the num_partitions - 1 set). This will provide some indication of the relative impact of the size of the training set on model performance.

-
-
+
+
+
+
-In [102]: +In [104]:
-
+
+
def compute_learning_curve(instances, num_partitions=10):
     '''Returns a list of training sizes and scores for incrementally increasing partitions.
 
@@ -6374,18 +7606,19 @@ 

Evaluating the Accura return accuracy_list accuracy_list = compute_learning_curve(clean_instances) -print accuracy_list +print(accuracy_list)

+
-
-
+
+
-
-
+
+
 [(565, (562, 2, 0.9964539007092199)), (1130, (564, 0, 1.0)), (1695, (564, 0, 1.0)), (2260, (564, 0, 1.0)), (2824, (564, 0, 1.0)), (3388, (564, 0, 1.0)), (3952, (564, 0, 1.0)), (4516, (564, 0, 1.0)), (5080, (564, 0, 1.0))]
 
@@ -6397,28 +7630,36 @@ 

Evaluating the Accura

+
+
+
+

Due to the quick convergence on an optimal decision tree for classifying the UCI mushroom dataset, we can use a larger number of smaller partitions to see a little more variation in acccuracy performance.

-
-
+
+
+
+
-In [103]: +In [105]:
-
+
+
accuracy_list = compute_learning_curve(clean_instances, 100)
-print accuracy_list[:10]
+print(accuracy_list[:10])
 
+
-
-
+
+
-
-
+
+
 [(57, (55, 1, 0.9821428571428571)), (114, (56, 0, 1.0)), (171, (55, 1, 0.9821428571428571)), (228, (56, 0, 1.0)), (285, (56, 0, 1.0)), (342, (56, 0, 1.0)), (399, (56, 0, 1.0)), (456, (56, 0, 1.0)), (513, (56, 0, 1.0)), (570, (56, 0, 1.0))]
 
@@ -6430,16 +7671,26 @@ 

Evaluating the Accura

+
+
+
+

Object-Oriented Programming: Defining a Python Class to Encapsulate a Simple Decision Tree

+
+
+
+
+
+

The simple decision tree defined above uses a Python dictionary for its representation. One can imagine using other data structures, and/or extending the decision tree to support confidence estimates, numeric features and other capabilities that are often included in more fully functional implementations. To support future extensibility, and hide the details of the representation from the user, it would be helpful to have a user-defined class for simple decision trees.

Python is an object-oriented programming language, offering simple syntax and semantics for defining classes and instantiating objects of those classes. [It is assumed that the reader is already familiar with the concepts of object-oriented programming]

A Python class starts with the keyword class followed by a class name (identifier), a colon (':'), and then any number of statements, which typically take the form of assignment statements for class or instance variables and/or function definitions for class methods. All statements are indented to reflect their inclusion in the class definition.

The members - methods, class variables and instance variables - of a class are accessed by prepending self. to each reference. Class methods always include self as the first parameter.

-

All class members in Python are public (accessible outside the class). There is no mechanism for private class members, but identifiers with leading double underscores (__member_identifier) are 'mangled' (translated into _class_name_memberidentifier), and thus not directly accessible outside their class, and can be used to approximate private members by Python programmers.

+

All class members in Python are public (accessible outside the class). There is no mechanism for private class members, but identifiers with leading double underscores (__member_identifier) are 'mangled' (translated into *_class_name__member_identifier*), and thus not directly accessible outside their class, and can be used to approximate private members by Python programmers.

There is also no mechanism for protected identifiers - accessible only within a defining class and its subclasses - in the Python language, and so Python programmers have adopted the convention of using a single underscore (_identifier) at the start of any identifier that is intended to be protected (i.e., not to be accessed outside the class or its subclasses).

Some Python programmers only use the single underscore prefixes and avoid double underscore prefixes due to unintended consequences that can arise when names are mangled. The following warning about single and double underscore prefixes is issued in Code Like a Pythonista:

@@ -6452,12 +7703,15 @@

sklearn.tree.DecisionTreeClassifier class (and in most sklearn classifier classes), the method for constructing a classifier is named fit() - since it "fits" the data to a model - and the method for classifying instances is named predict() - since it is predicting the class label for an instance.

Most comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found in SimpleDecisionTree.py.

-
-
+
+
+
+
-In [104]: +In [106]:
-
+
+
class SimpleDecisionTree:
 
     _tree = {} # this instance variable becomes accessible to class methods via self._tree
@@ -6509,45 +7763,54 @@ 

predicted_labels = [self.classify(instance, default_class) for instance in instances] actual_labels = [x[0] for x in instances] counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)]) - return counts[True], counts[False], float(counts[True]) / len(instances) + return counts[True], counts[False], counts[True] / len(instances) def pprint(self): pprint(self._tree)

+
+
+
+
+

The following statements instantiate a SimpleDecisionTree, using all but the last 20 clean_instances, prints out the tree using its pprint() method, and then uses the classify() method to print the classification of the last 20 clean_instances.

-
-
+
+
+
+
-In [105]: +In [107]:
-
+
+
simple_decision_tree = SimpleDecisionTree(training_instances)
 simple_decision_tree.pprint()
-print
+print()
 for instance in testing_instances:
     predicted_label = simple_decision_tree.classify(instance)
     actual_label = instance[0]
-    print 'Model: {}; truth: {}'.format(predicted_label, actual_label)
-print
-print 'Classification accuracy:', simple_decision_tree.classification_accuracy(testing_instances)
+    print('Model: {}; truth: {}'.format(predicted_label, actual_label))
+print()
+print('Classification accuracy:', simple_decision_tree.classification_accuracy(testing_instances))
 
+
-
-
+
+
-
-
From e5781c463640402d99af9a4c3b05389ede5b2bb9 Mon Sep 17 00:00:00 2001 From: Joe McCarthy Date: Tue, 24 Feb 2015 22:10:51 -0800 Subject: [PATCH 03/16] Fixed some Python 3 compatibility issues --- Python_for_Data_Science_all.ipynb | 312 ++++++++++++++---------------- simple_ml.py | 18 +- 2 files changed, 154 insertions(+), 176 deletions(-) diff --git a/Python_for_Data_Science_all.ipynb b/Python_for_Data_Science_all.ipynb index 86eaa40..4ef3ce1 100644 --- a/Python_for_Data_Science_all.ipynb +++ b/Python_for_Data_Science_all.ipynb @@ -1,7 +1,7 @@ { "metadata": { "name": "", - "signature": "sha256:70904b83221e3051c52ba052346cc93ef71896373e06c07b0d982ac4605674c8" + "signature": "sha256:e4e44d764641165b5be7b5e9e173657c21423c423f68e3f034e2dd22505466ed" }, "nbformat": 3, "nbformat_minor": 0, @@ -348,7 +348,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "There are 2 major versions of Python in widespread use: [Python 2](https://docs.python.org/2/) and [Python 3](https://docs.python.org/3/). Python 3 has some features that are not backward compatible with Python 2, and some Python 2 libraries have not been updated to work with Python 3. I have been using Python 2, primarily because I use some of those Python 2 libraries, but an increasing proportion of them are migrating to Python 3, and I anticipate shifting to Python 3 in the near future.\n", + "There are 2 major versions of Python in widespread use: [Python 2](https://docs.python.org/2/) and [Python 3](https://docs.python.org/3/). Python 3 has some features that are not backward compatible with Python 2, and some Python 2 libraries have not been updated to work with Python 3. I have been using Python 2, primarily because I use some of those Python 2[-only] libraries, but an increasing proportion of them are migrating to Python 3, and I anticipate shifting to Python 3 in the near future.\n", "\n", "For more on the topic, I recommend a very well documented IPython Notebook, which includes numerous helpful examples and links, by [Sebastian Raschka](http://sebastianraschka.com/), [Key differences between Python 2.7.x and Python 3.x](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/key_differences_between_python_2_and_3.ipynb) ... or [googling Python 2 vs 3](https://www.google.com/q=python%202%20vs%203).\n", "\n", @@ -384,7 +384,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The sample instance of a mushroom shown above can be represented as a string. A Python *string* ([`str`](http://docs.python.org/2/tutorial/introduction.html#strings)) is a sequence of 0 or more characters enclosed within a pair of single quotes (`'`) or a pair double quotes (`\"`). " + "The sample instance of a mushroom shown above can be represented as a string. A Python *string* ([**`str`**](http://docs.python.org/2/tutorial/introduction.html#strings)) is a sequence of 0 or more characters enclosed within a pair of single quotes (`'`) or a pair double quotes (`\"`). " ] }, { @@ -431,7 +431,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [**`print`**](https://docs.python.org/3/library/functions.html#print) function writes the value of its comma-delimited arguments to [`sys.stdout`](http://docs.python.org/2/library/sys.html#sys.stdout) (typically the console). Each value in the output is separated by a single blank space. If an `end` argument that does not include `\\n` (newline character) is supplied, the output cursor will not move to the next line." + "The [**`print`**](https://docs.python.org/3/library/functions.html#print) function writes the value of its comma-delimited arguments to [**`sys.stdout`**](http://docs.python.org/2/library/sys.html#sys.stdout) (typically the console). Each value in the output is separated by a single blank space. If an **`end`** argument that does not include `\\n` (newline character) is supplied, the output cursor will not move to the next line." ] }, { @@ -460,9 +460,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The Python comment character is `'#'`: anything after `'#'` on the line is ignored by the Python interpreter. \n", + "The Python comment character is **`'#'`**: anything after `'#'` on the line is ignored by the Python interpreter. \n", "\n", - "Pairs of triple quotes (`'''` or `\"\"\"`) can be used to delimit multi-line comments." + "Pairs of triple quotes (**`'''`** or **`\"\"\"`**) can be used to delimit multi-line comments." ] }, { @@ -492,7 +492,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A [`list`](http://docs.python.org/2/tutorial/introduction.html#lists) is an ordered sequence of 0 or more comma-delimited elements enclosed within square brackets ('`[`', '`]`'). The Python [`str.split(sep)`](http://docs.python.org/2/library/stdtypes.html#str.split) method can be used to split a `sep`-delimited string into a corresponding list of elements." + "A [**`list`**](http://docs.python.org/2/tutorial/introduction.html#lists) is an ordered sequence of 0 or more comma-delimited elements enclosed within square brackets ('`[`', '`]`'). The Python [**`str.split(sep)`**](http://docs.python.org/2/library/stdtypes.html#str.split) method can be used to split a `sep`-delimited string into a corresponding list of elements." ] }, { @@ -799,7 +799,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [`str.strip(\\[chars\\]`)](http://docs.python.org/2/library/stdtypes.html#str.strip) method returns a copy of `str` in which any leading or trailing `chars` are removed. If no `chars` are specified, it removes all leading and trailing whitespace. [*Whitespace* is any sequence of spaces, tabs (`'\\t'`) and/or newline (`'\\n'`) characters.] \n", + "The [**`str.strip(\\[chars\\]`**)](http://docs.python.org/2/library/stdtypes.html#str.strip) method returns a copy of `str` in which any leading or trailing `chars` are removed. If no `chars` are specified, it removes all leading and trailing whitespace. [*Whitespace* is any sequence of spaces, tabs (`'\\t'`) and/or newline (`'\\n'`) characters.] \n", "\n", "Note that since a blank space is inserted in the output after every item in a comma-delimited list, the last asterisk is printed after a leading blank space is inserted on the new line." ] @@ -881,7 +881,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [`str.join(words)`](http://docs.python.org/2/library/string.html#string.join) method is the inverse of `str.split()`, returning a single string in which each string in the sequence of `words` is separated by `str`." + "The [**`str.join(words)`**](http://docs.python.org/2/library/string.html#string.join) method is the inverse of `str.split()`, returning a single string in which each string in the sequence of `words` is separated by `str`." ] }, { @@ -909,7 +909,7 @@ "source": [ "A number of Python methods can be used on strings, lists and other sequences.\n", "\n", - "The [`len(s)`](http://docs.python.org/2/library/functions.html#len) function can be used to find the length of (number of items in) a sequence `s`. It will also return the number of items in a *dictionary*, a data structure we will cover further below." + "The [**`len(s)`**](http://docs.python.org/2/library/functions.html#len) function can be used to find the length of (number of items in) a sequence `s`. It will also return the number of items in a *dictionary*, a data structure we will cover further below." ] }, { @@ -963,7 +963,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [`s.count(x)`](http://docs.python.org/2/library/stdtypes.html#str.count) ormethod can be used to count the number of occurrences of item `x` in sequence `s`." + "The [**`s.count(x)`**](http://docs.python.org/2/library/stdtypes.html#str.count) ormethod can be used to count the number of occurrences of item `x` in sequence `s`." ] }, { @@ -989,7 +989,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [`s.index(x)`](http://docs.python.org/2/library/stdtypes.html#str.index) method can be used to find the first 0-based index of item `x` in sequence `s`. " + "The [**`s.index(x)`**](http://docs.python.org/2/library/stdtypes.html#str.index) method can be used to find the first 0-based index of item `x` in sequence `s`. " ] }, { @@ -1271,7 +1271,7 @@ "\n", "As in many practical applications of philosophy, religion or dogma, it is helpful to *think before you choose (TBYC)*. There are a number of factors to consider in deciding whether to follow the EAFP or LBYL paradigm, including code readability and the anticipated likelihood and relative severity of encountering an exception. Oran Looney wrote a blog post providing a nice overview of the debate over [LBYL vs. EAFP](http://oranlooney.com/lbyl-vs-eafp/).\n", "\n", - "We will follow the LBYL paradigm throughout most of this primer. However, as an illustration of EAFP in Python, here is an alternate implementation of the functionality of the code above, using a [`try/except`](http://docs.python.org/2/tutorial/errors.html#handling-exceptions) statement." + "We will follow the LBYL paradigm throughout most of this primer. However, as an illustration of EAFP in Python, here is an alternate implementation of the functionality of the code above, using a [**`try/except`**](http://docs.python.org/2/tutorial/errors.html#handling-exceptions) statement." ] }, { @@ -1400,7 +1400,7 @@ "source": [ "Note that Python does not distinguish between names used for *variables* and names used for *functions*. An assignment statement binds a value to a name; a function definition also binds a value to a name. At any given time, the value most recently bound to a name is the one that is used. \n", "\n", - "The [`type(object)`](http://docs.python.org/2.7/library/functions.html#type) function returns the `type` of `object`." + "The [**`type(object)`**](http://docs.python.org/2.7/library/functions.html#type) function returns the `type` of `object`." ] }, { @@ -1423,7 +1423,7 @@ "stream": "stdout", "text": [ "x used as a variable: 0 \n", - "x used as a function: \n" + "x used as a function: \n" ] } ], @@ -1433,7 +1433,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Another way to determine the `type` of an object is to use [`isinstance(object, class)`](https://docs.python.org/2/library/functions.html#isinstance). This is generally [preferable](http://stackoverflow.com/questions/1549801/differences-between-isinstance-and-type-in-python), as it takes into account [class inheritance](https://docs.python.org/2/tutorial/classes.html#inheritance). There is a larger issue of [*duck typing*](https://en.wikipedia.org/wiki/Duck_typing), and whether code should ever explicitly check for the type of an object, but we will omit further discussion of the topic in this notebook.\n", + "Another way to determine the `type` of an object is to use [**`isinstance(object, class)`**](https://docs.python.org/2/library/functions.html#isinstance). This is generally [preferable](http://stackoverflow.com/questions/1549801/differences-between-isinstance-and-type-in-python), as it takes into account [class inheritance](https://docs.python.org/2/tutorial/classes.html#inheritance). There is a larger issue of [*duck typing*](https://en.wikipedia.org/wiki/Duck_typing), and whether code should ever explicitly check for the type of an object, but we will omit further discussion of the topic in this notebook.\n", "\n", "Checking whether an object is a `function` type will require the use of the `types` library ... and more [thorough checking](http://stackoverflow.com/questions/624926/how-to-detect-whether-a-python-variable-is-a-function/624948#624948) could be done if one wants to include built-in as well as user-defined functions." ] @@ -1612,21 +1612,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [**`for`**](http://docs.python.org/2/tutorial/controlflow.html#for-statements) statement iterates over the elements of a sequence.\n", - "\n", - "The [`range(stop)`](http://docs.python.org/2/tutorial/controlflow.html#the-range-function) function returns a list of values from 0 up to `stop - 1` (inclusive). " + "The [**`for`**](http://docs.python.org/2/tutorial/controlflow.html#for-statements) statement iterates over the elements of a sequence." ] }, { "cell_type": "code", "collapsed": false, "input": [ - "print('Index values for attributes:', range(len(attribute_names)), end='\\n\\n') # 2 newlines\n", - "\n", - "print('Values for the', len(attribute_names), 'attributes:', end='\\n\\n')\n", - "for i in range(len(attribute_names)):\n", - " print(attribute_names[i], '=', \n", - " attribute_value(single_instance_list, attribute_names[i], attribute_names))" + "for i in [0, 1, 2]:\n", + " print(i)" ], "language": "python", "metadata": {}, @@ -1635,52 +1629,20 @@ "output_type": "stream", "stream": "stdout", "text": [ - "Index values for attributes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]\n", - "\n", - "Values for the 23 attributes:\n", - "\n", - "class = p\n", - "cap-shape = k\n", - "cap-surface = f\n", - "cap-color = n\n", - "bruises? = f\n", - "odor = n\n", - "gill-attachment = f\n", - "gill-spacing = c\n", - "gill-size = n\n", - "gill-color = w\n", - "stalk-shape = e\n", - "stalk-root = ?\n", - "stalk-surface-above-ring = k\n", - "stalk-surface-below-ring = y\n", - "stalk-color-above-ring = w\n", - "stalk-color-below-ring = n\n", - "veil-type = p\n", - "veil-color = w\n", - "ring-number = o\n", - "ring-type = e\n", - "spore-print-color = w\n", - "population = v\n", - "habitat = d\n" + "0\n", + "1\n", + "2\n" ] } ], "prompt_number": 41 }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The more general form of the function, [`range(start, stop[, step])`](http://docs.python.org/2/library/functions.html#range), returns a list of values from `start` to `stop - 1` (inclusive) increasing by `step` (which defaults to `1`), or from `start` down to `stop + 1` (inclusive) decreasing by `step` if `step` is negative." - ] - }, { "cell_type": "code", "collapsed": false, "input": [ - "print('range(5, 10):', range(5, 10))\n", - "print('range(10, 5, -1):', range(10, 5, -1))\n", - "print('range(0, 10, 2):', range(0, 10, 2))" + "for c in 'abc':\n", + " print(c)" ], "language": "python", "metadata": {}, @@ -1689,9 +1651,9 @@ "output_type": "stream", "stream": "stdout", "text": [ - "range(5, 10): [5, 6, 7, 8, 9]\n", - "range(10, 5, -1): [10, 9, 8, 7, 6]\n", - "range(0, 10, 2): [0, 2, 4, 6, 8]\n" + "a\n", + "b\n", + "c\n" ] } ], @@ -1701,19 +1663,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [`xrange(stop[, stop[, step]])`](http://docs.python.org/2/library/functions.html#xrange) function is an [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) version of the `range()` function. In the context of a `for` loop, it returns the *next* item of the sequence for each iteration of the loop rather than creating *all* the elements of the sequence before the first iteration. This can reduce memory consumption in cases where iteration over all the items is not required. \n", - "\n", - "The `range()` function returns a list, which can then be manipulated by any list or sequence methods. An *xrange object* can only be used in a `for` loop or the `len()` function. A related and slightly more general class of container objects, [*iterators*](http://docs.python.org/2/library/stdtypes.html#typeiter), include a [`next()`](http://docs.python.org/2/library/stdtypes.html#iterator.next) method for explicitly returning the next item in the container." + "In Python 2, the [**`range(stop)`**](http://docs.python.org/2/tutorial/controlflow.html#the-range-function) function returns a list of values from 0 up to `stop - 1` (inclusive). It is often used in the context of a `for` loop that iterates over the list of values." ] }, { "cell_type": "code", "collapsed": false, "input": [ - "print(xrange(len(attribute_names)), end='\\n\\n') # prints the string representation of the object\n", - "\n", "print('Values for the', len(attribute_names), 'attributes:', end='\\n\\n')\n", - "for i in xrange(len(attribute_names)):\n", + "for i in range(len(attribute_names)):\n", " print(attribute_names[i], '=', \n", " attribute_value(single_instance_list, attribute_names[i], attribute_names))" ], @@ -1724,8 +1682,6 @@ "output_type": "stream", "stream": "stdout", "text": [ - "xrange(23)\n", - "\n", "Values for the 23 attributes:\n", "\n", "class = p\n", @@ -1756,6 +1712,46 @@ ], "prompt_number": 43 }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The more general form of the function, [**`range(start, stop[, step])`**](http://docs.python.org/2/library/functions.html#range), returns a list of values from `start` to `stop - 1` (inclusive) increasing by `step` (which defaults to `1`), or from `start` down to `stop + 1` (inclusive) decreasing by `step` if `step` is negative." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "for i in range(3, 0, -1):\n", + " print(i)" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "3\n", + "2\n", + "1\n" + ] + } + ], + "prompt_number": 44 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Python 2, the [**`xrange(stop[, stop[, step]])`**](http://docs.python.org/2/library/functions.html#xrange) function is an [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) version of the `range()` function. In the context of a `for` loop, it returns the *next* item of the sequence for each iteration of the loop rather than creating *all* the elements of the sequence before the first iteration. This can reduce memory consumption in cases where iteration over all the items is not required.\n", + "\n", + "In Python 3, the `range()` function behaves the same way as the `xrange()` function does in Python 2, and so the `xrange()` function is not defined in Python 3. \n", + "\n", + "To maximize compatibility, we will use `range()` throughout this notebook; however, note that it is generally advisable to use `xrange()` rather than `range()` wherever possible in Python 2." + ] + }, { "cell_type": "heading", "level": 3, @@ -1768,9 +1764,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "A Python [***module***](http://docs.python.org/2/tutorial/modules.html) is a file containing related definitions (e.g., of functions and variables). Modules are used to help organize the Python [***namespaces***](http://docs.python.org/2/tutorial/classes.html#python-scopes-and-namespaces), the set of identifiers accessible in a particular contexts. All of the functions and variables we define in this IPython Notebook are in the `__main__` namespace, so accessing them does not require any specification of a module.\n", + "A Python [*module*](http://docs.python.org/2/tutorial/modules.html) is a file containing related definitions (e.g., of functions and variables). Modules are used to help organize the Python [*namespaces*](http://docs.python.org/2/tutorial/classes.html#python-scopes-and-namespaces), the set of identifiers accessible in a particular contexts. All of the functions and variables we define in this IPython Notebook are in the `__main__` namespace, so accessing them does not require any specification of a module.\n", "\n", - "A Python module named `simple_ml` (in the file `simple_ml.py`), contains a set of solutions to the exercises in this IPython Notebook. Accessing functions in that module requires that we first [`import`](http://docs.python.org/2/reference/simple_stmts.html#the-import-statement) the module, and then prefix the function names with the module name followed by a dot (this is known as ***dotted notation***).\n", + "A Python module named `simple_ml` (in the file `simple_ml.py`), contains a set of solutions to the exercises in this IPython Notebook. Accessing functions in that module requires that we first [`import`](http://docs.python.org/2/reference/simple_stmts.html#the-import-statement) the module, and then prefix the function names with the module name followed by a dot (this is known as *dotted notation*).\n", "\n", "For example, the following function call Exercise 1 below: \n", "\n", @@ -1849,7 +1845,7 @@ ] } ], - "prompt_number": 44 + "prompt_number": 45 }, { "cell_type": "heading", @@ -1899,7 +1895,7 @@ ] } ], - "prompt_number": 45 + "prompt_number": 46 }, { "cell_type": "heading", @@ -1947,13 +1943,13 @@ ] } ], - "prompt_number": 46 + "prompt_number": 47 }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Output to text file is usually done via [`file.write(str)`](http://docs.python.org/2/library/stdtypes.html#file.write) method.\n", + "Output to text file is usually done via [**`file.write(str)`**](http://docs.python.org/2/library/stdtypes.html#file.write) method.\n", "\n", "As we saw earlier, the [`str.join(words)`](http://docs.python.org/2/library/stdtypes.html#str.join) method returns a single `str`-delimited string containing each of the strings in the list `words`.\n", "\n", @@ -1995,7 +1991,7 @@ ] } ], - "prompt_number": 47 + "prompt_number": 48 }, { "cell_type": "heading", @@ -2013,7 +2009,7 @@ "\n", "Some programmers find list comprehensions confusing, and avoid their use. We won't rely on list comprehensions here, but will show examples with and without list comprehensions below.\n", "\n", - "One common use of list comprehensions is in the context of the [str.join(words)](http://docs.python.org/2/library/string.html#string.join) method we saw earlier.\n", + "One common use of list comprehensions is in the context of the [`str.join(words)`](http://docs.python.org/2/library/string.html#string.join) method we saw earlier.\n", "\n", "If we wanted to construct a pipe-delimited string containing elements of the list, we could use a `for` loop to iteratively add list elements and pipe delimiters to a string. We would thereby add one pipe delimiter too many, and would thus have to shave that off at the end." ] @@ -2034,13 +2030,13 @@ { "metadata": {}, "output_type": "pyout", - "prompt_number": 48, + "prompt_number": 49, "text": [ "'1|2|3'" ] } ], - "prompt_number": 48 + "prompt_number": 49 }, { "cell_type": "markdown", @@ -2061,13 +2057,13 @@ { "metadata": {}, "output_type": "pyout", - "prompt_number": 49, + "prompt_number": 50, "text": [ "'1|2|3'" ] } ], - "prompt_number": 49 + "prompt_number": 50 }, { "cell_type": "markdown", @@ -2105,7 +2101,7 @@ ] } ], - "prompt_number": 50 + "prompt_number": 51 }, { "cell_type": "code", @@ -2127,7 +2123,7 @@ ] } ], - "prompt_number": 51 + "prompt_number": 52 }, { "cell_type": "heading", @@ -2143,7 +2139,7 @@ "source": [ "Although single character abbreviations of attribute values (e.g., 'x') allow for more compact data files, they are not as easy to understand by human readers as the longer attribute value descriptions (e.g., 'convex').\n", "\n", - "A Python [dictionary (or `dict`)](http://docs.python.org/2/tutorial/datastructures.html#dictionaries) is an unordered, comma-delimited collection of *key, value* pairs, serving a siimilar function as a hash table or hashmap in other programming languages.\n", + "A Python [dictionary (or **`dict`**)](http://docs.python.org/2/tutorial/datastructures.html#dictionaries) is an unordered, comma-delimited collection of *key, value* pairs, serving a siimilar function as a hash table or hashmap in other programming languages.\n", "\n", "We could create a dictionary for the `cap-type` attribute values shown above:\n", "\n", @@ -2179,7 +2175,7 @@ ] } ], - "prompt_number": 52 + "prompt_number": 53 }, { "cell_type": "markdown", @@ -2213,7 +2209,7 @@ ] } ], - "prompt_number": 53 + "prompt_number": 54 }, { "cell_type": "markdown", @@ -2242,7 +2238,7 @@ ] } ], - "prompt_number": 54 + "prompt_number": 55 }, { "cell_type": "markdown", @@ -2310,7 +2306,7 @@ ] } ], - "prompt_number": 55 + "prompt_number": 56 }, { "cell_type": "heading", @@ -2371,7 +2367,7 @@ ] } ], - "prompt_number": 56 + "prompt_number": 57 }, { "cell_type": "heading", @@ -2411,7 +2407,7 @@ ] } ], - "prompt_number": 57 + "prompt_number": 58 }, { "cell_type": "markdown", @@ -2454,13 +2450,13 @@ ] } ], - "prompt_number": 58 + "prompt_number": 59 }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The Python [`collections`](http://docs.python.org/2/library/collections.html) module provides a number of high performance container datatypes. A frequently useful datatype is a [`defaultdict`](http://docs.python.org/2/library/collections.html#defaultdict-objects), which automatically creates an appropriate default value for a new key. For example, a `defaultdict(int)` automatically initializes a new dictionary entry to 0 (zero); a `defaultdict(list)` automatically initializes a new dictionary entry to the empty list (`[]`).\n", + "The Python [**`collections`**](http://docs.python.org/2/library/collections.html) module provides a number of high performance container datatypes. A frequently useful datatype is a [**`defaultdict`**](http://docs.python.org/2/library/collections.html#defaultdict-objects), which automatically creates an appropriate default value for a new key. For example, a `defaultdict(int)` automatically initializes a new dictionary entry to 0 (zero); a `defaultdict(list)` automatically initializes a new dictionary entry to the empty list (`[]`).\n", "\n", "After first importing `defaultdict` from `collections`, we can use `defaultdict(int)` to simplify the code above:" ] @@ -2497,7 +2493,7 @@ ] } ], - "prompt_number": 59 + "prompt_number": 60 }, { "cell_type": "heading", @@ -2545,7 +2541,7 @@ ] } ], - "prompt_number": 60 + "prompt_number": 61 }, { "cell_type": "heading", @@ -2561,7 +2557,7 @@ "source": [ "Earlier, we saw that there is a `list.sort()` method that will sort a list in-place, i.e., by replacing the original value of `list` with a sorted version of the elements in `list`. \n", "\n", - "The Python [`sorted(iterable[, cmp[, key[, reverse]]])`](http://docs.python.org/2/library/functions.html#sorted) function can be used to return a *copy* of a list, dictionary or any other [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) container it is passed, in ascending order." + "The Python [**`sorted(iterable[, cmp[, key[, reverse]]])`**](http://docs.python.org/2/library/functions.html#sorted) function can be used to return a *copy* of a list, dictionary or any other [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) container it is passed, in ascending order." ] }, { @@ -2585,7 +2581,7 @@ ] } ], - "prompt_number": 61 + "prompt_number": 62 }, { "cell_type": "markdown", @@ -2611,7 +2607,7 @@ ] } ], - "prompt_number": 62 + "prompt_number": 63 }, { "cell_type": "markdown", @@ -2637,7 +2633,7 @@ ] } ], - "prompt_number": 63 + "prompt_number": 64 }, { "cell_type": "markdown", @@ -2669,13 +2665,13 @@ ] } ], - "prompt_number": 64 + "prompt_number": 65 }, { "cell_type": "markdown", "metadata": {}, "source": [ - "An optional [keyword argument](http://docs.python.org/2/tutorial/controlflow.html#keyword-arguments), `reverse`, can be used to reverse the order of the sorted list returned by the function. The default value of this optional parameter is `False`, to get non-default behavior, we must specify the name and value of the argument: `reverse=True`. " + "An optional [keyword argument](http://docs.python.org/2/tutorial/controlflow.html#keyword-arguments), **`reverse`**, can be used to reverse the order of the sorted list returned by the function. The default value of this optional parameter is `False`, to get non-default behavior, we must specify the name and value of the argument: `reverse=True`. " ] }, { @@ -2695,7 +2691,7 @@ ] } ], - "prompt_number": 65 + "prompt_number": 66 }, { "cell_type": "code", @@ -2714,7 +2710,7 @@ ] } ], - "prompt_number": 66 + "prompt_number": 67 }, { "cell_type": "code", @@ -2744,7 +2740,7 @@ ] } ], - "prompt_number": 67 + "prompt_number": 68 }, { "cell_type": "heading", @@ -2771,7 +2767,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [`dict.items()`](http://docs.python.org/2/library/stdtypes.html#dict.items) method returns an unordered list of `(key, value)` tuples in `dict`." + "The [**`dict.items()`**](http://docs.python.org/2/library/stdtypes.html#dict.items) method returns an unordered list of `(key, value)` tuples in `dict`." ] }, { @@ -2786,7 +2782,7 @@ { "metadata": {}, "output_type": "pyout", - "prompt_number": 68, + "prompt_number": 69, "text": [ "[('c', 'conical'),\n", " ('b', 'bell'),\n", @@ -2797,40 +2793,22 @@ ] } ], - "prompt_number": 68 + "prompt_number": 69 }, { "cell_type": "markdown", "metadata": {}, "source": [ - "A related method, [`dict.iteritems()`](http://docs.python.org/2/library/stdtypes.html#dict.iteritems), returns an [`iterator`](http://docs.python.org/2/library/stdtypes.html#iterator-types) - a callable object that returns the *next* item in a sequence each time it is referenced (e.g., during each iteration of a for loop), which can be more efficient than generating *all* the items in the sequence before any are used. This is similar to the distinction between `xrange()` and `range()` described above." + "In Python 2, a related method, [**`dict.iteritems()`**](http://docs.python.org/2/library/stdtypes.html#dict.iteritems), returns an [**`iterator`**](http://docs.python.org/2/library/stdtypes.html#iterator-types): a callable object that returns the *next* item in a sequence each time it is referenced (e.g., during each iteration of a for loop), which can be more efficient than generating *all* the items in the sequence before any are used ... and so should be used rather than `items()` wherever possible\n", + "\n", + "This is similar to the distinction between `xrange()` and `range()` described above ... and, also similarly, `dict.items()` is an `iterator` in Python 3 and so `dict.iteritems()` is no longer needed (nor defined) ... and further similarly, we will use only `dict.items()` in this notebook, but `dict.iteritems()` should be used wherever possible in Python 2." ] }, { "cell_type": "code", "collapsed": false, "input": [ - "attribute_values_cap_type.iteritems()" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "metadata": {}, - "output_type": "pyout", - "prompt_number": 69, - "text": [ - "" - ] - } - ], - "prompt_number": 69 - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "for key, value in attribute_values_cap_type.iteritems():\n", + "for key, value in attribute_values_cap_type.items():\n", " print(key, ':', value)" ], "language": "python", @@ -2855,13 +2833,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The Python [`operator`](http://docs.python.org/2/library/operator.html) module contains a number of functions that perform object comparisons, logical operations, mathematical operations, sequence operations, and abstract type tests.\n", + "The Python [**`operator`**](http://docs.python.org/2/library/operator.html) module contains a number of functions that perform object comparisons, logical operations, mathematical operations, sequence operations, and abstract type tests.\n", "\n", - "To facilitate sorting a dictionary by values, we will use the [`operator.itemgetter(item)`](http://docs.python.org/2/library/operator.html#operator.itemgetter) function that can be used to retrieve an indexed value (`item`) in a tuple (such as a `(key, value)` pair returned by `[iter]items()`).\n", + "To facilitate sorting a dictionary by values, we will use the [**`operator.itemgetter(item)`**](http://docs.python.org/2/library/operator.html#operator.itemgetter) function that can be used to retrieve an indexed value (`item`) in a tuple (such as a `(key, value)` pair returned by `[iter]items()`).\n", "\n", "We can use `operator.itemgetter(1)`) to reference the value - the 2nd item in each `(key, value)` tuple, (at zero-based index position 1) - rather than the key - the first item in each `(key, value)` tuple (at index position 0).\n", "\n", - "We will use the optional keyword argument `key` in [`sorted(iterable[, cmp[, key[, reverse]]])`](http://docs.python.org/2/library/functions.html#sorted) to specify a *sorting* key that is not the same as the `dict` key (the `dict` key is the default *sorting* key)" + "We will use the optional keyword argument **`key`** in [`sorted(iterable[, cmp[, key[, reverse]]])`](http://docs.python.org/2/library/functions.html#sorted) to specify a *sorting* key that is not the same as the `dict` key (the `dict` key is the default *sorting* key)" ] }, { @@ -2870,7 +2848,7 @@ "input": [ "import operator\n", "\n", - "sorted(attribute_values_cap_type.iteritems(), key=operator.itemgetter(1))" + "sorted(attribute_values_cap_type.items(), key=operator.itemgetter(1))" ], "language": "python", "metadata": {}, @@ -2906,7 +2884,7 @@ "value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)\n", "\n", "print('Counts for each value of', attribute, '(sorted by count):')\n", - "for value, count in sorted(value_counts.iteritems(), key=operator.itemgetter(1), reverse=True):\n", + "for value, count in sorted(value_counts.items(), key=operator.itemgetter(1), reverse=True):\n", " print(value, ':', count)" ], "language": "python", @@ -3485,7 +3463,7 @@ "\n", "***Note:*** the algorithm above is *recursive*, i.e., the there is a recursive call to `ID3` within the definition of `ID3`. Covering recursion is beyond the scope of this primer, but there are a number of other resources on [using recursion in Python](https://www.google.com/search?q=python+recursion). Familiarity with recursion will be important for understanding both the tree construction and classification functions below.\n", "\n", - "First, we will define a function to split a set of instances based on any attribute. This function will return a dictionary where the *key* of each dictionary is a distinct value of the specified `attribute_index`, and the *value* of each dictionary is a list representing the subset of `instances` that have that attribute value." + "First, we will define a function, **`split_instances(instances, attribute_index)`**, to split a set of instances based on any attribute. This function will return a dictionary where the *key* of each dictionary is a distinct value of the specified `attribute_index`, and the *value* of each dictionary is a list representing the subset of `instances` that have that attribute value." ] }, { @@ -3571,9 +3549,9 @@ "\n", "We earlier saw how the [`defaultdict`](http://docs.python.org/2/library/collections.html#collections.defaultdict) container in the [`collections`](http://docs.python.org/2/library/collections.html) module can be used to simplify the construction of a dictionary containing the counts of all attribute values for all attributes, by automatically setting the count for any attribute value to zero when the attribute value is first added to the dictionary.\n", "\n", - "The `collections` module has another useful container, a [`Counter`](http://docs.python.org/2/library/collections.html#collections.Counter) class, that can further simplify the construction of a specialized dictionary of counts. When a `Counter` object is instantiated with a list of items, it returns a dictionary-like container in which the *keys* are the unique items in the list, and the *values* are the counts of each unique item in that list. \n", + "The `collections` module has another useful container, a [**`Counter`**](http://docs.python.org/2/library/collections.html#collections.Counter) class, that can further simplify the construction of a specialized dictionary of counts. When a `Counter` object is instantiated with a list of items, it returns a dictionary-like container in which the *keys* are the unique items in the list, and the *values* are the counts of each unique item in that list. \n", "\n", - "This container has an additional method, [`most_common([n])`](http://docs.python.org/2/library/collections.html#collections.Counter.most_common), which returns a list of 2-element tuples representing the values and their associated counts for the most common `n` values; if `n` is omitted, the method returns all tuples.\n", + "This container has an additional method, [**`most_common([n])`**](http://docs.python.org/2/library/collections.html#collections.Counter.most_common), which returns a list of 2-element tuples representing the values and their associated counts for the most common `n` values; if `n` is omitted, the method returns all tuples.\n", "\n", "The following is an example of how we can use a `Counter` to represent the frequency of different class labels, and how we can identify the most frequent value and its count." ] @@ -3836,8 +3814,8 @@ " \n", " # if no candidate_attribute_indexes are provided, assume that we will use all but the target_attribute_index\n", " if candidate_attribute_indexes is None:\n", - " candidate_attribute_indexes = range(len(instances[0]))\n", - " candidate_attribute_indexes.remove(class_index)\n", + " candidate_attribute_indexes = [i for i in range(len(instances[0])) if i != class_index]\n", + " #candidate_attribute_indexes.remove(class_index)\n", " \n", " class_labels_and_counts = Counter([instance[class_index] for instance in instances])\n", "\n", @@ -3948,9 +3926,9 @@ "source": [ "The structure of the tree shown above is rather difficult to discern from the normal printed representation of a dictionary.\n", "\n", - "The Python [`pprint`](http://docs.python.org/2/library/pprint.html) module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.\n", + "The Python [**`pprint`**](http://docs.python.org/2/library/pprint.html) module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.\n", "\n", - "The [`pprint.pprint(object, stream=None, indent=1, width=80, depth=None)`](http://docs.python.org/2/library/pprint.html#pprint.pprint) method will print `object` to a `stream` (a default value of `None` will dictate the use of [sys.stdout](http://docs.python.org/2/library/sys.html#sys.stdout), the same destination as `print` statement output), using `indent` spaces to differentiate nesting levels, using up to a maximum `width` columns and up to to a maximum nesting level `depth` (`None` indicating no maximum).\n", + "The [**`pprint.pprint(object, stream=None, indent=1, width=80, depth=None)`**](http://docs.python.org/2/library/pprint.html#pprint.pprint) method will print `object` to a `stream` (a default value of `None` will dictate the use of [sys.stdout](http://docs.python.org/2/library/sys.html#sys.stdout), the same destination as `print` function output), using `indent` spaces to differentiate nesting levels, using up to a maximum `width` columns and up to to a maximum nesting level `depth` (`None` indicating no maximum).\n", "\n", "We will use the a variation on the import statement that imports one or more functions into the current namespace:\n", "\n", @@ -4009,7 +3987,7 @@ "source": [ "Usually, when we construct a decision tree based on a set of *training* instances, we do so with the intent of using that tree to classify a set of one or more *testing* instances.\n", "\n", - "We will define a function, `classify(tree, instance, default_class=None)`, to use a decision `tree` to classify a single `instance`, where an optional `default_class` can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.\n", + "We will define a function, **`classify(tree, instance, default_class=None)`**, to use a decision `tree` to classify a single `instance`, where an optional `default_class` can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.\n", "\n", "We will use a design pattern in which we will use a series of `if` statements, each of which returns a value if the condition is true, rather than a nested series of `if`, `elif` and/or `else` clauses, as it helps constrain the levels of indentation in the function." ] @@ -4024,8 +4002,8 @@ " return default_class\n", " if not isinstance(tree, dict): \n", " return tree\n", - " attribute_index = tree.keys()[0]\n", - " attribute_values = tree.values()[0]\n", + " attribute_index = list(tree.keys())[0] # using list(dict.keys()) for Python 3 compatibility\n", + " attribute_values = list(tree.values())[0]\n", " instance_attribute_value = instance[attribute_index]\n", " if instance_attribute_value not in attribute_values:\n", " return default_class\n", @@ -4098,7 +4076,7 @@ "def classification_accuracy(tree, testing_instances, class_index=0, default_class=None):\n", " '''Returns the accuracy of classifying testing_instances with tree, where the class label is in position class_index'''\n", " num_correct = 0\n", - " for i in xrange(len(testing_instances)):\n", + " for i in range(len(testing_instances)):\n", " prediction = classify(tree, testing_instances[i], default_class)\n", " actual_value = testing_instances[i][class_index]\n", " if prediction == actual_value:\n", @@ -4124,7 +4102,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [`zip([iterable, ...])`](http://docs.python.org/2.7/library/functions.html#zip) function combines 2 or more sequences or iterables; the function returns a list of tuples, where the *i*th tuple contains the *i*th element from each of the argument sequences or iterables." + "The [**`zip([iterable, ...])`**](http://docs.python.org/2.7/library/functions.html#zip) function combines 2 or more sequences or iterables; the function returns a list of tuples, where the *i*th tuple contains the *i*th element from each of the argument sequences or iterables." ] }, { @@ -4190,7 +4168,7 @@ "source": [ "We sometimes want to partition the instances into subsets of equal sizes to measure performance. One metric this partitioning allows us to compute is a [learning curve](https://en.wikipedia.org/wiki/Learning_curve), i.e., assess how well the model performs based on the size of its training set. Another use of these partitions (aka *folds*) would be to conduct an [*n-fold cross validation*](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) evaluation.\n", "\n", - "The following function, `partition_instances(instances, num_partitions)`, partitions a set of `instances` into `num_partitions` relatively equally sized subsets.\n", + "The following function, **`partition_instances(instances, num_partitions)`**, partitions a set of `instances` into `num_partitions` relatively equally sized subsets.\n", "\n", "We'll use this as yet another opportunity to demonstrate the power of using list comprehensions, this time, to condense the use of nested `for` loops." ] @@ -4201,8 +4179,8 @@ "input": [ "def partition_instances(instances, num_partitions):\n", " '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''\n", - " return [[instances[j] for j in xrange(i, len(instances), num_partitions)] \\\n", - " for i in xrange(num_partitions)]" + " return [[instances[j] for j in range(i, len(instances), num_partitions)] \\\n", + " for i in range(num_partitions)]" ], "language": "python", "metadata": {}, @@ -4223,7 +4201,7 @@ "instance_length = 3\n", "num_instances = 5\n", "\n", - "simplified_instances = [[j for j in xrange(i, instance_length + i)] for i in xrange(num_instances)]\n", + "simplified_instances = [[j for j in range(i, instance_length + i)] for i in range(num_instances)]\n", "\n", "print('Instances:', simplified_instances)\n", "partitions = partition_instances(simplified_instances, 2)\n", @@ -4257,18 +4235,18 @@ "def partition_instances(instances, num_partitions):\n", " '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''\n", " partitions = []\n", - " for i in xrange(num_partitions):\n", + " for i in range(num_partitions):\n", " partition = []\n", " # iterate over instances starting at position i in increments of num_paritions\n", - " for j in xrange(i, len(instances), num_partitions): \n", + " for j in range(i, len(instances), num_partitions): \n", " partition.append(instances[j])\n", " partitions.append(partition)\n", " return partitions\n", "\n", "simplified_instances = []\n", - "for i in xrange(num_instances):\n", + "for i in range(num_instances):\n", " new_instance = []\n", - " for j in xrange(i, instance_length + i):\n", + " for j in range(i, instance_length + i):\n", " new_instance.append(j)\n", " simplified_instances.append(new_instance)\n", "\n", @@ -4294,7 +4272,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [`enumerate(sequence, start=0)`](http://docs.python.org/2.7/library/functions.html#enumerate) function creates an iterator that successively returns the index and value of each element in a `sequence`, beginning at the `start` index." + "The [**`enumerate(sequence, start=0)`**](http://docs.python.org/2.7/library/functions.html#enumerate) function creates an iterator that successively returns the index and value of each element in a `sequence`, beginning at the `start` index." ] }, { @@ -4330,7 +4308,7 @@ "cell_type": "code", "collapsed": false, "input": [ - "for i in xrange(5):\n", + "for i in range(5):\n", " print('\\n# partitions:', i)\n", " for j, partition in enumerate(partition_instances(simplified_instances, i)):\n", " print('partition {}: {}'.format(j, partition))" @@ -4515,7 +4493,7 @@ "\n", "We will start off with an initial training set consisting only of the first partition, and then progressively extend that training set by adding a new partition during each iteration of computing the learning curve.\n", "\n", - "The [`list.extend(L)`](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) method enables us to extend `list` by appending all the items in another list, `L`, to the end of `list`." + "The [**`list.extend(L)`**](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) method enables us to extend `list` by appending all the items in another list, `L`, to the end of `list`." ] }, { @@ -4543,7 +4521,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can now define the function, `compute_learning_curve(instances, num_partitions=10)`, that will take a list of `instances`, partition it into `num_partitions` relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first `num_partitions - 1` partitions then tested with instances in the last partition. That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on. \n", + "We can now define the function, **`compute_learning_curve(instances, num_partitions=10)`**, that will take a list of `instances`, partition it into `num_partitions` relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first `num_partitions - 1` partitions then tested with instances in the last partition. That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on. \n", "\n", "The function will return a list of `num_partitions - 1` tuples representing the size of the training set and the accuracy of a tree trained with that set (and tested on the `num_partitions - 1` set). This will provide some indication of the relative impact of the size of the training set on model performance." ] @@ -4565,7 +4543,7 @@ " testing_instances = partitions[-1][:]\n", " training_instances = []\n", " accuracy_list = []\n", - " for i in xrange(0, num_partitions - 1):\n", + " for i in range(0, num_partitions - 1):\n", " # for each iteration, the training set is composed of partitions 0 through i - 1\n", " training_instances.extend(partitions[i][:])\n", " tree = create_decision_tree(training_instances)\n", @@ -4646,15 +4624,15 @@ "\n", "We will follow this advice and avoid using the double underscore prefix in user-defined member variables and methods.\n", "\n", - "Python has a number of pre-defined [special method names](http://docs.python.org/2/reference/datamodel.html#special-method-names), all of which are denoted by leading and trailing double underscores. For example, the [`object.__init__(self[, ...])`](http://docs.python.org/2/reference/datamodel.html#object.__init__) method is used to specify instructions that should be executed whenever a new object of a class is instantiated. \n", + "Python has a number of pre-defined [special method names](http://docs.python.org/2/reference/datamodel.html#special-method-names), all of which are denoted by leading and trailing double underscores. For example, the [**`object.__init__(self[, ...])`**](http://docs.python.org/2/reference/datamodel.html#object.__init__) method is used to specify instructions that should be executed whenever a new object of a class is instantiated. \n", "\n", - "The code below defines a class, `SimpleDecisionTree`, with a single pseudo-protected member variable `_tree` and a pseudo-protected tree construction method `_create()`, two public methods - `classify()` and `pprint()` - and an initialization method that takes an optional list of training `instances` and a `target_attribute_index`. \n", + "The code below defines a class, **`SimpleDecisionTree`**, with a single pseudo-protected member variable `_tree` and a pseudo-protected tree construction method `_create()`, two public methods - `classify()` and `pprint()` - and an initialization method that takes an optional list of training `instances` and a `target_attribute_index`. \n", "\n", "The `_create()` method is identical to the `create_decision_tree()` function above, with the inclusion of the `self` parameter (as it is now a class method). The `classify()` method is a similarly modified version of the `classify()` and `classification_accuracy()` functions above, with references to `tree` converted to `self._tree`. The `pprint()` method prints the tree in a human-readable format.\n", "\n", "Note that other machine learning libraries may use different terminology for the methods we've defined here. For example, in the [`sklearn.tree.DecisionTreeClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) class (and in most `sklearn` classifier classes), the method for constructing a classifier is named [`fit()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) - since it \"fits\" the data to a model - and the method for classifying instances is named [`predict()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict) - since it is predicting the class label for an instance.\n", "\n", - "Most comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found in `SimpleDecisionTree.py`." + "Most comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found in **`SimpleDecisionTree.py`**." ] }, { @@ -4701,8 +4679,8 @@ " return default_class\n", " if not isinstance(tree, dict):\n", " return tree\n", - " attribute_index = tree.keys()[0]\n", - " attribute_values = tree.values()[0]\n", + " attribute_index = list(tree.keys())[0] # using list(dict.keys()) for Python 3 compatibiity\n", + " attribute_values = list(tree.values())[0]\n", " instance_attribute_value = instance[attribute_index]\n", " if instance_attribute_value not in attribute_values:\n", " return default_class\n", diff --git a/simple_ml.py b/simple_ml.py index 04cc51b..6d4e108 100644 --- a/simple_ml.py +++ b/simple_ml.py @@ -5,7 +5,7 @@ __author__ = 'Joe McCarthy' __copyright__ = 'Copyright 2014, Atigeo LLC' -__version__ = '1.0.1' +__version__ = '1.0.2' __date__ = '2014-04-04' __maintainer__ = 'Joe McCarthy' __email__ = 'joe.mccarthy@atigeo.com' @@ -175,7 +175,7 @@ def print_all_attribute_value_counts(instances, attribute_names): for attribute in attribute_names: value_counts = attribute_value_counts(instances, attribute, attribute_names) print('{}:'.format(attribute), end=' ') - for value, count in sorted(value_counts.iteritems(), key=operator.itemgetter(1), reverse=True): + for value, count in sorted(value_counts.items(), key=operator.itemgetter(1), reverse=True): print('{} = {} ({:5.3f}),'.format(value, count, count / num_instances), end=' ') print() @@ -261,7 +261,7 @@ def split_instances(instances, attribute_index): def partition_instances(instances, num_partitions): '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances''' - return [[instances[j] for j in xrange(i, len(instances), num_partitions)] for i in xrange(num_partitions)] + return [[instances[j] for j in range(i, len(instances), num_partitions)] for i in xrange(num_partitions)] def create_decision_tree(instances, candidate_attribute_indexes=None, class_index=0, default_class=None, trace=0): @@ -280,8 +280,8 @@ def create_decision_tree(instances, candidate_attribute_indexes=None, class_inde # if no candidate_attribute_indexes are provided, assume that we will use all but the target_attribute_index if candidate_attribute_indexes is None: - candidate_attribute_indexes = range(len(instances[0])) - candidate_attribute_indexes.remove(class_index) + candidate_attribute_indexes = [i for i in range(len(instances[0])) if i != class_index] + #candidate_attribute_indexes.remove(class_index) class_labels_and_counts = Counter([instance[class_index] for instance in instances]) @@ -344,8 +344,8 @@ def classify(tree, instance, default_class=None): return default_class if not isinstance(tree, dict): return tree - attribute_index = tree.keys()[0] - attribute_values = tree.values()[0] + attribute_index = list(tree.keys())[0] # using list(dict.keys()) for Python 3 compatibiity + attribute_values = list(tree.values())[0] instance_attribute_value = instance[attribute_index] if instance_attribute_value not in attribute_values: return default_class @@ -356,7 +356,7 @@ def classification_accuracy(tree, testing_instances, class_index=0): '''Returns the accuracy of classifying testing_instances with tree, where the class label is in position class_index''' num_correct = 0 - for i in xrange(len(testing_instances)): + for i in range(len(testing_instances)): prediction = classify(tree, testing_instances[i]) actual_value = testing_instances[i][class_index] if prediction == actual_value: @@ -376,7 +376,7 @@ def compute_learning_curve(instances, num_partitions=10): testing_instances = partitions[-1][:] training_instances = partitions[0][:] accuracy_list = [] - for i in xrange(1, num_partitions): + for i in range(1, num_partitions): # for each iteration, the training set is composed of partitions 0 through i - 1 tree = create_decision_tree(training_instances) partition_accuracy = classification_accuracy(tree, testing_instances) From ca9894e54d1f0ca882045a412d565715eb24a9ff Mon Sep 17 00:00:00 2001 From: Joe McCarthy Date: Tue, 24 Feb 2015 22:13:22 -0800 Subject: [PATCH 04/16] Reconverted updated IPYNB to HTML --- Python_for_Data_Science_all.html | 337 ++++++++++++++----------------- 1 file changed, 157 insertions(+), 180 deletions(-) diff --git a/Python_for_Data_Science_all.html b/Python_for_Data_Science_all.html index d2c3937..7c68b04 100644 --- a/Python_for_Data_Science_all.html +++ b/Python_for_Data_Science_all.html @@ -2036,7 +2036,7 @@

A note on Python 2 vs. Python 3
-

There are 2 major versions of Python in widespread use: Python 2 and Python 3. Python 3 has some features that are not backward compatible with Python 2, and some Python 2 libraries have not been updated to work with Python 3. I have been using Python 2, primarily because I use some of those Python 2 libraries, but an increasing proportion of them are migrating to Python 3, and I anticipate shifting to Python 3 in the near future.

+

There are 2 major versions of Python in widespread use: Python 2 and Python 3. Python 3 has some features that are not backward compatible with Python 2, and some Python 2 libraries have not been updated to work with Python 3. I have been using Python 2, primarily because I use some of those Python 2[-only] libraries, but an increasing proportion of them are migrating to Python 3, and I anticipate shifting to Python 3 in the near future.

For more on the topic, I recommend a very well documented IPython Notebook, which includes numerous helpful examples and links, by Sebastian Raschka, Key differences between Python 2.7.x and Python 3.x ... or googling Python 2 vs 3.

I received an email request from a Python 3 programmer who suggested that a relatively minor change in this notebook would enable it to run with Python 2 or Python 3: importing the print_function from __future__, and changing my print statements (Python 2) to print function calls (Python 3). Although a relatively minor conceptual change, it necessitates the changing of many cells to reflect the Python 3 print syntax.

I find the arguments for making print a function rather than statement compelling - especially as it is more consistent with printing functionality in many other programming langugages - and so while I do not want to convert this notebook to a Python 3 notebook, I have implemented this change so that it can be used in either Python 2 or Python 3. However, while I have verified that it still works in Python 2, I have not tested it in Python 3.

@@ -2075,7 +2075,7 @@

Identifiers, strings, lists and

-

The sample instance of a mushroom shown above can be represented as a string. A Python string (str) is a sequence of 0 or more characters enclosed within a pair of single quotes (') or a pair double quotes (").

+

The sample instance of a mushroom shown above can be represented as a string. A Python string (str) is a sequence of 0 or more characters enclosed within a pair of single quotes (') or a pair double quotes (").

@@ -2143,7 +2143,7 @@

Identifiers, strings, lists and

-

The print function writes the value of its comma-delimited arguments to sys.stdout (typically the console). Each value in the output is separated by a single blank space. If an end argument that does not include \n (newline character) is supplied, the output cursor will not move to the next line.

+

The print function writes the value of its comma-delimited arguments to sys.stdout (typically the console). Each value in the output is separated by a single blank space. If an end argument that does not include \n (newline character) is supplied, the output cursor will not move to the next line.

@@ -2186,8 +2186,8 @@

Identifiers, strings, lists and

-

The Python comment character is '#': anything after '#' on the line is ignored by the Python interpreter.

-

Pairs of triple quotes (''' or """) can be used to delimit multi-line comments.

+

The Python comment character is '#': anything after '#' on the line is ignored by the Python interpreter.

+

Pairs of triple quotes (''' or """) can be used to delimit multi-line comments.

@@ -2231,7 +2231,7 @@

Identifiers, strings, lists and

-

A list is an ordered sequence of 0 or more comma-delimited elements enclosed within square brackets ('[', ']'). The Python str.split(sep) method can be used to split a sep-delimited string into a corresponding list of elements.

+

A list is an ordered sequence of 0 or more comma-delimited elements enclosed within square brackets ('[', ']'). The Python str.split(sep) method can be used to split a sep-delimited string into a corresponding list of elements.

@@ -2679,7 +2679,7 @@

Identifiers, strings, lists and

-

The str.strip(\[chars\]) method returns a copy of str in which any leading or trailing chars are removed. If no chars are specified, it removes all leading and trailing whitespace. [Whitespace is any sequence of spaces, tabs ('\t') and/or newline ('\n') characters.]

+

The str.strip(\[chars\]) method returns a copy of str in which any leading or trailing chars are removed. If no chars are specified, it removes all leading and trailing whitespace. [Whitespace is any sequence of spaces, tabs ('\t') and/or newline ('\n') characters.]

Note that since a blank space is inserted in the output after every item in a comma-delimited list, the last asterisk is printed after a leading blank space is inserted on the new line.

@@ -2800,7 +2800,7 @@

Identifiers, strings, lists and

-

The str.join(words) method is the inverse of str.split(), returning a single string in which each string in the sequence of words is separated by str.

+

The str.join(words) method is the inverse of str.split(), returning a single string in which each string in the sequence of words is separated by str.

@@ -2841,7 +2841,7 @@

Identifiers, strings, lists and

A number of Python methods can be used on strings, lists and other sequences.

-

The len(s) function can be used to find the length of (number of items in) a sequence s. It will also return the number of items in a dictionary, a data structure we will cover further below.

+

The len(s) function can be used to find the length of (number of items in) a sequence s. It will also return the number of items in a dictionary, a data structure we will cover further below.

@@ -2922,7 +2922,7 @@

Identifiers, strings, lists and

-

The s.count(x) ormethod can be used to count the number of occurrences of item x in sequence s.

+

The s.count(x) ormethod can be used to count the number of occurrences of item x in sequence s.

@@ -2962,7 +2962,7 @@

Identifiers, strings, lists and

-

The s.index(x) method can be used to find the first 0-based index of item x in sequence s.

+

The s.index(x) method can be used to find the first 0-based index of item x in sequence s.

@@ -3350,7 +3350,7 @@

Conditionals&#

Another perspective on handling errors championed by some pythonistas is that it is easier to ask forgiveness than permission (EAFP).

As in many practical applications of philosophy, religion or dogma, it is helpful to think before you choose (TBYC). There are a number of factors to consider in deciding whether to follow the EAFP or LBYL paradigm, including code readability and the anticipated likelihood and relative severity of encountering an exception. Oran Looney wrote a blog post providing a nice overview of the debate over LBYL vs. EAFP.

-

We will follow the LBYL paradigm throughout most of this primer. However, as an illustration of EAFP in Python, here is an alternate implementation of the functionality of the code above, using a try/except statement.

+

We will follow the LBYL paradigm throughout most of this primer. However, as an illustration of EAFP in Python, here is an alternate implementation of the functionality of the code above, using a try/except statement.

@@ -3527,7 +3527,7 @@

Defining and calling functions

@@ -3559,7 +3559,7 @@

Defining and calling functions
 x used as a variable: 0 <type 'int'>
-x used as a function: <function x at 0x104e44578> <type 'function'>
+x used as a function: <function x at 0x1050076e0> <type 'function'>
 
 

@@ -3574,7 +3574,7 @@

Defining and calling functions
-

Another way to determine the type of an object is to use isinstance(object, class). This is generally preferable, as it takes into account class inheritance. There is a larger issue of duck typing, and whether code should ever explicitly check for the type of an object, but we will omit further discussion of the topic in this notebook.

+

Another way to determine the type of an object is to use isinstance(object, class). This is generally preferable, as it takes into account class inheritance. There is a larger issue of duck typing, and whether code should ever explicitly check for the type of an object, but we will omit further discussion of the topic in this notebook.

Checking whether an object is a function type will require the use of the types library ... and more thorough checking could be done if one wants to include built-in as well as user-defined functions.

@@ -3809,7 +3809,6 @@

Iteration: for, range

The for statement iterates over the elements of a sequence.

-

The range(stop) function returns a list of values from 0 up to stop - 1 (inclusive).

@@ -3820,12 +3819,8 @@

Iteration: for, range -
-
-
-
-
-

The more general form of the function, range(start, stop[, step]), returns a list of values from start to stop - 1 (inclusive) increasing by step (which defaults to 1), or from start down to stop + 1 (inclusive) decreasing by step if step is negative.

-
-
@@ -3891,9 +3853,8 @@

Iteration: for, range
-
print('range(5, 10):', range(5, 10))
-print('range(10, 5, -1):', range(10, 5, -1))
-print('range(0, 10, 2):', range(0, 10, 2))
+
for c in 'abc':
+    print(c)
 
@@ -3907,9 +3868,9 @@

Iteration: for, range
-range(5, 10): [5, 6, 7, 8, 9]
-range(10, 5, -1): [10, 9, 8, 7, 6]
-range(0, 10, 2): [0, 2, 4, 6, 8]
+a
+b
+c
 
 
@@ -3924,8 +3885,7 @@

Iteration: for, range
-

The xrange(stop[, stop[, step]]) function is an iterable version of the range() function. In the context of a for loop, it returns the next item of the sequence for each iteration of the loop rather than creating all the elements of the sequence before the first iteration. This can reduce memory consumption in cases where iteration over all the items is not required.

-

The range() function returns a list, which can then be manipulated by any list or sequence methods. An xrange object can only be used in a for loop or the len() function. A related and slightly more general class of container objects, iterators, include a next() method for explicitly returning the next item in the container.

+

In Python 2, the range(stop) function returns a list of values from 0 up to stop - 1 (inclusive). It is often used in the context of a for loop that iterates over the list of values.

@@ -3936,10 +3896,8 @@

Iteration: for, range
- +
+
+
+
+
+

The more general form of the function, range(start, stop[, step]), returns a list of values from start to stop - 1 (inclusive) increasing by step (which defaults to 1), or from start down to stop + 1 (inclusive) decreasing by step if step is negative.

+
+
+
+
+
+
+In [44]: +
+
+
+
for i in range(3, 0, -1):
+    print(i)
+
+ +
+
+
+ +
+
+ + +
+
+
+3
+2
+1
+
+
+
+
+ +
+
+ +
+
+
+
+
+
+

In Python 2, the xrange(stop[, stop[, step]]) function is an iterable version of the range() function. In the context of a for loop, it returns the next item of the sequence for each iteration of the loop rather than creating all the elements of the sequence before the first iteration. This can reduce memory consumption in cases where iteration over all the items is not required.

+

In Python 3, the range() function behaves the same way as the xrange() function does in Python 2, and so the xrange() function is not defined in Python 3.

+

To maximize compatibility, we will use range() throughout this notebook; however, note that it is generally advisable to use xrange() rather than range() wherever possible in Python 2.

+
+
@@ -4006,8 +4016,8 @@

Modules, namespaces and dotted

-

A Python module is a file containing related definitions (e.g., of functions and variables). Modules are used to help organize the Python namespaces, the set of identifiers accessible in a particular contexts. All of the functions and variables we define in this IPython Notebook are in the __main__ namespace, so accessing them does not require any specification of a module.

-

A Python module named simple_ml (in the file simple_ml.py), contains a set of solutions to the exercises in this IPython Notebook. Accessing functions in that module requires that we first import the module, and then prefix the function names with the module name followed by a dot (this is known as dotted notation).

+

A Python module is a file containing related definitions (e.g., of functions and variables). Modules are used to help organize the Python namespaces, the set of identifiers accessible in a particular contexts. All of the functions and variables we define in this IPython Notebook are in the __main__ namespace, so accessing them does not require any specification of a module.

+

A Python module named simple_ml (in the file simple_ml.py), contains a set of solutions to the exercises in this IPython Notebook. Accessing functions in that module requires that we first import the module, and then prefix the function names with the module name followed by a dot (this is known as dotted notation).

For example, the following function call Exercise 1 below:

simple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)

uses dotted notation to reference the print_attribute_names_and_values() function in the simple_ml module.

@@ -4039,7 +4049,7 @@

Exercise 1: defin
-In [44]: +In [45]:
@@ -4127,7 +4137,7 @@

File I/O

-In [45]: +In [46]:
@@ -4186,7 +4196,7 @@

Exercise 2: define load_instances()
-In [46]: +In [47]:
@@ -4233,7 +4243,7 @@

Exercise 2: define load_instances()

-

Output to text file is usually done via file.write(str) method.

+

Output to text file is usually done via file.write(str) method.

As we saw earlier, the str.join(words) method returns a single str-delimited string containing each of the strings in the list words.

SQL and Hive database tables often use the pipe ('|') delimiter to separate column values for each row when they are stored as flat files. The following code creates a new data file using pipes rather than commas to separate the attribute values.

To help maintain internal consistency, it is generally a good practice to define a variable such as delimiter or separator and bind it to the intended delimiter string, and then use the variable throughout.

@@ -4243,7 +4253,7 @@

Exercise 2: define load_instances()
-In [47]: +In [48]:
@@ -4304,7 +4314,7 @@

List comprehensions

Python provides a powerful list comprehension construct to simplify the creation of a list by specifying a formula in a single expression.

Some programmers find list comprehensions confusing, and avoid their use. We won't rely on list comprehensions here, but will show examples with and without list comprehensions below.

-

One common use of list comprehensions is in the context of the str.join(words) method we saw earlier.

+

One common use of list comprehensions is in the context of the str.join(words) method we saw earlier.

If we wanted to construct a pipe-delimited string containing elements of the list, we could use a for loop to iteratively add list elements and pipe delimiters to a string. We would thereby add one pipe delimiter too many, and would thus have to shave that off at the end.

@@ -4312,7 +4322,7 @@

List comprehensions
-In [48]: +In [49]:
@@ -4359,7 +4369,7 @@

List comprehensions
-In [49]: +In [50]:
@@ -4404,7 +4414,7 @@

List comprehensions
-In [50]: +In [51]:
@@ -4443,7 +4453,7 @@

List comprehensions
-In [51]: +In [52]:
@@ -4490,7 +4500,7 @@

Dictionaries (dicts)

Although single character abbreviations of attribute values (e.g., 'x') allow for more compact data files, they are not as easy to understand by human readers as the longer attribute value descriptions (e.g., 'convex').

-

A Python dictionary (or dict) is an unordered, comma-delimited collection of key, value pairs, serving a siimilar function as a hash table or hashmap in other programming languages.

+

A Python dictionary (or dict) is an unordered, comma-delimited collection of key, value pairs, serving a siimilar function as a hash table or hashmap in other programming languages.

We could create a dictionary for the cap-type attribute values shown above:

bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s

@@ -4503,7 +4513,7 @@

Dictionaries (dicts)
-In [52]: +In [53]:
@@ -4552,7 +4562,7 @@

Dictionaries (dicts)
-In [53]: +In [54]:
@@ -4599,7 +4609,7 @@

Dictionaries (dicts)
-In [54]: +In [55]:
@@ -4647,7 +4657,7 @@

Dictionaries (dicts)
-In [55]: +In [56]:
@@ -4733,7 +4743,7 @@

Exercise 3: define load_attr
-In [56]: +In [57]:
@@ -4808,7 +4818,7 @@

Counting

-In [57]: +In [58]:
@@ -4855,7 +4865,7 @@

Counting

-In [58]: +In [59]:
@@ -4903,7 +4913,7 @@

Counting

-

The Python collections module provides a number of high performance container datatypes. A frequently useful datatype is a defaultdict, which automatically creates an appropriate default value for a new key. For example, a defaultdict(int) automatically initializes a new dictionary entry to 0 (zero); a defaultdict(list) automatically initializes a new dictionary entry to the empty list ([]).

+

The Python collections module provides a number of high performance container datatypes. A frequently useful datatype is a defaultdict, which automatically creates an appropriate default value for a new key. For example, a defaultdict(int) automatically initializes a new dictionary entry to 0 (zero); a defaultdict(list) automatically initializes a new dictionary entry to the empty list ([]).

After first importing defaultdict from collections, we can use defaultdict(int) to simplify the code above:

@@ -4911,7 +4921,7 @@

Counting

-In [59]: +In [60]:
@@ -4976,7 +4986,7 @@

Exercise 4: define attribut
-In [60]: +In [61]:
@@ -5034,14 +5044,14 @@

Sorting

Earlier, we saw that there is a list.sort() method that will sort a list in-place, i.e., by replacing the original value of list with a sorted version of the elements in list.

-

The Python sorted(iterable[, cmp[, key[, reverse]]]) function can be used to return a copy of a list, dictionary or any other iterable container it is passed, in ascending order.

+

The Python sorted(iterable[, cmp[, key[, reverse]]]) function can be used to return a copy of a list, dictionary or any other iterable container it is passed, in ascending order.

-In [61]: +In [62]:
@@ -5085,7 +5095,7 @@

Sorting

-In [62]: +In [63]:
@@ -5125,7 +5135,7 @@

Sorting

-In [63]: +In [64]:
@@ -5165,7 +5175,7 @@

Sorting

-In [64]: +In [65]:
@@ -5204,14 +5214,14 @@

Sorting

-

An optional keyword argument, reverse, can be used to reverse the order of the sorted list returned by the function. The default value of this optional parameter is False, to get non-default behavior, we must specify the name and value of the argument: reverse=True.

+

An optional keyword argument, reverse, can be used to reverse the order of the sorted list returned by the function. The default value of this optional parameter is False, to get non-default behavior, we must specify the name and value of the argument: reverse=True.

-In [65]: +In [66]:
@@ -5242,7 +5252,7 @@

Sorting

-In [66]: +In [67]:
@@ -5273,7 +5283,7 @@

Sorting

-In [67]: +In [68]:
@@ -5339,14 +5349,14 @@

Sorting a dictionary by values
-

The dict.items() method returns an unordered list of (key, value) tuples in dict.

+

The dict.items() method returns an unordered list of (key, value) tuples in dict.

-In [68]: +In [69]:
@@ -5362,7 +5372,7 @@

Sorting a dictionary by values
- Out[68]:
+ Out[69]:

@@ -5387,43 +5397,10 @@

Sorting a dictionary by values
-

A related method, dict.iteritems(), returns an iterator - a callable object that returns the next item in a sequence each time it is referenced (e.g., during each iteration of a for loop), which can be more efficient than generating all the items in the sequence before any are used. This is similar to the distinction between xrange() and range() described above.

-
-

-
-
-
-
-In [69]: -
-
-
-
attribute_values_cap_type.iteritems()
-
- -
+

In Python 2, a related method, dict.iteritems(), returns an iterator: a callable object that returns the next item in a sequence each time it is referenced (e.g., during each iteration of a for loop), which can be more efficient than generating all the items in the sequence before any are used ... and so should be used rather than items() wherever possible

+

This is similar to the distinction between xrange() and range() described above ... and, also similarly, dict.items() is an iterator in Python 3 and so dict.iteritems() is no longer needed (nor defined) ... and further similarly, we will use only dict.items() in this notebook, but dict.iteritems() should be used wherever possible in Python 2.

- -
-
- - -
- Out[69]:
- - -
-
-<dictionary-itemiterator at 0x107481730>
-
-
- -
- -
-
-
@@ -5432,7 +5409,7 @@

Sorting a dictionary by values
-
for key, value in attribute_values_cap_type.iteritems():
+
for key, value in attribute_values_cap_type.items():
     print(key, ':', value)
 
@@ -5467,10 +5444,10 @@

Sorting a dictionary by values
-

The Python operator module contains a number of functions that perform object comparisons, logical operations, mathematical operations, sequence operations, and abstract type tests.

-

To facilitate sorting a dictionary by values, we will use the operator.itemgetter(item) function that can be used to retrieve an indexed value (item) in a tuple (such as a (key, value) pair returned by [iter]items()).

+

The Python operator module contains a number of functions that perform object comparisons, logical operations, mathematical operations, sequence operations, and abstract type tests.

+

To facilitate sorting a dictionary by values, we will use the operator.itemgetter(item) function that can be used to retrieve an indexed value (item) in a tuple (such as a (key, value) pair returned by [iter]items()).

We can use operator.itemgetter(1)) to reference the value - the 2nd item in each (key, value) tuple, (at zero-based index position 1) - rather than the key - the first item in each (key, value) tuple (at index position 0).

-

We will use the optional keyword argument key in sorted(iterable[, cmp[, key[, reverse]]]) to specify a sorting key that is not the same as the dict key (the dict key is the default sorting key)

+

We will use the optional keyword argument key in sorted(iterable[, cmp[, key[, reverse]]]) to specify a sorting key that is not the same as the dict key (the dict key is the default sorting key)

@@ -5483,7 +5460,7 @@

Sorting a dictionary by values
import operator
 
-sorted(attribute_values_cap_type.iteritems(), key=operator.itemgetter(1))
+sorted(attribute_values_cap_type.items(), key=operator.itemgetter(1))
 

@@ -5535,7 +5512,7 @@

Sorting a dictionary by valuesvalue_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names) print('Counts for each value of', attribute, '(sorted by count):') -for value, count in sorted(value_counts.iteritems(), key=operator.itemgetter(1), reverse=True): +for value, count in sorted(value_counts.items(), key=operator.itemgetter(1), reverse=True): print(value, ':', count)

@@ -6215,7 +6192,7 @@

Building a Simple Decision Tree

In building a decision tree, we will need to split the instances based on the index of the best attribute, i.e., the attribute that offers the highest information gain. We will use separate utility functions to handle these subtasks. To simplify the functions, we will rely exclusively on attribute indexes rather than attribute names.

Note: the algorithm above is recursive, i.e., the there is a recursive call to ID3 within the definition of ID3. Covering recursion is beyond the scope of this primer, but there are a number of other resources on using recursion in Python. Familiarity with recursion will be important for understanding both the tree construction and classification functions below.

-

First, we will define a function to split a set of instances based on any attribute. This function will return a dictionary where the key of each dictionary is a distinct value of the specified attribute_index, and the value of each dictionary is a list representing the subset of instances that have that attribute value.

+

First, we will define a function, split_instances(instances, attribute_index), to split a set of instances based on any attribute. This function will return a dictionary where the key of each dictionary is a distinct value of the specified attribute_index, and the value of each dictionary is a list representing the subset of instances that have that attribute value.

@@ -6331,8 +6308,8 @@

Exercise 8: define cho

A leaf node in a decision tree represents the most frequently occurring - or majority - class value for that path through the tree. We will need a function that determines the majority value for the class index among a set of instances.

We earlier saw how the defaultdict container in the collections module can be used to simplify the construction of a dictionary containing the counts of all attribute values for all attributes, by automatically setting the count for any attribute value to zero when the attribute value is first added to the dictionary.

-

The collections module has another useful container, a Counter class, that can further simplify the construction of a specialized dictionary of counts. When a Counter object is instantiated with a list of items, it returns a dictionary-like container in which the keys are the unique items in the list, and the values are the counts of each unique item in that list.

-

This container has an additional method, most_common([n]), which returns a list of 2-element tuples representing the values and their associated counts for the most common n values; if n is omitted, the method returns all tuples.

+

The collections module has another useful container, a Counter class, that can further simplify the construction of a specialized dictionary of counts. When a Counter object is instantiated with a list of items, it returns a dictionary-like container in which the keys are the unique items in the list, and the values are the counts of each unique item in that list.

+

This container has an additional method, most_common([n]), which returns a list of 2-element tuples representing the values and their associated counts for the most common n values; if n is omitted, the method returns all tuples.

The following is an example of how we can use a Counter to represent the frequency of different class labels, and how we can identify the most frequent value and its count.

@@ -6684,8 +6661,8 @@

Exercise 9: define majority_value() # if no candidate_attribute_indexes are provided, assume that we will use all but the target_attribute_index if candidate_attribute_indexes is None: - candidate_attribute_indexes = range(len(instances[0])) - candidate_attribute_indexes.remove(class_index) + candidate_attribute_indexes = [i for i in range(len(instances[0])) if i != class_index] + #candidate_attribute_indexes.remove(class_index) class_labels_and_counts = Counter([instance[class_index] for instance in instances]) @@ -6805,8 +6782,8 @@

Exercise 9: define majority_value()

The structure of the tree shown above is rather difficult to discern from the normal printed representation of a dictionary.

-

The Python pprint module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.

-

The pprint.pprint(object, stream=None, indent=1, width=80, depth=None) method will print object to a stream (a default value of None will dictate the use of sys.stdout, the same destination as print statement output), using indent spaces to differentiate nesting levels, using up to a maximum width columns and up to to a maximum nesting level depth (None indicating no maximum).

+

The Python pprint module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.

+

The pprint.pprint(object, stream=None, indent=1, width=80, depth=None) method will print object to a stream (a default value of None will dictate the use of sys.stdout, the same destination as print function output), using indent spaces to differentiate nesting levels, using up to a maximum width columns and up to to a maximum nesting level depth (None indicating no maximum).

We will use the a variation on the import statement that imports one or more functions into the current namespace:

from pprint import pprint

This will to enable us to use pprint() rather than having to use dotted notation, i.e., pprint.pprint().

@@ -6874,7 +6851,7 @@

Classifying Instances

Usually, when we construct a decision tree based on a set of training instances, we do so with the intent of using that tree to classify a set of one or more testing instances.

-

We will define a function, classify(tree, instance, default_class=None), to use a decision tree to classify a single instance, where an optional default_class can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.

+

We will define a function, classify(tree, instance, default_class=None), to use a decision tree to classify a single instance, where an optional default_class can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.

We will use a design pattern in which we will use a series of if statements, each of which returns a value if the condition is true, rather than a nested series of if, elif and/or else clauses, as it helps constrain the levels of indentation in the function.

@@ -6892,8 +6869,8 @@

Classifying Instances return default_class if not isinstance(tree, dict): return tree - attribute_index = tree.keys()[0] - attribute_values = tree.values()[0] + attribute_index = list(tree.keys())[0] # using list(dict.keys()) for Python 3 compatibility + attribute_values = list(tree.values())[0] instance_attribute_value = instance[attribute_index] if instance_attribute_value not in attribute_values: return default_class @@ -6978,7 +6955,7 @@

Evaluating the Accura
def classification_accuracy(tree, testing_instances, class_index=0, default_class=None):
     '''Returns the accuracy of classifying testing_instances with tree, where the class label is in position class_index'''
     num_correct = 0
-    for i in xrange(len(testing_instances)):
+    for i in range(len(testing_instances)):
         prediction = classify(tree, testing_instances[i], default_class)
         actual_value = testing_instances[i][class_index]
         if prediction == actual_value:
@@ -7014,7 +6991,7 @@ 

Evaluating the Accura

-

The zip([iterable, ...]) function combines 2 or more sequences or iterables; the function returns a list of tuples, where the ith tuple contains the ith element from each of the argument sequences or iterables.

+

The zip([iterable, ...]) function combines 2 or more sequences or iterables; the function returns a list of tuples, where the ith tuple contains the ith element from each of the argument sequences or iterables.

@@ -7110,7 +7087,7 @@

Evaluating the Accura

We sometimes want to partition the instances into subsets of equal sizes to measure performance. One metric this partitioning allows us to compute is a learning curve, i.e., assess how well the model performs based on the size of its training set. Another use of these partitions (aka folds) would be to conduct an n-fold cross validation evaluation.

-

The following function, partition_instances(instances, num_partitions), partitions a set of instances into num_partitions relatively equally sized subsets.

+

The following function, partition_instances(instances, num_partitions), partitions a set of instances into num_partitions relatively equally sized subsets.

We'll use this as yet another opportunity to demonstrate the power of using list comprehensions, this time, to condense the use of nested for loops.

@@ -7124,8 +7101,8 @@

Evaluating the Accura
def partition_instances(instances, num_partitions):
     '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''
-    return [[instances[j] for j in xrange(i, len(instances), num_partitions)] \
-            for i in xrange(num_partitions)]
+    return [[instances[j] for j in range(i, len(instances), num_partitions)] \
+            for i in range(num_partitions)]
 
@@ -7152,7 +7129,7 @@

Evaluating the Accura
instance_length = 3
 num_instances = 5
 
-simplified_instances = [[j for j in xrange(i, instance_length + i)] for i in xrange(num_instances)]
+simplified_instances = [[j for j in range(i, instance_length + i)] for i in range(num_instances)]
 
 print('Instances:', simplified_instances)
 partitions = partition_instances(simplified_instances, 2)
@@ -7200,18 +7177,18 @@ 

Evaluating the Accura
def partition_instances(instances, num_partitions):
     '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''
     partitions = []
-    for i in xrange(num_partitions):
+    for i in range(num_partitions):
         partition = []
         # iterate over instances starting at position i in increments of num_paritions
-        for j in xrange(i, len(instances), num_partitions): 
+        for j in range(i, len(instances), num_partitions): 
             partition.append(instances[j])
         partitions.append(partition)
     return partitions
 
 simplified_instances = []
-for i in xrange(num_instances):
+for i in range(num_instances):
     new_instance = []
-    for j in xrange(i, instance_length + i):
+    for j in range(i, instance_length + i):
         new_instance.append(j)
     simplified_instances.append(new_instance)
 
@@ -7247,7 +7224,7 @@ 

Evaluating the Accura

-

The enumerate(sequence, start=0) function creates an iterator that successively returns the index and value of each element in a sequence, beginning at the start index.

+

The enumerate(sequence, start=0) function creates an iterator that successively returns the index and value of each element in a sequence, beginning at the start index.

@@ -7301,7 +7278,7 @@

Evaluating the Accura

-
for i in xrange(5):
+
for i in range(5):
     print('\n# partitions:', i)
     for j, partition in enumerate(partition_instances(simplified_instances, i)):
         print('partition {}: {}'.format(j, partition))
@@ -7530,7 +7507,7 @@ 

Evaluating the Accura

Now that we can partition our instances into subsets, we can use these subsets to construct different-sized training sets in the process of computing a learning curve.

We will start off with an initial training set consisting only of the first partition, and then progressively extend that training set by adding a new partition during each iteration of computing the learning curve.

-

The list.extend(L) method enables us to extend list by appending all the items in another list, L, to the end of list.

+

The list.extend(L) method enables us to extend list by appending all the items in another list, L, to the end of list.

@@ -7572,7 +7549,7 @@

Evaluating the Accura

-

We can now define the function, compute_learning_curve(instances, num_partitions=10), that will take a list of instances, partition it into num_partitions relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first num_partitions - 1 partitions then tested with instances in the last partition. That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on.

+

We can now define the function, compute_learning_curve(instances, num_partitions=10), that will take a list of instances, partition it into num_partitions relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first num_partitions - 1 partitions then tested with instances in the last partition. That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on.

The function will return a list of num_partitions - 1 tuples representing the size of the training set and the accuracy of a tree trained with that set (and tested on the num_partitions - 1 set). This will provide some indication of the relative impact of the size of the training set on model performance.

@@ -7597,7 +7574,7 @@

Evaluating the Accura testing_instances = partitions[-1][:] training_instances = [] accuracy_list = [] - for i in xrange(0, num_partitions - 1): + for i in range(0, num_partitions - 1): # for each iteration, the training set is composed of partitions 0 through i - 1 training_instances.extend(partitions[i][:]) tree = create_decision_tree(training_instances) @@ -7697,11 +7674,11 @@

special method names, all of which are denoted by leading and trailing double underscores. For example, the object.__init__(self[, ...]) method is used to specify instructions that should be executed whenever a new object of a class is instantiated.

-

The code below defines a class, SimpleDecisionTree, with a single pseudo-protected member variable _tree and a pseudo-protected tree construction method _create(), two public methods - classify() and pprint() - and an initialization method that takes an optional list of training instances and a target_attribute_index.

+

Python has a number of pre-defined special method names, all of which are denoted by leading and trailing double underscores. For example, the object.__init__(self[, ...]) method is used to specify instructions that should be executed whenever a new object of a class is instantiated.

+

The code below defines a class, SimpleDecisionTree, with a single pseudo-protected member variable _tree and a pseudo-protected tree construction method _create(), two public methods - classify() and pprint() - and an initialization method that takes an optional list of training instances and a target_attribute_index.

The _create() method is identical to the create_decision_tree() function above, with the inclusion of the self parameter (as it is now a class method). The classify() method is a similarly modified version of the classify() and classification_accuracy() functions above, with references to tree converted to self._tree. The pprint() method prints the tree in a human-readable format.

Note that other machine learning libraries may use different terminology for the methods we've defined here. For example, in the sklearn.tree.DecisionTreeClassifier class (and in most sklearn classifier classes), the method for constructing a classifier is named fit() - since it "fits" the data to a model - and the method for classifying instances is named predict() - since it is predicting the class label for an instance.

-

Most comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found in SimpleDecisionTree.py.

+

Most comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found in SimpleDecisionTree.py.

@@ -7752,8 +7729,8 @@

return default_class if not isinstance(tree, dict): return tree - attribute_index = tree.keys()[0] - attribute_values = tree.values()[0] + attribute_index = list(tree.keys())[0] # using list(dict.keys()) for Python 3 compatibiity + attribute_values = list(tree.values())[0] instance_attribute_value = instance[attribute_index] if instance_attribute_value not in attribute_values: return default_class From 4ff6be79776ffd5a5aa29c77af6cdec94ec6dec8 Mon Sep 17 00:00:00 2001 From: Joe McCarthy Date: Tue, 24 Feb 2015 22:38:26 -0800 Subject: [PATCH 05/16] Removed errant backslashes --- Python_for_Data_Science_all.html | 2 +- Python_for_Data_Science_all.ipynb | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/Python_for_Data_Science_all.html b/Python_for_Data_Science_all.html index 7c68b04..2e7ff51 100644 --- a/Python_for_Data_Science_all.html +++ b/Python_for_Data_Science_all.html @@ -2679,7 +2679,7 @@

Identifiers, strings, lists and

-

The str.strip(\[chars\]) method returns a copy of str in which any leading or trailing chars are removed. If no chars are specified, it removes all leading and trailing whitespace. [Whitespace is any sequence of spaces, tabs ('\t') and/or newline ('\n') characters.]

+

The str.strip([chars]) method returns a copy of str in which any leading or trailing chars are removed. If no chars are specified, it removes all leading and trailing whitespace. [Whitespace is any sequence of spaces, tabs ('\t') and/or newline ('\n') characters.]

Note that since a blank space is inserted in the output after every item in a comma-delimited list, the last asterisk is printed after a leading blank space is inserted on the new line.

diff --git a/Python_for_Data_Science_all.ipynb b/Python_for_Data_Science_all.ipynb index 4ef3ce1..3185b92 100644 --- a/Python_for_Data_Science_all.ipynb +++ b/Python_for_Data_Science_all.ipynb @@ -1,7 +1,7 @@ { "metadata": { "name": "", - "signature": "sha256:e4e44d764641165b5be7b5e9e173657c21423c423f68e3f034e2dd22505466ed" + "signature": "sha256:bcca0950b7dac5cf922034044015c39bca27062f58925c71f26294b7f0bfd371" }, "nbformat": 3, "nbformat_minor": 0, @@ -799,7 +799,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The [**`str.strip(\\[chars\\]`**)](http://docs.python.org/2/library/stdtypes.html#str.strip) method returns a copy of `str` in which any leading or trailing `chars` are removed. If no `chars` are specified, it removes all leading and trailing whitespace. [*Whitespace* is any sequence of spaces, tabs (`'\\t'`) and/or newline (`'\\n'`) characters.] \n", + "The [**`str.strip([chars]`**)](http://docs.python.org/2/library/stdtypes.html#str.strip) method returns a copy of `str` in which any leading or trailing `chars` are removed. If no `chars` are specified, it removes all leading and trailing whitespace. [*Whitespace* is any sequence of spaces, tabs (`'\\t'`) and/or newline (`'\\n'`) characters.] \n", "\n", "Note that since a blank space is inserted in the output after every item in a comma-delimited list, the last asterisk is printed after a leading blank space is inserted on the new line." ] From d5ec7c57d93cffff9386d554d89b88dda87b40e5 Mon Sep 17 00:00:00 2001 From: Joe McCarthy Date: Wed, 25 Feb 2015 13:55:57 -0800 Subject: [PATCH 06/16] Updated README.md, shortened the Py2 vs. Py3 section --- Python_for_Data_Science_all.html | 9 +++------ Python_for_Data_Science_all.ipynb | 14 ++++---------- README.md | 14 ++++++++++---- SimpleDecisionTree.py | 18 ++++++++++-------- 4 files changed, 27 insertions(+), 28 deletions(-) diff --git a/Python_for_Data_Science_all.html b/Python_for_Data_Science_all.html index 2e7ff51..d980b94 100644 --- a/Python_for_Data_Science_all.html +++ b/Python_for_Data_Science_all.html @@ -2037,18 +2037,15 @@

A note on Python 2 vs. Python 3

There are 2 major versions of Python in widespread use: Python 2 and Python 3. Python 3 has some features that are not backward compatible with Python 2, and some Python 2 libraries have not been updated to work with Python 3. I have been using Python 2, primarily because I use some of those Python 2[-only] libraries, but an increasing proportion of them are migrating to Python 3, and I anticipate shifting to Python 3 in the near future.

-

For more on the topic, I recommend a very well documented IPython Notebook, which includes numerous helpful examples and links, by Sebastian Raschka, Key differences between Python 2.7.x and Python 3.x ... or googling Python 2 vs 3.

-

I received an email request from a Python 3 programmer who suggested that a relatively minor change in this notebook would enable it to run with Python 2 or Python 3: importing the print_function from __future__, and changing my print statements (Python 2) to print function calls (Python 3). Although a relatively minor conceptual change, it necessitates the changing of many cells to reflect the Python 3 print syntax.

-

I find the arguments for making print a function rather than statement compelling - especially as it is more consistent with printing functionality in many other programming langugages - and so while I do not want to convert this notebook to a Python 3 notebook, I have implemented this change so that it can be used in either Python 2 or Python 3. However, while I have verified that it still works in Python 2, I have not tested it in Python 3.

-

I also find the arguments for changing the division operator compelling, so will import that as well. Without this import in Python 2, 1 / 2 returns 0 (the integer portion of the quotient); with this import, 1 / 2 returns 0.5, and if you want only the integer portion of the quotient (floor division), you can use 1 // 2 (which works the same way in Python 2 and Python 3).

-

Note that if you don't understand some/any of the above discussion about Python 2 and Python 3, it should not affect your ability to understand the rest of this notebook.

+

For more on the topic, I recommend a very well documented IPython Notebook, which includes numerous helpful examples and links, by Sebastian Raschka, Key differences between Python 2.7.x and Python 3.x, the Cheat Sheet: Writing Python 2-3 compatible code by Ed Schofield ... or googling Python 2 vs 3.

+

Nick Coghlan, a CPython core developer, sent me an email suggesting that relatively minor changes in this notebook would enable it to run with Python 2 or Python 3: importing the print_function from the __future__ module, and changing my print statements (Python 2) to print function calls (Python 3). Although a relatively minor conceptual change, it necessitated the changing of many individual cells to reflect the Python 3 print syntax. I also needed to replace a few functions that are no longer available in Python 3 with related functions that are available in both versions; I've added notes in nearby cells where the incompatible functions were removed explaining why they are related ... and no longer available.

-In [2]: +In [1]:
diff --git a/Python_for_Data_Science_all.ipynb b/Python_for_Data_Science_all.ipynb index 3185b92..4ed8ee4 100644 --- a/Python_for_Data_Science_all.ipynb +++ b/Python_for_Data_Science_all.ipynb @@ -1,7 +1,7 @@ { "metadata": { "name": "", - "signature": "sha256:bcca0950b7dac5cf922034044015c39bca27062f58925c71f26294b7f0bfd371" + "signature": "sha256:f066be0cc350e6372bbdb872eb18175b0f4453d758f4206d4d337a6a9f7e6fde" }, "nbformat": 3, "nbformat_minor": 0, @@ -350,15 +350,9 @@ "source": [ "There are 2 major versions of Python in widespread use: [Python 2](https://docs.python.org/2/) and [Python 3](https://docs.python.org/3/). Python 3 has some features that are not backward compatible with Python 2, and some Python 2 libraries have not been updated to work with Python 3. I have been using Python 2, primarily because I use some of those Python 2[-only] libraries, but an increasing proportion of them are migrating to Python 3, and I anticipate shifting to Python 3 in the near future.\n", "\n", - "For more on the topic, I recommend a very well documented IPython Notebook, which includes numerous helpful examples and links, by [Sebastian Raschka](http://sebastianraschka.com/), [Key differences between Python 2.7.x and Python 3.x](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/key_differences_between_python_2_and_3.ipynb) ... or [googling Python 2 vs 3](https://www.google.com/q=python%202%20vs%203).\n", + "For more on the topic, I recommend a very well documented IPython Notebook, which includes numerous helpful examples and links, by [Sebastian Raschka](http://sebastianraschka.com/), [Key differences between Python 2.7.x and Python 3.x](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/key_differences_between_python_2_and_3.ipynb), the [Cheat Sheet: Writing Python 2-3 compatible code](http://python-future.org/compatible_idioms.html) by Ed Schofield ... or [googling Python 2 vs 3](https://www.google.com/q=python%202%20vs%203).\n", "\n", - "I received an email request from a Python 3 programmer who suggested that a relatively minor change in this notebook would enable it to run with Python 2 or Python 3: importing the `print_function` from [`__future__`](https://docs.python.org/2/library/__future__.html), and changing my [`print` statements (Python 2)](https://docs.python.org/2/reference/simple_stmts.html#print) to [`print` function calls (Python 3)](https://docs.python.org/3/library/functions.html#print). Although a relatively minor conceptual change, it necessitates the changing of many cells to reflect the Python 3 `print` syntax.\n", - "\n", - "I find the arguments for [making `print` a function rather than statement](https://www.python.org/dev/peps/pep-3105/) compelling - especially as it is more consistent with printing functionality in many other programming langugages - and so while I do not want to convert this notebook to a Python 3 notebook, I have implemented this change so that it can be used in either Python 2 or Python 3. However, while I have verified that it still works in Python 2, I have not tested it in Python 3.\n", - "\n", - "I also find the arguments for [changing the division operator](https://www.python.org/dev/peps/pep-0238/) compelling, so will import that as well. Without this import in Python 2, `1 / 2` returns `0` (the integer portion of the quotient); with this import, `1 / 2` returns `0.5`, and if you want only the integer portion of the quotient (*floor division*), you can use `1 // 2` (which works the same way in Python 2 and Python 3).\n", - "\n", - "Note that if you don't understand some/any of the above discussion about Python 2 and Python 3, it should not affect your ability to understand the rest of this notebook." + "[Nick Coghlan](https://twitter.com/ncoghlan_dev), a CPython core developer, sent me an email suggesting that relatively minor changes in this notebook would enable it to run with Python 2 *or* Python 3: importing the `print_function` from the [**`__future__`**](https://docs.python.org/2/library/__future__.html) module, and changing my [`print` *statements* (Python 2)](https://docs.python.org/2/reference/simple_stmts.html#print) to [`print` *function calls* (Python 3)](https://docs.python.org/3/library/functions.html#print). Although a relatively minor conceptual change, it necessitated the changing of many individual cells to reflect the Python 3 `print` syntax. I also needed to replace a few functions that are no longer available in Python 3 with related functions that are available in both versions; I've added notes in nearby cells where the incompatible functions were removed explaining why they are related ... and no longer available." ] }, { @@ -370,7 +364,7 @@ "language": "python", "metadata": {}, "outputs": [], - "prompt_number": 2 + "prompt_number": 1 }, { "cell_type": "heading", diff --git a/README.md b/README.md index 2a8e261..7ec9517 100644 --- a/README.md +++ b/README.md @@ -4,12 +4,12 @@ This short primer on [Python](http://www.python.org/) is designed to provide a r The primer is spread across a collection of [IPython Notebooks](http://ipython.org/notebook.html), and the easiest way to use the primer is to [install IPython Notebook](http://ipython.org/install.html) on your computer. You can also [install Python](https://www.python.org/downloads/), and manually copy and paste the pieces of sample code into the Python interpreter, as the primer only makes use of the Python standard libraries. -There are three versions of the primer. Two versions contain the entire primer in a single notebook: +There are three versions of the primer. Two versions are compatible with Python 2 or Python 3), and contain the entire primer in a single notebook: * Single IPython Notebook: [Python_for_Data_Science_all.ipynb](Python_for_Data_Science_all.ipynb) * Single web page (HTML): [Python_for_Data_Science_all.html](Python_for_Data_Science_all.html) -The other version divides the primer into 5 separate notebooks: +The other version divides the primer into 5 separate notebooks (Python 2 only): * [Introduction](1_Introduction.ipynb) * [Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb) @@ -29,11 +29,17 @@ There are also 2 data files, based on the [mushroom dataset](https://archive.ics ## Change Log +2015-02-23 + +* Added attribution for suggested changes to accommodate Python 3 to [Nick Coghlan](https://twitter.com/ncoghlan_dev) + 2015-02-22 * Added `from __future__ import print_function, division` for Python 3 compatibility -* Updated `simple_ml.py` to also use Python 3 `print_function` and `division` -* Changed "call by reference" to "call by sharing" +* Updated `simple_ml.py` and `SimpleDecisionTree.py` to also use Python 3 `print_function` and `division` +* Replaced `xrange()` (Python 2 only) with `range()` (Python 2 or 3) +* Replaced `dict.iteritems()` (Python 2 only) with `dict.items()` (Python 2 or 3) +* Changed ["call by reference"](https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_reference) to ["call by sharing"](https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_sharing) * Added `isinstance()` (and reference to duck typing) to section on `type()` * Added variable for `delimiter` rather than hard-coding `'|'` character * Cleaned up various cells \ No newline at end of file diff --git a/SimpleDecisionTree.py b/SimpleDecisionTree.py index 491ff05..ddd4d92 100644 --- a/SimpleDecisionTree.py +++ b/SimpleDecisionTree.py @@ -1,3 +1,5 @@ +from __future__ import print_function, division + ''' A class to implement a simple decision tree (based on ID3) ''' @@ -38,14 +40,14 @@ def _create(self, instances, candidate_attribute_indexes, target_attribute_index # If the dataset is empty or the candidate attributes list is empty, return the default value. if not instances or not candidate_attribute_indexes: if trace: - print '{}Using default class {}'.format('< ' * trace, default_class) + print('{}Using default class {}'.format('< ' * trace, default_class)) return default_class # If all the instances have the same class label, return that class label elif len(class_labels_and_counts) == 1: class_label = class_labels_and_counts.most_common(1)[0][0] if trace: - print '{}All {} instances have label {}'.format('< ' * trace, len(instances), class_label) + print('{}All {} instances have label {}'.format('< ' * trace, len(instances), class_label)) return class_label else: default_class = simple_ml.majority_value(instances, target_attribute_index) @@ -53,7 +55,7 @@ def _create(self, instances, candidate_attribute_indexes, target_attribute_index # Choose the next best attribute index to best classify the instances best_index = simple_ml.choose_best_attribute_index(instances, candidate_attribute_indexes, target_attribute_index) if trace: - print '{}Creating tree node for attribute index {}'.format('> ' * trace, best_index) + print('{}Creating tree node for attribute index {}'.format('> ' * trace, best_index)) # Create a new decision tree node with the best attribute index and an empty dictionary object (for now) tree = {best_index:{}} @@ -66,13 +68,13 @@ def _create(self, instances, candidate_attribute_indexes, target_attribute_index for attribute_value in partitions: if trace: - print '{}Creating subtree for value {} ({}, {}, {}, {})'.format( + print('{}Creating subtree for value {} ({}, {}, {}, {})'.format( '> ' * trace, attribute_value, len(partitions[attribute_value]), len(remaining_candidate_attribute_indexes), target_attribute_index, - default_class) + default_class)) # Create a subtree for each value of the the best attribute subtree = self._create( @@ -99,8 +101,8 @@ def _classify(self, tree, instance, default_class=None): return default_class if not isinstance(tree, dict): return tree - attribute_index = tree.keys()[0] - attribute_values = tree.values()[0] + attribute_index = list(tree.keys())[0] + attribute_values = list(tree.values())[0] instance_attribute_value = instance[attribute_index] if instance_attribute_value not in attribute_values: return default_class @@ -115,7 +117,7 @@ def evaluate_accuracy(self, instances, default_class=None): predicted_labels = self.classify_list(instances, default_class) actual_labels = [x[0] for x in instances] counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)]) - return counts[True], counts[False], float(counts[True]) / len(instances) + return counts[True], counts[False], counts[True] / len(instances) def pprint(self): From a3aa0a29ccf31185ab9559f3be86ed3fcdf3b7ff Mon Sep 17 00:00:00 2001 From: joem Date: Tue, 21 Jul 2015 15:19:14 -0700 Subject: [PATCH 07/16] Updated for PyData Seattle tutorial --- Python_for_Data_Science_clean.ipynb | 4108 +++++++++++++++++++++++++++ simple_decision_tree.py | 167 ++ 2 files changed, 4275 insertions(+) create mode 100644 Python_for_Data_Science_clean.ipynb create mode 100644 simple_decision_tree.py diff --git a/Python_for_Data_Science_clean.ipynb b/Python_for_Data_Science_clean.ipynb new file mode 100644 index 0000000..3f1cb05 --- /dev/null +++ b/Python_for_Data_Science_clean.ipynb @@ -0,0 +1,4108 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python for Data Science" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Joe McCarthy](http://interrelativity.com/joe), \n", + "*Data Scientist*, [Indeed](http://www.indeed.com/)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from IPython.display import display, Image, HTML" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 1. Introduction" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"python-logo-master-v3-TM.png\"\n", + "This short primer on [Python](http://www.python.org/) is designed to provide a rapid \"on-ramp\" to enable computer programmers who are already familiar with concepts and constructs in other programming languages learn enough about Python to facilitate the effective use of open-source and proprietary Python-based machine learning and data science tools.\n", + "\n", + "\"nltk_book_cover.gif\"\n", + "The primer is motivated, in part, by the approach taken in the [Natural Language Toolkit (NLTK) book](http://www.nltk.org/book/), which provides a rapid on-ramp for using Python and the open-source [NLTK library](http://www.nltk.org/) to develop programs using natural language processing techniques (many of which involve [machine learning](http://www.nltk.org/book/ch06.html)).\n", + "\n", + "The [Python Tutorial](http://docs.python.org/2/tutorial/) offers a more comprehensive primer, and opens with an excellent - if biased - overview of some of the general strengths of the Python programming language:\n", + "\n", + "> Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.\n", + "\n", + "\"Python\n", + "[Hans Petter Langtangen](http://folk.uio.no/hpl/), author of [Python Scripting for Computational Science](http://www.amazon.com/Python-Scripting-Computational-Science-Engineering/dp/3642093159), emphasizes the utility of Python for many of the common tasks in all areas of computational science:\n", + "\n", + "> Very often programming is about shuffling data in and out of different tools, converting one data format to another, extracting numerical data from a text, and administering numerical experiments involving a large number of data files and directories. Such tasks are much faster to accomplish in a language like Python than in Fortran, C, C++, C#, or Java\n", + "\n", + "[Foster Provost](http://people.stern.nyu.edu/fprovost/), co-author of [Data Science for Business](http://data-science-for-biz.com/), describes why Python is such a useful programming language for practical data science in [Python: A Practical Tool for Data Science](https://docs.google.com/document/pub?id=1p6vowsEuiezLbWnFKgse70a8LxfsrRixqPF5nBg8F3A), :\n", + "\n", + "> The practice of data science involves many interrelated but different activities, including accessing data, manipulating data, computing statistics about data, plotting/graphing/visualizing data, building predictive and explanatory models from data, evaluating those models on yet more data, integrating models into production systems, etc. One option for the data scientist is to learn several different software packages that each specialize in one or two of these things, but don’t do them all well, plus learn a programming language to tie them together. (Or do a lot of manual work.) \n", + "> \n", + "> An alternative is to use a general-purpose, high-level programming language that provides libraries to do all these things. Python is an excellent choice for this. It has a diverse range of open source libraries for just about everything the data scientist will do. It is available everywhere; high performance python interpreters exist for running your code on almost any operating system or architecture. Python and most of its libraries are both open source and free. Contrast this with common software packages that are available in a course via an academic license, yet are extremely expensive to license and use in industry.\n", + "\n", + "\"scikit-learn-logo-small.png\"\n", + "The goal of this primer is to provide efficient and sufficient scaffolding for software engineers with no prior knowledge of Python to be able to effectively use Python-based tools for data science research and development, such as the open-source library [scikit-learn](http://scikit-learn.org/). There is another, more comprehensive tutorial for scikit-learn, [Python Scientific Lecture Notes](http://scipy-lectures.github.io/index.html), that includes coverage of a number of other useful Python open-source libraries used by scikit-learn ([numpy](http://www.numpy.org/), [scipy](http://www.scipy.org/) and [matplotlib](http://matplotlib.org)) - all highly recommended ... and, to keep things simple, all beyond the scope of this primer.\n", + "\n", + "Using an IPython Notebook as a delivery vehicle for this primer was motivated by Brian Granger's inspiring tutorial, [The IPython Notebook: Get Close to Your Data with Python and JavaScript](http://strataconf.com/strata2014/public/schedule/detail/32033), one of the [highlights from my Strata 2014 conference experience](http://gumption.typepad.com/blog/2014/02/ipython-deep-learning-doing-good-some-highlights-from-strata-2014.html). You can run this notebook locally in a browser once you [install ipython notebook](http://ipython.org/install.html).\n", + "\n", + "One final note on external resources: the [Python Style Guide (PEP-0008)](http://legacy.python.org/dev/peps/pep-0008/) offers helpful tips on how best to format Python code. [Code like a Pythonista](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html) offers a number of additional tips on Python programming style and philosophy, several of which are incorporated into this primer.\n", + "\n", + "We will focus entirely on using Python within the interpreter environment (as supported within an IPython Notebook). Python scripts - files containing definitions of functions and variables, and typically including code invoking some of those functions - can also be run from a command line. Using Python scripts from the command line may be the subject of a future primer. \n", + "\n", + "To help motivate the data science-oriented Python programming examples provided in this primer, we will start off with a brief overview of basic concepts and terminology in data science." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Data Science: Basic Concepts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Science and Data Mining" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"DataScienceForBusiness_cover.jpg\"\n", + "Foster Provost and [Tom Fawcett](http://home.comcast.net/~tom.fawcett/public_html/index.html) offer succinct descriptions of data science and data mining in [Data Science for Business](http://data-science-for-biz.com/):\n", + "\n", + "> **Data science** involves principles, processes and techniques for understanding phenomena via the (automated) analysis of data.\n", + "> \n", + "> **Data mining** is the extraction of knowledge from data, via technologies that incorporate these principles." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Knowledge Discovery, Data Mining and Machine Learning" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Provost & Fawcett also offer some history and insights into the relationship between *data mining* and *machine learning*, terms which are often used somewhat interchangeably:\n", + "\n", + "> The field of Data Mining (or KDD: Knowledge Discovery and Data Mining) started as an offshoot of Machine Learning, and they remain closely linked. Both fields are concerned with the analysis of data to find useful or informative patterns. Techniques and algorithms are shared between the two; indeed, the areas are so closely related that researchers commonly participate in both communities and transition between them seamlessly. Nevertheless, it is worth pointing out some of the differences to give perspective.\n", + "> \n", + ">Speaking generally, because Machine Learning is concerned with many types of performance improvement, it includes subfields such as robotics and computer vision that are not part of KDD. It also is concerned with issues of agency and cognition — how will an intelligent agent use learned knowledge to reason and act in its environment — which are not concerns of Data Mining.\n", + "> \n", + ">Historically, KDD spun off from Machine Learning as a research field focused on concerns raised by examining real-world applications, and a decade and a half later the KDD community remains more concerned with applications than Machine Learning is. As such, research focused on commercial applications and business issues of data analysis tends to gravitate toward the KDD community rather than to Machine Learning. KDD also tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, and so on.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Cross Industry Standard Process for Data Mining (CRISP-DM)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Cross Industry Standard Process for Data Mining](https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) introduced a process model for data mining in 2000 that has become widely adopted.\n", + "\n", + "\"CRISP-DM_Process_Diagram\"\n", + "\n", + "The model emphasizes the ***iterative*** nature of the data mining process, distinguishing several different stages that are regularly revisited in the course of developing and deploying data-driven solutions to business problems:\n", + "\n", + "* Business understanding\n", + "* Data understanding\n", + "* Data preparation\n", + "* Modeling \n", + "* Deployment\n", + "\n", + "We will be focusing primarily on using Python for **data preparation** and **modeling**." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Science Workflow" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Philip Guo](http://www.pgbovine.net/) presents a [Data Science Workflow](http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext) offering a slightly different process model emhasizing the importance of **reflection** and some of the meta-data, data management and bookkeeping challenges that typically arise in the data science process. His 2012 PhD thesis, [Software Tools to Facilitate Research Programming](http://pgbovine.net/projects/pubs/guo_phd_dissertation.pdf), offers an insightful and more comprehensive description of many of these challenges.\n", + "\n", + "\"pguo-data-science-overview.jpg\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Provost & Fawcett list a number of different tasks in which data science techniques are employed:\n", + "\n", + "* Classification and class probability estimation \n", + "* Regression (aka value estimation) \n", + "* Similarity matching \n", + "* Clustering \n", + "* Co-occurrence grouping (aka frequent itemset mining, association rule discovery, market-basket analysis) \n", + "* Profiling (aka behavior description, fraud / anomaly detection) \n", + "* Link prediction \n", + "* Data reduction \n", + "* Causal modeling \n", + "\n", + "We will be focusing primarily on **classification** and **class probability estimation** tasks, which are defined by Provost & Fawcett as follows:\n", + "\n", + "> *Classification* and *class probability estimation* attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to. Usually the classes are mutually exclusive. An example classification question would be: “Among all the customers of MegaTelCo, which are likely to respond to a given offer?” In this example the two classes could be called will respond and will not respond.\n", + "\n", + "To further simplify this primer, we will focus exclusively on **supervised** methods, in which the data is explicitly labeled with classes. There are also *unsupervised* methods that involve working with data in which there are no pre-specified class labels." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Supervised Classification" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Natural Language Toolkit (NLTK) book](http://www.nltk.org/book) provides a diagram and succinct description (below, with italics and bold added for emphasis) of supervised classification:\n", + "\n", + "\"nltk_ch06_supervised-classification.png\"\n", + "\n", + "> *Supervised Classification*. (a) During *training*, a **feature extractor** is used to convert each **input value** to a **feature set**. These feature sets, which capture the basic information about each input that should be used to classify it, are discussed in the next section. Pairs of feature sets and **labels** are fed into the **machine learning algorithm** to generate a **model**. (b) During *prediction*, the same feature extractor is used to convert **unseen inputs** to feature sets. These feature sets are then fed into the model, which generates **predicted labels**." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Mining Terminology" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* **Structured** data has simple, well-defined patterns (e.g., a table or graph)\n", + "* **Unstructured** data has less well-defined patterns (e.g., text, images)\n", + "* **Model**: a pattern that captures / generalizes regularities in data (e.g., an equation, set of rules, decision tree)\n", + "* **Attribute** (aka *variable*, *feature*, *signal*, *column*): an element used in a model\n", + "* **Instance** (aka *example*, *feature vector*, *row*): a representation of a single entity being modeled\n", + "* **Target attribute** (aka *dependent variable*, *class label*): the class / type / category of an entity being modeled" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Mining Example: UCI Mushroom dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Center for Machine Learning and Intelligent Systems](http://cml.ics.uci.edu/) at the University of California, Irvine (UCI), hosts a [Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html) containing over 200 publicly available data sets.\n", + "\n", + "\"mushroom\"/\n", + "We will use the [mushroom](https://archive.ics.uci.edu/ml/datasets/Mushroom) data set, which forms the basis of several examples in Chapter 3 of the Provost & Fawcett data science book.\n", + "\n", + "The following description of the dataset is provided at the UCI repository:\n", + "\n", + ">This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525 [The Audubon Society Field Guide to North American Mushrooms, 1981]). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like leaflets three, let it be'' for Poisonous Oak and Ivy.\n", + "> \n", + "> **Number of Instances**: 8124\n", + "> \n", + "> **Number of Attributes**: 22 (all nominally valued)\n", + "> \n", + "> **Attribute Information**: (*classes*: edible=e, poisonous=p)\n", + "> \n", + "> 1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", + "> 2. *cap-surface*: fibrous=f, grooves=g, scaly=y, smooth=s\n", + "> 3. *cap-color*: brown=n ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y\n", + "> 4. *bruises?*: bruises=t, no=f\n", + "> 5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s\n", + "> 6. *gill-attachment*: attached=a, descending=d, free=f, notched=n\n", + "> 7. *gill-spacing*: close=c, crowded=w, distant=d\n", + "> 8. *gill-size*: broad=b, narrow=n\n", + "> 9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y\n", + "> 10. *stalk-shape*: enlarging=e, tapering=t\n", + "> 11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?\n", + "> 12. *stalk-surface-above-ring*: fibrous=f, scaly=y, silky=k, smooth=s\n", + "> 13. *stalk-surface-below-ring*: fibrous=f, scaly=y, silky=k, smooth=s\n", + "> 14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n", + "> 15. *stalk-color-below-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n", + "> 16. *veil-type*: partial=p, universal=u\n", + "> 17. *veil-color*: brown=n, orange=o, white=w, yellow=y\n", + "> 18. *ring-number*: none=n, one=o, two=t\n", + "> 19. *ring-type*: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z\n", + "> 20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y\n", + "> 21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y\n", + "> 22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d\n", + "> \n", + "> **Missing Attribute Values**: 2480 of them (denoted by \"?\"), all for attribute #11.\n", + "> \n", + "> **Class Distribution**: -- edible: 4208 (51.8%) -- poisonous: 3916 (48.2%) -- total: 8124 instances\n", + "\n", + "The [data file](https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data) associated with this dataset has one instance of a hypothetical mushroom per line, with abbreviations for the values of the class and each of the other 22 attributes separated by commas.\n", + "\n", + "Here is a sample line from the data file:\n", + "\n", + "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", + "\n", + "This instance represents a mushroom with the following attribute values (highlighted in **bold**):\n", + "\n", + "*class*: edible=e, **poisonous=p**\n", + "\n", + "1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, **knobbed=k**, sunken=s\n", + "2. *cap-surface*: **fibrous=f**, grooves=g, scaly=y, smooth=s\n", + "3. *cap-color*: **brown=n** ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y\n", + "4. *bruises?*: bruises=t, **no=f**\n", + "5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, **none=n**, pungent=p, spicy=s\n", + "6. *gill-attachment*: attached=a, descending=d, **free=f**, notched=n\n", + "7. *gill-spacing*: **close=c**, crowded=w, distant=d\n", + "8. *gill-size*: broad=b, **narrow=n**\n", + "9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, **white=w**, yellow=y\n", + "10. *stalk-shape*: **enlarging=e**, tapering=t\n", + "11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, **missing=?**\n", + "12. *stalk-surface-above-ring*: fibrous=f, scaly=y, **silky=k**, smooth=s\n", + "13. *stalk-surface-below-ring*: fibrous=f, **scaly=y**, silky=k, smooth=s\n", + "14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, **white=w**, yellow=y\n", + "15. *stalk-color-below-ring*: **brown=n**, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n", + "16. *veil-type*: **partial=p**, universal=u\n", + "17. *veil-color*: brown=n, orange=o, **white=w**, yellow=y\n", + "18. *ring-number*: none=n, **one=o**, two=t\n", + "19. *ring-type*: cobwebby=c, **evanescent=e**, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z\n", + "20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, **white=w**, yellow=y\n", + "21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, **several=v**, solitary=y\n", + "22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, **woods=d**\n", + "\n", + "Building a model with this data set will serve as a motivating example throughout much of this primer." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Python: Basic Concepts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### *A note on Python 2 vs. Python 3*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are 2 major versions of Python in widespread use: [Python 2](https://docs.python.org/2/) and [Python 3](https://docs.python.org/3/). Python 3 has some features that are not backward compatible with Python 2, and some Python 2 libraries have not been updated to work with Python 3. I have been using Python 2, primarily because I use some of those Python 2[-only] libraries, but an increasing proportion of them are migrating to Python 3, and I anticipate shifting to Python 3 in the near future.\n", + "\n", + "For more on the topic, I recommend a very well documented IPython Notebook, which includes numerous helpful examples and links, by [Sebastian Raschka](http://sebastianraschka.com/), [Key differences between Python 2.7.x and Python 3.x](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/key_differences_between_python_2_and_3.ipynb), the [Cheat Sheet: Writing Python 2-3 compatible code](http://python-future.org/compatible_idioms.html) by Ed Schofield ... or [googling Python 2 vs 3](https://www.google.com/q=python%202%20vs%203).\n", + "\n", + "[Nick Coghlan](https://twitter.com/ncoghlan_dev), a CPython core developer, sent me an email suggesting that relatively minor changes in this notebook would enable it to run with Python 2 *or* Python 3: importing the `print_function` from the [**`__future__`**](https://docs.python.org/2/library/__future__.html) module, and changing my [`print` *statements* (Python 2)](https://docs.python.org/2/reference/simple_stmts.html#print) to [`print` *function calls* (Python 3)](https://docs.python.org/3/library/functions.html#print). Although a relatively minor conceptual change, it necessitated the changing of many individual cells to reflect the Python 3 `print` syntax. \n", + "\n", + "I decided to import the `division` module from the `future`, as I find [the use of `/` for \"true division\"](https://www.python.org/dev/peps/pep-0238/) - and the use of `//` for \"floor division\" - to be more aligned with my intuition. I also needed to replace a few functions that are no longer available in Python 3 with related functions that are available in both versions; I've added notes in nearby cells where the incompatible functions were removed explaining why they are related ... and no longer available. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from __future__ import print_function, division" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Names (identifiers), strings & binding values to names (assignment)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The sample instance of a mushroom shown above can be represented as a string. \n", + "\n", + "A Python ***string* ([`str`](http://docs.python.org/2/tutorial/introduction.html#strings))** is a sequence of 0 or more characters enclosed within a pair of single quotes (`'`) or a pair of double quotes (`\"`). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python [*identifiers*](http://docs.python.org/2/reference/lexical_analysis.html#identifiers) (or [*names*](https://docs.python.org/2/reference/executionmodel.html#naming-and-binding)) are composed of letters, numbers and/or underscores ('`_`'), starting with a letter or underscore. Python identifiers are case sensitive. Although camelCase identifiers can be used, it is generally considered more [pythonic](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html) to use underscores. Python variables and functions typically start with lowercase letters; Python classes start with uppercase letters.\n", + "\n", + "The following [assignment statement](http://docs.python.org/2/reference/simple_stmts.html#assignment-statements) binds the value of the string shown above to the name `single_instance_str`. Typing the name on the subsequent line will cause the intepreter to print the value bound to that name." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "single_instance_str = 'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'\n", + "single_instance_str" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Printing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`print`**](https://docs.python.org/3/library/functions.html#print) function writes the value of its comma-delimited arguments to [**`sys.stdout`**](http://docs.python.org/2/library/sys.html#sys.stdout) (typically the console). Each value in the output is separated by a single blank space. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('A', 'B', 'C', 1, 2, 3)\n", + "print('Instance 1:', single_instance_str)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The print function has an optional keyword argument, **`end`**. When this argument is used and its value does not include `'\\n'` (newline character), the output cursor will not advance to the next line." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('A', 'B') # no end argument\n", + "print('C')\n", + "print ('A', 'B', end='...\\n') # end includes '\\n' --> output cursor advancees to next line\n", + "print ('C')\n", + "print('A', 'B', end=' ') # end=' ' --> use a space rather than newline at the end of the line\n", + "print('C') # so that subsequent printed output will appear on same line" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Comments" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python ***comment*** character is **`'#'`**: anything after `'#'` on the line is ignored by the Python interpreter. PEP8 style guidelines recommend using at least 2 blank spaces before an inline comment that appears on the same line as any code.\n", + "\n", + "***Multi-line strings*** can be used within code blocks to provide multi-line comments.\n", + "\n", + "Multi-line strings are delimited by pairs of triple quotes (**`'''`** or **`\"\"\"`**). Any newlines in the string will be represented as `'\\n'` characters in the string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "'''\n", + "This is\n", + "a mult-line\n", + "string'''" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Before comment') # this is an inline comment\n", + "'''\n", + "This is\n", + "a multi-line\n", + "comment\n", + "'''\n", + "print('After comment')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Multi-line strings can be printed, in which case the embedded newline (`'\\n'`) characters will be converted to newlines in the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('''\n", + "This is\n", + "a mult-line\n", + "string''')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Lists" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A [**`list`**](http://docs.python.org/2/tutorial/introduction.html#lists) is an ordered ***sequence*** of 0 or more comma-delimited elements enclosed within square brackets ('`[`', '`]`'). The Python [**`str.split(sep)`**](http://docs.python.org/2/library/stdtypes.html#str.split) method can be used to split a `sep`-delimited string into a corresponding list of elements.\n", + "\n", + "In the following example, a comma-delimited string is split using `sep=','`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "single_instance_list = single_instance_str.split(',')\n", + "print(single_instance_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python lists are *heterogeneous*, i.e., they can contain elements of different types." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "mixed_list = ['a', 1, 2.3, True, [1, 'b']]\n", + "print(mixed_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python **`+`** operator can be used for addition, and also to concatenate strings and lists." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(1 + 2 + 3)\n", + "print('a' + 'b' + 'c')\n", + "print(['a', 1] + [2.3, True] + [[1, 'b']])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Accessing sequence elements & subsequences " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Individual elements of [*sequences*](http://docs.python.org/2/library/stdtypes.html#typesseq) (e.g., lists and strings) can be accessed by specifying their *zero-based index position* within square brackets ('`[`', '`]`').\n", + "\n", + "The following statements print out the 3rd element - at zero-based index position 2 - of `single_instance_str` and `single_instance_list`.\n", + "\n", + "Note that the 3rd elements are not the same, as commas count as elements in the string, but not in the list created by splitting a comma-delimited string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str)\n", + "print(single_instance_str[2])\n", + "print(single_instance_list)\n", + "print(single_instance_list[2])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Negative index values* can be used to specify a position offset from the end of the sequence.\n", + "\n", + "It is often useful to use a `-1` index value to access the last element of a sequence." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str)\n", + "print(single_instance_str[-1])\n", + "print(single_instance_str[-2])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_list)\n", + "print(single_instance_list[-1])\n", + "print(single_instance_list[-2])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python ***slice notation*** can be used to access subsequences by specifying two index positions separated by a colon (':'); `seq[start:stop]` returns all the elements in `seq` between `start` and `stop - 1` (inclusive)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str[2:4])\n", + "print(single_instance_list[2:4])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Slices index values can be negative." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str[-4:-2])\n", + "print(single_instance_list[-4:-2])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `start` and/or `stop` index can be omitted. A common use of slices with a single index value is to access all but the first element or all but the last element of a sequence." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str)\n", + "print(single_instance_str[:-1]) # all but the last \n", + "print(single_instance_str[:-2]) # all but the last 2 \n", + "print(single_instance_str[1:]) # all but the first\n", + "print(single_instance_str[2:]) # all but the first 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_list)\n", + "print(single_instance_list[:-1])\n", + "print(single_instance_list[1:])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Slice notation includes an optional third element, `step`, as in `seq[start:stop:step]`, that specifies the steps or increments by which elements are retrieved from `seq` between `start` and `step - 1`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str)\n", + "print(single_instance_str[::2]) # print elements in even-numbered positions\n", + "print(single_instance_str[1::2]) # print elements in odd-numbered positions\n", + "print(single_instance_str[::-1]) # print elements in reverse order" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Python tutorial](http://docs.python.org/2/tutorial/introduction.html) offers a helpful ASCII art representation to show how positive and negative indexes are interpreted:\n", + "\n", + "
\n",
+    " +---+---+---+---+---+\n",
+    " | H | e | l | p | A |\n",
+    " +---+---+---+---+---+\n",
+    " 0   1   2   3   4   5\n",
+    "-5  -4  -3  -2  -1\n",
+    "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Splitting / separating statements" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python statements are typically separated by newlines (rather than, say, the semi-colon in Java). Statements can extend over more than one line; it is generally best to break the lines after commas, parentheses, braces or brackets. Inserting a backslash character ('\\\\') at the end of a line will also enable continuation of the statement on the next line, but it is generally best to look for other alternatives." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute_names = ['class', \n", + " 'cap-shape', 'cap-surface', 'cap-color', \n", + " 'bruises?', \n", + " 'odor', \n", + " 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', \n", + " 'stalk-shape', 'stalk-root', \n", + " 'stalk-surface-above-ring', 'stalk-surface-below-ring', \n", + " 'stalk-color-above-ring', 'stalk-color-below-ring',\n", + " 'veil-type', 'veil-color', \n", + " 'ring-number', 'ring-type', \n", + " 'spore-print-color', \n", + " 'population', \n", + " 'habitat']\n", + "print(attribute_names)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('a', 'b', 'c', # no '\\' needed when breaking after comma\n", + " 1, 2, 3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print( # no '\\' needed when breaking after parenthesis, brace or bracket\n", + " 'a', 'b', 'c',\n", + " 1, 2, 3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(1 + 2 \\\n", + " + 3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Processing strings & other sequences" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`str.strip([chars]`**)](http://docs.python.org/2/library/stdtypes.html#str.strip) method returns a copy of `str` in which any leading or trailing `chars` are removed. If no `chars` are specified, it removes all leading and trailing whitespace. [*Whitespace* is any sequence of spaces, tabs (`'\\t'`) and/or newline (`'\\n'`) characters.] \n", + "\n", + "Note that since a blank space is inserted in the output after every item in a comma-delimited list, the second asterisk below is printed after a leading blank space is inserted on the new line." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n', '*')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'.strip(), '*')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A common programming pattern when dealing with CSV (comma-separated value) files, such as the mushroom dataset file mentioned above, is to repeatedly:\n", + "\n", + "1. read a line from a file\n", + "2. strip off any leading and trailing whitespace\n", + "3. split the values separated by commas into a list\n", + "\n", + "We will get to repetition control structures (loops) and file input and output shortly, but here is an example of how `str.strip()` and `str.split()` be chained together in a single instruction for processing a line representing a single instance from the mushroom dataset file. Note that chained methods are executed in left-to-right order.\n", + "\n", + "*\\[Python providees a **[`csv`](https://docs.python.org/2/library/csv.html)** module to facilitate the processing of CSV files, but we will not use that module here\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "single_instance_str = 'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'\n", + "print(single_instance_str)\n", + "# first strip leading & trailing whitespace, then split on commas\n", + "single_instance_list = single_instance_str.strip().split(',') \n", + "print(single_instance_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`str.join(words)`**](http://docs.python.org/2/library/string.html#string.join) method is the inverse of `str.split()`, returning a single string in which each string in the sequence of `words` is separated by `str`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_list)\n", + "print(','.join(single_instance_list))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A number of Python methods can be used on strings, lists and other sequences.\n", + "\n", + "The [**`len(s)`**](http://docs.python.org/2/library/functions.html#len) function can be used to find the length of (number of items in) a sequence `s`. It will also return the number of items in a *dictionary*, a data structure we will cover further below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(len(single_instance_str))\n", + "print(len(single_instance_list))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The **`in`** operator can be used to determine whether a sequence contains a value. \n", + "\n", + "Boolean values in Python are **`True`** and **`False`** (note the capitalization)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(',' in single_instance_str)\n", + "print(',' in single_instance_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`s.count(x)`**](http://docs.python.org/2/library/stdtypes.html#str.count) ormethod can be used to count the number of occurrences of item `x` in sequence `s`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str.count(','))\n", + "print(single_instance_list.count('f'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`s.index(x)`**](http://docs.python.org/2/library/stdtypes.html#str.index) method can be used to find the first zero-based index of item `x` in sequence `s`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str.index(','))\n", + "print(single_instance_list.index('f'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that an [`ValueError`](https://docs.python.org/2/library/exceptions.html#exceptions.ValueError) exception will be raised if item `x` is not found in sequence `s`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_list.index(','))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Mutability" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One important distinction between strings and lists has to do with their [*mutability*](http://docs.python.org/2/reference/datamodel.html).\n", + "\n", + "Python strings are *immutable*, i.e., they cannot be modified. Most string methods (like `str.strip()`) return modified *copies* of the strings on which they are used.\n", + "\n", + "Python lists are *mutable*, i.e., they can be modified. \n", + "\n", + "The examples below illustrate a number of [`list`](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) methods that modify lists." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "list_1 = [4, 2, 3, 5, 1]\n", + "list_2 = list_1 # list_2 now references the same object as list_1\n", + "print('list_1: ', list_1)\n", + "print('list_2: ', list_2)\n", + "print()\n", + "\n", + "list_1.remove(1)\n", + "print('list_1.remove(1):', list_1)\n", + "print()\n", + "\n", + "list_1.append(6)\n", + "print('list_1.append(6):', list_1)\n", + "print()\n", + "\n", + "list_1.sort()\n", + "print('list_1.sort(): ', list_1)\n", + "print()\n", + "\n", + "list_1.reverse()\n", + "print('list_1.reverse():', list_1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When more than one name (e.g., a variable) is bound to the same mutable object, changes made to that object are reflected in all names bound to that object. For example, in the second statement above, `list_2` is bound to the same object that is bound to `list_1`. All changes made to the object bound to `list_1` will thus be reflected in `list_2` (since they both reference the same object)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('list_1: ', list_1)\n", + "print('list_2: ', list_2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are sorting and reversing functions, **[`sorted()`](https://docs.python.org/2.7/library/functions.html#sorted)** and **[`reversed()`](https://docs.python.org/2.7/library/functions.html#reversed)**, that do *not* modify their arguments, and can thus be used on mutable or immutable objects. \n", + "\n", + "Note that `sorted()` always returns a sorted *list* of each element in its argument, regardless of which type of sequence it is passed. Thus, invoking `sorted()` on a *string* returns a *list* of sorted characters from the string, rather than a sorted string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('sorted(list_1):', sorted(list_1)) \n", + "print('list_1: ', list_1)\n", + "print()\n", + "print('sorted(single_instance_str):', sorted(single_instance_str)) \n", + "print('single_instance_str: ', single_instance_str)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `sorted()` function sorts its argument in ascending order by default. \n", + "\n", + "An optional ***[keyword argument](http://docs.python.org/2/tutorial/controlflow.html#keyword-arguments)***, `reverse`, can be used to sort in descending order. The default value of this optional parameter is `False`; to get non-default behavior of an optional argument, we must specify the name and value of the argument, in this case, `reverse=True`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(sorted(single_instance_str)) \n", + "print(sorted(single_instance_str, reverse=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Tuples (immutable list-like sequences)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A [*tuple*](http://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences) is an ordered, immutable sequence of 0 or more comma-delimited values enclosed in parentheses (`'('`, `')'`). Many of the functions and methods that operate on strings and lists also operate on tuples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x = (5, 4, 3, 2, 1) # a tuple\n", + "print('x =', x)\n", + "print('len(x) =', len(x))\n", + "print('x.index(3) =', x.index(3))\n", + "print('x[2:4] = ', x[2:4])\n", + "print('x[4:2:-1] = ', x[4:2:-1])\n", + "print('sorted(x):', sorted(x)) # note: sorted() always returns a list" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the methods that modify lists (e.g., `append()`, `remove()`, `reverse()`, `sort()`) are not defined for immutable sequences such as tuples (or strings). Invoking one of these sequence modification methods on an immutable sequence will raise an [`AttributeError`](https://docs.python.org/2/library/exceptions.html#exceptions.AttributeError) exception." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x.append(6)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, one can approximate these modifications by creating modified copies of an immutable sequence and then re-assigning it to a name." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x = x + (6,) # need to include a comma to differentiate tuple from numeric expression\n", + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that Python has a **`+=`** operator which is a shortcut for the *`name = name + new_value`* pattern. This can be used for addition (e.g., `x += 1` is shorthand for `x = x + 1`) or concatenation (e.g., `x += (7,)` is shorthand for `x = x + (7,)`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x += (7,)\n", + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Conditionals" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One common approach to handling errors is to *look before you leap (LBYL)*, i.e., test for potential [exceptions](http://docs.python.org/2/tutorial/errors.html) before executing instructions that might raise those exceptions. \n", + "\n", + "This approach can be implemented using the [**`if`**](http://docs.python.org/2/tutorial/controlflow.html#if-statements) statement (which may optionally include an **`else`** and any number of **`elif`** clauses).\n", + "\n", + "The following is a simple example of an `if` statement:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "class_value = 'e' # try changing this to 'p' or 'x'\n", + "\n", + "if class_value == 'e':\n", + " print('edible')\n", + "elif class_value == 'p':\n", + " print('poisonous')\n", + "else:\n", + " print('unknown')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that \n", + "\n", + "* a colon ('`:`') is used at the end of the lines with `if`, `else` or `elif`\n", + "* no parentheses are required to enclose the boolean condition (it is presumed to include everything between `if` or `elif` and the colon)\n", + "* the statements below each `if`, `elif` and `else` line are all indented\n", + "\n", + "Python does not have special characters to delimit statement blocks (like the '{' and '}' delimiters in Java); instead, sequences of statements with the same *indentation level* are treated as a statement block. The [Python Style Guide](http://legacy.python.org/dev/peps/pep-0008/) recommends using 4 spaces for each indentation level.\n", + "\n", + "An `if` statement can be used to follow the LBYL paradigm in preventing the `ValueError` that occured in an earlier example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code\n", + "\n", + "if attribute in attribute_names:\n", + " i = attribute_names.index(attribute)\n", + " print(attribute, 'is in position', i)\n", + "else:\n", + " print(attribute, 'is not in', attribute_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Seeking forgiveness vs. asking for permission (EAFP vs. LBYL)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another perspective on handling errors championed by some pythonistas is that it is [*easier to ask forgiveness than permission (EAFP)*](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#eafp-vs-lbyl).\n", + "\n", + "As in many practical applications of philosophy, religion or dogma, it is helpful to *think before you choose (TBYC)*. There are a number of factors to consider in deciding whether to follow the EAFP or LBYL paradigm, including code readability and the anticipated likelihood and relative severity of encountering an exception. For those who are interested, Oran Looney wrote a blog post providing a nice overview of the debate over [LBYL vs. EAFP](http://oranlooney.com/lbyl-vs-eafp/).\n", + "\n", + "In keeping with practices most commonly used with other languages, we will follow the LBYL paradigm throughout most of this primer. \n", + "\n", + "However, as a brief illustration of the EAFP paradigm in Python, here is an alternate implementation of the functionality of the code above, using a [**`try/except`**](http://docs.python.org/2/tutorial/errors.html#handling-exceptions) statement." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code\n", + "\n", + "try:\n", + " i = attribute_names.index(attribute)\n", + " print(attribute, 'is in position', i)\n", + "except ValueError:\n", + " print(attribute, 'is not found')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python *null object* is **`None`** (note the capitalization)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code\n", + "\n", + "if attribute not in attribute_names: # equivalent to 'not attribute in attribute_names'\n", + " value = None\n", + "else:\n", + " i = attribute_names.index(attribute)\n", + " value = single_instance_list[i]\n", + " \n", + "print(attribute, '=', value)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Defining and calling functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python [*function definitions*](http://docs.python.org/2/tutorial/controlflow.html#defining-functions) start with the **`def`** keyword followed by a function name, a list of 0 or more comma-delimited *parameters* (aka 'formal parameters') enclosed within parentheses, and then a colon ('`:`'). \n", + "\n", + "A function definition may include one or more [**`return`**](http://docs.python.org/2/reference/simple_stmts.html#the-return-statement) statements to indicate the value(s) returned to where the function is called. It is good practice to include a short [docstring](http://docs.python.org/2/tutorial/controlflow.html#tut-docstrings) to briefly describe the behavior of the function and the value(s) it returns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def attribute_value(instance, attribute, attribute_names):\n", + " '''Returns the value of attribute in instance, based on its position in attribute_names'''\n", + " if attribute not in attribute_names:\n", + " return None\n", + " else:\n", + " i = attribute_names.index(attribute)\n", + " return instance[i] # using the parameter name here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A *function call* starts with the function name, followed by a list of 0 or more comma-delimited *arguments* (aka 'actual parameters') enclosed within parentheses. A function call can be used as a statement or within an expression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'cap-shape' # try substituting any of the other attribute names shown above\n", + "print(attribute, '=', attribute_value(single_instance_list, 'cap-shape', attribute_names))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that Python does not distinguish between names used for *variables* and names used for *functions*. An assignment statement binds a value to a name; a function definition also binds a value to a name. At any given time, the value most recently bound to a name is the one that is used. \n", + "\n", + "This can be demonstrated using the [**`type(object)`**](http://docs.python.org/2.7/library/functions.html#type) function, which returns the `type` of `object`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x = 0\n", + "print('x used as a variable:', x, type(x))\n", + "\n", + "def x():\n", + " print('x')\n", + " \n", + "print('x used as a function:', x, type(x))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another way to determine the `type` of an object is to use [**`isinstance(object, class)`**](https://docs.python.org/2/library/functions.html#isinstance). This is generally [preferable](http://stackoverflow.com/questions/1549801/differences-between-isinstance-and-type-in-python), as it takes into account [class inheritance](https://docs.python.org/2/tutorial/classes.html#inheritance). There is a larger issue of [*duck typing*](https://en.wikipedia.org/wiki/Duck_typing), and whether code should ever explicitly check for the type of an object, but we will omit further discussion of the topic in this primer." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Call by sharing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An important feature of Python functions is that arguments are passed using [*call by sharing*](https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_sharing). \n", + "\n", + "If a *mutable* object is passed as an argument to a function parameter, assignment statements using that parameter do not affect the passed argument, however other modifications to the parameter (e.g., modifications to a list using methods such as `append()`, `remove()`, `reverse()` or `sort()`) do affect the passed argument.\n", + "\n", + "Not being aware of - or forgetting - this important distinction can lead to challenging debugging sessions. \n", + "\n", + "The example below demonstrates this difference and introduces another [list method](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists), `list.insert(i, x)`, which inserts `x` into `list` at position `i`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def modify_parameters(parameter1, parameter2):\n", + " '''Inserts \"x\" at the head of parameter1, assigns [7, 8, 9] to parameter2'''\n", + " parameter1.insert(0, 'x') # insert() WILL affect argument passed as parameter1\n", + " print('parameter1, after inserting \"x\":', parameter1)\n", + " parameter2 = [7, 8, 9] # assignment WILL NOT affect argument passed as parameter2\n", + " print('parameter2, after assigning \"x\"', parameter2)\n", + " return\n", + "\n", + "argument1 = [1, 2, 3] \n", + "argument2 = [4, 5, 6]\n", + "print('argument1, before calling modify_parameters:', argument1)\n", + "print('argument2, before calling modify_parameters:', argument2)\n", + "print()\n", + "modify_parameters(argument1, argument2)\n", + "print()\n", + "print('argument1, after calling modify_parameters:', argument1)\n", + "print('argument2, after calling modify_parameters:', argument2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One way of preventing functions from modifying mutable objects passed as parameters is to make a copy of those objects inside the function. Here is another version of the function above that makes a shallow copy of the *list_parameter* using the slice operator. \n", + "\n", + "*\\[Note: the Python [copy](http://docs.python.org/2/library/copy.html) module provides both [shallow] [`copy()`](http://docs.python.org/2/library/copy.html#copy.copy) and [`deepcopy()`](http://docs.python.org/2/library/copy.html#copy.deepcopy) methods; we will cover modules further below.\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def modify_parameter_copy(parameter_1):\n", + " '''Inserts \"x\" at the head of parameter_1, without modifying the list argument'''\n", + " parameter_1_copy = parameter_1[:] # list[:] returns a copy of list\n", + " parameter_1_copy.insert(0, 'x')\n", + " print('Inserted \"x\":', parameter_1_copy)\n", + " return\n", + "\n", + "argument_1 = [1, 2, 3] # passing a named object will not affect the object bound to that name\n", + "print('Before:', argument_1)\n", + "modify_parameter_copy(argument_1)\n", + "print('After:', argument_1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another way to avoid modifying parameters is to use assignment statements which do not modify the parameter objects but return a new object that is bound to the name (locally)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def modify_parameter_assignment(parameter_1):\n", + " '''Inserts \"x\" at the head of parameter_1, without modifying the list argument'''\n", + " parameter_1 = ['x'] + parameter_1 # using assignment rather than list.insert()\n", + " print('Inserted \"x\":', parameter_1)\n", + " return\n", + "\n", + "argument_1 = [1, 2, 3] # passing a named object will not affect the object bound to that name\n", + "print('Before:', argument_1)\n", + "modify_parameter_assignment(argument_1)\n", + "print('After:', argument_1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Multiple return values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python functions can return more than one value by separating those return values with commas in the **return** statement. Multiple values are returned as a tuple. \n", + "\n", + "If the function-invoking expression is an assignment statement, multiple variables can be assigned the multiple values returned by the function in a single statement. This combining of values and subsequent separation is known as tuple ***packing*** and ***unpacking***." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def min_and_max(list_of_values):\n", + " '''Returns a tuple containing the min and max values in the list_of_values'''\n", + " return min(list_of_values), max(list_of_values)\n", + "\n", + "list_1 = [3, 1, 4, 2, 5]\n", + "print('min and max of', list_1, ':', min_and_max(list_1))\n", + "\n", + "# a single variable is assigned the two-element tuple\n", + "min_and_max_list_1 = min_and_max(list_1) \n", + "print('min and max of', list_1, ':', min_and_max_list_1)\n", + "\n", + "# the 1st variable is assigned the 1st value, the 2nd variable is assigned the 2nd value\n", + "min_list_1, max_list_1 = min_and_max(list_1) \n", + "print('min and max of', list_1, ':', min_list_1, ',', max_list_1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Iteration: for, range" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`for`**](http://docs.python.org/2/tutorial/controlflow.html#for-statements) statement iterates over the elements of a sequence or other [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for i in [0, 1, 2]:\n", + " print(i)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for c in 'abc':\n", + " print(c)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Python 2, the [**`range(stop)`**](http://docs.python.org/2/tutorial/controlflow.html#the-range-function) function returns a list of values from 0 up to `stop - 1` (inclusive). It is often used in the context of a `for` loop that iterates over the list of values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Values for the', len(attribute_names), 'attributes:', end='\\n\\n') # adds a blank line\n", + "for i in range(len(attribute_names)):\n", + " print(attribute_names[i], '=', \n", + " attribute_value(single_instance_list, attribute_names[i], attribute_names))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The more general form of the function, [**`range(start, stop[, step])`**](http://docs.python.org/2/library/functions.html#range), returns a list of values from `start` to `stop - 1` (inclusive) increasing by `step` (which defaults to `1`), or from `start` down to `stop + 1` (inclusive) decreasing by `step` if `step` is negative." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for i in range(3, 0, -1):\n", + " print(i)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Python 2, the [**`xrange(stop[, stop[, step]])`**](http://docs.python.org/2/library/functions.html#xrange) function is an [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) version of the `range()` function. In the context of a `for` loop, it returns the *next* item of the sequence for each iteration of the loop rather than creating *all* the elements of the sequence before the first iteration. This can reduce memory consumption in cases where iteration over all the items is not required.\n", + "\n", + "In Python 3, the `range()` function behaves the same way as the `xrange()` function does in Python 2, and so the `xrange()` function is not defined in Python 3. \n", + "\n", + "To maximize compatibility, we will use `range()` throughout this notebook; however, note that it is generally more efficient to use `xrange()` rather than `range()` in Python 2." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Modules, namespaces and dotted notation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A Python [***module***](http://docs.python.org/2/tutorial/modules.html) is a file containing related definitions (e.g., of functions and variables). Modules are used to help organize a Python [***namespace***](http://docs.python.org/2/tutorial/classes.html#python-scopes-and-namespaces), the set of identifiers accessible in a particular context. All of the functions and variables we define in this IPython Notebook are in the `__main__` namespace, so accessing them does not require any specification of a module.\n", + "\n", + "A Python module named **`simple_ml`** (in the file `simple_ml.py`), contains a set of solutions to the exercises in this IPython Notebook. *\\[The learning opportunity provided by this primer will be maximized by not looking at that file, or waiting as long as possible to do so.\\]*\n", + "\n", + "Accessing functions in an external module requires that we first **[`import`](http://docs.python.org/2/reference/simple_stmts.html#the-import-statement)** the module, and then prefix the function names with the module name followed by a dot (this is known as ***dotted notation***).\n", + "\n", + "For example, the following function call in Exercise 1 below: \n", + "\n", + "`simple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)`\n", + "\n", + "uses dotted notation to reference the `print_attribute_names_and_values()` function in the `simple_ml` module.\n", + "\n", + "After you have defined your own function for Exercise 1, you can test your function by deleting the `simple_ml` module specification, so that the statement becomes\n", + "\n", + "`print_attribute_names_and_values(single_instance_list, attribute_names)`\n", + "\n", + "This will reference the `print_attribute_names_and_values()` function in the current namespace (`__main__`), i.e., the top-level interpreter environment. The `simple_ml.print_attribute_names_and_values()` function will still be accessible in the `simple_ml` namespace by using the \"`simple_ml.`\" prefix (so you can easily toggle back and forth between your own definition and that provided in the solutions file)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 1: define `print_attribute_names_and_values()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Complete the following function definition, `print_attribute_names_and_values(instance, attribute_names)`, so that it generates exactly the same output as the code above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def print_attribute_names_and_values(instance, attribute_names):\n", + " '''Prints the attribute names and values for an instance'''\n", + " # your code goes here\n", + " return\n", + "\n", + "import simple_ml # this module contains my solutions to exercises\n", + "\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "simple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### File I/O" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python [file input and output](http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files) is done through [file](http://docs.python.org/2/library/stdtypes.html#file-objects) objects. A file object is created with the [`open(name[, mode])`](http://docs.python.org/2/library/functions.html#open) statement, where `name` is a string representing the name of the file, and `mode` is `'r'` (read), `'w'` (write) or `'a'` (append); if no second argument is provided, the mode defaults to `'r'`.\n", + "\n", + "A common Python programming pattern for processing an input text file is to \n", + "\n", + "* [**`open`**](http://docs.python.org/2/library/functions.html#open) the file using a [**`with`**](http://docs.python.org/2/reference/compound_stmts.html#the-with-statement) statement (which will automatically [**`close`**](http://docs.python.org/2/library/stdtypes.html#file.close) the file after the statements inside the `with` block have been executed)\n", + "* iterate over each line in the file using a **`for`** statement\n", + "\n", + "The following code creates a list of instances, where each instance is a list of attribute values (like `instance_1_str` above). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "all_instances = [] # initialize instances to an empty list\n", + "data_filename = 'agaricus-lepiota.data'\n", + "\n", + "with open(data_filename, 'r') as f:\n", + " for line in f: # 'line' will be bound to the next line in f in each for loop iteration\n", + " all_instances.append(line.strip().split(','))\n", + " \n", + "print('Read', len(all_instances), 'instances from', data_filename)\n", + "# we don't want to print all the instances, so we'll just print the first one to verify\n", + "print('First instance:', all_instances[0]) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 2: define load_instances()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `load_instances(filename)`, that returns a list of instances in a text file. The function definition is started for you below. The function should exhibit the same behavior as the code above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def load_instances(filename):\n", + " '''Returns a list of instances stored in a file.\n", + " \n", + " filename is expected to have a series of comma-separated attribute values per line, e.g.,\n", + " p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", + " '''\n", + " instances = []\n", + " # your code goes here\n", + " return instances\n", + "\n", + "data_filename = 'agaricus-lepiota.data'\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "all_instances_2 = simple_ml.load_instances(data_filename)\n", + "print('Read', len(all_instances_2), 'instances from', data_filename)\n", + "print('First instance:', all_instances_2[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Output can be written to a text file via the [**`file.write(str)`**](http://docs.python.org/2/library/stdtypes.html#file.write) method.\n", + "\n", + "As we saw earlier, the [`str.join(words)`](http://docs.python.org/2/library/stdtypes.html#str.join) method returns a single `str`-delimited string containing each of the strings in the `words` list.\n", + "\n", + "SQL and Hive database tables sometimes use a pipe ('|') delimiter to separate column values for each row when they are stored as flat files. The following code creates a new data file using pipes rather than commas to separate the attribute values.\n", + "\n", + "To help maintain internal consistency, it is generally a good practice to define a variable such as `DELIMITER` or `SEPARATOR`, bind it to the intended delimiter string, and then use it as a named constant. The Python language does not support named constants, so the use of variables as named constants depends on conventions (e.g., using ALL-CAPS)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "DELIMITER = '|'\n", + "\n", + "print('Converting to {}-delimited strings, e.g.,'.format(DELIMITER), \n", + " DELIMITER.join(all_instances[0]))\n", + "\n", + "datafile2 = 'agaricus-lepiota-2.data'\n", + "with open(datafile2, 'w') as f: # 'w' = open file for writing (output)\n", + " for instance in all_instances:\n", + " f.write(DELIMITER.join(instance) + '\\n') # write each instance on a separate line\n", + "\n", + "all_instances_3 = []\n", + "with open(datafile2, 'r') as f:\n", + " for line in f:\n", + " all_instances_3.append(line.strip().split(DELIMITER)) # note: changed ',' to '|'\n", + " \n", + "print('Read', len(all_instances_3), 'instances from', datafile2)\n", + "print('First instance:', all_instances_3[0]) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### List comprehensions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python provides a powerful [*list comprehension*](http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions) construct to simplify the creation of a list by specifying a formula in a single expression.\n", + "\n", + "Some programmers find list comprehensions confusing, and avoid their use. We won't rely on list comprehensions here, but we will offer several examples with and without list comprehensions to highlight the power of the construct.\n", + "\n", + "One common use of list comprehensions is in the context of the [`str.join(words)`](http://docs.python.org/2/library/string.html#string.join) method we saw earlier.\n", + "\n", + "If we wanted to construct a pipe-delimited string containing elements of the list, we could use a `for` loop to iteratively add list elements and pipe delimiters to a string for all but the last element, and then manually add the last element." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# create pipe-delimited string without using list comprehension\n", + "DELIMITER = '|'\n", + "delimited_string = ''\n", + "token_list = ['a', 'b', 'c']\n", + "\n", + "for token in token_list[:-1]: # add all but the last token + DELIMITER\n", + " delimited_string += token + DELIMITER\n", + "delimited_string += token_list[-1] # add the last token (with no trailing DELIMITER)\n", + "delimited_string" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This process is much simpler using a list comprehension." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "delimited_string = DELIMITER.join([token for token in token_list])\n", + "delimited_string" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Missing values & \"clean\" instances" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As noted in the initial description of the UCI mushroom set above, 2480 of the 8124 instances have missing attribute values (denoted by `'?'`). \n", + "\n", + "There are several techniques for dealing with instances that include missing attribute values, but to simplify things in the context of this primer - and following the example in the [Data Science for Business](http://www.data-science-for-biz.com/) book - we will simply ignore any such instances and restrict our focus to only the *clean* instances (with no missing values).\n", + "\n", + "We could use several lines of code - with an `if` statement inside a `for` loop - to create a `clean_instances` list from the `all_instances` list. Or we could use a list comprehension that includes an `if` statement.\n", + "\n", + "We will show both approaches to creating `clean_instances` below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# version 1: using an if statement nested within a for statement\n", + "UNKNOWN_VALUE = '?'\n", + "\n", + "clean_instances = []\n", + "for instance in all_instances:\n", + " if UNKNOWN_VALUE not in instance:\n", + " clean_instances.append(instance)\n", + " \n", + "print(len(clean_instances), 'clean instances')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# version 2: using an equivalent list comprehension\n", + "clean_instances = [instance\n", + " for instance in all_instances\n", + " if UNKNOWN_VALUE not in instance]\n", + "\n", + "print(len(clean_instances), 'clean instances')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that line breaks can be used before a `for` or `if` keyword in a list comprehension." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dictionaries (dicts)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Although single character abbreviations of attribute values (e.g., 'x') allow for more compact data files, they are not as easy to understand by human readers as the longer attribute value descriptions (e.g., 'convex').\n", + "\n", + "A Python [dictionary (or **`dict`**)](http://docs.python.org/2/tutorial/datastructures.html#dictionaries) is an unordered, comma-delimited collection of ***key: value*** pairs, serving a siimilar function as a hash table or hashmap in other programming languages.\n", + "\n", + "We could create a dictionary for the `cap-type` attribute values shown above:\n", + "\n", + "> bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", + "\n", + "Since we will want to look up the value using the abbreviation (which is the representation of the value stored in the file), we will use the abbreviations as *keys* and the descriptions as *values*.\n", + "\n", + "A Python dictionary can be created by specifying all `key: value` pairs (with colons separating each *key* and *value*), or by adding them iteratively. We will show the first method in the cell below, and use the second method in a subsequent cell. \n", + "\n", + "Note that a *value* in a Python dictionary (`dict`) can be accessed by specifying its *key* using the general form `dict[key]` (or `dict.get(key, [default])`, which allows the specification of a `default` value to use if `key` is not in `dict`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute_values_cap_type = {'b': 'bell', \n", + " 'c': 'conical', \n", + " 'x': 'convex', \n", + " 'f': 'flat', \n", + " 'k': 'knobbed', \n", + " 's': 'sunken'}\n", + "\n", + "attribute_value_abbrev = 'x'\n", + "print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A Python dictionary is an *iterable* container, so we can iterate over the keys in a dictionary using a `for` loop.\n", + "\n", + "Note that since a dictionary is an *unordered* collection, the sequence of abbreviations and associated values is not guaranteed to appear in any particular order. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for attribute_value_abbrev in attribute_values_cap_type:\n", + " print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python supports *dictionary comprehensions*, which have a similar form as the *list comprehensions* described above, except that both a key and a value have to be specified for each iteration.\n", + "\n", + "For example, if we provisionally omit the 'convex' cap-type (whose abbreviation is the last letter rather than first letter in the attribute name), we could construct a dictionary of abbreviations and descriptions using the following expression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute_values_cap_type_2 = {x[0]: x \n", + " for x in ['bell', 'conical', 'flat', 'knobbed', 'sunken']}\n", + "print(attribute_values_cap_type_2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While it's useful to have a dictionary of values for the `cap-type` attribute, it would be even more useful to have a dictionary of values for *every* attribute. Earlier, we created a list of `attribute_names`; we will now expand this to create a list of `attribute_values` wherein each list element is a dictionary.\n", + "\n", + "Rather than explicitly type in each dictionary entry in the Python interpreter, we'll define a function to read a file containing the list of attribute names, values and value abbreviations in the format shown above:\n", + "\n", + "* class: edible=e, poisonous=p\n", + "* cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", + "* cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s\n", + "* ..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def load_attribute_values(filename):\n", + " '''Returns a list of attribute values in a file.\n", + " \n", + " The attribute values are represented as dictionaries, \n", + " wherein the keys are abbreviations and the values are descriptions.\n", + " filename is expected to have one attribute name and set of values per line, \n", + " with the following format:\n", + " name: value_description=value_abbreviation[,value_description=value_abbreviation]*\n", + " For example\n", + " cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", + " The attribute value description dictionary created from this line would be the following:\n", + " {'c': 'conical', \n", + " 'b': 'bell', \n", + " 'f': 'flat', \n", + " 'k': 'knobbed',\n", + " 's': 'sunken', \n", + " 'x': 'convex'}\n", + " '''\n", + " attribute_value_descriptions = []\n", + " with open(filename) as f:\n", + " for line in f:\n", + " attr_name_and_values = line.strip().split(':')\n", + " attr_name = attr_name_and_values[0]\n", + " if len(attr_name_and_values) < 2:\n", + " attribute_value_descriptions.append({}) # no values for this attribute\n", + " else:\n", + " abbrev_desc_dict = {}\n", + " desc_and_abbrev_list = attr_name_and_values[1].strip().split(',')\n", + " for desc_and_abbrev_str in desc_and_abbrev_list:\n", + " desc_and_abbrev = desc_and_abbrev_str.strip().split('=')\n", + " # simplifying assumption: no more than 1 value is missing an abbreviation\n", + " desc = desc_and_abbrev[0]\n", + " if len(desc_and_abbrev) < 2: \n", + " abbrev_desc_dict[None] = desc\n", + " else:\n", + " abbrev = desc_and_abbrev[1]\n", + " abbrev_desc_dict[abbrev] = desc\n", + " attribute_value_descriptions.append(abbrev_desc_dict)\n", + " return attribute_value_descriptions\n", + "\n", + "attribute_filename = 'agaricus-lepiota.attributes'\n", + "attribute_values = load_attribute_values(attribute_filename)\n", + "print('Read', len(attribute_values), 'attribute values from', attribute_filename)\n", + "print('First attribute values list:', attribute_values[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 3: define `load_attribute_values()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We earlier created the `attribute_names` list manually. The `load_attribute_values()` function above creates the `attribute_values` list from the contents of a file, each line of which starts with the name of an attribute. Unfortunately, the function discards the name of each attribute.\n", + "\n", + "It would be nice to retain the name as well as the value abbreviations and descriptions. One way to do this would be to create a list of dictionaries, in which each dictionary has 2 keys, a `name`, the value of which is the attribute name (a string), and `values`, the value of which is yet another dictionary (with abbreviation keys and description values, as in `load_attribute_values()`).\n", + "\n", + "Complete the following function definition so that the code implements this functionality." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def load_attribute_names_and_values(filename):\n", + " '''Returns a list of attribute names and values in a file.\n", + " \n", + " This list contains dictionaries wherein the keys are names \n", + " and the values are value description dictionariess.\n", + " \n", + " Each value description sub-dictionary will use \n", + " the attribute value abbreviations as its keys \n", + " and the attribute descriptions as the values.\n", + " \n", + " filename is expected to have one attribute name and set of values per line, \n", + " with the following format:\n", + " name: value_description=value_abbreviation[,value_description=value_abbreviation]*\n", + " for example\n", + " cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", + " The attribute name and values dictionary created from this line would be the following:\n", + " {'name': 'cap-shape', \n", + " 'values': {'c': 'conical', \n", + " 'b': 'bell', \n", + " 'f': 'flat', \n", + " 'k': 'knobbed', \n", + " 's': 'sunken', \n", + " 'x': 'convex'}}\n", + " '''\n", + " attribute_names_and_values = [] # this will be a list of dicts\n", + " # your code goes here\n", + " return attribute_names_and_values\n", + "\n", + "attribute_filename = 'agaricus-lepiota.attributes'\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "attribute_names_and_values = simple_ml.load_attribute_names_and_values(attribute_filename)\n", + "print('Read', len(attribute_names_and_values), 'attribute values from', attribute_filename)\n", + "print('First attribute name:', attribute_names_and_values[0]['name'], \n", + " '; values:', attribute_names_and_values[0]['values'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Counters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Data scientists often need to count things. For example, we might want to count the numbers of edible and poisonous mushrooms in the *clean_instances* list we created earlier." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "edible_count = 0\n", + "for instance in clean_instances:\n", + " if instance[0] == 'e':\n", + " edible_count += 1 # this is shorthand for edible_count = edible_count + 1\n", + "\n", + "print('There are', edible_count, 'edible mushrooms among the', \n", + " len(clean_instances), 'clean instances')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "More generally, we often want to count the number of occurrences (frequencies) of each possible value for an attribute. One way to do so is to create a dictionary where each dictionary key is an attribute value and each dictionary value is the count of instances with that attribute value.\n", + "\n", + "Using an ordinary dictionary, we must be careful to create a new dictionary entry the first time we see a new attribute value (that is not already contained in the dictionary)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "cap_state_value_counts = {}\n", + "for instance in clean_instances:\n", + " cap_state_value = instance[1] # cap-state is the 2nd attribute\n", + " if cap_state_value not in cap_state_value_counts:\n", + " # first occurrence, must explicitly initialize counter for this cap_state_value\n", + " cap_state_value_counts[cap_state_value] = 0\n", + " cap_state_value_counts[cap_state_value] += 1\n", + "\n", + "print('Counts for each value of cap-state:')\n", + "for value in cap_state_value_counts:\n", + " print(value, ':', cap_state_value_counts[value])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python [**`collections`**](http://docs.python.org/2/library/collections.html) module provides a number of high performance container datatypes. A frequently useful datatype is a [**`Counter`**](http://docs.python.org/2/library/collections.html#collections.Counter), a specialized dictionary in which each *key* is a unique element found in a list or some other container, and each *value* is the number of occurrences of that element in the source container. The default value for each newly created key is zero.\n", + "\n", + "A `Counter` includes a method, [**`most_common([n])`**](http://docs.python.org/2/library/collections.html#collections.Counter.most_common), that returns a list of 2-element tuples representing the values and their associated counts for the most common `n` values in descending order of the counts; if `n` is omitted, the method returns all tuples.\n", + "\n", + "Note that we can either use\n", + "\n", + "`import collections`\n", + "\n", + "and then use `collections.Counter()` in our code, or use\n", + "\n", + "`from collections import Counter`\n", + "\n", + "and then use `Counter()` (with no module specification) in our code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from collections import Counter\n", + "\n", + "cap_state_value_counts = Counter()\n", + "for instance in clean_instances:\n", + " cap_state_value = instance[1]\n", + " # no need to explicitly initialize counters for cap_state_value; all start at zero\n", + " cap_state_value_counts[cap_state_value] += 1\n", + "\n", + "print('Counts for each value of cap-state:')\n", + "for value in cap_state_value_counts:\n", + " print(value, ':', cap_state_value_counts[value])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When a `Counter` object is instantiated with a list of items, it returns a dictionary-like container in which the *keys* are the unique items in the list, and the *values* are the counts of each unique item in that list. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "counts = Counter(['a', 'b', 'c', 'a', 'b', 'a'])\n", + "print(counts)\n", + "print(counts.most_common())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This allows us to count the number of values for `cap-state` in a very compact way." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "cap_state_value_counts = Counter([instance[1] for instance in clean_instances])\n", + "\n", + "print('Counts for each value of cap-state:')\n", + "for value in cap_state_value_counts:\n", + " print(value, ':', cap_state_value_counts[value])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 4: define `attribute_value_counts()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `attribute_value_counts(instances, attribute, attribute_names)`, that returns a `Counter` containing the counts of occurrences of each value of `attribute` in the list of `instances`. `attribute_names` is the list we created above, where each element is the name of an attribute.\n", + "\n", + "This exercise is designed to generalize the solution shown in the code directly above (which handles only the `cap-state` attribute)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your definition goes here\n", + "\n", + "attribute = 'cap-shape'\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, \n", + " attribute, \n", + " attribute_names)\n", + "\n", + "print('Counts for each value of', attribute, ':')\n", + "for value in attribute_value_counts:\n", + " print(value, ':', attribute_value_counts[value])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### More on sorting" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Earlier, we saw that there is a `list.sort()` method that will sort a list in-place, i.e., by replacing the original value of `list` with a sorted version of the elements in `list`. \n", + "\n", + "We also saw that the [**`sorted(iterable[, cmp[, key[, reverse]]])`**](http://docs.python.org/2/library/functions.html#sorted) function can be used to return a *copy* of a list, dictionary or any other [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) container it is passed, in ascending order." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "original_list = [3, 1, 4, 2, 5]\n", + "sorted_list = sorted(original_list)\n", + "\n", + "print(original_list)\n", + "print(sorted_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`sorted()` can also be used with dictionaries (it returns a sorted list of the dictionary *keys*)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(sorted(attribute_values_cap_type))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use the sorted *keys* to access the *values* of a dictionary in ascending order of the keys." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for attribute_value_abbrev in sorted(attribute_values_cap_type):\n", + " print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'cap-shape'\n", + "attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, \n", + " attribute, \n", + " attribute_names)\n", + "\n", + "print('Counts for each value of', attribute, ':')\n", + "for value in sorted(attribute_value_counts):\n", + " print(value, ':', attribute_value_counts[value])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Sorting a dictionary by values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is often useful to sort a dictionary by its *values* rather than its *keys*. \n", + "\n", + "For example, when we printed out the counts of the attribute values for `cap-shape` above, the counts appeared in an ascending alphabetic order of their attribute names. It is often more helpful to show the attribute value counts in descending order of the counts (which are the values in that dictionary).\n", + "\n", + "There are a [variety of ways to sort a dictionary by values](http://writeonly.wordpress.com/2008/08/30/sorting-dictionaries-by-value-in-python-improved/), but the approach described in [PEP-256](http://legacy.python.org/dev/peps/pep-0265/) is generally considered the most efficient.\n", + "\n", + "In order to understand the components used in this approach, we will revisit and elaborate on a few concepts involving *dictionaries*, *iterators* and *modules*." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`dict.items()`**](http://docs.python.org/2/library/stdtypes.html#dict.items) method returns an unordered list of `(key, value)` tuples in `dict`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute_values_cap_type.items()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Python 2, a related method, [**`dict.iteritems()`**](http://docs.python.org/2/library/stdtypes.html#dict.iteritems), returns an [**`iterator`**](http://docs.python.org/2/library/stdtypes.html#iterator-types): a callable object that returns the *next* item in a sequence each time it is referenced (e.g., during each iteration of a for loop), which can be more efficient than generating *all* the items in the sequence before any are used ... and so should be used rather than `items()` wherever possible\n", + "\n", + "This is similar to the distinction between `xrange()` and `range()` described above ... and, also similarly, `dict.items()` is an `iterator` in Python 3 and so `dict.iteritems()` is no longer needed (nor defined) ... and further similarly, we will use only `dict.items()` in this notebook, but it is generally more efficient to use `dict.iteritems()` in Python 2." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for key, value in attribute_values_cap_type.items():\n", + " print(key, ':', value)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python [**`operator`**](http://docs.python.org/2/library/operator.html) module contains a number of functions that perform object comparisons, logical operations, mathematical operations, sequence operations, and abstract type tests.\n", + "\n", + "To facilitate sorting a dictionary by values, we will use the [**`operator.itemgetter(i)`**](http://docs.python.org/2/library/operator.html#operator.itemgetter) function that can be used to retrieve the `i`th value in a tuple (such as a `(key, value)` pair returned by `[iter]items()`).\n", + "\n", + "We can use `operator.itemgetter(1)`) to reference the *value* - the 2nd item in each `(key, value)` tuple, (at zero-based index position 1) - rather than the *key* - the first item in each `(key, value)` tuple (at index position 0).\n", + "\n", + "We will use the optional keyword argument **`key`** in [`sorted(iterable[, cmp[, key[, reverse]]])`](http://docs.python.org/2/library/functions.html#sorted) to specify a *sorting* key that is not the same as the `dict` key (recall that the `dict` key is the default *sorting* key for `sorted()`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import operator\n", + "\n", + "sorted(attribute_values_cap_type.items(), \n", + " key=operator.itemgetter(1))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now sort the counts of attribute values in descending frequency of occurrence, and print them out using tuple unpacking." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'cap-shape'\n", + "value_counts = simple_ml.attribute_value_counts(clean_instances, \n", + " attribute, \n", + " attribute_names)\n", + "\n", + "print('Counts for each value of', attribute, '(sorted by count):')\n", + "for value, count in sorted(value_counts.items(), \n", + " key=operator.itemgetter(1), \n", + " reverse=True):\n", + " print(value, ':', count)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that this example is rather contrived, as it is generally easiest to use a `Counter` and its associated `most_common()` method when sorting a dictionary wherein the values are all counts. The need to sort other kinds of dictionaries by their values is rather common. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### String formatting" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is often helpful to use [fancier output formatting](http://docs.python.org/2/tutorial/inputoutput.html#fancier-output-formatting) than simply printing comma-delimited lists of items. \n", + "\n", + "Examples of the **[`str.format()`](https://docs.python.org/2/library/stdtypes.html#str.format)** function used in conjunction with print statements is shown below. \n", + "\n", + "More details can be found in the Python documentation on [format string syntax](http://docs.python.org/2/library/string.html#format-string-syntax)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('{:5.3f}'.format(0.1)) # fieldwidth = 5; precision = 3; f = float\n", + "print('{:7.3f}'.format(0.1)) # if fieldwidth is larger than needed, left pad with spaces\n", + "print('{:07.3f}'.format(0.1)) # use leading zero to left pad with leading zeros\n", + "print('{:3d}'.format(1)) # d = int\n", + "print('{:03d}'.format(1))\n", + "print('{:10s}'.format('hello')) # s = string, left-justified\n", + "print('{:>10s}'.format('hello')) # use '>' to right-justify within fieldwidth" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following example illustrates the use of `str.format()` on data associated with the mushroom dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('class: {} = {} ({:5.3f}), {} = {} ({:5.3f})'.format(\n", + " 'e', 3488, 3488 / 5644, \n", + " 'p', 2156, 2156 / 5644), end=' ')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following variation - splitting off the printing of the attribute name from the printing of the values and counts of values for that attrbiute - may be more useful in developing a solution to the following exercise." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('class:', end=' ') # keeps cursor on the same line for subsequent print statements\n", + "print('{} = {} ({:5.3f}),'.format('e', 3488, 3488 / 5644), end=' ')\n", + "print('{} = {} ({:5.3f})'.format('p', 2156, 2156 / 5644), end=' ')\n", + "print() # advance the cursor to the beginning of the next line" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 5: define `print_all_attribute_value_counts()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `print_all_attribute_value_counts(instances, attribute_names)`, that prints each attribute name in `attribute_names`, and then for each attribute value, prints the value abbreviation, the count of occurrences of that value and the proportion of instances that have that attribute value." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your function definition goes here\n", + "\n", + "print('\\nCounts for all attributes and values:\\n')\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "simple_ml.print_all_attribute_value_counts(clean_instances, attribute_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Using Python to Build and Use a Simple Decision Tree Classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Decision Trees" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Wikipedia offers the following description of a [decision tree](https://en.wikipedia.org/wiki/Decision_tree) (with italics added to emphasize terms that will be elaborated below):\n", + "\n", + "> A decision tree is a flowchart-like structure in which each *internal node* represents a *test* of an *attribute*, each branch represents an *outcome* of that test and each *leaf node* represents *class label* (a decision taken after testing all attributes in the path from the root to the leaf). Each path from the root to a leaf can also be represented as a classification rule.\n", + "\n", + "*\\[Decision trees can also be used for regression, wherein the goal is to predict a continuous value rather than a class label, but we will focus here solely on their use for classification.\\]*\n", + "\n", + "The image below depicts a decision tree created from the UCI mushroom dataset that appears on [Andy G's blog post about Decision Tree Learning](http://gieseanw.wordpress.com/2012/03/03/decision-tree-learning/), where \n", + "\n", + "* a white box represents an *internal node* (and the label represents the *attribute* being tested)\n", + "* a blue box represents an attribute value (an *outcome* of the *test* of that attribute)\n", + "* a green box represents a *leaf node* with a *class label* of *edible*\n", + "* a red box represents a *leaf node* with a *class label* of *poisonous*\n", + "\n", + "\n", + "\n", + "It is important to note that the UCI mushroom dataset consists entirely of [categorical variables](https://en.wikipedia.org/wiki/Categorical_variable), i.e., every variable (or *attribute*) has an enumerated set of possible values. Many datasets include numeric variables that can take on `int` or `float` values. Tests for such variables typically use comparison operators, e.g., $age < 65$ or $36,250 < adjusted\\_gross\\_income <= 87,850$. *[Aside: Python supports boolean expressions containing multiple comparison operators, such as the expression comparing adjusted_gross_income in the preceding example.]*\n", + "\n", + "Our simple decision tree will only accommodate categorical variables. We will closely follow a version of the [decision tree learning algorithm implementation](http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3) offered by Chris Roach.\n", + "\n", + "Our goal in the following sections is to use Python to\n", + "\n", + "* ***create*** a simple decision tree using a set of *training* instances\n", + "* ***classify*** (predict class labels) for a set of *test* instances using a simple decision tree\n", + "* ***evaluate*** the performance of a simple decision tree on classifying a set of test instances\n", + "\n", + "First, we will explore some concepts and algorithms used in building and using decision trees." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Entropy" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When building a supervised classification model, the frequency distribution of attribute values is a potentially important factor in determining the relative importance of each attribute at various stages in the model building process.\n", + "\n", + "In data modeling, we can use frequency distributions to compute ***entropy***, a measure of disorder (impurity) in a set.\n", + "\n", + "We compute the entropy of multiplying the proportion of instances with each class label by the log of that proportion, and then taking the negative sum of those terms.\n", + "\n", + "More precisely, for a 2-class (binary) classification task:\n", + "\n", + "$entropy(S) = - p_1 log_2 (p_1) - p_2 log_2 (p_2)$\n", + "\n", + "where $p_i$ is proportion (relative frequency) of class *i* within the set *S*.\n", + "\n", + "From the output above, we know that the proportion of `clean_instances` that are labeled `'e'` (class `edible`) in the UCI dataset is $3488 \\div 5644 = 0.618$, and the proportion labeled `'p'` (class `poisonous`) is $2156 \\div 5644 = 0.382$.\n", + "\n", + "After importing the Python [`math`](http://docs.python.org/2/library/math.html) module, we can use the [`math.log(x[, base])`](http://docs.python.org/2/library/math.html#math.log) function in computing the entropy of the `clean_instances` of the UCI mushroom data set as follows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import math\n", + "\n", + "entropy = \\\n", + " - (3488 / 5644) * math.log(3488 / 5644, 2) \\\n", + " - (2156 / 5644) * math.log(2156 / 5644, 2)\n", + "print(entropy)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 6: define `entropy()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `entropy(instances)`, that computes the entropy of `instances`. You may assume the class label is in position 0; we will later see how to specify default parameter values in function definitions.\n", + "\n", + "[Note: the class label in many data files is the *last* rather than the *first* item on each line.]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your function definition here\n", + "\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "print(simple_ml.entropy(clean_instances))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Information Gain" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Informally, a decision tree is constructed from a set of instances using a recursive algorithm that \n", + "\n", + "* selects the *best* attribute \n", + "* splits the set into subsets based on the values of that attribute (each subset is composed of instances from the original set that have the same value for that attribute)\n", + "* repeats the process on each of these subsets until a stopping condition is met (e.g., a subset has no instances or has instances which all have the same class label)\n", + "\n", + "Entropy is a metric that can be used in selecting the best attribute for each split: the best attribute is the one resulting in the *largest decrease in entropy* for a set of instances. [Note: other metrics can be used for determining the best attribute]\n", + "\n", + "*Information gain* measures the decrease in entropy that results from splitting a set of instances based on an attribute.\n", + "\n", + "$IG(S, a) = entropy(S) - [p(s_1) × entropy(s_1) + p(s_2) × entropy(s_2) ... + p(s_n) × entropy(s_n)]$\n", + "\n", + "Where \n", + "* $n$ is the number of distinct values of attribute $a$\n", + "* $s_i$ is the subset of $S$ where all instances have the $i$th value of $a$\n", + "* $p(s_i)$ is the proportion of instances in $S$ that have the $i$th value of $a$\n", + "\n", + "We'll use the definition of `information_gain()` in `simple_ml` to print the information gain for each of the attributes in the mushroom dataset ... before asking you to write your own definition of the function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Information gain for different attributes:', end='\\n\\n')\n", + "for i in range(1, len(attribute_names)):\n", + " print('{:5.3f} {:2} {}'.format(\n", + " simple_ml.information_gain(clean_instances, i), i, attribute_names[i]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can sort the attributes based in decreasing order of information gain, which shows that `odor` is the best attribute for the first split in a decision tree that models the instances in this dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Information gain for different attributes:', end='\\n\\n')\n", + "sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i)\n", + " for i in range(1, len(attribute_names))], \n", + " reverse=True)\n", + "for gain, i in sorted_information_gain_indexes:\n", + " print('{:5.3f} {:2} {}'.format(gain, i, attribute_names[i]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*\\[The following variation does not use a list comprehension\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Information gain for different attributes:', end='\\n\\n')\n", + "\n", + "information_gain_values = []\n", + "for i in range(1, len(attribute_names)):\n", + " information_gain_values.append((simple_ml.information_gain(clean_instances, i), i))\n", + " \n", + "sorted_information_gain_indexes = sorted(information_gain_values, \n", + " reverse=True)\n", + "for gain, i in sorted_information_gain_indexes:\n", + " print('{:5.3f} {:2} {}'.format(gain, i, attribute_names[i]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 7: define `information_gain()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `information_gain(instances, i)`, that returns the information gain achieved by selecting the `i`th attribute to split `instances`. It should exhibit the same behavior as the `simple_ml` version of the function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your definition of information_gain(instances, i) here\n", + "\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) \n", + " for i in range(1, len(attribute_names))], \n", + " reverse=True)\n", + "\n", + "print('Information gain for different attributes:', end='\\n\\n')\n", + "for gain, i in sorted_information_gain_indexes:\n", + " print('{:5.3f} {:2} {}'.format(gain, i, attribute_names[i]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Building a Simple Decision Tree" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will implement a modified version of the [ID3](https://en.wikipedia.org/wiki/ID3_algorithm) algorithm for building a simple decision tree.\n", + "\n", + " ID3 (Examples, Target_Attribute, Candidate_Attributes)\n", + " Create a Root node for the tree\n", + " If all examples have the same value of the Target_Attribute, \n", + " Return the single-node tree Root with label = that value \n", + " If the list of Candidate_Attributes is empty,\n", + " Return the single node tree Root,\n", + " with label = most common value of Target_Attribute in the examples.\n", + " Otherwise Begin\n", + " A ← The Attribute that best classifies examples (most information gain)\n", + " Decision Tree attribute for Root = A.\n", + " For each possible value, v_i, of A,\n", + " Add a new tree branch below Root, corresponding to the test A = v_i.\n", + " Let Examples(v_i) be the subset of examples that have the value v_i for A\n", + " If Examples(v_i) is empty,\n", + " Below this new branch add a leaf node \n", + " with label = most common target value in the examples\n", + " Else \n", + " Below this new branch add the subtree \n", + " ID3 (Examples(v_i), Target_Attribute, Attributes – {A})\n", + " End\n", + " Return Root" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*\\[**Note:** the algorithm above is *recursive*, i.e., the there is a recursive call to `ID3` within the definition of `ID3`. Covering recursion is beyond the scope of this primer, but there are a number of other resources on [using recursion in Python](https://www.google.com/search?q=python+recursion). Familiarity with recursion will be important for understanding both the tree construction and classification functions below.\\]*\n", + "\n", + "In building a decision tree, we will need to split the instances based on the index of the *best* attribute, i.e., the attribute that offers the *highest information gain*. We will use separate utility functions to handle these subtasks. To simplify the functions, we will rely exclusively on attribute *indexes* rather than attribute *names*.\n", + "\n", + "First, we will define a function, **`split_instances(instances, attribute_index)`**, to split a set of instances based on any attribute. This function will return a dictionary where each *key* is a distinct value of the specified `attribute_index`, and the *value* of each key is a list representing the subset of `instances` that have that `attribute_index` value.\n", + "\n", + "We will use a [**`defaultdict`**](http://docs.python.org/2/library/collections.html#defaultdict-objects), a specialized dictionary class in the [**`collections`**](http://docs.python.org/2/library/collections.html) module, which automatically creates an appropriate default value for a new key. For example, a `defaultdict(int)` automatically initializes a new dictionary entry to 0 (zero); a `defaultdict(list)` automatically initializes a new dictionary entry to the empty list (`[]`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from collections import defaultdict\n", + "\n", + "def split_instances(instances, attribute_index):\n", + " '''Returns a list of dictionaries, splitting a list of instances \n", + " according to their values of a specified attribute index\n", + " \n", + " The key of each dictionary is a distinct value of attribute_index,\n", + " and the value of each dictionary is a list representing \n", + " the subset of instances that have that value for the attribute\n", + " '''\n", + " partitions = defaultdict(list)\n", + " for instance in instances:\n", + " partitions[instance[attribute_index]].append(instance)\n", + " return partitions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To test the function, we will partition the `clean_instances` based on the `odor` attribute (index position 5) and print out the size (number of instances) in each partition rather than the lists of instances in each partition." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "partitions = split_instances(clean_instances, 5)\n", + "print([(partition, len(partitions[partition])) for partition in partitions])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we can split instances based on a particular attribute, we would like to be able to choose the *best* attribute with which to split the instances, where *best* is defined as the attribute that provides the greatest information gain if instances were split based on that attribute. We will want to restrict the candidate attributes so that we don't bother trying to split on an attribute that was used higher up in the decision tree (or use the target attribute as a candidate)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 8: define `choose_best_attribute_index()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `choose_best_attribute_index(instances, candidate_attribute_indexes)`, that returns the index in the list of `candidate_attribute_indexes` that provides the highest information gain if `instances` are split based on that attribute index." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your function here\n", + "\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "print('Best attribute index:', \n", + " simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names))))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A leaf node in a decision tree represents the most frequently occurring - or majority - class value for that path through the tree. We will need a function that determines the majority value for the class index among a set of instances. One way to do this is to use the [`Counter`](https://docs.python.org/2/library/collections.html#counter-objects) class introduced above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "class_counts = Counter([instance[0] for instance in clean_instances])\n", + "print('class_counts: {}\\n most_common(1): {}\\n most_common(1)[0][0]: {}'.format(\n", + " class_counts, # the Counter object\n", + " class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count\n", + " class_counts.most_common(1)[0][0])) # the most common value (1st element in that tuple)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*\\[The following variation does not use a list comprehension\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "class_counts = Counter() # create an empty counter\n", + "for instance in clean_instances:\n", + " class_counts[instance[0]] += 1\n", + " \n", + "print ('class_counts: {}\\n most_common(1): {}\\n most_common(1)[0][0]: {}'.format(\n", + " class_counts,\n", + " class_counts.most_common(1), \n", + " class_counts.most_common(1)[0][0]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is often useful to compute the number of unique values and/or the total number of values in a `Counter`.\n", + "\n", + "The number of unique values is simply the number of dictionary entries.\n", + "\n", + "The total number of values can be computed by taking the [**`sum()`**](https://docs.python.org/2/library/functions.html#sum) of all the counts (the *value* of each *key: value* pair ... or *key, value* tuple, if we use `Counter().most_common()`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Number of unique values: {}'.format(len(class_counts)))\n", + "print('Total number of values: {}'.format(sum([v \n", + " for k, v in class_counts.most_common()])))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before putting all this together to define a decision tree construction function, we will cover a few additional aspects of Python used in that function." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Truth values in Python" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python offers a very flexible mechanism for the [testing of truth values](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#testing-for-truth-values): in an **if** condition, any null object, zero-valued numerical expression or empty container (string, list, dictionary or tuple) is interpreted as *False* (i.e., *not True*):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n", + " print('\"{}\" is'.format(x), end=' ')\n", + " if x:\n", + " print(True)\n", + " else:\n", + " print(False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sometimes, particularly with function parameters, it is helpful to differentiate `None` from empty lists and other data structures with a `False` truth value (one common use case is illustrated in `create_decision_tree()` below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n", + " print('\"{} is None\" is'.format(x), end=' ')\n", + " if x is None:\n", + " print(True)\n", + " else:\n", + " print(False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Conditional expressions (ternary operators)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python also offers a [conditional expression (ternary operator)](http://docs.python.org/2/reference/expressions.html#conditional-expressions) that allows the functionality of an if/else statement that returns a value to be implemented as an expression. For example, the if/else statement in the code above could be implemented as a conditional expression as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n", + " print('\"{}\" is {}'.format(x, True if x else False)) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### More on optional parameters in Python functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python function definitions can specify [default parameter values](http://docs.python.org/2/tutorial/controlflow.html#default-argument-values) indicating the value those parameters will have if no argument is explicitly provided when the function is called. Arguments can also be passed using [keyword parameters](http://docs.python.org/2/tutorial/controlflow.html#keyword-arguments) indicting which parameter will be assigned a specific argument value (which may or may not correspond to the order in which the parameters are defined).\n", + "\n", + "The [Python Tutorial page on default parameters](http://docs.python.org/2/tutorial/controlflow.html#default-argument-values) includes the following warning:\n", + "\n", + "> Important warning: The default value is evaluated only once. This makes a difference when the default is a mutable object such as a list, dictionary, or instances of most classes. \n", + "\n", + "Thus it is generally better to use the Python null object, `None`, rather than an empty `list` (`[]`), `dict` (`{}`) or other mutable data structure when specifying default parameter values for any of those data types." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def parameter_test(parameter1=None, parameter2=None):\n", + " '''Prints the values of parameter1 and parameter2'''\n", + " print('parameter1: {}; parameter2: {}'.format(parameter1, parameter2))\n", + " \n", + "parameter_test() # no args are required\n", + "parameter_test(1) # if any args are provided, 1st arg gets assigned to parameter1\n", + "parameter_test(1, 2) # 2nd arg gets assigned to parameter2\n", + "parameter_test(2) # remember: if only 1 arg, 1st arg gets assigned to arg1\n", + "parameter_test(parameter2=2) # can use keyword to provide a value only for parameter2\n", + "parameter_test(parameter2=2, parameter1=1) # can use keywords for either arg, in either order" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 9: define `majority_value()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `majority_value(instances, class_index)`, that returns the most frequently occurring value of `class_index` in `instances`. The `class_index` parameter should be optional, and have a default value of `0` (zero).\n", + "\n", + "Your function definition should support the use of optional arguments as used in the function calls below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your definition of majority_value(instances) here\n", + "\n", + "# delete 'simple_ml.' in the function calls below to test your function\n", + "\n", + "print('Majority value of index {}: {}'.format(\n", + " 0, simple_ml.majority_value(clean_instances))) \n", + "\n", + "# although there is only one class_index for the dataset, \n", + "# we'll test the function by specifying other indexes using optional / keyword arguments\n", + "print('Majority value of index {}: {}'.format(\n", + " 1, simple_ml.majority_value(clean_instances, 1))) # using argument order\n", + "print('Majority value of index {}: {}'.format(\n", + " 2, simple_ml.majority_value(clean_instances, class_index=2))) # using keyword argument" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Building a Simple Decision Tree" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The recursive `create_decision_tree()` function below uses an optional parameter, `class_index`, which defaults to `0`. This is to accommodate other datasets in which the class label is the last element on each line (which would be most easily specified by using a `-1` value). Most data files in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html) have the class labels as either the first element or the last element.\n", + "\n", + "To show how the decision tree is being built, an optional `trace` parameter, when non-zero, will generate some trace information as the tree is constructed. The indentation level is incremented with each recursive call via the use of the conditional expression (ternary operator), `trace + 1 if trace else 0`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def create_decision_tree(instances, \n", + " candidate_attribute_indexes=None, \n", + " class_index=0, \n", + " default_class=None, \n", + " trace=0):\n", + " '''Returns a new decision tree trained on a list of instances.\n", + " \n", + " The tree is constructed by recursively selecting and splitting instances based on \n", + " the highest information_gain of the candidate_attribute_indexes.\n", + " \n", + " The class label is found in position class_index.\n", + " \n", + " The default_class is the majority value for the current node's parent in the tree.\n", + " A positive (int) trace value will generate trace information \n", + " with increasing levels of indentation.\n", + " \n", + " Derived from the simplified ID3 algorithm presented in Building Decision Trees in Python \n", + " by Christopher Roach,\n", + " http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3\n", + " '''\n", + " \n", + " # if no candidate_attribute_indexes are provided, \n", + " # assume that we will use all but the target_attribute_index\n", + " # Note that None != [], \n", + " # as an empty candidate_attribute_indexes list is a recursion stopping condition\n", + " if candidate_attribute_indexes is None:\n", + " candidate_attribute_indexes = [i \n", + " for i in range(len(instances[0])) \n", + " if i != class_index]\n", + " # Note: do not use candidate_attribute_indexes.remove(class_index)\n", + " # as this would destructively modify the argument,\n", + " # causing problems during recursive calls\n", + " \n", + " class_labels_and_counts = Counter([instance[class_index] for instance in instances])\n", + "\n", + " # If the dataset is empty or the candidate attributes list is empty, \n", + " # return the default value\n", + " if not instances or not candidate_attribute_indexes:\n", + " if trace:\n", + " print('{}Using default class {}'.format('< ' * trace, default_class))\n", + " return default_class\n", + " \n", + " # If all the instances have the same class label, return that class label\n", + " elif len(class_labels_and_counts) == 1:\n", + " class_label = class_labels_and_counts.most_common(1)[0][0]\n", + " if trace:\n", + " print('{}All {} instances have label {}'.format(\n", + " '< ' * trace, len(instances), class_label))\n", + " return class_label\n", + " else:\n", + " default_class = simple_ml.majority_value(instances, class_index)\n", + "\n", + " # Choose the next best attribute index to best classify the instances\n", + " best_index = simple_ml.choose_best_attribute_index(\n", + " instances, candidate_attribute_indexes, class_index) \n", + " if trace:\n", + " print('{}Creating tree node for attribute index {}'.format(\n", + " '> ' * trace, best_index))\n", + "\n", + " # Create a new decision tree node with the best attribute index \n", + " # and an empty dictionary object (for now)\n", + " tree = {best_index:{}}\n", + "\n", + " # Create a new decision tree sub-node (branch) for each of the values \n", + " # in the best attribute field\n", + " partitions = simple_ml.split_instances(instances, best_index)\n", + "\n", + " # Remove that attribute from the set of candidates for further splits\n", + " remaining_candidate_attribute_indexes = [i \n", + " for i in candidate_attribute_indexes \n", + " if i != best_index]\n", + " for attribute_value in partitions:\n", + " if trace:\n", + " print('{}Creating subtree for value {} ({}, {}, {}, {})'.format(\n", + " '> ' * trace,\n", + " attribute_value, \n", + " len(partitions[attribute_value]), \n", + " len(remaining_candidate_attribute_indexes), \n", + " class_index, \n", + " default_class))\n", + " \n", + " # Create a subtree for each value of the the best attribute\n", + " subtree = create_decision_tree(\n", + " partitions[attribute_value],\n", + " remaining_candidate_attribute_indexes,\n", + " class_index,\n", + " default_class,\n", + " trace + 1 if trace else 0)\n", + "\n", + " # Add the new subtree to the empty dictionary object \n", + " # in the new tree/node we just created\n", + " tree[best_index][attribute_value] = subtree\n", + "\n", + " return tree\n", + "\n", + "# split instances into separate training and testing sets\n", + "training_instances = clean_instances[:-20]\n", + "test_instances = clean_instances[-20:]\n", + "tree = create_decision_tree(training_instances, trace=1) # remove trace=1 to turn off tracing\n", + "print(tree)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The structure of the tree shown above is rather difficult to discern from the normal printed representation of a dictionary.\n", + "\n", + "The Python [**`pprint`**](http://docs.python.org/2/library/pprint.html) module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.\n", + "\n", + "The [**`pprint.pprint(object, stream=None, indent=1, width=80, depth=None)`**](http://docs.python.org/2/library/pprint.html#pprint.pprint) method will print `object` to a `stream` (a default value of `None` will dictate the use of [sys.stdout](http://docs.python.org/2/library/sys.html#sys.stdout), the same destination as `print` function output), using `indent` spaces to differentiate nesting levels, using up to a maximum `width` columns and up to to a maximum nesting level `depth` (`None` indicating no maximum)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from pprint import pprint\n", + "\n", + "pprint(tree)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Classifying Instances with a Simple Decision Tree" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Usually, when we construct a decision tree based on a set of *training* instances, we do so with the intent of using that tree to classify a set of one or more *test* instances.\n", + "\n", + "We will define a function, **`classify(tree, instance, default_class=None)`**, to use a decision `tree` to classify a single `instance`, where an optional `default_class` can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.\n", + "\n", + "We will use a design pattern in which we will use a series of `if` statements, each of which returns a value if the condition is true, rather than a nested series of `if`, `elif` and/or `else` clauses, as it helps constrain the levels of indentation in the function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def classify(tree, instance, default_class=None):\n", + " '''Returns a classification label for instance, given a decision tree'''\n", + " if not tree: # if the node is empty, return the default class\n", + " return default_class\n", + " if not isinstance(tree, dict): # if the node is a leaf, return its class label\n", + " return tree\n", + " attribute_index = list(tree.keys())[0] # using list(dict.keys()) for Python 3 compatibility\n", + " attribute_values = list(tree.values())[0]\n", + " instance_attribute_value = instance[attribute_index]\n", + " if instance_attribute_value not in attribute_values: # this value was not in training data\n", + " return default_class\n", + " # recursively traverse the subtree (branch) associated with instance_attribute_value\n", + " return classify(attribute_values[instance_attribute_value], instance, default_class)\n", + "\n", + "for instance in test_instances:\n", + " predicted_label = classify(tree, instance)\n", + " actual_label = instance[0]\n", + " print('predicted: {}; actual: {}'.format(predicted_label, actual_label))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluating the Accuracy of a Simple Decision Tree" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is often helpful to evaluate the performance of a model using a dataset not used in the training of that model. In the simple example shown above, we used all but the last 20 instances to train a simple decision tree, then classified those last 20 instances using the tree.\n", + "\n", + "The advantage of this training/test split is that visual inspection of the classifications (sometimes called *predictions*) is relatively straightforward, revealing that all 20 instances were correctly classified.\n", + "\n", + "There are a variety of metrics that can be used to evaluate the performance of a model. [Scikit Learn's Model Evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html) library provides an overview and implementation of several possible metrics. For now, we'll simply measure the *accuracy* of a model, i.e., the percentage of test instances that are correctly classified (*true positives* and *true negatives*).\n", + "\n", + "The accuracy of the model above, given the set of 20 test instances, is 100% (20/20).\n", + "\n", + "The function below calculates the classification accuracy of a `tree` over a set of `test_instances` (with an optional `class_index` parameter indicating the position of the class label in each instance)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def classification_accuracy(tree, test_instances, class_index=0, default_class=None):\n", + " '''Returns the accuracy of classifying test_instances with tree, \n", + " where the class label is in position class_index'''\n", + " num_correct = 0\n", + " for i in range(len(test_instances)):\n", + " prediction = classify(tree, test_instances[i], default_class)\n", + " actual_value = test_instances[i][class_index]\n", + " if prediction == actual_value:\n", + " num_correct += 1\n", + " return num_correct / len(test_instances)\n", + "\n", + "print(classification_accuracy(tree, test_instances))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In addition to showing the percentage of correctly classified instances, it may be helpful to return the actual counts of correctly and incorrectly classified instances, e.g., if we want to compile a total count of correctly and incorrectly classified instances over a collection of test instances.\n", + "\n", + "In order to do so, we'll use the [**`zip([iterable, ...])`**](http://docs.python.org/2.7/library/functions.html#zip) function, which combines 2 or more sequences or iterables; the function returns a list of tuples, where the *i*th tuple contains the *i*th element from each of the argument sequences or iterables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "zip([0, 1, 2], ['a', 'b', 'c'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use [list comprehensions](http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions), the `Counter` class and the `zip()` function to modify `classification_accuracy()` so that it returns a packed tuple with \n", + "\n", + "* the percentage of instances correctly classified\n", + "* the number of correctly classified instances\n", + "* the number of incorrectly classified instances\n", + "\n", + "We'll also modify the function to use `instances` rather than `test_instances`, as we sometimes want to be able to valuate the accuracy of a model when tested on the training instances used to create it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def classification_accuracy(tree, instances, class_index=0, default_class=None):\n", + " '''Returns the accuracy of classifying test_instances with tree, \n", + " where the class label is in position class_index'''\n", + " predicted_labels = [classify(tree, instance, default_class) \n", + " for instance in instances]\n", + " actual_labels = [x[class_index] \n", + " for x in instances]\n", + " counts = Counter([x == y \n", + " for x, y in zip(predicted_labels, actual_labels)])\n", + " return counts[True] / len(instances), counts[True], counts[False]\n", + "\n", + "print(classification_accuracy(tree, test_instances))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We sometimes want to partition the instances into subsets of equal sizes to measure performance. One metric this partitioning allows us to compute is a [learning curve](https://en.wikipedia.org/wiki/Learning_curve), i.e., assess how well the model performs based on the size of its training set. Another use of these partitions (aka *folds*) would be to conduct an [*n-fold cross validation*](https://en.wikipedia.org/wiki/Cross-validation_(statistics) evaluation.\n", + "\n", + "The following function, **`partition_instances(instances, num_partitions)`**, partitions a set of `instances` into `num_partitions` relatively equally sized subsets.\n", + "\n", + "We'll use this as yet another opportunity to demonstrate the power of using list comprehensions, this time, to condense the use of nested `for` loops." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def partition_instances(instances, num_partitions):\n", + " '''Returns a list of relatively equally sized disjoint sublists (partitions) \n", + " of the list of instances'''\n", + " return [[instances[j] \n", + " for j in range(i, len(instances), num_partitions)]\n", + " for i in range(num_partitions)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before testing this function on the 5644 `clean_instances` from the UCI mushroom dataset, we'll create a small number of simplified instances to verify that the function has the desired behavior." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "instance_length = 3\n", + "num_instances = 5\n", + "\n", + "simplified_instances = [[j \n", + " for j in range(i, instance_length + i)] \n", + " for i in range(num_instances)]\n", + "\n", + "print('Instances:', simplified_instances)\n", + "partitions = partition_instances(simplified_instances, 2)\n", + "print('Partitions:', partitions)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*\\[The following variations do not use list comprehensions\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def partition_instances(instances, num_partitions):\n", + " '''Returns a list of relatively equally sized disjoint sublists (partitions) \n", + " of the list of instances'''\n", + " partitions = []\n", + " for i in range(num_partitions):\n", + " partition = []\n", + " # iterate over instances starting at position i in increments of num_paritions\n", + " for j in range(i, len(instances), num_partitions): \n", + " partition.append(instances[j])\n", + " partitions.append(partition)\n", + " return partitions\n", + "\n", + "simplified_instances = []\n", + "for i in range(num_instances):\n", + " new_instance = []\n", + " for j in range(i, instance_length + i):\n", + " new_instance.append(j)\n", + " simplified_instances.append(new_instance)\n", + "\n", + "print('Instances:', simplified_instances)\n", + "partitions = partition_instances(simplified_instances, 2)\n", + "print('Partitions:', partitions)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`enumerate(sequence, start=0)`**](http://docs.python.org/2.7/library/functions.html#enumerate) function creates an iterator that successively returns the index and value of each element in a `sequence`, beginning at the `start` index." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for i, x in enumerate(['a', 'b', 'c']):\n", + " print(i, x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use `enumerate()` to facilitate slightly more rigorous testing of our `partition_instances` function on our `simplified_instances`.\n", + "\n", + "Note that since we are printing values rather than accumulating values, we will not use nested list comprehensions for this task." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for i in range(num_instances):\n", + " print('\\n# partitions:', i)\n", + " for j, partition in enumerate(partition_instances(simplified_instances, i)):\n", + " print('partition {}: {}'.format(j, partition))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Returning our attention to the UCI mushroom dataset, the following will partition our `clean_instances` into 10 relatively equally sized disjoint subsets. We will use a list comprehension to print out the length of each partition" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "partitions = partition_instances(clean_instances, 10)\n", + "print([len(partition) for partition in partitions])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*\\[The following variation does not use a list comprehension\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for partition in partitions:\n", + " print(len(partition), end=' ')\n", + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following shows the different trees that are constructed based on partition 0 (first 10th) of `clean_instances`, partitions 0 and 1 (first 2/10ths) of `clean_instances` and all `clean_instances`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "tree0 = create_decision_tree(partitions[0])\n", + "print('Tree trained with {} instances:'.format(len(partitions[0])))\n", + "pprint(tree0)\n", + "print()\n", + "\n", + "tree1 = create_decision_tree(partitions[0] + partitions[1])\n", + "print('Tree trained with {} instances:'.format(len(partitions[0] + partitions[1])))\n", + "pprint(tree1)\n", + "print()\n", + "\n", + "tree = create_decision_tree(clean_instances)\n", + "print('Tree trained with {} instances:'.format(len(clean_instances)))\n", + "pprint(tree)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The only difference between the first two trees - *tree0* and *tree1* - is that in the first tree, instances with no `odor` (attribute index `5` is `'n'`) and a `spore-print-color` of white (attribute `20` = `'w'`) are classified as `edible` (`'e'`). With additional training data in the 2nd partition, an additional distinction is made such that instances with no `odor`, a white `spore-print-color` and a clustered `population` (attribute `21` = `'c'`) are classified as `poisonous` (`'p'`), while all other instances with no `odor` and a white `spore-print-color` (and any other value for the `population` attribute) are classified as `edible` (`'e'`).\n", + "\n", + "Note that there is no difference between `tree1` and `tree` (the tree trained with all instances). This early convergence on an optimal model is uncommon on most datasets (outside the UCI repository)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Learning curves" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we can partition our instances into subsets, we can use these subsets to construct different-sized training sets in the process of computing a learning curve.\n", + "\n", + "We will start off with an initial training set consisting only of the first partition, and then progressively extend that training set by adding a new partition during each iteration of computing the learning curve.\n", + "\n", + "The [**`list.extend(L)`**](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) method enables us to extend `list` by appending all the items in another list, `L`, to the end of `list`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x = [1, 2, 3]\n", + "x.extend([4, 5])\n", + "print(x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now define the function, **`compute_learning_curve(instances, num_partitions=10)`**, which will take a list of `instances`, partition it into `num_partitions` relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first `num_partitions - 1` partitions then tested with instances in the last partition, a variant of . That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on. \n", + "\n", + "The function will return a list of `num_partitions - 1` tuples representing the size of the training set and the accuracy of a tree trained with that set (and tested on the `num_partitions - 1` set). This will provide some indication of the relative impact of the size of the training set on model performance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def compute_learning_curve(instances, num_partitions=10):\n", + " '''Returns a list of training sizes and scores for incrementally increasing partitions.\n", + "\n", + " The list contains 2-element tuples, each representing a training size and score.\n", + " The i-th training size is the number of instances in partitions 0 through num_partitions - 2.\n", + " The i-th score is the accuracy of a tree trained with instances \n", + " from partitions 0 through num_partitions - 2\n", + " and tested on instances from num_partitions - 1 (the last partition).'''\n", + " \n", + " partitions = partition_instances(instances, num_partitions)\n", + " test_instances = partitions[-1][:]\n", + " training_instances = []\n", + " accuracy_list = []\n", + " for i in range(0, num_partitions - 1):\n", + " # for each iteration, the training set is composed of partitions 0 through i - 1\n", + " training_instances.extend(partitions[i][:])\n", + " tree = create_decision_tree(training_instances)\n", + " partition_accuracy = classification_accuracy(tree, test_instances)\n", + " accuracy_list.append((len(training_instances), partition_accuracy))\n", + " return accuracy_list\n", + "\n", + "accuracy_list = compute_learning_curve(clean_instances)\n", + "print(accuracy_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The UCI mushroom dataset is a particularly clean and simple data set, enabling quick convergence on an optimal decision tree for classifying new instances using relatively few training instances. \n", + "\n", + "We can use a larger number of smaller partitions to see a little more variation in accuracy performance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "accuracy_list = compute_learning_curve(clean_instances, 100)\n", + "print(accuracy_list[:10])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Object-Oriented Programming: Defining a Python Class to Encapsulate a Simple Decision Tree" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The simple decision tree defined above uses a Python dictionary for its representation. One can imagine using other data structures, and/or extending the decision tree to support confidence estimates, numeric features and other capabilities that are often included in more fully functional implementations. To support future extensibility, and hide the details of the representation from the user, it would be helpful to have a user-defined class for simple decision trees.\n", + "\n", + "Python is an [object-oriented programming](https://en.wikipedia.org/wiki/Object-oriented_programming) language, offering simple syntax and semantics for defining classes and instantiating objects of those classes. *[It is assumed that the reader is already familiar with the concepts of object-oriented programming]*\n", + "\n", + "A Python [class](http://docs.python.org/2/tutorial/classes.html) starts with the keyword **`class`** followed by a class name (identifier), a colon ('`:`'), and then any number of statements, which typically take the form of assignment statements for class or instance variables and/or function definitions for class methods. All statements are indented to reflect their inclusion in the class definition.\n", + "\n", + "The members - methods, class variables and instance variables - of a class are accessed by prepending `self.` to each reference. Class methods always include `self` as the first parameter. \n", + "\n", + "All class members in Python are *public* (accessible outside the class). There is no mechanism for *private* class members, but identifiers with leading double underscores (*\\_\\_member_identifier*) are 'mangled' (translated into *\\_class_name\\__member_identifier*), and thus not directly accessible outside their class, and can be used to approximate private members by Python programmers. \n", + "\n", + "There is also no mechanism for *protected* identifiers - accessible only within a defining class and its subclasses - in the Python language, and so Python programmers have adopted the convention of using a single underscore (*\\_identifier*) at the start of any identifier that is intended to be protected (i.e., not to be accessed outside the class or its subclasses). \n", + "\n", + "Some Python programmers only use the single underscore prefixes and avoid double underscore prefixes due to unintended consequences that can arise when names are mangled. The following warning about single and double underscore prefixes is issued in [Code Like a Pythonista](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#naming):\n", + "\n", + "> try to avoid the __private form. I never use it. Trust me. If you use it, you WILL regret it later\n", + "\n", + "We will follow this advice and avoid using the double underscore prefix in user-defined member variables and methods.\n", + "\n", + "Python has a number of pre-defined [special method names](http://docs.python.org/2/reference/datamodel.html#special-method-names), all of which are denoted by leading and trailing double underscores. For example, the [**`object.__init__(self[, ...])`**](http://docs.python.org/2/reference/datamodel.html#object.__init__) method is used to specify instructions that should be executed whenever a new object of a class is instantiated. \n", + "\n", + "Note that other machine learning libraries may use different terminology for some of the functions we defined above. For example, in the [`sklearn.tree.DecisionTreeClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) class (and in most `sklearn` classifier classes), the method for constructing a classifier is named [`fit()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) - since it \"fits\" the data to a model - and the method for classifying instances is named [`predict()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict) - since it is predicting the class label for an instance.\n", + "\n", + "In keeping with this common terminology, the code below defines a class, **`SimpleDecisionTree`**, with a single pseudo-protected member variable `_tree`, three public methods - `fit()`, `predict()` and `pprint()` - and two pseudo-protected auxilary methods - `_create_tree()` and `_predict()` - to augment the `fit()` and `predict()` methods, respectively. \n", + "\n", + "The `fit()` method is identical to the `create_decision_tree()` function above, with the inclusion of the `self` parameter (as it is now a class method rather than a function). The `predict()` method is a similarly modified version of the `classify()` function, with the added capability to predict the label of either a single instance or a list of instances. The `classification_accuracy()` method is similar to the function of the same name (with the addition of the `self` parameter). The `pprint()` method prints the tree in a human-readable format.\n", + "\n", + "Most comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found in **`simple_decision_tree.py`**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "class SimpleDecisionTree:\n", + "\n", + " _tree = {} # this instance variable becomes accessible to class methods via self._tree\n", + "\n", + " def __init__(self):\n", + " # this is where we would initialize any parameters to the SimpleDecisionTree\n", + " pass\n", + " \n", + " def fit(self, \n", + " instances, \n", + " candidate_attribute_indexes=None,\n", + " target_attribute_index=0,\n", + " default_class=None):\n", + " if not candidate_attribute_indexes:\n", + " candidate_attribute_indexes = [i \n", + " for i in range(len(instances[0]))\n", + " if i != target_attribute_index]\n", + " self._tree = self._create_tree(instances,\n", + " candidate_attribute_indexes,\n", + " target_attribute_index,\n", + " default_class)\n", + " \n", + " def _create_tree(self,\n", + " instances,\n", + " candidate_attribute_indexes,\n", + " target_attribute_index=0,\n", + " default_class=None):\n", + " class_labels_and_counts = Counter([instance[target_attribute_index] \n", + " for instance in instances])\n", + " if not instances or not candidate_attribute_indexes:\n", + " return default_class\n", + " elif len(class_labels_and_counts) == 1:\n", + " class_label = class_labels_and_counts.most_common(1)[0][0]\n", + " return class_label\n", + " else:\n", + " default_class = simple_ml.majority_value(instances, target_attribute_index)\n", + " best_index = simple_ml.choose_best_attribute_index(instances, \n", + " candidate_attribute_indexes, \n", + " target_attribute_index)\n", + " tree = {best_index:{}}\n", + " partitions = simple_ml.split_instances(instances, best_index)\n", + " remaining_candidate_attribute_indexes = [i \n", + " for i in candidate_attribute_indexes \n", + " if i != best_index]\n", + " for attribute_value in partitions:\n", + " subtree = self._create_tree(\n", + " partitions[attribute_value],\n", + " remaining_candidate_attribute_indexes,\n", + " target_attribute_index,\n", + " default_class)\n", + " tree[best_index][attribute_value] = subtree\n", + " return tree\n", + " \n", + " def predict(self, instances, default_class=None):\n", + " if not isinstance(instances, list):\n", + " return self._predict(self._tree, instance, default_class)\n", + " else:\n", + " return [self._predict(self._tree, instance, default_class) \n", + " for instance in instances]\n", + " \n", + " def _predict(self, tree, instance, default_class=None):\n", + " if not tree:\n", + " return default_class\n", + " if not isinstance(tree, dict):\n", + " return tree\n", + " attribute_index = list(tree.keys())[0] # using list(dict.keys()) for Py3 compatibiity\n", + " attribute_values = list(tree.values())[0]\n", + " instance_attribute_value = instance[attribute_index]\n", + " if instance_attribute_value not in attribute_values:\n", + " return default_class\n", + " return self._predict(attribute_values[instance_attribute_value],\n", + " instance,\n", + " default_class)\n", + " \n", + " def classification_accuracy(self, instances, default_class=None):\n", + " predicted_labels = self.predict(instances, default_class)\n", + " actual_labels = [x[0] for x in instances]\n", + " counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])\n", + " return counts[True] / len(instances), counts[True], counts[False]\n", + " \n", + " def pprint(self):\n", + " pprint(self._tree)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following statements instantiate a `SimpleDecisionTree`, using all but the last 20 `clean_instances`, prints out the tree using its `pprint()` method, and then uses the `classify()` method to print the classification of the last 20 `clean_instances`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "simple_decision_tree = SimpleDecisionTree()\n", + "simple_decision_tree.fit(training_instances)\n", + "simple_decision_tree.pprint()\n", + "print()\n", + "\n", + "predicted_labels = simple_decision_tree.predict(test_instances)\n", + "actual_labels = [instance[0] for instance in test_instances]\n", + "for predicted_label, actual_label in zip(predicted_labels, actual_labels):\n", + " print('Model: {}; truth: {}'.format(predicted_label, actual_label))\n", + "print()\n", + "print('Classification accuracy:', simple_decision_tree.classification_accuracy(test_instances))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Next steps" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are a variety of Python libraries - e.g., [Scikit-Learn](http://scikit-learn.org/) - for building more full-featured decision trees and other types of models based on a variety of machine learning algorithms. Hopefully, this primer will have prepared you for learning how to use those libraries effectively.\n", + "\n", + "Many Python-based machine learning libraries use other external Python libraries such as [NumPy](http://www.numpy.org/), [SciPy](http://www.scipy.org/scipylib/), [Matplotlib](http://matplotlib.org/) and [pandas](http://pandas.pydata.org/). There are tutorials available for each of these libraries, including the following:\n", + "\n", + "* [Tentative NumPy Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)\n", + "* [SciPy Tutorial](http://docs.scipy.org/doc/scipy/reference/tutorial/)\n", + "* [Matplotlib PyPlot Tutorial](http://matplotlib.org/1.3.1/users/pyplot_tutorial.html)\n", + "* [Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) (especially [10 Minutes to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html))\n", + "\n", + "There are many machine learning or data science resources that may be useful to help you continue the journey. Here is a sampling:\n", + "\n", + "* Scikit-learn's tutorial, [An introduction to machine learning with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)\n", + "* Kevin Markham's video series (on the Kaggle blog), [An introduction to machine learning with scikit-learn](http://blog.kaggle.com/2015/04/08/new-video-series-introduction-to-machine-learning-with-scikit-learn/)\n", + "* Kaggle's [Getting Started With Python For Data Science](http://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience)\n", + "* Coursera's [Introduction to Data Science](https://www.coursera.org/course/datasci)\n", + "* Olivier Grisel's Strata 2014 tutorial, [Parallel Machine Learning with scikit-learn and IPython](https://github.com/ogrisel/parallel_ml_tutorial)\n", + "\n", + "Please feel free to contact the author ([Joe McCarthy](mailto:joe@interrelativity.com?subject=Python for Data Science)) to suggest additional resources." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.10" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/simple_decision_tree.py b/simple_decision_tree.py new file mode 100644 index 0000000..1fdd367 --- /dev/null +++ b/simple_decision_tree.py @@ -0,0 +1,167 @@ +from __future__ import print_function, division + +''' A class to implement a simple decision tree (based on ID3) +''' + +__author__ = 'Joe McCarthy' +__email__ = 'joe@interrelativity.com' + + +from collections import Counter +from pprint import pprint +from simple_ml import majority_value, choose_best_attribute_index, split_instances + + +class SimpleDecisionTree: + + + _tree = {} # this instance variable becomes accessible to class methods via self._tree + + + def __init__(self): + # this is where we would initialize any parameters to the SimpleDecisionTree + pass + + def fit(self, + instances, + candidate_attribute_indexes=None, + target_attribute_index=0, + default_class=None, + trace=0): + ''' + Build a decision tree that best fits the data in instances. + + The target_attribute_index defaults to 0 (zero). + The candidate_attribute_indexes defaults to all other index values. + + The tree is constructed by recursively selecting & splitting instances based on + the highest information_gain of the candidate_attribute_indexes. + The class label is found in target_attribute_index. + The default_class is the majority value for that branch of the tree. + A positive trace value will print trace information during tree construction. + + Derived from the simplified ID3 algorithm presented in + Building Decision Trees in Python by Christopher Roach, + http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3 + ''' + if not candidate_attribute_indexes: + candidate_attribute_indexes = [i + for i in range(len(instances[0])) + if i != target_attribute_index] + self._tree = self._create_tree(instances, + candidate_attribute_indexes, + target_attribute_index, + default_class) + + + def _create_tree(self, + instances, + candidate_attribute_indexes, + target_attribute_index=0, + default_class=None, + trace=0): + class_labels_and_counts = Counter([instance[target_attribute_index] + for instance in instances]) + # If the dataset is empty or the candidate attributes list is empty, + # return the default class label + if not instances or not candidate_attribute_indexes: + if trace: + print('{}Using default class {}'.format('< ' * trace, default_class)) + return default_class + + # If all the instances have the same class label, return that class label + elif len(class_labels_and_counts) == 1: + class_label = class_labels_and_counts.most_common(1)[0][0] + if trace: + print('{}All {} instances have label {}'.format( + '< ' * trace, len(instances), class_label)) + return class_label + + # Otherwise, create a new subtree and add it to the tree + else: + default_class = majority_value(instances, target_attribute_index) + + # Choose the next best attribute index to best classify the instances + best_index = choose_best_attribute_index(instances, + candidate_attribute_indexes, + target_attribute_index) + if trace: + print('{}Creating tree node for attribute index {}'.format( + '> ' * trace, best_index)) + + # Create a new decision tree node with the best attribute index + # and an empty dictionary object (for now) + tree = {best_index:{}} + + # Create a new decision tree sub-node (branch) + # for each of the values in the best attribute field + partitions = split_instances(instances, best_index) + + # Remove that attribute from the set of candidates for further splits + remaining_candidate_attribute_indexes = [i + for i in candidate_attribute_indexes + if i != best_index] + + for attribute_value in partitions: + if trace: + print('{}Creating subtree for value {} ({}, {}, {}, {})'.format( + '> ' * trace, + attribute_value, + len(partitions[attribute_value]), + len(remaining_candidate_attribute_indexes), + target_attribute_index, + default_class)) + + # Create a subtree for each value of the the best attribute + subtree = self._create_tree( + partitions[attribute_value], + remaining_candidate_attribute_indexes, + target_attribute_index, + default_class) + + # Add the new subtree to the empty dictionary object + # in the new tree/node created above + tree[best_index][attribute_value] = subtree + + return tree + + + def predict(self, instances, default_class=None): + '''Return the predicted class label(s) of instance(s)''' + if not isinstance(instances, list): + return self._predict(self._tree, instance, default_class) + else: + return [self._predict(self._tree, instance, default_class) + for instance in instances] + + + # a method intended to be "protected" that can implement the recursive algorithm to classify an instance given a tree + def _predict(self, tree, instance, default_class=None): + if not tree: + return default_class + if not isinstance(tree, dict): + return tree + attribute_index = list(tree.keys())[0] # using list(dict.keys()) for Py3 compatibiity + attribute_values = list(tree.values())[0] + instance_attribute_value = instance[attribute_index] + if instance_attribute_value not in attribute_values: + return default_class + return self._predict(attribute_values[instance_attribute_value], + instance, + default_class) + + + def classification_accuracy(self, instances, default_class=None): + '''Return a tuple with + the number of correctly classified instances, + the number of incorrectly classified instances, + the proportion of instances that were correctly classified + ''' + predicted_labels = self.predict(instances, default_class) + actual_labels = [x[0] for x in instances] + counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)]) + return counts[True] / len(instances), counts[True], counts[False] + + + def pprint(self): + pprint(self._tree) \ No newline at end of file From e2da93dba3203c990e7d72fe56e6db8ebe0d4bb0 Mon Sep 17 00:00:00 2001 From: joem Date: Tue, 21 Jul 2015 15:22:20 -0700 Subject: [PATCH 08/16] Updated for PyData Seattle tutorial --- 1_Introduction.ipynb | 261 +- 2_Data_Science_Basic_Concepts.ipynb | 651 +- 3_Python_Basic_Concepts.ipynb | 5131 +++++++------ 4_Python_Simple_Decision_Tree.ipynb | 3319 ++++----- 5_Next_Steps.ipynb | 190 +- Python_for_Data_Science_all.html | 5051 ++++++------- Python_for_Data_Science_all.ipynb | 10126 ++++++++++++++------------ SimpleDecisionTree.py | 131 +- simple_ml.py | 16 +- 9 files changed, 12126 insertions(+), 12750 deletions(-) diff --git a/1_Introduction.ipynb b/1_Introduction.ipynb index 665e04c..59a5a0c 100644 --- a/1_Introduction.ipynb +++ b/1_Introduction.ipynb @@ -1,132 +1,135 @@ { - "metadata": { - "name": "", - "signature": "sha256:75ce16430b53be4d47a57bce8487bd8922d658c887ef151a5a82a81c168837f1" - }, - "nbformat": 3, - "nbformat_minor": 0, - "worksheets": [ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python for Data Science" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Joe McCarthy](http://interrelativity.com/joe), \n", + "*Data Scientist*, [Indeed](http://www.indeed.com/)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from IPython.display import display, Image, HTML" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Navigation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notebooks in this primer:\n", + "\n", + "1. **Introduction** (*you are here*) \n", + "2. [Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", + "3. [Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", + "4. [Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", + "5. [Next Steps](5_Next_Steps.ipynb)" + ] + }, { - "cells": [ - { - "cell_type": "heading", - "level": 1, - "metadata": {}, - "source": [ - "Python for Data Science" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Joe McCarthy](http://interrelativity.com/joe), \n", - "*Director, Analytics & Data Science*, [Atigeo, LLC](http://atigeo.com)" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "from IPython.display import display, Image, HTML" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 1 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Navigation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notebooks in this primer:\n", - "\n", - "1. **Introduction** (*you are here*) \n", - "2. [Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", - "3. [Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", - "4. [Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", - "5. [Next Steps](5_Next_Steps.ipynb)" - ] - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "1. Introduction" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\"python-logo-master-v3-TM.png\"\n", - "This short primer on [Python](http://www.python.org/) is designed to provide a rapid \"on-ramp\" to enable computer programmers who are already familiar with concepts and constructs in other programming languages learn enough about Python to facilitate the effective use of open-source and proprietary Python-based machine learning and data science tools.\n", - "\n", - "\"nltk_book_cover.gif\"\n", - "The primer is motivated, in part, by the approach taken in the [Natural Language Toolkit (NLTK) book](http://www.nltk.org/book/), which provides a rapid on-ramp for using Python and the open-source [NLTK library](http://www.nltk.org/) to develop programs using natural language processing techniques (many of which involve [machine learning](http://www.nltk.org/book/ch06.html)).\n", - "\n", - "The [Python Tutorial](http://docs.python.org/2/tutorial/) offers a more comprehensive primer, and opens with an excellent - if biased - overview of some of the general strengths of the Python programming language:\n", - "\n", - "> Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python\u2019s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.\n", - "\n", - "\"Python\n", - "[Hans Petter Langtangen](http://folk.uio.no/hpl/), author of [Python Scripting for Computational Science](http://www.amazon.com/Python-Scripting-Computational-Science-Engineering/dp/3642093159), emphasizes the utility of Python for many of the common tasks in all areas of computational science:\n", - "\n", - "> Very often programming is about shuffling data in and out of different tools, converting one data format to another, extracting numerical data from a text, and administering numerical experiments involving a large number of data files and directories. Such tasks are much faster to accomplish in a language like Python than in Fortran, C, C++, C#, or Java\n", - "\n", - "[Foster Provost](http://people.stern.nyu.edu/fprovost/), co-author of [Data Science for Business](http://data-science-for-biz.com/), describes why Python is such a useful programming language for practical data science in [Python: A Practical Tool for Data Science](https://docs.google.com/document/pub?id=1p6vowsEuiezLbWnFKgse70a8LxfsrRixqPF5nBg8F3A), :\n", - "\n", - "> The practice of data science involves many interrelated but different activities, including accessing data, manipulating data, computing statistics about data, plotting/graphing/visualizing data, building predictive and explanatory models from data, evaluating those models on yet more data, integrating models into production systems, etc. One option for the data scientist is to learn several different software packages that each specialize in one or two of these things, but don\u2019t do them all well, plus learn a programming language to tie them together. (Or do a lot of manual work.) \n", - "> \n", - "> An alternative is to use a general-purpose, high-level programming language that provides libraries to do all these things. Python is an excellent choice for this. It has a diverse range of open source libraries for just about everything the data scientist will do. It is available everywhere; high performance python interpreters exist for running your code on almost any operating system or architecture. Python and most of its libraries are both open source and free. Contrast this with common software packages that are available in a course via an academic license, yet are extremely expensive to license and use in industry.\n", - "\n", - "\"scikit-learn-logo-small.png\"\n", - "The goal of this primer is to provide efficient and sufficient scaffolding for software engineers with no prior knowledge of Python to be able to effectively use Python-based tools for data science research and development, such as the open-source library [scikit-learn](http://scikit-learn.org/). There is another, more comprehensive tutorial for scikit-learn, [Python Scientific Lecture Notes](http://scipy-lectures.github.io/index.html), that includes coverage of a number of other useful Python open-source libraries used by scikit-learn ([numpy](http://www.numpy.org/), [scipy](http://www.scipy.org/) and [matplotlib](http://matplotlib.org)) - all highly recommended ... and, to keep things simple, all beyond the scope of this primer.\n", - "\n", - "\"xp-logo-forslider.png\"\n", - "The initial motivation for this primer was a 2-hour training session for a group of experienced software engineers to learn enough Python to utilize the [Atigeo xPatterns analytics framework API](http://atigeo.com/technology/) in their software development work. I am grateful to the company for affording me the opportunity to develop this educational tool, and to make it freely available to others who might be looking for a fast on-ramp to Python for data science.\n", - "\n", - "Using an IPython Notebook as a delivery vehicle for this primer was motivated by Brian Granger's inspiring tutorial, [The IPython Notebook: Get Close to Your Data with Python and JavaScript](http://strataconf.com/strata2014/public/schedule/detail/32033), one of the [highlights from my Strata 2014 conference experience](http://gumption.typepad.com/blog/2014/02/ipython-deep-learning-doing-good-some-highlights-from-strata-2014.html).\n", - "\n", - "One final note on external resources: the [Python Style Guide (PEP-0008)](http://legacy.python.org/dev/peps/pep-0008/) offers helpful tips on how best to format Python code. [Code like a Pythonista](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html) offers a number of additional tips on Python programming style and philosophy, several of which are incorporated into this primer.\n", - "\n", - "We will focus entirely on using Python within the interpreter environment (as supported within an IPython Notebook). Python scripts - files containing definitions of functions and variables, and typically including code invoking some of those functions - can also be run from a command line. Using Python scripts from the command line may be the subject of a future primer. \n", - "\n", - "To help motivate the data science-oriented Python programming examples provided in this primer, we will start off with a brief overview of basic concepts and terminology in data science." - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Navigation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notebooks in this primer:\n", - "\n", - "1. **Introduction** (*you are here*) \n", - "2. [Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", - "3. [Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", - "4. [Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", - "5. [Next Steps](5_Next_Steps.ipynb)" - ] - } - ], - "metadata": {} + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Introduction" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"python-logo-master-v3-TM.png\"\n", + "This short primer on [Python](http://www.python.org/) is designed to provide a rapid \"on-ramp\" to enable computer programmers who are already familiar with concepts and constructs in other programming languages learn enough about Python to facilitate the effective use of open-source and proprietary Python-based machine learning and data science tools.\n", + "\n", + "\"nltk_book_cover.gif\"\n", + "The primer is motivated, in part, by the approach taken in the [Natural Language Toolkit (NLTK) book](http://www.nltk.org/book/), which provides a rapid on-ramp for using Python and the open-source [NLTK library](http://www.nltk.org/) to develop programs using natural language processing techniques (many of which involve [machine learning](http://www.nltk.org/book/ch06.html)).\n", + "\n", + "The [Python Tutorial](http://docs.python.org/2/tutorial/) offers a more comprehensive primer, and opens with an excellent - if biased - overview of some of the general strengths of the Python programming language:\n", + "\n", + "> Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.\n", + "\n", + "\"Python\n", + "[Hans Petter Langtangen](http://folk.uio.no/hpl/), author of [Python Scripting for Computational Science](http://www.amazon.com/Python-Scripting-Computational-Science-Engineering/dp/3642093159), emphasizes the utility of Python for many of the common tasks in all areas of computational science:\n", + "\n", + "> Very often programming is about shuffling data in and out of different tools, converting one data format to another, extracting numerical data from a text, and administering numerical experiments involving a large number of data files and directories. Such tasks are much faster to accomplish in a language like Python than in Fortran, C, C++, C#, or Java\n", + "\n", + "[Foster Provost](http://people.stern.nyu.edu/fprovost/), co-author of [Data Science for Business](http://data-science-for-biz.com/), describes why Python is such a useful programming language for practical data science in [Python: A Practical Tool for Data Science](https://docs.google.com/document/pub?id=1p6vowsEuiezLbWnFKgse70a8LxfsrRixqPF5nBg8F3A), :\n", + "\n", + "> The practice of data science involves many interrelated but different activities, including accessing data, manipulating data, computing statistics about data, plotting/graphing/visualizing data, building predictive and explanatory models from data, evaluating those models on yet more data, integrating models into production systems, etc. One option for the data scientist is to learn several different software packages that each specialize in one or two of these things, but don’t do them all well, plus learn a programming language to tie them together. (Or do a lot of manual work.) \n", + "> \n", + "> An alternative is to use a general-purpose, high-level programming language that provides libraries to do all these things. Python is an excellent choice for this. It has a diverse range of open source libraries for just about everything the data scientist will do. It is available everywhere; high performance python interpreters exist for running your code on almost any operating system or architecture. Python and most of its libraries are both open source and free. Contrast this with common software packages that are available in a course via an academic license, yet are extremely expensive to license and use in industry.\n", + "\n", + "\"scikit-learn-logo-small.png\"\n", + "The goal of this primer is to provide efficient and sufficient scaffolding for software engineers with no prior knowledge of Python to be able to effectively use Python-based tools for data science research and development, such as the open-source library [scikit-learn](http://scikit-learn.org/). There is another, more comprehensive tutorial for scikit-learn, [Python Scientific Lecture Notes](http://scipy-lectures.github.io/index.html), that includes coverage of a number of other useful Python open-source libraries used by scikit-learn ([numpy](http://www.numpy.org/), [scipy](http://www.scipy.org/) and [matplotlib](http://matplotlib.org)) - all highly recommended ... and, to keep things simple, all beyond the scope of this primer.\n", + "\n", + "Using an IPython Notebook as a delivery vehicle for this primer was motivated by Brian Granger's inspiring tutorial, [The IPython Notebook: Get Close to Your Data with Python and JavaScript](http://strataconf.com/strata2014/public/schedule/detail/32033), one of the [highlights from my Strata 2014 conference experience](http://gumption.typepad.com/blog/2014/02/ipython-deep-learning-doing-good-some-highlights-from-strata-2014.html). You can run this notebook locally in a browser once you [install ipython notebook](http://ipython.org/install.html).\n", + "\n", + "One final note on external resources: the [Python Style Guide (PEP-0008)](http://legacy.python.org/dev/peps/pep-0008/) offers helpful tips on how best to format Python code. [Code like a Pythonista](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html) offers a number of additional tips on Python programming style and philosophy, several of which are incorporated into this primer.\n", + "\n", + "We will focus entirely on using Python within the interpreter environment (as supported within an IPython Notebook). Python scripts - files containing definitions of functions and variables, and typically including code invoking some of those functions - can also be run from a command line. Using Python scripts from the command line may be the subject of a future primer. \n", + "\n", + "To help motivate the data science-oriented Python programming examples provided in this primer, we will start off with a brief overview of basic concepts and terminology in data science." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Navigation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notebooks in this primer:\n", + "\n", + "1. **Introduction** (*you are here*) \n", + "2. [Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", + "3. [Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", + "4. [Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", + "5. [Next Steps](5_Next_Steps.ipynb)" + ] } - ] -} \ No newline at end of file + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.10" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/2_Data_Science_Basic_Concepts.ipynb b/2_Data_Science_Basic_Concepts.ipynb index bcb0ecd..bc2ef3a 100644 --- a/2_Data_Science_Basic_Concepts.ipynb +++ b/2_Data_Science_Basic_Concepts.ipynb @@ -1,329 +1,328 @@ { - "metadata": { - "name": "", - "signature": "sha256:e9ce83ce359770d776e46f0cbff068bb20238ec1716dfdaa9f755145605adda4" - }, - "nbformat": 3, - "nbformat_minor": 0, - "worksheets": [ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python for Data Science" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Joe McCarthy](http://interrelativity.com/joe), \n", + "*Data Scientist*, [Indeed](http://www.indeed.com/)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from IPython.display import display, Image, HTML" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Navigation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notebooks in this primer:\n", + "\n", + "1. [Introduction](1_Introduction.ipynb)\n", + "2. **Data Science: Basic Concepts** (*you are here*)\n", + "3. [Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", + "4. [Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", + "5. [Next Steps](5_Next_Steps.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Data Science: Basic Concepts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Science and Data Mining" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"DataScienceForBusiness_cover.jpg\"\n", + "Foster Provost and [Tom Fawcett](http://home.comcast.net/~tom.fawcett/public_html/index.html) offer succinct descriptions of data science and data mining in [Data Science for Business](http://data-science-for-biz.com/):\n", + "\n", + "> **Data science** involves principles, processes and techniques for understanding phenomena via the (automated) analysis of data.\n", + "> \n", + "> **Data mining** is the extraction of knowledge from data, via technologies that incorporate these principles." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Knowledge Discovery, Data Mining and Machine Learning" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Provost & Fawcett also offer some history and insights into the relationship between *data mining* and *machine learning*, terms which are often used somewhat interchangeably:\n", + "\n", + "> The field of Data Mining (or KDD: Knowledge Discovery and Data Mining) started as an offshoot of Machine Learning, and they remain closely linked. Both fields are concerned with the analysis of data to find useful or informative patterns. Techniques and algorithms are shared between the two; indeed, the areas are so closely related that researchers commonly participate in both communities and transition between them seamlessly. Nevertheless, it is worth pointing out some of the differences to give perspective.\n", + "> \n", + ">Speaking generally, because Machine Learning is concerned with many types of performance improvement, it includes subfields such as robotics and computer vision that are not part of KDD. It also is concerned with issues of agency and cognition — how will an intelligent agent use learned knowledge to reason and act in its environment — which are not concerns of Data Mining.\n", + "> \n", + ">Historically, KDD spun off from Machine Learning as a research field focused on concerns raised by examining real-world applications, and a decade and a half later the KDD community remains more concerned with applications than Machine Learning is. As such, research focused on commercial applications and business issues of data analysis tends to gravitate toward the KDD community rather than to Machine Learning. KDD also tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, and so on.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Cross Industry Standard Process for Data Mining (CRISP-DM)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Cross Industry Standard Process for Data Mining](https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) introduced a process model for data mining in 2000 that has become widely adopted.\n", + "\n", + "\"CRISP-DM_Process_Diagram\"\n", + "\n", + "The model emphasizes the ***iterative*** nature of the data mining process, distinguishing several different stages that are regularly revisited in the course of developing and deploying data-driven solutions to business problems:\n", + "\n", + "* Business understanding\n", + "* Data understanding\n", + "* Data preparation\n", + "* Modeling \n", + "* Deployment\n", + "\n", + "We will be focusing primarily on using Python for **data preparation** and **modeling**." + ] + }, { - "cells": [ - { - "cell_type": "heading", - "level": 1, - "metadata": {}, - "source": [ - "Python for Data Science" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Joe McCarthy](http://interrelativity.com/joe), \n", - "*Director, Analytics & Data Science*, [Atigeo, LLC](http://atigeo.com)" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "from IPython.display import display, Image, HTML" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 1 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Navigation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notebooks in this primer:\n", - "\n", - "1. [Introduction](1_Introduction.ipynb)\n", - "2. **Data Science: Basic Concepts** (*you are here*)\n", - "3. [Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", - "4. [Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", - "5. [Next Steps](5_Next_Steps.ipynb)" - ] - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "2. Data Science: Basic Concepts" - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Data Science and Data Mining" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\"DataScienceForBusiness_cover.jpg\"\n", - "Foster Provost and [Tom Fawcett](http://home.comcast.net/~tom.fawcett/public_html/index.html) offer succinct descriptions of data science and data mining in [Data Science for Business](http://data-science-for-biz.com/):\n", - "\n", - "> **Data science** involves principles, processes and techniques for understanding phenomena via the (automated) analysis of data.\n", - "> \n", - "> **Data mining** is the extraction of knowledge from data, via technologies that incorporate these principles." - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Knowledge Discovery, Data Mining and Machine Learning" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Provost & Fawcett also offer some history and insights into the relationship between *data mining* and *machine learning*, terms which are often used somewhat interchangeably:\n", - "\n", - "> The field of Data Mining (or KDD: Knowledge Discovery and Data Mining) started as an offshoot of Machine Learning, and they remain closely linked. Both fields are concerned with the analysis of data to find useful or informative patterns. Techniques and algorithms are shared between the two; indeed, the areas are so closely related that researchers commonly participate in both communities and transition between them seamlessly. Nevertheless, it is worth pointing out some of the differences to give perspective.\n", - "> \n", - ">Speaking generally, because Machine Learning is concerned with many types of performance improvement, it includes subfields such as robotics and computer vision that are not part of KDD. It also is concerned with issues of agency and cognition \u2014 how will an intelligent agent use learned knowledge to reason and act in its environment \u2014 which are not concerns of Data Mining.\n", - "> \n", - ">Historically, KDD spun off from Machine Learning as a research field focused on concerns raised by examining real-world applications, and a decade and a half later the KDD community remains more concerned with applications than Machine Learning is. As such, research focused on commercial applications and business issues of data analysis tends to gravitate toward the KDD community rather than to Machine Learning. KDD also tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, and so on.\n" - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Cross Industry Standard Process for Data Mining (CRISP-DM)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [Cross Industry Standard Process for Data Mining](https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) introduced a process model for data mining in 2000 that has become widely adopted.\n", - "\n", - "\"CRISP-DM_Process_Diagram\"\n", - "\n", - "The model emphasizes the ***iterative*** nature of the data mining process, distinguishing several different stages that are regularly revisited in the course of developing and deploying data-driven solutions to business problems:\n", - "\n", - "* Business understanding\n", - "* Data understanding\n", - "* Data preparation\n", - "* Modeling \n", - "* Deployment\n", - "\n", - "We will be focusing primarily on using Python for **data preparation** and **modeling**." - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Data Science Workflow" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Philip Guo](http://www.pgbovine.net/) presents a [Data Science Workflow](http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext) offering a slightly different process model emhasizing the importance of **reflection** and some of the meta-data, data management and bookkeeping challenges that typically arise in the data science process. His 2012 PhD thesis, [Software Tools to Facilitate Research Programming](http://pgbovine.net/projects/pubs/guo_phd_dissertation.pdf), offers an insightful and more comprehensive description of many of these challenges.\n", - "\n", - "\"pguo-data-science-overview.jpg\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Provost & Fawcett list a number of different tasks in which data science techniques are employed:\n", - "\n", - "* Classification and class probability estimation \n", - "* Regression (aka value estimation) \n", - "* Similarity matching \n", - "* Clustering \n", - "* Co-occurrence grouping (aka frequent itemset mining, association rule discovery, market-basket analysis) \n", - "* Profiling (aka behavior description, fraud / anomaly detection) \n", - "* Link prediction \n", - "* Data reduction \n", - "* Causal modeling \n", - "\n", - "We will be focusing primarily on **classification** and **class probability estimation** tasks, which are defined by Provost & Fawcett as follows:\n", - "\n", - "> *Classification* and *class probability estimation* attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to. Usually the classes are mutually exclusive. An example classification question would be: \u201cAmong all the customers of MegaTelCo, which are likely to respond to a given offer?\u201d In this example the two classes could be called will respond and will not respond.\n", - "\n", - "To further simplify this primer, we will focus exclusively on **supervised** methods, in which the data is explicitly labeled with classes. There are also *unsupervised* methods that involve working with data in which there are no pre-specified class labels." - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Supervised Classification" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [Natural Language Toolkit (NLTK) book](http://www.nltk.org/book) provides a diagram and succinct description (below, with italics and bold added for emphasis) of supervised classification:\n", - "\n", - "\"nltk_ch06_supervised-classification.png\"\n", - "\n", - "> *Supervised Classification*. (a) During *training*, a **feature extractor** is used to convert each **input value** to a **feature set**. These feature sets, which capture the basic information about each input that should be used to classify it, are discussed in the next section. Pairs of feature sets and **labels** are fed into the **machine learning algorithm** to generate a **model**. (b) During *prediction*, the same feature extractor is used to convert **unseen inputs** to feature sets. These feature sets are then fed into the model, which generates **predicted labels**." - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Data Mining Terminology" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "* **Structured** data has simple, well-defined patterns (e.g., a table or graph)\n", - "* **Unstructured** data has less well-defined patterns (e.g., text, images)\n", - "* **Model**: a pattern that captures / generalizes regularities in data (e.g., an equation, set of rules, decision tree)\n", - "* **Attribute** (aka *variable*, *feature*, *signal*, *column*): an element used in a model\n", - "* **Example** (aka *instance*, *feature vector*, *row*): a representation of an entity being modeled\n", - "* **Target attribute** (aka *dependent variable*, *class label*): the class / type / category of an entity being modeled" - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Data Mining Example: UCI Mushroom dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [Center for Machine Learning and Intelligent Systems](http://cml.ics.uci.edu/) at the University of California, Irvine (UCI), hosts a [Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html) containing over 200 publicly available data sets.\n", - "\n", - "\"mushroom\"/\n", - "We will use the [mushroom](https://archive.ics.uci.edu/ml/datasets/Mushroom) data set, which forms the basis of several examples in Chapter 3 of the Provost & Fawcett data science book.\n", - "\n", - "The following description of the dataset is provided at the UCI repository:\n", - "\n", - ">This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525 [The Audubon Society Field Guide to North American Mushrooms, 1981]). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like leaflets three, let it be'' for Poisonous Oak and Ivy.\n", - "> \n", - "> **Number of Instances**: 8124\n", - "> \n", - "> **Number of Attributes**: 22 (all nominally valued)\n", - "> \n", - "> **Attribute Information**: (*classes*: edible=e, poisonous=p)\n", - "> \n", - "> 1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", - "> 2. *cap-surface*: fibrous=f, grooves=g, scaly=y, smooth=s\n", - "> 3. *cap-color*: brown=n ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y\n", - "> 4. *bruises?*: bruises=t, no=f\n", - "> 5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s\n", - "> 6. *gill-attachment*: attached=a, descending=d, free=f, notched=n\n", - "> 7. *gill-spacing*: close=c, crowded=w, distant=d\n", - "> 8. *gill-size*: broad=b, narrow=n\n", - "> 9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y\n", - "> 10. *stalk-shape*: enlarging=e, tapering=t\n", - "> 11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?\n", - "> 12. *stalk-surface-above-ring*: fibrous=f, scaly=y, silky=k, smooth=s\n", - "> 13. *stalk-surface-below-ring*: fibrous=f, scaly=y, silky=k, smooth=s\n", - "> 14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n", - "> 15. *stalk-color-below-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n", - "> 16. *veil-type*: partial=p, universal=u\n", - "> 17. *veil-color*: brown=n, orange=o, white=w, yellow=y\n", - "> 18. *ring-number*: none=n, one=o, two=t\n", - "> 19. *ring-type*: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z\n", - "> 20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y\n", - "> 21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y\n", - "> 22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d\n", - "> \n", - "> **Missing Attribute Values**: 2480 of them (denoted by \"?\"), all for attribute #11.\n", - "> \n", - "> **Class Distribution**: -- edible: 4208 (51.8%) -- poisonous: 3916 (48.2%) -- total: 8124 instances\n", - "\n", - "The [data file](https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data) associated with this dataset has one instance of a hypothetical mushroom per line, with abbreviations for the values of the class and each of the other 22 attributes separated by commas.\n", - "\n", - "Here is a sample line from the data file:\n", - "\n", - "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", - "\n", - "This instance represents a mushroom with the following attribute values:\n", - "\n", - "*class*: edible=e, **poisonous=p**\n", - "\n", - "1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, **knobbed=k**, sunken=s\n", - "2. *cap-surface*: **fibrous=f**, grooves=g, scaly=y, smooth=s\n", - "3. *cap-color*: **brown=n** ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y\n", - "4. *bruises?*: bruises=t, **no=f**\n", - "5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, **none=n**, pungent=p, spicy=s\n", - "6. *gill-attachment*: attached=a, descending=d, **free=f**, notched=n\n", - "7. *gill-spacing*: **close=c**, crowded=w, distant=d\n", - "8. *gill-size*: broad=b, **narrow=n**\n", - "9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, **white=w**, yellow=y\n", - "10. *stalk-shape*: **enlarging=e**, tapering=t\n", - "11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, **missing=?**\n", - "12. *stalk-surface-above-ring*: fibrous=f, scaly=y, **silky=k**, smooth=s\n", - "13. *stalk-surface-below-ring*: fibrous=f, **scaly=y**, silky=k, smooth=s\n", - "14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, **white=w**, yellow=y\n", - "15. *stalk-color-below-ring*: **brown=n**, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n", - "16. *veil-type*: **partial=p**, universal=u\n", - "17. *veil-color*: brown=n, orange=o, **white=w**, yellow=y\n", - "18. *ring-number*: none=n, **one=o**, two=t\n", - "19. *ring-type*: cobwebby=c, **evanescent=e**, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z\n", - "20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, **white=w**, yellow=y\n", - "21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, **several=v**, solitary=y\n", - "22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, **woods=d**\n", - "\n", - "Building a model with this data set will serve as a motivating example throughout much of this primer." - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Navigation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notebooks in this primer:\n", - "\n", - "1. [Introduction](1_Introduction.ipynb)\n", - "2. **Data Science: Basic Concepts** (*you are here*)\n", - "3. [Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", - "4. [Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", - "5. [Next Steps](5_Next_Steps.ipynb)" - ] - } - ], - "metadata": {} + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Science Workflow" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Philip Guo](http://www.pgbovine.net/) presents a [Data Science Workflow](http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext) offering a slightly different process model emhasizing the importance of **reflection** and some of the meta-data, data management and bookkeeping challenges that typically arise in the data science process. His 2012 PhD thesis, [Software Tools to Facilitate Research Programming](http://pgbovine.net/projects/pubs/guo_phd_dissertation.pdf), offers an insightful and more comprehensive description of many of these challenges.\n", + "\n", + "
\"pguo-data-science-overview.jpg\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Provost & Fawcett list a number of different tasks in which data science techniques are employed:\n", + "\n", + "* Classification and class probability estimation \n", + "* Regression (aka value estimation) \n", + "* Similarity matching \n", + "* Clustering \n", + "* Co-occurrence grouping (aka frequent itemset mining, association rule discovery, market-basket analysis) \n", + "* Profiling (aka behavior description, fraud / anomaly detection) \n", + "* Link prediction \n", + "* Data reduction \n", + "* Causal modeling \n", + "\n", + "We will be focusing primarily on **classification** and **class probability estimation** tasks, which are defined by Provost & Fawcett as follows:\n", + "\n", + "> *Classification* and *class probability estimation* attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to. Usually the classes are mutually exclusive. An example classification question would be: “Among all the customers of MegaTelCo, which are likely to respond to a given offer?” In this example the two classes could be called will respond and will not respond.\n", + "\n", + "To further simplify this primer, we will focus exclusively on **supervised** methods, in which the data is explicitly labeled with classes. There are also *unsupervised* methods that involve working with data in which there are no pre-specified class labels." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Supervised Classification" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Natural Language Toolkit (NLTK) book](http://www.nltk.org/book) provides a diagram and succinct description (below, with italics and bold added for emphasis) of supervised classification:\n", + "\n", + "\"nltk_ch06_supervised-classification.png\"\n", + "\n", + "> *Supervised Classification*. (a) During *training*, a **feature extractor** is used to convert each **input value** to a **feature set**. These feature sets, which capture the basic information about each input that should be used to classify it, are discussed in the next section. Pairs of feature sets and **labels** are fed into the **machine learning algorithm** to generate a **model**. (b) During *prediction*, the same feature extractor is used to convert **unseen inputs** to feature sets. These feature sets are then fed into the model, which generates **predicted labels**." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Mining Terminology" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* **Structured** data has simple, well-defined patterns (e.g., a table or graph)\n", + "* **Unstructured** data has less well-defined patterns (e.g., text, images)\n", + "* **Model**: a pattern that captures / generalizes regularities in data (e.g., an equation, set of rules, decision tree)\n", + "* **Attribute** (aka *variable*, *feature*, *signal*, *column*): an element used in a model\n", + "* **Instance** (aka *example*, *feature vector*, *row*): a representation of a single entity being modeled\n", + "* **Target attribute** (aka *dependent variable*, *class label*): the class / type / category of an entity being modeled" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Mining Example: UCI Mushroom dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Center for Machine Learning and Intelligent Systems](http://cml.ics.uci.edu/) at the University of California, Irvine (UCI), hosts a [Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html) containing over 200 publicly available data sets.\n", + "\n", + "\"mushroom\"/\n", + "We will use the [mushroom](https://archive.ics.uci.edu/ml/datasets/Mushroom) data set, which forms the basis of several examples in Chapter 3 of the Provost & Fawcett data science book.\n", + "\n", + "The following description of the dataset is provided at the UCI repository:\n", + "\n", + ">This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525 [The Audubon Society Field Guide to North American Mushrooms, 1981]). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like leaflets three, let it be'' for Poisonous Oak and Ivy.\n", + "> \n", + "> **Number of Instances**: 8124\n", + "> \n", + "> **Number of Attributes**: 22 (all nominally valued)\n", + "> \n", + "> **Attribute Information**: (*classes*: edible=e, poisonous=p)\n", + "> \n", + "> 1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", + "> 2. *cap-surface*: fibrous=f, grooves=g, scaly=y, smooth=s\n", + "> 3. *cap-color*: brown=n ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y\n", + "> 4. *bruises?*: bruises=t, no=f\n", + "> 5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s\n", + "> 6. *gill-attachment*: attached=a, descending=d, free=f, notched=n\n", + "> 7. *gill-spacing*: close=c, crowded=w, distant=d\n", + "> 8. *gill-size*: broad=b, narrow=n\n", + "> 9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y\n", + "> 10. *stalk-shape*: enlarging=e, tapering=t\n", + "> 11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?\n", + "> 12. *stalk-surface-above-ring*: fibrous=f, scaly=y, silky=k, smooth=s\n", + "> 13. *stalk-surface-below-ring*: fibrous=f, scaly=y, silky=k, smooth=s\n", + "> 14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n", + "> 15. *stalk-color-below-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n", + "> 16. *veil-type*: partial=p, universal=u\n", + "> 17. *veil-color*: brown=n, orange=o, white=w, yellow=y\n", + "> 18. *ring-number*: none=n, one=o, two=t\n", + "> 19. *ring-type*: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z\n", + "> 20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y\n", + "> 21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y\n", + "> 22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d\n", + "> \n", + "> **Missing Attribute Values**: 2480 of them (denoted by \"?\"), all for attribute #11.\n", + "> \n", + "> **Class Distribution**: -- edible: 4208 (51.8%) -- poisonous: 3916 (48.2%) -- total: 8124 instances\n", + "\n", + "The [data file](https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data) associated with this dataset has one instance of a hypothetical mushroom per line, with abbreviations for the values of the class and each of the other 22 attributes separated by commas.\n", + "\n", + "Here is a sample line from the data file:\n", + "\n", + "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", + "\n", + "This instance represents a mushroom with the following attribute values (highlighted in **bold**):\n", + "\n", + "*class*: edible=e, **poisonous=p**\n", + "\n", + "1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, **knobbed=k**, sunken=s\n", + "2. *cap-surface*: **fibrous=f**, grooves=g, scaly=y, smooth=s\n", + "3. *cap-color*: **brown=n** ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y\n", + "4. *bruises?*: bruises=t, **no=f**\n", + "5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, **none=n**, pungent=p, spicy=s\n", + "6. *gill-attachment*: attached=a, descending=d, **free=f**, notched=n\n", + "7. *gill-spacing*: **close=c**, crowded=w, distant=d\n", + "8. *gill-size*: broad=b, **narrow=n**\n", + "9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, **white=w**, yellow=y\n", + "10. *stalk-shape*: **enlarging=e**, tapering=t\n", + "11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, **missing=?**\n", + "12. *stalk-surface-above-ring*: fibrous=f, scaly=y, **silky=k**, smooth=s\n", + "13. *stalk-surface-below-ring*: fibrous=f, **scaly=y**, silky=k, smooth=s\n", + "14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, **white=w**, yellow=y\n", + "15. *stalk-color-below-ring*: **brown=n**, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\n", + "16. *veil-type*: **partial=p**, universal=u\n", + "17. *veil-color*: brown=n, orange=o, **white=w**, yellow=y\n", + "18. *ring-number*: none=n, **one=o**, two=t\n", + "19. *ring-type*: cobwebby=c, **evanescent=e**, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z\n", + "20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, **white=w**, yellow=y\n", + "21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, **several=v**, solitary=y\n", + "22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, **woods=d**\n", + "\n", + "Building a model with this data set will serve as a motivating example throughout much of this primer." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Navigation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notebooks in this primer:\n", + "\n", + "1. [Introduction](1_Introduction.ipynb)\n", + "2. **Data Science: Basic Concepts** (*you are here*)\n", + "3. [Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", + "4. [Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", + "5. [Next Steps](5_Next_Steps.ipynb)" + ] } - ] -} \ No newline at end of file + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.10" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/3_Python_Basic_Concepts.ipynb b/3_Python_Basic_Concepts.ipynb index 5722db7..9a5f781 100644 --- a/3_Python_Basic_Concepts.ipynb +++ b/3_Python_Basic_Concepts.ipynb @@ -1,2676 +1,2461 @@ { + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python for Data Science" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Joe McCarthy](http://interrelativity.com/joe), \n", + "*Data Scientist*, [Indeed](http://www.indeed.com/)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from IPython.display import display, Image, HTML" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Navigation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notebooks in this primer:\n", + "\n", + "* [1. Introduction](1_Introduction.ipynb)\n", + "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", + "* **3. Python: Basic Concepts** (*you are here*)\n", + "* [4. Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", + "* [5. Next Steps](5_Next_Steps.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Python: Basic Concepts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### *A note on Python 2 vs. Python 3*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are 2 major versions of Python in widespread use: [Python 2](https://docs.python.org/2/) and [Python 3](https://docs.python.org/3/). Python 3 has some features that are not backward compatible with Python 2, and some Python 2 libraries have not been updated to work with Python 3. I have been using Python 2, primarily because I use some of those Python 2[-only] libraries, but an increasing proportion of them are migrating to Python 3, and I anticipate shifting to Python 3 in the near future.\n", + "\n", + "For more on the topic, I recommend a very well documented IPython Notebook, which includes numerous helpful examples and links, by [Sebastian Raschka](http://sebastianraschka.com/), [Key differences between Python 2.7.x and Python 3.x](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/key_differences_between_python_2_and_3.ipynb), the [Cheat Sheet: Writing Python 2-3 compatible code](http://python-future.org/compatible_idioms.html) by Ed Schofield ... or [googling Python 2 vs 3](https://www.google.com/q=python%202%20vs%203).\n", + "\n", + "[Nick Coghlan](https://twitter.com/ncoghlan_dev), a CPython core developer, sent me an email suggesting that relatively minor changes in this notebook would enable it to run with Python 2 *or* Python 3: importing the `print_function` from the [**`__future__`**](https://docs.python.org/2/library/__future__.html) module, and changing my [`print` *statements* (Python 2)](https://docs.python.org/2/reference/simple_stmts.html#print) to [`print` *function calls* (Python 3)](https://docs.python.org/3/library/functions.html#print). Although a relatively minor conceptual change, it necessitated the changing of many individual cells to reflect the Python 3 `print` syntax. \n", + "\n", + "I decided to import the `division` module from the `future`, as I find [the use of `/` for \"true division\"](https://www.python.org/dev/peps/pep-0238/) - and the use of `//` for \"floor division\" - to be more aligned with my intuition. I also needed to replace a few functions that are no longer available in Python 3 with related functions that are available in both versions; I've added notes in nearby cells where the incompatible functions were removed explaining why they are related ... and no longer available. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from __future__ import print_function, division" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Names (identifiers), strings & binding values to names (assignment)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The sample instance of a mushroom shown above can be represented as a string. \n", + "\n", + "A Python ***string* ([`str`](http://docs.python.org/2/tutorial/introduction.html#strings))** is a sequence of 0 or more characters enclosed within a pair of single quotes (`'`) or a pair of double quotes (`\"`). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python [*identifiers*](http://docs.python.org/2/reference/lexical_analysis.html#identifiers) (or [*names*](https://docs.python.org/2/reference/executionmodel.html#naming-and-binding)) are composed of letters, numbers and/or underscores ('`_`'), starting with a letter or underscore. Python identifiers are case sensitive. Although camelCase identifiers can be used, it is generally considered more [pythonic](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html) to use underscores. Python variables and functions typically start with lowercase letters; Python classes start with uppercase letters.\n", + "\n", + "The following [assignment statement](http://docs.python.org/2/reference/simple_stmts.html#assignment-statements) binds the value of the string shown above to the name `single_instance_str`. Typing the name on the subsequent line will cause the intepreter to print the value bound to that name." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "single_instance_str = 'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'\n", + "single_instance_str" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Printing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`print`**](https://docs.python.org/3/library/functions.html#print) function writes the value of its comma-delimited arguments to [**`sys.stdout`**](http://docs.python.org/2/library/sys.html#sys.stdout) (typically the console). Each value in the output is separated by a single blank space. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('A', 'B', 'C', 1, 2, 3)\n", + "print('Instance 1:', single_instance_str)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The print function has an optional keyword argument, **`end`**. When this argument is used and its value does not include `'\\n'` (newline character), the output cursor will not advance to the next line." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('A', 'B') # no end argument\n", + "print('C')\n", + "print ('A', 'B', end='...\\n') # end includes '\\n' --> output cursor advancees to next line\n", + "print ('C')\n", + "print('A', 'B', end=' ') # end=' ' --> use a space rather than newline at the end of the line\n", + "print('C') # so that subsequent printed output will appear on same line" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Comments" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python ***comment*** character is **`'#'`**: anything after `'#'` on the line is ignored by the Python interpreter. PEP8 style guidelines recommend using at least 2 blank spaces before an inline comment that appears on the same line as any code.\n", + "\n", + "***Multi-line strings*** can be used within code blocks to provide multi-line comments.\n", + "\n", + "Multi-line strings are delimited by pairs of triple quotes (**`'''`** or **`\"\"\"`**). Any newlines in the string will be represented as `'\\n'` characters in the string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "'''\n", + "This is\n", + "a mult-line\n", + "string'''" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Before comment') # this is an inline comment\n", + "'''\n", + "This is\n", + "a multi-line\n", + "comment\n", + "'''\n", + "print('After comment')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Multi-line strings can be printed, in which case the embedded newline (`'\\n'`) characters will be converted to newlines in the output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('''\n", + "This is\n", + "a mult-line\n", + "string''')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Lists" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A [**`list`**](http://docs.python.org/2/tutorial/introduction.html#lists) is an ordered ***sequence*** of 0 or more comma-delimited elements enclosed within square brackets ('`[`', '`]`'). The Python [**`str.split(sep)`**](http://docs.python.org/2/library/stdtypes.html#str.split) method can be used to split a `sep`-delimited string into a corresponding list of elements.\n", + "\n", + "In the following example, a comma-delimited string is split using `sep=','`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "single_instance_list = single_instance_str.split(',')\n", + "print(single_instance_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python lists are *heterogeneous*, i.e., they can contain elements of different types." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "mixed_list = ['a', 1, 2.3, True, [1, 'b']]\n", + "print(mixed_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python **`+`** operator can be used for addition, and also to concatenate strings and lists." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(1 + 2 + 3)\n", + "print('a' + 'b' + 'c')\n", + "print(['a', 1] + [2.3, True] + [[1, 'b']])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Accessing sequence elements & subsequences " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Individual elements of [*sequences*](http://docs.python.org/2/library/stdtypes.html#typesseq) (e.g., lists and strings) can be accessed by specifying their *zero-based index position* within square brackets ('`[`', '`]`').\n", + "\n", + "The following statements print out the 3rd element - at zero-based index position 2 - of `single_instance_str` and `single_instance_list`.\n", + "\n", + "Note that the 3rd elements are not the same, as commas count as elements in the string, but not in the list created by splitting a comma-delimited string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str)\n", + "print(single_instance_str[2])\n", + "print(single_instance_list)\n", + "print(single_instance_list[2])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*Negative index values* can be used to specify a position offset from the end of the sequence.\n", + "\n", + "It is often useful to use a `-1` index value to access the last element of a sequence." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str)\n", + "print(single_instance_str[-1])\n", + "print(single_instance_str[-2])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_list)\n", + "print(single_instance_list[-1])\n", + "print(single_instance_list[-2])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python ***slice notation*** can be used to access subsequences by specifying two index positions separated by a colon (':'); `seq[start:stop]` returns all the elements in `seq` between `start` and `stop - 1` (inclusive)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str[2:4])\n", + "print(single_instance_list[2:4])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Slices index values can be negative." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str[-4:-2])\n", + "print(single_instance_list[-4:-2])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `start` and/or `stop` index can be omitted. A common use of slices with a single index value is to access all but the first element or all but the last element of a sequence." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str)\n", + "print(single_instance_str[:-1]) # all but the last \n", + "print(single_instance_str[:-2]) # all but the last 2 \n", + "print(single_instance_str[1:]) # all but the first\n", + "print(single_instance_str[2:]) # all but the first 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_list)\n", + "print(single_instance_list[:-1])\n", + "print(single_instance_list[1:])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Slice notation includes an optional third element, `step`, as in `seq[start:stop:step]`, that specifies the steps or increments by which elements are retrieved from `seq` between `start` and `step - 1`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str)\n", + "print(single_instance_str[::2]) # print elements in even-numbered positions\n", + "print(single_instance_str[1::2]) # print elements in odd-numbered positions\n", + "print(single_instance_str[::-1]) # print elements in reverse order" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [Python tutorial](http://docs.python.org/2/tutorial/introduction.html) offers a helpful ASCII art representation to show how positive and negative indexes are interpreted:\n", + "\n", + "
\n",
+    " +---+---+---+---+---+\n",
+    " | H | e | l | p | A |\n",
+    " +---+---+---+---+---+\n",
+    " 0   1   2   3   4   5\n",
+    "-5  -4  -3  -2  -1\n",
+    "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Splitting / separating statements" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python statements are typically separated by newlines (rather than, say, the semi-colon in Java). Statements can extend over more than one line; it is generally best to break the lines after commas, parentheses, braces or brackets. Inserting a backslash character ('\\\\') at the end of a line will also enable continuation of the statement on the next line, but it is generally best to look for other alternatives." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute_names = ['class', \n", + " 'cap-shape', 'cap-surface', 'cap-color', \n", + " 'bruises?', \n", + " 'odor', \n", + " 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', \n", + " 'stalk-shape', 'stalk-root', \n", + " 'stalk-surface-above-ring', 'stalk-surface-below-ring', \n", + " 'stalk-color-above-ring', 'stalk-color-below-ring',\n", + " 'veil-type', 'veil-color', \n", + " 'ring-number', 'ring-type', \n", + " 'spore-print-color', \n", + " 'population', \n", + " 'habitat']\n", + "print(attribute_names)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('a', 'b', 'c', # no '\\' needed when breaking after comma\n", + " 1, 2, 3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print( # no '\\' needed when breaking after parenthesis, brace or bracket\n", + " 'a', 'b', 'c',\n", + " 1, 2, 3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(1 + 2 \\\n", + " + 3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Processing strings & other sequences" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`str.strip([chars]`**)](http://docs.python.org/2/library/stdtypes.html#str.strip) method returns a copy of `str` in which any leading or trailing `chars` are removed. If no `chars` are specified, it removes all leading and trailing whitespace. [*Whitespace* is any sequence of spaces, tabs (`'\\t'`) and/or newline (`'\\n'`) characters.] \n", + "\n", + "Note that since a blank space is inserted in the output after every item in a comma-delimited list, the second asterisk below is printed after a leading blank space is inserted on the new line." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n', '*')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'.strip(), '*')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A common programming pattern when dealing with CSV (comma-separated value) files, such as the mushroom dataset file mentioned above, is to repeatedly:\n", + "\n", + "1. read a line from a file\n", + "2. strip off any leading and trailing whitespace\n", + "3. split the values separated by commas into a list\n", + "\n", + "We will get to repetition control structures (loops) and file input and output shortly, but here is an example of how `str.strip()` and `str.split()` be chained together in a single instruction for processing a line representing a single instance from the mushroom dataset file. Note that chained methods are executed in left-to-right order.\n", + "\n", + "*\\[Python providees a **[`csv`](https://docs.python.org/2/library/csv.html)** module to facilitate the processing of CSV files, but we will not use that module here\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "single_instance_str = 'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'\n", + "print(single_instance_str)\n", + "# first strip leading & trailing whitespace, then split on commas\n", + "single_instance_list = single_instance_str.strip().split(',') \n", + "print(single_instance_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`str.join(words)`**](http://docs.python.org/2/library/string.html#string.join) method is the inverse of `str.split()`, returning a single string in which each string in the sequence of `words` is separated by `str`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_list)\n", + "print(','.join(single_instance_list))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A number of Python methods can be used on strings, lists and other sequences.\n", + "\n", + "The [**`len(s)`**](http://docs.python.org/2/library/functions.html#len) function can be used to find the length of (number of items in) a sequence `s`. It will also return the number of items in a *dictionary*, a data structure we will cover further below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(len(single_instance_str))\n", + "print(len(single_instance_list))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The **`in`** operator can be used to determine whether a sequence contains a value. \n", + "\n", + "Boolean values in Python are **`True`** and **`False`** (note the capitalization)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(',' in single_instance_str)\n", + "print(',' in single_instance_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`s.count(x)`**](http://docs.python.org/2/library/stdtypes.html#str.count) ormethod can be used to count the number of occurrences of item `x` in sequence `s`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str.count(','))\n", + "print(single_instance_list.count('f'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`s.index(x)`**](http://docs.python.org/2/library/stdtypes.html#str.index) method can be used to find the first zero-based index of item `x` in sequence `s`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_str.index(','))\n", + "print(single_instance_list.index('f'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that an [`ValueError`](https://docs.python.org/2/library/exceptions.html#exceptions.ValueError) exception will be raised if item `x` is not found in sequence `s`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(single_instance_list.index(','))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Mutability" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One important distinction between strings and lists has to do with their [*mutability*](http://docs.python.org/2/reference/datamodel.html).\n", + "\n", + "Python strings are *immutable*, i.e., they cannot be modified. Most string methods (like `str.strip()`) return modified *copies* of the strings on which they are used.\n", + "\n", + "Python lists are *mutable*, i.e., they can be modified. \n", + "\n", + "The examples below illustrate a number of [`list`](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) methods that modify lists." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "list_1 = [4, 2, 3, 5, 1]\n", + "list_2 = list_1 # list_2 now references the same object as list_1\n", + "print('list_1: ', list_1)\n", + "print('list_2: ', list_2)\n", + "print()\n", + "\n", + "list_1.remove(1)\n", + "print('list_1.remove(1):', list_1)\n", + "print()\n", + "\n", + "list_1.append(6)\n", + "print('list_1.append(6):', list_1)\n", + "print()\n", + "\n", + "list_1.sort()\n", + "print('list_1.sort(): ', list_1)\n", + "print()\n", + "\n", + "list_1.reverse()\n", + "print('list_1.reverse():', list_1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When more than one name (e.g., a variable) is bound to the same mutable object, changes made to that object are reflected in all names bound to that object. For example, in the second statement above, `list_2` is bound to the same object that is bound to `list_1`. All changes made to the object bound to `list_1` will thus be reflected in `list_2` (since they both reference the same object)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('list_1: ', list_1)\n", + "print('list_2: ', list_2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are sorting and reversing functions, **[`sorted()`](https://docs.python.org/2.7/library/functions.html#sorted)** and **[`reversed()`](https://docs.python.org/2.7/library/functions.html#reversed)**, that do *not* modify their arguments, and can thus be used on mutable or immutable objects. \n", + "\n", + "Note that `sorted()` always returns a sorted *list* of each element in its argument, regardless of which type of sequence it is passed. Thus, invoking `sorted()` on a *string* returns a *list* of sorted characters from the string, rather than a sorted string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('sorted(list_1):', sorted(list_1)) \n", + "print('list_1: ', list_1)\n", + "print()\n", + "print('sorted(single_instance_str):', sorted(single_instance_str)) \n", + "print('single_instance_str: ', single_instance_str)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `sorted()` function sorts its argument in ascending order by default. \n", + "\n", + "An optional ***[keyword argument](http://docs.python.org/2/tutorial/controlflow.html#keyword-arguments)***, `reverse`, can be used to sort in descending order. The default value of this optional parameter is `False`; to get non-default behavior of an optional argument, we must specify the name and value of the argument, in this case, `reverse=True`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(sorted(single_instance_str)) \n", + "print(sorted(single_instance_str, reverse=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Tuples (immutable list-like sequences)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A [*tuple*](http://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences) is an ordered, immutable sequence of 0 or more comma-delimited values enclosed in parentheses (`'('`, `')'`). Many of the functions and methods that operate on strings and lists also operate on tuples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x = (5, 4, 3, 2, 1) # a tuple\n", + "print('x =', x)\n", + "print('len(x) =', len(x))\n", + "print('x.index(3) =', x.index(3))\n", + "print('x[2:4] = ', x[2:4])\n", + "print('x[4:2:-1] = ', x[4:2:-1])\n", + "print('sorted(x):', sorted(x)) # note: sorted() always returns a list" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the methods that modify lists (e.g., `append()`, `remove()`, `reverse()`, `sort()`) are not defined for immutable sequences such as tuples (or strings). Invoking one of these sequence modification methods on an immutable sequence will raise an [`AttributeError`](https://docs.python.org/2/library/exceptions.html#exceptions.AttributeError) exception." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x.append(6)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, one can approximate these modifications by creating modified copies of an immutable sequence and then re-assigning it to a name." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x = x + (6,) # need to include a comma to differentiate tuple from numeric expression\n", + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that Python has a **`+=`** operator which is a shortcut for the *`name = name + new_value`* pattern. This can be used for addition (e.g., `x += 1` is shorthand for `x = x + 1`) or concatenation (e.g., `x += (7,)` is shorthand for `x = x + (7,)`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x += (7,)\n", + "x" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Conditionals" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One common approach to handling errors is to *look before you leap (LBYL)*, i.e., test for potential [exceptions](http://docs.python.org/2/tutorial/errors.html) before executing instructions that might raise those exceptions. \n", + "\n", + "This approach can be implemented using the [**`if`**](http://docs.python.org/2/tutorial/controlflow.html#if-statements) statement (which may optionally include an **`else`** and any number of **`elif`** clauses).\n", + "\n", + "The following is a simple example of an `if` statement:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "class_value = 'e' # try changing this to 'p' or 'x'\n", + "\n", + "if class_value == 'e':\n", + " print('edible')\n", + "elif class_value == 'p':\n", + " print('poisonous')\n", + "else:\n", + " print('unknown')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that \n", + "\n", + "* a colon ('`:`') is used at the end of the lines with `if`, `else` or `elif`\n", + "* no parentheses are required to enclose the boolean condition (it is presumed to include everything between `if` or `elif` and the colon)\n", + "* the statements below each `if`, `elif` and `else` line are all indented\n", + "\n", + "Python does not have special characters to delimit statement blocks (like the '{' and '}' delimiters in Java); instead, sequences of statements with the same *indentation level* are treated as a statement block. The [Python Style Guide](http://legacy.python.org/dev/peps/pep-0008/) recommends using 4 spaces for each indentation level.\n", + "\n", + "An `if` statement can be used to follow the LBYL paradigm in preventing the `ValueError` that occured in an earlier example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code\n", + "\n", + "if attribute in attribute_names:\n", + " i = attribute_names.index(attribute)\n", + " print(attribute, 'is in position', i)\n", + "else:\n", + " print(attribute, 'is not in', attribute_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Seeking forgiveness vs. asking for permission (EAFP vs. LBYL)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another perspective on handling errors championed by some pythonistas is that it is [*easier to ask forgiveness than permission (EAFP)*](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#eafp-vs-lbyl).\n", + "\n", + "As in many practical applications of philosophy, religion or dogma, it is helpful to *think before you choose (TBYC)*. There are a number of factors to consider in deciding whether to follow the EAFP or LBYL paradigm, including code readability and the anticipated likelihood and relative severity of encountering an exception. For those who are interested, Oran Looney wrote a blog post providing a nice overview of the debate over [LBYL vs. EAFP](http://oranlooney.com/lbyl-vs-eafp/).\n", + "\n", + "In keeping with practices most commonly used with other languages, we will follow the LBYL paradigm throughout most of this primer. \n", + "\n", + "However, as a brief illustration of the EAFP paradigm in Python, here is an alternate implementation of the functionality of the code above, using a [**`try/except`**](http://docs.python.org/2/tutorial/errors.html#handling-exceptions) statement." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code\n", + "\n", + "try:\n", + " i = attribute_names.index(attribute)\n", + " print(attribute, 'is in position', i)\n", + "except ValueError:\n", + " print(attribute, 'is not found')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python *null object* is **`None`** (note the capitalization)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code\n", + "\n", + "if attribute not in attribute_names: # equivalent to 'not attribute in attribute_names'\n", + " value = None\n", + "else:\n", + " i = attribute_names.index(attribute)\n", + " value = single_instance_list[i]\n", + " \n", + "print(attribute, '=', value)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Defining and calling functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python [*function definitions*](http://docs.python.org/2/tutorial/controlflow.html#defining-functions) start with the **`def`** keyword followed by a function name, a list of 0 or more comma-delimited *parameters* (aka 'formal parameters') enclosed within parentheses, and then a colon ('`:`'). \n", + "\n", + "A function definition may include one or more [**`return`**](http://docs.python.org/2/reference/simple_stmts.html#the-return-statement) statements to indicate the value(s) returned to where the function is called. It is good practice to include a short [docstring](http://docs.python.org/2/tutorial/controlflow.html#tut-docstrings) to briefly describe the behavior of the function and the value(s) it returns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def attribute_value(instance, attribute, attribute_names):\n", + " '''Returns the value of attribute in instance, based on its position in attribute_names'''\n", + " if attribute not in attribute_names:\n", + " return None\n", + " else:\n", + " i = attribute_names.index(attribute)\n", + " return instance[i] # using the parameter name here" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A *function call* starts with the function name, followed by a list of 0 or more comma-delimited *arguments* (aka 'actual parameters') enclosed within parentheses. A function call can be used as a statement or within an expression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'cap-shape' # try substituting any of the other attribute names shown above\n", + "print(attribute, '=', attribute_value(single_instance_list, 'cap-shape', attribute_names))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that Python does not distinguish between names used for *variables* and names used for *functions*. An assignment statement binds a value to a name; a function definition also binds a value to a name. At any given time, the value most recently bound to a name is the one that is used. \n", + "\n", + "This can be demonstrated using the [**`type(object)`**](http://docs.python.org/2.7/library/functions.html#type) function, which returns the `type` of `object`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x = 0\n", + "print('x used as a variable:', x, type(x))\n", + "\n", + "def x():\n", + " print('x')\n", + " \n", + "print('x used as a function:', x, type(x))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another way to determine the `type` of an object is to use [**`isinstance(object, class)`**](https://docs.python.org/2/library/functions.html#isinstance). This is generally [preferable](http://stackoverflow.com/questions/1549801/differences-between-isinstance-and-type-in-python), as it takes into account [class inheritance](https://docs.python.org/2/tutorial/classes.html#inheritance). There is a larger issue of [*duck typing*](https://en.wikipedia.org/wiki/Duck_typing), and whether code should ever explicitly check for the type of an object, but we will omit further discussion of the topic in this primer." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Call by sharing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An important feature of Python functions is that arguments are passed using [*call by sharing*](https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_sharing). \n", + "\n", + "If a *mutable* object is passed as an argument to a function parameter, assignment statements using that parameter do not affect the passed argument, however other modifications to the parameter (e.g., modifications to a list using methods such as `append()`, `remove()`, `reverse()` or `sort()`) do affect the passed argument.\n", + "\n", + "Not being aware of - or forgetting - this important distinction can lead to challenging debugging sessions. \n", + "\n", + "The example below demonstrates this difference and introduces another [list method](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists), `list.insert(i, x)`, which inserts `x` into `list` at position `i`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def modify_parameters(parameter1, parameter2):\n", + " '''Inserts \"x\" at the head of parameter1, assigns [7, 8, 9] to parameter2'''\n", + " parameter1.insert(0, 'x') # insert() WILL affect argument passed as parameter1\n", + " print('parameter1, after inserting \"x\":', parameter1)\n", + " parameter2 = [7, 8, 9] # assignment WILL NOT affect argument passed as parameter2\n", + " print('parameter2, after assigning \"x\"', parameter2)\n", + " return\n", + "\n", + "argument1 = [1, 2, 3] \n", + "argument2 = [4, 5, 6]\n", + "print('argument1, before calling modify_parameters:', argument1)\n", + "print('argument2, before calling modify_parameters:', argument2)\n", + "print()\n", + "modify_parameters(argument1, argument2)\n", + "print()\n", + "print('argument1, after calling modify_parameters:', argument1)\n", + "print('argument2, after calling modify_parameters:', argument2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One way of preventing functions from modifying mutable objects passed as parameters is to make a copy of those objects inside the function. Here is another version of the function above that makes a shallow copy of the *list_parameter* using the slice operator. \n", + "\n", + "*\\[Note: the Python [copy](http://docs.python.org/2/library/copy.html) module provides both [shallow] [`copy()`](http://docs.python.org/2/library/copy.html#copy.copy) and [`deepcopy()`](http://docs.python.org/2/library/copy.html#copy.deepcopy) methods; we will cover modules further below.\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def modify_parameter_copy(parameter_1):\n", + " '''Inserts \"x\" at the head of parameter_1, without modifying the list argument'''\n", + " parameter_1_copy = parameter_1[:] # list[:] returns a copy of list\n", + " parameter_1_copy.insert(0, 'x')\n", + " print('Inserted \"x\":', parameter_1_copy)\n", + " return\n", + "\n", + "argument_1 = [1, 2, 3] # passing a named object will not affect the object bound to that name\n", + "print('Before:', argument_1)\n", + "modify_parameter_copy(argument_1)\n", + "print('After:', argument_1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another way to avoid modifying parameters is to use assignment statements which do not modify the parameter objects but return a new object that is bound to the name (locally)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def modify_parameter_assignment(parameter_1):\n", + " '''Inserts \"x\" at the head of parameter_1, without modifying the list argument'''\n", + " parameter_1 = ['x'] + parameter_1 # using assignment rather than list.insert()\n", + " print('Inserted \"x\":', parameter_1)\n", + " return\n", + "\n", + "argument_1 = [1, 2, 3] # passing a named object will not affect the object bound to that name\n", + "print('Before:', argument_1)\n", + "modify_parameter_assignment(argument_1)\n", + "print('After:', argument_1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Multiple return values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python functions can return more than one value by separating those return values with commas in the **return** statement. Multiple values are returned as a tuple. \n", + "\n", + "If the function-invoking expression is an assignment statement, multiple variables can be assigned the multiple values returned by the function in a single statement. This combining of values and subsequent separation is known as tuple ***packing*** and ***unpacking***." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def min_and_max(list_of_values):\n", + " '''Returns a tuple containing the min and max values in the list_of_values'''\n", + " return min(list_of_values), max(list_of_values)\n", + "\n", + "list_1 = [3, 1, 4, 2, 5]\n", + "print('min and max of', list_1, ':', min_and_max(list_1))\n", + "\n", + "# a single variable is assigned the two-element tuple\n", + "min_and_max_list_1 = min_and_max(list_1) \n", + "print('min and max of', list_1, ':', min_and_max_list_1)\n", + "\n", + "# the 1st variable is assigned the 1st value, the 2nd variable is assigned the 2nd value\n", + "min_list_1, max_list_1 = min_and_max(list_1) \n", + "print('min and max of', list_1, ':', min_list_1, ',', max_list_1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Iteration: for, range" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`for`**](http://docs.python.org/2/tutorial/controlflow.html#for-statements) statement iterates over the elements of a sequence or other [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for i in [0, 1, 2]:\n", + " print(i)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for c in 'abc':\n", + " print(c)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Python 2, the [**`range(stop)`**](http://docs.python.org/2/tutorial/controlflow.html#the-range-function) function returns a list of values from 0 up to `stop - 1` (inclusive). It is often used in the context of a `for` loop that iterates over the list of values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Values for the', len(attribute_names), 'attributes:', end='\\n\\n') # adds a blank line\n", + "for i in range(len(attribute_names)):\n", + " print(attribute_names[i], '=', \n", + " attribute_value(single_instance_list, attribute_names[i], attribute_names))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The more general form of the function, [**`range(start, stop[, step])`**](http://docs.python.org/2/library/functions.html#range), returns a list of values from `start` to `stop - 1` (inclusive) increasing by `step` (which defaults to `1`), or from `start` down to `stop + 1` (inclusive) decreasing by `step` if `step` is negative." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for i in range(3, 0, -1):\n", + " print(i)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Python 2, the [**`xrange(stop[, stop[, step]])`**](http://docs.python.org/2/library/functions.html#xrange) function is an [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) version of the `range()` function. In the context of a `for` loop, it returns the *next* item of the sequence for each iteration of the loop rather than creating *all* the elements of the sequence before the first iteration. This can reduce memory consumption in cases where iteration over all the items is not required.\n", + "\n", + "In Python 3, the `range()` function behaves the same way as the `xrange()` function does in Python 2, and so the `xrange()` function is not defined in Python 3. \n", + "\n", + "To maximize compatibility, we will use `range()` throughout this notebook; however, note that it is generally more efficient to use `xrange()` rather than `range()` in Python 2." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Modules, namespaces and dotted notation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A Python [***module***](http://docs.python.org/2/tutorial/modules.html) is a file containing related definitions (e.g., of functions and variables). Modules are used to help organize a Python [***namespace***](http://docs.python.org/2/tutorial/classes.html#python-scopes-and-namespaces), the set of identifiers accessible in a particular context. All of the functions and variables we define in this IPython Notebook are in the `__main__` namespace, so accessing them does not require any specification of a module.\n", + "\n", + "A Python module named **`simple_ml`** (in the file `simple_ml.py`), contains a set of solutions to the exercises in this IPython Notebook. *\\[The learning opportunity provided by this primer will be maximized by not looking at that file, or waiting as long as possible to do so.\\]*\n", + "\n", + "Accessing functions in an external module requires that we first **[`import`](http://docs.python.org/2/reference/simple_stmts.html#the-import-statement)** the module, and then prefix the function names with the module name followed by a dot (this is known as ***dotted notation***).\n", + "\n", + "For example, the following function call in Exercise 1 below: \n", + "\n", + "`simple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)`\n", + "\n", + "uses dotted notation to reference the `print_attribute_names_and_values()` function in the `simple_ml` module.\n", + "\n", + "After you have defined your own function for Exercise 1, you can test your function by deleting the `simple_ml` module specification, so that the statement becomes\n", + "\n", + "`print_attribute_names_and_values(single_instance_list, attribute_names)`\n", + "\n", + "This will reference the `print_attribute_names_and_values()` function in the current namespace (`__main__`), i.e., the top-level interpreter environment. The `simple_ml.print_attribute_names_and_values()` function will still be accessible in the `simple_ml` namespace by using the \"`simple_ml.`\" prefix (so you can easily toggle back and forth between your own definition and that provided in the solutions file)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 1: define `print_attribute_names_and_values()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Complete the following function definition, `print_attribute_names_and_values(instance, attribute_names)`, so that it generates exactly the same output as the code above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def print_attribute_names_and_values(instance, attribute_names):\n", + " '''Prints the attribute names and values for an instance'''\n", + " # your code goes here\n", + " return\n", + "\n", + "import simple_ml # this module contains my solutions to exercises\n", + "\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "simple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### File I/O" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python [file input and output](http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files) is done through [file](http://docs.python.org/2/library/stdtypes.html#file-objects) objects. A file object is created with the [`open(name[, mode])`](http://docs.python.org/2/library/functions.html#open) statement, where `name` is a string representing the name of the file, and `mode` is `'r'` (read), `'w'` (write) or `'a'` (append); if no second argument is provided, the mode defaults to `'r'`.\n", + "\n", + "A common Python programming pattern for processing an input text file is to \n", + "\n", + "* [**`open`**](http://docs.python.org/2/library/functions.html#open) the file using a [**`with`**](http://docs.python.org/2/reference/compound_stmts.html#the-with-statement) statement (which will automatically [**`close`**](http://docs.python.org/2/library/stdtypes.html#file.close) the file after the statements inside the `with` block have been executed)\n", + "* iterate over each line in the file using a **`for`** statement\n", + "\n", + "The following code creates a list of instances, where each instance is a list of attribute values (like `instance_1_str` above). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "all_instances = [] # initialize instances to an empty list\n", + "data_filename = 'agaricus-lepiota.data'\n", + "\n", + "with open(data_filename, 'r') as f:\n", + " for line in f: # 'line' will be bound to the next line in f in each for loop iteration\n", + " all_instances.append(line.strip().split(','))\n", + " \n", + "print('Read', len(all_instances), 'instances from', data_filename)\n", + "# we don't want to print all the instances, so we'll just print the first one to verify\n", + "print('First instance:', all_instances[0]) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 2: define load_instances()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `load_instances(filename)`, that returns a list of instances in a text file. The function definition is started for you below. The function should exhibit the same behavior as the code above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def load_instances(filename):\n", + " '''Returns a list of instances stored in a file.\n", + " \n", + " filename is expected to have a series of comma-separated attribute values per line, e.g.,\n", + " p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", + " '''\n", + " instances = []\n", + " # your code goes here\n", + " return instances\n", + "\n", + "data_filename = 'agaricus-lepiota.data'\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "all_instances_2 = simple_ml.load_instances(data_filename)\n", + "print('Read', len(all_instances_2), 'instances from', data_filename)\n", + "print('First instance:', all_instances_2[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Output can be written to a text file via the [**`file.write(str)`**](http://docs.python.org/2/library/stdtypes.html#file.write) method.\n", + "\n", + "As we saw earlier, the [`str.join(words)`](http://docs.python.org/2/library/stdtypes.html#str.join) method returns a single `str`-delimited string containing each of the strings in the `words` list.\n", + "\n", + "SQL and Hive database tables sometimes use a pipe ('|') delimiter to separate column values for each row when they are stored as flat files. The following code creates a new data file using pipes rather than commas to separate the attribute values.\n", + "\n", + "To help maintain internal consistency, it is generally a good practice to define a variable such as `DELIMITER` or `SEPARATOR`, bind it to the intended delimiter string, and then use it as a named constant. The Python language does not support named constants, so the use of variables as named constants depends on conventions (e.g., using ALL-CAPS)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "DELIMITER = '|'\n", + "\n", + "print('Converting to {}-delimited strings, e.g.,'.format(DELIMITER), \n", + " DELIMITER.join(all_instances[0]))\n", + "\n", + "datafile2 = 'agaricus-lepiota-2.data'\n", + "with open(datafile2, 'w') as f: # 'w' = open file for writing (output)\n", + " for instance in all_instances:\n", + " f.write(DELIMITER.join(instance) + '\\n') # write each instance on a separate line\n", + "\n", + "all_instances_3 = []\n", + "with open(datafile2, 'r') as f:\n", + " for line in f:\n", + " all_instances_3.append(line.strip().split(DELIMITER)) # note: changed ',' to '|'\n", + " \n", + "print('Read', len(all_instances_3), 'instances from', datafile2)\n", + "print('First instance:', all_instances_3[0]) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### List comprehensions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python provides a powerful [*list comprehension*](http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions) construct to simplify the creation of a list by specifying a formula in a single expression.\n", + "\n", + "Some programmers find list comprehensions confusing, and avoid their use. We won't rely on list comprehensions here, but we will offer several examples with and without list comprehensions to highlight the power of the construct.\n", + "\n", + "One common use of list comprehensions is in the context of the [`str.join(words)`](http://docs.python.org/2/library/string.html#string.join) method we saw earlier.\n", + "\n", + "If we wanted to construct a pipe-delimited string containing elements of the list, we could use a `for` loop to iteratively add list elements and pipe delimiters to a string for all but the last element, and then manually add the last element." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# create pipe-delimited string without using list comprehension\n", + "DELIMITER = '|'\n", + "delimited_string = ''\n", + "token_list = ['a', 'b', 'c']\n", + "\n", + "for token in token_list[:-1]: # add all but the last token + DELIMITER\n", + " delimited_string += token + DELIMITER\n", + "delimited_string += token_list[-1] # add the last token (with no trailing DELIMITER)\n", + "delimited_string" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This process is much simpler using a list comprehension." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "delimited_string = DELIMITER.join([token for token in token_list])\n", + "delimited_string" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Missing values & \"clean\" instances" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As noted in the initial description of the UCI mushroom set above, 2480 of the 8124 instances have missing attribute values (denoted by `'?'`). \n", + "\n", + "There are several techniques for dealing with instances that include missing attribute values, but to simplify things in the context of this primer - and following the example in the [Data Science for Business](http://www.data-science-for-biz.com/) book - we will simply ignore any such instances and restrict our focus to only the *clean* instances (with no missing values).\n", + "\n", + "We could use several lines of code - with an `if` statement inside a `for` loop - to create a `clean_instances` list from the `all_instances` list. Or we could use a list comprehension that includes an `if` statement.\n", + "\n", + "We will show both approaches to creating `clean_instances` below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# version 1: using an if statement nested within a for statement\n", + "UNKNOWN_VALUE = '?'\n", + "\n", + "clean_instances = []\n", + "for instance in all_instances:\n", + " if UNKNOWN_VALUE not in instance:\n", + " clean_instances.append(instance)\n", + " \n", + "print(len(clean_instances), 'clean instances')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# version 2: using an equivalent list comprehension\n", + "clean_instances = [instance\n", + " for instance in all_instances\n", + " if UNKNOWN_VALUE not in instance]\n", + "\n", + "print(len(clean_instances), 'clean instances')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that line breaks can be used before a `for` or `if` keyword in a list comprehension." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dictionaries (dicts)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Although single character abbreviations of attribute values (e.g., 'x') allow for more compact data files, they are not as easy to understand by human readers as the longer attribute value descriptions (e.g., 'convex').\n", + "\n", + "A Python [dictionary (or **`dict`**)](http://docs.python.org/2/tutorial/datastructures.html#dictionaries) is an unordered, comma-delimited collection of ***key: value*** pairs, serving a siimilar function as a hash table or hashmap in other programming languages.\n", + "\n", + "We could create a dictionary for the `cap-type` attribute values shown above:\n", + "\n", + "> bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", + "\n", + "Since we will want to look up the value using the abbreviation (which is the representation of the value stored in the file), we will use the abbreviations as *keys* and the descriptions as *values*.\n", + "\n", + "A Python dictionary can be created by specifying all `key: value` pairs (with colons separating each *key* and *value*), or by adding them iteratively. We will show the first method in the cell below, and use the second method in a subsequent cell. \n", + "\n", + "Note that a *value* in a Python dictionary (`dict`) can be accessed by specifying its *key* using the general form `dict[key]` (or `dict.get(key, [default])`, which allows the specification of a `default` value to use if `key` is not in `dict`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute_values_cap_type = {'b': 'bell', \n", + " 'c': 'conical', \n", + " 'x': 'convex', \n", + " 'f': 'flat', \n", + " 'k': 'knobbed', \n", + " 's': 'sunken'}\n", + "\n", + "attribute_value_abbrev = 'x'\n", + "print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A Python dictionary is an *iterable* container, so we can iterate over the keys in a dictionary using a `for` loop.\n", + "\n", + "Note that since a dictionary is an *unordered* collection, the sequence of abbreviations and associated values is not guaranteed to appear in any particular order. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for attribute_value_abbrev in attribute_values_cap_type:\n", + " print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python supports *dictionary comprehensions*, which have a similar form as the *list comprehensions* described above, except that both a key and a value have to be specified for each iteration.\n", + "\n", + "For example, if we provisionally omit the 'convex' cap-type (whose abbreviation is the last letter rather than first letter in the attribute name), we could construct a dictionary of abbreviations and descriptions using the following expression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute_values_cap_type_2 = {x[0]: x \n", + " for x in ['bell', 'conical', 'flat', 'knobbed', 'sunken']}\n", + "print(attribute_values_cap_type_2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While it's useful to have a dictionary of values for the `cap-type` attribute, it would be even more useful to have a dictionary of values for *every* attribute. Earlier, we created a list of `attribute_names`; we will now expand this to create a list of `attribute_values` wherein each list element is a dictionary.\n", + "\n", + "Rather than explicitly type in each dictionary entry in the Python interpreter, we'll define a function to read a file containing the list of attribute names, values and value abbreviations in the format shown above:\n", + "\n", + "* class: edible=e, poisonous=p\n", + "* cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", + "* cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s\n", + "* ..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def load_attribute_values(filename):\n", + " '''Returns a list of attribute values in a file.\n", + " \n", + " The attribute values are represented as dictionaries, \n", + " wherein the keys are abbreviations and the values are descriptions.\n", + " filename is expected to have one attribute name and set of values per line, \n", + " with the following format:\n", + " name: value_description=value_abbreviation[,value_description=value_abbreviation]*\n", + " For example\n", + " cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", + " The attribute value description dictionary created from this line would be the following:\n", + " {'c': 'conical', \n", + " 'b': 'bell', \n", + " 'f': 'flat', \n", + " 'k': 'knobbed',\n", + " 's': 'sunken', \n", + " 'x': 'convex'}\n", + " '''\n", + " attribute_value_descriptions = []\n", + " with open(filename) as f:\n", + " for line in f:\n", + " attr_name_and_values = line.strip().split(':')\n", + " attr_name = attr_name_and_values[0]\n", + " if len(attr_name_and_values) < 2:\n", + " attribute_value_descriptions.append({}) # no values for this attribute\n", + " else:\n", + " abbrev_desc_dict = {}\n", + " desc_and_abbrev_list = attr_name_and_values[1].strip().split(',')\n", + " for desc_and_abbrev_str in desc_and_abbrev_list:\n", + " desc_and_abbrev = desc_and_abbrev_str.strip().split('=')\n", + " # simplifying assumption: no more than 1 value is missing an abbreviation\n", + " desc = desc_and_abbrev[0]\n", + " if len(desc_and_abbrev) < 2: \n", + " abbrev_desc_dict[None] = desc\n", + " else:\n", + " abbrev = desc_and_abbrev[1]\n", + " abbrev_desc_dict[abbrev] = desc\n", + " attribute_value_descriptions.append(abbrev_desc_dict)\n", + " return attribute_value_descriptions\n", + "\n", + "attribute_filename = 'agaricus-lepiota.attributes'\n", + "attribute_values = load_attribute_values(attribute_filename)\n", + "print('Read', len(attribute_values), 'attribute values from', attribute_filename)\n", + "print('First attribute values list:', attribute_values[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 3: define `load_attribute_values()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We earlier created the `attribute_names` list manually. The `load_attribute_values()` function above creates the `attribute_values` list from the contents of a file, each line of which starts with the name of an attribute. Unfortunately, the function discards the name of each attribute.\n", + "\n", + "It would be nice to retain the name as well as the value abbreviations and descriptions. One way to do this would be to create a list of dictionaries, in which each dictionary has 2 keys, a `name`, the value of which is the attribute name (a string), and `values`, the value of which is yet another dictionary (with abbreviation keys and description values, as in `load_attribute_values()`).\n", + "\n", + "Complete the following function definition so that the code implements this functionality." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def load_attribute_names_and_values(filename):\n", + " '''Returns a list of attribute names and values in a file.\n", + " \n", + " This list contains dictionaries wherein the keys are names \n", + " and the values are value description dictionariess.\n", + " \n", + " Each value description sub-dictionary will use \n", + " the attribute value abbreviations as its keys \n", + " and the attribute descriptions as the values.\n", + " \n", + " filename is expected to have one attribute name and set of values per line, \n", + " with the following format:\n", + " name: value_description=value_abbreviation[,value_description=value_abbreviation]*\n", + " for example\n", + " cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", + " The attribute name and values dictionary created from this line would be the following:\n", + " {'name': 'cap-shape', \n", + " 'values': {'c': 'conical', \n", + " 'b': 'bell', \n", + " 'f': 'flat', \n", + " 'k': 'knobbed', \n", + " 's': 'sunken', \n", + " 'x': 'convex'}}\n", + " '''\n", + " attribute_names_and_values = [] # this will be a list of dicts\n", + " # your code goes here\n", + " return attribute_names_and_values\n", + "\n", + "attribute_filename = 'agaricus-lepiota.attributes'\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "attribute_names_and_values = simple_ml.load_attribute_names_and_values(attribute_filename)\n", + "print('Read', len(attribute_names_and_values), 'attribute values from', attribute_filename)\n", + "print('First attribute name:', attribute_names_and_values[0]['name'], \n", + " '; values:', attribute_names_and_values[0]['values'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Counters" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Data scientists often need to count things. For example, we might want to count the numbers of edible and poisonous mushrooms in the *clean_instances* list we created earlier." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "edible_count = 0\n", + "for instance in clean_instances:\n", + " if instance[0] == 'e':\n", + " edible_count += 1 # this is shorthand for edible_count = edible_count + 1\n", + "\n", + "print('There are', edible_count, 'edible mushrooms among the', \n", + " len(clean_instances), 'clean instances')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "More generally, we often want to count the number of occurrences (frequencies) of each possible value for an attribute. One way to do so is to create a dictionary where each dictionary key is an attribute value and each dictionary value is the count of instances with that attribute value.\n", + "\n", + "Using an ordinary dictionary, we must be careful to create a new dictionary entry the first time we see a new attribute value (that is not already contained in the dictionary)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "cap_state_value_counts = {}\n", + "for instance in clean_instances:\n", + " cap_state_value = instance[1] # cap-state is the 2nd attribute\n", + " if cap_state_value not in cap_state_value_counts:\n", + " # first occurrence, must explicitly initialize counter for this cap_state_value\n", + " cap_state_value_counts[cap_state_value] = 0\n", + " cap_state_value_counts[cap_state_value] += 1\n", + "\n", + "print('Counts for each value of cap-state:')\n", + "for value in cap_state_value_counts:\n", + " print(value, ':', cap_state_value_counts[value])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python [**`collections`**](http://docs.python.org/2/library/collections.html) module provides a number of high performance container datatypes. A frequently useful datatype is a [**`Counter`**](http://docs.python.org/2/library/collections.html#collections.Counter), a specialized dictionary in which each *key* is a unique element found in a list or some other container, and each *value* is the number of occurrences of that element in the source container. The default value for each newly created key is zero.\n", + "\n", + "A `Counter` includes a method, [**`most_common([n])`**](http://docs.python.org/2/library/collections.html#collections.Counter.most_common), that returns a list of 2-element tuples representing the values and their associated counts for the most common `n` values in descending order of the counts; if `n` is omitted, the method returns all tuples.\n", + "\n", + "Note that we can either use\n", + "\n", + "`import collections`\n", + "\n", + "and then use `collections.Counter()` in our code, or use\n", + "\n", + "`from collections import Counter`\n", + "\n", + "and then use `Counter()` (with no module specification) in our code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from collections import Counter\n", + "\n", + "cap_state_value_counts = Counter()\n", + "for instance in clean_instances:\n", + " cap_state_value = instance[1]\n", + " # no need to explicitly initialize counters for cap_state_value; all start at zero\n", + " cap_state_value_counts[cap_state_value] += 1\n", + "\n", + "print('Counts for each value of cap-state:')\n", + "for value in cap_state_value_counts:\n", + " print(value, ':', cap_state_value_counts[value])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When a `Counter` object is instantiated with a list of items, it returns a dictionary-like container in which the *keys* are the unique items in the list, and the *values* are the counts of each unique item in that list. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "counts = Counter(['a', 'b', 'c', 'a', 'b', 'a'])\n", + "print(counts)\n", + "print(counts.most_common())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This allows us to count the number of values for `cap-state` in a very compact way." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "cap_state_value_counts = Counter([instance[1] for instance in clean_instances])\n", + "\n", + "print('Counts for each value of cap-state:')\n", + "for value in cap_state_value_counts:\n", + " print(value, ':', cap_state_value_counts[value])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 4: define `attribute_value_counts()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `attribute_value_counts(instances, attribute, attribute_names)`, that returns a `Counter` containing the counts of occurrences of each value of `attribute` in the list of `instances`. `attribute_names` is the list we created above, where each element is the name of an attribute.\n", + "\n", + "This exercise is designed to generalize the solution shown in the code directly above (which handles only the `cap-state` attribute)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your definition goes here\n", + "\n", + "attribute = 'cap-shape'\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, \n", + " attribute, \n", + " attribute_names)\n", + "\n", + "print('Counts for each value of', attribute, ':')\n", + "for value in attribute_value_counts:\n", + " print(value, ':', attribute_value_counts[value])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### More on sorting" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Earlier, we saw that there is a `list.sort()` method that will sort a list in-place, i.e., by replacing the original value of `list` with a sorted version of the elements in `list`. \n", + "\n", + "We also saw that the [**`sorted(iterable[, cmp[, key[, reverse]]])`**](http://docs.python.org/2/library/functions.html#sorted) function can be used to return a *copy* of a list, dictionary or any other [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) container it is passed, in ascending order." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "original_list = [3, 1, 4, 2, 5]\n", + "sorted_list = sorted(original_list)\n", + "\n", + "print(original_list)\n", + "print(sorted_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`sorted()` can also be used with dictionaries (it returns a sorted list of the dictionary *keys*)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print(sorted(attribute_values_cap_type))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use the sorted *keys* to access the *values* of a dictionary in ascending order of the keys." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for attribute_value_abbrev in sorted(attribute_values_cap_type):\n", + " print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'cap-shape'\n", + "attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, \n", + " attribute, \n", + " attribute_names)\n", + "\n", + "print('Counts for each value of', attribute, ':')\n", + "for value in sorted(attribute_value_counts):\n", + " print(value, ':', attribute_value_counts[value])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Sorting a dictionary by values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is often useful to sort a dictionary by its *values* rather than its *keys*. \n", + "\n", + "For example, when we printed out the counts of the attribute values for `cap-shape` above, the counts appeared in an ascending alphabetic order of their attribute names. It is often more helpful to show the attribute value counts in descending order of the counts (which are the values in that dictionary).\n", + "\n", + "There are a [variety of ways to sort a dictionary by values](http://writeonly.wordpress.com/2008/08/30/sorting-dictionaries-by-value-in-python-improved/), but the approach described in [PEP-256](http://legacy.python.org/dev/peps/pep-0265/) is generally considered the most efficient.\n", + "\n", + "In order to understand the components used in this approach, we will revisit and elaborate on a few concepts involving *dictionaries*, *iterators* and *modules*." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`dict.items()`**](http://docs.python.org/2/library/stdtypes.html#dict.items) method returns an unordered list of `(key, value)` tuples in `dict`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute_values_cap_type.items()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Python 2, a related method, [**`dict.iteritems()`**](http://docs.python.org/2/library/stdtypes.html#dict.iteritems), returns an [**`iterator`**](http://docs.python.org/2/library/stdtypes.html#iterator-types): a callable object that returns the *next* item in a sequence each time it is referenced (e.g., during each iteration of a for loop), which can be more efficient than generating *all* the items in the sequence before any are used ... and so should be used rather than `items()` wherever possible\n", + "\n", + "This is similar to the distinction between `xrange()` and `range()` described above ... and, also similarly, `dict.items()` is an `iterator` in Python 3 and so `dict.iteritems()` is no longer needed (nor defined) ... and further similarly, we will use only `dict.items()` in this notebook, but it is generally more efficient to use `dict.iteritems()` in Python 2." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for key, value in attribute_values_cap_type.items():\n", + " print(key, ':', value)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Python [**`operator`**](http://docs.python.org/2/library/operator.html) module contains a number of functions that perform object comparisons, logical operations, mathematical operations, sequence operations, and abstract type tests.\n", + "\n", + "To facilitate sorting a dictionary by values, we will use the [**`operator.itemgetter(i)`**](http://docs.python.org/2/library/operator.html#operator.itemgetter) function that can be used to retrieve the `i`th value in a tuple (such as a `(key, value)` pair returned by `[iter]items()`).\n", + "\n", + "We can use `operator.itemgetter(1)`) to reference the *value* - the 2nd item in each `(key, value)` tuple, (at zero-based index position 1) - rather than the *key* - the first item in each `(key, value)` tuple (at index position 0).\n", + "\n", + "We will use the optional keyword argument **`key`** in [`sorted(iterable[, cmp[, key[, reverse]]])`](http://docs.python.org/2/library/functions.html#sorted) to specify a *sorting* key that is not the same as the `dict` key (recall that the `dict` key is the default *sorting* key for `sorted()`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import operator\n", + "\n", + "sorted(attribute_values_cap_type.items(), \n", + " key=operator.itemgetter(1))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now sort the counts of attribute values in descending frequency of occurrence, and print them out using tuple unpacking." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "attribute = 'cap-shape'\n", + "value_counts = simple_ml.attribute_value_counts(clean_instances, \n", + " attribute, \n", + " attribute_names)\n", + "\n", + "print('Counts for each value of', attribute, '(sorted by count):')\n", + "for value, count in sorted(value_counts.items(), \n", + " key=operator.itemgetter(1), \n", + " reverse=True):\n", + " print(value, ':', count)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that this example is rather contrived, as it is generally easiest to use a `Counter` and its associated `most_common()` method when sorting a dictionary wherein the values are all counts. The need to sort other kinds of dictionaries by their values is rather common. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### String formatting" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is often helpful to use [fancier output formatting](http://docs.python.org/2/tutorial/inputoutput.html#fancier-output-formatting) than simply printing comma-delimited lists of items. \n", + "\n", + "Examples of the **[`str.format()`](https://docs.python.org/2/library/stdtypes.html#str.format)** function used in conjunction with print statements is shown below. \n", + "\n", + "More details can be found in the Python documentation on [format string syntax](http://docs.python.org/2/library/string.html#format-string-syntax)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('{:5.3f}'.format(0.1)) # fieldwidth = 5; precision = 3; f = float\n", + "print('{:7.3f}'.format(0.1)) # if fieldwidth is larger than needed, left pad with spaces\n", + "print('{:07.3f}'.format(0.1)) # use leading zero to left pad with leading zeros\n", + "print('{:3d}'.format(1)) # d = int\n", + "print('{:03d}'.format(1))\n", + "print('{:10s}'.format('hello')) # s = string, left-justified\n", + "print('{:>10s}'.format('hello')) # use '>' to right-justify within fieldwidth" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following example illustrates the use of `str.format()` on data associated with the mushroom dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('class: {} = {} ({:5.3f}), {} = {} ({:5.3f})'.format(\n", + " 'e', 3488, 3488 / 5644, \n", + " 'p', 2156, 2156 / 5644), end=' ')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following variation - splitting off the printing of the attribute name from the printing of the values and counts of values for that attrbiute - may be more useful in developing a solution to the following exercise." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('class:', end=' ') # keeps cursor on the same line for subsequent print statements\n", + "print('{} = {} ({:5.3f}),'.format('e', 3488, 3488 / 5644), end=' ')\n", + "print('{} = {} ({:5.3f})'.format('p', 2156, 2156 / 5644), end=' ')\n", + "print() # advance the cursor to the beginning of the next line" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 5: define `print_all_attribute_value_counts()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `print_all_attribute_value_counts(instances, attribute_names)`, that prints each attribute name in `attribute_names`, and then for each attribute value, prints the value abbreviation, the count of occurrences of that value and the proportion of instances that have that attribute value." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your function definition goes here\n", + "\n", + "print('\\nCounts for all attributes and values:\\n')\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "simple_ml.print_all_attribute_value_counts(clean_instances, attribute_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Navigation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notebooks in this primer:\n", + "\n", + "* [1. Introduction](1_Introduction.ipynb)\n", + "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", + "* **3. Python: Basic Concepts** (*you are here*)\n", + "* [4. Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", + "* [5. Next Steps](5_Next_Steps.ipynb)" + ] + } + ], "metadata": { - "name": "", - "signature": "sha256:d2083ca2409e110c6168801712cb39ed49de3290dc773ba2f2e6e3ed408b3e06" - }, - "nbformat": 3, - "nbformat_minor": 0, - "worksheets": [ - { - "cells": [ - { - "cell_type": "heading", - "level": 1, - "metadata": {}, - "source": [ - "Python for Data Science" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Joe McCarthy](http://interrelativity.com/joe), \n", - "*Director, Analytics & Data Science*, [Atigeo, LLC](http://atigeo.com)" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "from IPython.display import display, Image, HTML" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 1 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Navigation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notebooks in this primer:\n", - "\n", - "* [1. Introduction](1_Introduction.ipynb)\n", - "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", - "* **3. Python: Basic Concepts** (*you are here*)\n", - "* [4. Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", - "* [5. Next Steps](5_Next_Steps.ipynb)" - ] - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "3. Python: Basic Concepts" - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Identifiers, strings, lists and tuples" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The sample instance shown above can be represented as a string. A Python *string* ([`str`](http://docs.python.org/2/tutorial/introduction.html#strings)) is a sequence of 0 or more characters enclosed within a pair of single quotes (`'`) or a pair double quotes (`\"`). " - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "metadata": {}, - "output_type": "pyout", - "prompt_number": 2, - "text": [ - "'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'" - ] - } - ], - "prompt_number": 2 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Python [*identifiers*](http://docs.python.org/2/reference/lexical_analysis.html#identifiers) (or [*names*](https://docs.python.org/2/reference/executionmodel.html#naming-and-binding)) are composed of letters, numbers and/or underscores ('`_`'), starting with a letter or underscore. Python identifiers are case sensitive. Although camelCase identifiers can be used, it is generally considered more [pythonic](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html) to use underscores. Python variables and functions typically start with lowercase letters; Python classes start with uppercase letters.\n", - "\n", - "The following [assignment statement](http://docs.python.org/2/reference/simple_stmts.html#assignment-statements) binds the value of the string shown above to the name `single_instance_str`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "single_instance_str = 'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 3 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [**`print`**](http://docs.python.org/2/tutorial/inputoutput.html) statement writes the value of its comma-delimited arguments to [`sys.stdout`](http://docs.python.org/2/library/sys.html#sys.stdout) (typically the console). Each value in the output is separated by a single blank space. If the last argument is followed by a comma, the output cursor will stay on the same line." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print 'Instance 1:', single_instance_str\n", - "print 'A', 'B', # note comma at the end\n", - "print 'C' # will appear on same line" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Instance 1: p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", - "A B C\n" - ] - } - ], - "prompt_number": 4 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Python comment character is `'#'`: anything after `'#'` on the line is ignored by the Python interpreter. \n", - "\n", - "Pairs of triple quotes (`'''` or `\"\"\"`) can be used to delimit multi-line comments." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "'''\n", - "A multi-line\n", - "comment\n", - "'''\n", - "print 'no comment'" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "no comment\n" - ] - } - ], - "prompt_number": 5 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A [`list`](http://docs.python.org/2/tutorial/introduction.html#lists) is an ordered sequence of 0 or more comma-delimited elements enclosed within square brackets ('`[`', '`]`'). The Python [`str.split(sep)`](http://docs.python.org/2/library/stdtypes.html#str.split) method can be used to split a `sep`-delimited string into a corresponding list of elements." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "single_instance_list = single_instance_str.split(',')\n", - "print single_instance_list" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']\n" - ] - } - ], - "prompt_number": 6 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Python sequences are *heterogeneous*, i.e., they can contain elements of different types." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "mixed_list = ['a', 1, 2.3, True, [1, 'b']]\n", - "print mixed_list" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "['a', 1, 2.3, True, [1, 'b']]\n" - ] - } - ], - "prompt_number": 7 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Python **`+`** operator can be used to concatenate lists." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "concatenated_list = ['a', 1] + [2.3, True] + [[1, 'b']]\n", - "print concatenated_list" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "['a', 1, 2.3, True, [1, 'b']]\n" - ] - } - ], - "prompt_number": 8 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Individual elements of [*sequences*](http://docs.python.org/2/library/stdtypes.html#typesseq) (lists, strings and other data structures) can be accessed by specifying their zero-based index position within square brackets ('`[`', '`]`')." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print single_instance_str[2], single_instance_list[2]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "k f\n" - ] - } - ], - "prompt_number": 9 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Negative index values can be used to specify a position offset from the end of the sequence. It is often useful to use a `-1` index value to access the last element of a sequence." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print single_instance_str[-1], single_instance_list[-1]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "d d\n" - ] - } - ], - "prompt_number": 10 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Python *slice notation* can be used to access subsequences by specifying two index positions separated by a colon (':'); `seq[start:stop]` returns all the elements in `seq` between `start` and `stop - 1` (inclusive)." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print single_instance_str[2:4]\n", - "print single_instance_list[2:4]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "k,\n", - "['f', 'n']\n" - ] - } - ], - "prompt_number": 11 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Slices indices can be negative values." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print single_instance_str[-4:-2]\n", - "print single_instance_list[-4:-2]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - ",v\n", - "['e', 'w']\n" - ] - } - ], - "prompt_number": 12 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The `start` and/or `stop` index can be omitted. A common use of slices with a single index value is to access all but the first element or all but the last element of a sequence." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print single_instance_str[:-1] # all but the last\n", - "print single_instance_list[:-1]\n", - "print single_instance_str[1:] # all but the first\n", - "print single_instance_list[1:]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,\n", - "['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v']\n", - ",k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", - "['k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']\n" - ] - } - ], - "prompt_number": 13 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Slice notation includes an optional third element, `step`, as in `seq[start:stop:step]`, that specifies the steps or increments by which elements are retrieved from `seq` between `start` and `step - 1`:" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print single_instance_str\n", - "print single_instance_str[::2] # print elements in even-numbered positions (the values, in this case)\n", - "print single_instance_str[1::2] # print elements in odd-numbered positions (the commas, in this case)\n", - "print single_instance_str[::-1] # reverse the string" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", - "pkfnfnfcnwe?kywnpwoewvd\n", - ",,,,,,,,,,,,,,,,,,,,,,\n", - "d,v,w,e,o,w,p,n,w,y,k,?,e,w,n,c,f,n,f,n,f,k,p\n" - ] - } - ], - "prompt_number": 14 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [Python tutorial](http://docs.python.org/2/tutorial/introduction.html) offers a helpful ASCII art representation to show how positive and negative indexes are interpreted:\n", - "\n", - "
\n",
-      " +---+---+---+---+---+\n",
-      " | H | e | l | p | A |\n",
-      " +---+---+---+---+---+\n",
-      " 0   1   2   3   4   5\n",
-      "-5  -4  -3  -2  -1\n",
-      "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Python statements are typically separated by newlines (rather than, say, the semi-colon in Java). Statements can extend over more than one line; it is generally best to break the lines after commas within parentheses, braces or brackets. Inserting a backslash character ('\\\\') at the end of a line will also enable continuation of the statement on the next line, but it is generally best to look for other alternatives." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "attribute_names = ['class', \n", - " 'cap-shape', 'cap-surface', 'cap-color', \n", - " 'bruises?', \n", - " 'odor', \n", - " 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', \n", - " 'stalk-shape', 'stalk-root', \n", - " 'stalk-surface-above-ring', 'stalk-surface-below-ring', \n", - " 'stalk-color-above-ring', 'stalk-color-below-ring',\n", - " 'veil-type', 'veil-color', \n", - " 'ring-number', 'ring-type', \n", - " 'spore-print-color', \n", - " 'population', \n", - " 'habitat']\n", - "print attribute_names" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']\n" - ] - } - ], - "prompt_number": 15 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [`str.strip(\\[chars\\]`)](http://docs.python.org/2/library/stdtypes.html#str.strip) method returns a copy of `str` in which any leading or trailing `chars` are removed. If no `chars` are specified, it removes all leading and trailing whitespace. [*Whitespace* is any sequence of spaces, tabs (`'\\t'`) and/or newline (`'\\n'`) characters.] " - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print '*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n', '*'" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "* \tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", - "*\n" - ] - } - ], - "prompt_number": 16 - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print '*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'.strip(), '*'" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "* p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d *\n" - ] - } - ], - "prompt_number": 17 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A common programming pattern when dealing with CSV (comma-separated values) data files containing is to repeatedly\n", - "\n", - "1. read a line from a file\n", - "2. strip off any leading and trailing whitespace\n", - "3. split the values separated by commas into a list\n", - "\n", - "We will get to repetition control structures (loops) and file input and output shortly, but here is an example of how `str.strip()` and `str.split()` be chained together in a single instruction:" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "single_instance_str = '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'\n", - "single_instance_list = single_instance_str.strip().split(',') # first strip leading & trailing whitespace, then split on commas\n", - "print single_instance_list" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']\n" - ] - } - ], - "prompt_number": 18 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [`str.join(words)`](http://docs.python.org/2/library/string.html#string.join) method is the inverse of `str.split()`, returning a single string in which each string in the sequence of `words` is separated by `str`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print '*', ','.join(single_instance_list), '*'" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "* p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d *\n" - ] - } - ], - "prompt_number": 19 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A number of Python methods can be used on strings, lists and other sequences.\n", - "\n", - "The [`len(s)`](http://docs.python.org/2/library/functions.html#len) function can be used to find the length of (number of items in) a sequence `s`. It will also return the number of items in a *dictionary*, a data structure we will cover further below." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print len(single_instance_str), len(single_instance_list)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "47 23\n" - ] - } - ], - "prompt_number": 20 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The **`in`** operator can be used to determine whether a sequence contains a value. \n", - "\n", - "Boolean values in Python are **`True`** and **`False`** (note the capitalization)." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print ',' in single_instance_str, ',' in single_instance_list" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "True False\n" - ] - } - ], - "prompt_number": 21 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [`s.count(x)`](http://docs.python.org/2/library/stdtypes.html#str.count) ormethod can be used to count the number of occurrences of item `x` in sequence `s`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print single_instance_str.count(','), single_instance_list.count('f')" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "22 3\n" - ] - } - ], - "prompt_number": 22 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [`s.index(x)`](http://docs.python.org/2/library/stdtypes.html#str.index) method can be used to find the first 0-based index of item `x` in sequence `s`. " - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print single_instance_str.index(','), single_instance_list.index('f')" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "2 2\n" - ] - } - ], - "prompt_number": 23 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "One important distinction between strings and lists has to do with their [*mutability*](http://docs.python.org/2/reference/datamodel.html).\n", - "\n", - "Python strings are *immutable*, i.e., they cannot be modified. Most string methods (like `str.strip()`) return modified *copies* of the strings on which they are used.\n", - "\n", - "Python lists are *mutable*, i.e., they can be modified. \n", - "\n", - "The examples below illustrate a number of [`list`](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) methods that modify lists." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "list_1 = [4, 2, 3, 5, 1]\n", - "list_2 = list_1 # list_2 now references the same object as list_1\n", - "print 'list_1: ', list_1\n", - "print 'list_2: ', list_2\n", - "list_1.remove(1)\n", - "print 'list_1.remove(1):', list_1\n", - "list_1.append(6)\n", - "print 'list_1.append(6):', list_1\n", - "list_1.sort()\n", - "print 'list_1.sort(): ', list_1\n", - "list_1.reverse()\n", - "print 'list_1.reverse():', list_1" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "list_1: [4, 2, 3, 5, 1]\n", - "list_2: [4, 2, 3, 5, 1]\n", - "list_1.remove(1): [4, 2, 3, 5]\n", - "list_1.append(6): [4, 2, 3, 5, 6]\n", - "list_1.sort(): [2, 3, 4, 5, 6]\n", - "list_1.reverse(): [6, 5, 4, 3, 2]\n" - ] - } - ], - "prompt_number": 24 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When more than one name (e.g., a variable) is bound to the same mutable object, changes made to that object are reflected in all names bound to that object. For example, in the second statement above, `list_2` is bound to the same object that is bound to `list_1`, namely, the list `[4, 2, 3, 5 1]`. All changes made to the object bound to `list_1` will thus be reflected in `list_2` (since they both reference the same object)." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print 'list_1: ', list_1\n", - "print 'list_2: ', list_2" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "list_1: [6, 5, 4, 3, 2]\n", - "list_2: [6, 5, 4, 3, 2]\n" - ] - } - ], - "prompt_number": 25 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "There are sorting and reversing functions, [`sorted()`](https://docs.python.org/2.7/library/functions.html#sorted) and [`reversed()`](https://docs.python.org/2.7/library/functions.html#reversed), that do not modify their arguments, and can thus be used on mutable or immutable objects. We will elaborate on each of these functions further below, but here are a couple of examples of how `sorted()` returns a sorted list of each element in its argument." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print 'sorted(list_1):', sorted(list_1) # return a copy of list_1 in sorted order\n", - "print 'list_1: ', list_1\n", - "print 'sorted(single_instance_str):', sorted(single_instance_str) # returns a list of sorted elements in the string\n", - "print 'single_instance_str: ', single_instance_str" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "sorted(list_1): [2, 3, 4, 5, 6]\n", - "list_1: [6, 5, 4, 3, 2]\n", - "sorted(single_instance_str): ['\\t', '\\n', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', '?', 'c', 'd', 'e', 'e', 'f', 'f', 'f', 'k', 'k', 'n', 'n', 'n', 'n', 'o', 'p', 'p', 'v', 'w', 'w', 'w', 'w', 'y']\n", - "single_instance_str: \tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", - "\n" - ] - } - ], - "prompt_number": 26 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A [*tuple*](http://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences) is an ordered, immutable sequence of 0 or more comma-delimited values enclosed in parentheses (`'('`, `')'`). Many of the functions that operate on strings and lists also operate on tuples." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "x = (1, 2, 3, 4, 5) # a tuple\n", - "print 'x =', x, ', len(x) =', len(x), ', x.index(3) =', x.index(3), ', x[4:2:-1] = ', x[4:2:-1]\n", - "print 'sorted(x, reverse=True):', sorted(x, reverse=True) # sorted always returns a list; reverse=True specifies reverse sort order" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "x = (1, 2, 3, 4, 5) , len(x) = 5 , x.index(3) = 2 , x[4:2:-1] = (5, 4)\n", - "sorted(x, reverse=True): [5, 4, 3, 2, 1]\n" - ] - } - ], - "prompt_number": 27 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If the `s.index(x)` or `list.remove(x)` method is used on a sequence `s` or `list` that does not contain the value `x`, a [`ValueError`](http://docs.python.org/2/library/exceptions.html#exceptions.ValueError) exception is raised." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print x.index(6) # a ValueError will be raised" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "ename": "ValueError", - "evalue": "tuple.index(x): x not in tuple", - "output_type": "pyerr", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m6\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# a ValueError will be raised\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;31mValueError\u001b[0m: tuple.index(x): x not in tuple" - ] - } - ], - "prompt_number": 28 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Conditionals" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "One common approach to handling errors is to *look before you leap (LBYL)*, i.e., test for potential [exceptions](http://docs.python.org/2/tutorial/errors.html) before executing instructions that might raise those exceptions. \n", - "\n", - "This approach can be implemented using the [**`if`**](http://docs.python.org/2/tutorial/controlflow.html#if-statements) statement (which may optionally include an **`else`** and any number of **`elif`** clauses).\n", - "\n", - "The following is a simple example of an `if` statement:" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "class_value = 'e' # try changing this to 'p' or 'x'\n", - "\n", - "if class_value == 'e':\n", - " print 'edible'\n", - "elif class_value == 'p':\n", - " print 'poisonous'\n", - "else:\n", - " print 'unknown'" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "edible\n" - ] - } - ], - "prompt_number": 29 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that \n", - "\n", - "* a colon ('`:`') is used at the end of the lines with `if`, `else` or `elif`\n", - "* no parentheses are required to enclose the boolean condition (it is presumed to include everything between `if` or `elif` and the colon)\n", - "* the statements below each `if`, `elif` and `else` line are all indented\n", - "\n", - "Python does not have special characters to delimit statement blocks (like the '{' and '}' delimiters in Java); instead, sequences of statements with the same *indentation level* are treated as a statement block. The [Python Style Guide](http://legacy.python.org/dev/peps/pep-0008/) recommends using 4 spaces for each indentation level.\n", - "\n", - "An `if` statement can be used to follow the LBYL paradigm in preventing the `ValueError` that occured in an earlier example:" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code\n", - "\n", - "if attribute in attribute_names:\n", - " i = attribute_names.index(attribute)\n", - " print attribute, 'is in position', i\n", - "else:\n", - " print attribute, 'is not in', attribute_names" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "bruises is not in ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']\n" - ] - } - ], - "prompt_number": 30 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Another perspective on handling errors championed by some pythonistas is that it is [*easier to ask forgiveness than permission (EAFP)*](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#eafp-vs-lbyl).\n", - "\n", - "As in many practical applications of philosophy, religion or dogma, it is helpful to *think before you choose (TBYC)*. There are a number of factors to consider in deciding whether to follow the EAFP or LBYL paradigm, including code readability and the anticipated likelihood and relative severity of encountering an exception. Oran Looney wrote a blog post providing a nice overview of the debate over [LBYL vs. EAFP](http://oranlooney.com/lbyl-vs-eafp/).\n", - "\n", - "We will follow the LBYL paradigm throughout most of this primer. However, as an illustration of EAFP in Python, here is an alternate implementation of the functionality of the code above, using a [`try/except`](http://docs.python.org/2/tutorial/errors.html#handling-exceptions) statement." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code\n", - "\n", - "try:\n", - " i = attribute_names.index(attribute)\n", - " print attribute, 'is in position', i\n", - "except ValueError:\n", - " print attribute, 'is not found'" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "bruises is not found\n" - ] - } - ], - "prompt_number": 31 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Python *null object* is **`None`** (note the capitalization)." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "attribute = 'bruises?'\n", - "\n", - "if attribute not in attribute_names: # equivalent to 'not attribute in attribute_names'\n", - " value = None\n", - "else:\n", - " i = attribute_names.index(attribute)\n", - " value = single_instance_list[i]\n", - " \n", - "print attribute, '=', value" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "bruises? = f\n" - ] - } - ], - "prompt_number": 32 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Defining and calling functions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Python [*functions definitions*](http://docs.python.org/2/tutorial/controlflow.html#defining-functions) start with the **`def`** keyword followed by a function name, a list of 0 or more comma-delimited *parameters* (aka 'formal parameters') enclosed within parentheses, and then a colon ('`:`'). \n", - "\n", - "A function definition may include one or more [**`return`**](http://docs.python.org/2/reference/simple_stmts.html#the-return-statement) statemens to indicate the value(s) returned to where the function is called. It is good practice to include a short [docstring](http://docs.python.org/2/tutorial/controlflow.html#tut-docstrings) to briefly describe the behavior of the function and the value(s) it returns." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def attribute_value(instance, attribute, attribute_names):\n", - " '''Returns the value of attribute in instance, based on the position of attribute in the list of attribute_names'''\n", - " if attribute not in attribute_names:\n", - " return None\n", - " else:\n", - " i = attribute_names.index(attribute)\n", - " return instance[i] # using the parameter name here" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 33 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A *function call* starts with the function name, followed by a list of 0 or more comma-delimited *arguments* (aka 'actual parameters') enclosed within parentheses. A function call can be used as a statement or within an expression." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "attribute = 'cap-shape' # try substituting any of the other attribute names shown above\n", - "print attribute, '=', attribute_value(single_instance_list, attribute, attribute_names)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "cap-shape = k\n" - ] - } - ], - "prompt_number": 34 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that Python does not distinguish between names used for *variables* and names used for *functions*. An assignment statement binds a value to a name; a function definition also binds a value to a name. At any given time, the value most recently bound to a name is the one that is used. \n", - "\n", - "The [`type(object)`](http://docs.python.org/2.7/library/functions.html#type) function returns the `type` of `object`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "x = 0\n", - "print 'x used as a variable:', x, type(x)\n", - "def x():\n", - " print 'x'\n", - "print 'x used as a function:', x, type(x)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "x used as a variable: 0 \n", - "x used as a function: \n" - ] - } - ], - "prompt_number": 35 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Also note that Python function arguments are passed using *call by object reference*. Thus any modifications made to a parameter that has been passed a mutable object bound to a name as an argument will persist after the function exits." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def insert_x(list_parameter):\n", - " '''Inserts \"x\" at the head of a list, modifying the list argument'''\n", - " list_parameter.insert(0, 'x')\n", - " print 'Inserted x:', list_parameter\n", - " return list_parameter\n", - "\n", - "insert_x([1, 2, 3]) # passing an unnamed object does not affect any existing names\n", - "list_argument = [1, 2, 3] # passing a named object will affect the object bound to that name\n", - "print 'Before:', list_argument\n", - "insert_x(list_argument)\n", - "print 'After:', list_argument" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Inserted x: ['x', 1, 2, 3]\n", - "Before: [1, 2, 3]\n", - "Inserted x: ['x', 1, 2, 3]\n", - "After: ['x', 1, 2, 3]\n" - ] - } - ], - "prompt_number": 36 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "One way of preventing functions from modifying mutable objects passed as parameters is to make a copy of those objects inside the function. Here is another version of the function above that makes a shallow copy of the *list_parameter* using the slice operator. \n", - "\n", - "*\\[Note: the Python [copy](http://docs.python.org/2/library/copy.html) module provides both [shallow] [`copy()`](http://docs.python.org/2/library/copy.html#copy.copy) and [`deepcopy()`](http://docs.python.org/2/library/copy.html#copy.deepcopy) methods; we will cover modules further below.]*" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def insert_x_copy(list_parameter):\n", - " '''Inserts \"x\" at the head of a list, without modifying the list argument'''\n", - " list_parameter_copy = list_parameter[:]\n", - " list_parameter_copy.insert(0, 'x')\n", - " print 'Inserted x:', list_parameter_copy\n", - " return list_parameter_copy\n", - "\n", - "insert_x_copy([1, 2, 3]) # passing an unnamed object does not affect any existing names\n", - "list_argument = [1, 2, 3] # passing a named object will affect the object bound to that name\n", - "print 'Before:', list_argument\n", - "insert_x_copy(list_argument)\n", - "print 'After:', list_argument" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Inserted x: ['x', 1, 2, 3]\n", - "Before: [1, 2, 3]\n", - "Inserted x: ['x', 1, 2, 3]\n", - "After: [1, 2, 3]\n" - ] - } - ], - "prompt_number": 37 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Python functions can return more than one value, by separating those return values with commas in the **return** statement. Multiple values are returned as a tuple. If the function-invoking expression is an assignment statement, multiple variables can be assigned the multiple values returned by the function in a single statement. This combining of values and subsequent separation is known as tuple *packing* and *unpacking*." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def min_and_max(list_of_values):\n", - " '''Returns a tuple containing the min and max values in the list_of_values'''\n", - " return min(list_of_values), max(list_of_values)\n", - "\n", - "list_1 = [3, 1, 4, 2, 5]\n", - "print 'min and max of', list_1, ':', min_and_max(list_1)\n", - "\n", - "min_and_max_list_1 = min_and_max(list_1) # a single variable is assigned the two-element tuple\n", - "print 'min and max of', list_1, ':', min_and_max_list_1\n", - "\n", - "min_list_1, max_list_1 = min_and_max(list_1) # the 1st variable is assigned the 1st value, the 2nd variable is assigned the 2nd value\n", - "print 'min and max of', list_1, ':', min_list_1, ',', max_list_1" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "min and max of [3, 1, 4, 2, 5] : (1, 5)\n", - "min and max of [3, 1, 4, 2, 5] : (1, 5)\n", - "min and max of [3, 1, 4, 2, 5] : 1 , 5\n" - ] - } - ], - "prompt_number": 38 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Iteration: for, range" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [**`for`**](http://docs.python.org/2/tutorial/controlflow.html#for-statements) statement iterates over the elements of a sequence.\n", - "\n", - "The [`range(stop)`](http://docs.python.org/2/tutorial/controlflow.html#the-range-function) function returns a list of values from 0 up to `stop - 1` (inclusive). " - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print 'Index values for attributes:', range(len(attribute_names)), '\\n'\n", - "\n", - "print 'Values for the', len(attribute_names), 'attributes:\\n'\n", - "for i in range(len(attribute_names)):\n", - " print attribute_names[i], '=', attribute_value(single_instance_list, attribute_names[i], attribute_names)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Index values for attributes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] \n", - "\n", - "Values for the 23 attributes:\n", - "\n", - "class = p\n", - "cap-shape = k\n", - "cap-surface = f\n", - "cap-color = n\n", - "bruises? = f\n", - "odor = n\n", - "gill-attachment = f\n", - "gill-spacing = c\n", - "gill-size = n\n", - "gill-color = w\n", - "stalk-shape = e\n", - "stalk-root = ?\n", - "stalk-surface-above-ring = k\n", - "stalk-surface-below-ring = y\n", - "stalk-color-above-ring = w\n", - "stalk-color-below-ring = n\n", - "veil-type = p\n", - "veil-color = w\n", - "ring-number = o\n", - "ring-type = e\n", - "spore-print-color = w\n", - "population = v\n", - "habitat = d\n" - ] - } - ], - "prompt_number": 39 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The more general form of the function, [`range(start, stop[, step])`](http://docs.python.org/2/library/functions.html#range), returns a list of values from `start` to `stop - 1` (inclusive) increasing by `step` (which defaults to `1`), or from `start` down to `stop + 1` (inclusive) decreasing by `step` if `step` is negative." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print 'range(5, 10):', range(5, 10)\n", - "print 'range(10, 5, -1):', range(10, 5, -1)\n", - "print 'range(0, 10, 2):', range(0, 10, 2)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "range(5, 10): [5, 6, 7, 8, 9]\n", - "range(10, 5, -1): [10, 9, 8, 7, 6]\n", - "range(0, 10, 2): [0, 2, 4, 6, 8]\n" - ] - } - ], - "prompt_number": 40 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [`xrange(stop[, stop[, step]])`](http://docs.python.org/2/library/functions.html#xrange) function is an [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) version of the `range()` function. In the context of a `for` loop, it returns the *next* item of the sequence for each iteration of the loop rather than creating *all* the elements of the sequence before the first iteration. This can reduce memory consumption in cases where iteration over all the items is not required. \n", - "\n", - "The `range()` function returns a list, which can then be manipulated by any list or sequence methods. An *xrange object* can only be used in a `for` loop or the `len()` function. A related and slightly more general class of container objects, [*iterators*](http://docs.python.org/2/library/stdtypes.html#typeiter), include a [`next()`](http://docs.python.org/2/library/stdtypes.html#iterator.next) method for explicitly returning the next item in the container." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print xrange(len(attribute_names)), '\\n'\n", - "\n", - "print 'Values for the', len(attribute_names), 'attributes:\\n'\n", - "for i in xrange(len(attribute_names)):\n", - " print attribute_names[i], '=', attribute_value(single_instance_list, attribute_names[i], attribute_names)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "xrange(23) \n", - "\n", - "Values for the 23 attributes:\n", - "\n", - "class = p\n", - "cap-shape = k\n", - "cap-surface = f\n", - "cap-color = n\n", - "bruises? = f\n", - "odor = n\n", - "gill-attachment = f\n", - "gill-spacing = c\n", - "gill-size = n\n", - "gill-color = w\n", - "stalk-shape = e\n", - "stalk-root = ?\n", - "stalk-surface-above-ring = k\n", - "stalk-surface-below-ring = y\n", - "stalk-color-above-ring = w\n", - "stalk-color-below-ring = n\n", - "veil-type = p\n", - "veil-color = w\n", - "ring-number = o\n", - "ring-type = e\n", - "spore-print-color = w\n", - "population = v\n", - "habitat = d\n" - ] - } - ], - "prompt_number": 41 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Modules, namespaces and dotted notation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A Python [***module***](http://docs.python.org/2/tutorial/modules.html) is a file containing related definitions (e.g., of functions and variables). Modules are used to help organize the Python [***namespaces***](http://docs.python.org/2/tutorial/classes.html#python-scopes-and-namespaces), the set of identifiers accessible in a particular contexts. All of the functions and variables we define in this IPython Notebook are in the `__main__` namespace, so accessing them does not require any specification of a module.\n", - "\n", - "A Python module named `simple_ml` (in the file `simple_ml.py`), contains a set of solutions to the exercises in this IPython Notebook. Accessing functions in that module requires that we first [`import`](http://docs.python.org/2/reference/simple_stmts.html#the-import-statement) the module, and then prefix the function names with the module name followed by a dot (this is known as ***dotted notation***).\n", - "\n", - "For example, the following function call Exercise 1 below: \n", - "\n", - "`simple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)`\n", - "\n", - "uses dotted notation to reference the `print_attribute_names_and_values()` function in the `simple_ml` module.\n", - "\n", - "After you have defined your function in Exercise 1, you can test it by deleting the `simple_ml` module specification, so that the statement becomes\n", - "\n", - "`print_attribute_names_and_values(single_instance_list, attribute_names)`\n", - "\n", - "This will reference the `print_attribute_names_and_values()` function in the current namespace (`__main__`), i.e., the top-level interpreter environment. The `simple_ml.print_attribute_names_and_values()` function will still be accessible in the `simple_ml` namespace by using the \"`simple_ml.`\" prefix." - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Exercise 1: define print_attribute_names_and_values()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Complete the following function definition, `print_attribute_names_and_values(instance, attribute_names)`, so that it generates exactly the same output as the code above." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def print_attribute_names_and_values(instance, attribute_names):\n", - " '''Prints the attribute names and values for an instance'''\n", - " # your code goes here\n", - " return\n", - "\n", - "import simple_ml # this module contains my solutions to exercises\n", - "# to test your function, delete the 'simple_ml.' module specification in the call to print_attribute_names_and_values() below\n", - "simple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Values for the 23 attributes:\n", - "\n", - "class = p\n", - "cap-shape = k\n", - "cap-surface = f\n", - "cap-color = n\n", - "bruises? = f\n", - "odor = n\n", - "gill-attachment = f\n", - "gill-spacing = c\n", - "gill-size = n\n", - "gill-color = w\n", - "stalk-shape = e\n", - "stalk-root = ?\n", - "stalk-surface-above-ring = k\n", - "stalk-surface-below-ring = y\n", - "stalk-color-above-ring = w\n", - "stalk-color-below-ring = n\n", - "veil-type = p\n", - "veil-color = w\n", - "ring-number = o\n", - "ring-type = e\n", - "spore-print-color = w\n", - "population = v\n", - "habitat = d\n" - ] - } - ], - "prompt_number": 42 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "File I/O" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Python [file input and output](http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files) is done through [file](http://docs.python.org/2/library/stdtypes.html#file-objects) objects. A file object is created with the [`open(name[, mode])`](http://docs.python.org/2/library/functions.html#open) statement, where `name` is a string representing the name of the file, and `mode` is `'r'` (read), `'w'` (write) or `'a'` (append); if no second argument is provided, the mode defaults to `'r'`.\n", - "\n", - "A common Python programming pattern for processing an input text file is to \n", - "\n", - "* [**`open`**](http://docs.python.org/2/library/functions.html#open) the file using a [**`with`**](http://docs.python.org/2/reference/compound_stmts.html#the-with-statement) statement (which will automatically [**`close`**](http://docs.python.org/2/library/stdtypes.html#file.close) the file after the statements inside the `with` execute)\n", - "* iterate over each line in the file using a **`for`** statement\n", - "\n", - "The following code creates a list of instances, where each instance is a list of attribute values (like `instance_1_str` above). " - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "all_instances = [] # initialize instances to an empty list\n", - "data_filename = 'agaricus-lepiota.data'\n", - "\n", - "with open(data_filename, 'r') as f:\n", - " for line in f:\n", - " all_instances.append(line.strip().split(','))\n", - " \n", - "print 'Read', len(all_instances), 'instances from', data_filename\n", - "print 'First instance:', all_instances[0] # we don't want to print all the instances, so let's just print one to verify" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Read 8124 instances from agaricus-lepiota.data\n", - "First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']\n" - ] - } - ], - "prompt_number": 43 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Exercise 2: define load_instances()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Define a function, `load_instances(filename)`, that returns a list of instances in a text file. The function definition is started for you below. The function should exhibit the same behavior as the code above." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def load_instances(filename):\n", - " '''Returns a list of instances stored in a file.\n", - " \n", - " filename is expected to have one list of comma-delimited attribute values per line, e.g.,\n", - " p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'''\n", - " instances = []\n", - " # your code goes here\n", - " return instances\n", - "\n", - "data_filename = 'agaricus-lepiota.data'\n", - "# to test your function, delete the 'simple_ml.' module specification in the call to load_instances() below\n", - "all_instances_2 = simple_ml.load_instances(data_filename)\n", - "print 'Read', len(all_instances_2), 'instances from', data_filename\n", - "print 'First instance:', all_instances_2[0] " - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Read 8124 instances from agaricus-lepiota.data\n", - "First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']\n" - ] - } - ], - "prompt_number": 44 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Output to text file is usually done via [`file.write(str)`](http://docs.python.org/2/library/stdtypes.html#file.write) method.\n", - "\n", - "As we saw earlier, the [`str.join(words)`](http://docs.python.org/2/library/stdtypes.html#str.join) method returns a single `str`-delimited string containing each of the strings in the list `words`.\n", - "\n", - "SQL and Hive database tables often use the pipe ('|') delimiter to separate column values for each row when they are stored as flat files. The following code creates a new data file using pipes rather than commas to separate the attribute values." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print 'Converting to pipe delimiter, e.g.,', '|'.join(all_instances[0])\n", - "\n", - "datafile2 = 'agaricus-lepiota-2.data'\n", - "with open(datafile2, 'w') as f:\n", - " for instance in all_instances:\n", - " f.write('|'.join(instance) + '\\n') # '+' is the concatenation operator when used with strings\n", - "\n", - "all_instances_3 = []\n", - "with open(datafile2, 'r') as f:\n", - " for line in f:\n", - " all_instances_3.append(line.strip().split('|')) # note: changed ',' to '|'\n", - "print 'Read', len(all_instances_3), 'instances from', datafile2\n", - "print 'First instance:', all_instances_3[0] # we don't want to print all the instances, so let's just print one to verify" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Converting to pipe delimiter, e.g., p|x|s|n|t|p|f|c|n|k|e|e|s|s|w|w|p|w|o|p|k|s|u\n", - "Read 8124 instances from agaricus-lepiota-2.data\n", - "First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']\n" - ] - } - ], - "prompt_number": 45 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "List comprehensions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Python provides a powerful [*list comprehension*](http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions) construct to simplify the creation of a list by specifying a formula in a single expression.\n", - "\n", - "Some programmers find list comprehensions confusing, and avoid their use. We won't rely on list comprehensions here, but will show examples with and without list comprehensions below.\n", - "\n", - "One common use of list comprehensions is in the context of the [str.join(words)](http://docs.python.org/2/library/string.html#string.join) method we saw earlier.\n", - "\n", - "If we wanted to construct a pipe-delimited string containing elements of the list, we could use a `for` loop to iteratively add list elements and pipe delimiters to a string. We would thereby add one pipe delimiter too many, and would thus have to shave that off at the end." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "pipe_delimited_string = ''\n", - "for x in [1, 2, 3]:\n", - " pipe_delimited_string += str(x) + '|'\n", - "pipe_delimited_string = pipe_delimited_string[:-1]\n", - "pipe_delimited_string" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "metadata": {}, - "output_type": "pyout", - "prompt_number": 46, - "text": [ - "'1|2|3'" - ] - } - ], - "prompt_number": 46 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This process is much simpler using a list comprehension." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "'|'.join([str(x) for x in [1, 2, 3]])" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "metadata": {}, - "output_type": "pyout", - "prompt_number": 47, - "text": [ - "'1|2|3'" - ] - } - ], - "prompt_number": 47 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As noted in the initial description of the UCI mushroom set above, 2480 of the 8124 instances have missing values (denoted by `'?'`) for an attribute. There are several techniques for dealing with instances that include missing values, but to simplify things in the context of this primer - and following the example in the [Data Science for Business](http://www.data-science-for-biz.com/) book - we will restrict our focus to only those *clean* instances that have no missing values.\n", - "\n", - "We could use several lines of code - with an `if` statement inside a `for` loop - to create a `clean_instances` list from the `all_instances` list. Or we could use a list comprehension.\n", - "\n", - "We will show both approaches to creating `clean_instances` below." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# version 1: using an if statement nested within a for statement\n", - "clean_instances = []\n", - "for instance in all_instances:\n", - " if '?' not in instance:\n", - " clean_instances.append(instance)\n", - " \n", - "print len(clean_instances), 'clean instances'" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "5644 clean instances\n" - ] - } - ], - "prompt_number": 48 - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# version 2: using an equivalent list comprehension\n", - "clean_instances_2 = [instance for instance in all_instances if '?' not in instance]\n", - "\n", - "print len(clean_instances_2), 'clean instances'" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "5644 clean instances\n" - ] - } - ], - "prompt_number": 49 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Dictionaries (dicts)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Although single character abbreviations of attribute values (e.g., 'x') allow for more compact data files, they are not as easy to understand by human readers as the longer attribute value descriptions (e.g., 'convex').\n", - "\n", - "A Python [dictionary (or `dict`)](http://docs.python.org/2/tutorial/datastructures.html#dictionaries) is an unordered, comma-delimited collection of *key, value* pairs, serving a siimilar function as a hash table or hashmap in other programming languages.\n", - "\n", - "We could create a dictionary for the `cap-type` attribute values shown above:\n", - "\n", - "> bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", - "\n", - "Since we will want to look up the value using the abbreviation (which is the representation of the value stored in the file), we will use the abbreviations as *keys* and the descriptions as *values*.\n", - "\n", - "A Python dictionary can be created by specifying all `key: value` pairs (with colons separating each *key* and *value*), or by adding them iteratively. We will use the first method below, and use the second method further below. A *value* in a Python dictionary (`dict`) is accessed by specifying its *key* using the general form `dict[key]`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "attribute_values_cap_type = {'b': 'bell', 'c': 'conical', 'x': 'convex', 'f': 'flat', 'k': 'knobbed', 's': 'sunken'}\n", - "\n", - "attribute_value_abbrev = 'x'\n", - "print attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "x = convex\n" - ] - } - ], - "prompt_number": 50 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A Python dictionary is an *iterable* container, so we can iterate over the keys in a dictionary using a `for` loop.\n", - "\n", - "Note that since a dictionary is an *unordered* collection, the sequence of abbreviations and associated values is not guaranteed to appear in any particular order. " - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "for attribute_value_abbrev in attribute_values_cap_type:\n", - " print attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "c = conical\n", - "b = bell\n", - "f = flat\n", - "k = knobbed\n", - "s = sunken\n", - "x = convex\n" - ] - } - ], - "prompt_number": 51 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Python supports *dictionary comprehensions*, which have a similar form as the *list comprehensions* described above.\n", - "\n", - "For example, if we provisionally omit the 'convex' cap-type (whose abbreviation is the last letter rather than first letter in the attribute name), we could construct a dictionary of abbreviations and descriptions using the following expression." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "attribute_values_cap_type_2 ={x[0]: x for x in ['bell', 'conical', 'flat', 'knobbed', 'sunken']}\n", - "print attribute_values_cap_type_2" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "{'s': 'sunken', 'c': 'conical', 'b': 'bell', 'k': 'knobbed', 'f': 'flat'}\n" - ] - } - ], - "prompt_number": 52 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "While it's useful to have a dictionary of values for the `cap-type` attribute, it would be even more useful to have a dictionary of values for every attribute. Earlier, we created a list of `attribute_names`; let's expand this to create a list of `attribute_values` wherein each list element is a dictionary.\n", - "\n", - "Rather than explicitly type in each dictionary entry in the Python interpreter, we'll define a function to read a file containing the list of attribute names, values and value abbreviations in the format shown above:\n", - "\n", - "* class: edible=e, poisonous=p\n", - "* cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", - "* cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s\n", - "* ..." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def load_attribute_values(filename):\n", - " '''Returns a list of attribute values in a file.\n", - " \n", - " The attribute values are represented as dictionaries, wherein the keys are abbreviations and the values are descriptions.\n", - " filename is expected to have one attribute name and set of values per line, with the following format:\n", - " name: value_description=value_abbreviation[,value_description=value_abbreviation]*\n", - " for example\n", - " cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", - " The attribute value description dictionary created from this line would be the following:\n", - " {'c': 'conical', 'b': 'bell', 'f': 'flat', 'k': 'knobbed', 's': 'sunken', 'x': 'convex'}'''\n", - " attribute_values = []\n", - " with open(filename) as f:\n", - " for line in f:\n", - " attribute_name_and_value_string_list = line.strip().split(':')\n", - " attribute_name = attribute_name_and_value_string_list[0]\n", - " if len(attribute_name_and_value_string_list) < 2:\n", - " attribute_values.append({}) # no values for this attribute\n", - " else:\n", - " value_abbreviation_description_dict = {}\n", - " description_and_abbreviation_string_list = attribute_name_and_value_string_list[1].strip().split(',')\n", - " for description_and_abbreviation_string in description_and_abbreviation_string_list:\n", - " description_and_abbreviation = description_and_abbreviation_string.strip().split('=')\n", - " description = description_and_abbreviation[0]\n", - " if len(description_and_abbreviation) < 2: # assumption: no more than 1 value is missing an abbreviation\n", - " value_abbreviation_description_dict[None] = description\n", - " else:\n", - " abbreviation = description_and_abbreviation[1]\n", - " value_abbreviation_description_dict[abbreviation] = description\n", - " attribute_values.append(value_abbreviation_description_dict)\n", - " return attribute_values\n", - "\n", - "attribute_filename = 'agaricus-lepiota.attributes'\n", - "attribute_values = load_attribute_values(attribute_filename)\n", - "print 'Read', len(attribute_values), 'attribute values from', attribute_filename\n", - "print 'First attribute values list:', attribute_values[0]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Read 23 attribute values from agaricus-lepiota.attributes\n", - "First attribute values list: {'p': 'poisonous', 'e': 'edible'}\n" - ] - } - ], - "prompt_number": 53 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Exercise 3: define load_attribute_values()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We earlier created the `attribute_names` list manually. The `load_attribute_values()` function above creates the `attribute_values` list automatically from the contents of a file ... each line of which starts with the name of each attribute ... which we discard.\n", - "\n", - "Complete the following function definition so that the code implements the functionality described in the docstring." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def load_attribute_names_and_values(filename):\n", - " '''Returns a list of attribute names and values in a file.\n", - " \n", - " This list contains dictionaries wherein the keys are names \n", - " and the values are value description dictionariess.\n", - " \n", - " Each value description sub-dictionary will use the attribute value abbreviations as its keys \n", - " and the attribute descriptions as the values.\n", - " \n", - " filename is expected to have one attribute name and set of values per line, with the following format:\n", - " name: value_description=value_abbreviation[,value_description=value_abbreviation]*\n", - " for example\n", - " cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", - " The attribute name and values dictionary created from this line would be the following:\n", - " {'name': 'cap-shape', 'values': {'c': 'conical', 'b': 'bell', 'f': 'flat', 'k': 'knobbed', 's': 'sunken', 'x': 'convex'}}'''\n", - " attribute_names_and_values = [] # this will be a list of dicts\n", - " # your code goes here\n", - " return attribute_names_and_values\n", - "\n", - "attribute_filename = 'agaricus-lepiota.attributes'\n", - "# to test your function, delete the 'simple_ml.' module specification in the call to load_attribute_names_and_values() below\n", - "attribute_names_and_values = simple_ml.load_attribute_names_and_values(attribute_filename)\n", - "print 'Read', len(attribute_names_and_values), 'attribute values from', attribute_filename\n", - "print 'First attribute name:', attribute_names_and_values[0]['name'], '; values:', attribute_names_and_values[0]['values']" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Read 23 attribute values from agaricus-lepiota.attributes\n", - "First attribute name: class ; values: {'p': 'poisonous', 'e': 'edible'}\n" - ] - } - ], - "prompt_number": 54 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Counting" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Data scientists often need to count things. For example, we might want to count the numbers of edible and poisonous mushrooms in the *clean_instances* list we created earlier." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "edible_count = 0\n", - "for instance in clean_instances:\n", - " if instance[0] == 'e':\n", - " edible_count += 1 # this is shorthand for edible_count = edible_count + 1\n", - "\n", - "print 'There are', edible_count, 'edible mushrooms among the', len(clean_instances), 'clean instances'" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "There are 3488 edible mushrooms among the 5644 clean instances\n" - ] - } - ], - "prompt_number": 55 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "More generally, we often want to count the number of occurrences (frequencies) of each possible value for an attribute. One way to do so is to create a dictionary where each dictionary key is an attribute value and each dictionary value is the count of instances with that attribute value.\n", - "\n", - "Using an ordinary dictionary, we must be careful to create a new dictionary entry the first time we see a new attribute value (that is not already contained in the dictionary)." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "cap_state_value_counts = {}\n", - "for instance in clean_instances:\n", - " cap_state_value = instance[1] # cap-state is the 2nd attribute\n", - " if cap_state_value not in cap_state_value_counts:\n", - " cap_state_value_counts[cap_state_value] = 0\n", - " cap_state_value_counts[cap_state_value] += 1\n", - "\n", - "print 'Counts for each value of cap-state:'\n", - "for value in cap_state_value_counts:\n", - " print value, ':', cap_state_value_counts[value]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Counts for each value of cap-state:\n", - "c : 4\n", - "b : 300\n", - "f : 2432\n", - "k : 36\n", - "s : 32\n", - "x : 2840\n" - ] - } - ], - "prompt_number": 56 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Python [`collections`](http://docs.python.org/2/library/collections.html) module provides a number of high performance container datatypes. A frequently useful datatype is a [`defaultdict`](http://docs.python.org/2/library/collections.html#defaultdict-objects), which automatically creates an appropriate default value for a new key. For example, a `defaultdict(int)` automatically initializes a new dictionary entry to 0 (zero); a `defaultdict(list)` automatically initializes a new dictionary entry to the empty list (`[]`).\n", - "\n", - "After first importing `defaultdict` from `collections`, we can use `defaultdict(int)` to simplify the code above:" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "from collections import defaultdict # don't need to use collections.defaultdict() below\n", - "\n", - "cap_state_value_counts = defaultdict(int)\n", - "for instance in clean_instances:\n", - " cap_state_value = instance[1]\n", - " cap_state_value_counts[cap_state_value] += 1\n", - "\n", - "print 'Counts for each value of cap-state:'\n", - "for value in cap_state_value_counts:\n", - " print value, ':', cap_state_value_counts[value]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Counts for each value of cap-state:\n", - "c : 4\n", - "b : 300\n", - "f : 2432\n", - "k : 36\n", - "s : 32\n", - "x : 2840\n" - ] - } - ], - "prompt_number": 57 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Exercise 4: define attribute_value_counts()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Define a function, `attribute_value_counts(instances, attribute, attribute_names)`, that returns a `defaultdict` containing the counts of occurrences of each value of `attribute` in the list of `instances`. `attribute_names` is the list we created above, where each element is the name of an attribute." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# your definition goes here\n", - "\n", - "attribute = 'cap-shape'\n", - "# remove 'simple_ml.' below to test your function definition\n", - "attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)\n", - "\n", - "print 'Counts for each value of', attribute, ':'\n", - "for value in attribute_value_counts:\n", - " print value, ':', attribute_value_counts[value]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Counts for each value of cap-shape :\n", - "c : 4\n", - "b : 300\n", - "f : 2432\n", - "k : 36\n", - "s : 32\n", - "x : 2840\n" - ] - } - ], - "prompt_number": 58 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Sorting" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Earlier, we saw that there is a `list.sort()` method that will sort a list in-place, i.e., by replacing the original value of `list` with a sorted version of the elements in `list`. \n", - "\n", - "The Python [`sorted(iterable[, cmp[, key[, reverse]]])`](http://docs.python.org/2/library/functions.html#sorted) function can be used to return a *copy* of a list, dictionary or any other [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) container it is passed, in ascending order." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "original_list = [3, 1, 4, 2, 5]\n", - "sorted_list = sorted(original_list)\n", - "print original_list\n", - "print sorted_list" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "[3, 1, 4, 2, 5]\n", - "[1, 2, 3, 4, 5]\n" - ] - } - ], - "prompt_number": 59 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Since it returns a copy, `sorted()` can be used with strings." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print sorted('python')" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "['h', 'n', 'o', 'p', 't', 'y']\n" - ] - } - ], - "prompt_number": 60 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`sorted()` can also be used with dictionaries (it returns a sorted list of the dictionary *keys*)." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print sorted(attribute_values_cap_type) # returns a list of sorted keys (but not values) in the dictionary" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "['b', 'c', 'f', 'k', 's', 'x']\n" - ] - } - ], - "prompt_number": 61 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "However, we can use the sorted *keys* to access the *values* of a dictionary." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "for attribute_value_abbrev in sorted(attribute_values_cap_type):\n", - " print attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "b = bell\n", - "c = conical\n", - "f = flat\n", - "k = knobbed\n", - "s = sunken\n", - "x = convex\n" - ] - } - ], - "prompt_number": 62 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "An optional [keyword argument](http://docs.python.org/2/tutorial/controlflow.html#keyword-arguments), `reverse`, can be used to reverse the order of the sorted list returned by the function. The default value of this optional parameter is `False`, to get non-default behavior, we must specify the name and value of the argument: `reverse=True`. " - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print sorted([3, 1, 4, 2, 5], reverse=True)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "[5, 4, 3, 2, 1]\n" - ] - } - ], - "prompt_number": 63 - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print sorted(attribute_values_cap_type, reverse=True) " - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "['x', 's', 'k', 'f', 'c', 'b']\n" - ] - } - ], - "prompt_number": 64 - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "attribute = 'cap-shape'\n", - "attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)\n", - "\n", - "print 'Counts for each value of', attribute, ':'\n", - "for value in sorted(attribute_value_counts):\n", - " print value, ':', attribute_value_counts[value]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Counts for each value of cap-shape :\n", - "b : 300\n", - "c : 4\n", - "f : 2432\n", - "k : 36\n", - "s : 32\n", - "x : 2840\n" - ] - } - ], - "prompt_number": 65 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Sorting a dictionary by values" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We often want to sort a dictionary by its *values* rather than its *keys*. \n", - "\n", - "For example, when we printed out the counts of the attribute values for `cap-shape` above, the counts appeared in an ascending alphabetic order of their attribute names. It is often more helpful to show the attribute value counts in descending order of the counts (which are the values in that dictionary).\n", - "\n", - "There are a [variety of ways to sort a dictionary by values](http://writeonly.wordpress.com/2008/08/30/sorting-dictionaries-by-value-in-python-improved/), but the approach described in [PEP-256](http://legacy.python.org/dev/peps/pep-0265/) is generally considered the most efficient.\n", - "\n", - "In order to understand the components used in this approach, we will revisit and elaborate on a few concepts involving *dictionaries*, *iterators* and *modules*." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [`dict.items()`](http://docs.python.org/2/library/stdtypes.html#dict.items) method returns an unordered list of `(key, value)` tuples in `dict`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "attribute_values_cap_type.items()" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "metadata": {}, - "output_type": "pyout", - "prompt_number": 66, - "text": [ - "[('c', 'conical'),\n", - " ('b', 'bell'),\n", - " ('f', 'flat'),\n", - " ('k', 'knobbed'),\n", - " ('s', 'sunken'),\n", - " ('x', 'convex')]" - ] - } - ], - "prompt_number": 66 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A related method, [`dict.iteritems()`](http://docs.python.org/2/library/stdtypes.html#dict.iteritems), returns an [`iterator`](http://docs.python.org/2/library/stdtypes.html#iterator-types) - a callable object that returns the *next* item in a sequence each time it is referenced (e.g., during each iteration of a for loop), which can be more efficient than generating *all* the items in the sequence before any are used. This is similar to the distinction between `xrange()` and `range()` described above." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "attribute_values_cap_type.iteritems()" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "metadata": {}, - "output_type": "pyout", - "prompt_number": 67, - "text": [ - "" - ] - } - ], - "prompt_number": 67 - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "for key, value in attribute_values_cap_type.iteritems():\n", - " print key, value" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "c conical\n", - "b bell\n", - "f flat\n", - "k knobbed\n", - "s sunken\n", - "x convex\n" - ] - } - ], - "prompt_number": 68 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Python [`operator`](http://docs.python.org/2/library/operator.html) module contains a number of functions that perform object comparisons, logical operations, mathematical operations, sequence operations, and abstract type tests.\n", - "\n", - "To facilitate sorting a dictionary by values, we will use the [`operator.itemgetter(item)`](http://docs.python.org/2/library/operator.html#operator.itemgetter) function that can be used to retrieve an indexed value (`item`) in a tuple (such as a `(key, value)` pair returned by `[iter]items()`).\n", - "\n", - "We can use `operator.itemgetter(1)`) to reference the value - the 2nd item in each `(key, value)` tuple, (at zero-based index position 1) - rather than the key - the first item in each `(key, value)` tuple (at index position 0).\n", - "\n", - "We will use the optional keyword argument `key` in [`sorted(iterable[, cmp[, key[, reverse]]])`](http://docs.python.org/2/library/functions.html#sorted) to specify a *sorting* key that is not the same as the `dict` key (the `dict` key is the default *sorting* key)" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "import operator\n", - "\n", - "sorted(attribute_values_cap_type.iteritems(), key=operator.itemgetter(1))" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "metadata": {}, - "output_type": "pyout", - "prompt_number": 69, - "text": [ - "[('b', 'bell'),\n", - " ('c', 'conical'),\n", - " ('x', 'convex'),\n", - " ('f', 'flat'),\n", - " ('k', 'knobbed'),\n", - " ('s', 'sunken')]" - ] - } - ], - "prompt_number": 69 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can now sort the counts of attribute values in descending frequency of occurrence, and print them out using tuple unpacking." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "attribute = 'cap-shape'\n", - "value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)\n", - "\n", - "print 'Counts for each value of', attribute, ':'\n", - "for value, count in sorted(value_counts.iteritems(), key=operator.itemgetter(1), reverse=True):\n", - " print value, ':', count" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Counts for each value of cap-shape :\n", - "x : 2840\n", - "f : 2432\n", - "b : 300\n", - "k : 36\n", - "s : 32\n", - "c : 4\n" - ] - } - ], - "prompt_number": 70 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Exercise 5: define print_all_attribute_value_counts()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Define a function, `print_all_attribute_value_counts(instances, attribute_names)`, that prints each attribute name in `attribute_names`, and then for each attribute value, prints the value abbreviation, the count of occurrences of that value and the proportion of instances that have that attribute value.\n", - "\n", - "You may find it helpful to use [fancier output formatting](http://docs.python.org/2/tutorial/inputoutput.html#fancier-output-formatting). More details can be found in the Python documentation on [format string syntax](http://docs.python.org/2/library/string.html#format-string-syntax).\n", - "\n", - "Examples of the `str.format()` function used in conjunction with print statements is shown below, followed by sample output of the `simple_ml` version of `print_all_attribute_value_counts()` (which uses similar formatting, but without hard-coded values)." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print 'Output of a sample line using str.format():'\n", - "print 'class:', # comma at end keeps cursor on the same line for subsequent print statements\n", - "print '{} = {} ({:5.3f}),'.format('e', 3488, 3488 / 5644.0),\n", - "print '{} = {} ({:5.3f}),'.format('p', 2156, 2156 / 5644.0),\n", - "print # a print statement with no arguments will advance the cursor to the beginning of the next line\n", - "print 'End of sample line'" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Output of a sample line using str.format():\n", - "class: e = 3488 (0.618), p = 2156 (0.382),\n", - "End of sample line\n" - ] - } - ], - "prompt_number": 71 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Define your version of `print_all_attribute_value_counts(instances, attribute_names)` below, deleting the `simple_ml.` module specification when you are ready to test your function." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# your function definition goes here\n", - "\n", - "print '\\nCounts for all attributes and values:\\n'\n", - "simple_ml.print_all_attribute_value_counts(clean_instances, attribute_names)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "\n", - "Counts for all attributes and values:\n", - "\n", - "class: e = 3488 (0.618), p = 2156 (0.382),\n", - "cap-shape: x = 2840 (0.503), f = 2432 (0.431), b = 300 (0.053), k = 36 (0.006), s = 32 (0.006), c = 4 (0.001),\n", - "cap-surface: y = 2220 (0.393), f = 2160 (0.383), s = 1260 (0.223), g = 4 (0.001),\n", - "cap-color: g = 1696 (0.300), n = 1164 (0.206), y = 1056 (0.187), w = 880 (0.156), e = 588 (0.104), b = 120 (0.021), p = 96 (0.017), c = 44 (0.008),\n", - "bruises?: t = 3184 (0.564), f = 2460 (0.436),\n", - "odor: n = 2776 (0.492), f = 1584 (0.281), a = 400 (0.071), l = 400 (0.071), p = 256 (0.045), c = 192 (0.034), m = 36 (0.006),\n", - "gill-attachment: f = 5626 (0.997), a = 18 (0.003),\n", - "gill-spacing: c = 4620 (0.819), w = 1024 (0.181),\n", - "gill-size: b = 4940 (0.875), n = 704 (0.125),\n", - "gill-color: p = 1384 (0.245), n = 984 (0.174), w = 966 (0.171), h = 720 (0.128), g = 656 (0.116), u = 480 (0.085), k = 408 (0.072), r = 24 (0.004), y = 22 (0.004),\n", - "stalk-shape: t = 2880 (0.510), e = 2764 (0.490),\n", - "stalk-root: b = 3776 (0.669), e = 1120 (0.198), c = 556 (0.099), r = 192 (0.034),\n", - "stalk-surface-above-ring: s = 3736 (0.662), k = 1332 (0.236), f = 552 (0.098), y = 24 (0.004),\n", - "stalk-surface-below-ring: s = 3544 (0.628), k = 1296 (0.230), f = 552 (0.098), y = 252 (0.045),\n", - "stalk-color-above-ring: w = 3136 (0.556), p = 1008 (0.179), g = 576 (0.102), n = 448 (0.079), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001),\n", - "stalk-color-below-ring: w = 3088 (0.547), p = 1008 (0.179), g = 576 (0.102), n = 496 (0.088), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001),\n", - "veil-type: p = 5644 (1.000),\n", - "veil-color: w = 5636 (0.999), y = 8 (0.001),\n", - "ring-number: o = 5488 (0.972), t = 120 (0.021), n = 36 (0.006),\n", - "ring-type: p = 3488 (0.618), l = 1296 (0.230), e = 824 (0.146), n = 36 (0.006),\n", - "spore-print-color: n = 1920 (0.340), k = 1872 (0.332), h = 1584 (0.281), w = 148 (0.026), r = 72 (0.013), u = 48 (0.009),\n", - "population: v = 2160 (0.383), y = 1688 (0.299), s = 1104 (0.196), a = 384 (0.068), n = 256 (0.045), c = 52 (0.009),\n", - "habitat: d = 2492 (0.442), g = 1860 (0.330), p = 568 (0.101), u = 368 (0.065), m = 292 (0.052), l = 64 (0.011),\n" - ] - } - ], - "prompt_number": 72 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Navigation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notebooks in this primer:\n", - "\n", - "* [1. Introduction](1_Introduction.ipynb)\n", - "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", - "* **3. Python: Basic Concepts** (*you are here*)\n", - "* [4. Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", - "* [5. Next Steps](5_Next_Steps.ipynb)" - ] - } - ], - "metadata": {} + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.10" } - ] -} \ No newline at end of file + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/4_Python_Simple_Decision_Tree.ipynb b/4_Python_Simple_Decision_Tree.ipynb index ff161d7..e54dd0a 100644 --- a/4_Python_Simple_Decision_Tree.ipynb +++ b/4_Python_Simple_Decision_Tree.ipynb @@ -1,1859 +1,1466 @@ { + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python for Data Science" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Joe McCarthy](http://interrelativity.com/joe), \n", + "*Data Scientist*, [Indeed](http://www.indeed.com/)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from IPython.display import display, Image, HTML" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Navigation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notebooks in this primer:\n", + "\n", + "* [1. Introduction](1_Introduction.ipynb)\n", + "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", + "* [3. Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", + "* **4. Using Python to Build and Use a Simple Decision Tree Classifier** (*you are here*)\n", + "* [5. Next Steps](5_Next_Steps.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Using Python to Build and Use a Simple Decision Tree Classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Decision Trees" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Wikipedia offers the following description of a [decision tree](https://en.wikipedia.org/wiki/Decision_tree) (with italics added to emphasize terms that will be elaborated below):\n", + "\n", + "> A decision tree is a flowchart-like structure in which each *internal node* represents a *test* of an *attribute*, each branch represents an *outcome* of that test and each *leaf node* represents *class label* (a decision taken after testing all attributes in the path from the root to the leaf). Each path from the root to a leaf can also be represented as a classification rule.\n", + "\n", + "*\\[Decision trees can also be used for regression, wherein the goal is to predict a continuous value rather than a class label, but we will focus here solely on their use for classification.\\]*\n", + "\n", + "The image below depicts a decision tree created from the UCI mushroom dataset that appears on [Andy G's blog post about Decision Tree Learning](http://gieseanw.wordpress.com/2012/03/03/decision-tree-learning/), where \n", + "\n", + "* a white box represents an *internal node* (and the label represents the *attribute* being tested)\n", + "* a blue box represents an attribute value (an *outcome* of the *test* of that attribute)\n", + "* a green box represents a *leaf node* with a *class label* of *edible*\n", + "* a red box represents a *leaf node* with a *class label* of *poisonous*\n", + "\n", + "\n", + "\n", + "It is important to note that the UCI mushroom dataset consists entirely of [categorical variables](https://en.wikipedia.org/wiki/Categorical_variable), i.e., every variable (or *attribute*) has an enumerated set of possible values. Many datasets include numeric variables that can take on `int` or `float` values. Tests for such variables typically use comparison operators, e.g., $age < 65$ or $36,250 < adjusted\\_gross\\_income <= 87,850$. *[Aside: Python supports boolean expressions containing multiple comparison operators, such as the expression comparing adjusted_gross_income in the preceding example.]*\n", + "\n", + "Our simple decision tree will only accommodate categorical variables. We will closely follow a version of the [decision tree learning algorithm implementation](http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3) offered by Chris Roach.\n", + "\n", + "Our goal in the following sections is to use Python to\n", + "\n", + "* ***create*** a simple decision tree using a set of *training* instances\n", + "* ***classify*** (predict class labels) for a set of *test* instances using a simple decision tree\n", + "* ***evaluate*** the performance of a simple decision tree on classifying a set of test instances\n", + "\n", + "First, we will explore some concepts and algorithms used in building and using decision trees." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Entropy" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When building a supervised classification model, the frequency distribution of attribute values is a potentially important factor in determining the relative importance of each attribute at various stages in the model building process.\n", + "\n", + "In data modeling, we can use frequency distributions to compute ***entropy***, a measure of disorder (impurity) in a set.\n", + "\n", + "We compute the entropy of multiplying the proportion of instances with each class label by the log of that proportion, and then taking the negative sum of those terms.\n", + "\n", + "More precisely, for a 2-class (binary) classification task:\n", + "\n", + "$entropy(S) = - p_1 log_2 (p_1) - p_2 log_2 (p_2)$\n", + "\n", + "where $p_i$ is proportion (relative frequency) of class *i* within the set *S*.\n", + "\n", + "From the output above, we know that the proportion of `clean_instances` that are labeled `'e'` (class `edible`) in the UCI dataset is $3488 \\div 5644 = 0.618$, and the proportion labeled `'p'` (class `poisonous`) is $2156 \\div 5644 = 0.382$.\n", + "\n", + "After importing the Python [`math`](http://docs.python.org/2/library/math.html) module, we can use the [`math.log(x[, base])`](http://docs.python.org/2/library/math.html#math.log) function in computing the entropy of the `clean_instances` of the UCI mushroom data set as follows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import math\n", + "\n", + "entropy = \\\n", + " - (3488 / 5644) * math.log(3488 / 5644, 2) \\\n", + " - (2156 / 5644) * math.log(2156 / 5644, 2)\n", + "print(entropy)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 6: define `entropy()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `entropy(instances)`, that computes the entropy of `instances`. You may assume the class label is in position 0; we will later see how to specify default parameter values in function definitions.\n", + "\n", + "[Note: the class label in many data files is the *last* rather than the *first* item on each line.]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your function definition here\n", + "\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "print(simple_ml.entropy(clean_instances))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Information Gain" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Informally, a decision tree is constructed from a set of instances using a recursive algorithm that \n", + "\n", + "* selects the *best* attribute \n", + "* splits the set into subsets based on the values of that attribute (each subset is composed of instances from the original set that have the same value for that attribute)\n", + "* repeats the process on each of these subsets until a stopping condition is met (e.g., a subset has no instances or has instances which all have the same class label)\n", + "\n", + "Entropy is a metric that can be used in selecting the best attribute for each split: the best attribute is the one resulting in the *largest decrease in entropy* for a set of instances. [Note: other metrics can be used for determining the best attribute]\n", + "\n", + "*Information gain* measures the decrease in entropy that results from splitting a set of instances based on an attribute.\n", + "\n", + "$IG(S, a) = entropy(S) - [p(s_1) × entropy(s_1) + p(s_2) × entropy(s_2) ... + p(s_n) × entropy(s_n)]$\n", + "\n", + "Where \n", + "* $n$ is the number of distinct values of attribute $a$\n", + "* $s_i$ is the subset of $S$ where all instances have the $i$th value of $a$\n", + "* $p(s_i)$ is the proportion of instances in $S$ that have the $i$th value of $a$\n", + "\n", + "We'll use the definition of `information_gain()` in `simple_ml` to print the information gain for each of the attributes in the mushroom dataset ... before asking you to write your own definition of the function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Information gain for different attributes:', end='\\n\\n')\n", + "for i in range(1, len(attribute_names)):\n", + " print('{:5.3f} {:2} {}'.format(\n", + " simple_ml.information_gain(clean_instances, i), i, attribute_names[i]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can sort the attributes based in decreasing order of information gain, which shows that `odor` is the best attribute for the first split in a decision tree that models the instances in this dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Information gain for different attributes:', end='\\n\\n')\n", + "sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i)\n", + " for i in range(1, len(attribute_names))], \n", + " reverse=True)\n", + "for gain, i in sorted_information_gain_indexes:\n", + " print('{:5.3f} {:2} {}'.format(gain, i, attribute_names[i]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*\\[The following variation does not use a list comprehension\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Information gain for different attributes:', end='\\n\\n')\n", + "\n", + "information_gain_values = []\n", + "for i in range(1, len(attribute_names)):\n", + " information_gain_values.append((simple_ml.information_gain(clean_instances, i), i))\n", + " \n", + "sorted_information_gain_indexes = sorted(information_gain_values, \n", + " reverse=True)\n", + "for gain, i in sorted_information_gain_indexes:\n", + " print('{:5.3f} {:2} {}'.format(gain, i, attribute_names[i]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 7: define `information_gain()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `information_gain(instances, i)`, that returns the information gain achieved by selecting the `i`th attribute to split `instances`. It should exhibit the same behavior as the `simple_ml` version of the function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your definition of information_gain(instances, i) here\n", + "\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) \n", + " for i in range(1, len(attribute_names))], \n", + " reverse=True)\n", + "\n", + "print('Information gain for different attributes:', end='\\n\\n')\n", + "for gain, i in sorted_information_gain_indexes:\n", + " print('{:5.3f} {:2} {}'.format(gain, i, attribute_names[i]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Building a Simple Decision Tree" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will implement a modified version of the [ID3](https://en.wikipedia.org/wiki/ID3_algorithm) algorithm for building a simple decision tree.\n", + "\n", + " ID3 (Examples, Target_Attribute, Candidate_Attributes)\n", + " Create a Root node for the tree\n", + " If all examples have the same value of the Target_Attribute, \n", + " Return the single-node tree Root with label = that value \n", + " If the list of Candidate_Attributes is empty,\n", + " Return the single node tree Root,\n", + " with label = most common value of Target_Attribute in the examples.\n", + " Otherwise Begin\n", + " A ← The Attribute that best classifies examples (most information gain)\n", + " Decision Tree attribute for Root = A.\n", + " For each possible value, v_i, of A,\n", + " Add a new tree branch below Root, corresponding to the test A = v_i.\n", + " Let Examples(v_i) be the subset of examples that have the value v_i for A\n", + " If Examples(v_i) is empty,\n", + " Below this new branch add a leaf node \n", + " with label = most common target value in the examples\n", + " Else \n", + " Below this new branch add the subtree \n", + " ID3 (Examples(v_i), Target_Attribute, Attributes – {A})\n", + " End\n", + " Return Root" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*\\[**Note:** the algorithm above is *recursive*, i.e., the there is a recursive call to `ID3` within the definition of `ID3`. Covering recursion is beyond the scope of this primer, but there are a number of other resources on [using recursion in Python](https://www.google.com/search?q=python+recursion). Familiarity with recursion will be important for understanding both the tree construction and classification functions below.\\]*\n", + "\n", + "In building a decision tree, we will need to split the instances based on the index of the *best* attribute, i.e., the attribute that offers the *highest information gain*. We will use separate utility functions to handle these subtasks. To simplify the functions, we will rely exclusively on attribute *indexes* rather than attribute *names*.\n", + "\n", + "First, we will define a function, **`split_instances(instances, attribute_index)`**, to split a set of instances based on any attribute. This function will return a dictionary where each *key* is a distinct value of the specified `attribute_index`, and the *value* of each key is a list representing the subset of `instances` that have that `attribute_index` value.\n", + "\n", + "We will use a [**`defaultdict`**](http://docs.python.org/2/library/collections.html#defaultdict-objects), a specialized dictionary class in the [**`collections`**](http://docs.python.org/2/library/collections.html) module, which automatically creates an appropriate default value for a new key. For example, a `defaultdict(int)` automatically initializes a new dictionary entry to 0 (zero); a `defaultdict(list)` automatically initializes a new dictionary entry to the empty list (`[]`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from collections import defaultdict\n", + "\n", + "def split_instances(instances, attribute_index):\n", + " '''Returns a list of dictionaries, splitting a list of instances \n", + " according to their values of a specified attribute index\n", + " \n", + " The key of each dictionary is a distinct value of attribute_index,\n", + " and the value of each dictionary is a list representing \n", + " the subset of instances that have that value for the attribute\n", + " '''\n", + " partitions = defaultdict(list)\n", + " for instance in instances:\n", + " partitions[instance[attribute_index]].append(instance)\n", + " return partitions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To test the function, we will partition the `clean_instances` based on the `odor` attribute (index position 5) and print out the size (number of instances) in each partition rather than the lists of instances in each partition." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "partitions = split_instances(clean_instances, 5)\n", + "print([(partition, len(partitions[partition])) for partition in partitions])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we can split instances based on a particular attribute, we would like to be able to choose the *best* attribute with which to split the instances, where *best* is defined as the attribute that provides the greatest information gain if instances were split based on that attribute. We will want to restrict the candidate attributes so that we don't bother trying to split on an attribute that was used higher up in the decision tree (or use the target attribute as a candidate)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 8: define `choose_best_attribute_index()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `choose_best_attribute_index(instances, candidate_attribute_indexes)`, that returns the index in the list of `candidate_attribute_indexes` that provides the highest information gain if `instances` are split based on that attribute index." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your function here\n", + "\n", + "# delete 'simple_ml.' in the function call below to test your function\n", + "print('Best attribute index:', \n", + " simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names))))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A leaf node in a decision tree represents the most frequently occurring - or majority - class value for that path through the tree. We will need a function that determines the majority value for the class index among a set of instances. One way to do this is to use the [`Counter`](https://docs.python.org/2/library/collections.html#counter-objects) class introduced above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "class_counts = Counter([instance[0] for instance in clean_instances])\n", + "print('class_counts: {}\\n most_common(1): {}\\n most_common(1)[0][0]: {}'.format(\n", + " class_counts, # the Counter object\n", + " class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count\n", + " class_counts.most_common(1)[0][0])) # the most common value (1st element in that tuple)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*\\[The following variation does not use a list comprehension\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "class_counts = Counter() # create an empty counter\n", + "for instance in clean_instances:\n", + " class_counts[instance[0]] += 1\n", + " \n", + "print ('class_counts: {}\\n most_common(1): {}\\n most_common(1)[0][0]: {}'.format(\n", + " class_counts,\n", + " class_counts.most_common(1), \n", + " class_counts.most_common(1)[0][0]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is often useful to compute the number of unique values and/or the total number of values in a `Counter`.\n", + "\n", + "The number of unique values is simply the number of dictionary entries.\n", + "\n", + "The total number of values can be computed by taking the [**`sum()`**](https://docs.python.org/2/library/functions.html#sum) of all the counts (the *value* of each *key: value* pair ... or *key, value* tuple, if we use `Counter().most_common()`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print('Number of unique values: {}'.format(len(class_counts)))\n", + "print('Total number of values: {}'.format(sum([v \n", + " for k, v in class_counts.most_common()])))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before putting all this together to define a decision tree construction function, we will cover a few additional aspects of Python used in that function." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Truth values in Python" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python offers a very flexible mechanism for the [testing of truth values](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#testing-for-truth-values): in an **if** condition, any null object, zero-valued numerical expression or empty container (string, list, dictionary or tuple) is interpreted as *False* (i.e., *not True*):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n", + " print('\"{}\" is'.format(x), end=' ')\n", + " if x:\n", + " print(True)\n", + " else:\n", + " print(False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sometimes, particularly with function parameters, it is helpful to differentiate `None` from empty lists and other data structures with a `False` truth value (one common use case is illustrated in `create_decision_tree()` below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n", + " print('\"{} is None\" is'.format(x), end=' ')\n", + " if x is None:\n", + " print(True)\n", + " else:\n", + " print(False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Conditional expressions (ternary operators)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python also offers a [conditional expression (ternary operator)](http://docs.python.org/2/reference/expressions.html#conditional-expressions) that allows the functionality of an if/else statement that returns a value to be implemented as an expression. For example, the if/else statement in the code above could be implemented as a conditional expression as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n", + " print('\"{}\" is {}'.format(x, True if x else False)) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### More on optional parameters in Python functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Python function definitions can specify [default parameter values](http://docs.python.org/2/tutorial/controlflow.html#default-argument-values) indicating the value those parameters will have if no argument is explicitly provided when the function is called. Arguments can also be passed using [keyword parameters](http://docs.python.org/2/tutorial/controlflow.html#keyword-arguments) indicting which parameter will be assigned a specific argument value (which may or may not correspond to the order in which the parameters are defined).\n", + "\n", + "The [Python Tutorial page on default parameters](http://docs.python.org/2/tutorial/controlflow.html#default-argument-values) includes the following warning:\n", + "\n", + "> Important warning: The default value is evaluated only once. This makes a difference when the default is a mutable object such as a list, dictionary, or instances of most classes. \n", + "\n", + "Thus it is generally better to use the Python null object, `None`, rather than an empty `list` (`[]`), `dict` (`{}`) or other mutable data structure when specifying default parameter values for any of those data types." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def parameter_test(parameter1=None, parameter2=None):\n", + " '''Prints the values of parameter1 and parameter2'''\n", + " print('parameter1: {}; parameter2: {}'.format(parameter1, parameter2))\n", + " \n", + "parameter_test() # no args are required\n", + "parameter_test(1) # if any args are provided, 1st arg gets assigned to parameter1\n", + "parameter_test(1, 2) # 2nd arg gets assigned to parameter2\n", + "parameter_test(2) # remember: if only 1 arg, 1st arg gets assigned to arg1\n", + "parameter_test(parameter2=2) # can use keyword to provide a value only for parameter2\n", + "parameter_test(parameter2=2, parameter1=1) # can use keywords for either arg, in either order" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercise 9: define `majority_value()`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function, `majority_value(instances, class_index)`, that returns the most frequently occurring value of `class_index` in `instances`. The `class_index` parameter should be optional, and have a default value of `0` (zero).\n", + "\n", + "Your function definition should support the use of optional arguments as used in the function calls below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# your definition of majority_value(instances) here\n", + "\n", + "# delete 'simple_ml.' in the function calls below to test your function\n", + "\n", + "print('Majority value of index {}: {}'.format(\n", + " 0, simple_ml.majority_value(clean_instances))) \n", + "\n", + "# although there is only one class_index for the dataset, \n", + "# we'll test the function by specifying other indexes using optional / keyword arguments\n", + "print('Majority value of index {}: {}'.format(\n", + " 1, simple_ml.majority_value(clean_instances, 1))) # using argument order\n", + "print('Majority value of index {}: {}'.format(\n", + " 2, simple_ml.majority_value(clean_instances, class_index=2))) # using keyword argument" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Building a Simple Decision Tree" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The recursive `create_decision_tree()` function below uses an optional parameter, `class_index`, which defaults to `0`. This is to accommodate other datasets in which the class label is the last element on each line (which would be most easily specified by using a `-1` value). Most data files in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html) have the class labels as either the first element or the last element.\n", + "\n", + "To show how the decision tree is being built, an optional `trace` parameter, when non-zero, will generate some trace information as the tree is constructed. The indentation level is incremented with each recursive call via the use of the conditional expression (ternary operator), `trace + 1 if trace else 0`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def create_decision_tree(instances, \n", + " candidate_attribute_indexes=None, \n", + " class_index=0, \n", + " default_class=None, \n", + " trace=0):\n", + " '''Returns a new decision tree trained on a list of instances.\n", + " \n", + " The tree is constructed by recursively selecting and splitting instances based on \n", + " the highest information_gain of the candidate_attribute_indexes.\n", + " \n", + " The class label is found in position class_index.\n", + " \n", + " The default_class is the majority value for the current node's parent in the tree.\n", + " A positive (int) trace value will generate trace information \n", + " with increasing levels of indentation.\n", + " \n", + " Derived from the simplified ID3 algorithm presented in Building Decision Trees in Python \n", + " by Christopher Roach,\n", + " http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3\n", + " '''\n", + " \n", + " # if no candidate_attribute_indexes are provided, \n", + " # assume that we will use all but the target_attribute_index\n", + " # Note that None != [], \n", + " # as an empty candidate_attribute_indexes list is a recursion stopping condition\n", + " if candidate_attribute_indexes is None:\n", + " candidate_attribute_indexes = [i \n", + " for i in range(len(instances[0])) \n", + " if i != class_index]\n", + " # Note: do not use candidate_attribute_indexes.remove(class_index)\n", + " # as this would destructively modify the argument,\n", + " # causing problems during recursive calls\n", + " \n", + " class_labels_and_counts = Counter([instance[class_index] for instance in instances])\n", + "\n", + " # If the dataset is empty or the candidate attributes list is empty, \n", + " # return the default value\n", + " if not instances or not candidate_attribute_indexes:\n", + " if trace:\n", + " print('{}Using default class {}'.format('< ' * trace, default_class))\n", + " return default_class\n", + " \n", + " # If all the instances have the same class label, return that class label\n", + " elif len(class_labels_and_counts) == 1:\n", + " class_label = class_labels_and_counts.most_common(1)[0][0]\n", + " if trace:\n", + " print('{}All {} instances have label {}'.format(\n", + " '< ' * trace, len(instances), class_label))\n", + " return class_label\n", + " else:\n", + " default_class = simple_ml.majority_value(instances, class_index)\n", + "\n", + " # Choose the next best attribute index to best classify the instances\n", + " best_index = simple_ml.choose_best_attribute_index(\n", + " instances, candidate_attribute_indexes, class_index) \n", + " if trace:\n", + " print('{}Creating tree node for attribute index {}'.format(\n", + " '> ' * trace, best_index))\n", + "\n", + " # Create a new decision tree node with the best attribute index \n", + " # and an empty dictionary object (for now)\n", + " tree = {best_index:{}}\n", + "\n", + " # Create a new decision tree sub-node (branch) for each of the values \n", + " # in the best attribute field\n", + " partitions = simple_ml.split_instances(instances, best_index)\n", + "\n", + " # Remove that attribute from the set of candidates for further splits\n", + " remaining_candidate_attribute_indexes = [i \n", + " for i in candidate_attribute_indexes \n", + " if i != best_index]\n", + " for attribute_value in partitions:\n", + " if trace:\n", + " print('{}Creating subtree for value {} ({}, {}, {}, {})'.format(\n", + " '> ' * trace,\n", + " attribute_value, \n", + " len(partitions[attribute_value]), \n", + " len(remaining_candidate_attribute_indexes), \n", + " class_index, \n", + " default_class))\n", + " \n", + " # Create a subtree for each value of the the best attribute\n", + " subtree = create_decision_tree(\n", + " partitions[attribute_value],\n", + " remaining_candidate_attribute_indexes,\n", + " class_index,\n", + " default_class,\n", + " trace + 1 if trace else 0)\n", + "\n", + " # Add the new subtree to the empty dictionary object \n", + " # in the new tree/node we just created\n", + " tree[best_index][attribute_value] = subtree\n", + "\n", + " return tree\n", + "\n", + "# split instances into separate training and testing sets\n", + "training_instances = clean_instances[:-20]\n", + "test_instances = clean_instances[-20:]\n", + "tree = create_decision_tree(training_instances, trace=1) # remove trace=1 to turn off tracing\n", + "print(tree)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The structure of the tree shown above is rather difficult to discern from the normal printed representation of a dictionary.\n", + "\n", + "The Python [**`pprint`**](http://docs.python.org/2/library/pprint.html) module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.\n", + "\n", + "The [**`pprint.pprint(object, stream=None, indent=1, width=80, depth=None)`**](http://docs.python.org/2/library/pprint.html#pprint.pprint) method will print `object` to a `stream` (a default value of `None` will dictate the use of [sys.stdout](http://docs.python.org/2/library/sys.html#sys.stdout), the same destination as `print` function output), using `indent` spaces to differentiate nesting levels, using up to a maximum `width` columns and up to to a maximum nesting level `depth` (`None` indicating no maximum)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from pprint import pprint\n", + "\n", + "pprint(tree)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Classifying Instances with a Simple Decision Tree" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Usually, when we construct a decision tree based on a set of *training* instances, we do so with the intent of using that tree to classify a set of one or more *test* instances.\n", + "\n", + "We will define a function, **`classify(tree, instance, default_class=None)`**, to use a decision `tree` to classify a single `instance`, where an optional `default_class` can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.\n", + "\n", + "We will use a design pattern in which we will use a series of `if` statements, each of which returns a value if the condition is true, rather than a nested series of `if`, `elif` and/or `else` clauses, as it helps constrain the levels of indentation in the function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def classify(tree, instance, default_class=None):\n", + " '''Returns a classification label for instance, given a decision tree'''\n", + " if not tree: # if the node is empty, return the default class\n", + " return default_class\n", + " if not isinstance(tree, dict): # if the node is a leaf, return its class label\n", + " return tree\n", + " attribute_index = list(tree.keys())[0] # using list(dict.keys()) for Python 3 compatibility\n", + " attribute_values = list(tree.values())[0]\n", + " instance_attribute_value = instance[attribute_index]\n", + " if instance_attribute_value not in attribute_values: # this value was not in training data\n", + " return default_class\n", + " # recursively traverse the subtree (branch) associated with instance_attribute_value\n", + " return classify(attribute_values[instance_attribute_value], instance, default_class)\n", + "\n", + "for instance in test_instances:\n", + " predicted_label = classify(tree, instance)\n", + " actual_label = instance[0]\n", + " print('predicted: {}; actual: {}'.format(predicted_label, actual_label))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluating the Accuracy of a Simple Decision Tree" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is often helpful to evaluate the performance of a model using a dataset not used in the training of that model. In the simple example shown above, we used all but the last 20 instances to train a simple decision tree, then classified those last 20 instances using the tree.\n", + "\n", + "The advantage of this training/test split is that visual inspection of the classifications (sometimes called *predictions*) is relatively straightforward, revealing that all 20 instances were correctly classified.\n", + "\n", + "There are a variety of metrics that can be used to evaluate the performance of a model. [Scikit Learn's Model Evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html) library provides an overview and implementation of several possible metrics. For now, we'll simply measure the *accuracy* of a model, i.e., the percentage of test instances that are correctly classified (*true positives* and *true negatives*).\n", + "\n", + "The accuracy of the model above, given the set of 20 test instances, is 100% (20/20).\n", + "\n", + "The function below calculates the classification accuracy of a `tree` over a set of `test_instances` (with an optional `class_index` parameter indicating the position of the class label in each instance)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def classification_accuracy(tree, test_instances, class_index=0, default_class=None):\n", + " '''Returns the accuracy of classifying test_instances with tree, \n", + " where the class label is in position class_index'''\n", + " num_correct = 0\n", + " for i in range(len(test_instances)):\n", + " prediction = classify(tree, test_instances[i], default_class)\n", + " actual_value = test_instances[i][class_index]\n", + " if prediction == actual_value:\n", + " num_correct += 1\n", + " return num_correct / len(test_instances)\n", + "\n", + "print(classification_accuracy(tree, test_instances))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In addition to showing the percentage of correctly classified instances, it may be helpful to return the actual counts of correctly and incorrectly classified instances, e.g., if we want to compile a total count of correctly and incorrectly classified instances over a collection of test instances.\n", + "\n", + "In order to do so, we'll use the [**`zip([iterable, ...])`**](http://docs.python.org/2.7/library/functions.html#zip) function, which combines 2 or more sequences or iterables; the function returns a list of tuples, where the *i*th tuple contains the *i*th element from each of the argument sequences or iterables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "zip([0, 1, 2], ['a', 'b', 'c'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use [list comprehensions](http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions), the `Counter` class and the `zip()` function to modify `classification_accuracy()` so that it returns a packed tuple with \n", + "\n", + "* the percentage of instances correctly classified\n", + "* the number of correctly classified instances\n", + "* the number of incorrectly classified instances\n", + "\n", + "We'll also modify the function to use `instances` rather than `test_instances`, as we sometimes want to be able to valuate the accuracy of a model when tested on the training instances used to create it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def classification_accuracy(tree, instances, class_index=0, default_class=None):\n", + " '''Returns the accuracy of classifying test_instances with tree, \n", + " where the class label is in position class_index'''\n", + " predicted_labels = [classify(tree, instance, default_class) \n", + " for instance in instances]\n", + " actual_labels = [x[class_index] \n", + " for x in instances]\n", + " counts = Counter([x == y \n", + " for x, y in zip(predicted_labels, actual_labels)])\n", + " return counts[True] / len(instances), counts[True], counts[False]\n", + "\n", + "print(classification_accuracy(tree, test_instances))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We sometimes want to partition the instances into subsets of equal sizes to measure performance. One metric this partitioning allows us to compute is a [learning curve](https://en.wikipedia.org/wiki/Learning_curve), i.e., assess how well the model performs based on the size of its training set. Another use of these partitions (aka *folds*) would be to conduct an [*n-fold cross validation*](https://en.wikipedia.org/wiki/Cross-validation_(statistics) evaluation.\n", + "\n", + "The following function, **`partition_instances(instances, num_partitions)`**, partitions a set of `instances` into `num_partitions` relatively equally sized subsets.\n", + "\n", + "We'll use this as yet another opportunity to demonstrate the power of using list comprehensions, this time, to condense the use of nested `for` loops." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def partition_instances(instances, num_partitions):\n", + " '''Returns a list of relatively equally sized disjoint sublists (partitions) \n", + " of the list of instances'''\n", + " return [[instances[j] \n", + " for j in range(i, len(instances), num_partitions)]\n", + " for i in range(num_partitions)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before testing this function on the 5644 `clean_instances` from the UCI mushroom dataset, we'll create a small number of simplified instances to verify that the function has the desired behavior." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "instance_length = 3\n", + "num_instances = 5\n", + "\n", + "simplified_instances = [[j \n", + " for j in range(i, instance_length + i)] \n", + " for i in range(num_instances)]\n", + "\n", + "print('Instances:', simplified_instances)\n", + "partitions = partition_instances(simplified_instances, 2)\n", + "print('Partitions:', partitions)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*\\[The following variations do not use list comprehensions\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def partition_instances(instances, num_partitions):\n", + " '''Returns a list of relatively equally sized disjoint sublists (partitions) \n", + " of the list of instances'''\n", + " partitions = []\n", + " for i in range(num_partitions):\n", + " partition = []\n", + " # iterate over instances starting at position i in increments of num_paritions\n", + " for j in range(i, len(instances), num_partitions): \n", + " partition.append(instances[j])\n", + " partitions.append(partition)\n", + " return partitions\n", + "\n", + "simplified_instances = []\n", + "for i in range(num_instances):\n", + " new_instance = []\n", + " for j in range(i, instance_length + i):\n", + " new_instance.append(j)\n", + " simplified_instances.append(new_instance)\n", + "\n", + "print('Instances:', simplified_instances)\n", + "partitions = partition_instances(simplified_instances, 2)\n", + "print('Partitions:', partitions)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The [**`enumerate(sequence, start=0)`**](http://docs.python.org/2.7/library/functions.html#enumerate) function creates an iterator that successively returns the index and value of each element in a `sequence`, beginning at the `start` index." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for i, x in enumerate(['a', 'b', 'c']):\n", + " print(i, x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use `enumerate()` to facilitate slightly more rigorous testing of our `partition_instances` function on our `simplified_instances`.\n", + "\n", + "Note that since we are printing values rather than accumulating values, we will not use nested list comprehensions for this task." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for i in range(num_instances):\n", + " print('\\n# partitions:', i)\n", + " for j, partition in enumerate(partition_instances(simplified_instances, i)):\n", + " print('partition {}: {}'.format(j, partition))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Returning our attention to the UCI mushroom dataset, the following will partition our `clean_instances` into 10 relatively equally sized disjoint subsets. We will use a list comprehension to print out the length of each partition" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "partitions = partition_instances(clean_instances, 10)\n", + "print([len(partition) for partition in partitions])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "*\\[The following variation does not use a list comprehension\\]*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "for partition in partitions:\n", + " print(len(partition), end=' ')\n", + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following shows the different trees that are constructed based on partition 0 (first 10th) of `clean_instances`, partitions 0 and 1 (first 2/10ths) of `clean_instances` and all `clean_instances`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "tree0 = create_decision_tree(partitions[0])\n", + "print('Tree trained with {} instances:'.format(len(partitions[0])))\n", + "pprint(tree0)\n", + "print()\n", + "\n", + "tree1 = create_decision_tree(partitions[0] + partitions[1])\n", + "print('Tree trained with {} instances:'.format(len(partitions[0] + partitions[1])))\n", + "pprint(tree1)\n", + "print()\n", + "\n", + "tree = create_decision_tree(clean_instances)\n", + "print('Tree trained with {} instances:'.format(len(clean_instances)))\n", + "pprint(tree)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The only difference between the first two trees - *tree0* and *tree1* - is that in the first tree, instances with no `odor` (attribute index `5` is `'n'`) and a `spore-print-color` of white (attribute `20` = `'w'`) are classified as `edible` (`'e'`). With additional training data in the 2nd partition, an additional distinction is made such that instances with no `odor`, a white `spore-print-color` and a clustered `population` (attribute `21` = `'c'`) are classified as `poisonous` (`'p'`), while all other instances with no `odor` and a white `spore-print-color` (and any other value for the `population` attribute) are classified as `edible` (`'e'`).\n", + "\n", + "Note that there is no difference between `tree1` and `tree` (the tree trained with all instances). This early convergence on an optimal model is uncommon on most datasets (outside the UCI repository)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Learning curves" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we can partition our instances into subsets, we can use these subsets to construct different-sized training sets in the process of computing a learning curve.\n", + "\n", + "We will start off with an initial training set consisting only of the first partition, and then progressively extend that training set by adding a new partition during each iteration of computing the learning curve.\n", + "\n", + "The [**`list.extend(L)`**](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) method enables us to extend `list` by appending all the items in another list, `L`, to the end of `list`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "x = [1, 2, 3]\n", + "x.extend([4, 5])\n", + "print(x)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now define the function, **`compute_learning_curve(instances, num_partitions=10)`**, which will take a list of `instances`, partition it into `num_partitions` relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first `num_partitions - 1` partitions then tested with instances in the last partition, a variant of . That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on. \n", + "\n", + "The function will return a list of `num_partitions - 1` tuples representing the size of the training set and the accuracy of a tree trained with that set (and tested on the `num_partitions - 1` set). This will provide some indication of the relative impact of the size of the training set on model performance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def compute_learning_curve(instances, num_partitions=10):\n", + " '''Returns a list of training sizes and scores for incrementally increasing partitions.\n", + "\n", + " The list contains 2-element tuples, each representing a training size and score.\n", + " The i-th training size is the number of instances in partitions 0 through num_partitions - 2.\n", + " The i-th score is the accuracy of a tree trained with instances \n", + " from partitions 0 through num_partitions - 2\n", + " and tested on instances from num_partitions - 1 (the last partition).'''\n", + " \n", + " partitions = partition_instances(instances, num_partitions)\n", + " test_instances = partitions[-1][:]\n", + " training_instances = []\n", + " accuracy_list = []\n", + " for i in range(0, num_partitions - 1):\n", + " # for each iteration, the training set is composed of partitions 0 through i - 1\n", + " training_instances.extend(partitions[i][:])\n", + " tree = create_decision_tree(training_instances)\n", + " partition_accuracy = classification_accuracy(tree, test_instances)\n", + " accuracy_list.append((len(training_instances), partition_accuracy))\n", + " return accuracy_list\n", + "\n", + "accuracy_list = compute_learning_curve(clean_instances)\n", + "print(accuracy_list)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The UCI mushroom dataset is a particularly clean and simple data set, enabling quick convergence on an optimal decision tree for classifying new instances using relatively few training instances. \n", + "\n", + "We can use a larger number of smaller partitions to see a little more variation in accuracy performance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "accuracy_list = compute_learning_curve(clean_instances, 100)\n", + "print(accuracy_list[:10])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Object-Oriented Programming: Defining a Python Class to Encapsulate a Simple Decision Tree" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The simple decision tree defined above uses a Python dictionary for its representation. One can imagine using other data structures, and/or extending the decision tree to support confidence estimates, numeric features and other capabilities that are often included in more fully functional implementations. To support future extensibility, and hide the details of the representation from the user, it would be helpful to have a user-defined class for simple decision trees.\n", + "\n", + "Python is an [object-oriented programming](https://en.wikipedia.org/wiki/Object-oriented_programming) language, offering simple syntax and semantics for defining classes and instantiating objects of those classes. *[It is assumed that the reader is already familiar with the concepts of object-oriented programming]*\n", + "\n", + "A Python [class](http://docs.python.org/2/tutorial/classes.html) starts with the keyword **`class`** followed by a class name (identifier), a colon ('`:`'), and then any number of statements, which typically take the form of assignment statements for class or instance variables and/or function definitions for class methods. All statements are indented to reflect their inclusion in the class definition.\n", + "\n", + "The members - methods, class variables and instance variables - of a class are accessed by prepending `self.` to each reference. Class methods always include `self` as the first parameter. \n", + "\n", + "All class members in Python are *public* (accessible outside the class). There is no mechanism for *private* class members, but identifiers with leading double underscores (*\\_\\_member_identifier*) are 'mangled' (translated into *\\_class_name\\__member_identifier*), and thus not directly accessible outside their class, and can be used to approximate private members by Python programmers. \n", + "\n", + "There is also no mechanism for *protected* identifiers - accessible only within a defining class and its subclasses - in the Python language, and so Python programmers have adopted the convention of using a single underscore (*\\_identifier*) at the start of any identifier that is intended to be protected (i.e., not to be accessed outside the class or its subclasses). \n", + "\n", + "Some Python programmers only use the single underscore prefixes and avoid double underscore prefixes due to unintended consequences that can arise when names are mangled. The following warning about single and double underscore prefixes is issued in [Code Like a Pythonista](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#naming):\n", + "\n", + "> try to avoid the __private form. I never use it. Trust me. If you use it, you WILL regret it later\n", + "\n", + "We will follow this advice and avoid using the double underscore prefix in user-defined member variables and methods.\n", + "\n", + "Python has a number of pre-defined [special method names](http://docs.python.org/2/reference/datamodel.html#special-method-names), all of which are denoted by leading and trailing double underscores. For example, the [**`object.__init__(self[, ...])`**](http://docs.python.org/2/reference/datamodel.html#object.__init__) method is used to specify instructions that should be executed whenever a new object of a class is instantiated. \n", + "\n", + "Note that other machine learning libraries may use different terminology for some of the functions we defined above. For example, in the [`sklearn.tree.DecisionTreeClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) class (and in most `sklearn` classifier classes), the method for constructing a classifier is named [`fit()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) - since it \"fits\" the data to a model - and the method for classifying instances is named [`predict()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict) - since it is predicting the class label for an instance.\n", + "\n", + "In keeping with this common terminology, the code below defines a class, **`SimpleDecisionTree`**, with a single pseudo-protected member variable `_tree`, three public methods - `fit()`, `predict()` and `pprint()` - and two pseudo-protected auxilary methods - `_create_tree()` and `_predict()` - to augment the `fit()` and `predict()` methods, respectively. \n", + "\n", + "The `fit()` method is identical to the `create_decision_tree()` function above, with the inclusion of the `self` parameter (as it is now a class method rather than a function). The `predict()` method is a similarly modified version of the `classify()` function, with the added capability to predict the label of either a single instance or a list of instances. The `classification_accuracy()` method is similar to the function of the same name (with the addition of the `self` parameter). The `pprint()` method prints the tree in a human-readable format.\n", + "\n", + "Most comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found in **`simple_decision_tree.py`**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "class SimpleDecisionTree:\n", + "\n", + " _tree = {} # this instance variable becomes accessible to class methods via self._tree\n", + "\n", + " def __init__(self):\n", + " # this is where we would initialize any parameters to the SimpleDecisionTree\n", + " pass\n", + " \n", + " def fit(self, \n", + " instances, \n", + " candidate_attribute_indexes=None,\n", + " target_attribute_index=0,\n", + " default_class=None):\n", + " if not candidate_attribute_indexes:\n", + " candidate_attribute_indexes = [i \n", + " for i in range(len(instances[0]))\n", + " if i != target_attribute_index]\n", + " self._tree = self._create_tree(instances,\n", + " candidate_attribute_indexes,\n", + " target_attribute_index,\n", + " default_class)\n", + " \n", + " def _create_tree(self,\n", + " instances,\n", + " candidate_attribute_indexes,\n", + " target_attribute_index=0,\n", + " default_class=None):\n", + " class_labels_and_counts = Counter([instance[target_attribute_index] \n", + " for instance in instances])\n", + " if not instances or not candidate_attribute_indexes:\n", + " return default_class\n", + " elif len(class_labels_and_counts) == 1:\n", + " class_label = class_labels_and_counts.most_common(1)[0][0]\n", + " return class_label\n", + " else:\n", + " default_class = simple_ml.majority_value(instances, target_attribute_index)\n", + " best_index = simple_ml.choose_best_attribute_index(instances, \n", + " candidate_attribute_indexes, \n", + " target_attribute_index)\n", + " tree = {best_index:{}}\n", + " partitions = simple_ml.split_instances(instances, best_index)\n", + " remaining_candidate_attribute_indexes = [i \n", + " for i in candidate_attribute_indexes \n", + " if i != best_index]\n", + " for attribute_value in partitions:\n", + " subtree = self._create_tree(\n", + " partitions[attribute_value],\n", + " remaining_candidate_attribute_indexes,\n", + " target_attribute_index,\n", + " default_class)\n", + " tree[best_index][attribute_value] = subtree\n", + " return tree\n", + " \n", + " def predict(self, instances, default_class=None):\n", + " if not isinstance(instances, list):\n", + " return self._predict(self._tree, instance, default_class)\n", + " else:\n", + " return [self._predict(self._tree, instance, default_class) \n", + " for instance in instances]\n", + " \n", + " def _predict(self, tree, instance, default_class=None):\n", + " if not tree:\n", + " return default_class\n", + " if not isinstance(tree, dict):\n", + " return tree\n", + " attribute_index = list(tree.keys())[0] # using list(dict.keys()) for Py3 compatibiity\n", + " attribute_values = list(tree.values())[0]\n", + " instance_attribute_value = instance[attribute_index]\n", + " if instance_attribute_value not in attribute_values:\n", + " return default_class\n", + " return self._predict(attribute_values[instance_attribute_value],\n", + " instance,\n", + " default_class)\n", + " \n", + " def classification_accuracy(self, instances, default_class=None):\n", + " predicted_labels = self.predict(instances, default_class)\n", + " actual_labels = [x[0] for x in instances]\n", + " counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])\n", + " return counts[True] / len(instances), counts[True], counts[False]\n", + " \n", + " def pprint(self):\n", + " pprint(self._tree)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following statements instantiate a `SimpleDecisionTree`, using all but the last 20 `clean_instances`, prints out the tree using its `pprint()` method, and then uses the `classify()` method to print the classification of the last 20 `clean_instances`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "simple_decision_tree = SimpleDecisionTree()\n", + "simple_decision_tree.fit(training_instances)\n", + "simple_decision_tree.pprint()\n", + "print()\n", + "\n", + "predicted_labels = simple_decision_tree.predict(test_instances)\n", + "actual_labels = [instance[0] for instance in test_instances]\n", + "for predicted_label, actual_label in zip(predicted_labels, actual_labels):\n", + " print('Model: {}; truth: {}'.format(predicted_label, actual_label))\n", + "print()\n", + "print('Classification accuracy:', simple_decision_tree.classification_accuracy(test_instances))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Navigation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notebooks in this primer:\n", + "\n", + "* [1. Introduction](1_Introduction.ipynb)\n", + "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", + "* [3. Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", + "* **4. Using Python to Build and Use a Simple Decision Tree Classifier** (*you are here*)\n", + "* [5. Next Steps](5_Next_Steps.ipynb)" + ] + } + ], "metadata": { - "name": "", - "signature": "sha256:94a35a1cb8a7080e711ad0ee699aa3a6a4667dff06964e81f0a47944363bfa8b" - }, - "nbformat": 3, - "nbformat_minor": 0, - "worksheets": [ - { - "cells": [ - { - "cell_type": "heading", - "level": 1, - "metadata": {}, - "source": [ - "Python for Data Science" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Joe McCarthy](http://interrelativity.com/joe), \n", - "*Director, Analytics & Data Science*, [Atigeo, LLC](http://atigeo.com)" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "from IPython.display import display, Image, HTML" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 1 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Navigation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notebooks in this primer:\n", - "\n", - "* [1. Introduction](1_Introduction.ipynb)\n", - "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", - "* [3. Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", - "* **4. Using Python to Build and Use a Simple Decision Tree Classifier** (*you are here*)\n", - "* [5. Next Steps](5_Next_Steps.ipynb)" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# reconstitute relevent elements from the IPython environment active in previous notebook session\n", - "from collections import defaultdict\n", - "import simple_ml\n", - "clean_instances = simple_ml.load_instances('agaricus-lepiota.data', filter_missing_values=True)\n", - "attribute_names = simple_ml.load_attribute_names('agaricus-lepiota.attributes')\n", - "attribute_names_and_values = simple_ml.load_attribute_names_and_values('agaricus-lepiota.attributes')" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 2 - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "4. Using Python to Build and Use a Simple Decision Tree Classifier" - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Decision Trees" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wikipedia offers the following description of a [decision tree](https://en.wikipedia.org/wiki/Decision_tree) (with italics added to emphasize terms that will be elaborated below):\n", - "\n", - "> A decision tree is a flowchart-like structure in which each *internal node* represents a *test* of an *attribute*, each branch represents an *outcome* of that test and each *leaf node* represents *class label* (a decision taken after testing all attributes in the path from the root to the leaf). Each path from the root to a leaf can also be represented as a classification rule.\n", - "\n", - "The image below depicts a decision tree created from the UCI mushroom dataset that appears on [Andy G's blog post about Decision Tree Learning](http://gieseanw.wordpress.com/2012/03/03/decision-tree-learning/), where \n", - "\n", - "* a white box represents an *internal node* (and the label represents the *attribute* being tested)\n", - "* a blue box represents an attribute value (an *outcome* of the *test* of that attribute)\n", - "* a green box represents a *leaf node* with a *class label* of *edible*\n", - "* a red box represents a *leaf node* with a *class label* of *poisonous*\n", - "\n", - "\n", - "\n", - "It is important to note that the UCI mushroom dataset consists entirely of [categorical variables](https://en.wikipedia.org/wiki/Categorical_variable), i.e., every variable (or *attribute*) has an enumerated set of possible values. Many datasets include numeric variables that can take on int or float values. Tests for such variables typically use comparison operators, e.g., $age < 65$ or $36,250 < adjusted\\_gross\\_income <= 87,850$. *[Aside: Python supports boolean expressions containing multiple comparison operators, such as the expression comparing adjusted_gross_income in the preceding example.]*\n", - "\n", - "Our simple decision tree will only accommodate categorical variables. We will closely follow a version of the [decision tree learning algorithm implementation](http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3) offered by Chris Roach.\n", - "\n", - "Our goal in the following sections is to use Python to\n", - "\n", - "* *create* a simple decision tree based on a set of training instances\n", - "* *classify* (predict class labels for) for an instance using a simple decision tree\n", - "* *evaluate* the performance of the simple decision tree on classifying a set of test instances\n", - "\n", - "First, we will explore some concepts and algorithms used in building and using decision trees." - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Entropy" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When building a supervised classification model, the frequency distribution of attribute values is a potentially important factor in determining the relative importance of each attribute at various stages in the model building process.\n", - "\n", - "In data modeling, we can use frequency distributions to compute ***entropy***, a measure of disorder (impurity) in a set.\n", - "\n", - "We compute the entropy of multiplying the proportion of instances with each class label by the log of that proportion, and then taking the negative sum of those terms.\n", - "\n", - "More precisely, for a 2-class (binary) classification task:\n", - "\n", - "$entropy(S) = - p_1 log_2 (p_1) - p_2 log_2 (p_2)$\n", - "\n", - "where $p_i$ is proportion (relative frequency) of class *i* within the set *S*.\n", - "\n", - "From the output above, we know that the proportion of `clean_instances` that are labeled `'e'` (class `edible`) in the UCI dataset is $3488 \\div 5644 = 0.618$, and the proportion labeled `'p'` (class `poisonous`) is $2156 \\div 5644 = 0.382$.\n", - "\n", - "After importing the Python [`math`](http://docs.python.org/2/library/math.html) module, we can use the [`math.log(x[, base])`](http://docs.python.org/2/library/math.html#math.log) function in computing the entropy of the `clean_instances` of the UCI mushroom data set as follows:" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "import math\n", - "entropy = - (3488 / 5644.0) * math.log(3488 / 5644.0, 2) - (2156 / 5644.0) * math.log(2156 / 5644.0, 2)\n", - "print entropy" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "0.959441337353\n" - ] - } - ], - "prompt_number": 3 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Exercise 6: define entropy()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Define a function, `entropy(instances)`, that computes the entropy of `instances`. You may assume the class label is in position 0; we will later see how to specify default parameter values in function definitions.\n", - "\n", - "[Note: the class label in many data files is the *last* rather than the *first* item on each line.]" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# your function definition here\n", - "\n", - "# delete 'simple_ml.' below to test your function\n", - "print simple_ml.entropy(clean_instances)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "0.959441337353\n" - ] - } - ], - "prompt_number": 4 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Information Gain" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Informally, a decision tree is constructed using a recursive algorithm that \n", - "\n", - "* selects the *best* attribute \n", - "* splits the set into subsets based on the values of that attribute (each subset is composed of instances from the original set that have the same value for that attribute)\n", - "* repeats the process on each of these subsets until a stopping condition is met (e.g., a subset has no instances or has instances which all have the same class label)\n", - "\n", - "Entropy is a metric that can be used in selecting the best attribute for each split: the best attribute is the one resulting in the *largest decrease in entropy* for a set of instances. [Note: other metrics can be used for determining the best attribute]\n", - "\n", - "*Information gain* measures the decrease in entropy that results from splitting a set of instances based on an attribute.\n", - "\n", - "$IG(S, a) = entropy(S) - [p(s_1) \u00d7 entropy(s_1) + p(s_2) \u00d7 entropy(s_2) ... + p(s_n) \u00d7 entropy(s_n)]$\n", - "\n", - "Where $n$ is the number of distinct values of attribute $a$, and $s_i$ is the subset of $S$ where all instances have the $i$th value of $a$." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print 'Information gain for different attributes:\\n'\n", - "for i in range(1, len(attribute_names)):\n", - " print '{:5.3f} {:2} {}'.format(simple_ml.information_gain(clean_instances, i), i, attribute_names[i])" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Information gain for different attributes:\n", - "\n", - "0.017 1 cap-shape\n", - "0.005 2 cap-surface\n", - "0.195 3 cap-color\n", - "0.140 4 bruises?\n", - "0.860 5 odor\n", - "0.004 6 gill-attachment\n", - "0.058 7 gill-spacing\n", - "0.032 8 gill-size\n", - "0.213 9 gill-color\n", - "0.275 10 stalk-shape\n", - "0.097 11 stalk-root\n", - "0.425 12 stalk-surface-above-ring\n", - "0.409 13 stalk-surface-below-ring\n", - "0.306 14 stalk-color-above-ring\n", - "0.279 15 stalk-color-below-ring\n", - "0.000 16 veil-type" - ] - }, - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "\n", - "0.002 17 veil-color\n", - "0.012 18 ring-number\n", - "0.463 19 ring-type\n", - "0.583 20 spore-print-color\n", - "0.110 21 population\n", - "0.101 22 habitat\n" - ] - } - ], - "prompt_number": 5 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can sort the attributes based in decreasing order of information gain." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print 'Information gain for different attributes:\\n'\n", - "sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) for i in range(1, len(attribute_names))], \n", - " reverse=True)\n", - "print sorted_information_gain_indexes, '\\n'\n", - "\n", - "for gain, i in sorted_information_gain_indexes:\n", - " print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Information gain for different attributes:\n", - "\n", - "[(0.8596704358849709, 5), (0.5828694793608379, 20), (0.46290566555455265, 19), (0.42456477093655975, 12), (0.40865780788318695, 13), (0.3062989793570199, 14), (0.27891994708759504, 15), (0.2750355212178639, 10), (0.2127971869976022, 9), (0.19495343617580085, 3), (0.1400386042032834, 4), (0.1097880400299237, 21), (0.10067585994181227, 22), (0.09733858997769329, 11), (0.05836192763098613, 7), (0.03242975884332899, 8), (0.01740692300090696, 1), (0.01205967443646827, 18), (0.004572013423856602, 2), (0.0044397141315495325, 6), (0.0019702590992403124, 17), (0.0, 16)]" - ] - }, - { - "output_type": "stream", - "stream": "stdout", - "text": [ - " \n", - "\n", - "0.860 5 odor\n", - "0.583 20 spore-print-color\n", - "0.463 19 ring-type\n", - "0.425 12 stalk-surface-above-ring\n", - "0.409 13 stalk-surface-below-ring\n", - "0.306 14 stalk-color-above-ring\n", - "0.279 15 stalk-color-below-ring\n", - "0.275 10 stalk-shape\n", - "0.213 9 gill-color\n", - "0.195 3 cap-color\n", - "0.140 4 bruises?\n", - "0.110 21 population\n", - "0.101 22 habitat\n", - "0.097 11 stalk-root\n", - "0.058 7 gill-spacing\n", - "0.032 8 gill-size\n", - "0.017 1 cap-shape\n", - "0.012 18 ring-number\n", - "0.005 2 cap-surface\n", - "0.004 6 gill-attachment\n", - "0.002 17 veil-color\n", - "0.000 16 veil-type\n" - ] - } - ], - "prompt_number": 6 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following variation does not use a list comprehension:" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "print 'Information gain for different attributes:\\n'\n", - "\n", - "information_gain_values = []\n", - "for i in range(1, len(attribute_names)):\n", - " information_gain_values.append((simple_ml.information_gain(clean_instances, i), i))\n", - " \n", - "sorted_information_gain_indexes = sorted(information_gain_values, \n", - " reverse=True)\n", - "print sorted_information_gain_indexes, '\\n'\n", - "\n", - "for gain, i in sorted_information_gain_indexes:\n", - " print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Information gain for different attributes:\n", - "\n", - "[(0.8596704358849709, 5), (0.5828694793608379, 20), (0.46290566555455265, 19), (0.42456477093655975, 12), (0.40865780788318695, 13), (0.3062989793570199, 14), (0.27891994708759504, 15), (0.2750355212178639, 10), (0.2127971869976022, 9), (0.19495343617580085, 3), (0.1400386042032834, 4), (0.1097880400299237, 21), (0.10067585994181227, 22), (0.09733858997769329, 11), (0.05836192763098613, 7), (0.03242975884332899, 8), (0.01740692300090696, 1), (0.01205967443646827, 18), (0.004572013423856602, 2), (0.0044397141315495325, 6), (0.0019702590992403124, 17), (0.0, 16)]" - ] - }, - { - "output_type": "stream", - "stream": "stdout", - "text": [ - " \n", - "\n", - "0.860 5 odor\n", - "0.583 20 spore-print-color\n", - "0.463 19 ring-type\n", - "0.425 12 stalk-surface-above-ring\n", - "0.409 13 stalk-surface-below-ring\n", - "0.306 14 stalk-color-above-ring\n", - "0.279 15 stalk-color-below-ring\n", - "0.275 10 stalk-shape\n", - "0.213 9 gill-color\n", - "0.195 3 cap-color\n", - "0.140 4 bruises?\n", - "0.110 21 population\n", - "0.101 22 habitat\n", - "0.097 11 stalk-root\n", - "0.058 7 gill-spacing\n", - "0.032 8 gill-size\n", - "0.017 1 cap-shape\n", - "0.012 18 ring-number\n", - "0.005 2 cap-surface\n", - "0.004 6 gill-attachment\n", - "0.002 17 veil-color\n", - "0.000 16 veil-type\n" - ] - } - ], - "prompt_number": 7 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Exercise 7: define information_gain()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Define a function, `information_gain(instances, i)`, that returns the information gain achieved by selecting the `i`th attribute to split `instances`. It should exhibit the same behavior as the `simple_ml` version of the function." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# your definition of information_gain(instances, i) here\n", - "\n", - "# delete 'simple_ml.' below to test your function\n", - "sorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) for i in range(1, len(attribute_names))], \n", - " reverse=True)\n", - "\n", - "print 'Information gain for different attributes:\\n'\n", - "for gain, i in sorted_information_gain_indexes:\n", - " print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Information gain for different attributes:\n", - "\n", - "0.860 5 odor\n", - "0.583 20 spore-print-color\n", - "0.463 19 ring-type\n", - "0.425 12 stalk-surface-above-ring\n", - "0.409 13 stalk-surface-below-ring\n", - "0.306 14 stalk-color-above-ring\n", - "0.279 15 stalk-color-below-ring\n", - "0.275 10 stalk-shape\n", - "0.213 9 gill-color\n", - "0.195 3 cap-color\n", - "0.140 4 bruises?\n", - "0.110 21 population\n", - "0.101 22 habitat\n", - "0.097 11 stalk-root\n", - "0.058 7 gill-spacing\n", - "0.032 8 gill-size\n", - "0.017 1 cap-shape\n", - "0.012 18 ring-number\n", - "0.005 2 cap-surface\n", - "0.004 6 gill-attachment\n", - "0.002 17 veil-color\n", - "0.000 16 veil-type\n" - ] - } - ], - "prompt_number": 8 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Building a Simple Decision Tree" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will implement a modified version of the [ID3](https://en.wikipedia.org/wiki/ID3_algorithm) algorithm for building a simple decision tree.\n", - "\n", - " ID3 (Examples, Target_Attribute, Attributes)\n", - " Create a root node for the tree\n", - " If all examples are positive, Return the single-node tree Root, with label = +.\n", - " If all examples are negative, Return the single-node tree Root, with label = -.\n", - " If number of predicting attributes is empty, then Return the single node tree Root,\n", - " with label = most common value of the target attribute in the examples.\n", - " Otherwise Begin\n", - " A \u2190 The Attribute that best classifies examples.\n", - " Decision Tree attribute for Root = A.\n", - " For each possible value, v_i, of A,\n", - " Add a new tree branch below Root, corresponding to the test A = v_i.\n", - " Let Examples(v_i) be the subset of examples that have the value v_i for A\n", - " If Examples(v_i) is empty\n", - " Then below this new branch add a leaf node with label = most common target value in the examples\n", - " Else below this new branch add the subtree ID3 (Examples(v_i), Target_Attribute, Attributes \u2013 {A})\n", - " End\n", - " Return Root" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In building a decision tree, we will need to split the instances based on the index of the *best* attribute, i.e., the attribute that offers the *highest information gain*. We will use separate utility functions to handle these subtasks. To simplify the functions, we will rely exclusively on attribute indexes rather than attribute names.\n", - "\n", - "***Note:*** the algorithm above is *recursive*, i.e., the there is a recursive call to `ID3` within the definition of `ID3`. Covering recursion is beyond the scope of this primer, but there are a number of other resources on [using recursion in Python](https://www.google.com/search?q=python+recursion). Familiarity with recursion will be important for understanding both the tree construction and classification functions below.\n", - "\n", - "First, we will define a function to split a set of instances based on any attribute. This function will return a dictionary where the *key* of each dictionary is a distinct value of the specified `attribute_index`, and the *value* of each dictionary is a list representing the subset of `instances` that have that attribute value." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def split_instances(instances, attribute_index):\n", - " '''Returns a list of dictionaries, splitting a list of instances according to their values of a specified attribute''\n", - " \n", - " The key of each dictionary is a distinct value of attribute_index,\n", - " and the value of each dictionary is a list representing the subset of instances that have that value for the attribute'''\n", - " partitions = defaultdict(list)\n", - " for instance in instances:\n", - " partitions[instance[attribute_index]].append(instance)\n", - " return partitions\n", - "\n", - "partitions = split_instances(clean_instances, 5)\n", - "print [(partition, len(partitions[partition])) for partition in partitions]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "[('a', 400), ('c', 192), ('f', 1584), ('m', 36), ('l', 400), ('n', 2776), ('p', 256)]\n" - ] - } - ], - "prompt_number": 9 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that we can split instances based on a particular attribute, we would like to be able to choose the *best* attribute with which to split the instances, where *best* is defined as the attribute that provides the greatest information gain if instances were split based on that attribute. We will want to restrict the candidate attributes so that we don't bother trying to split on an attribute that was used higher up in the decision tree (or use the target attribute as a candidate)." - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Exercise 8: define choose_best_attribute_index()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Define a function, `choose_best_attribute_index(instances, candidate_attribute_indexes)`, that returns the index in the list of `candidate_attribute_indexes` that provides the highest information gain if `instances` are split based on that attribute index." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# your function here\n", - "\n", - "# delete 'simple_ml.' below to test your function:\n", - "print 'Best attribute index:', simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names)))" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Best attribute index: " - ] - }, - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "5\n" - ] - } - ], - "prompt_number": 10 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A leaf node in a decision tree represents the most frequently occurring - or majority - class value for that path through the tree. We will need a function that determines the majority value for the class index among a set of instances.\n", - "\n", - "We earlier saw how the [`defaultdict`](http://docs.python.org/2/library/collections.html#collections.defaultdict) container in the [`collections`](http://docs.python.org/2/library/collections.html) module can be used to simplify the construction of a dictionary containing the counts of all attribute values for all attributes, by automatically setting the count for any attribute value to zero when the attribute value is first added to the dictionary.\n", - "\n", - "The `collections` module has another useful container, a [`Counter`](http://docs.python.org/2/library/collections.html#collections.Counter) class, that can further simplify the construction of a specialized dictionary of counts. When a `Counter` object is instantiated with a list of items, it returns a dictionary-like container in which the *keys* are the unique items in the list, and the *values* are the counts of each unique item in that list. \n", - "\n", - "This container has an additional method, [`most_common([n])`](http://docs.python.org/2/library/collections.html#collections.Counter.most_common), which returns a list of 2-element tuples representing the values and their associated counts for the most common `n` values; if `n` is omitted, the method returns all tuples.\n", - "\n", - "The following is an example of how we can use a `Counter` to represent the frequency of different class labels, and how we can identify the most frequent value and its count." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "from collections import Counter\n", - "\n", - "class_counts = Counter([instance[0] for instance in clean_instances])\n", - "print 'class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(\n", - " class_counts, # the Counter object\n", - " class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count\n", - " class_counts.most_common(1)[0][0]) # the most common value (1st element in that tuple)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "class_counts: Counter({'e': 3488, 'p': 2156}); most_common(1): [('e', 3488)], most_common(1)[0][0]: e\n" - ] - } - ], - "prompt_number": 11 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following variation does not use a list comprehension:" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "class_values = []\n", - "for instance in clean_instances:\n", - " class_values.append(instance[0])\n", - " \n", - "class_counts = Counter(class_values)\n", - "print 'class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(\n", - " class_counts, # the Counter object\n", - " class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count\n", - " class_counts.most_common(1)[0][0]) # the most common value (1st element in that tuple)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "class_counts: Counter({'e': 3488, 'p': 2156}); most_common(1): [('e', 3488)], most_common(1)[0][0]: e\n" - ] - } - ], - "prompt_number": 12 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Before putting all this together to define a decision tree construction function, it may be helpful to cover a few additional aspects of Python the function will utilize." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Python offers a very flexible mechanism for the [testing of truth values](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#testing-for-truth-values): in an **if** condition, any null object, zero-valued numerical expression or empty container (string, list, dictionary or tuple) is interpreted as *False* (i.e., *not True*):" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n", - " print '\"{}\" is'.format(x),\n", - " if x:\n", - " print True\n", - " else:\n", - " print False" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "\"False\" is False\n", - "\"None\" is False\n", - "\"0\" is False\n", - "\"0.0\" is False\n", - "\"\" is False\n", - "\"[]\" is False\n", - "\"{}\" is False\n", - "\"()\" is False\n" - ] - } - ], - "prompt_number": 13 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Python also offers a [conditional expression (ternary operator)](http://docs.python.org/2/reference/expressions.html#conditional-expressions) that allows the functionality of an if/else statement that returns a value to be implemented as an expression. For example, the if/else statement in the code above could be implemented as a conditional expression as follows:" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "for x in [False, None, 0, 0.0, \"\", [], {}, ()]:\n", - " print '\"{}\" is {}'.format(x, True if x else False) # using conditional expression as second argument to format()" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "\"False\" is False\n", - "\"None\" is False\n", - "\"0\" is False\n", - "\"0.0\" is False\n", - "\"\" is False\n", - "\"[]\" is False\n", - "\"{}\" is False\n", - "\"()\" is False\n" - ] - } - ], - "prompt_number": 14 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Python function definitions can specify [default parameter values](http://docs.python.org/2/tutorial/controlflow.html#default-argument-values) indicating the value those parameters will have if no argument is explicitly provided when the function is called. Arguments can also be passed using [keyword parameters](http://docs.python.org/2/tutorial/controlflow.html#keyword-arguments) indicting which parameter will be assigned a specific argument value (which may or may not correspond to the order in which the parameters are defined).\n", - "\n", - "The [Python Tutorial page on default parameters](http://docs.python.org/2/tutorial/controlflow.html#default-argument-values) includes the following warning:\n", - "\n", - "> Important warning: The default value is evaluated only once. This makes a difference when the default is a mutable object such as a list, dictionary, or instances of most classes. \n", - "\n", - "Thus it is generally better to use the Python null object, `None`, rather than an empty `list` (`[]`), `dict` (`{}`) or other mutable data structure when specifying default parameter values for any of those data types." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def parameter_test(parameter1=None, parameter2=None):\n", - " '''Prints the values of parameter1 and parameter2'''\n", - " print 'parameter1: {}; parameter2: {}'.format(parameter1, parameter2)\n", - " \n", - "parameter_test() # no args are required\n", - "parameter_test(1) # if any args are provided, 1st arg gets assigned to parameter1\n", - "parameter_test(1, 2) # 2nd arg gets assigned to parameter2\n", - "parameter_test(2) # remember: if only 1 arg, 1st arg gets assigned to arg1\n", - "parameter_test(parameter2=2) # can use keyword to [only] provide an explicit value for parameter2\n", - "parameter_test(parameter2=2, parameter1=1) # can use keywords for either arg, in either order" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "parameter1: None; parameter2: None\n", - "parameter1: 1; parameter2: None\n", - "parameter1: 1; parameter2: 2\n", - "parameter1: 2; parameter2: None\n", - "parameter1: None; parameter2: 2\n", - "parameter1: 1; parameter2: 2\n" - ] - } - ], - "prompt_number": 15 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Exercise 9: define majority_value()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Define a function, `majority_value(instances, class_index)`, that returns the most frequently occurring value of `class_index` in `instances`. The `class_index` parameter should be optional, and have a default value of `0` (zero)." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# your definition of majority_value(instances) here\n", - "\n", - "# delete 'simple_ml.' below to test your function:\n", - "print 'Majority value of index {}: {}'.format(0, simple_ml.majority_value(clean_instances)) # note: relying on default parameter here\n", - "# although there is only one class_index for the dataset, we'll test it by providing non-default values\n", - "print 'Majority value of index {}: {}'.format(1, simple_ml.majority_value(clean_instances, 1)) # using an optional 2nd argument\n", - "print 'Majority value of index {}: {}'.format(2, simple_ml.majority_value(clean_instances, class_index=2)) # using a keyword" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Majority value of index 0: e\n", - "Majority value of index 1: x\n", - "Majority value of index 2: y\n" - ] - } - ], - "prompt_number": 16 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The recursive `create_decision_tree()` function below uses an optional parameter, `class_index`, which defaults to `0`. This is to accommodate other datasets in which the class label is the last element on each line (which would be most easily specified by using a `-1` value). Most data files in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.html) have the class labels as either the first element or the last element.\n", - "\n", - "To show how the decision tree is being built, an optional `trace` parameter, when non-zero, will generate some trace information as the tree is constructed. The indentation level is incremented with each recursive call via the use of the conditional expression (ternary operator), `trace + 1 if trace else 0`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def create_decision_tree(instances, candidate_attribute_indexes=None, class_index=0, default_class=None, trace=0):\n", - " '''Returns a new decision tree trained on a list of instances.\n", - " \n", - " The tree is constructed by recursively selecting and splitting instances based on \n", - " the highest information_gain of the candidate_attribute_indexes.\n", - " \n", - " The class label is found in position class_index.\n", - " \n", - " The default_class is the majority value for the current node's parent in the tree.\n", - " A positive (int) trace value will generate trace information with increasing levels of indentation.\n", - " \n", - " Derived from the simplified ID3 algorithm presented in Building Decision Trees in Python by Christopher Roach,\n", - " http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=3'''\n", - " \n", - " # if no candidate_attribute_indexes are provided, assume that we will use all but the target_attribute_index\n", - " if candidate_attribute_indexes is None:\n", - " candidate_attribute_indexes = range(len(instances[0]))\n", - " candidate_attribute_indexes.remove(class_index)\n", - " \n", - " class_labels_and_counts = Counter([instance[class_index] for instance in instances])\n", - "\n", - " # If the dataset is empty or the candidate attributes list is empty, return the default value\n", - " if not instances or not candidate_attribute_indexes:\n", - " if trace:\n", - " print '{}Using default class {}'.format('< ' * trace, default_class)\n", - " return default_class\n", - " \n", - " # If all the instances have the same class label, return that class label\n", - " elif len(class_labels_and_counts) == 1:\n", - " class_label = class_labels_and_counts.most_common(1)[0][0]\n", - " if trace:\n", - " print '{}All {} instances have label {}'.format('< ' * trace, len(instances), class_label)\n", - " return class_label\n", - " else:\n", - " default_class = simple_ml.majority_value(instances, class_index)\n", - "\n", - " # Choose the next best attribute index to best classify the instances\n", - " best_index = simple_ml.choose_best_attribute_index(instances, candidate_attribute_indexes, class_index) \n", - " if trace:\n", - " print '{}Creating tree node for attribute index {}'.format('> ' * trace, best_index)\n", - "\n", - " # Create a new decision tree node with the best attribute index and an empty dictionary object (for now)\n", - " tree = {best_index:{}}\n", - "\n", - " # Create a new decision tree sub-node (branch) for each of the values in the best attribute field\n", - " partitions = simple_ml.split_instances(instances, best_index)\n", - "\n", - " # Remove that attribute from the set of candidates for further splits\n", - " remaining_candidate_attribute_indexes = [i for i in candidate_attribute_indexes if i != best_index]\n", - " for attribute_value in partitions:\n", - " if trace:\n", - " print '{}Creating subtree for value {} ({}, {}, {}, {})'.format(\n", - " '> ' * trace,\n", - " attribute_value, \n", - " len(partitions[attribute_value]), \n", - " len(remaining_candidate_attribute_indexes), \n", - " class_index, \n", - " default_class)\n", - " \n", - " # Create a subtree for each value of the the best attribute\n", - " subtree = create_decision_tree(\n", - " partitions[attribute_value],\n", - " remaining_candidate_attribute_indexes,\n", - " class_index,\n", - " default_class,\n", - " trace + 1 if trace else 0)\n", - "\n", - " # Add the new subtree to the empty dictionary object in the new tree/node we just created\n", - " tree[best_index][attribute_value] = subtree\n", - "\n", - " return tree\n", - "\n", - "# split instances into separate training and testing sets\n", - "training_instances = clean_instances[:-20]\n", - "testing_instances = clean_instances[-20:]\n", - "tree = create_decision_tree(training_instances, trace=1) # remove trace=1 to turn off tracing\n", - "print tree" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "> Creating tree node for attribute index 5\n", - "> Creating subtree for value a (400, 21, 0, e)\n", - "< < All 400 instances have label e\n", - "> Creating subtree for value c (192, 21, 0, e)\n", - "< < All 192 instances have label p\n", - "> Creating subtree for value f (1584, 21, 0, e)\n", - "< < All 1584 instances have label p\n", - "> Creating subtree for value m (28, 21, 0, e)\n", - "< < All 28 instances have label p\n", - "> Creating subtree for value l (400, 21, 0, e)\n", - "< < All 400 instances have label e\n", - "> Creating subtree for value n (2764, 21, 0, e)\n", - "> > Creating tree node for attribute index 20\n", - "> > Creating subtree for value k (1296, 20, 0, e)\n", - "< < < All 1296 instances have label e\n", - "> > Creating subtree for value r (72, 20, 0, e)\n", - "< < < All 72 instances have label p\n", - "> > Creating subtree for value w (100, 20, 0, e)\n", - "> > > Creating tree node for attribute index 21\n", - "> > > Creating subtree for value y (24, 19, 0, e)\n", - "< < < < All 24 instances have label e\n", - "> > > Creating subtree for value c (16, 19, 0, e)\n", - "< < < < All 16 instances have label p\n", - "> > > Creating subtree for value v (60, 19, 0, e)\n", - "< < < < All 60 instances have label e\n", - "> > Creating subtree for value n (1296, 20, 0, e)\n", - "< < < All 1296 instances have label e" - ] - }, - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "\n", - "> Creating subtree for value p (256, 21, 0, e)\n", - "< < All 256 instances have label p\n", - "{5: {'a': 'e', 'c': 'p', 'f': 'p', 'm': 'p', 'l': 'e', 'n': {20: {'k': 'e', 'r': 'p', 'w': {21: {'y': 'e', 'c': 'p', 'v': 'e'}}, 'n': 'e'}}, 'p': 'p'}}\n" - ] - } - ], - "prompt_number": 17 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The structure of the tree shown above is rather difficult to discern from the normal printed representation of a dictionary.\n", - "\n", - "The Python [`pprint`](http://docs.python.org/2/library/pprint.html) module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.\n", - "\n", - "The [`pprint.pprint(object, stream=None, indent=1, width=80, depth=None)`](http://docs.python.org/2/library/pprint.html#pprint.pprint) method will print `object` to a `stream` (a default value of `None` will dictate the use of [sys.stdout](http://docs.python.org/2/library/sys.html#sys.stdout), the same destination as `print` statement output), using `indent` spaces to differentiate nesting levels, using up to a maximum `width` columns and up to to a maximum nesting level `depth` (`None` indicating no maximum).\n", - "\n", - "We will use the a variation on the import statement that imports one or more functions into the current namespace:\n", - "\n", - " from pprint import pprint\n", - " \n", - "This will to enable us to use `pprint()` rather than having to use dotted notation, i.e., `pprint.pprint()`. \n", - "\n", - "Note that if we wanted to define our own `pprint()` function, we would be best only using\n", - "\n", - " import pprint\n", - " \n", - "so that we can still access the `pprint()` function in the `pprint` module (since defining `pprint()` in the current namespace would otherwise override the imported definition of the function)." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "from pprint import pprint\n", - "pprint(tree)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "{5: {'a': 'e',\n", - " 'c': 'p',\n", - " 'f': 'p',\n", - " 'l': 'e',\n", - " 'm': 'p',\n", - " 'n': {20: {'k': 'e',\n", - " 'n': 'e',\n", - " 'r': 'p',\n", - " 'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},\n", - " 'p': 'p'}}\n" - ] - } - ], - "prompt_number": 18 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Classifying Instances with a Simple Decision Tree" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Usually, when we construct a decision tree based on a set of *training* instances, we do so with the intent of using that tree to classify a set of one or more *testing* instances.\n", - "\n", - "We will define a function, `classify(tree, instance, default_class=None)`, to use a decision `tree` to classify a single `instance`, where an optional `default_class` can be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.\n", - "\n", - "We will use a design pattern in which we will use a series of `if` statements, each of which returns a value if the condition is true, rather than a nested series of `if`, `elif` and/or `else` clauses, as it helps constrain the levels of indentation in the function." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def classify(tree, instance, default_class=None):\n", - " '''Returns a classification label for instance, given a decision tree'''\n", - " if not tree:\n", - " return default_class\n", - " if not isinstance(tree, dict): \n", - " return tree\n", - " attribute_index = tree.keys()[0]\n", - " attribute_values = tree.values()[0]\n", - " instance_attribute_value = instance[attribute_index]\n", - " if instance_attribute_value not in attribute_values:\n", - " return default_class\n", - " return classify(attribute_values[instance_attribute_value], instance, default_class)\n", - "\n", - "for instance in testing_instances:\n", - " predicted_label = classify(tree, instance)\n", - " actual_label = instance[0]\n", - " print 'predicted: {}; actual: {}'.format(predicted_label, actual_label)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "predicted: p; actual: p\n", - "predicted: p; actual: p\n", - "predicted: p; actual: p\n", - "predicted: e; actual: e\n", - "predicted: e; actual: e\n", - "predicted: p; actual: p\n", - "predicted: e; actual: e\n", - "predicted: e; actual: e\n", - "predicted: e; actual: e\n", - "predicted: p; actual: p\n", - "predicted: e; actual: e\n", - "predicted: e; actual: e\n", - "predicted: e; actual: e\n", - "predicted: p; actual: p\n", - "predicted: e; actual: e\n", - "predicted: e; actual: e\n", - "predicted: e; actual: e\n", - "predicted: e; actual: e\n", - "predicted: p; actual: p\n", - "predicted: p; actual: p\n" - ] - } - ], - "prompt_number": 19 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Evaluating the Accuracy of a Simple Decision Tree" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It is often helpful to evaluate the performance of a model using a dataset not used in the training of that model. In the simple example shown above, we used all but the last 20 instances to train a simple decision tree, then classified those last 20 instances using the tree.\n", - "\n", - "The advantage of this training/testing split is that visual inspection of the classifications (sometimes called *predictions*) is relatively straightforward, revealing that all 20 instances were correctly classified.\n", - "\n", - "There are a variety of metrics that can be used to evaluate the performance of a model. [Scikit Learn's Model Evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html) library provides an overview and implementation of several possible metrics. For now, we'll simply measure the *accuracy* of a model, i.e., the percentage of testing instances that are correctly classified (*true positives* and *true negatives*).\n", - "\n", - "The accuracy of the model above, given the set of 20 testing instances, is 100% (20/20).\n", - "\n", - "The function below calculates the classification accuracy of a `tree` over a set of `testing_instances` (with an optional `class_index` parameter indicating the position of the class label in each instance)." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def classification_accuracy(tree, testing_instances, class_index=0, default_class=None):\n", - " '''Returns the accuracy of classifying testing_instances with tree, where the class label is in position class_index'''\n", - " num_correct = 0\n", - " for i in xrange(len(testing_instances)):\n", - " prediction = classify(tree, testing_instances[i], default_class)\n", - " actual_value = testing_instances[i][class_index]\n", - " if prediction == actual_value:\n", - " num_correct += 1\n", - " return float(num_correct) / len(testing_instances)\n", - "\n", - "print classification_accuracy(tree, testing_instances)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "1.0\n" - ] - } - ], - "prompt_number": 20 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [`zip([iterable, ...])`](http://docs.python.org/2.7/library/functions.html#zip) function combines 2 or more sequences or iterables; the function returns a list of tuples, where the *i*th tuple contains the *i*th element from each of the argument sequences or iterables." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "zip([0, 1, 2], ['a', 'b', 'c'])" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "metadata": {}, - "output_type": "pyout", - "prompt_number": 21, - "text": [ - "[(0, 'a'), (1, 'b'), (2, 'c')]" - ] - } - ], - "prompt_number": 21 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can use [list comprehensions](http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions), the `Counter` class and the `zip()` function to modify `classification_accuracy()` so that it returns a packed tuple with \n", - "\n", - "* the number of correctly classified instances\n", - "* the number of incorrectly classified instances\n", - "* the percentage of instances correctly classified" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def classification_accuracy(tree, instances, class_index=0, default_class=None):\n", - " '''Returns the accuracy of classifying testing_instances with tree, where the class label is in position class_index'''\n", - " predicted_labels = [classify(tree, instance, default_class) for instance in instances]\n", - " actual_labels = [x[class_index] for x in instances]\n", - " counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])\n", - " return counts[True], counts[False], float(counts[True]) / len(instances)\n", - "\n", - "print classification_accuracy(tree, testing_instances)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "(20, 0, 1.0)\n" - ] - } - ], - "prompt_number": 22 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We sometimes want to partition the instances into subsets of equal sizes to measure performance. One metric this partitioning allows us to compute is a [learning curve](https://en.wikipedia.org/wiki/Learning_curve), i.e., assess how well the model performs based on the size of its training set. Another use of these partitions (aka *folds*) would be to conduct an [*n-fold cross validation*](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) evaluation.\n", - "\n", - "The following function, `partition_instances(instances, num_partitions)`, partitions a set of `instances` into `num_partitions` relatively equally sized subsets.\n", - "\n", - "We'll use this as yet another opportunity to demonstrate the power of using list comprehensions, this time, to condense the use of nested `for` loops." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def partition_instances(instances, num_partitions):\n", - " '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''\n", - " return [[instances[j] for j in xrange(i, len(instances), num_partitions)] for i in xrange(num_partitions)]" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 23 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Before testing this function on the 5644 `clean_instances` from the UCI mushroom dataset, let's create a small number of simplified instances to verify that the function has the desired behavior." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "instance_length = 3\n", - "num_instances = 5\n", - "\n", - "simplified_instances = [[j for j in xrange(i, instance_length + i)] for i in xrange(num_instances)]\n", - "\n", - "print 'Instances:', simplified_instances\n", - "partitions = partition_instances(simplified_instances, 2)\n", - "print 'Partitions:', partitions" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Instances: [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]\n", - "Partitions: [[[0, 1, 2], [2, 3, 4], [4, 5, 6]], [[1, 2, 3], [3, 4, 5]]]\n" - ] - } - ], - "prompt_number": 24 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following variations do not use list comprehensions." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def partition_instances(instances, num_partitions):\n", - " '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''\n", - " partitions = []\n", - " for i in xrange(num_partitions):\n", - " partition = []\n", - " # iterate over instances starting at position i in increments of num_paritions\n", - " for j in xrange(i, len(instances), num_partitions): \n", - " partition.append(instances[j])\n", - " partitions.append(partition)\n", - " return partitions\n", - "\n", - "simplified_instances = []\n", - "for i in xrange(num_instances):\n", - " new_instance = []\n", - " for j in xrange(i, instance_length + i):\n", - " new_instance.append(j)\n", - " simplified_instances.append(new_instance)\n", - "\n", - "print 'Instances:', simplified_instances\n", - "partitions = partition_instances(simplified_instances, 2)\n", - "print 'Partitions:', partitions" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Instances: [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]\n", - "Partitions: [[[0, 1, 2], [2, 3, 4], [4, 5, 6]], [[1, 2, 3], [3, 4, 5]]]\n" - ] - } - ], - "prompt_number": 25 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The [`enumerate(sequence, start=0)`](http://docs.python.org/2.7/library/functions.html#enumerate) function creates an iterator that successively returns the index and value of each element in a `sequence`, beginning at the `start` index." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "for i, x in enumerate(['a', 'b', 'c']):\n", - " print i, x" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "0 a\n", - "1 b\n", - "2 c\n" - ] - } - ], - "prompt_number": 26 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can use `enumerate()` to facilitate slightly more rigorous testing of our `partition_instances` function on our `simplified_instances`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "for i in xrange(5):\n", - " print '\\n# partitions:', i\n", - " for j, partition in enumerate(partition_instances(simplified_instances, i)):\n", - " print 'partition {}: {}'.format(j, partition)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "\n", - "# partitions: 0\n", - "\n", - "# partitions: 1\n", - "partition 0: [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]\n", - "\n", - "# partitions: 2\n", - "partition 0: [[0, 1, 2], [2, 3, 4], [4, 5, 6]]\n", - "partition 1: [[1, 2, 3], [3, 4, 5]]\n", - "\n", - "# partitions: 3\n", - "partition 0: [[0, 1, 2], [3, 4, 5]]\n", - "partition 1: [[1, 2, 3], [4, 5, 6]]\n", - "partition 2: [[2, 3, 4]]\n", - "\n", - "# partitions: 4\n", - "partition 0: [[0, 1, 2], [4, 5, 6]]\n", - "partition 1: [[1, 2, 3]]\n", - "partition 2: [[2, 3, 4]]\n", - "partition 3: [[3, 4, 5]]\n" - ] - } - ], - "prompt_number": 27 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Returning our attention to the UCI mushroom dataset, the following will partition our `clean_instances` into 10 relatively equally sized disjoint subsets. We will use a list comprehension to print out the length of each partition" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "partitions = partition_instances(clean_instances, 10)\n", - "print [len(partition) for partition in partitions]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "[565, 565, 565, 565, 564, 564, 564, 564, 564, 564]\n" - ] - } - ], - "prompt_number": 28 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following variation does not use a list comprehension." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "for partition in partitions:\n", - " print len(partition), # note the comma at the end\n", - "print" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "565 565 565 565 564 564 564 564 564 564\n" - ] - } - ], - "prompt_number": 29 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following shows the different trees that are constructed based on partition 0 (first 10th) of `clean_instances`, partitions 0 and 1 (first 2/10ths) of `clean_instances` and all `clean_instances`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "tree0 = create_decision_tree(partitions[0])\n", - "print 'Tree trained with {} instances:'.format(len(partitions[0]))\n", - "pprint(tree0)\n", - "\n", - "tree1 = create_decision_tree(partitions[0] + partitions[1])\n", - "print '\\nTree trained with {} instances:'.format(len(partitions[0] + partitions[1]))\n", - "pprint(tree1)\n", - "\n", - "tree = create_decision_tree(clean_instances)\n", - "print '\\nTree trained with {} instances:'.format(len(clean_instances))\n", - "pprint(tree)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "Tree trained with 565 instances:\n", - "{5: {'a': 'e',\n", - " 'c': 'p',\n", - " 'f': 'p',\n", - " 'l': 'e',\n", - " 'm': 'p',\n", - " 'n': {20: {'k': 'e', 'n': 'e', 'r': 'p', 'w': 'e'}},\n", - " 'p': 'p'}}\n", - "\n", - "Tree trained with 1130 instances:\n", - "{5: {'a': 'e',\n", - " 'c': 'p',\n", - " 'f': 'p',\n", - " 'l': 'e',\n", - " 'm': 'p',\n", - " 'n': {20: {'k': 'e',\n", - " 'n': 'e',\n", - " 'r': 'p',\n", - " 'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},\n", - " 'p': 'p'}}\n", - "\n", - "Tree trained with 5644 instances:" - ] - }, - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "\n", - "{5: {'a': 'e',\n", - " 'c': 'p',\n", - " 'f': 'p',\n", - " 'l': 'e',\n", - " 'm': 'p',\n", - " 'n': {20: {'k': 'e',\n", - " 'n': 'e',\n", - " 'r': 'p',\n", - " 'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},\n", - " 'p': 'p'}}\n" - ] - } - ], - "prompt_number": 30 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The only difference between the first two trees - *tree0* and *tree1* - is that in the first tree, instances with no `odor` (attribute index `5` is `'n'`) and a `spore-print-color` of white (attribute `20` = `'w'`) are classified as `edible` (`'e'`). With additional training data in the 2nd partition, an additional distinction is made such that instances with no `odor`, a white `spore-print-color` and a clustered `population` (attribute `21` = `'c'`) are classified as `poisonous` (`'p'`), while all other instances with no `odor` and a white `spore-print-color` (and any other value for the `population` attribute) are classified as `edible` (`'e'`).\n", - "\n", - "Note that there is no difference between `tree1` and `tree` (the tree trained with all instances). This early convergence on an optimal model is uncommon on most datasets (outside the UCI repository)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that we can partition our instances into subsets, we can use these subsets to construct different-sized training sets in the process of computing a learning curve.\n", - "\n", - "We will start off with an initial training set consisting only of the first partition, and then progressively extend that training set by adding a new partition during each iteration of computing the learning curve.\n", - "\n", - "The [`list.extend(L)`](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) method enables us to extend `list` by appending all the items in another list, `L`, to the end of `list`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "x = [1, 2, 3]\n", - "x.extend([4, 5])\n", - "print x" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "[1, 2, 3, 4, 5]\n" - ] - } - ], - "prompt_number": 31 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can now define the function, `compute_learning_curve(instances, num_partitions=10)`, that will take a list of `instances`, partition it into `num_partitions` relatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the first `num_partitions - 1` partitions then tested with instances in the last partition. That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on. \n", - "\n", - "The function will return a list of `num_partitions - 1` tuples representing the size of the training set and the accuracy of a tree trained with that set (and tested on the `num_partitions - 1` set). This will provide some indication of the relative impact of the size of the training set on model performance." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def compute_learning_curve(instances, num_partitions=10):\n", - " '''Returns a list of training sizes and scores for incrementally increasing partitions.\n", - "\n", - " The list contains 2-element tuples, each representing a training size and score.\n", - " The i-th training size is the number of instances in partitions 0 through num_partitions - 2.\n", - " The i-th score is the accuracy of a tree trained with instances \n", - " from partitions 0 through num_partitions - 2\n", - " and tested on instances from num_partitions - 1 (the last partition).'''\n", - " \n", - " partitions = partition_instances(instances, num_partitions)\n", - " testing_instances = partitions[-1][:]\n", - " training_instances = []\n", - " accuracy_list = []\n", - " for i in xrange(0, num_partitions - 1):\n", - " # for each iteration, the training set is composed of partitions 0 through i - 1\n", - " training_instances.extend(partitions[i][:])\n", - " tree = create_decision_tree(training_instances)\n", - " partition_accuracy = classification_accuracy(tree, testing_instances)\n", - " accuracy_list.append((len(training_instances), partition_accuracy))\n", - " return accuracy_list\n", - "\n", - "accuracy_list = compute_learning_curve(clean_instances)\n", - "print accuracy_list" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "[(565, (562, 2, 0.9964539007092199)), (1130, (564, 0, 1.0)), (1695, (564, 0, 1.0)), (2260, (564, 0, 1.0)), (2824, (564, 0, 1.0)), (3388, (564, 0, 1.0)), (3952, (564, 0, 1.0)), (4516, (564, 0, 1.0)), (5080, (564, 0, 1.0))]\n" - ] - } - ], - "prompt_number": 32 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Due to the quick convergence on an optimal decision tree for classifying the UCI mushroom dataset, we can use a larger number of smaller partitions to see a little more variation in acccuracy performance." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "accuracy_list = compute_learning_curve(clean_instances, 100)\n", - "print accuracy_list[:10]" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "[(57, (55, 1, 0.9821428571428571)), (114, (56, 0, 1.0)), (171, (55, 1, 0.9821428571428571)), (228, (56, 0, 1.0)), (285, (56, 0, 1.0)), (342, (56, 0, 1.0)), (399, (56, 0, 1.0)), (456, (56, 0, 1.0)), (513, (56, 0, 1.0)), (570, (56, 0, 1.0))]\n" - ] - } - ], - "prompt_number": 33 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Object-Oriented Programming: Defining a Python Class to Encapsulate a Simple Decision Tree" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The simple decision tree defined above uses a Python dictionary for its representation. One can imagine using other data structures, and/or extending the decision tree to support confidence estimates, numeric features and other capabilities that are often included in more fully functional implementations. To support future extensibility, and hide the details of the representation from the user, it would be helpful to have a user-defined class for simple decision trees.\n", - "\n", - "Python is an [object-oriented programming](https://en.wikipedia.org/wiki/Object-oriented_programming) language, offering simple syntax and semantics for defining classes and instantiating objects of those classes. *[It is assumed that the reader is already familiar with the concepts of object-oriented programming]*\n", - "\n", - "A Python [class](http://docs.python.org/2/tutorial/classes.html) starts with the keyword **`class`** followed by a class name (identifier), a colon ('`:`'), and then any number of statements, which typically take the form of assignment statements for class or instance variables and/or function definitions for class methods. All statements are indented to reflect their inclusion in the class definition.\n", - "\n", - "The members - methods, class variables and instance variables - of a class are accessed by prepending `self.` to each reference. Class methods always include `self` as the first parameter. \n", - "\n", - "All class members in Python are *public* (accessible outside the class). There is no mechanism for *private* class members, but identifiers with leading double underscores (*\\_\\_member_identifier*) are 'mangled' (translated into *\\_class_name\\__member_identifier*), and thus not directly accessible outside their class, and can be used to approximate private members by Python programmers. \n", - "\n", - "There is also no mechanism for *protected* identifiers - accessible only within a defining class and its subclasses - in the Python language, and so Python programmers have adopted the convention of using a single underscore (*\\_identifier*) at the start of any identifier that is intended to be protected (i.e., not to be accessed outside the class or its subclasses). \n", - "\n", - "Some Python programmers only use the single underscore prefixes and avoid double underscore prefixes due to unintended consequences that can arise when names are mangled. The following warning about single and double underscore prefixes is issued in [Code Like a Pythonista](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#naming):\n", - "\n", - "> try to avoid the __private form. I never use it. Trust me. If you use it, you WILL regret it later\n", - "\n", - "We will follow this advice and avoid using the double underscore prefix in user-defined member variables and methods.\n", - "\n", - "Python has a number of pre-defined [special method names](http://docs.python.org/2/reference/datamodel.html#special-method-names), all of which are denoted by leading and trailing double underscores. For example, the [`object.__init__(self[, ...])`](http://docs.python.org/2/reference/datamodel.html#object.__init__) method is used to specify instructions that should be executed whenever a new object of a class is instantiated. \n", - "\n", - "The code below defines a class, `SimpleDecisionTree`, with a single pseudo-protected member variable `_tree` and a pseudo-protected tree construction method `_create()`, two public methods - `classify()` and `pprint()` - and an initialization method that takes an optional list of training `instances` and a `target_attribute_index`. \n", - "\n", - "The `_create()` method is identical to the `create_decision_tree()` function above, with the inclusion of the `self` parameter (as it is now a class method). The `classify()` method is a similarly modified version of the `classify()` and `classification_accuracy()` functions above, with references to `tree` converted to `self._tree`. The `pprint()` method prints the tree in a human-readable format.\n", - "\n", - "Note that other machine learning libraries may use different terminology for the methods we've defined here. For example, in the [`sklearn.tree.DecisionTreeClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) class (and in most `sklearn` classifier classes), the method for constructing a classifier is named [`fit()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) - since it \"fits\" the data to a model - and the method for classifying instances is named [`predict()`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict) - since it is predicting the class label for an instance.\n", - "\n", - "Most comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found in `SimpleDecisionTree.py`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "class SimpleDecisionTree:\n", - "\n", - " _tree = {} # this instance variable becomes accessible to class methods via self._tree\n", - "\n", - " def __init__(self, instances=None, target_attribute_index=0): # note the use of self as the first parameter\n", - " if instances:\n", - " self._tree = self._create(instances, range(1, len(instances[0])), target_attribute_index)\n", - " \n", - " def _create(self, instances, candidate_attribute_indexes, target_attribute_index=0, default_class=None):\n", - " class_labels_and_counts = Counter([instance[target_attribute_index] for instance in instances])\n", - " if not instances or not candidate_attribute_indexes:\n", - " return default_class\n", - " elif len(class_labels_and_counts) == 1:\n", - " class_label = class_labels_and_counts.most_common(1)[0][0]\n", - " return class_label\n", - " else:\n", - " default_class = simple_ml.majority_value(instances, target_attribute_index)\n", - " best_index = simple_ml.choose_best_attribute_index(instances, candidate_attribute_indexes, target_attribute_index)\n", - " tree = {best_index:{}}\n", - " partitions = simple_ml.split_instances(instances, best_index)\n", - " remaining_candidate_attribute_indexes = [i for i in candidate_attribute_indexes if i != best_index]\n", - " for attribute_value in partitions:\n", - " subtree = self._create(\n", - " partitions[attribute_value],\n", - " remaining_candidate_attribute_indexes,\n", - " target_attribute_index,\n", - " default_class)\n", - " tree[best_index][attribute_value] = subtree\n", - " return tree\n", - " \n", - " # calls the internal \"protected\" method to classify the instance given the _tree\n", - " def classify(self, instance, default_class=None):\n", - " return self._classify(self._tree, instance, default_class)\n", - " \n", - " # a method intended to be \"protected\" that can implement the recursive algorithm to classify an instance given a tree\n", - " def _classify(self, tree, instance, default_class=None):\n", - " if not tree:\n", - " return default_class\n", - " if not isinstance(tree, dict):\n", - " return tree\n", - " attribute_index = tree.keys()[0]\n", - " attribute_values = tree.values()[0]\n", - " instance_attribute_value = instance[attribute_index]\n", - " if instance_attribute_value not in attribute_values:\n", - " return default_class\n", - " return self._classify(attribute_values[instance_attribute_value], instance, default_class)\n", - " \n", - " def classification_accuracy(self, instances, default_class=None):\n", - " predicted_labels = [self.classify(instance, default_class) for instance in instances]\n", - " actual_labels = [x[0] for x in instances]\n", - " counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])\n", - " return counts[True], counts[False], float(counts[True]) / len(instances)\n", - " \n", - " def pprint(self):\n", - " pprint(self._tree)" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 34 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following statements instantiate a `SimpleDecisionTree`, using all but the last 20 `clean_instances`, prints out the tree using its `pprint()` method, and then uses the `classify()` method to print the classification of the last 20 `clean_instances`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "simple_decision_tree = SimpleDecisionTree(training_instances)\n", - "simple_decision_tree.pprint()\n", - "print\n", - "for instance in testing_instances:\n", - " predicted_label = simple_decision_tree.classify(instance)\n", - " actual_label = instance[0]\n", - " print 'Model: {}; truth: {}'.format(predicted_label, actual_label)\n", - "print\n", - "print 'Classification accuracy:', simple_decision_tree.classification_accuracy(testing_instances)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "{5: {'a': 'e',\n", - " 'c': 'p',\n", - " 'f': 'p',\n", - " 'l': 'e',\n", - " 'm': 'p',\n", - " 'n': {20: {'k': 'e',\n", - " 'n': 'e',\n", - " 'r': 'p',\n", - " 'w': {21: {'c': 'p', 'v': 'e', 'y': 'e'}}}},\n", - " 'p': 'p'}}\n", - "\n", - "Model: p; truth: p\n", - "Model: p; truth: p\n", - "Model: p; truth: p\n", - "Model: e; truth: e\n", - "Model: e; truth: e\n", - "Model: p; truth: p\n", - "Model: e; truth: e\n", - "Model: e; truth: e\n", - "Model: e; truth: e\n", - "Model: p; truth: p\n", - "Model: e; truth: e\n", - "Model: e; truth: e\n", - "Model: e; truth: e\n", - "Model: p; truth: p\n", - "Model: e; truth: e\n", - "Model: e; truth: e\n", - "Model: e; truth: e\n", - "Model: e; truth: e\n", - "Model: p; truth: p\n", - "Model: p; truth: p\n", - "\n", - "Classification accuracy: (20, 0, 1.0)\n" - ] - } - ], - "prompt_number": 35 - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Navigation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notebooks in this primer:\n", - "\n", - "* [1. Introduction](1_Introduction.ipynb)\n", - "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", - "* [3. Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", - "* **4. Using Python to Build and Use a Simple Decision Tree Classifier** (*you are here*)\n", - "* [5. Next Steps](5_Next_Steps.ipynb)" - ] - } - ], - "metadata": {} + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.10" } - ] -} \ No newline at end of file + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/5_Next_Steps.ipynb b/5_Next_Steps.ipynb index fe38bef..c9e65d3 100644 --- a/5_Next_Steps.ipynb +++ b/5_Next_Steps.ipynb @@ -1,94 +1,102 @@ { - "metadata": { - "name": "", - "signature": "sha256:84e68912df2865badec94cd3908a4bf864847c8755dba09fd202622c4cbb21b1" - }, - "nbformat": 3, - "nbformat_minor": 0, - "worksheets": [ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python for Data Science" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Joe McCarthy](http://interrelativity.com/joe), \n", + "*Director, Analytics & Data Science*, [Atigeo, LLC](http://atigeo.com)" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from IPython.display import display, Image, HTML" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Next steps" + ] + }, { - "cells": [ - { - "cell_type": "heading", - "level": 1, - "metadata": {}, - "source": [ - "Python for Data Science" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Joe McCarthy](http://interrelativity.com/joe), \n", - "*Director, Analytics & Data Science*, [Atigeo, LLC](http://atigeo.com)" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "from IPython.display import display, Image, HTML" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 1 - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "5. Next steps" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "There are a variety of Python libraries - e.g., [Scikit-Learn](http://scikit-learn.org/) and [xPatterns](http://atigeo.com/technology/) - for building more full-featured decision trees and other types of models based on a variety of machine learning algorithms. Hopefully, this primer will have prepared you for learning how to use those libraries effectively.\n", - "\n", - "Many Python-based machine learning libraries use other external Python libraries such as [NumPy](http://www.numpy.org/), [SciPy](http://www.scipy.org/scipylib/), [Matplotlib](http://matplotlib.org/) and [pandas](http://pandas.pydata.org/). There are tutorials available for each of these libraries, including the following:\n", - "\n", - "* [Tentative NumPy Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)\n", - "* [SciPy Tutorial](http://docs.scipy.org/doc/scipy/reference/tutorial/)\n", - "* [Matplotlib PyPlot Tutorial](http://matplotlib.org/1.3.1/users/pyplot_tutorial.html)\n", - "* [Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) (especially [10 Minutes to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html))\n", - "\n", - "There are many machine learning or data science resources that may be useful to help you continue the journey. Here is a sampling:\n", - "\n", - "* [An introduction to machine learning with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)\n", - "* Olivier Grisel's Strata 2014 tutorial, [Parallel Machine Learning with scikit-learn and IPython](https://github.com/ogrisel/parallel_ml_tutorial)\n", - "* Kaggle's [Getting Started With Python For Data Science](http://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience)\n", - "* Coursera's [Introduction to Data Science](https://www.coursera.org/course/datasci)\n", - "\n", - "Please feel free to contact the author ([Joe McCarthy](mailto:joe@interrelativity.com?subject=Python for Data Science)) to suggest additional resources." - ] - }, - { - "cell_type": "heading", - "level": 3, - "metadata": {}, - "source": [ - "Navigation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notebooks in this primer:\n", - "\n", - "* [1. Introduction](1_Introduction.ipynb)\n", - "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", - "* [3. Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", - "* [4. Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", - "* **5. Next Steps** (*you are here*)" - ] - } - ], - "metadata": {} + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are a variety of Python libraries - e.g., [Scikit-Learn](http://scikit-learn.org/) - for building more full-featured decision trees and other types of models based on a variety of machine learning algorithms. Hopefully, this primer will have prepared you for learning how to use those libraries effectively.\n", + "\n", + "Many Python-based machine learning libraries use other external Python libraries such as [NumPy](http://www.numpy.org/), [SciPy](http://www.scipy.org/scipylib/), [Matplotlib](http://matplotlib.org/) and [pandas](http://pandas.pydata.org/). There are tutorials available for each of these libraries, including the following:\n", + "\n", + "* [Tentative NumPy Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)\n", + "* [SciPy Tutorial](http://docs.scipy.org/doc/scipy/reference/tutorial/)\n", + "* [Matplotlib PyPlot Tutorial](http://matplotlib.org/1.3.1/users/pyplot_tutorial.html)\n", + "* [Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) (especially [10 Minutes to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html))\n", + "\n", + "There are many machine learning or data science resources that may be useful to help you continue the journey. Here is a sampling:\n", + "\n", + "* Scikit-learn's tutorial, [An introduction to machine learning with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)\n", + "* Kevin Markham's video series (on the Kaggle blog), [An introduction to machine learning with scikit-learn](http://blog.kaggle.com/2015/04/08/new-video-series-introduction-to-machine-learning-with-scikit-learn/)\n", + "* Kaggle's [Getting Started With Python For Data Science](http://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience)\n", + "* Coursera's [Introduction to Data Science](https://www.coursera.org/course/datasci)\n", + "* Olivier Grisel's Strata 2014 tutorial, [Parallel Machine Learning with scikit-learn and IPython](https://github.com/ogrisel/parallel_ml_tutorial)\n", + "\n", + "Please feel free to contact the author ([Joe McCarthy](mailto:joe@interrelativity.com?subject=Python for Data Science)) to suggest additional resources." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Navigation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notebooks in this primer:\n", + "\n", + "* [1. Introduction](1_Introduction.ipynb)\n", + "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", + "* [3. Python: Basic Concepts](3_Python_Basic_Concepts.ipynb)\n", + "* [4. Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", + "* **5. Next Steps** (*you are here*)" + ] } - ] -} \ No newline at end of file + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.10" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/Python_for_Data_Science_all.html b/Python_for_Data_Science_all.html index d980b94..9262386 100644 --- a/Python_for_Data_Science_all.html +++ b/Python_for_Data_Science_all.html @@ -9,1546 +9,42 @@