TensorFlow Machine Learning
Cookbook
Table of Contents
TensorFlow Machine Learning Cookbook
Credits
About the Author
About the Reviewer
www.PacktPub.com
eBooks, discount offers, and more
Why Subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Sections
Getting ready
How to do it…
How it works…
There's more…
See also
Conventions
Reader feedback
Customer support
Downloading the example code
Piracy
Questions
1. Getting Started with TensorFlow
Introduction
How TensorFlow Works
Getting ready
How to do it…
How it works…
See also
Declaring Tensors
Getting ready
How to do it…
How it works…
There's more…
Using Placeholders and Variables
Getting ready
How to do it…
How it works…
There's more…
Working with Matrices
Getting ready
How to do it…
How it works…
Declaring Operations
Getting ready
How to do it…
How it works…
There's more…
Implementing Activation Functions
Getting ready
How to do it…
How it works…
There's more…
Working with Data Sources
Getting ready
How to do it…
How it works…
See also
Additional Resources
Getting ready
How to do it…
See also
2. The TensorFlow Way
Introduction
Operations in a Computational Graph
Getting ready
How to do it…
How it works…
Layering Nested Operations
Getting ready
How to do it…
How it works…
There's more…
Working with Multiple Layers
Getting ready
How to do it…
How it works…
Implementing Loss Functions
Getting ready
How to do it…
How it works…
There's more…
Implementing Back Propagation
Getting ready
How to do it…
How it works…
There's more…
See also
Working with Batch and Stochastic Training
Getting ready
How to do it…
How it works…
There's more…
Combining Everything Together
Getting ready
How to do it…
How it works…
There's more…
See also
Evaluating Models
Getting ready
How to do it…
How it works…
3. Linear Regression
Introduction
Using the Matrix Inverse Method
Getting ready
How to do it…
How it works…
Implementing a Decomposition Method
Getting ready
How to do it…
How it works…
Learning The TensorFlow Way of Linear Regression
Getting ready
How to do it…
How it works…
Understanding Loss Functions in Linear Regression
Getting ready
How to do it…
How it works…
There's more…
Implementing Deming regression
Getting ready
How to do it…
How it works…
Implementing Lasso and Ridge Regression
Getting ready
How to do it…
How it works…
There's' more…
Implementing Elastic Net Regression
Getting ready
How to do it…
How it works…
Implementing Logistic Regression
Getting ready
How to do it…
How it works…
4. Support Vector Machines
Introduction
Working with a Linear SVM
Getting ready
How to do it…
How it works…
Reduction to Linear Regression
Getting ready
How to do it…
How it works…
Working with Kernels in TensorFlow
Getting ready
How to do it…
How it works…
There's more…
Implementing a Non-Linear SVM
Getting ready
How to do it…
How it works…
Implementing a Multi-Class SVM
Getting ready
How to do it…
How it works…
5. Nearest Neighbor Methods
Introduction
Working with Nearest Neighbors
Getting ready
How to do it…
How it works…
There's more…
Working with Text-Based Distances
Getting ready
How to do it…
How it works…
There's more…
Computing with Mixed Distance Functions
Getting ready
How to do it…
How it works…
There's more…
Using an Address Matching Example
Getting ready
How to do it…
How it works…
Using Nearest Neighbors for Image Recognition
Getting ready
How to do it…
How it works…
There's more…
6. Neural Networks
Introduction
Implementing Operational Gates
Getting ready
How to do it…
How it works…
Working with Gates and Activation Functions
Getting ready
How to do it…
How it works…
There's more…
Implementing a One-Layer Neural Network
Getting ready
How to do it…
How it works…
There's more…
Implementing Different Layers
Getting ready
How to do it…
How it works…
Using a Multilayer Neural Network
Getting ready
How to do it…
How it works…
Improving the Predictions of Linear Models
Getting ready
How to do it
How it works…
Learning to Play Tic Tac Toe
Getting ready
How to do it…
How it works…
7. Natural Language Processing
Introduction
Working with bag of words
Getting ready
How to do it…
How it works…
There's more…
Implementing TF-IDF
Getting ready
How to do it…
How it works…
There's more…
Working with Skip-gram Embeddings
Getting ready
How to do it…
How it works…
There's more…
Working with CBOW Embeddings
Getting ready
How to do it…
How it works…
There's more…
Making Predictions with Word2vec
Getting ready
How to do it…
How it works…
There's more…
Using Doc2vec for Sentiment Analysis
Getting ready
How to do it…
How it works…
8. Convolutional Neural Networks
Introduction
Implementing a Simpler CNN
Getting ready
How to do it…
How it works…
There's more…
See also
Implementing an Advanced CNN
Getting ready
How to do it…
How it works…
See also
Retraining Existing CNNs models
Getting ready
How to do it…
How it works…
See also
Applying Stylenet/Neural-Style
Getting ready
How to do it…
How it works…
See also
Implementing DeepDream
Getting ready
How to do it…
There's more…
See also
9. Recurrent Neural Networks
Introduction
Implementing RNN for Spam Prediction
Getting ready
How to do it…
How it works…
There's more…
Implementing an LSTM Model
Getting ready
How to do it…
How it works…
There's more…
Stacking multiple LSTM Layers
Getting ready
How to do it…
How it works…
Creating Sequence-to-Sequence Models
Getting ready
How to do it…
How it works…
There's more…
Training a Siamese Similarity Measure
Getting ready
How to do it…
There's more…
10. Taking TensorFlow to Production
Introduction
Implementing unit tests
Getting ready
How it works…
Using Multiple Executors
Getting ready
How to do it…
How it works…
There's more…
Parallelizing TensorFlow
Getting ready
How to do it…
How it works…
Taking TensorFlow to Production
Getting ready
How to do it…
How it works…
Productionalizing TensorFlow - An Example
Getting ready
How to do it…
How it works…
11. More with TensorFlow
Introduction
Visualizing graphs in Tensorboard
Getting ready
How to do it…
There's more…
Working with a Genetic Algorithm
Getting ready
How to do it…
How it works…
There's more…
Clustering Using K-Means
Getting ready
How to do it…
There's more…
Solving a System of ODEs
Getting ready
How to do it…
How it works…
See also
Index
TensorFlow Machine Learning
Cookbook
TensorFlow Machine Learning
Cookbook
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a
retrieval system, or transmitted in any form or by any means, without the
prior written permission of the publisher, except in the case of brief
quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the
accuracy of the information presented. However, the information
contained in this book is sold without warranty, either express or implied.
Neither the author, nor Packt Publishing, and its dealers and distributors
will be held liable for any damages caused or alleged to be caused directly
or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about
all of the companies and products mentioned in this book by the
appropriate use of capitals. However, Packt Publishing cannot guarantee
the accuracy of this information.
First published: February 2017
Production reference: 1090217
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78646-216-9
www.packtpub.com
Credits
Author
Nick McClure
Reviewer
Chetan Khatri
Commissioning Editor
Veena Pagare
Acquisition Editor
Manish Nainani
Content Development Editor
Sumeet Sawant
Technical Editor
Akash Patel
Copy Editor
Safis Editing
Project Coordinator
Shweta H Birwatkar
Proofreader
Safis Editing
Indexer
Mariammal Chettiyar
Graphics
Disha Haria
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta
About the Author
Nick McClure is currently a senior data scientist at PayScale, Inc. in
Seattle, WA. Prior to this, he has worked at Zillow and Caesar's
Entertainment. He got his degrees in Applied Mathematics from The
University of Montana and the College of Saint Benedict and Saint John's
University.
He has a passion for learning and advocating for analytics, machine
learning, and artificial intelligence. Nick occasionally puts his thoughts and
musings on his blog, http://fromdata.org/ , or through his Twitter account,
@nfmcclure.
I am very grateful to my parents, who have always encouraged me to
pursue knowledge. I also want to thank my friends and partner, who have
endured my long monologues about the subjects in this book and always
have been encouraging and listening to me. Writing this book was made
easier by the amazing efforts of the open source community and the great
documentation of many projects out there related to TensorFlow.
A special thanks goes out to the TensorFlow developers at Google. Their
great product and skill speaks volumes for itself, and is accompanied by
great documentation, tutorials, and examples.
About the Reviewer
Chetan Khatri is a Data Science Researcher with a total of 5 years of
experience in research and development. He works as a Lead -
Technology at Accionlabs India. Prior to that he worked with Nazara
Games where he was leading Data Science practice as a Principal Big Data
Engineer for Gaming and Telecom Business. He has worked with leading
data companies and a Big 4 companies, where he has managed the Data
Science Practice Platform and one of the Big 4 company's resources
teams.
He completed his master's degree in computer science and minor data
science at KSKV Kachchh University and awarded a "Gold Medalist" by
the Governer of Gujarat for his "University 1st Rank" achievements.
He contributes to society in various ways, including giving talks to
sophomore students at universities and giving talks on the various fields of
data science, machine learning, AI, and IoT in academia and at various
conferences. He has excellent correlative knowledge of both academic
research and industry best practices. Hence, he always comes forward to
remove the gap between Industry and Academia, where he has good
number of achievements. He is the co-author of various courses, such as
Data Science, IoT, Machine Learning/AI, and Distributed Databases in
PG/UG cariculla at University of Kachchh. Hence, University of Kachchh
became first government university in Gujarat to introduce Python as the
first programming language in Cariculla and India's first government
university to introduce Data Science, AI, and IoT courses in cariculla
entire success story presented by Chetan at Pycon India 2016 conference.
He is one of the founding members of PyKutch—A Python Community.
Currently, he is working on Intelligent IoT Devices with Deep Learning ,
Reinforcement learning and Distributed computing with various modern
architectures.
I would like to thanks Prof. Devji Chhanga, head of the Computer Science
Department, University of Kachchh, for guiding me to the correct path and
for his valuable guidance in the field of data science research.
I would also like to thanks Prof. Shweta Gorania for being the first to
introduce Genetic Algorithms and Neural Networks.
Last but not least I would like to thank my beloved family for their
support.
www.PacktPub.com
eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published,
with PDF and ePub files available? You can upgrade to the eBook version
at www.PacktPub.com and as a print book customer, you are entitled to a
discount on the eBook copy. Get in touch with us at
<customercare@packtpub.com > for more details.
At www.PacktPub.com , you can also read a collection of free technical
articles, sign up for a range of free newsletters and receive exclusive
discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full
access to all Packt books and video courses, as well as industry-leading
tools to help you plan your personal development and advance your
career.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thank you for purchasing this Packt book. We take our commitment to
improving our content and products to meet your needs seriously—that's
why your feedback is so valuable. Whatever your feelings about your
purchase, please consider leaving a review on this book's Amazon page.
Not only will this help us, more importantly it will also help others in the
community to make an informed decision about the resources that they
invest in to learn.
You can also review for us on a regular basis by joining our reviewers'
club. If you're interested in joining, or would like to learn more about the
benefits we offer, please contact us: <customerreviews@packtpub.com >.
Preface
TensorFlow was open sourced in November of 2015 by Google, and since
then it has become the most starred machine learning repository on
GitHub. TensorFlow's popularity is due to the approach of creating
computational graphs, automatic differentiation, and customizability.
Because of these features, TensorFlow is a very powerful and adaptable
tool that can be used to solve many different machine learning problems.
This book addresses many machine learning algorithms, applies them to
real situations and data, and shows how to interpret the results.
What this book covers
Chapter 1, Getting Started with TensorFlow, covers the main objects and
concepts in TensorFlow. We introduce tensors, variables, and placeholders.
We also show how to work with matrices and various mathematical
operations in TensorFlow. At the end of the chapter we show how to
access the data sources used in the rest of the book.
Chapter 2, The TensorFlow Way, establishes how to connect all the
algorithm components from Chapter 1 into a computational graph in
multiple ways to create a simple classifier. Along the way, we cover
computational graphs, loss functions, back propagation, and training with
data.
Chapter 3, Linear Regression, focuses on using TensorFlow for exploring
various linear regression techniques, such as Deming, lasso, ridge, elastic
net, and logistic regression. We show how to implement each in a
TensorFlow computational graph.
Chapter 4, Support Vector Machines, introduces support vector machines
(SVMs) and shows how to use TensorFlow to implement linear SVMs, non-
linear SVMs, and multi-class SVMs.
Chapter 5, Nearest Neighbor Methods, shows how to implement nearest
neighbor techniques using numerical metrics, text metrics, and scaled
distance functions. We use nearest neighbor techniques to perform record
matching among addresses and to classify hand-written digits from the
MNIST database.
Chapter 6, Neural Networks, covers how to implement neural networks in
TensorFlow, starting with the operational gates and activation function
concepts. We then show a shallow neural network and show how to build
up various different types of layers. We end the chapter by teaching
TensorFlow to play tic-tac-toe via a neural network method.
Chapter 7, Natural Language Processing, illustrates various text
processing techniques with TensorFlow. We show how to implement the
bag-of-words technique and TF-IDF for text. We then introduce neural
network text representations with CBOW and skip-gram and use these
techniques for Word2Vec and Doc2Vec for making real-world predictions.
Chapter 8, Convolutional Neural Networks, expands our knowledge of
neural networks by illustrating how to use neural networks on images with
convolutional neural networks (CNNs). We show how to build a simple
CNN for MNIST digit recognition and extend it to color images in the
CIFAR-10 task. We also illustrate how to extend prior trained image
recognition models for custom tasks. We end the chapter by explaining and
showing the stylenet/neural style and deep-dream algorithms in
TensorFlow.
Chapter 9, Recurrent Neural Networks, explains how to implement
recurrent neural networks (RNNs) in TensorFlow. We show how to do
text-spam prediction, and expand the RNN model to do text generation
based on Shakespeare. We also train a sequence to sequence model for
German-English translation. We finish the chapter by showing the usage of
Siamese RNN networks for record matching on addresses.
Chapter 10, Taking TensorFlow to Production, gives tips and examples on
moving TensorFlow to a production environment and how to take
advantage of multiple processing devices (for example GPUs) and setting
up TensorFlow distributed on multiple machines.
Chapter 11, More with TensorFlow, show the versatility of TensorFlow by
illustrating how to do k-means, genetic algorithms, and solve a system of
ordinary differential equations (ODEs). We also show the various uses of
Tensorboard, and how to view computational graph metrics.
What you need for this book
The recipes in this book use TensorFlow, which is available at
https://www.tensorflow.org/ and are based on Python 3, available at
https://www.python.org/downloads/ . Most of the recipes will require the
use of an Internet connection to download the necessary data.
Who this book is for
The TensorFlow Machine Learning Cookbook is for users that have some
experience with machine learning and some experience with Python
programming. Users with an extensive machine learning background may
find the TensorFlow code enlightening, and users with an extensive Python
programming background may find the explanations helpful.
Sections
In this book, you will find several headings that appear frequently (Getting
ready, How to do it…, How it works…, There's more…, and See also).
To give clear instructions on how to complete a recipe, we use these
sections as follows:
Getting ready
This section tells you what to expect in the recipe, and describes how to
set up any software or any preliminary settings required for the recipe.
How to do it…
This section contains the steps required to follow the recipe.
How it works…
This section usually consists of a detailed explanation of what happened in
the previous section.
There's more…
This section consists of additional information about the recipe in order to
make the reader more knowledgeable about the recipe.
See also
This section provides helpful links to other useful information for the
recipe.
Conventions
In this book, there are many styles of text that distinguish between the
types of information. Code words in text are shown as follows: "We then
set the batch_size variable."
A block of code is set as follows:
embedding_mat = tf.Variable(tf.random_uniform([vocab_size,
embedding_size], -1.0, 1.0))
embedding_output = tf.nn.embedding_lookup(embedding_mat,
x_data_ph)
Some code blocks will have output associated with that code, and we note
this in the code block as follows:
print('Training Accuracy: {}'.format(accuracy))
Which results in the following output:
Training Accuracy: 0.878171
Important words are shown in bold.
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you
think about this book— what you liked or may have disliked. Reader
feedback is important for us to develop titles that you really get the most
out of.
To send us general feedback, simply drop an email to
<feedback@packtpub.com >, and mention the book title in the subject of
your message.
If there is a book that you need and would like to see us publish, please
send us a note in the SUGGEST A TITLE form on www.packtpub.com or
email <suggest@packtpub.com >.
If there is a topic that you have expertise in and you are interested in
either writing or contributing to a book, see our author guide on
www.packtpub.com/authors .
Customer support
Now that you are the proud owner of a Packt book, we have a number of
things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for all Packt books you have
purchased from your account at http://www.packtpub.com. If you
purchased this book elsewhere, you can visit
http://www.packtpub.com/support and register to have the files e-mailed
directly to you.
You can download the code files by following these steps:
1. Log in or register to our website using your e-mail address and
password.
2. Hover the mouse pointer on the SUPPORT tab at the top.
3. Click on Code Downloads & Errata.
4. Enter the name of the book in the Search box.
5. Select the book for which you're looking to download the code files.
6. Choose from the drop-down menu where you purchased this book
from.
7. Click on Code Download.
Once the file is downloaded, please make sure that you unzip or extract
the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at
https://github.com/PacktPublishing/TensorFlow-Machine-Learning-
Cookbook. We also have other code bundles from our rich catalog of
books and videos available at https://github.com/PacktPublishing/ . Check
them out!
If you are using Tableau Public, you'll need to locate the workbooks that
have been published to Tableau Public. These may be found at the
following link: http://goo.gl/wJzfDO .
Although we have taken every care to ensure the accuracy of our content,
mistakes do happen. If you find a mistake in one of our books—maybe a
mistake in the text or the code—we would be grateful if you could report
this to us. By doing so, you can save other readers from frustration and
help us improve subsequent versions of this book. If you find any errata,
please report them by visiting http://www.packtpub.com/submit-errata ,
selecting your book, clicking on the Errata Submission Form link, and
entering the details of your errata. Once your errata are verified, your
submission will be accepted and the errata will be uploaded to our website
or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to
https://www.packtpub.com/books/content/support and enter the name of
the book in the search field. The required information will appear under
the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem
across all media. At Packt, we take the protection of our copyright and
licenses very seriously. If you come across any illegal copies of our works
in any form on the Internet, please provide us with the location address or
website name immediately so that we can pursue a remedy.
Please contact us at <copyright@packtpub.com > with a link to the
suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring
you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at
<questions@packtpub.com >, and we will do our best to address the
problem.
Chapter 1. Getting Started with
TensorFlow
In this chapter, we will cover basic recipes in order to understand how
TensorFlow works and how to access data for this book and additional
resources. By the end of the chapter, you should have knowledge of the
following:
How TensorFlow Works
Declaring Variables and Tensors
Using Placeholders and Variables
Working with Matrices
Declaring Operations
Implementing Activation Functions
Working with Data Sources
Additional Resources
Introduction
Google's TensorFlow engine has a unique way of solving problems. This
unique way allows us to solve machine learning problems very efficiently.
Machine learning is used in almost all areas of life and work, but some of
the more famous areas are computer vision, speech recognition, language
translations, and healthcare. We will cover the basic steps to understand
how TensorFlow operates and eventually build up to production code
techniques later in the book. These fundamentals are important in order to
understand the recipes in the rest of this book.
How TensorFlow Works
At first, computation in TensorFlow may seem needlessly complicated. But
there is a reason for it: because of how TensorFlow treats computation,
developing more complicated algorithms is relatively easy. This recipe will
guide us through the pseudocode of a TensorFlow algorithm.
Getting ready
Currently, TensorFlow is supported on Linux, Mac, and Windows. The
code for this book has been created and run on a Linux system, but should
run on any other system as well. The code for the book is available on
GitHub at https://github.com/nfmcclure/tensorflow_cookbookTensorFlow .
Throughout this book, we will only concern ourselves with the Python
library wrapper of TensorFlow, although most of the original core code for
TensorFlow is written in C++. This book will use Python 3.4+
(https://www.python.org ) and TensorFlow 0.12
(https://www.tensorflow.org ). TensorFlow has a 1.0.0 alpha version
available on the official GitHub site, and the code in this book has been
reviewed to be compatible with that version as well. While TensorFlow
can run on the CPU, most algorithms run faster if processed on the GPU,
and it is supported on graphics cards with Nvidia Compute Capability
v4.0+ (v5.1 recommended). Popular GPUs for TensorFlow are Nvidia
Tesla architectures and Pascal architectures with at least 4 GB of video
RAM. To run on a GPU, you will also need to download and install the
Nvidia Cuda Toolkit and also v 5.x + (https://developer.nvidia.com/cuda-
downloads). Some of the recipes will rely on a current installation of the
Python packages: Scipy, Numpy, and Scikit-Learn. These accompanying
packages are also all included in the Anaconda package
(https://www.continuum.io/downloads ).
How to do it…
Here we will introduce the general flow of TensorFlow algorithms. Most
recipes will follow this outline:
1.
Import or generate datasets: All of our machine-learning algorithms
will depend on datasets. In this book, we will either generate data or
use an outside source of datasets. Sometimes it is better to rely on
generated data because we will just want to know the expected
outcome. Most of the time, we will access public datasets for the given
recipe and the details on accessing these are given in section 8 of this
chapter.
2.
Transform and normalize data: Normally, input datasets do not come
in the shape TensorFlow would expect so we need to transform
TensorFlow them to the accepted shape. The data is usually not in the
correct dimension or type that our algorithms expect. We will have to
transform our data before we can use it. Most algorithms also expect
normalized data and we will do this here as well. TensorFlow has built-
in functions that can normalize the data for you as follows:
data = tf.nn.batch_norm_with_global_normalization(...)
3.
Partition datasets into train, test, and validation sets: We generally
want to test our algorithms on different sets that we have trained on.
Also, many algorithms require hyperparameter tuning, so we set aside
a validation set for determining the best set of hyperparameters.
4.
Set algorithm parameters (hyperparameters): Our algorithms
usually have a set of parameters that we hold constant throughout the
procedure. For example, this can be the number of iterations, the
learning rate, or other fixed parameters of our choosing. It is
considered good form to initialize these together so the reader or user
can easily find them, as follows:
learning_rate = 0.01
batch_size = 100
iterations = 1000
5.
Initialize variables and placeholders: TensorFlow depends on
knowing what it can and cannot modify. TensorFlow will
modify/adjust the variables and weight/bias during optimization to
minimize a loss function. To accomplish this, we feed in data through
placeholders. We need to initialize both of these variables and
placeholders with size and type, so that TensorFlow knows what to
expect. TensorFlow also needs to know the type of data to expect: for
most of this book, we will use float32. TensorFlow also provides
float64 and float16. Note that the more bytes used for precision
results in slower algorithms, but the less we use results in less
precision. See the following code:
a_var = tf.constant(42)
x_input = tf.placeholder(tf.float32, [None, input_size])
y_input = tf.placeholder(tf.float32, [None, num_classes])
6.
Define the model structure: After we have the data, and have
initialized our variables and placeholders, we have to define the
model. This is done by building a computational graph. TensorFlow
chooses what operations and values must be the variables and
placeholders to arrive at our model outcomes. We talk more in depth
about computational graphs in the Operations in a Computational
Graph TensorFlow recipe in Chapter 2, The TensorFlow Way. Our
model for this example will be a linear model:
y_pred = tf.add(tf.mul(x_input, weight_matrix), b_matrix)
7.
Declare the loss functions: After defining the model, we must be able
to evaluate the output. This is where we declare the loss function. The
loss function is very important as it tells us how far off our predictions
are from the actual values. The different types of loss functions are
explored in greater detail, in the Implementing Back Propagation
recipe in Chapter 2, The TensorFlow Way:
loss = tf.reduce_mean(tf.square(y_actual - y_pred))
8.
Initialize and train the model: Now that we have everything in place,
we need to create an instance of our graph, feed in the data through
the placeholders, and let TensorFlow change the variables to better
predict our training data. Here is one way to initialize the
computational graph:
with tf.Session(graph=graph) as session:
session.run(...)
Note that we can also initiate our graph with:
session = tf.Session(graph=graph)
session.run(…)
9. Evaluate the model: Once we have built and trained the model, we
should evaluate the model by looking at how well it does with new
data through some specified criteria. We evaluate on the train and test
set and these evaluations will allow us to see if the model is underfit or
overfit. We will address these in later recipes.
10. Tune hyperparameters: Most of the time, we will want to go back
and change some of the hyperparamters, based on the model
performance. We then repeat the previous steps with different
hyperparameters and evaluate the model on the validation set.
11. Deploy/predict new outcomes: It is also important to know how to
make predictions on new, unseen, data. We can do this with all of our
models, once we have them trained.
How it works…
In TensorFlow, we have to set up the data, variables, placeholders, and
model before we tell the program to train and change the variables to
improve the predictions. TensorFlow accomplishes this through the
computational graphs. These computational graphs are a directed graphs
with no recursion, which allows for computational parallelism. We create a
loss function for TensorFlow to minimize. TensorFlow accomplishes this
by modifying the variables in the computational graph. Tensorflow knows
how to modify the variables because it keeps track of the computations in
the model and automatically computes the gradients for every variable.
Because of this, we can see how easy it can be to make changes and try
different data sources.
See also
A great place to start is to go through the official documentation of the
Tensorflow Python API section at
https://www.tensorflow.org/api_docs/python/
There are also tutorials available at:
https://www.tensorflow.org/tutorials/
Declaring Tensors
Tensors are the primary data structure that TensorFlow uses to operate on
the computational graph. We can declare these tensors as variables and or
feed them in as placeholders. First we must know how to create tensors.
Getting ready
When we create a tensor and declare it to be a variable, TensorFlow
creates several graph structures in our computation graph. It is also
important to point out that just by creating a tensor, TensorFlow is not
adding anything to the computational graph. TensorFlow does this only
after creating available out of the tensor. See the next section on variables
and placeholders for more information.
How to do it…
Here we will cover the main ways to create tensors in TensorFlow:
1. Fixed tensors:
Create a zero filled tensor. Use the following:
zero_tsr = tf.zeros([row_dim, col_dim])
Create a one filled tensor. Use the following:
ones_tsr = tf.ones([row_dim, col_dim])
Create a constant filled tensor. Use the following:
filled_tsr = tf.fill([row_dim, col_dim], 42)
Create a tensor out of an existing constant. Use the following:
constant_tsr = tf.constant([1,2,3])
Note
Note that the tf.constant() function can be used to broadcast a value
into an array, mimicking the behavior of tf.fill() by writing
tf.constant(42, [row_dim, col_dim])
2. Tensors of similar shape:
We can also initialize variables based on the shape of other
tensors, as follows:
zeros_similar = tf.zeros_like(constant_tsr)
ones_similar = tf.ones_like(constant_tsr)
Note
Note, that since these tensors depend on prior tensors, we must
initialize them in order. Attempting to initialize all the tensors all at
once willwould result in an error. See the section There's more… at the
end of the next chapter on variables and placeholders.
3.
Sequence tensors:
TensorFlow allows us to specify tensors that contain defined
intervals. The following functions behave very similarly to the
range() outputs and numpy's linspace() outputs. See the
following function:
linear_tsr = tf.linspace(start=0, stop=1, start=3)
The resulting tensor is the sequence [0.0, 0.5, 1.0]. Note that
this function includes the specified stop value. See the following
function:
integer_seq_tsr = tf.range(start=6, limit=15, delta=3)
The result is the sequence [6, 9, 12]. Note that this function does
not include the limit value.
4.
Random tensors:
The following generated random numbers are from a uniform
distribution:
randunif_tsr = tf.random_uniform([row_dim, col_dim],
minval=0, maxval=1)
Note that this random uniform distribution draws from the interval
that includes the minval but not the maxval (minval <= x <
maxval).
To get a tensor with random draws from a normal distribution, as
follows:
randnorm_tsr = tf.random_normal([row_dim, col_dim],
mean=0.0, stddev=1.0)
There are also times when we wish to generate normal random
values that are assured within certain bounds. The
truncated_normal() function always picks normal values within
two standard deviations of the specified mean. See the following:
runcnorm_tsr = tf.truncated_normal([row_dim, col_dim],
mean=0.0, stddev=1.0)
We might also be interested in randomizing entries of arrays. To
accomplish this, there are two functions that help us:
random_shuffle() and random_crop(). See the following:
shuffled_output = tf.random_shuffle(input_tensor)
cropped_output = tf.random_crop(input_tensor, crop_size)
Later on in this book, we will be interested in randomly cropping
an image of size (height, width, 3) where there are three color
spectrums. To fix a dimension in the cropped_output, you must
give it the maximum size in that dimension:
cropped_image = tf.random_crop(my_image, [height/2,
width/2, 3])
How it works…
Once we have decided on how to create the tensors, then we may also
create the corresponding variables by wrapping the tensor in the
Variable() function, as follows. More on this in the next section:
my_var = tf.Variable(tf.zeros([row_dim, col_dim]))
There's more…
We are not limited to the built-in functions. We can convert any numpy
array to a Python list, or constant to a tensor using the function
convert_to_tensor(). Note that this function also accepts tensors as an
input in case we wish to generalize a computation inside a function.
Using Placeholders and Variables
Placeholders and variables are key tools for using computational graphs in
TensorFlow. We must understand the difference and when to best use them
to our advantage.
Getting ready
One of the most important distinctions to make with the data is whether it
is a placeholder or a variable. Variables are the parameters of the algorithm
and TensorFlow keeps track of how to change these to optimize the
algorithm. Placeholders are objects that allow you to feed in data of a
specific type and shape and depend on the results of the computational
graph, such as the expected outcome of a computation.
How to do it…
The main way to create a variable is by using the Variable() function,
which takes a tensor as an input and outputs a variable. This is the
declaration and we still need to initialize the variable. Initializing is what
puts the variable with the corresponding methods on the computational
graph. Here is an example of creating and initializing a variable:
my_var = tf.Variable(tf.zeros([2,3]))
sess = tf.Session()
initialize_op = tf.global_variables_initializer ()
sess.run(initialize_op)
To see what the computational graph looks like after creating and
initializing a variable, see the next part in this recipe.
Placeholders are just holding the position for data to be fed into the graph.
Placeholders get data from a feed_dict argument in the session. To put a
placeholder in the graph, we must perform at least one operation on the
placeholder. We initialize the graph, declare x to be a placeholder, and
define y as the identity operation on x, which just returns x. We then create
data to feed into the x placeholder and run the identity operation. It is
worth noting that TensorFlow will not return a self-referenced placeholder
in the feed dictionary. The code is shown here and the resulting graph is
shown in the next section:
sess = tf.Session()
x = tf.placeholder(tf.float32, shape=[2,2])
y = tf.identity(x)
x_vals = np.random.rand(2,2)
sess.run(y, feed_dict={x: x_vals})
# Note that sess.run(x, feed_dict={x: x_vals}) will result in a
self-referencing error.
How it works…
The computational graph of initializing a variable as a tensor of zeros is
shown in the following figure:
Figure 1: Variable
In Figure 1, we can see what the computational graph looks like in detail
with just one variable, initialized to all zeros. The grey shaded region is a
very detailed view of the operations and constants involved. The main
computational graph with less detail is the smaller graph outside of the
grey region in the upper right corner. For more details on creating and
visualizing graphs, see Chapter 10, Taking TensorFlow to Production ,
section 1.
Similarly, the computational graph of feeding a numpy array into a
placeholder can be seen in the following figure:
Figure 2: Here is the computational graph of a placeholder initialized.
The grey shaded region is a very detailed view of the operations and
constants involved. The main computational graph with less detail is the
smaller graph outside of the grey region in the upper right.
There's more…
During the run of the computational graph, we have to tell TensorFlow
when to initialize the variables we have created. TensorFlow must be
informed about when it can initialize the variables. While each variable
has an initializer method, the most common way to do this is to use the
helper function, which is global_variables_initializer(). This function
creates an operation in the graph that initializes all the variables we have
created, as follows:
initializer_op = tf.global_variables_initializer ()
But if we want to initialize a variable based on the results of initializing
another variable, we have to initialize variables in the order we want, as
follows:
sess = tf.Session()
first_var = tf.Variable(tf.zeros([2,3]))
sess.run(first_var.initializer)
second_var = tf.Variable(tf.zeros_like(first_var))
# Depends on first_var
sess.run(second_var.initializer)
Working with Matrices
Understanding how TensorFlow works with matrices is very important to
understanding the flow of data through computational graphs.
Getting ready
Many algorithms depend on matrix operations. TensorFlow gives us easy-
to-use operations to perform such matrix calculations. For all of the
following examples, we can create a graph session by running the
following code:
import tensorflow as tf
sess = tf.Session()
How to do it…
1.
Creating matrices: We can create two-dimensional matrices from
numpy arrays or nested lists, as we described in the earlier section on
tensors. We can also use the tensor creation functions and specify a
two-dimensional shape for functions such as zeros(), ones(),
truncated_normal(), and so on. TensorFlow also allows us to create a
diagonal matrix from a one-dimensional array or list with the function
diag(), as follows:
identity_matrix = tf.diag([1.0, 1.0, 1.0])
A = tf.truncated_normal([2, 3])
B = tf.fill([2,3], 5.0)
C = tf.random_uniform([3,2])
D = tf.convert_to_tensor(np.array([[1., 2., 3.],[-3., -7.,
-1.],[0., 5., -2.]]))
print(sess.run(identity_matrix))
[[ 1.
0.
0.]
[ 0.
1.
0.]
[ 0.
0.
1.]]
print(sess.run(A))
[[ 0.96751703
0.11397751 -0.3438891 ]
[-0.10132604 -0.8432678
0.29810596]]
print(sess.run(B))
[[ 5.
5.
5.]
[ 5.
5.
5.]]
print(sess.run(C))
[[ 0.33184157
0.08907614]
[ 0.53189191
0.67605299]
[ 0.95889051
0.67061249]]
print(sess.run(D))
[[ 1.
2.
3.]
[-3. -7. -1.]
[ 0.
5. -2.]]
Note
Note that if we were to run sess.run(C) again, we would reinitialize
the random variables and end up with different random values.
2.
Addition and subtraction uses the following function:
print(sess.run(A+B))
[[ 4.61596632
5.39771316
4.4325695 ]
[ 3.26702736
5.14477345
4.98265553]]
print(sess.run(B-B))
[[ 0.
0.
0.]
[ 0.
0.
0.]]
Multiplication
print(sess.run(tf.matmul(B, identity_matrix)))
[[ 5.
5.
5.]
[ 5.
5.
5.]]
3.
Also, the function matmul() has arguments that specify whether or not
to transpose the arguments before multiplication or whether each
matrix is sparse.
4.
Transpose the arguments as follows:
print(sess.run(tf.transpose(C)))
[[ 0.67124544
0.26766731
0.99068872]
[ 0.25006068
0.86560275
0.58411312]]
5.
Again, it is worth mentioning the reinitializing that gives us different
values than before.
6.
For the determinant, use the following:
print(sess.run(tf.matrix_determinant(D)))
-38.0
Inverse:
print(sess.run(tf.matrix_inverse(D)))
[[-0.5
-0.5
-0.5
]
[ 0.15789474
0.05263158
0.21052632]
[ 0.39473684
0.13157895
0.02631579]]
Note
Note that the inverse method is based on the Cholesky decomposition
if the matrix is symmetric positive definite or the LU decomposition
otherwise.
7. Decompositions:
For the Cholesky decomposition, use the following:
print(sess.run(tf.cholesky(identity_matrix)))
[[ 1.
0.
1.]
[ 0.
1.
0.]
[ 0.
0.
1.]]
8. For Eigenvalues and eigenvectors, use the following code:
print(sess.run(tf.self_adjoint_eig(D))
[[-10.65907521
-0.22750691
2.88658212]
[
0.21749542
0.63250104
-0.74339638]
[
0.84526515
0.2587998
0.46749277]
[ -0.4880805
0.73004459
0.47834331]]
Note that the function self_adjoint_eig() outputs the eigenvalues in the
first row and the subsequent vectors in the remaining vectors. In
mathematics, this is known as the Eigen decomposition of a matrix.
How it works…
TensorFlow provides all the tools for us to get started with numerical
computations and adding such computations to our graphs. This notation
might seem quite heavy for simple matrix operations. Remember that we
are adding these operations to the graph and telling TensorFlow what
tensors to run through those operations. While this might seem verbose
now, it helps to understand the notations in later chapters, when this way
of computation will make it easier to accomplish our goals.
Declaring Operations
Now we must learn about the other operations we can add to a TensorFlow
graph.
Getting ready
Besides the standard arithmetic operations, TensorFlow provides us with
more operations that we should be aware of. We need to know how to use
them before proceeding. Again, we can create a graph session by running
the following code:
import tensorflow as tf
sess = tf.Session()
How to do it…
TensorFlow has the standard operations on tensors: add(), sub(), mul(),
and div(). Note that all of these operations in this section will evaluate the
inputs element-wise unless specified otherwise:
1. TensorFlow provides some variations of div() and relevant functions.
2. It is worth mentioning that div() returns the same type as the inputs.
This means it really returns the floor of the division (akin to Python 2)
if the inputs are integers. To return the Python 3 version, which casts
integers into floats before dividing and always returning a float,
TensorFlow provides the function truediv() function, as shown as
follows:
print(sess.run(tf.div(3,4)))
0
print(sess.run(tf.truediv(3,4)))
0.75
3. If we have floats and want an integer division, we can use the function
floordiv(). Note that this will still return a float, but rounded down to
the nearest integer. The function is shown as follows:
print(sess.run(tf.floordiv(3.0,4.0)))
0.0
4. Another important function is mod(). This function returns the
remainder after the division. It is shown as follows:
print(sess.run(tf.mod(22.0, 5.0)))
2.0-
5. The cross-product between two tensors is achieved by the cross()
function. Remember that the cross-product is only defined for two
three-dimensional vectors, so it only accepts two three-dimensional
tensors. The function is shown as follows:
print(sess.run(tf.cross([1., 0., 0.], [0., 1., 0.])))
[ 0.
0.
1.0]
6. Here is a compact list of the more common math functions. All of
these functions operate elementwise.
abs()
Absolute value of one input tensor
ceil()
Ceiling function of one input tensor
cos()
Cosine function of one input tensor
exp()
Base e exponential of one input tensor
floor()
Floor function of one input tensor
inv()
Multiplicative inverse (1/x) of one input tensor
log()
Natural logarithm of one input tensor
maximum()
Element-wise max of two tensors
minimum()
Element-wise min of two tensors
neg()
Negative of one input tensor
The first tensor raised to the second tensor element-wise
pow()
round()
Rounds one input tensor
rsqrt()
One over the square root of one tensor
sign()
Returns -1, 0, or 1, depending on the sign of the tensor
sin()
Sine function of one input tensor
sqrt()
Square root of one input tensor
square()
Square of one input tensor
7. Specialty mathematical functions: There are some special math
functions that get used in machine learning that are worth mentioning
and TensorFlow has built in functions for them. Again, these functions
operate element-wise, unless specified otherwise:
digamma()
Psi function, the derivative of the lgamma() function
erf()
Gaussian error function, element-wise, of one tensor
erfc()
Complimentary error function of one tensor
igamma()
Lower regularized incomplete gamma function
igammac()
Upper regularized incomplete gamma function
lbeta()
Natural logarithm of the absolute value of the beta function
lgamma()
Natural logarithm of the absolute value of the gamma function
squared_difference()
Computes the square of the differences between two tensors
How it works…
It is important to know what functions are available to us to add to our
computational graphs. Mostly, we will be concerned with the preceding
functions. We can also generate many different custom functions as
compositions of the preceding functions, as follows:
# Tangent function (tan(pi/4)=1)
print(sess.run(tf.div(tf.sin(3.1416/4.), tf.cos(3.1416/4.))))
1.0
There's more…
If we wish to add other operations to our graphs that are not listed here,
we must create our own from the preceding functions. Here is an example
of an operation not listed previously that we can add to our graph. We
choose to add a custom polynomial function,
:
def custom_polynomial(value):
return(tf.sub(3 * tf.square(value), value) + 10)
print(sess.run(custom_polynomial(11)))
362
Implementing Activation
Functions
Getting ready
When we start to use neural networks, we will use activation functions
regularly because activation functions are a mandatory part of any neural
network. The goal of the activation function is to adjust weight and bias. In
TensorFlow, activation functions are non-linear operations that act on
tensors. They are functions that operate in a similar way to the previous
mathematical operations. Activation functions serve many purposes, but a
few main concepts is that they introduce a non-linearity into the graph
while normalizing the outputs. Start a TensorFlow graph with the following
commands:
import tensorflow as tf
sess = tf.Session()
How to do it…
The activation functions live in the neural network (nn) library in
TensorFlow. Besides using built-in activation functions, we can also design
our own using TensorFlow operations. We can import the predefined
activation functions (import tensorflow.nn as nn) or be explicit and
write .nn in our function calls. Here, we choose to be explicit with each
function call:
1. The rectified linear unit, known as ReLU, is the most common and
basic way to introduce a non-linearity into neural networks. This
function is just max(0,x). It is continuous but not smooth. It appears as
follows:
print(sess.run(tf.nn.relu([-3., 3., 10.])))
[
0.
3.
10.]
2.
There will be times when we wish to cap the linearly increasing part of
the preceding ReLU activation function. We can do this by nesting the
max(0,x) function into a min() function. The implementation that
TensorFlow has is called the ReLU6 function. This is defined as
min(max(0,x),6). This is a version of the hard-sigmoid function and is
computationally faster, and does not suffer from vanishing
(infinitesimally near zero) or exploding values. This will come in
handy when we discuss deeper neural networks in Chapters 8,
Convolutional Neural Networks and Chapter 9, Recurrent Neural
Networks. It appears as follows:
print(sess.run(tf.nn.relu6([-3., 3., 10.])))
[ 0.
3.
6.]
3.
The sigmoid function is the most common continuous and smooth
activation function. It is also called a logistic function and has the form
1/(1+exp(-x)). The sigmoid is not often used because of the tendency
to zero-out the back propagation terms during training. It appears as
follows:
print(sess.run(tf.nn.sigmoid([-1., 0., 1.])))
[ 0.26894143
0.5
0.7310586 ]
Note
We should be aware that some activation functions are not zero
centered, such as the sigmoid. This will require us to zero-mean the
data prior to using it in most computational graph algorithms.
4. Another smooth activation function is the hyper tangent. The hyper
tangent function is very similar to the sigmoid except that instead of
having a range between 0 and 1, it has a range between -1 and 1. The
function has the form of the ratio of the hyperbolic sine over the
hyperbolic cosine. But another way to write this is ((exp(x)-exp(-
x))/(exp(x)+exp(-x)). It appears as follows:
print(sess.run(tf.nn.tanh([-1., 0., 1.])))
[-0.76159418
0.
0.76159418 ]
5. The softsign function also gets used as an activation function. The
form of this function is x/(abs(x) + 1). The softsign function is
supposed to be a continuous approximation to the sign function. It
appears as follows:
print(sess.run(tf.nn.softsign([-1., 0., -1.])))
[-0.5
0.
0.5]
6. Another function, the softplus, is a smooth version of the ReLU
function. The form of this function is log(exp(x) + 1). It appears as
follows:
print(sess.run(tf.nn.softplus([-1., 0., -1.])))
[ 0.31326166
0.69314718
1.31326163]
Note
The softplus goes to infinity as the input increases whereas the
softsign goes to 1. As the input gets smaller, however, the softplus
approaches zero and the softsign goes to -1.
7. The Exponential Linear Unit (ELU) is very similar to the softplus
function except that the bottom asymptote is -1 instead of 0. The form
is (exp(x)+1) if x < 0 else x. It appears as follows:
print(sess.run(tf.nn.elu([-1., 0., -1.])))
[-0.63212055
0.
1.
]
How it works…
These activation functions are the way that we introduce nonlinearities in
neural networks or other computational graphs in the future. It is important
to note where in our network we are using activation functions. If the
activation function has a range between 0 and 1 (sigmoid), then the
computational graph can only output values between 0 and 1.
If the activation functions are inside and hidden between nodes, then we
want to be aware of the effect that the range can have on our tensors as
we pass them through. If our tensors were scaled to have a mean of zero,
we will want to use an activation function that preserves as much variance
as possible around zero. This would imply we want to choose an activation
function such as the hyperbolic tangent (tanh) or softsign. If the tensors
are all scaled to be positive, then we would ideally choose an activation
function that preserves variance in the positive domain.
There's more…
Here are two graphs that illustrate the different activation functions. The
following figure shows the following functions ReLU, ReLU6, softplus,
exponential LU, sigmoid, softsign, and the hyperbolic tangent:
Figure 3: Activation functions of softplus, ReLU, ReLU6, and exponential
LU
In Figure 3, we can see four of the activation functions, softplus, ReLU,
ReLU6, and exponential LU. These functions flatten out to the left of zero
and linearly increase to the right of zero, with the exception of ReLU6,
which has a maximum value of 6:
Figure 4: Sigmoid, hyperbolic tangent (tanh), and softsign activation
function
In Figure 4, we have the activation functions sigmoid, hyperbolic tangent
(tanh), and softsign. These activation functions are all smooth and have a S
n shape. Note that there are two horizontal asymptotes for these functions.
Working with Data Sources
For most of this book, we will rely on the use of datasets to fit machine
learning algorithms. This section has instructions on how to access each of
these various datasets through TensorFlow and Python.
Getting ready
In TensorFlow some of the datasets that we will use are built in to Python
libraries, some will require a Python script to download, and some will be
manually downloaded through the Internet. Almost all of these datasets
require an active Internet connection to retrieve data.
How to do it…
1.
Iris data: This dataset is arguably the most classic dataset used in
machine learning and maybe all of statistics. It is a dataset that
measures sepal length, sepal width, petal length, and petal width of
three different types of iris flowers: Iris setosa, Iris virginica, and Iris
versicolor. There are 150 measurements overall, 50 measurements of
each species. To load the dataset in Python, we use Scikit Learn's
dataset function, as follows:
from sklearn import datasets
iris = datasets.load_iris()
print(len(iris.data))
150
print(len(iris.target))
150
print(iris.target[0]) # Sepal length, Sepal width, Petal
length, Petal width
[ 5.1 3.5 1.4 0.2]
print(set(iris.target)) # I. setosa, I. virginica, I.
versicolor
{0, 1, 2}
2.
Birth weight data: The University of Massachusetts at Amherst has
compiled many statistical datasets that are of interest (1). One such
dataset is a measure of child birth weight and other demographic and
medical measurements of the mother and family history. There are 189
observations of 11 variables. Here is how to access the data in Python:
import requests
birthdata_url =
'https://www.umass.edu/statdata/statdata/data/lowbwt.dat'
birth_file = requests.get(birthdata_url)
birth_data = birth_file.text.split('\'r\n') [5:]
birth_header = [x for x in birth_data[0].split( '') if
len(x)>=1]
birth_data = [[float(x) for x in y.split( ')'' if len(x)>=1]
for y in birth_data[1:] if len(y)>=1]
print(len(birth_data))
189
print(len(birth_data[0]))
11
3.
Boston Housing data: Carnegie Mellon University maintains a library
of datasets in their Statlib Library. This data is easily accessible via
The University of California at Irvine's Machine-Learning Repository
(2). There are 506 observations of house worth along with various
demographic data and housing attributes (14 variables). Here is how to
access the data in Python:
import requests
housing_url = 'https://archive.ics.uci.edu/ml/machine-
learning-databases/housing/housing.data'
housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM',
'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV0']
housing_file = requests.get(housing_url)
housing_data = [[float(x) for x in y.split( '') if len(x)>=1]
for y in housing_file.text.split('\n') if len(y)>=1]
print(len(housing_data))
506
print(len(housing_data[0]))
14
4.
MNIST handwriting data: MNIST (Mixed National Institute of
Standards and Technology) is a subset of the larger NIST
handwriting database. The MNIST handwriting dataset is hosted on
Yann LeCun's website (https://yann.lecun.com/exdb/mnist/ ). It is a
database of 70,000 images of single digit numbers (0-9) with about
60,000 annotated for a training set and 10,000 for a test set. This
dataset is used so often in image recognition that TensorFlow provides
built-in functions to access this data. In machine learning, it is also
important to provide validation data to prevent overfitting (target
leakage). Because of this TensorFlow, sets aside 5,000 of the train set
into a validation set. Here is how to access the data in Python:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/","
one_hot=True)
print(len(mnist.train.images))
55000
print(len(mnist.test.images))
10000
print(len(mnist.validation.images))
5000
print(mnist.train.labels[1,:]) # The first label is a 3'''
[ 0.
0.
0.
1.
0.
0.
0.
0.
0.
0.]
5.
Spam-ham text data. UCI's machine -learning data set library (2) also
holds a spam-ham text message dataset. We can access this .zip file
and get the spam-ham text data as follows:
import requests
import io
from zipfile import ZipFile
zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-
databases/00228/smsspamcollection.zip'
r = requests.get(zip_url)
z = ZipFile(io.BytesIO(r.content))
file = z.read('SMSSpamCollection')
text_data = file.decode()
text_data = text_data.encode('ascii',errors='ignore')
text_data = text_data.decode().split(\n')
text_data = [x.split(\t') for x in text_data if len(x)>=1]
[text_data_target, text_data_train] = [list(x) for x in
zip(*text_data)]
print(len(text_data_train))
5574
print(set(text_data_target))
{'ham', 'spam'}
print(text_data_train[1])
Ok lar... Joking wif u oni...
6.
Movie review data: Bo Pang from Cornell has released a movie
review dataset that classifies reviews as good or bad (3). You can find
the data on the website, http://www.cs.cornell.edu/people/pabo/movie-
review-data/. To download, extract, and transform this data, we run
the following code:
import requests
import io
import tarfile
movie_data_url =
'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-
polaritydata.tar.gz'
r = requests.get(movie_data_url)
# Stream data into temp object
stream_data = io.BytesIO(r.content)
tmp = io.BytesIO()
while True:
s = stream_data.read(16384)
if not s:
break
tmp.write(s)
stream_data.close()
tmp.seek(0)
# Extract tar file
tar_file = tarfile.open(fileobj=tmp, mode="r:gz")
pos = tar_file.extractfile('rt'-polaritydata/rt-
polarity.pos')
neg = tar_file.extractfile('rt'-polaritydata/rt-
polarity.neg')
# Save pos/neg reviews (Also deal with encoding)
pos_data = []
for line in pos:
pos_data.append(line.decode('ISO'-8859-
1').encode('ascii',errors='ignore').decode())
neg_data = []
for line in neg:
neg_data.append(line.decode('ISO'-8859-
1').encode('ascii',errors='ignore').decode())
tar_file.close()
print(len(pos_data))
5331
print(len(neg_data))
5331
# Print out first negative review
print(neg_data[0])
simplistic , silly and tedious .
7.
CIFAR-10 image data: The Canadian Institute For Advanced
Research has released an image set that contains 80 million labeled
colored images (each image is scaled to 32x32 pixels). There are 10
different target classes (airplane, automobile, bird, and so on). The
CIFAR-10 is a subset that has 60,000 images. There are 50,000 images
in the training set, and 10,000 in the test set. Since we will be using
this dataset in multiple ways, and because it is one of our larger
datasets, we will not run a script each time we need it. To get this
dataset, please navigate to http://www.cs.toronto.edu/~kriz/cifar.html ,
and download the CIFAR-10 dataset. We will address how to use this
dataset in the appropriate chapters.
8.
The works of Shakespeare text data: Project Gutenberg (5) is a
project that releases electronic versions of free books. They have
compiled all of the works of Shakespeare together and here is how to
access the text file through Python:
import requests
shakespeare_url =
'http://www.gutenberg.org/cache/epub/100/pg100.txt'
# Get Shakespeare text
response = requests.get(shakespeare_url)
shakespeare_file = response.content
# Decode binary into string
shakespeare_text = shakespeare_file.decode('utf-8')
# Drop first few descriptive paragraphs.
shakespeare_text = shakespeare_text[7675:]
print(len(shakespeare_text)) # Number of characters
5582212
9.
English-German sentence translation data: The Tatoeba project
(http://tatoeba.org ) collects sentence translations in many languages.
Their data has been released under the Creative Commons License.
From this data, ManyThings.org (http://www.manythings.org ) has
compiled sentence-to-sentence translations in text files available for
download. Here we will use the English-German translation file, but
you can change the URL to whatever languages you would like to use:
import requests
import io
from zipfile import ZipFile
sentence_url = 'http://www.manythings.org/anki/deu-eng.zip'
r = requests.get(sentence_url)
z = ZipFile(io.BytesIO(r.content))
file = z.read('deu.txt''')
# Format Data
eng_ger_data = file.decode()
eng_ger_data =
eng_ger_data.encode('ascii''',errors='ignore''')
eng_ger_data = eng_ger_data.decode().split(\n''')
eng_ger_data = [x.split(\t''') for x in eng_ger_data if
len(x)>=1]
[english_sentence, german_sentence] = [list(x) for x in
zip(*eng_ger_data)]
print(len(english_sentence))
137673
print(len(german_sentence))
137673
print(eng_ger_data[10])
['I won!, 'Ich habe gewonnen!']
How it works…
When it comes time to use one of these datasets in a recipe, we will refer
you to this section and assume that the data is loaded in such a way as
described in the preceding text. If further data transformation or pre-
processing is needed, then such code will be provided in the recipe itself.
See also
Hosmer, D.W., Lemeshow, S., and Sturdivant, R. X. (2013). Applied
Logistic Regression: 3rd Edition.
https://www.umass.edu/statdata/statdata/data/lowbwt.txt Lichman, M.
(2013). UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml . Irvine, CA: University of California,
School of Information and Computer Science.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up?
Sentiment Classification using Machine Learning Techniques,
Proceedings of EMNLP 2002.
http://www.cs.cornell.edu/people/pabo/movie-review-data/
Krizhevsky. (2009). Learning Multiple Layers of Features from Tiny
Images. http://www.cs.toronto.edu/~kriz/cifar.html
Project Gutenberg. Accessed April 2016. http://www.gutenberg.org/ .
Additional Resources
Here we will provide additional links, documentation sources, and tutorials
that are of great assistance to learning and using TensorFlow.
Getting ready
When learning how to use TensorFlow, it helps to know where to turn to
for assistance or pointers. This section lists resources to get TensorFlow
running and to troubleshoot problems.
How to do it…
Here is a list of TensorFlow resources:
1.
The code for this book is available online at
https://github.com/nfmcclure/tensorflow_cookbook .
2.
The official TensorFlow Python API documentation is located at
https://www.tensorflow.org/api_docs/python . Here there is
documentation and examples of all of the functions, objects, and
methods in TensorFlow. Note the version number r0.8' in the link and
realize that a more current version may be available.
3.
TensorFlow's official tutorials are very thorough and detailed. They are
located at https://www.tensorflow.org/tutorials/index.html . They start
covering image recognition models, and work through Word2Vec,
RNN models, and sequence-to-sequence models. They also have
additional tutorials on generating fractals and solving a PDE system.
Note that they are continually adding more tutorials and examples to
this collection.
4.
TensorFlow's official GitHub repository is available via
https://github.com/tensorflow/tensorflow . Here you can view the
open-sourced code and even fork or clone the most current version of
the code if you want. You can also see current filed issues if you
navigate to the issues directory.
5.
A public Docker container that is kept current by TensorFlow is
available on Dockerhub at:
https://hub.docker.com/r/tensorflow/tensorflow/
6.
A downloadable virtual machine that contains TensorFlow installed on
an Ubuntu 15.04 OS is available as well. This option is great for
running the UNIX version of TensorFlow on a Windows PC. The VM
is available through a Google Document request form at:
https://docs.google.com/forms/d/1mUztUlK6_z31BbMW5ihXaYHlhBc
8XHyoI/viewform. It is about a 2 GB download and requires VMWare
player to run. VMWare player is a product made by VMWare and is
free for personal use and is available at:
https://www.vmware.com/go/downloadplayer/ . This virtual machine is
maintained by David Winters (1).
7.
A great source for community help is Stack Overflow. There is a tag
for TensorFlow. This tag seems to be growing in interest as TensorFlow
is gaining more popularity. To view activity on this tag, visit
http://stackoverflow.com/questions/tagged/Tensorflow
8.
While TensorFlow is very agile and can be used for many things, the
most common usage of TensorFlow is deep learning. To understand the
basis for deep learning, how the underlying mathematics works, and to
develop more intuition on deep learning, Google has created an online
course available on Udacity. To sign up and take the video lecture
course visit https://www.udacity.com/course/deep-learning--ud730 .
9.
TensorFlow has also made a site where you can visually explore
training a neural network while changing the parameters and datasets.
Visit http://playground.tensorflow.org/ to explore how different
settings affect the training of neural networks.
10.
Geoffrey Hinton teaches an online course, Neural Networks for
Machine Learning, through Coursera. Visit
https://www.coursera.org/learn/neural-networks
11.
Stanford University has an online syllabus and detailed course notes
for Convolutional Neural Networks for Visual Recognition. Visit
http://cs231n.stanford.edu/
See also
Winters, D.
https://docs.google.com/forms/d/1mUztUlK6_z31BbMW5ihXaYHlhBc
8XHyoI/viewform
Chapter 2. The TensorFlow Way
In this chapter, we will introduce the key components of how TensorFlow
operates. Then we will tie it together to create a simple classifier and
evaluate the outcomes. By the end of the chapter you should have learned
about the following:
Operations in a Computational Graph
Layering Nested Operations
Working with Multiple Layers
Implementing Loss Functions
Implementing Back Propagation
Working with Batch and Stochastic Training
Combining Everything Together
Evaluating Models
Introduction
Now that we have introduced how TensorFlow creates tensors, uses
variables and placeholders, we will introduce how to act on these objects
in a computational graph. From this, we can set up a simple classifier and
see how well it performs.
Note
Also, remember that all the code from this book is available online on
GitHub at https://github.com/nfmcclure/tensorflow_cookbook .
Operations in a Computational
Graph
Now that we can put objects into our computational graph, we will
introduce operations that act on such objects.
Getting ready
To start a graph, we load TensorFlow and create a session, as follows:
import tensorflow as tf
sess = tf.Session()
How to do it…
In this example, we will combine what we have learned and feed in each
number in a list to an operation in a graph and print the output:
1. First we declare our tensors and placeholders. Here we will create a
numpy array to feed into our operation:
import numpy as np
x_vals = np.array([1., 3., 5., 7., 9.])
x_data = tf.placeholder(tf.float32)
m_const = tf.constant(3.)
my_product = tf.mul(x_data, m_const)
for x_val in x_vals:
print(sess.run(my_product, feed_dict={x_data: x_val}))
3.0
9.0
15.0
21.0
27.0
How it works…
Steps 1 and 2 create the data and operations on the computational graph.
Then, in step 3, we feed the data through the graph and print the output.
Here is what the computational graph looks like:
Figure 1: Here we can see in the graph that the placeholder, x_data,
along with our multiplicative constant, feeds into the multiplication
operation.
Layering Nested Operations
In this recipe, we will learn how to put multiple operations on the same
computational graph.
Getting ready
It's important to know how to chain operations together. This will set up
layered operations in the computational graph. For a demonstration we will
multiply a placeholder by two matrices and then perform addition. We will
feed in two matrices in the form of a three-dimensional numpy array:
import tensorflow as tf
sess = tf.Session()
How to do it…
It is also important to note how the data will change shape as it passes
through. We will feed in two numpy arrays of size 3x5. We will multiply
each matrix by a constant of size 5x1, which will result in a matrix of size
3x1. We will then multiply this by 1x1 matrix resulting in a 3x1 matrix
again. Finally, we add a 3x1 matrix at the end, as follows:
1. First we create the data to feed in and the corresponding placeholder:
my_array = np.array([[1., 3., 5., 7., 9.],
[-2., 0., 2., 4., 6.],
[-6., -3., 0., 3., 6.]])
x_vals = np.array([my_array, my_array + 1])
x_data = tf.placeholder(tf.float32, shape=(3, 5))
2. Next we create the constants that we will use for matrix multiplication
and addition:
m1 = tf.constant([[1.],[0.],[-1.],[2.],[4.]])
m2 = tf.constant([[2.]])
a1 = tf.constant([[10.]])
3. Now we declare the operations and add them to the graph:
prod1 = tf.matmul(x_data, m1)
prod2 = tf.matmul(prod1, m2)
add1 = tf.add(prod2, a1)
4. Finally, we feed the data through our graph:
for x_val in x_vals:
print(sess.run(add1, feed_dict={x_data: x_val}))
[[ 102.]
[
66.]
[
58.]]
[[ 114.]
[
78.]
[
70.]]
How it works…
The computational graph we just created can be visualized with
Tensorboard. Tensorboard is a feature of TensorFlow that allows us to
visualize the computational graphs and values in that graph. These features
are provided natively, unlike other machine learning frameworks. To see
how this is done, see the Visualizing graphs in Tensorboard recipe in
Chapter 11, More with TensorFlow. Here is what our layered graph looks
like:
Figure 2: In this computational graph you can see the data size as it
propagates upward through the graph.
There's more…
We have to declare the data shape and know the outcome shape of the
operations before we run data through the graph. This is not always the
case. There may be a dimension or two that we do not know beforehand or
that can vary. To accomplish this, we designate the dimension that can
vary or is unknown as value none. For example, to have the prior data
placeholder have an unknown amount of columns, we would write the
following line:
x_data = tf.placeholder(tf.float32, shape=(3,None))
This allows us to break matrix multiplication rules and we must still obey
the fact that the multiplying constant must have the same corresponding
number of rows. We can either generate this dynamically or reshape the
x_data as we feed data in our graph. This will come in handy in later
chapters when we are feeding data in multiple batches.
Working with Multiple Layers
Now that we have covered multiple operations, we will cover how to
connect various layers that have data propagating through them.
Getting ready
In this recipe, we will introduce how to best connect various layers,
including custom layers. The data we will generate and use will be
representative of small random images. It is best to understand these types
of operation on a simple example and how we can use some built-in layers
to perform calculations. We will perform a small moving window average
across a 2D image and then flow the resulting output through a custom
operation layer.
In this section, we will see that the computational graph can get large and
hard to look at. To address this, we will also introduce ways to name
operations and create scopes for layers. To start, load numpy and
tensorflow and create a graph, using the following:
import tensorflow as tf
import numpy as np
sess = tf.Session()
How to do it…
1. First we create our sample 2D image with numpy. This image will be a
4x4 pixel image. We will create it in four dimensions; the first and last
dimension will have a size of one. Note that some TensorFlow image
functions will operate on four-dimensional images. Those four
dimensions are image number, height, width, and channel, and to make
it one image with one channel, we set two of the dimensions to 1, as
follows:
x_shape = [1, 4, 4, 1]
x_val = np.random.uniform(size=x_shape)
2.
Now we have to create the placeholder in our graph where we can
feed in the sample image, as follows:
x_data = tf.placeholder(tf.float32, shape=x_shape)
3.
To create a moving window average across our 4x4 image, we will use
a built-in function that will convolute a constant across a window of
the shape 2x2. This function is quite common to use in image
processing and in TensorFlow, the function we will use is conv2d().
This function takes a piecewise product of the window and a filter we
specify. We must also specify a stride for the moving window in both
directions. Here we will compute four moving window averages, the
top left, top right, bottom left, and bottom right four pixels. We do this
by creating a 2x2 window and having strides of length 2 in each
direction. To take the average, we will convolute the 2x2 window with
a constant of 0.25., as follows:
my_filter = tf.constant(0.25, shape=[2, 2, 1, 1])
my_strides = [1, 2, 2, 1]
mov_avg_layer= tf.nn.conv2d(x_data, my_filter, my_strides,
padding='SAME''',
name='Moving'_Avg_Window')
Note
To figure out the output size of a convolutional layer, we can use the
following formula: Output = (W-F+2P)/S+1, where W is the input
size, F is the filter size, P is the padding of zeros, and S is the stride.
4. Note that we are also naming this layer Moving_Avg_Window by using
the name argument of the function.
5. Now we define a custom layer that will operate on the 2x2 output of
the moving window average. The custom function will first multiply
the input by another 2x2 matrix tensor, and then add one to each entry.
After this we take the sigmoid of each element and return the 2x2
matrix. Since matrix multiplication only operates on two-dimensional
matrices, we need to drop the extra dimensions of our image that are
of size 1. TensorFlow can do this with the built-in function squeeze().
Here we define the new layer:
def custom_layer(input_matrix):
input_matrix_sqeezed = tf.squeeze(input_matrix)
A = tf.constant([[1., 2.], [-1., 3.]])
b = tf.constant(1., shape=[2, 2])
temp1 = tf.matmul(A, input_matrix_sqeezed)
temp = tf.add(temp1, b) # Ax + b
return(tf.sigmoid(temp))
6.
Now we have to place the new layer on the graph. We will do this with
a named scope so that it is identifiable and collapsible/expandable on
the computational graph, as follows:
with tf.name_scope('Custom_Layer') as scope:
custom_layer1 = custom_layer(mov_avg_layer)
7.
Now we just feed in the 4x4 image in the placeholder and tell
TensorFlow to run the graph, as follows:
print(sess.run(custom_layer1, feed_dict={x_data: x_val}))
[[ 0.91914582
0.96025133]
[ 0.87262219
0.9469803 ]]
How it works…
The visualized graph looks better with the naming of operations and
scoping of layers. We can collapse and expand the custom layer because
we created it in a named scope. In the following figure, see the collapsed
version on the left and the expanded version on the right:
Figure 3: Computational graph with two layers. The first layer is named
as Moving_Avg_Window, and the second is a collection of operations called
Custom_Layer. It is collapsed on the left and expanded on the right.
Implementing Loss Functions
Loss functions are very important to machine learning algorithms. They
measure the distance between the model outputs and the target (truth)
values. In this recipe, we show various loss function implementations in
TensorFlow.
Getting ready
In order to optimize our machine learning algorithms, we will need to
evaluate the outcomes. Evaluating outcomes in TensorFlow depends on
specifying a loss function. A loss function tells TensorFlow how good or
bad the predictions are compared to the desired result. In most cases, we
will have a set of data and a target on which to train our algorithm. The
loss function compares the target to the prediction and gives a numerical
distance between the two.
For this recipe, we will cover the main loss functions that we can
implement in TensorFlow.
To see how the different loss functions operate, we will plot them in this
recipe. We will first start a computational graph and load matplotlib, a
python plotting library, as follows:
import matplotlib.pyplot as plt
import tensorflow as tf
How to do it…
First we will talk about loss functions for regression, that is, predicting a
continuous dependent variable. To start, we will create a sequence of our
predictions and a target as a tensor. We will output the results across 500
x-values between -1 and 1. See the next section for a plot of the outputs.
Use the following code:
x_vals = tf.linspace(-1., 1., 500)
target = tf.constant(0.)
1. The L2 norm loss is also known as the Euclidean loss function. It is
just the square of the distance to the target. Here we will compute the
loss function as if the target is zero. The L2 norm is a great loss
function because it is very curved near the target and algorithms can
use this fact to converge to the target more slowly, the closer it gets.,
as follows:
l2_y_vals = tf.square(target - x_vals)
l2_y_out = sess.run(l2_y_vals)
Note
TensorFlow has a built -in form of the L2 norm, called nn.l2_loss().
This function is actually half the L2-norm above. In other words, it is
same as previously but divided by 2.
2.
The L1 norm loss is also known as the absolute loss function. Instead
of squaring the difference, we take the absolute value. The L1 norm is
better for outliers than the L2 norm because it is not as steep for larger
values. One issue to be aware of is that the L1 norm is not smooth at
the target and this can result in algorithms not converging well. It
appears as follows:
l1_y_vals = tf.abs(target - x_vals)
l1_y_out = sess.run(l1_y_vals)
3.
Pseudo-Huber loss is a continuous and smooth approximation to the
Huber loss function. This loss function attempts to take the best of
the L1 and L2 norms by being convex near the target and less steep
for extreme values. The form depends on an extra parameter, delta,
which dictates how steep it will be. We will plot two forms, delta1 =
0.25 and delta2 = 5 to show the difference, as follows:
delta1 = tf.constant(0.25)
phuber1_y_vals = tf.mul(tf.square(delta1), tf.sqrt(1. +
tf.square((target - x_vals)/delta1))
- 1.)
phuber1_y_out = sess.run(phuber1_y_vals)
delta2 = tf.constant(5.)
phuber2_y_vals = tf.mul(tf.square(delta2), tf.sqrt(1. +
tf.square((target - x_vals)/delta2))
- 1.)
phuber2_y_out = sess.run(phuber2_y_vals)
4.
Classification loss functions are used to evaluate loss when predicting
categorical outcomes.
5.
We will need to redefine our predictions (x_vals) and target. We will
save the outputs and plot them in the next section. Use the following:
x_vals = tf.linspace(-3., 5., 500)
target = tf.constant(1.)
targets = tf.fill([500,], 1.)
6.
Hinge loss is mostly used for support vector machines, but can be used
in neural networks as well. It is meant to compute a loss between with
two target classes, 1 and -1. In the following code, we are using the
target value 1, so the as closer our predictions as near are to 1, the
lower the loss value:
hinge_y_vals = tf.maximum(0., 1. - tf.mul(target, x_vals))
hinge_y_out = sess.run(hinge_y_vals)
7.
Cross-entropy loss for a binary case is also sometimes referred to as
the logistic loss function. It comes about when we are predicting the
two classes 0 or 1. We wish to measure a distance from the actual class
(0 or 1) to the predicted value, which is usually a real number between
0 and 1. To measure this distance, we can use the cross entropy
formula from information theory, as follows:
xentropy_y_vals = - tf.mul(target, tf.log(x_vals)) -
tf.mul((1. - target), tf.log(1. - x_vals))
xentropy_y_out = sess.run(xentropy_y_vals)
8.
Sigmoid cross entropy loss is very similar to the previous loss
function except we transform the x-values by the sigmoid function
before we put them in the cross entropy loss, as follows:
xentropy_sigmoid_y_vals =
tf.nn.sigmoid_cross_entropy_with_logits(x_vals, targets)
xentropy_sigmoid_y_out = sess.run(xentropy_sigmoid_y_vals)
9.
Weighted cross entropy loss is a weighted version of the sigmoid
cross entropy loss. We provide a weight on the positive target. For
an example, we will weight the positive target by 0.5, as follows:
weight = tf.constant(0.5)
xentropy_weighted_y_vals =
tf.nn.weighted_cross_entropy_with_logits(x_vals, targets,
weight)
xentropy_weighted_y_out = sess.run(xentropy_weighted_y_vals)
10.
Softmax cross-entropy loss operates on non-normalized outputs. This
function is used to measure a loss when there is only one target
category instead of multiple. Because of this, the function transforms
the outputs into a probability distribution via the softmax function and
then computes the loss function from a true probability distribution,
as follows:
unscaled_logits = tf.constant([[1., -3., 10.]])
target_dist = tf.constant([[0.1, 0.02, 0.88]])
softmax_xentropy =
tf.nn.softmax_cross_entropy_with_logits(unscaled_logits,
target_dist)
print(sess.run(softmax_xentropy))
[ 1.16012561]
11.
Sparse softmax cross-entropy loss is the same as previously, except
instead of the target being a probability distribution, it is an index of
which category is true. Instead of a sparse all-zero target vector with
one value of one, we just pass in the index of which category is the
true value, as follows:
unscaled_logits = tf.constant([[1., -3., 10.]])
sparse_target_dist = tf.constant([2])
sparse_xentropy =
tf.nn.sparse_softmax_cross_entropy_with_logits(unscaled_logit
s, sparse_target_dist)
print(sess.run(sparse_xentropy))
[ 0.00012564]
How it works…
Here is how to use matplotlib to plot the regression loss functions:
x_array = sess.run(x_vals)
plt.plot(x_array, l2_y_out, 'b-', label='L2 Loss')
plt.plot(x_array, l1_y_out, 'r--', label='L1 Loss')
plt.plot(x_array, phuber1_y_out, 'k-.', label='P-Huber Loss
(0.25)')
plt.plot(x_array, phuber2_y_out, 'g:', label='P'-Huber Loss
(5.0)')
plt.ylim(-0.2, 0.4)
plt.legend(loc='lower right', prop={'size': 11})
plt.show()
Figure 4: Plotting various regression loss functions.
And here is how to use matplotlib to plot the various classification loss
functions:
x_array = sess.run(x_vals)
plt.plot(x_array, hinge_y_out, 'b-', label='Hinge Loss')
plt.plot(x_array, xentropy_y_out, 'r--', label='Cross Entropy
Loss')
plt.plot(x_array, xentropy_sigmoid_y_out, 'k-.', label='Cross
Entropy Sigmoid Loss')
plt.plot(x_array, xentropy_weighted_y_out, g:', label='Weighted
Cross Enropy Loss (x0.5)')
plt.ylim(-1.5, 3)
plt.legend(loc='lower right', prop={'size': 11})
plt.show()
Figure 5: Plots of classification loss functions.
There's more…
Here is a table summarizing the different loss functions that we have
described:
Loss
Use
Benefits
Disadvantages
function
L2
Regression
More stable
Less robust
L1
Regression
More robust
Less stable
Psuedo-Huber
Regression
More robust and stable
One more parameter
Hinge
Classification
Creates a max margin for use in
Unbounded loss affected by outliers
SVM
Cross-entropy
Classification
More stable
Unbounded loss, less robust
The remaining classification loss functions all have to do with the type of
cross-entropy loss. The cross-entropy sigmoid loss function is for use on
unscaled logits and is preferred over computing the sigmoid, and then the
cross entropy, because TensorFlow has better built-in ways to handle
numerical edge cases. The same goes for softmax cross entropy and
sparse softmax cross entropy.
Note
Most of the classification loss functions described here are for two class
predictions. This can be extended to multiple classes via summing the
cross entropy terms over each prediction/target.
There are also many other metrics to look at when evaluating a model.
Here is a list of some more to consider:
Model metric
Description
R-squared
For linear models, this is the proportion of variance in the dependent variable
(coefficient of
that is explained by the independent data.
determination)
RMSE (root mean
For continuous models, measures the difference between predictions and actual
squared error)
via the square root of the average squared error.
For categorical models, we look at a matrix of predicted categories versus actual
Confusion matrix
categories. A perfect model has all the counts along the diagonal.
For categorical models, this is the fraction of true positives over all predicted
Recall
positives.
For categorical models, this is the fraction of true positives over all actual
Precision
positives.
F-score
For categorical models, this is the harmonic mean of precision and recall.
Implementing Back Propagation
One of the benefits of using TensorFlow, is that it can keep track of
operations and automatically update model variables based on back
propagation. In this recipe, we will introduce how to use this aspect to our
advantage when training machine learning models.
Getting ready
Now we will introduce how to change our variables in the model in such a
way that a loss function is minimized. We have learned about how to use
objects and operations, and create loss functions that will measure the
distance between our predictions and targets. Now we just have to tell
TensorFlow how to back propagate errors through our computational graph
to update the variables and minimize the loss function. This is done via
declaring an optimization function. Once we have an optimization function
declared, TensorFlow will go through and figure out the back propagation
terms for all of our computations in the graph. When we feed data in and
minimize the loss function, TensorFlow will modify our variables in the
graph accordingly.
For this recipe, we will do a very simple regression algorithm. We will
sample random numbers from a normal, with mean 1 and standard
deviation 0.1. Then we will run the numbers through one operation, which
will be to multiply them by a variable, A. From this, the loss function will
be the L2 norm between the output and the target, which will always be
the value 10. Theoretically, the best value for A will be the number 10
since our data will have mean 1.
The second example is a very simple binary classification algorithm. Here
we will generate 100 numbers from two normal distributions, N(-1,1) and
N(3,1). All the numbers from N(-1, 1) will be in target class 0, and all the
numbers from N(3, 1) will be in target class 1. The model to differentiate
these numbers will be a sigmoid function of a translation. In other words,
the model will be sigmoid (x + A) where A is a variable we will fit.
Theoretically, A will be equal to -1. We arrive at this number because if
m1 and m2 are the means of the two normal functions, the value added to
them to translate them equidistant to zero will be -(m1+m2)/2. We will see
how TensorFlow can arrive at that number in the second example.
While specifying a good learning rate helps the convergence of algorithms,
we must also specify a type of optimization. From the preceding two
examples, we are using standard gradient descent. This is implemented
with the TensorFlow function GradientDescentOptimizer().
How to do it…
Here is how the regression example works:
1. We start by loading the numerical Python package, numpy and
tensorflow:
import numpy as np
import tensorflow as tf
2. Now we start a graph session:
sess = tf.Session()
3. Next we create the data, placeholders, and the A variable:
x_vals = np.random.normal(1, 0.1, 100)
y_vals = np.repeat(10., 100)
x_data = tf.placeholder(shape=[1], dtype=tf.float32)
y_target = tf.placeholder(shape=[1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1]))
4. We add the multiplication operation to our graph:
my_output = tf.mul(x_data, A)
5. Next we add our L2 loss function between the multiplication output
and the target data:
loss = tf.square(my_output - y_target)
6.
Before we can run anything, we have to initialize the variables:
init = tf.initialize_all_variables()
sess.run(init)
7.
Now we have to declare a way to optimize the variables in our graph.
We declare an optimizer algorithm. Most optimization algorithms need
to know how far to step in each iteration. This distance is controlled
by the learning rate. If our learning rate is too big, our algorithm might
overshoot the minimum, but if our learning rate is too small, out
algorithm might take too long to converge; this is related to the
vanishing and exploding gradient problem. The learning rate has a big
influence on convergence and we will discuss this at the end of the
section. While here we use the standard gradient descent algorithm,
there are many different optimization algorithms that operate
differently and can do better or worse depending on the problem. For
a great overview of different optimization algorithms, see the paper by
Sebastian Ruder in the See Also section at the end of this recipe:
my_opt =
tf.train.GradientDescentOptimizer(learning_rate=0.02)
train_step = my_opt.minimize(loss)
Note
There is much theory on what learning rates are best. This is one of the
harder things to know and figure out in machine learning algorithms.
Good papers to read about how learning rates are related to specific
optimization algorithms are listed in the There's more… section at the
end of this recipe.
8. The final step is to loop through our training algorithm and tell
TensorFlow to train many times. We will do this 101 times and print
out results every 25th iteration. To train, we will select a random x and
y entry and feed it through the graph. TensorFlow will automatically
compute the loss, and slightly change the A bias to minimize the loss:
for i in range(100):
rand_index = np.random.choice(100)
rand_x = [x_vals[rand_index]]
rand_y = [y_vals[rand_index]]
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
if (i+1)%25==0:
print('Step #' + str(i+1) + ' A = ' +
str(sess.run(A)))
print('Loss = ' + str(sess.run(loss, feed_dict=
{x_data: rand_x, y_target: rand_y})))
Here is the output:
Step #25 A = [ 6.23402166]
Loss = 16.3173
Step #50 A = [ 8.50733757]
Loss = 3.56651
Step #75 A = [ 9.37753201]
Loss = 3.03149
Step #100 A = [ 9.80041122]
Loss = 0.0990248
9.
Now we will introduce the code for the simple classification example.
We can use the same TensorFlow script if we reset the graph first.
Remember we will attempt to find an optimal translation, A that will
translate the two distributions to the origin and the sigmoid function
will split the two into two different classes.
10.
First we reset the graph and reinitialize the graph session:
from tensorflow.python.framework import ops
ops.reset_default_graph()
sess = tf.Session()
11.
Next we will create the data from two different normal distributions,
N(-1, 1) and N(3, 1). We will also generate the target labels,
placeholders for the data, and the bias variable, A:
x_vals = np.concatenate((np.random.normal(-1, 1, 50),
np.random.normal(3, 1, 50)))
y_vals = np.concatenate((np.repeat(0., 50), np.repeat(1.,
50)))
x_data = tf.placeholder(shape=[1], dtype=tf.float32)
y_target = tf.placeholder(shape=[1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(mean=10, shape=[1]))
Note
Note that we initialized A to around the value 10, far from the
theoretical value of -1. We did this on purpose to show how the
algorithm converges from the value 10 to the optimal value, -1.
12.
Next we add the translation operation to the graph. Remember that we
do not have to wrap this in a sigmoid function because the loss
function will do that for us:
my_output = tf.add(x_data, A)
13.
Because the specific loss function expects batches of data that have
an extra dimension associated with them (an added dimension which is
the batch number), we will add an extra dimension to the output with
the function, expand_dims() In the next section we will discuss how to
use variable sized batches in training. For now, we will again just use
one random data point at a time:
my_output_expanded = tf.expand_dims(my_output, 0)
y_target_expanded = tf.expand_dims(y_target, 0)
14.
Next we will initialize our one variable, A:
init = tf.initialize_all_variables()
sess.run(init)
15.
Now we declare our loss function. We will use a cross entropy with
unscaled logits that transforms them with a sigmoid function.
TensorFlow has this all in one function for us in the neural network
package called nn.sigmoid_cross_entropy_with_logits(). As stated
before, it expects the arguments to have specific dimensions, so we
have to use the expanded outputs and targets accordingly:
xentropy = tf.nn.sigmoid_cross_entropy_with_logits(
my_output_expanded, y_target_expanded)
16.
Just like the regression example, we need to add an optimizer function
to the graph so that TensorFlow knows how to update the bias variable
in the graph:
my_opt = tf.train.GradientDescentOptimizer(0.05)
train_step = my_opt.minimize(xentropy)
17.
Finally, we loop through a randomly selected data point several
hundred times and update the variable A accordingly. Every 200
iterations, we will print out the value of A and the loss:
for i in range(1400):
rand_index = np.random.choice(100)
rand_x = [x_vals[rand_index]]
rand_y = [y_vals[rand_index]]
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
if (i+1)%200==0:
print('Step #' + str(i+1) + ' A = ' +
str(sess.run(A)))
print('Loss = ' + str(sess.run(xentropy, feed_dict=
{x_data: rand_x, y_target: rand_y})))
Step #200 A = [ 3.59597969]
Loss = [[ 0.00126199]]
Step #400 A = [ 0.50947344]
Loss = [[ 0.01149425]]
Step #600 A = [-0.50994617]
Loss = [[ 0.14271219]]
Step #800 A = [-0.76606178]
Loss = [[ 0.18807337]]
Step #1000 A = [-0.90859312]
Loss = [[ 0.02346182]]
Step #1200 A = [-0.86169094]
Loss = [[ 0.05427232]]
Step #1400 A = [-1.08486211]
Loss = [[ 0.04099189]]
How it works…
As a recap, for both examples, we did the following:
1. Created the data.
2. Initialized placeholders and variables.
3. Created a loss function.
4. Defined an optimization algorithm.
5. And finally, iterated across random data samples to iteratively update
our variables.
There's more…
We've mentioned before that the optimization algorithm is sensitive to the
choice of the learning rate. It is important to summarize the effect of this
choice in a concise manner:
Learning rate
Advantages/Disadvantages
Uses
size
Smaller learning
Converges slower but more
If solution is unstable, try lowering the learning
rate
accurate results.
rate first.
Larger learning
For some problems, helps prevent solutions from
Less accurate, but converges faster.
rate
stagnating.
Sometimes the standard gradient descent algorithm can get stuck or slow
down significantly. This can happen when the optimization is stuck in the
flat spot of a saddle. To combat this, there is another algorithm that takes
into account a momentum term, which adds on a fraction of the prior step's
gradient descent value. TensorFlow has this built in with the
MomentumOptimizer() function.
Another variant is to vary the optimizer step for each variable in our
models. Ideally, we would like to take larger steps for smaller moving
variables and shorter steps for faster changing variables. We will not go
into the mathematics of this approach, but a common implementation of
this idea is called the Adagrad algorithm. This algorithm takes into account
the whole history of the variable gradients. Again, the function in
TensorFlow for this is called AdagradOptimizer().
Sometimes, Adagrad forces the gradients to zero too soon because it takes
into account the whole history. A solution to this is to limit how many steps
we use. Doing this is called the Adadelta algorithm. We can apply this by
using the function AdadeltaOptimizer().
There are a few other implementations of different gradient descent
algorithms. For these, we would refer the reader to the TensorFlow
documentation at:
https://www.tensorflow.org/api_docs/python/train/optimizers .
See also
For some references on optimization algorithms and learning rates, see the
following papers and articles:
Kingma, D., Jimmy, L. Adam: A Method for Stochastic Optimization.
ICLR 2015. https://arxiv.org/pdf/1412.6980.pdf
Ruder, S. An Overview of Gradient Descent Optimization Algorithms.
2016. https://arxiv.org/pdf/1609.04747v1.pdf
Zeiler, M. ADADelta: An Adaptive Learning Rate Method. 2012.
http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf
Working with Batch and
Stochastic Training
While TensorFlow updates our model variables according to the prior
described back propagation, it can operate on anywhere from one datum
observation to a large group of data at once. Operating on one training
example can make for a very erratic learning process, while using a too
large batch can be computationally expensive. Choosing the right type of
training is crucial to getting our machine learning algorithms to converge to
a solution.
Getting ready
In order for TensorFlow to compute the variable gradients for back
propagation to work, we have to measure the loss on a sample or multiple
samples. Stochastic training is only putting through one randomly sampled
data-target pair at a time, just like we did in the previous recipe. Another
option is to put a larger portion of the training examples in at a time and
average the loss for the gradient calculation. Batch training size can vary
up to and including the whole dataset at once. Here we will show how to
extend the prior regression example, which used stochastic training to
batch training.
We will start by loading numpy, matplotlib, and tensorflow and start a
graph session, as follows:
import matplotlib as plt
import numpy as np
import tensorflow as tf
sess = tf.Session()
How to do it…
1. We will start by declaring a batch size. This will be how many data
observations we will feed through the computational graph at one
time:
batch_size = 20
2.
Next we declare the data, placeholders, and the variable in the model.
The change we make here is tothat we change the shape of the
placeholders. They are now two dimensions, where the first dimension
is None, and second will be the number of data points in the batch. We
could have explicitly set it to 20, but we can generalize and use the
None value. Again, as mentioned in Chapter 1, Getting Started with
TensorFlow, we still have to make sure that the dimensions work out
in the model and this does not allow us to perform any illegal matrix
operations:
x_vals = np.random.normal(1, 0.1, 100)
y_vals = np.repeat(10., 100)
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1,1]))
3.
Now we add our operation to the graph, which will now be matrix
multiplication instead of regular multiplication. Remember that matrix
multiplication is not communicative so we have to enter the matrices
in the correct order in the matmul() function:
my_output = tf.matmul(x_data, A)
4.
Our loss function will change because we have to take the mean of all
the L2 losses of each data point in the batch. We do this by wrapping
our prior loss output in TensorFlow's reduce_mean() function:
loss = tf.reduce_mean(tf.square(my_output - y_target))
5.
We declare our optimizer just like we did before:
my_opt = tf.train.GradientDescentOptimizer(0.02)
train_step = my_opt.minimize(loss)
6.
Finally, we will loop through and iterate on the training step to
optimize the algorithm. This part is different than before because we
want to be able to plot the loss over versus stochastic training
convergence. So we initialize a list to store the loss function every
five intervals:
loss_batch = []
for i in range(100):
rand_index = np.random.choice(100, size=batch_size)
rand_x = np.transpose([x_vals[rand_index]])
rand_y = np.transpose([y_vals[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
if (i+1)%5==0:
print('Step #' + str(i+1) + ' A = ' +
str(sess.run(A)))
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
print('Loss = ' + str(temp_loss))
loss_batch.append(temp_loss)
7.
Here is the final output of the 100 iterations. Notice that the value of
A has an extra dimension because it now has to be a 2D matrix:
Step #100 A = [[ 9.86720943]]
Loss = 0.
How it works…
Batch training and stochastic training differ in their optimization method
and their convergence. Finding a good batch size can be difficult. To see
how convergence differs between batch and stochastic, here is the code to
plot the batch loss from above. There is also a variable here that contains
the stochastic loss, but that computation follows from the prior section in
this chapter. Here is the code to save and record the stochastic loss in the
training loop. Just substitute this code in the prior recipe:
loss_stochastic = []
for i in range(100):
rand_index = np.random.choice(100)
rand_x = [x_vals[rand_index]]
rand_y = [y_vals[rand_index]]
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
if (i+1)%5==0:
print('Step #' + str(i+1) + ' A = ' + str(sess.run(A)))
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
print('Loss = ' + str(temp_loss))
loss_stochastic.append(temp_loss)
Here is the code to produce the plot of both the stochastic and batch loss
for the same regression problem:
plt.plot(range(0, 100, 5), loss_stochastic, 'b-',
label='Stochastic Loss')
plt.plot(range(0, 100, 5), loss_batch, 'r--', label='Batch' Loss,
size=20')
plt.legend(loc='upper right', prop={'size': 11})
plt.show()
Figure 6: Stochastic loss and batch loss (batch size = 20) plotted over
100 iterations. Note that the batch loss is much smoother and the
stochastic loss is much more erratic.
There's more…
Type of
training
Advantages
Disadvantages
Randomness may help move out of local
Generally, needs more iterations to
Stochastic
minimums.
converge.
Batch
Finds minimums quicker.
Takes more resources to compute.
Combining Everything Together
In this section, we will combine everything we have illustrated so far and
create a classifier on the iris dataset.
Getting ready
The iris data set is described in more detail in the Working with Data
Sources recipe in Chapter 1, Getting Started with TensorFlow. We will
load this data, and do a simple binary classifier to predict whether a flower
is the species Iris setosa or not. To be clear, this dataset has three classes
of species, but we will only predict whether it is a single species (I. setosa)
or not, giving us a binary classifier. We will start by loading the libraries
and data, then transform the target accordingly.
How to do it…
1.
First we load the libraries needed and initialize the computational
graph. Note that we also load matplotlib here, because we would like
to plot the resulting line after:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
import tensorflow as tf
sess = tf.Session()
2.
Next we load the iris data. We will also need to transform the target
data to be just 1 or 0 if the target is setosa or not. Since the iris data set
marks setosa as a zero, we will change all targets with the value 0 to 1,
and the other values all to 0. We will also only use two features, petal
length and petal width. These two features are the third and fourth
entry in each x-value:
iris = datasets.load_iris()
binary_target = np.array([1. if x==0 else 0. for x in
iris.target])
iris_2d = np.array([[x[2], x[3]] for x in iris.data])
3. Let's declare our batch size, data placeholders, and model variables.
Remember that the data placeholders for variable batch sizes have
None as the first dimension:
batch_size = 20
x1_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
x2_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1, 1]))
b = tf.Variable(tf.random_normal(shape=[1, 1]))
Note
Note that we can increase the performance (speed) of the algorithm by
decreasing the bytes for floats by using dtype=tf.float32 instead.
4.
Here we define the linear model. The model will take the form
x2=x1*A+b. And if we want to find points above or below that line,
we see whether they are above or below zero when plugged into the
equation x2-x1*A-b. We will do this by taking the sigmoid of that
equation and predicting 1 or 0 from that equation. Remember that
TensorFlow has loss functions with the sigmoid built in, so we just
need to define the output of the model prior to the sigmoid function:
my_mult = tf.matmul(x2_data, A)
my_add = tf.add(my_mult, b)
my_output = tf.sub(x1_data, my_add)
5.
Now we add our sigmoid cross-entropy loss function with
TensorFlow's built in function,
sigmoid_cross_entropy_with_logits():
xentropy = tf.nn.sigmoid_cross_entropy_with_logits(my_output,
y_target)
6.
We also have to tell TensorFlow how to optimize our computational
graph by declaring an optimizing method. We will want to minimize
the cross-entropy loss. We will also choose 0.05 as our learning rate:
my_opt = tf.train.GradientDescentOptimizer(0.05)
train_step = my_opt.minimize(xentropy)
7.
Now we create a variable initialization operation and tell TensorFlow
to execute it:
init = tf.initialize_all_variables()
sess.run(init)
8.
Now we will train our linear model with 1000 iterations. We will feed
in the three data points that we require: petal length, petal width, and
the target variable. Every 200 iterations we will print the variable
values:
for i in range(1000):
rand_index = np.random.choice(len(iris_2d),
size=batch_size)
rand_x = iris_2d[rand_index]
rand_x1 = np.array([[x[0]] for x in rand_x])
rand_x2 = np.array([[x[1]] for x in rand_x])
rand_y = np.array([[y] for y in
binary_target[rand_index]])
sess.run(train_step, feed_dict={x1_data: rand_x1,
x2_data: rand_x2, y_target: rand_y})
if (i+1)%200==0:
print('Step #' + str(i+1) + ' A = ' +
str(sess.run(A)) + ', b = ' + str(sess.run(b)))
Step #200 A = [[ 8.67285347]], b = [[-3.47147632]]
Step #400 A = [[ 10.25393486]], b = [[-4.62928772]]
Step #600 A = [[ 11.152668]], b = [[-5.4077611]]
Step #800 A = [[ 11.81016064]], b = [[-5.96689034]]
Step #1000 A = [[ 12.41202831]], b = [[-6.34769201]]
9.
The next set of commands extracts the model variables, and plots the
line on a graph. The resulting graph is in the next section:
[[slope]] = sess.run(A)
[[intercept]] = sess.run(b)
x = np.linspace(0, 3, num=50)
ablineValues = []
for i in x:
ablineValues.append(slope*i+intercept)
setosa_x = [a[1] for i,a in enumerate(iris_2d) if
binary_target[i]==1]
setosa_y = [a[0] for i,a in enumerate(iris_2d) if
binary_target[i]==1]
non_setosa_x = [a[1] for i,a in enumerate(iris_2d) if
binary_target[i]==0]
non_setosa_y = [a[0] for i,a in enumerate(iris_2d) if
binary_target[i]==0]
plt.plot(setosa_x, setosa_y, 'rx', ms=10, mew=2,
label='setosa''')
plt.plot(non_setosa_x, non_setosa_y, 'ro', label='Non-
setosa')
plt.plot(x, ablineValues, 'b-')
plt.xlim([0.0, 2.7])
plt.ylim([0.0, 7.1])
plt.suptitle('Linear' Separator For I.setosa', fontsize=20)
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend(loc='lower right')
plt.show()
How it works…
Our goal was to fit a line between the I.setosa points and the other two
species using only petal width and petal length. If we plot the points and
the resulting line, we see that we have achieved the following:
Figure 7: Plot of I.setosa and non-setosa for petal width vs petal length.
The solid line is the linear separator that we achieved after 1,000
iterations.
There's more…
While we achieved our objective of separating the two classes with a line,
it may not be the best model for separating two classes. In Chapter 4,
Support Vector Machines we will discuss support vector machines, which
is a better way of separating two classes in a feature space.
See also
For more information on the iris dataset, see the Wikipedia entry,
https://en.wikipedia.org/wiki/Iris_flower_data_set . For information about
the Scikit Learn iris dataset implementation, see the documentation at
http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html .
Evaluating Models
We have learned how to train a regression and classification algorithm in
TensorFlow. After this is accomplished, we must be able to evaluate the
model's predictions to determine how well it did.
Getting ready
Evaluating models is very important and every subsequent model will have
some form of model evaluation. Using TensorFlow, we must build this
feature into the computational graph and call it during and/or after our
model is training.
Evaluating models during training gives us insight into the algorithm and
may give us hints to debug it, improve it, or change models entirely. While
evaluation during training isn't always necessary, we will show how to do
this with both regression and classification.
After training, we need to quantify how the model performs on the data.
Ideally, we have a separate training and test set (and even a validation set)
on which we can evaluate the model.
When we want to evaluate a model, we will want to do so on a large batch
of data points. If we have implemented batch training, we can reuse our
model to make a prediction on such a batch. If we have implemented
stochastic training, we may have to create a separate evaluator that can
process data in batches.
Note
If we included a transformation on our model output in the loss function,
for example, sigmoid_cross_entropy_with_logits(), we must take that
into account when computing predictions for accuracy calculations. Don't
forget to include this in our evaluation of the model.
How to do it…
Regression models attempt to predict a continuous number. The target is
not a category, but a desired number. To evaluate these regression
predictions against the actual targets, we need an aggregate measure of the
distance between the two. Most of the time, a meaningful loss function
will satisfy these criteria. Here is how to change the simple regression
algorithm from above into printing out the loss in the training loop and
evaluating the loss at the end. For an example, we will revisit and rewrite
our regression example in the prior Implementing Back Propagation
recipe in this chapter.
Classification models predict a category based on numerical inputs. The
actual targets are a sequence of 1s and 0s and we must have a measure of
how close we are to the truth from our predictions. The loss function for
classification models usually isn't that helpful in interpreting how well our
model is doing. Usually, we want some sort of classification accuracy,
which is commonly the percentage of correctly predicted categories. For
this example, we will use the classification example from the prior
Implementing Back Propagation recipe in this chapter.
How it works…
First we will show how to evaluate the simple regression model that simply
fits a constant multiplication to the target of 10, as follows:
1. First we start by loading the libraries, creating the graph, data,
variables, and placeholders. There is an additional part to this section
that is very important. After we create the data, we will split the data
into training and testing datasets randomly. This is important because
we will always test our models if they are predicting well or not.
Evaluating the model both on the training data and test data also lets
us see whether the model is overfitting or not:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
sess = tf.Session()
x_vals = np.random.normal(1, 0.1, 100)
y_vals = np.repeat(10., 100)
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
batch_size = 25
train_indices = np.random.choice(len(x_vals),
round(len(x_vals)*0.8), replace=False)
test_indices = np.array(list(set(range(len(x_vals))) -
set(train_indices)))
x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]
A = tf.Variable(tf.random_normal(shape=[1,1]))
2.
Now we declare our model, loss function, and optimization algorithm.
We will also initialize the model variable A. Use the following code:
my_output = tf.matmul(x_data, A)
loss = tf.reduce_mean(tf.square(my_output - y_target))
init = tf.initialize_all_variables()
sess.run(init)
my_opt = tf.train.GradientDescentOptimizer(0.02)
train_step = my_opt.minimize(loss)
3.
We run the training loop just as we would before, as follows:
for i in range(100):
rand_index = np.random.choice(len(x_vals_train),
size=batch_size)
rand_x = np.transpose([x_vals_train[rand_index]])
rand_y = np.transpose([y_vals_train[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
if (i+1)%25==0:
print('Step #' + str(i+1) + ' A = ' +
str(sess.run(A)))
print('Loss = ' + str(sess.run(loss, feed_dict=
{x_data: rand_x, y_target: rand_y})))
Step #25 A = [[ 6.39879179]]
Loss = 13.7903
Step #50 A = [[ 8.64770794]]
Loss = 2.53685
Step #75 A = [[ 9.40029907]]
Loss = 0.818259
Step #100 A = [[ 9.6809473]]
Loss = 1.10908
4.
Now, to evaluate the model, we will output the MSE (loss function)
on the training and test sets, as follows:
mse_test = sess.run(loss, feed_dict={x_data:
np.transpose([x_vals_test]), y_target:
np.transpose([y_vals_test])})
mse_train = sess.run(loss, feed_dict={x_data:
np.transpose([x_vals_train]), y_target:
np.transpose([y_vals_train])})
print('MSE' on test:' + str(np.round(mse_test, 2)))
print('MSE' on train:' + str(np.round(mse_train, 2)))
MSE on test:1.35
MSE on train:0.88
5.
For the classification example, we will do something very similar. This
time, we will need to create our own accuracy function that we can
call at the end. One reason for this is because our loss function has
the sigmoid built in and we will need to call the sigmoid separately
and test it to see if our classes are correct.
6.
In the same script, we can just reload the graph and create our data,
variables, and placeholders. Remember that we will also need to
separate the data and targets into training and testing sets. Use the
following code:
from tensorflow.python.framework import ops
ops.reset_default_graph()
sess = tf.Session()
batch_size = 25
x_vals = np.concatenate((np.random.normal(-1, 1, 50),
np.random.normal(2, 1, 50)))
y_vals = np.concatenate((np.repeat(0., 50), np.repeat(1.,
50)))
x_data = tf.placeholder(shape=[1, None], dtype=tf.float32)
y_target = tf.placeholder(shape=[1, None], dtype=tf.float32)
train_indices = np.random.choice(len(x_vals),
round(len(x_vals)*0.8), replace=False)
test_indices = np.array(list(set(range(len(x_vals))) -
set(train_indices)))
x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]
A = tf.Variable(tf.random_normal(mean=10, shape=[1]))
7.
We will now add the model and the loss function to the graph,
initialize variables, and create the optimization procedure, as follows:
my_output = tf.add(x_data, A)
init = tf.initialize_all_variables()
sess.run(init)
xentropy =
tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(my_out
put, y_target))
my_opt = tf.train.GradientDescentOptimizer(0.05)
train_step = my_opt.minimize(xentropy)
8.
Now we run our training loop, as follows:
for i in range(1800):
rand_index = np.random.choice(len(x_vals_train),
size=batch_size)
rand_x = [x_vals_train[rand_index]]
rand_y = [y_vals_train[rand_index]]
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
if (i+1)%200==0:
print('Step #' + str(i+1) + ' A = ' +
str(sess.run(A)))
print('Loss = ' + str(sess.run(xentropy, feed_dict=
{x_data: rand_x, y_target: rand_y})))
Step #200 A = [ 6.64970636]
Loss = 3.39434
Step #400 A = [ 2.2884655]
Loss = 0.456173
Step #600 A = [ 0.29109824]
Loss = 0.312162
Step #800 A = [-0.20045301]
Loss = 0.241349
Step #1000 A = [-0.33634067]
Loss = 0.376786
Step #1200 A = [-0.36866501]
Loss = 0.271654
Step #1400 A = [-0.3727718]
Loss = 0.294866
Step #1600 A = [-0.39153299]
Loss = 0.202275
Step #1800 A = [-0.36630616]
Loss = 0.358463
9.
To evaluate the model, we will create our own prediction operation.
We wrap the prediction operation in a squeeze function because we
want to make the predictions and targets the same shape. Then we test
for equality with the equal function. After that, we are left with a
tensor of true and false values that we cast to float32 and take the
mean of them. This will result in an accuracy value. We will evaluate
this function for both the training and testing sets, as follows:
y_prediction =
tf.squeeze(tf.round(tf.nn.sigmoid(tf.add(x_data, A))))
correct_prediction = tf.equal(y_prediction, y_target)
accuracy = tf.reduce_mean(tf.cast(correct_prediction,
tf.float32))
acc_value_test = sess.run(accuracy, feed_dict={x_data:
[x_vals_test], y_target: [y_vals_test]})
acc_value_train = sess.run(accuracy, feed_dict={x_data:
[x_vals_train], y_target: [y_vals_train]})
print('Accuracy' on train set: ' + str(acc_value_train))
print('Accuracy' on test set: ' + str(acc_value_test))
Accuracy on train set: 0.925
Accuracy on test set: 0.95
10.
Many times, seeing the model results (accuracy, MSE, and so on) will
help us to evaluate the model. We can easily graph the model and data
here because it is one-dimensional. Here is how to visualize the model
and data with two separate histograms using matplotlib:
A_result = sess.run(A)
bins = np.linspace(-5, 5, 50)
plt.hist(x_vals[0:50], bins, alpha=0.5, label='N'(-1,1)',
color='white')
plt.hist(x_vals[50:100], bins[0:50], alpha=0.5,
label='N'(2,1)', color='red')
plt.plot((A_result, A_result), (0, 8), 'k--', linewidth=3,
label='A = '+ str(np.round(A_result, 2)))
plt.legend(loc='upper right')
plt.title('Binary' Classifier, Accuracy=' +
str(np.round(acc_value, 2)))
plt.show()
Figure 8: Visualization of data and the end model, A. The two normal
values are centered at -1 and 2, making the theoretical best split at
0.5. Here the model found the best split very close to that number.
Chapter 3. Linear Regression
In this chapter, we will cover the basic recipes for understanding how
TensorFlow works and how to access data for this book and additional
resources. We will cover the following areas:
Using the Matrix Inverse Method
Implementing a Decomposition Method
Learning the TensorFlow Way of Regression
Understanding Loss Functions in Linear Regression
Implementing Deming Regression
Implementing Lasso and Ridge Regression
Implementing Elastic Net Regression
Implementing Regression Logistic Regression
Introduction
Linear regression may be one of the most important algorithms in statistics,
machine learning, and science in general. It's one of the most used
algorithms and it is very important to understand how to implement it and
its various flavors. One of the advantages that linear regression has over
many other algorithms is that it is very interpretable. We end up with a
number for each feature that directly represents how that feature
influences the target or dependent variable. In this chapter, we will
introduce how linear regression can be classically implemented, and then
move on to how to best implement it in TensorFlow. Remember that all the
code is available at GitHub online at
https://github.com/nfmcclure/tensorflow_cookbook .
Using the Matrix Inverse Method
In this recipe, we will use TensorFlow to solve two dimensional linear
regressions with the matrix inverse method.
Getting ready
Linear regression can be represented as a set of matrix equations, say
. Here we are interested in solving the coefficients in matrix x. We
have to be careful if our observation matrix (design matrix) A is not
square. The solution to solving x can be expressed as
. To
show this is indeed the case, we will generate two-dimensional data, solve
it in TensorFlow, and plot the result.
How to do it…
1.
First we load the necessary libraries, initialize the graph, and create the
data, as follows:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
sess = tf.Session()
x_vals = np.linspace(0, 10, 100)
y_vals = x_vals + np.random.normal(0, 1, 100)
2.
Next we create the matrices to use in the inverse method. We create
the A matrix first, which will be a column of x-data and a column of 1s.
Then we create the b matrix from the y-data. Use the following code:
x_vals_column = np.transpose(np.matrix(x_vals))
ones_column = np.transpose(np.matrix(np.repeat(1, 100)))
A = np.column_stack((x_vals_column, ones_column))
b = np.transpose(np.matrix(y_vals))
3.
We then turn our A and b matrices into tensors, as follows:
A_tensor = tf.constant(A)
b_tensor = tf.constant(b)
4.
Now that we have our matrices set up , we can use TensorFlow to
solve this via the matrix inverse method, as follows:
tA_A = tf.matmul(tf.transpose(A_tensor), A_tensor)
tA_A_inv = tf.matrix_inverse(tA_A)
product = tf.matmul(tA_A_inv, tf.transpose(A_tensor))
solution = tf.matmul(product, b_tensor)
solution_eval = sess.run(solution)
5.
We now extract the coefficients from the solution, the slope and the y-
intercept, as follows:
slope = solution_eval[0][0]
y_intercept = solution_eval[1][0]
print('slope: ' + str(slope))
print('y'_intercept: ' + str(y_intercept))
slope: 0.955707151739
y_intercept: 0.174366829314
best_fit = []
for i in x_vals:
best_fit.append(slope*i+y_intercept)
plt.plot(x_vals, y_vals, 'o', label='Data')
plt.plot(x_vals, best_fit, 'r-', label='Best' fit line',
linewidth=3)
plt.legend(loc='upper left')
plt.show()
Figure 1: Data points and a best-fit line obtained via the matrix
inverse method.
How it works…
Unlike prior recipes, or most recipes in this book, the solution here is found
exactly through matrix operations. Most TensorFlow algorithms that we
will use are implemented via a training loop and take advantage of
automatic back propagation to update model variables. Here, we illustrate
the versatility of TensorFlow by implementing a direct solution to fitting a
model to data.
Implementing a Decomposition
Method
For this recipe, we will implement a matrix decomposition method for
linear regression. Specifically we will use the Cholesky decomposition, for
which relevant functions exist in TensorFlow.
Getting ready
Implementing inverse methods in the previous recipe can be numerically
inefficient in most cases, especially when the matrices get very large.
Another approach is to decompose the A matrix and perform matrix
operations on the decompositions instead. One such approach is to use the
built-in Cholesky decomposition method in TensorFlow. One reason people
are so interested in decomposing a matrix into more matrices is because
the resulting matrices will have assured properties that allow us to use
certain methods efficiently. The Cholesky decomposition decomposes a
matrix into a lower and upper triangular matrix, say and
, such that
these matrices are transpositions of each other. For further information on
the properties of this decomposition, there are many resources available
that describe it and how to arrive at it. Here we will solve the system,
, by writing it as
. We will first solve
and then solve
to arrive at our coefficient matrix, x.
How to do it…
1. We will set up the system exactly in the same way as the previous
recipe. We will import libraries, initialize the graph, and create the
data. Then we will obtain our A matrix and b matrix in the same way as
before:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.python.framework import ops
ops.reset_default_graph()
sess = tf.Session()
x_vals = np.linspace(0, 10, 100)
y_vals = x_vals + np.random.normal(0, 1, 100)
x_vals_column = np.transpose(np.matrix(x_vals))
ones_column = np.transpose(np.matrix(np.repeat(1, 100)))
A = np.column_stack((x_vals_column, ones_column))
b = np.transpose(np.matrix(y_vals))
A_tensor = tf.constant(A)
b_tensor = tf.constant(b)
2. Next we will find the Cholesky decomposition of our square matrix,
:
Note
Note that the TensorFlow function, cholesky(), only returns the lower
diagonal part of the decomposition. This is fine, as the upper diagonal
matrix is just the lower one, transposed.
tA_A = tf.matmul(tf.transpose(A_tensor), A_tensor)
L = tf.cholesky(tA_A)
tA_b = tf.matmul(tf.transpose(A_tensor), b)
sol1 = tf.matrix_solve(L, tA_b)
sol2 = tf.matrix_solve(tf.transpose(L), sol1)
3.
Now that we have the solution, we extract the coefficients:
solution_eval = sess.run(sol2)
slope = solution_eval[0][0]
y_intercept = solution_eval[1][0]
print('slope: ' + str(slope))
print('y'_intercept: ' + str(y_intercept))
slope: 0.956117676145
y_intercept: 0.136575513864
best_fit = []
for i in x_vals:
best_fit.append(slope*i+y_intercept)
plt.plot(x_vals, y_vals, 'o', label='Data')
plt.plot(x_vals, best_fit, 'r-', label='Best' fit line',
linewidth=3)
plt.legend(loc='upper left')
plt.show()
Figure 2: Data points and best-fit line obtained via Cholesky
decomposition.
How it works…
As you can see, we arrive at a very similar answer to the prior recipe.
Keep in mind that this way of decomposing a matrix, then performing our
operations on the pieces, is sometimes much more efficient and
numerically stable.
Learning The TensorFlow Way of
Linear Regression
Getting ready
In this recipe, we will loop through batches of data points and let
TensorFlow update the slope and y-intercept. Instead of generated data,
we will us the iris dataset that is built in to the Scikit Learn. Specifically,
we will find an optimal line through data points where the x-value is the
petal width and the y-value is the sepal length. We choose these two
because there appears to be a linear relationship between them, as we will
see in the graphs at the end. We will also talk more about the effects of
different loss functions in the next section, but for this recipe we will use
the L2 loss function.
How to do it…
1. We start by loading the necessary libraries, creating a graph, and
loading the data:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
from tensorflow.python.framework import ops
ops.reset_default_graph()
sess = tf.Session()
iris = datasets.load_iris()
x_vals = np.array([x[3] for x in iris.data])
y_vals = np.array([y[0] for y in iris.data])
2. We then declare our learning rate, batch size, placeholders, and model
variables:
learning_rate = 0.05
batch_size = 25
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
3.
Next, we write the formula for the linear model, y=Ax+b:
model_output = tf.add(tf.matmul(x_data, A), b)
4.
Then we declare our L2 loss function (which includes the mean over
the batch), initialize the variables, and declare our optimizer. Note that
we chose 0.05 as our learning rate:
loss = tf.reduce_mean(tf.square(y_target - model_output))
init = tf.global_variables_initializer()
sess.run(init)
my_opt = tf.train.GradientDescentOptimizer(learning_rate)
train_step = my_opt.minimize(loss)
5.
We can now loop through and train the model on randomly selected
batches. We will run it for 100 loops and print out the variable and
loss values every 25 iterations. Note that here we are also saving the
loss of every iteration so that we can view it afterwards:
loss_vec = []
for i in range(100):
rand_index = np.random.choice(len(x_vals),
size=batch_size)
rand_x = np.transpose([x_vals[rand_index]])
rand_y = np.transpose([y_vals[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
loss_vec.append(temp_loss)
if (i+1)%25==0:
print('Step #' + str(i+1) + ' A = ' +
str(sess.run(A)) + ' b = ' + str(sess.run(b)))
print('Loss = ''' + str(temp_loss))
Step #25 A = [[ 2.17270374]] b = [[ 2.85338426]]
Loss = 1.08116
Step #50 A = [[ 1.70683455]] b = [[ 3.59916329]]
Loss = 0.796941
Step #75 A = [[ 1.32762754]] b = [[ 4.08189011]]
Loss = 0.466912
Step #100 A = [[ 1.15968263]] b = [[ 4.38497639]]
Loss = 0.281003
6.
Next we will extract the coefficients we found and create a best-fit
line to put in the graph:
[slope] = sess.run(A)
[y_intercept] = sess.run(b)
best_fit = []
for i in x_vals:
best_fit.append(slope*i+y_intercept)
7.
Here we will create two plots. The first will be the data with the found
line overlaid. The second is the L2 loss function over the 100
iterations:
plt.plot(x_vals, y_vals, 'o', label='Data Points')
plt.plot(x_vals, best_fit, 'r-', label='Best' fit line',
linewidth=3)
plt.legend(loc='upper left')
plt.title('Sepal' Length vs Pedal Width')
plt.xlabel('Pedal Width')
plt.ylabel('Sepal Length')
plt.show()
plt.plot(loss_vec, 'k-')
plt.title('L2' Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('L2 Loss')
plt.show()
Figure 3: These are the data points from the iris dataset (sepal length
versus pedal width) overlaid with the optimal line fit found in
TensorFlow with the specified algorithm.
Figure 4: Here is the L2 loss of fitting the data with our algorithm.
Note the jitter in the loss function; this can be decreased with a
larger batch size or increased with a smaller batch size.
Note
Here is a good place to note how to see if the model is over-or
underfitting the data. If our data is broken into a test and train set,
and the accuracy is greater on the train set and going down on the
test set, then we are overfitting the data. If the accuracy is still
increasing on both the test and train set, then the model is
underfitting and we should continue training.
How it works…
The optimal line found is not guaranteed to be the best-fit line.
Convergence to the best-fit line depends on the number of iterations, batch
size, learning rate, and the loss function. It is always good practice to
observe the loss function over time as it can help us troubleshoot
problems or hyperparameter changes.
Understanding Loss Functions in
Linear Regression
It is important to know the effect of loss functions in algorithm
convergence. Here we will illustrate how the L1 and L2 loss functions
affect convergence in linear regression.
Getting ready
We will use the same iris dataset as in the prior recipe, but we will change
our loss functions and learning rates to see how convergence changes.
How to do it…
1.
The start of the program is unchanged from before until we get to our
loss function. We load the necessary libraries, start a session, load the
data, create placeholders, and define our variables and model. One
thing to note is that we are pulling out our learning rate and model
iterations. We are doing this because we want to show the effect of
quickly changing these parameters. Use the following code:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
sess = tf.Session()
iris = datasets.load_iris()
x_vals = np.array([x[3] for x in iris.data])
y_vals = np.array([y[0] for y in iris.data])
batch_size = 25
learning_rate = 0.1 # Will not converge with learning rate at
0.4
iterations = 50
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
model_output = tf.add(tf.matmul(x_data, A), b)
2. Our loss function will change to the L1 loss, as follows:
loss_l1 = tf.reduce_mean(tf.abs(y_target - model_output))
Note
Note that we can change this back to the L2 loss by substituting in the
following formula: tf.reduce_mean(tf.square(y_target -
model_output)).
3.
Now we resume by initializing the variables declaring our optimizer,
and looping them through the training part. Note that we are also
saving our loss at every generation to measure the convergence. Use
the following code:
init = tf.global_variables_initializer()
sess.run(init)
my_opt_l1 = tf.train.GradientDescentOptimizer(learning_rate)
train_step_l1 = my_opt_l1.minimize(loss_l1)
loss_vec_l1 = []
for i in range(iterations):
rand_index = np.random.choice(len(x_vals),
size=batch_size)
rand_x = np.transpose([x_vals[rand_index]])
rand_y = np.transpose([y_vals[rand_index]])
sess.run(train_step_l1, feed_dict={x_data: rand_x,
y_target: rand_y})
temp_loss_l1 = sess.run(loss_l1, feed_dict={x_data:
rand_x, y_target: rand_y})
loss_vec_l1.append(temp_loss_l1)
if (i+1)%25==0:
print('Step #' + str(i+1) + ' A = ' +
str(sess.run(A)) + ' b = ' + str(sess.run(b)))
plt.plot(loss_vec_l1, 'k-', label='L1 Loss')
plt.plot(loss_vec_l2, 'r--', label='L2 Loss')
plt.title('L1' and L2 Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('L1 Loss')
plt.legend(loc='upper right')
plt.show()
How it works…
When choosing a loss function, we must also choose a corresponding
learning rate that will work with our problem. Here, we will illustrate two
situations, one in which L2 is preferred and one in which L1 is preferred.
If our learning rate is small, our convergence will take more time. But if
our learning rate is too large, we will have issues with our algorithm never
converging. Here is a plot of the loss function of the L1 and L2 loss for
the iris linear regression problem when the learning rate is 0.05:
Figure 5: Here is the L1 and L2 loss with a learning rate of 0.05 for the
iris linear regression problem.
With a learning rate of 0.05, it would appear that L2 loss is preferred, as it
converges to a lower loss on the data. Here is a graph of the loss
functions when we increase the learning rate to 0.4:
Fihure 6: Shows the L1 and L2 loss on the iris linear regression problem
with a learning rate of 0.4. Note that the L1 loss is not visible because of
the high scale of the y-axis.
Here, we can see that the large learning rate can overshoot in the L2 norm,
whereas the L1 norm converges.
There's more…
To understand what is happening, we should look at how a large learning
rate and small learning rate act on L1 and L2 norms. To visualize this, we
look at a one-dimensional representation of learning steps on both norms,
as follows:
Figure 7: Illustrates what can happen with the L1 and L2 norm with
larger and smaller learning rates.
Implementing Deming regression
In this recipe, we will implement Deming regression (total regression),
which means we will need a different way to measure the distance
between the model line and data points.
Getting ready
If least squares linear regression minimizes the vertical distance to the line,
Deming regression minimizes the total distance to the line. This type of
regression minimizes the error in the y values and the x values. See the
following figure for a comparison:
Figure 8: Here we illustrate the difference between regular linear
regression and Deming regression. Linear regression on the left
minimizes the vertical distance to the line, and Deming regression
minimizes the total distance to the line.
To implement Deming regression, we have to modify the loss function.
The loss function in regular linear regression minimizes the vertical
distance. Here, we want to minimize the total distance. Given a slope and
intercept of a line, the perpendicular distance to a point is a known
geometric formula. We just have to substitute this formula in and tell
TensorFlow to minimize it.
How to do it…
1.
Everything stays the same except when we get to the loss function.
We begin by loading the libraries, starting a session, loading the data,
declaring the batch size, creating the placeholders, variables, and
model output, as follows:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
sess = tf.Session()
iris = datasets.load_iris()
x_vals = np.array([x[3] for x in iris.data])
y_vals = np.array([y[0] for y in iris.data])
batch_size = 50
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
model_output = tf.add(tf.matmul(x_data, A), b)
2.
The loss function is a geometric formula that comprises of a
numerator and denominator. For clarity we will write these out
separately. Given a line, y=mx+b and a point,
, the
perpendicular distance between the two can be written as follows:
demming_numerator = tf.abs(tf.sub(y_target,
tf.add(tf.matmul(x_data, A), b)))
demming_denominator = tf.sqrt(tf.add(tf.square(A),1))
loss = tf.reduce_mean(tf.truediv(demming_numerator,
demming_denominator))
3.
We now initialize our variables, declare our optimizer, and loop
through the training set to arrive at our parameters, as follows:
init = tf.global_variables_initializer()
sess.run(init)
my_opt = tf.train.GradientDescentOptimizer(0.1)
train_step = my_opt.minimize(loss)
loss_vec = []
for i in range(250):
rand_index = np.random.choice(len(x_vals),
size=batch_size)
rand_x = np.transpose([x_vals[rand_index]])
rand_y = np.transpose([y_vals[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
loss_vec.append(temp_loss)
if (i+1)%50==0:
print('Step #''' + str(i+1) + ' A = ' +
str(sess.run(A)) + ' b = ' + str(sess.run(b)))
print('Loss = ' + str(temp_loss))
4.
We can plot the output with the following code:
[slope] = sess.run(A)
[y_intercept] = sess.run(b)
best_fit = []
for i in x_vals:
best_fit.append(slope*i+y_intercept)
plt.plot(x_vals, y_vals, 'o', label='Data Points')
plt.plot(x_vals, best_fit, 'r-', label='Best' fit line',
linewidth=3)
plt.legend(loc='upper left')
plt.title('Sepal' Length vs Pedal Width')
plt.xlabel('Pedal Width')
plt.ylabel('Sepal Length')
plt.show()
Figure 9: The graph depicting the solution to Deming regression on
the iris dataset.
How it works…
The recipe here for Deming regression is almost identical to regular linear
regression. The key difference here is how we measure the loss between
the predictions and the data points. Instead of a vertical loss, we have a
perpendicular loss (or total loss) with the y values and x values.
Note
Note that the type of Deming regression implemented here is called total
regression. Total regression is when we assume the error in the x and y
values are similar. We can also scale the x and y axes in the distance
calculation by the difference in the errors according to our beliefs.
Implementing Lasso and Ridge
Regression
There are also ways to limit the influence of coefficients on the regression
output. These methods are called regularization methods and two of the
most common regularization methods are lasso and ridge regression. We
cover how to implement both of these in this recipe.
Getting ready
Lasso and ridge regression are very similar to regular linear regression,
except we adding regularization terms to limit the slopes (or partial slopes)
in the formula. There may be multiple reasons for this, but a common one
is that we wish to restrict the features that have an impact on the
dependent variable. This can be accomplished by adding a term to the loss
function that depends on the value of our slope, A.
For lasso regression, we must add a term that greatly increases our loss
function if the slope, A, gets above a certain value. We could use
TensorFlow's logical operations, but they do not have a gradient associated
with them. Instead, we will use a continuous approximation to a step
function, called the continuous heavy step function, that is scaled up and
over to the regularization cut off we choose. We will show how to do lasso
regression shortly.
For ridge regression, we just add a term to the L2 norm, which is the
scaled L2 norm of the slope coefficient. This modification is simple and is
shown in the There's more… section at the end of this recipe.
How to do it…
1. We will use the iris dataset again and set up our script the same way as
before. We first load the libraries, start a session, load the data, declare
the batch size, create the placeholders, variables, and model output as
follows:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
from tensorflow.python.framework import ops
ops.reset_default_graph()
sess = tf.Session()
iris = datasets.load_iris()
x_vals = np.array([x[3] for x in iris.data])
y_vals = np.array([y[0] for y in iris.data])
batch_size = 50
learning_rate = 0.001
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
model_output = tf.add(tf.matmul(x_data, A), b)
2.
We add the loss function, which is a modified continuous heavyside
step function. We also set the cutoff for lasso regression at 0.9. This
means that we want to restrict the slope coefficient to be less than 0.9.
Use the following code:
lasso_param = tf.constant(0.9)
heavyside_step = tf.truediv(1., tf.add(1.,
tf.exp(tf.mul(-100., tf.sub(A, lasso_param)))))
regularization_param = tf.mul(heavyside_step, 99.)
loss = tf.add(tf.reduce_mean(tf.square(y_target -
model_output)), regularization_param)
3.
We now initialize our variables and declare our optimizer, as follows:
init = tf.global_variables_initializer()
sess.run(init)
my_opt = tf.train.GradientDescentOptimizer(learning_rate)
train_step = my_opt.minimize(loss)
4.
We will run the training loop a fair bit longer because it can take a
while to converge. We can see that the slope coefficient is less than
0.9. Use the following code:
loss_vec = []
for i in range(1500):
rand_index = np.random.choice(len(x_vals),
size=batch_size)
rand_x = np.transpose([x_vals[rand_index]])
rand_y = np.transpose([y_vals[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
loss_vec.append(temp_loss[0])
if (i+1)%300==0:
print('Step #''' + str(i+1) + ' A = ' +
str(sess.run(A)) + ' b = ' + str(sess.run(b)))
print('Loss = ' + str(temp_loss))
Step #300 A = [[ 0.82512331]] b = [[ 2.30319238]]
Loss = [[ 6.84168959]]
Step #600 A = [[ 0.8200165]] b = [[ 3.45292258]]
Loss = [[ 2.02759886]]
Step #900 A = [[ 0.81428504]] b = [[ 4.08901262]]
Loss = [[ 0.49081498]]
Step #1200 A = [[ 0.80919558]] b = [[ 4.43668795]]
Loss = [[ 0.40478843]]
Step #1500 A = [[ 0.80433637]] b = [[ 4.6360755]]
Loss = [[ 0.23839757]]
How it works…
We implement lasso regression by adding a continuous heavyside step
function to the loss function of linear regression. Because of the steepness
of the step function, we have to be careful with the step size. Too big of a
step size and it will not converge. For ridge regression, see the necessary
change in the next section.
There's' more…
For ridge regression, we change the loss function to look like the
following code:
ridge_param = tf.constant(1.)
ridge_loss = tf.reduce_mean(tf.square(A))
loss = tf.expand_dims(tf.add(tf.reduce_mean(tf.square(y_target -
model_output)), tf.mul(ridge_param, ridge_loss)), 0)
Implementing Elastic Net
Regression
Elastic net regression is a type of regression that combines lasso regression
with ridge regression by adding a L1 and L2 regularization term to the
loss function.
Getting ready
Implementing elastic net regression should be straightforward after the
previous two recipes, so we will implement this in multiple linear
regression on the iris dataset, instead of sticking to the two-dimensional
data as before. We will use pedal length, pedal width, and sepal width to
predict sepal length.
How to do it…
1. First we load the necessary libraries and initialize a graph, as follows:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
sess = tf.Session()
2. Now we will load the data. This time, each element of x data will be a
list of three values instead of one. Use the following code:
iris = datasets.load_iris()
x_vals = np.array([[x[1], x[2], x[3]] for x in iris.data])
y_vals = np.array([y[0] for y in iris.data])
3. Next we declare the batch size, placeholders, variables, and model
output. The only difference here is that we change the size
specifications of the x data placeholder to take three values instead of
one, as follows:
batch_size = 50
learning_rate = 0.001
x_data = tf.placeholder(shape=[None, 3], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[3,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
model_output = tf.add(tf.matmul(x_data, A), b)
4.
For elastic net, the loss function has the L1 and L2 norms of the
partial slopes. We create these terms and then add them into the loss
function, as follows:
elastic_param1 = tf.constant(1.)
elastic_param2 = tf.constant(1.)
l1_a_loss = tf.reduce_mean(tf.abs(A))
l2_a_loss = tf.reduce_mean(tf.square(A))
e1_term = tf.mul(elastic_param1, l1_a_loss)
e2_term = tf.mul(elastic_param2, l2_a_loss)
loss =
tf.expand_dims(tf.add(tf.add(tf.reduce_mean(tf.square(y_targe
t - model_output)), e1_term), e2_term), 0)
5.
Now we can initialize the variables, declare our optimizer, and run
the training loop and fit our coefficients, as follows:
init = tf.global_variables_initializer()
sess.run(init)
my_opt = tf.train.GradientDescentOptimizer(learning_rate)
train_step = my_opt.minimize(loss)
loss_vec = []
for i in range(1000):
rand_index = np.random.choice(len(x_vals),
size=batch_size)
rand_x = x_vals[rand_index]
rand_y = np.transpose([y_vals[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
loss_vec.append(temp_loss[0])
if (i+1)%250==0:
print('Step #' + str(i+1) + ' A = ' +
str(sess.run(A)) + ' b = ' + str(sess.run(b)))
print('Loss = ' + str(temp_loss))
6.
Here is the output of the code:
Step #250 A = [[ 0.42095602]
[ 0.1055888 ]
[ 1.77064979]] b = [[ 1.76164341]]
Loss = [ 2.87764359]
Step #500 A = [[ 0.62762028]
[ 0.06065864]
[ 1.36294949]] b = [[ 1.87629771]]
Loss = [ 1.8032167]
Step #750 A = [[ 0.67953539]
[ 0.102514
]
[ 1.06914485]] b = [[ 1.95604002]]
Loss = [ 1.33256555]
Step #1000 A = [[ 0.6777274 ]
[ 0.16535147]
[ 0.8403284 ]] b = [[ 2.02246833]]
Loss = [ 1.21458709]
7.
Now we can observe the loss over the training iterations to be sure
that it converged, as follows:
plt.plot(loss_vec, 'k-')
plt.title('Loss' per Generation')
plt.xlabel('Generation')
plt.ylabel('Loss')
plt.show()
Figure 10: Elastic net regression loss plotted over the 1,000 training
iterations
How it works…
Elastic net regression is implemented here as well as multiple linear
regression. We can see that with these regularization terms in the loss
function the convergence is slower than in prior sections. Regularization is
as simple as adding in the appropriate terms in the loss functions.
Implementing Logistic Regression
For this recipe, we will implement logistic regression to predict the
probability of low birthweight.
Getting ready
Logistic regression is a way to turn linear regression into a binary
classification. This is accomplished by transforming the linear output in a
sigmoid function that scales the output between zero and 1. The target is a
zero or 1, which indicates whether or not a data point is in one class or
another. Since we are predicting a number between zero or 1, the
prediction is classified into class value 1''' if the prediction is above a
specified cut off value and class 0 otherwise. For the purpose of this
example, we will specify that cut off to be 0.5, which will make the
classification as simple as rounding the output.
The data we will use for this example will be the low birthweight data that
is obtained through the University of Massachusetts Amherst statistical
dataset repository (https://www.umass.edu/statdata/statdata/ ). We will be
predicting low birthweight from several other factors.
How to do it…
1. We start by loading the libraries, including the request library, because
we will access the low birth weight data through a hyperlink. We will
also initiate a session:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import requests
from sklearn import datasets
from sklearn.preprocessing import normalize
from tensorflow.python.framework import ops
ops.reset_default_graph()
sess = tf.Session()
2.
Next we will load the data through the request module and specify
which features we want to use. We have to be specific because one
feature is the actual birth weight and we don't want to use this to
predict if the birthweight is greater or less than a specific amount. We
also do not want to use the ID column as a predictor either:
birthdata_url =
'https://www.umass.edu/statdata/statdata/data/lowbwt.dat'
birth_file = requests.get(birthdata_url)
birth_data = birth_file.text.split('\r\n')[5:]
birth_header = [x for x in birth_data[0].split( '') if
len(x)>=1]
birth_data = [[float(x) for x in y.split( '') if len(x)>=1]
for y in birth_data[1:] if len(y)>=1]
y_vals = np.array([x[1] for x in birth_data])
x_vals = np.array([x[2:9] for x in birth_data])
3.
First we split the dataset into test and train sets:
train_indices = np.random.choice(len(x_vals),
round(len(x_vals)*0.8), replace=False)
test_indices = np.array(list(set(range(len(x_vals))) -
set(train_indices)))
x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]
4.
Logistic regression convergence works better when the features are
scaled between 0 and 1 (min-max scaling). So next we will scale each
feature:
def normalize_cols(m):
col_max = m.max(axis=0)
col_min = m.min(axis=0)
return (m-col_min) / (col_max - col_min)
x_vals_train = np.nan_to_num(normalize_cols(x_vals_train))
x_vals_test = np.nan_to_num(normalize_cols(x_vals_test))
Note
Note that we split the dataset into train and test before we scaled the
dataset. This is an important distinction to make. We want to make
sure that the training set does not influence the test set at all. If we
scaled the whole set before splitting, then we cannot guarantee that
they don't influence each other.
5.
Next we declare the batch size, placeholders, variables, and the
logistic model. We do not wrap the output in a sigmoid because that
operation is built into the loss function:
batch_size = 25
x_data = tf.placeholder(shape=[None, 7], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[7,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
model_output = tf.add(tf.matmul(x_data, A), b)
6.
Now we declare our loss function, which has the sigmoid function,
initialize our variables, and declare our optimizer function:
loss =
tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(model_
output, y_target))
init = tf.global_variables_initializer()
sess.run(init)
my_opt = tf.train.GradientDescentOptimizer(0.01)
train_step = my_opt.minimize(loss)
7.
Along with recording the loss function, we will also want to record
the classification accuracy on the training and test set. So we will
create a prediction function that returns the accuracy for any size
batch:
prediction = tf.round(tf.sigmoid(model_output))
predictions_correct = tf.cast(tf.equal(prediction, y_target),
tf.float32)
accuracy = tf.reduce_mean(predictions_correct)
8.
Now we can start our training loop and recording the loss and
accuracies:
loss_vec = []
train_acc = []
test_acc = []
for i in range(1500):
rand_index = np.random.choice(len(x_vals_train),
size=batch_size)
rand_x = x_vals_train[rand_index]
rand_y = np.transpose([y_vals_train[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
loss_vec.append(temp_loss)
temp_acc_train = sess.run(accuracy, feed_dict={x_data:
x_vals_train, y_target: np.transpose([y_vals_train])})
train_acc.append(temp_acc_train)
temp_acc_test = sess.run(accuracy, feed_dict={x_data:
x_vals_test, y_target: np.transpose([y_vals_test])})
test_acc.append(temp_acc_test)
9.
Here is the code to look at the plots of the loss and accuracies:
plt.plot(loss_vec, 'k-')
plt.title('Cross Entropy Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('Cross' Entropy Loss')
plt.show()
plt.plot(train_acc, 'k-', label='Train Set Accuracy')
plt.plot(test_acc, 'r--', label='Test Set Accuracy')
plt.title('Train' and Test Accuracy')
plt.xlabel('Generation')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.show()
How it works…
Here is the loss over the iterations and train and test set accuracies.
Since the dataset is only 189 observations, the train and test accuracy
plots will change owing to the random splitting of the dataset:
Figure 11: Cross-entropy loss plotted over the course of 1,500 iterations
Figure 12: Test and train set accuracy plotted over 1,500 generations.
Chapter 4. Support Vector
Machines
This chapter will cover some important recipes regarding how to use,
implement, and evaluate support vector machines (SVM) in TensorFlow.
The following areas will be covered:
Working with a Linear SVM
Reduction to Linear Regression
Working with Kernels in TensorFlow
Implementing a Non-Linear SVM
Implementing a Multi-Class SVM
Note
Note that both the prior covered logistic regression and most of the SVMs
in this chapter are binary predictors. While logistic regression tries to find
any separating line that maximizes the distance (probabilistically), SVMs
also try to minimize the error while maximizing the margin between
classes. In general, if the problem has a large number of features compared
to training examples, try logistic regression or a linear SVM. If the number
of training examples is larger, or the data is not linearly separable, a SVM
with a Gaussian kernel may be used.
Also remember that all the code for this chapter is available online at
https://github.com/nfmcclure/tensorflow_cookbook .
Introduction
Support vector machines are a method of binary classification. The basic
idea is to find a linear separating line (or hyperplane) between the two
classes. We first assume that the binary class targets are -1 or 1, instead of
the prior 0 or 1 targets. Since there may be many lines that separate two
classes, we define the best linear separator that maximizes the distance
between both classes.
Figure 1: Given two separable classes, 'o' and 'x', we wish to find the
equation for the linear separator between the two. The left shows that
there are many lines that separate the two classes. The right shows the
unique maximum margin line. The margin width is given by 2/. This line is
found by minimizing the L2 norm of A.
We can write such a hyperplane as follows:
Here, A is a vector of our partial slopes and x is a vector of inputs. The
width of the maximum margin can be shown to be two divided by the L2
norm of A. There are many proofs out there of this fact, but for a
geometric idea, solving the perpendicular distance from a 2D point to a
line may provide motivation for moving forward.
For linearly separable binary class data, to maximize the margin, we
minimize the L2 norm of A, . We must also subject this minimum to the
constraint:
The preceding constraint assures us that all the points from the
corresponding classes are on the same side of the separating line.
Since not all datasets are linearly separable, we can introduce a loss
function for points that cross the margin lines. For n data points, we
introduce what is called the soft margin loss function, as follows:
Note that the product
is always greater than 1 if the point is on
the correct side of the margin. This makes the left term of the loss function
equal to zero, and the only influence on the loss function is the size of the
margin.
The preceding loss function will seek a linearly separable line, but will
allow for points crossing the margin line. This can be a hard or soft
allowance, depending on the value of . Larger values of result in more
emphasis on widening the margin, and smaller values of result in the
model acting more like a hard margin, while allowing data points to cross
the margin, if need be.
In this chapter, we will set up a soft margin SVM and show how to extend
it to nonlinear cases and multiple classes.
Working with a Linear SVM
For this example, we will create a linear separator from the iris data set.
We know from prior chapters that the sepal length and petal width create a
linear separable binary data set for predicting if a flower is I. setosa or not.
Getting ready
To implement a soft separable SVM in TensorFlow, we will implement the
specific loss function, as follows:
Here, A is the vector of partial slopes, b is the intercept, is a vector of
inputs, is the actual class, (-1 or 1) and is the soft separability
regularization parameter.
How to do it…
1. We start by loading the necessary libraries. This will include the scikit
learn dataset library for access to the iris data set. Use the following
code:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
Note
To set up Scikit-learn for this exercise, we just need to type $pip
install -U scikit-learn. Note that it also comes installed with
Anaconda as well.
2.
Next we start a graph session and load the data as we need it.
Remember that we are loading the first and fourth variables in the
iris dataset as they are the sepal length and sepal width. We are
loading the target variable, which will take on the value 1 for I. setosa
and -1 otherwise. Use the following code:
sess = tf.Session()
iris = datasets.load_iris()
x_vals = np.array([[x[0], x[3]] for x in iris.data])
y_vals = np.array([1 if y==0 else -1 for y in iris.target])
3.
We should now split the dataset into train and test sets. We will
evaluate the accuracy on both the training and test sets. Since we
know this data set is linearly separable, we should expect to get one
hundred percent accuracy on both sets. Use the following code:
train_indices = np.random.choice(len(x_vals),
round(len(x_vals)*0.8), replace=False)
test_indices = np.array(list(set(range(len(x_vals))) -
set(train_indices)))
x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]
4.
Next we set our batch size, placeholders, and model variables. It is
important to mention that with this SVM algorithm, we want very large
batch sizes to help with convergence. We can imagine that with very
small batch sizes, the maximum margin line would jump around
slightly. Ideally, we would also slowly decrease the learning rate as
well, but this will suffice for now. Also, the A variable will take on the
shape 2x1 because we have two predictor variables, sepal length and
pedal width. Use the following code:
batch_size = 100
x_data = tf.placeholder(shape=[None, 2], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[2,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
5.
We now declare our model output. For correctly classified points, this
will return numbers that are greater than or equal to 1 if the target is I.
setosa and less than or equal to -1 otherwise. Use the following code:
model_output = tf.sub(tf.matmul(x_data, A), b)
6.
Next we will put together and declare the necessary components for
the maximum margin loss. First we will declare a function that will
calculate the L2 norm of a vector. Then we add the margin parameter,
. We then declare our classification loss and add together the two
terms. Use the following code:
l2_norm = tf.reduce_sum(tf.square(A))
alpha = tf.constant([0.1])
classification_term = tf.reduce_mean(tf.maximum(0.,
tf.sub(1., tf.mul(model_output, y_target))))
loss = tf.add(classification _term, tf.mul(alpha, l2_norm))
7.
Now we declare our prediction and accuracy functions so that we
can evaluate the accuracy on both the training and test sets, as
follows;
prediction = tf.sign(model_output)
accuracy = tf.reduce_mean(tf.cast(tf.equal(prediction,
y_target), tf.float32))
8.
Here we will declare our optimizer function and initialize our model
variables, as follows:
my_opt = tf.train.GradientDescentOptimizer(0.01)
train_step = my_opt.minimize(loss)
init = tf.initialize_all_variables()
sess.run(init)
9.
We now can start our training loop, keeping in mind that we want to
record our loss and training accuracy on both the training and test
set, as follows:
loss_vec = []
train_accuracy = []
test_accuracy = []
for i in range(500):
rand_index = np.random.choice(len(x_vals_train),
size=batch_size)
rand_x = x_vals_train[rand_index]
rand_y = np.transpose([y_vals_train[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
loss_vec.append(temp_loss)
train_acc_temp = sess.run(accuracy, feed_dict={x_data:
x_vals_train, y_target: np.transpose([y_vals_train])})
train_accuracy.append(train_acc_temp)
test_acc_temp = sess.run(accuracy, feed_dict={x_data:
x_vals_test, y_target: np.transpose([y_vals_test])})
test_accuracy.append(test_acc_temp)
if (i+1)%100==0:
print('Step #' + str(i+1) + ' A = ' +
str(sess.run(A)) + ' b = ' + str(sess.run(b)))
print('Loss = ' + str(temp_loss))
10.
The output of the script during training should look like the following.
Step #100 A = [[-0.10763293]
[-0.65735245]] b = [[-0.68752676]]
Loss = [ 0.48756418]
Step #200 A = [[-0.0650763 ]
[-0.89443302]] b = [[-0.73912662]]
Loss = [ 0.38910741]
Step #300 A = [[-0.02090022]
[-1.12334013]] b = [[-0.79332656]]
Loss = [ 0.28621092]
Step #400 A = [[ 0.03189624]
[-1.34912157]] b = [[-0.8507266]]
Loss = [ 0.22397576]
Step #500 A = [[ 0.05958777]
[-1.55989814]] b = [[-0.9000265]]
Loss = [ 0.20492229]
11.
In order to plot the outputs, we have to extract the coefficients and
separate the x values into I. setosa and non- I. setosa, as follows:
[[a1], [a2]] = sess.run(A)
[[b]] = sess.run(b)
slope = -a2/a1
y_intercept = b/a1
x1_vals = [d[1] for d in x_vals]
best_fit = []
for i in x1_vals:
best_fit.append(slope*i+y_intercept)
setosa_x = [d[1] for i,d in enumerate(x_vals) if
y_vals[i]==1]
setosa_y = [d[0] for i,d in enumerate(x_vals) if
y_vals[i]==1]
not_setosa_x = [d[1] for i,d in enumerate(x_vals) if
y_vals[i]==-1]
not_setosa_y = [d[0] for i,d in enumerate(x_vals) if
y_vals[i]==-1]
12.
The following is the code to plot the data with the linear separator,
accuracies, and loss:
plt.plot(setosa_x, setosa_y, 'o', label='I. setosa')
plt.plot(not_setosa_x, not_setosa_y, 'x', label='Non-setosa')
plt.plot(x1_vals, best_fit, 'r-', label='Linear Separator',
linewidth=3)
plt.ylim([0, 10])
plt.legend(loc='lower right')
plt.title('Sepal Length vs Pedal Width')
plt.xlabel('Pedal Width')
plt.ylabel('Sepal Length')
plt.show()
plt.plot(train_accuracy, 'k-', label='Training Accuracy')
plt.plot(test_accuracy, 'r--', label='Test Accuracy')
plt.title('Train and Test Set Accuracies')
plt.xlabel('Generation')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.show()
plt.plot(loss_vec, 'k-')
plt.title('Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('Loss')
plt.show()
Note
Using TensorFlow in this manner to implement the SVD algorithm may
result in slightly different outcomes each run. The reasons for this
include the random train/test set splitting and the selection of different
batches of points on each training batch. Also it would be ideal to also
slowly lower the learning rate after each generation.
Figure 2: Final linear SVM fit with the two classes plotted.
Final linear SVM fit with the two classes plotted:
Figure 3: Test and train set accuracy over iterations. We do get 100%
accuracy because the two classes are linearly separable.
Test and train set accuracy over iterations. We do get 100% accuracy
because the two classes are linearly separable:
Figure 4: Plot of the maximum margin loss over 500 iterations.
How it works…
In this recipe, we have shown that implementing a linear SVD model is
possible by using the maximum margin loss function.
Reduction to Linear Regression
Support vector machines can be used to fit linear regression. In this
chapter, we will explore how to do this with TensorFlow.
Getting ready
The same maximum margin concept can be applied toward fitting linear
regression. Instead of maximizing the margin that separates the classes, we
can think about maximizing the margin that contains the most (x, y) points.
To illustrate this, we will use the same iris data set, and show that we can
use this concept to fit a line between sepal length and petal width.
The corresponding loss function will be similar to max
Here, is half of the width of the margin, which makes the loss equal to
zero if a point lies in this region.
How to do it…
1. First we load the necessary libraries, start a graph, and load the iris
dataset. After that, we will split the dataset into train and test sets to
visualize the loss on both. Use the following code:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
sess = tf.Session()
iris = datasets.load_iris()
x_vals = np.array([x[3] for x in iris.data])
y_vals = np.array([y[0] for y in iris.data])
train_indices = np.random.choice(len(x_vals),
round(len(x_vals)*0.8), replace=False)
test_indices = np.array(list(set(range(len(x_vals))) -
set(train_indices)))
x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]
Note
For this example, we have split the data into train and test. It is also
common to split the data into three datasets, which includes the
validation set. We can use this validation set to verify that we are not
overfitting models as we train them.
2.
Let's declare our batch size, placeholders, and variables, and create
our linear model, as follows:
batch_size = 50
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
A = tf.Variable(tf.random_normal(shape=[1,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
model_output = tf.add(tf.matmul(x_data, A), b)
3.
Now we declare our loss function. The loss function, as described in
the preceding text, is implemented to follow with
. Remember
that the epsilon is part of our loss function, which allows for a soft
margin instead of a hard margin.
epsilon = tf.constant([0.5])
loss = tf.reduce_mean(tf.maximum(0.,
tf.sub(tf.abs(tf.sub(model_output, y_target)), epsilon)))
4.
We create an optimizer and initialize our variables next, as follows:
my_opt = tf.train.GradientDescentOptimizer(0.075)
train_step = my_opt.minimize(loss)
init = tf.initialize_all_variables()
sess.run(init)
5.
Now we iterate through 200 training iterations and save the training
and test loss for plotting later:
train_loss = []
test_loss = []
for i in range(200):
rand_index = np.random.choice(len(x_vals_train),
size=batch_size)
rand_x = np.transpose([x_vals_train[rand_index]])
rand_y = np.transpose([y_vals_train[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
temp_train_loss = sess.run(loss, feed_dict={x_data:
np.transpose([x_vals_train]), y_target:
np.transpose([y_vals_train])})
train_loss.append(temp_train_loss)
temp_test_loss = sess.run(loss, feed_dict={x_data:
np.transpose([x_vals_test]), y_target:
np.transpose([y_vals_test])})
test_loss.append(temp_test_loss)
if (i+1)%50==0:
print('-----------')
print('Generation: ' + str(i))
print('A = ' + str(sess.run(A)) + ' b = ' +
str(sess.run(b)))
print('Train Loss = ' + str(temp_train_loss))
print('Test Loss = ' + str(temp_test_loss))
6.
This results in the following output:
Generation: 50
A = [[ 2.20651722]] b = [[ 2.71290684]]
Train Loss = 0.609453
Test Loss = 0.460152
-----------
Generation: 100
A = [[ 1.6440177]] b = [[ 3.75240564]]
Train Loss = 0.242519
Test Loss = 0.208901
-----------
Generation: 150
A = [[ 1.27711761]] b = [[ 4.3149066]]
Train Loss = 0.108192
Test Loss = 0.119284
-----------
Generation: 200
A = [[ 1.05271816]] b = [[ 4.53690529]]
Train Loss = 0.0799957
Test Loss = 0.107551
7.
We can now extract the coefficients we found, and get values for the
best-fit line. For plotting purposes, we will also get values for the
margins as well. Use the following code:
[[slope]] = sess.run(A)
[[y_intercept]] = sess.run(b)
[width] = sess.run(epsilon)
best_fit = []
best_fit_upper = []
best_fit_lower = []
for i in x_vals:
best_fit.append(slope*i+y_intercept)
best_fit_upper.append(slope*i+y_intercept+width)
best_fit_lower.append(slope*i+y_intercept-width)
8.
Finally, here is the code to plot the data with the fitted line and the
train-test loss:
plt.plot(x_vals, y_vals, 'o', label='Data Points')
plt.plot(x_vals, best_fit, 'r-', label='SVM Regression Line',
linewidth=3)
plt.plot(x_vals, best_fit_upper, 'r--', linewidth=2)
plt.plot(x_vals, best_fit_lower, 'r--', linewidth=2)
plt.ylim([0, 10])
plt.legend(loc='lower right')
plt.title('Sepal Length vs Pedal Width')
plt.xlabel('Pedal Width')
plt.ylabel('Sepal Length')
plt.show()
plt.plot(train_loss, 'k-', label='Train Set Loss')
plt.plot(test_loss, 'r--', label='Test Set Loss')
plt.title('L2 Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('L2 Loss')
plt.legend(loc='upper right')
plt.show()
Figure 5: SVM regression with a 0.5 margin on the iris data (sepal
length versus petal width).
Here is the train and test loss over the training iterations:
Figure 6: SVM regression loss per generation on both the train and test
sets.
How it works…
Intuitively, we can think of SVM regression as a function that is trying to
fit as many points in the
width margin from the line as possible. The
fitting of this line is somewhat sensitive to this parameter. If we choose too
small an epsilon, the algorithm will not be able to fit many points in the
margin. If we choose too large of an epsilon, there will be many lines that
are able to fit all the data points in the margin. We prefer a smaller epsilon,
since nearer points to the margin contribute less loss than further away
points.
Working with Kernels in
TensorFlow
The prior SVMs worked with linear separable data. If we would like to
separate non-linear data, we can change how we project the linear
separator onto the data. This is done by changing the kernel in the SVM
loss function. In this chapter, we introduce how to changer kernels and
separate non-linear separable data.
Getting ready
In this recipe, we will motivate the usage of kernels in support vector
machines. In the linear SVM section, we solved the soft margin with a
specific loss function. A different approach to this method is to solve what
is called the dual of the optimization problem. It can be shown that the
dual for the linear SVM problem is given by the following formula:
Where:
Here, the variable in the model will be the b vector. Ideally, this vector will
be quite sparse, only taking on values near 1 and -1 for the corresponding
support vectors of our dataset. Our data point vectors are indicated by
and our targets (1 or -1) are represented by
The kernel in the preceding equations is the dot product,
, which
gives us the linear kernel. This kernel is a square matrix filled with the
dot products of the data points.
Instead of just doing the dot product between data points, we can expand
them with more complicated functions into higher dimensions, in which
the classes may be linear separable. This may seem needlessly
complicated, but if we select a function, k, that has the property where:
then k is called a kernel function. This is one of the more common kernels
if the Gaussian kernel (also known as the radian basis function kernel or
the RBF kernel) is used. This kernel is described with the following
equation:
In order to make predictions on this kernel, say at a point
, we just
substitute in the prediction point in the appropriate equation in the kernel
as follows:
In this section, we will discuss how to implement the Gaussian kernel. We
will also make a note of where to make the substitution for implementing
the linear kernel where appropriate. The dataset we will use will be
manually created to show where the Gaussian kernel would be more
appropriate to use over the linear kernel.
How to do it…
1.
First we load the necessary libraries and start a graph session, as
follows:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
sess = tf.Session()
2.
Now we generate the data. The data we will generate will be two
concentric rings of data, each ring will belong to a different class. We
have to make sure that the classes are -1 or 1 only . Then we will split
the data into x and y values for each class for plotting purposes. Use
the following code:
(x_vals, y_vals) = datasets.make_circles(n_samples=500,
factor=.5, noise=.1)
y_vals = np.array([1 if y==1 else -1 for y in y_vals])
class1_x = [x[0] for i,x in enumerate(x_vals) if
y_vals[i]==1]
class1_y = [x[1] for i,x in enumerate(x_vals) if
y_vals[i]==1]
class2_x = [x[0] for i,x in enumerate(x_vals) if
y_vals[i]==-1]
class2_y = [x[1] for i,x in enumerate(x_vals) if
y_vals[i]==-1]
3.
Next we declare our batch size, placeholders, and create our model
variable, b. For SVMs we tend to want larger batch sizes because we
want a very stable model that won't fluctuate much with each training
generation. Also note that we have an extra placeholder for the
prediction points. To visualize the results, we will create a color grid to
see which areas belong to which class at the end. Use the following
code:
batch_size = 250
x_data = tf.placeholder(shape=[None, 2], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
prediction_grid = tf.placeholder(shape=[None, 2],
dtype=tf.float32)
b = tf.Variable(tf.random_normal(shape=[1,batch_size]))
4.
We will now create the Gaussian kernel. This kernel can be expressed
as matrix operations as follows:
gamma = tf.constant(-50.0)
dist = tf.reduce_sum(tf.square(x_data), 1)
dist = tf.reshape(dist, [-1,1])
sq_dists = tf.add(tf.sub(dist, tf.mul(2., tf.matmul(x_data,
tf.transpose(x_data)))), tf.transpose(dist))
my_kernel = tf.exp(tf.mul(gamma, tf.abs(sq_dists)))
Note
Note the usage of broadcasting in the sq_dists line of the add and
subtract operations.
Note that the linear kernel can be expressed as my_kernel =
tf.matmul(x_data, tf.transpose(x_data)).
5. Now we declare the dual problem as previously stated in this recipe.
At the end, instead of maximizing, we will be minimizing the negative
of the loss function with a tf.neg() function. Use the following code:
model_output = tf.matmul(b, my_kernel)
first_term = tf.reduce_sum(b)
b_vec_cross = tf.matmul(tf.transpose(b), b)
y_target_cross = tf.matmul(y_target, tf.transpose(y_target))
second_term = tf.reduce_sum(tf.mul(my_kernel,
tf.mul(b_vec_cross, y_target_cross)))
loss = tf.neg(tf.sub(first_term, second_term))
6. We now create the prediction and accuracy functions. First, we must
create a prediction kernel, similar to step 4, but instead of a kernel of
the points with itself, we have the kernel of the points with the
prediction data. The prediction is then the sign of the output of the
model. Use the following code:
rA = tf.reshape(tf.reduce_sum(tf.square(x_data), 1),[-1,1])
rB = tf.reshape(tf.reduce_sum(tf.square(prediction_grid), 1),
[-1,1])
pred_sq_dist = tf.add(tf.sub(rA, tf.mul(2., tf.matmul(x_data,
tf.transpose(prediction_grid)))), tf.transpose(rB))
pred_kernel = tf.exp(tf.mul(gamma, tf.abs(pred_sq_dist)))
prediction_output =
tf.matmul(tf.mul(tf.transpose(y_target),b), pred_kernel)
prediction = tf.sign(prediction_output-
tf.reduce_mean(prediction_output))
accuracy =
tf.reduce_mean(tf.cast(tf.equal(tf.squeeze(prediction),
tf.squeeze(y_target)), tf.float32))
Note
To implement the linear prediction kernel, we can write pred_kernel =
tf.matmul(x_data, tf.transpose(prediction_grid)).
7.
Now we can create an optimizer function and initialize all the
variables, as follows:
my_opt = tf.train.GradientDescentOptimizer(0.001)
train_step = my_opt.minimize(loss)
init = tf.initialize_all_variables()
sess.run(init)
8.
Next we start the training loop. We will record the loss vector and the
batch accuracy for each generation. When we run the accuracy, we
have to put in all three placeholders, but we feed in the x data twice to
get the prediction on the points. Use the following code:
loss_vec = []
batch_accuracy = []
for i in range(500):
rand_index = np.random.choice(len(x_vals),
size=batch_size)
rand_x = x_vals[rand_index]
rand_y = np.transpose([y_vals[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
loss_vec.append(temp_loss)
acc_temp = sess.run(accuracy, feed_dict={x_data: rand_x,
y_target:
rand_y,
prediction_grid:rand_x})
batch_accuracy.append(acc_temp)
if (i+1)%100==0:
print('Step #' + str(i+1))
print('Loss = ' + str(temp_loss))
9.
This results in the following output:
Step #100
Loss = -28.0772
Step #200
Loss = -3.3628
Step #300
Loss = -58.862
Step #400
Loss = -75.1121
Step #500
Loss = -84.8905
10.
In order to see the output class on the whole space, we will create a
mesh of prediction points in our system and run the prediction on all of
them, as follows:
x_min, x_max = x_vals[:, 0].min() - 1, x_vals[:, 0].max() + 1
y_min, y_max = x_vals[:, 1].min() - 1, x_vals[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
grid_points = np.c_[xx.ravel(), yy.ravel()]
[grid_predictions] = sess.run(prediction, feed_dict={x_data:
rand_x,
y_target:
rand_y,
prediction_grid: grid_points})
grid_predictions = grid_predictions.reshape(xx.shape)
11.
The following is the code to plot the result, batch accuracy, and loss:
plt.contourf(xx, yy, grid_predictions, cmap=plt.cm.Paired,
alpha=0.8)
plt.plot(class1_x, class1_y, 'ro', label='Class 1')
plt.plot(class2_x, class2_y, 'kx', label='Class -1')
plt.legend(loc='lower right')
plt.ylim([-1.5, 1.5])
plt.xlim([-1.5, 1.5])
plt.show()
plt.plot(batch_accuracy, 'k-', label='Accuracy')
plt.title('Batch Accuracy')
plt.xlabel('Generation')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.show()
plt.plot(loss_vec, 'k-')
plt.title('Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('Loss')
plt.show()
12.
For succinctness, we will show only the results graph, but we can also
separately run the plotting code and see all three if we so choose:
Figure 7: Linear SVM on non-linear separable data.
Linear SVM on non-linear separable data.
Figure 8: Non-linear SVM with Gaussian kernel results on nonlinear ring
data.
Non-linear SVM with Gaussian kernel results on nonlinear ring data.
How it works…
There are two important pieces of the code to know about: how we
implemented the kernel and how we implemented the loss function for the
SVM dual optimization problem. We have shown how to implement the
linear and Gaussian kernel and that the Gaussian kernel can separate
nonlinear datasets.
We should also mention that there is another parameter, the gamma value
in the Gaussian kernel. This parameter controls how much influence points
have on the curvature of the separation. Small values are commonly
chosen, but it depends heavily on the dataset. Ideally this parameter is
chosen with statistical techniques such as cross-validation.
There's more…
There are many more kernels that we could implement if we so choose.
Here is a list of a few more common nonlinear kernels:
Polynomial homogeneous kernel:
Polynomial inhomogeneous kernel:
Hyperbolic tangent kernel:
Implementing a Non-Linear SVM
For this recipe, we will apply a non-linear kernel to split a dataset.
Getting ready
In this section, we will implement the preceding Gaussian kernel SVM on
real data. We will load the iris data set and create a classifier for I. setosa
(versus non-setosa). We will see the effect of various gamma values on the
classification.
How to do it…
1.
We first load the necessary libraries, which includes the scikit learn
datasets so that we can load the iris data. Then we will start a graph
session. Use the following code:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
sess = tf.Session()
2.
Next we will load the iris data, extract the sepal length and petal
width, and separated the x and y values for each class (for plotting
purposes later) , as follows:
iris = datasets.load_iris()
x_vals = np.array([[x[0], x[3]] for x in iris.data])
y_vals = np.array([1 if y==0 else -1 for y in iris.target])
class1_x = [x[0] for i,x in enumerate(x_vals) if
y_vals[i]==1]
class1_y = [x[1] for i,x in enumerate(x_vals) if
y_vals[i]==1]
class2_x = [x[0] for i,x in enumerate(x_vals) if
y_vals[i]==-1]
class2_y = [x[1] for i,x in enumerate(x_vals) if
y_vals[i]==-1]
3.
Now we declare our batch size (larger batches are preferred),
placeholders, and the model variable, b, as follows:
batch_size = 100
x_data = tf.placeholder(shape=[None, 2], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
prediction_grid = tf.placeholder(shape=[None, 2],
dtype=tf.float32)
b = tf.Variable(tf.random_normal(shape=[1,batch_size]))
4.
Next we declare our Gaussian kernel. This kernel is dependent on the
gamma value, and we will illustrate the effects of various gamma
values on the classification later in this recipe. Use the following code:
gamma = tf.constant(-10.0)
dist = tf.reduce_sum(tf.square(x_data), 1)
dist = tf.reshape(dist, [-1,1])
sq_dists = tf.add(tf.sub(dist, tf.mul(2., tf.matmul(x_data,
tf.transpose(x_data)))), tf.transpose(dist))
my_kernel = tf.exp(tf.mul(gamma, tf.abs(sq_dists)))
We now compute the loss for the dual optimization problem, as
follows:
model_output = tf.matmul(b, my_kernel)
first_term = tf.reduce_sum(b)
b_vec_cross = tf.matmul(tf.transpose(b), b)
y_target_cross = tf.matmul(y_target, tf.transpose(y_target))
second_term = tf.reduce_sum(tf.mul(my_kernel,
tf.mul(b_vec_cross, y_target_cross)))
loss = tf.neg(tf.sub(first_term, second_term))
5.
In order to perform predictions using an SVM, we must create a
prediction kernel function. After that we also declare an accuracy
calculation, which will just be a percentage of points correctly
classified. Use the following code:
rA = tf.reshape(tf.reduce_sum(tf.square(x_data), 1),[-1,1])
rB = tf.reshape(tf.reduce_sum(tf.square(prediction_grid), 1),
[-1,1])
pred_sq_dist = tf.add(tf.sub(rA, tf.mul(2., tf.matmul(x_data,
tf.transpose(prediction_grid)))), tf.transpose(rB))
pred_kernel = tf.exp(tf.mul(gamma, tf.abs(pred_sq_dist)))
prediction_output =
tf.matmul(tf.mul(tf.transpose(y_target),b), pred_kernel)
prediction = tf.sign(prediction_output-
tf.reduce_mean(prediction_output))
accuracy =
tf.reduce_mean(tf.cast(tf.equal(tf.squeeze(prediction),
tf.squeeze(y_target)), tf.float32))
6.
Next we declare our optimizer function and initialize the variables, as
follows:
my_opt = tf.train.GradientDescentOptimizer(0.01)
train_step = my_opt.minimize(loss)
init = tf.initialize_all_variables()
sess.run(init)
7.
Now we can start the training loop. We run the loop for 300 iterations
and will store the loss value and the batch accuracy. Use the following
code:
loss_vec = []
batch_accuracy = []
for i in range(300):
rand_index = np.random.choice(len(x_vals),
size=batch_size)
rand_x = x_vals[rand_index]
rand_y = np.transpose([y_vals[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
loss_vec.append(temp_loss)
acc_temp = sess.run(accuracy, feed_dict={x_data: rand_x,
y_target:
rand_y,
prediction_grid:rand_x})
batch_accuracy.append(acc_temp)
8.
In order to plot the decision boundary, we will create a mesh of x, y
points and evaluate the prediction function we created on all of these
points, as follows:
x_min, x_max = x_vals[:, 0].min() - 1, x_vals[:, 0].max() + 1
y_min, y_max = x_vals[:, 1].min() - 1, x_vals[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
grid_points = np.c_[xx.ravel(), yy.ravel()]
[grid_predictions] = sess.run(prediction, feed_dict={x_data:
rand_x,
y_target:
rand_y,
prediction_grid: grid_points})
grid_predictions = grid_predictions.reshape(xx.shape)
9.
For succinctness, we will only show how to plot the points with the
decision boundaries. For the plot and effect of gamma, see the next
section in this recipe. Use the following code:
plt.contourf(xx, yy, grid_predictions, cmap=plt.cm.Paired,
alpha=0.8)
plt.plot(class1_x, class1_y, 'ro', label='I. setosa')
plt.plot(class2_x, class2_y, 'kx', label='Non setosa')
plt.title('Gaussian SVM Results on Iris Data')
plt.xlabel('Pedal Length')
plt.ylabel('Sepal Width')
plt.legend(loc='lower right')
plt.ylim([-0.5, 3.0])
plt.xlim([3.5, 8.5])
plt.show()
How it works…
Here is the classification of I. setosa results for four different gamma
values (1, 10, 25, 100). Notice how the higher the gamma value, the more
of an effect each individual point has on the classification boundary.
Figure 9: Classification results of I. setosa using a Gaussian kernel SVM
with four different values of gamma.
Implementing a Multi-Class SVM
We can also use SVMs to categorize multiple classes instead of just two. In
this recipe, we will use a multi-class SVM to categorize the three types of
flowers in the iris dataset.
Getting ready
By design, SVM algorithms are binary classifiers. However, there are a few
strategies employed to get them to work on multiple classes. The two main
strategies are called one versus all, and one versus one.
One versus one is a strategy where a binary classifier is created for each
possible pair of classes. Then a prediction is made for a point for the class
that has the most votes. This can be computationally hard as we must
create
classifiers for k classes.
Another way to implement multi-class classifiers is to do a one versus all
strategy where we create a classifier for each of the classes. The predicted
class of a point will be the class that creates the largest SVM margin. This
is the strategy we will implement in this section.
Here, we will load the iris dataset and perform multiclass nonlinear SVM
with a Gaussian kernel. The iris dataset is ideal because there are three
classes (I. setosa, I. virginica, and I. versicolor). We will create three
Gaussian kernel SVMs for each class and make the prediction of points
where the highest margin exists.
How to do it…
1. First we load the libraries we need and start a graph, as follows:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
sess = tf.Session()
2.
Next, we will load the iris dataset and split apart the targets for each
class. We will only be using the sepal length and petal width to
illustrate because we want to be able to plot the outputs. We also
separate the x and y values for each class for plotting purposes at the
end. Use the following code:
iris = datasets.load_iris()
x_vals = np.array([[x[0], x[3]] for x in iris.data])
y_vals1 = np.array([1 if y==0 else -1 for y in iris.target])
y_vals2 = np.array([1 if y==1 else -1 for y in iris.target])
y_vals3 = np.array([1 if y==2 else -1 for y in iris.target])
y_vals = np.array([y_vals1, y_vals2, y_vals3])
class1_x = [x[0] for i,x in enumerate(x_vals) if
iris.target[i]==0]
class1_y = [x[1] for i,x in enumerate(x_vals) if
iris.target[i]==0]
class2_x = [x[0] for i,x in enumerate(x_vals) if
iris.target[i]==1]
class2_y = [x[1] for i,x in enumerate(x_vals) if
iris.target[i]==1]
class3_x = [x[0] for i,x in enumerate(x_vals) if
iris.target[i]==2]
class3_y = [x[1] for i,x in enumerate(x_vals) if
iris.target[i]==2]
3.
The biggest change we have in this example, as compared to the
Implementing a Non-Linear SVM recipe, is that a lot of the
dimensions will change (we have three classifiers now instead of one).
We will also make use of matrix broadcasting and reshaping
techniques to calculate all three SVMs at once. Since we are doing this
all at once, our y_target placeholder now has the dimensions [3,
None] and our model variable, b, will be initialized to be size [3,
batch_size]. Use the following code:
batch_size = 50
x_data = tf.placeholder(shape=[None, 2], dtype=tf.float32)
y_target = tf.placeholder(shape=[3, None], dtype=tf.float32)
prediction_grid = tf.placeholder(shape=[None, 2],
dtype=tf.float32)
b = tf.Variable(tf.random_normal(shape=[3,batch_size]))
4.
Next we calculate the Gaussian kernel. Since this is only dependent on
the x data, this code doesn't change from the prior recipe. Use the
following code:
gamma = tf.constant(-10.0)
dist = tf.reduce_sum(tf.square(x_data), 1)
dist = tf.reshape(dist, [-1,1])
sq_dists = tf.add(tf.sub(dist, tf.mul(2., tf.matmul(x_data,
tf.transpose(x_data)))), tf.transpose(dist))
my_kernel = tf.exp(tf.mul(gamma, tf.abs(sq_dists)))
5.
One big change is that we will do batch matrix multiplication. We will
end up with three-dimensional matrices and we will want to broadcast
matrix multiplication across the third index. Our data and target
matrices are not set up for this. In order for an operation such as
to work across an extra dimension, we create a function to expand
such matrices, reshape the matrix into a transpose, and then call
TensorFlow's batch_matmul across the extra dimension. Use the
following code:
def reshape_matmul(mat):
v1 = tf.expand_dims(mat, 1)
v2 = tf.reshape(v1, [3, batch_size, 1])
return(tf.batch_matmul(v2, v1))
6.
With this function created, we can now compute the dual loss
function, as follows:
model_output = tf.matmul(b, my_kernel)
first_term = tf.reduce_sum(b)
b_vec_cross = tf.matmul(tf.transpose(b), b)
y_target_cross = reshape_matmul(y_target)
second_term = tf.reduce_sum(tf.mul(my_kernel,
tf.mul(b_vec_cross, y_target_cross)),[1,2])
loss = tf.reduce_sum(tf.neg(tf.sub(first_term, second_term)))
7.
Now we can create the prediction kernel. Notice that we have to be
careful with the reduce_sum function and not reduce across all three
SVM predictions, so we have to tell TensorFlow not to sum everything
up with a second index argument. Use the following code:
rA = tf.reshape(tf.reduce_sum(tf.square(x_data), 1),[-1,1])
rB = tf.reshape(tf.reduce_sum(tf.square(prediction_grid), 1),
[-1,1])
pred_sq_dist = tf.add(tf.sub(rA, tf.mul(2., tf.matmul(x_data,
tf.transpose(prediction_grid)))), tf.transpose(rB))
pred_kernel = tf.exp(tf.mul(gamma, tf.abs(pred_sq_dist)))
8.
When we are done with the prediction kernel, we can create
predictions. A big change here is that the predictions are not the
sign() of the output. Since we are implementing a one versus all
strategy, the prediction is the classifier that has the largest output. To
accomplish this, we use TensorFlow's built in argmax() function, as
follows:
prediction_output = tf.matmul(tf.mul(y_target,b),
pred_kernel)
prediction = tf.arg_max(prediction_output-
tf.expand_dims(tf.reduce_mean(prediction_output,1), 1), 0)
accuracy = tf.reduce_mean(tf.cast(tf.equal(prediction,
tf.argmax(y_target,0)), tf.float32))
9.
Now that we have the kernel, loss, and prediction capabilities set
up, we just have to declare our optimizer function and initialize our
variables, as follows:
my_opt = tf.train.GradientDescentOptimizer(0.01)
train_step = my_opt.minimize(loss)
init = tf.initialize_all_variables()
sess.run(init)
10.
This algorithm converges relatively quickly, so we won't have run the
training loop for more than 100 iterations. We do so with the following
code:
loss_vec = []
batch_accuracy = []
for i in range(100):
rand_index = np.random.choice(len(x_vals),
size=batch_size)
rand_x = x_vals[rand_index]
rand_y = y_vals[:,rand_index]
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
loss_vec.append(temp_loss)
acc_temp = sess.run(accuracy, feed_dict={x_data: rand_x,
y_target: rand_y, prediction_grid:rand_x})
batch_accuracy.append(acc_temp)
if (i+1)%25==0:
print('Step #' + str(i+1))
print('Loss = ' + str(temp_loss))
Step #25
Loss = -2.8951
Step #50
Loss = -27.9612
Step #75
Loss = -26.896
Step #100
Loss = -30.2325
11.
We can now create the prediction grid of points and run the prediction
function on all of them, as follows:
x_min, x_max = x_vals[:, 0].min() - 1, x_vals[:, 0].max() + 1
y_min, y_max = x_vals[:, 1].min() - 1, x_vals[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
grid_points = np.c_[xx.ravel(), yy.ravel()]
grid_predictions = sess.run(prediction, feed_dict={x_data:
rand_x,
y_target:
rand_y,
prediction_grid: grid_points})
grid_predictions = grid_predictions.reshape(xx.shape)
12.
The following is code to plot the results, batch accuracy, and loss
function. For succinctness we will only display the end result:
plt.contourf(xx, yy, grid_predictions, cmap=plt.cm.Paired,
alpha=0.8)
plt.plot(class1_x, class1_y, 'ro', label='I. setosa')
plt.plot(class2_x, class2_y, 'kx', label='I. versicolor')
plt.plot(class3_x, class3_y, 'gv', label='I. virginica')
plt.title('Gaussian SVM Results on Iris Data')
plt.xlabel('Pedal Length')
plt.ylabel('Sepal Width')
plt.legend(loc='lower right')
plt.ylim([-0.5, 3.0])
plt.xlim([3.5, 8.5])
plt.show()
plt.plot(batch_accuracy, 'k-', label='Accuracy')
plt.title('Batch Accuracy')
plt.xlabel('Generation')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.show()
plt.plot(loss_vec, 'k-')
plt.title('Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('Loss')
plt.show()
Figure 10: Multi-class (three classes) nonlinear Gaussian SVM
results on the iris dataset with gamma = 10.
How it works…
The important point to notice in this recipe is how we changed our
algorithm to optimize over three SVM models at once. Our model
parameter, b, has an extra dimension to take into account all three models.
Here we can see that the extension of an algorithm to multiple similar
algorithms was made relatively easy owing to TensorFlow's built-in
capabilities to deal with extra dimensions.
Chapter 5. Nearest Neighbor
Methods
This chapter will focus on nearest neighbor methods and how to implement
them in TensorFlow. We will start with an introduction to the method and
show how to implement various forms, and the chapter will end with
examples of address matching and image recognition. This is what we will
cover:
Working with Nearest Neighbors
Working with Text-Based Distances
Computing Mixed Distance Functions
Using an Address Matching Example
Using Nearest Neighbors for Image Recognition
Note that all the code is available online at
https://github.com/nfmcclure/tensorflow_cookbook .
Introduction
Nearest neighbor methods are based on a simple idea. We consider our
training set as the model and make predictions on new points based on
how close they are to points in the training set. The most naïve way is to
make the prediction as the closest training data point class. But since most
datasets contain a degree of noise, a more common method would be to
take a weighted average of a set of k nearest neighbors. This method is
called k-nearest neighbors (k-NN).
Given a training dataset
, with corresponding targets
, we can make a prediction on a point, z, by looking at a set of
nearest neighbors. The actual method of prediction depends on whether or
not we are doing regression (continuous
) or classification (discrete
).
For discrete classification targets, the prediction may be given by a
maximum voting scheme weighted by the distance to the prediction point:
Here, our prediction, f(z) is the maximum weighted value over all classes,
j, where the weighted distance from the prediction point to the training
point, i, is given by
. And is just an indicator function if point i is
in class j.
For continuous regression targets, the prediction is given by a weighted
average of all k points nearest to the prediction:
It is obvious that the prediction is heavily dependent on the choice of the
distance metric, d.
Common specifications of the distance metric are L1 and L2 distances:
There are many different specifications of distance metrics that we can
choose. In this chapter, we will explore the L1 and L2 metrics as well as
edit and textual distances.
We also have to choose how to weight the distances. A straightforward
way to weight the distances is by the distance itself. Points that are further
away from our prediction should have less impact than nearer points. The
most common way to weight is by the normalized inverse of the distance.
We will implement this method in the next recipe.
Note
Note that k-NN is an aggregating method. For regression, we are
performing a weighted average of neighbors. Because of this, predictions
will be less extreme and less varied than the actual targets. The magnitude
of this effect will be determined by k, the number of neighbors in the
algorithm.
Working with Nearest Neighbors
We start this chapter by implementing nearest neighbors to predict housing
values. This is a great way to start with nearest neighbors because we will
be dealing with numerical features and continuous targets.
Getting ready
To illustrate how making predictions with nearest neighbors works in
TensorFlow, we will use the Boston housing dataset. Here we will be
predicting the median neighborhood housing value as a function of several
features.
Since we consider the training set the trained model, we will find the k-
NNs to the prediction points and do a weighted average of the target value.
How to do it…
1. First, we will start by loading the required libraries and starting a graph
session. We will use the requests module to load the necessary Boston
housing data from the UCI machine learning repository:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import requests
sess = tf.Session()
2. Next, we will load the data using the requests module:
housing_url = 'https://archive.ics.uci.edu/ml/machine-
learning-databases/housing/housing.data''
housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM',
'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
cols_used = ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS',
'TAX', 'PTRATIO', 'B', 'LSTAT']
num_features = len(cols_used)
# Request data
housing_file = requests.get(housing_url)
# Parse Data
housing_data = [[float(x) for x in y.split(' ') if len(x)>=1]
for y in housing_file.text.split('\n') if len(y)>=1]
3.
Next, we separate the data into our dependent and independent
features. We will be predicting the last variable, MEDV, which is the
median value for the group of houses. We will also not use the features
ZN, CHAS, and RAD because of their uninformative or binary nature:
y_vals = np.transpose([np.array([y[13] for y in
housing_data])])
x_vals = np.array([[x for i,x in enumerate(y) if
housing_header[i] in cols_used] for y in housing_data])
x_vals = (x_vals - x_vals.min(0)) / x_vals.ptp(0)
4.
Now we split the x and y values into the train and test sets. We will
create the training set by selecting about 80% of the rows at random,
and leave the remaining 20% for the test set:
train_indices = np.random.choice(len(x_vals),
round(len(x_vals)*0.8), replace=False)
test_indices = np.array(list(set(range(len(x_vals))) -
set(train_indices)))
x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]
5.
Next, we declare our k value and batch size:
k = 4
batch_size=len(x_vals_test)
6.
We will declare our placeholders next. Remember that there are no
model variables to train, as the model is determined exactly by our
training set:
x_data_train = tf.placeholder(shape=[None, num_features],
dtype=tf.float32)
x_data_test = tf.placeholder(shape=[None, num_features],
dtype=tf.float32)
y_target_train = tf.placeholder(shape=[None, 1],
dtype=tf.float32)
y_target_test = tf.placeholder(shape=[None, 1],
dtype=tf.float32)
7. Next, we create our distance function for a batch of test points. Here,
we illustrate the use of the L1 distance:
distance = tf.reduce_sum(tf.abs(tf.sub(x_data_train,
tf.expand_dims(x_data_test,1))), reduction_indices=2)
Note
Note that the L2 distance function can be used as well. We would
change the distance formula to the following:
distance =
tf.sqrt(tf.reduce_sum(tf.square(tf.sub(x_data_train,
tf.expand_dims(x_data_test,1))), reduction_indices=1))
8.
Now we create our prediction function. To do this, we will use the
top_k(), function, which returns the values and indices of the largest
values in a tensor. Since we want the indices of the smallest distances,
we will instead find the k-biggest negative distances. We also declare
the predictions and the mean squared error (MSE) of the target
values:
top_k_xvals, top_k_indices = tf.nn.top_k(tf.neg(distance),
k=k)
x_sums = tf.expand_dims(tf.reduce_sum(top_k_xvals, 1),1)
x_sums_repeated = tf.matmul(x_sums,tf.ones([1, k],
tf.float32))
x_val_weights =
tf.expand_dims(tf.div(top_k_xvals,x_sums_repeated), 1)
top_k_yvals = tf.gather(y_target_train, top_k_indices)
prediction =
tf.squeeze(tf.batch_matmul(x_val_weights,top_k_yvals),
squeeze_dims=[1])
mse = tf.div(tf.reduce_sum(tf.square(tf.sub(prediction,
y_target_test))), batch_size)
9.
Test:
num_loops = int(np.ceil(len(x_vals_test)/batch_size))
for i in range(num_loops):
min_index = i*batch_size
max_index = min((i+1)*batch_size,len(x_vals_train))
x_batch = x_vals_test[min_index:max_index]
y_batch = y_vals_test[min_index:max_index]
predictions = sess.run(prediction, feed_dict=
{x_data_train: x_vals_train, x_data_test: x_batch,
y_target_train: y_vals_train, y_target_test: y_batch})
batch_mse = sess.run(mse, feed_dict={x_data_train:
x_vals_train, x_data_test: x_batch, y_target_train:
y_vals_train, y_target_test: y_batch})
print('Batch #'' + str(i+1) + '' MSE: '' +
str(np.round(batch_mse,3)))
Batch #1 MSE: 23.153
10.
Additionally, we can also look at a histogram of the actual target
values compared with the predicted values. One reason to look at this
is to notice the fact that with an averaging method, we have trouble
predicting the extreme ends of the targets:
bins = np.linspace(5, 50, 45)
plt.hist(predictions, bins, alpha=0.5, label='Prediction'')
plt.hist(y_batch, bins, alpha=0.5, label='Actual'')
plt.title('Histogram of Predicted and Actual Values'')
plt.xlabel('Med Home Value in $1,000s'')
plt.ylabel('Frequency'')
plt.legend(loc='upper right'')
plt.show()
Figure 1: A histogram of the predicted values and actual target
values for k-NN (k=4).
11. One hard thing to determine is the best value of k. For the preceding
figure and predictions, we used k=4 for our model. We chose this
specifically because it gives us the lowest MSE. This is verified by
cross validation. If we use cross validation across multiple values of k,
we will see that k=4 gives us a minimum MSE. We show this in the
following figure. It is also worthwhile to plotting the variance in the
predicted values to show that it will decrease the more neighbors we
average over:
Figure 2: The MSE for k-NN predictions for various values of k. We
also plot the variance of the predicted values on the test set. Note
that the variance decreases as k increases.
How it works…
With the nearest neighbors algorithm, the model is the training set.
Because of this, we do not have to train any variables in our model. The
only parameter, k, we determined via cross-validation to minimize our
MSE.
There's more…
For the weighting of the k-NN, we chose to weight directly by the
distance. There are other options that we could consider as well. Another
common method is to weight by the inverse squared distance.
Working with Text-Based
Distances
Nearest neighbors is more versatile than just dealing with numbers. As long
as we have a way to measure distances between features, we can apply the
nearest neighbors algorithm. In this recipe, we will introduce how to
measure text distances with TensorFlow.
Getting ready
In this recipe, we will illustrate how to use TensorFlow's text distance
metric, the Levenshtein distance (the edit distance), between strings. This
will be important later in this chapter as we expand the nearest neighbor
methods to include features with text.
The Levenshtein distance is the minimal number of edits to get from one
string to another string. The allowed edits are inserting a character,
deleting a character, or substituting a character with a different one. For
this recipe, we will use TensorFlow's Levenshtein distance function,
edit_distance(). It is worthwhile to illustrate the use of this function
because the usage of this function will be applicable to later chapters.
Note
Note that TensorFlow's edit_distance() function only accepts sparse
tensors. We will have to create our strings as sparse tensors of individual
characters.
How to do it…
1. First, we load TensorFlow and initialize a graph:
import tensorflow as tf
sess = tf.Session()
2. Then we will show how to calculate the edit distance between two
words, 'bear' and 'beer'. First, we will create a list of characters
from our strings with Python's 'list()' function. Next, we create a
sparse 3D matrix from that list. We have to tell TensorFlow the
character indices, the shape of the matrix, and which characters we
want in the tensor. After this we can decide if we would like to go with
the total edit distance (normalize=False) or the normalized edit
distance (normalize=True), where we divide the edit distance by the
length of the second word:
Note
TensorFlow's documentation treats the two strings as a proposed
(hypothesis) string and a ground truth string. We will continue this
notation here with h and t tensors.
hypothesis = list('bear'')
truth = list('beers'')
h1 = tf.SparseTensor([[0,0,0], [0,0,1], [0,0,2], [0,0,3]],
hypothesis, [1,1,1])
t1 = tf.SparseTensor([[0,0,0], [0,0,1], [0,0,2], [0,0,3],
[0,0,4]], truth, [1,1,1])
print(sess.run(tf.edit_distance(h1, t1, normalize=False)))
3. This results in the following output:
[[ 2.]]
Note
The function, SparseTensorValue(), is a way to create a sparse tensor
in TensorFlow. It accepts the indices, values, and shape of a sparse
tensor we wish to create.
4. Next, we will illustrate how to compare two words, bear and beer,
both with another word, beers. In order to achieve this, we must
replicate the beers in order to have the same amount of comparable
words:
hypothesis2 = list('bearbeer')
truth2 = list('beersbeers')
h2 = tf.SparseTensor([[0,0,0], [0,0,1], [0,0,2], [0,0,3],
[0,1,0], [0,1,1], [0,1,2], [0,1,3]], hypothesis2, [1,2,4])
t2 = tf.SparseTensor([[0,0,0], [0,0,1], [0,0,2], [0,0,3],
[0,0,4], [0,1,0], [0,1,1], [0,1,2], [0,1,3], [0,1,4]],
truth2, [1,2,5])
print(sess.run(tf.edit_distance(h2, t2, normalize=True)))
5.
This results in the following output:
[[ 0.40000001
0.2
]]
6.
A more efficient way to compare a set of words against another word
is shown in this example. We create the indices and list of characters
beforehand for both the hypothesis and ground truth string:
hypothesis_words = ['bear','bar','tensor','flow']
truth_word = ['beers'']
num_h_words = len(hypothesis_words)
h_indices = [[xi, 0, yi] for xi,x in
enumerate(hypothesis_words) for yi,y in enumerate(x)]
h_chars = list('''.join(hypothesis_words))
h3 = tf.SparseTensor(h_indices, h_chars, [num_h_words,1,1])
truth_word_vec = truth_word*num_h_words
t_indices = [[xi, 0, yi] for xi,x in
enumerate(truth_word_vec) for yi,y in enumerate(x)]
t_chars = list('''.join(truth_word_vec))
t3 = tf.SparseTensor(t_indices, t_chars, [num_h_words,1,1])
print(sess.run(tf.edit_distance(h3, t3, normalize=True)))
7.
This results in the following output:
[[ 0.40000001]
[ 0.60000002]
[ 0.80000001]
[ 1.
]]
8.
Now we will illustrate how to calculate the edit distance between two
word lists using placeholders. The concept is the same, except we will
be feeding in SparseTensorValue() instead of sparse tensors. First, we
will create a function that creates the sparse tensors from a word list:
def create_sparse_vec(word_list):
num_words = len(word_list)
indices = [[xi, 0, yi] for xi,x in enumerate(word_list)
for yi,y in enumerate(x)]
chars = list('''.join(word_list))
return(tf.SparseTensorValue(indices, chars,
[num_words,1,1]))
hyp_string_sparse = create_sparse_vec(hypothesis_words)
truth_string_sparse =
create_sparse_vec(truth_word*len(hypothesis_words))
hyp_input = tf.sparse_placeholder(dtype=tf.string)
truth_input = tf.sparse_placeholder(dtype=tf.string)
edit_distances = tf.edit_distance(hyp_input, truth_input,
normalize=True)
feed_dict = {hyp_input: hyp_string_sparse,
truth_input: truth_string_sparse}
print(sess.run(edit_distances, feed_dict=feed_dict))
9.
This results in the following output:
[[ 0.40000001]
[ 0.60000002]
[ 0.80000001]
[ 1.
]]
How it works…
For this recipe, we have shown that we can measure text distances several
ways using TensorFlow. This will be extremely useful for performing
nearest neighbors on data that has text features. We will see more of this
later in the chapter when we perform address matching.
There's more…
Other text distance metrics exist that we should discuss. Here is a
definition table describing other various text distances between two
strings, s1 and s2:
Name
Description
Formula
Hamming
Number of equal character positions. Only valid if
, where I is an indicator function of
distance
the strings are equal length.
equal characters.
Cosine
The dot product of the k-gram differences divided
distance
by the L2 norm of the k-gram differences.
Jaccard
Number of characters in common divided by the
distance
total union of characters in both strings.
Computing with Mixed Distance
Functions
When dealing with data observations that have multiple features, we
should be aware that features can be scaled differently on different scales.
In this recipe, we account for that to improve our housing value
predictions.
Getting ready
It is important to extend the nearest neighbor algorithm to take into
account variables that are scaled differently. In this example, we will show
how to scale the distance function for different variables. Specifically, we
will scale the distance function as a function of the feature variance.
The key to weighting the distance function is to use a weight matrix. The
distance function written with matrix operations becomes the following
formula:
Here, A is a diagonal weight matrix that we use to scale the distance metric
for each feature.
For this recipe, we will try to improve our MSE on the Boston housing
value dataset. This dataset is a great example of features that are on
different scales, and the nearest neighbor algorithm would benefit from
scaling the distance function.
How to do it…
1.
First, we will load the necessary libraries and start a graph session:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import requests
sess = tf.Session()
2.
Next, we load the data and store it in a numpy array. Again, note that
we will only use certain columns for prediction. We do not use id
variables nor variables that have very low variance:
housing_url = 'https://archive.ics.uci.edu/ml/machine-
learning-databases/housing/housing.data''
housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM',
'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
cols_used = ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS',
'TAX', 'PTRATIO', 'B', 'LSTAT']
num_features = len(cols_used)
housing_file = requests.get(housing_url)
housing_data = [[float(x) for x in y.split(' ') if len(x)>=1]
for y in housing_file.text.split('\n') if len(y)>=1]
y_vals = np.transpose([np.array([y[13] for y in
housing_data])])
x_vals = np.array([[x for i,x in enumerate(y) if
housing_header[i] in cols_used] for y in housing_data])
3.
Now we scale the x values to be between zero and 1 with min-max
scaling:
x_vals = (x_vals - x_vals.min(0)) / x_vals.ptp(0)
4.
We now create the diagonal weight matrix that will provide the scaling
of the distance metric by the standard deviation of the features:
weight_diagonal = x_vals.std(0)
weight_matrix = tf.cast(tf.diag(weight_diagonal),
dtype=tf.float32)
5.
Now we split the data into a training and test set. We also declare k,
the amount of nearest neighbors, and make the batch size equal to the
test set size:
train_indices = np.random.choice(len(x_vals),
round(len(x_vals)*0.8), replace=False)
test_indices = np.array(list(set(range(len(x_vals))) -
set(train_indices)))
x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]
k = 4
batch_size=len(x_vals_test)
6.
We declare our placeholders that we need next. We have four
placeholders, the x-inputs and y-targets for both the training and test
set:
x_data_train = tf.placeholder(shape=[None, num_features],
dtype=tf.float32)
x_data_test = tf.placeholder(shape=[None, num_features],
dtype=tf.float32)
y_target_train = tf.placeholder(shape=[None, 1],
dtype=tf.float32)
y_target_test = tf.placeholder(shape=[None, 1],
dtype=tf.float32)
7.
Now we can declare our distance function. For readability, we break
up the distance function into its components. Note that we will have
to tile the weight matrix by the batch size and use the batch_matmul()
function to perform batch matrix multiplication across the batch size:
subtraction_term = tf.sub(x_data_train,
tf.expand_dims(x_data_test,1))
first_product = tf.batch_matmul(subtraction_term,
tf.tile(tf.expand_dims(weight_matrix,0), [batch_size,1,1]))
second_product = tf.batch_matmul(first_product,
tf.transpose(subtraction_term, perm=[0,2,1]))
distance = tf.sqrt(tf.batch_matrix_diag_part(second_product))
8.
After we calculate all the training distances for each test point, we
need to return the top k-NNs. We do this with the top_k() function.
Since this function returns the largest values, and we want the smallest
distances, we return the largest of the negative distance values. We
then want to make predictions as the weighted average of the
distances of the top k neighbors:
top_k_xvals, top_k_indices = tf.nn.top_k(tf.neg(distance),
k=k)
x_sums = tf.expand_dims(tf.reduce_sum(top_k_xvals, 1),1)
x_sums_repeated = tf.matmul(x_sums,tf.ones([1, k],
tf.float32))
x_val_weights =
tf.expand_dims(tf.div(top_k_xvals,x_sums_repeated), 1)
top_k_yvals = tf.gather(y_target_train, top_k_indices)
prediction =
tf.squeeze(tf.batch_matmul(x_val_weights,top_k_yvals),
squeeze_dims=[1])
9.
To evaluate our model, we calculate the MSE of our predictions:
mse = tf.div(tf.reduce_sum(tf.square(tf.sub(prediction,
y_target_test))), batch_size)
10.
Now we can loop through our test batches and calculate the MSE for
each:
num_loops = int(np.ceil(len(x_vals_test)/batch_size))
for i in range(num_loops):
min_index = i*batch_size
max_index = min((i+1)*batch_size,len(x_vals_train))
x_batch = x_vals_test[min_index:max_index]
y_batch = y_vals_test[min_index:max_index]
predictions = sess.run(prediction, feed_dict=
{x_data_train: x_vals_train, x_data_test: x_batch,
y_target_train: y_vals_train, y_target_test: y_batch})
batch_mse = sess.run(mse, feed_dict={x_data_train:
x_vals_train, x_data_test: x_batch, y_target_train:
y_vals_train, y_target_test: y_batch})
print('Batch #'' + str(i+1) + '' MSE: '' +
str(np.round(batch_mse,3)))
11.
This results in the following output:
Batch #1 MSE: 21.322
12.
As a final comparison, we can plot the distribution of housing values
for the actual test set and the predictions on the test set with the
following code:
bins = np.linspace(5, 50, 45)
plt.hist(predictions, bins, alpha=0.5, label='Prediction'')
plt.hist(y_batch, bins, alpha=0.5, label='Actual'')
plt.title('Histogram of Predicted and Actual Values'')
plt.xlabel('Med Home Value in $1,000s'')
plt.ylabel('Frequency'')
plt.legend(loc='upper right'')
plt.show()
Figure 3: The two histograms of the predicted and actual housing
values on the Boston dataset. This time we have scaled the distance
function differently for each feature.
How it works…
We decreased our MSE on the test set here by introducing a method of
scaling the distance functions for each feature. Here, we scaled the
distance functions by a factor of the feature's standard deviation. This
provides a more accurate view of measuring which points are the closest
neighbors or not. From this we also took the weighted average of the top k
neighbors as a function of distance to get the housing value prediction.
There's more…
This scaling factor can also be used to down-weight or up-weight features
in the nearest neighbor distance calculation. This can be useful in
situations where we trust features more or less than others.
Using an Address Matching
Example
Now that we have measured numerical and text distances, we will spend
time learning how to combine them to measure distances between
observations that have both text and numerical features.
Getting ready
Nearest neighbor is a great algorithm to use for address matching. Address
matching is a type of record matching in which we have addresses in
multiple datasets and we would like to match them up. In address
matching, we may have typos in the address, different cities, or different
zip codes, but they may all refer to the same address. Using the nearest
neighbor algorithm across the numerical and character components of an
address may help us identify addresses that are actually the same.
In this example, we will generate two datasets. Each dataset will comprise
a street address and a zip code. But one dataset has a high number of typos
in the street address. We will take the non-typo dataset as our gold
standard and return one address from it for each typo address that is the
closest as a function of the string distance (for the street) and numerical
distance (for the zip code).
The first part of the code will focus on generating the two datasets. Then
the second part of the code will run through the test set and return the
closest address from the training set.
How to do it…
1. We first start by loading the necessary libraries:
import random
import string
import numpy as np
import tensorflow as tf
2.
We will now create the reference dataset. To show succinct output, we
will only make each dataset comprise of 10 addresses (but it can be
run with many more):
n = 10
street_names = ['abbey', 'baker', 'canal', 'donner', 'elm']
street_types = ['rd', 'st', 'ln', 'pass', 'ave']
rand_zips = [random.randint(65000,65999) for i in range(5)]
numbers = [random.randint(1, 9999) for i in range(n)]
streets = [random.choice(street_names) for i in range(n)]
street_suffs = [random.choice(street_types) for i in
range(n)]
zips = [random.choice(rand_zips) for i in range(n)]
full_streets = [str(x) + ' ' + y + ' ' + z for x,y,z in
zip(numbers, streets, street_suffs)]
reference_data = [list(x) for x in zip(full_streets,zips)]
3.
To create the test set, we need a function that will randomly create a
typo in a string and return the resulting string:
def create_typo(s, prob=0.75):
if random.uniform(0,1) < prob:
rand_ind = random.choice(range(len(s)))
s_list = list(s)
s_list[rand_ind]=random.choice(string.ascii_lowercase)
s = '''.join(s_list)
return(s)
typo_streets = [create_typo(x) for x in streets]
typo_full_streets = [str(x) + ' ' + y + ' ' + z for x,y,z in
zip(numbers, typo_streets, street_suffs)]
test_data = [list(x) for x in zip(typo_full_streets,zips)]
4.
Now we can initialize a graph session and declare the placeholders we
need. We will need four placeholders in each test and reference set,
and we will need an address and zip code placeholder:
sess = tf.Session()
test_address = tf.sparse_placeholder( dtype=tf.string)
test_zip = tf.placeholder(shape=[None, 1], dtype=tf.float32)
ref_address = tf.sparse_placeholder(dtype=tf.string)
ref_zip = tf.placeholder(shape=[None, n], dtype=tf.float32)
5.
Now we declare the numerical zip distance and the edit distance for
the address string:
zip_dist = tf.square(tf.sub(ref_zip, test_zip))
address_dist = tf.edit_distance(test_address, ref_address,
normalize=True)
6.
We now convert the zip distance and the address distance into
similarities. For the similarities, we want a similarity of 1 when the two
inputs are exactly the same and near 0 when they are very different.
For the zip distance, we can do this by taking the distances,
subtracting from the max, and then dividing by the range of the
distances. For the address similarity, since the distance is already
scaled between 0 and 1, we just subtract it from 1 to get the similarity:
zip_max = tf.gather(tf.squeeze(zip_dist), tf.argmax(zip_dist,
1))
zip_min = tf.gather(tf.squeeze(zip_dist), tf.argmin(zip_dist,
1))
zip_sim = tf.div(tf.sub(zip_max, zip_dist), tf.sub(zip_max,
zip_min))
address_sim = tf.sub(1., address_dist)
7.
To combine the two similarity functions, we take a weighted average
of the two. For this recipe, we put equal weight on the address and the
zip code. We can also change this depending on how much we trust
each feature. We then return the index of the highest similarity of the
reference set:
address_weight = 0.5
zip_weight = 1. - address_weight
weighted_sim = tf.add(tf.transpose(tf.mul(address_weight,
address_sim)), tf.mul(zip_weight, zip_sim))
top_match_index = tf.argmax(weighted_sim, 1)
8.
In order to use the edit distance in TensorFlow, we have to convert the
address strings to a sparse vector. In a prior recipe in this chapter,
Working with Text- Based Distances recipe, we created the following
function and will use it in this recipe as well:
def sparse_from_word_vec(word_vec):
num_words = len(word_vec)
indices = [[xi, 0, yi] for xi,x in enumerate(word_vec)
for yi,y in enumerate(x)]
chars = list('''.join(word_vec))
# Now we return our sparse vector
return(tf.SparseTensorValue(indices, chars,
[num_words,1,1]))
9.
We need to separate the addresses and zip codes in the reference
dataset, so we can feed them into the placeholders when we loop
through the test set:
reference_addresses = [x[0] for x in reference_data]
reference_zips = np.array([[x[1] for x in reference_data]])
10.
We need to create the sparse tensor set of reference addresses using
the function we created in step 8:
sparse_ref_set = sparse_from_word_vec(reference_addresses)
11.
Now we can loop though each entry of the test set and return the
index of the reference set that it is the closest to. We print off both the
test and reference for each entry. As you can see, we have great
results on this generated dataset:
for i in range(n):
test_address_entry = test_data[i][0]
test_zip_entry = [[test_data[i][1]]]
# Create sparse address vectors
test_address_repeated = [test_address_entry] * n
sparse_test_set =
sparse_from_word_vec(test_address_repeated)
feeddict={test_address: sparse_test_set,
test_zip: test_zip_entry,
ref_address: sparse_ref_set,
ref_zip: reference_zips}
best_match = sess.run(top_match_index,
feed_dict=feeddict)
best_street = reference_addresses[best_match]
[best_zip] = reference_zips[0][best_match]
[[test_zip_]] = test_zip_entry
print('Address: '' + str(test_address_entry) + '', '' +
str(test_zip_))
print('Match
: '' + str(best_street) + '', '' +
str(best_zip))
12.
This results in the following output:
Address: 8659 beker ln, 65463
Match
: 8659 baker ln, 65463
Address: 1048 eanal ln, 65681
Match
: 1048 canal ln, 65681
Address: 1756 vaker st, 65983
Match
: 1756 baker st, 65983
Address: 900 abbjy pass, 65983
Match
: 900 abbey pass, 65983
Address: 5025 canal rd, 65463
Match
: 5025 canal rd, 65463
Address: 6814 elh st, 65154
Match
: 6814 elm st, 65154
Address: 3057 cagal ave, 65463
Match
: 3057 canal ave, 65463
Address: 7776 iaker ln, 65681
Match
: 7776 baker ln, 65681
Address: 5167 caker rd, 65154
Match
: 5167 baker rd, 65154
Address: 8765 donnor st, 65154
Match
: 8765 donner st, 65154
How it works…
One of the hard things to figure out in address matching problems like this
is the value of the weights and how to scale the distances. This may take
some exploration and insight into the data itself. Also, when dealing with
addresses we may consider different components than we did here. We
may consider the street number a separate component from the street
address, or even have other components, such as city and state. When
dealing with numerical address components, note that they can be treated
as numbers (with a numerical distance) or as characters (with an edit
distance). It is up to you to choose how. Also note that we might consider
using an edit distance with the zip code if we think that typos in the zip
code come from human entry and not, say, computer mapping errors.
To get a feel for how typos affect the results, we encourage the reader to
change the typo function to make more typos or more frequent typos and
increase the dataset's size to see how well this algorithm works.
Using Nearest Neighbors for Image
Recognition
Getting ready
Nearest neighbors can also be used for image recognition. The Hello
World of image recognition datasets is the MNIST handwritten digit
dataset. Since we will be using this dataset for various neural network
image recognition algorithms in later chapters, it will be great to compare
the results to a non-neural network algorithm.
The MNIST digit dataset is composed of thousands of labeled images that
are 28x28 pixels in size. Although this is considered to be a small image, it
has a total of 784 pixels (or features) for the nearest neighbor algorithm.
We will compute the nearest neighbor prediction for this categorical
problem by considering the mode prediction of the nearest k neighbors
(k=4 in this example).
How to do it…
1. We start by loading the necessary libraries. Note that we will also
import the Python Image Library (PIL) to be able to plot a sample
of the predicted outputs. And TensorFlow has a built-in method to load
the MNIST dataset that we will use:
import random
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from PIL import Image
from tensorflow.examples.tutorials.mnist import input_data
2. Now we start a graph session and load the MNIST data in a one hot
encoded form:
sess = tf.Session()
mnist = input_data.read_data_sets("MNIST_data/"",
one_hot=True)
Note
One hot encoding is a numerical representation of categorical values
that are better suited for numerical computations. Here we have 10
categories (numbers 0-9), and represent them as a 0-1 vector of length
10. For example, the '0' category is denoted by the vector
1,0,0,0,0,0,0,0,0,0, the 1 vector is denoted by 0,1,0,0,0,0,0,0,0,0, and
so on.
3.
Because the MNIST dataset is large and computing the distances
between 784 features on tens of thousands of inputs would be
computationally hard, we will sample a smaller set of images to train
on. Also, we choose a test set number that is divisible by six six only
for plotting purposes, as we will plot the last batch of six images to see
a sample of the results:
train_size = 1000
test_size = 102
rand_train_indices =
np.random.choice(len(mnist.train.images), train_size,
replace=False)
rand_test_indices = np.random.choice(len(mnist.test.images),
test_size, replace=False)
x_vals_train = mnist.train.images[rand_train_indices]
x_vals_test = mnist.test.images[rand_test_indices]
y_vals_train = mnist.train.labels[rand_train_indices]
y_vals_test = mnist.test.labels[rand_test_indices]
4.
We declare our k value and batch size:
k = 4
batch_size=6
5.
Now we initialize our placeholders that we will feed in the graph:
x_data_train = tf.placeholder(shape=[None, 784],
dtype=tf.float32)
x_data_test = tf.placeholder(shape=[None, 784],
dtype=tf.float32)
y_target_train = tf.placeholder(shape=[None, 10],
dtype=tf.float32)
y_target_test = tf.placeholder(shape=[None, 10],
dtype=tf.float32)
6. We declare our distance metric. Here we will use the L1 metric
(absolute value):
distance = tf.reduce_sum(tf.abs(tf.sub(x_data_train,
tf.expand_dims(x_data_test,1))), reduction_indices=2)
Note
Note that we can also make our distance function use the L2 distance
by using the following code instead: distance =
tf.sqrt(tf.reduce_sum(tf.square(tf.sub(x_data_train,
tf.expand_dims(x_data_test,1))), reduction_indices=1))
7.
Now we find the top k images that are the closest and predict the
mode. The mode will be performed on one hot encoded indices and
counting which occurs the most:
top_k_xvals, top_k_indices = tf.nn.top_k(tf.neg(distance),
k=k)
prediction_indices = tf.gather(y_target_train, top_k_indices)
count_of_predictions = tf.reduce_sum(prediction_indices,
reduction_indices=1)
prediction = tf.argmax(count_of_predictions, dimension=1)
8.
We can now loop through our test set, compute the predictions, and
store them:
num_loops = int(np.ceil(len(x_vals_test)/batch_size))
test_output = []
actual_vals = []
for i in range(num_loops):
min_index = i*batch_size
max_index = min((i+1)*batch_size,len(x_vals_train))
x_batch = x_vals_test[min_index:max_index]
y_batch = y_vals_test[min_index:max_index]
predictions = sess.run(prediction, feed_dict=
{x_data_train: x_vals_train, x_data_test: x_batch,
y_target_train:
y_vals_train, y_target_test: y_batch})
test_output.extend(predictions)
actual_vals.extend(np.argmax(y_batch, axis=1))
9.
Now that we have saved the actual and predicted output, we can
calculate the accuracy. This will change due to our random sampling
of the test/training datasets, but we should end up with accuracies of
around 80% to 90%:
accuracy = sum([1./test_size for i in range(test_size) if
test_output[i]==actual_vals[i]])
print('Accuracy on test set: '' + str(accuracy))
Accuracy on test set: 0.8333333333333325
10.
Here is the code to plot the last batch results:
actuals = np.argmax(y_batch, axis=1)
Nrows = 2
Ncols = 3
for i in range(len(actuals)):
plt.subplot(Nrows, Ncols, i+1)
plt.imshow(np.reshape(x_batch[i], [28,28]),
cmap='Greys_r'')
plt.title('Actual: '' + str(actuals[i]) + '' Pred: '' +
str(predictions[i]), fontsize=10)
frame = plt.gca()
frame.axes.get_xaxis().set_visible(False)
frame.axes.get_yaxis().set_visible(False)
Figure 4: The last batch of six images we ran our nearest neighbor
prediction on. We can see that we do not get all of the images exactly
correct
How it works…
Given enough computation time and computational resources, we could
have made the test and training sets bigger. This probably would have
increased our accuracy, and also is a common way to prevent overfitting.
Also, this algorithm warrants further exploration on the ideal k value to
choose. The k value would be chosen after a set of cross-validation
experiments on the dataset.
There's more…
We can also use the nearest neighbor algorithm here for evaluating unseen
numbers from the user as well. Please see the online repository for a way
to use this model to evaluate user input digits here:
https://github.com/nfmcclure/tensorflow_cookbook .
In this chapter, we've explored how to use kNN algorithms for regression
and classification. We've talked about the different usage of distance
functions and how to mix them together. We encourage the reader to
explore different distance metrics, weights, and k values to optimize the
accuracy of these methods.
Chapter 6. Neural Networks
In this chapter, we will introduce neural networks and how to implement
them in TensorFlow. Most of the subsequent chapters will be based on
neural networks, so learning how to use them in TensorFlow is very
important. We will start by introducing basic concepts of neural
networking and work up to multilayer networks. In the last section, we will
create a neural network that learns to play Tic Tac Toe.
In this chapter, we'll cover the following recipes:
Implementing Operational Gates
Working with Gates and Activation Functions
Implementing a One-Layer Neural Network
Implementing Different Layers
Using Multilayer Networks
Improving Predictions of Linear Models
Learning to Play Tic Tac Toe
The reader can find all the code from this chapter online, at
https://github.com/nfmcclure/tensorflow_cookbook .
Introduction
Neural networks are currently breaking records in tasks such as image and
speech recognition, reading handwriting, understanding text, image
segmentation, dialog systems, autonomous car driving, and so much more.
While some of these aforementioned tasks will be covered in later
chapters, it is important to introduce neural networks as an easy-to-
implement machine learning algorithm, so that we can expand on it later.
The concept of a neural network has been around for decades. However, it
only recently gained traction computationally because we now have the
computational power to train large networks because of advances in
processing power, algorithm efficiency, and data sizes.
A neural network is basically a sequence of operations applied to a matrix
of input data. These operations are usually collections of additions and
multiplications followed by applications of non-linear functions. One
example that we have already seen is logistic regression, the last section in
Chapter 3, Linear Regression. Logistic regression is the sum of the partial
slope-feature products followed by the application of the sigmoid function,
which is non-linear. Neural networks generalize this a bit more by allowing
any combination of operations and non-linear functions, which includes
the applications of absolute value, maximum, minimum, and so on.
The important trick with neural networks is called 'backpropagation'. Back
propagation is a procedure that allows us to update the model variables
based on the learning rate and the output of the loss function. We used
back propagation to update our model variables in the Chapter 3, Linear
Regression and Chapter 4, and the Support Vector Machine.
Another important feature to take note of in neural networks is the non-
linear activation function. Since most neural networks are just
combinations of addition and multiplication operations, they will not be
able to model non-linear datasets. To address this issue, we have used the
non-linear activation functions in the neural networks. This will allow the
neural network to adapt to most non-linear situations.
It is important to remember that, like most of the algorithms we have seen
so far, neural networks are sensitive to the hyper-parameters that we
choose. In this chapter, we will see the impact of different learning rates,
loss functions, and optimization procedures.
Note
There are more resources for learning about neural networks that are more
in-depth and detailed.
The seminal paper describing back propagation is Efficient BackProp by
Yann LeCun and others. The PDF is located here:
http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf .
CS231, Convolutional Neural Networks for Visual Recognition, by
Stanford University, class resources available here:
http://cs231n.stanford.edu/ .
CS224d, Deep Learning for Natural Language Processing, by Stanford
University, class resources available here:http://cs224d.stanford.edu/ .
Deep Learning, a book by the MIT Press. Goodfellow and others, 2016.
Located here: http://www.deeplearningbook.org .
There is an online book called Neural Networks and Deep Learning by
Michael Nielsen, located here:
http://neuralnetworksanddeeplearning.com/ .
For a more pragmatic approach and introduction to neural networks,
Andrej Karpathy has written a great summary and JavaScript examples
called A Hacker's Guide to Neural Networks. The write-up is located here:
http://karpathy.github.io/neuralnets/ .
Another site that summarizes some good notes on deep learning is called
Deep Learning for Beginners by Ian Goodfellow, Yoshua Bengio, and
Aaron Courville. This web page can be found here:
http://randomekek.github.io/deep/deeplearning.html .
Implementing Operational Gates
One of the most fundamental concepts of neural networks is an operation
known as an operational gate. In this section, we will start with a
multiplication operation as a gate and then we will consider nested gate
operations.
Getting ready
The first operational gate we will implement looks like f(x)=a.x. To
optimize this gate, we declare the a input as a variable and the x input as a
placeholder. This means that TensorFlow will try to change the a value and
not the x value. We will create the loss function as the difference between
the output and the target value, which is 50.
The second, nested operational gate will be f(x)=a.x+b. Again, we will
declare a and b as variables and x as a placeholder. We optimize the output
toward the target value of 50 again. The interesting thing to note is that the
solution for this second example is not unique. There are many
combinations of model variables that will allow the output to be 50. With
neural networks, we do not care as much for the values of the intermediate
model variables, but place more emphasis on the desired output.
Think of the operations as operational gates on our computational graph.
Here is a figure depicting the two examples:
Figure 1: Two operational gate examples in this section.
How to do it…
To implement the first operational f(x)=a.x in TensorFlow and train the
output toward the value of 50, follow these steps:
1. We start off by loading TensorFlow and creating a graph session:
import tensorflow as tf
sess = tf.Session()
2.
Now, we declare our model variable, input data, and placeholder. We
make our input data equal to the value 5, so that the multiplication
factor to get 50 will be 10 (that is, 5X10=50):
a = tf.Variable(tf.constant(4.))
x_val = 5.
x_data = tf.placeholder(dtype=tf.float32)
3.
Next we add the operation to our computational graph:
multiplication = tf.mul(a, x_data)
4.
We will declare the loss function as the L2 distance between the
output and the desired target value of 50:
loss = tf.square(tf.sub(multiplication, 50.))
5.
Now we initialize our model variable and declare our optimizing
algorithm as the standard gradient descent:
init = tf.initialize_all_variables()
sess.run(init)
my_opt = tf.train.GradientDescentOptimizer(0.01)
train_step = my_opt.minimize(loss)
6.
We can now optimize our model output towards the desired value of
50. We do this by continually feeding in the input value of 5 and back
propagating the loss to update the model variable towards the value of
10:
print('Optimizing a Multiplication Gate Output to 50.')
for i in range(10):
sess.run(train_step, feed_dict={x_data: x_val})
a_val = sess.run(a)
mult_output = sess.run(multiplication, feed_dict={x_data:
x_val})
print(str(a_val) + ' * ' + str(x_val) + ' = ' +
str(mult_output))
7.
This results in the following output:
Optimizing a Multiplication Gate Output to 50.
7.0 * 5.0 = 35.0
8.5 * 5.0 = 42.5
9.25 * 5.0 = 46.25
9.625 * 5.0 = 48.125
9.8125 * 5.0 = 49.0625
9.90625 * 5.0 = 49.5312
9.95312 * 5.0 = 49.7656
9.97656 * 5.0 = 49.8828
9.98828 * 5.0 = 49.9414
9.99414 * 5.0 = 49.9707
8.
Next, we will do the same with a two-nested operations, f(x)=a.x+b.
9.
We will start in exactly same way as the preceding example, except
now we'll initialize two model variables, a and b:
from tensorflow.python.framework import ops
ops.reset_default_graph()
sess = tf.Session()
a = tf.Variable(tf.constant(1.))
b = tf.Variable(tf.constant(1.))
x_val = 5.
x_data = tf.placeholder(dtype=tf.float32)
two_gate = tf.add(tf.mul(a, x_data), b)
loss = tf.square(tf.sub(two_gate, 50.))
my_opt = tf.train.GradientDescentOptimizer(0.01)
train_step = my_opt.minimize(loss)
init = tf.initialize_all_variables()
sess.run(init)
10.
We now optimize the model variables to train the output towards the
target value of 50:
print('\nOptimizing Two Gate Output to 50.')
for i in range(10):
# Run the train step
sess.run(train_step, feed_dict={x_data: x_val})
# Get the a and b values
a_val, b_val = (sess.run(a), sess.run(b))
# Run the two-gate graph output
two_gate_output = sess.run(two_gate, feed_dict={x_data:
x_val})
print(str(a_val) + ' * ' + str(x_val) + ' + ' +
str(b_val) + ' = ' + str(two_gate_output))
11.
This results in the following output:
Optimizing Two Gate Output to 50.
5.4 * 5.0 + 1.88 = 28.88
7.512 * 5.0 + 2.3024 = 39.8624
8.52576 * 5.0 + 2.50515 = 45.134
9.01236 * 5.0 + 2.60247 = 47.6643
9.24593 * 5.0 + 2.64919 = 48.8789
9.35805 * 5.0 + 2.67161 = 49.4619
9.41186 * 5.0 + 2.68237 = 49.7417
9.43769 * 5.0 + 2.68754 = 49.876
9.45009 * 5.0 + 2.69002 = 49.9405
9.45605 * 5.0 + 2.69121 = 49.9714
Note
It is important to note here that the solution to the second example is
not unique. This does not matter as much in neural networks, as all
parameters are adjusted towards reducing the loss. The final solution
here will depend on the initial values of a and b. If these were
randomly initialized, instead of to the value of 1, we would see
different ending values for the model variables for each iteration.
How it works…
We achieved the optimization of a computational gate via TensorFlow's
implicit back propagation. TensorFlow keeps track of our model's
operations and variable values and makes adjustments in respect of our
optimization algorithm specification and the output of the loss function.
We can keep expanding the operational gates, while keeping track of
which inputs are variables and which inputs are data. This is important to
keep track of, because TensorFlow will change all variables to minimize
the loss, but not the data, which is declared as placeholders.
The implicit ability to keep track of the computational graph and update
the model variables automatically with every training step is one of the
great features of TensorFlow and what makes it so powerful.
Working with Gates and
Activation Functions
Now that we can link together operational gates, we will want to run the
computational graph output through an activation function. Here we
introduce common activation functions.
Getting ready
In this section, we will compare and contrast two different activation
functions, the sigmoid and the rectified linear unit (ReLU). Recall that
the two functions are given by the following equations:
In this example, we will create two one-layer neural networks with the
same structure except one will feed through the sigmoid activation and one
will feed through the ReLU activation. The loss function will be governed
by the L2 distance from the value 0.75. We will randomly pull batch data
from a normal distribution (Normal(mean=2, sd=0.1)), and optimize the
output towards 0.75.
How to do it…
1. We'll start by loading the necessary libraries and initializing a graph.
This is also a good point to bring up how to set a random seed with
TensorFlow. Since we will be using a random number generator from
NumPy and TensorFlow, we need to set a random seed for both. With
the same random seeds set, we should be able to replicate:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
sess = tf.Session()
tf.set_random_seed(5)
np.random.seed(42)
2.
Now we'll declare our batch size, model variables, data, and a
placeholder for feeding the data in. Our computational graph will
consist of feeding in our normally distributed data into two similar
neural networks that differ only by the activation function at the end:
batch_size = 50
a1 = tf.Variable(tf.random_normal(shape=[1,1]))
b1 = tf.Variable(tf.random_uniform(shape=[1,1]))
a2 = tf.Variable(tf.random_normal(shape=[1,1]))
b2 = tf.Variable(tf.random_uniform(shape=[1,1]))
x = np.random.normal(2, 0.1, 500)
x_data = tf.placeholder(shape=[None, 1], dtype=tf.float32)
3.
Next, we'll declare our two models, the sigmoid activation model and
the ReLU activation model:
sigmoid_activation = tf.sigmoid(tf.add(tf.matmul(x_data, a1),
b1))
relu_activation = tf.nn.relu(tf.add(tf.matmul(x_data, a2),
b2))
4.
The loss functions will be the average L2 norm between the model
output and the value of 0.75:
loss1 = tf.reduce_mean(tf.square(tf.sub(sigmoid_activation,
0.75)))
loss2 = tf.reduce_mean(tf.square(tf.sub(relu_activation,
0.75)))
5.
Now we declare our optimization algorithm and initialize our variables:
my_opt = tf.train.GradientDescentOptimizer(0.01)
train_step_sigmoid = my_opt.minimize(loss1)
train_step_relu = my_opt.minimize(loss2)
init = tf.initialize_all_variables()
sess.run(init)
6.
Now we'll loop through our training for 750 iterations for both models.
We will also save the loss output and the activation output values for
plotting after:
loss_vec_sigmoid = []
loss_vec_relu = []
activation_sigmoid = []
activation_relu = []
for i in range(750):
rand_indices = np.random.choice(len(x), size=batch_size)
x_vals = np.transpose([x[rand_indices]])
sess.run(train_step_sigmoid, feed_dict={x_data: x_vals})
sess.run(train_step_relu, feed_dict={x_data: x_vals})
loss_vec_sigmoid.append(sess.run(loss1, feed_dict=
{x_data: x_vals}))
loss_vec_relu.append(sess.run(loss2, feed_dict={x_data:
x_vals}))
activation_sigmoid.append(np.mean(sess.run(sigmoid_activation
, feed_dict={x_data: x_vals})))
activation_relu.append(np.mean(sess.run(relu_activation,
feed_dict={x_data: x_vals})))
7.
The following is the code to plot the loss and the activation outputs:
plt.plot(activation_sigmoid, 'k-', label='Sigmoid
Activation')
plt.plot(activation_relu, 'r--', label='Relu Activation')
plt.ylim([0, 1.0])
plt.title('Activation Outputs')
plt.xlabel('Generation')
plt.ylabel('Outputs')
plt.legend(loc='upper right')
plt.show()
plt.plot(loss_vec_sigmoid, 'k-', label='Sigmoid Loss')
plt.plot(loss_vec_relu, 'r--', label='Relu Loss')
plt.ylim([0, 1.0])
plt.title('Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('Loss')
plt.legend(loc='upper right')
plt.show()
Figure 2: Computational graph outputs from the network with the
sigmoid activation and a network with the ReLU activation.
The two neural networks work with similar architecture and target (0.75)
with two different activation functions, sigmoid and ReLU. It is important
to notice how much quicker the ReLU activation network converges to the
desired target of 0.75 than sigmoid:
Figure 3: This figure depicts the loss value of the sigmoid and the ReLU
activation networks. Notice how extreme the ReLU loss is at the
beginning of the iterations.
How it works…
Because of the form of the ReLU activation function, it returns the value
of zero much more often than the sigmoid function. We consider this
behavior as a type of sparsity. This sparsity results in a speed up of
convergence, but a loss of controlled gradients. On the other hand, the
sigmoid function has very well-controlled gradients and does not risk the
extreme values that the ReLU activation does:
Activation function
Advantages
Disadvantages
Sigmoid
Less extreme outputs
Slower convergence
ReLU
Converges quicker
Extreme output values possible
There's more…
In this section, we compared the ReLU activation function and the sigmoid
activation for neural networks. There are many other activation functions
that are commonly used for neural networks, but most fall into one of two
categories: the first category contains functions that are shaped like the
sigmoid function (arctan, hypertangent, heavyside step, and so on) and the
second category contains functions that are shaped like the ReLU function
(softplus, leaky ReLU, and so on). Most of what was discussed in this
section about comparing the two functions will hold true for activations in
either category. However, it is important to note that the choice of the
activation function has a big impact on the convergence and output of the
neural networks.
Implementing a One-Layer Neural
Network
We have all the tools to implement a neural network that operates on real
data. We will create a neural network with one layer that operates on the
Iris dataset.
Getting ready
In this section, we will implement a neural network with one hidden layer.
It will be important to understand that a fully connected neural network is
based mostly on matrix multiplication. As such, the dimensions of the data
and matrix are very important to get lined up correctly.
Since this is a regression problem, we will use the mean squared error as
the loss function.
How to do it…
1. To create the computational graph, we'll start by loading the necessary
libraries:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from sklearn import datasets
2. Now we'll load the Iris data and store the pedal length as the target
value. Then we'll start a graph session:
iris = datasets.load_iris()
x_vals = np.array([x[0:3] for x in iris.data])
y_vals = np.array([x[3] for x in iris.data])
sess = tf.Session()
3. Since the dataset is of a smaller size, we want to set a seed to make the
results reproducible:
seed = 2
tf.set_random_seed(seed)
np.random.seed(seed)
4.
To prepare the data, we'll create a 80-20 train-test split and normalize
the x features to be between 0 and 1 via min-max scaling:
train_indices = np.random.choice(len(x_vals),
round(len(x_vals)*0.8), replace=False)
test_indices = np.array(list(set(range(len(x_vals))) -
set(train_indices)))
x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]
def normalize_cols(m):
col_max = m.max(axis=0)
col_min = m.min(axis=0)
return (m-col_min) / (col_max - col_min)
x_vals_train = np.nan_to_num(normalize_cols(x_vals_train))
x_vals_test = np.nan_to_num(normalize_cols(x_vals_test))
5.
Now we will declare the batch size and placeholders for the data and
target:
batch_size = 50
x_data = tf.placeholder(shape=[None, 3], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
6.
The important part is to declare our model variables with the
appropriate shape. We can declare the size of our hidden layer to be
any size we wish; here we set it to have five hidden nodes:
hidden_layer_nodes = 5
A1 = tf.Variable(tf.random_normal(shape=
[3,hidden_layer_nodes]))
b1 = tf.Variable(tf.random_normal(shape=
[hidden_layer_nodes]))
A2 = tf.Variable(tf.random_normal(shape=
[hidden_layer_nodes,1]))
b2 = tf.Variable(tf.random_normal(shape=[1]))
7.
We'll now declare our model in two steps. The first step will be
creating the hidden layer output and the second will be creating the
final output of the model:
Note
As a note, our model goes from (three features) (five hidden nodes)
(one output value).
hidden_output = tf.nn.relu(tf.add(tf.matmul(x_data, A1), b1))
final_output = tf.nn.relu(tf.add(tf.matmul(hidden_output,
A2),
b2))
8.
Here is our mean squared error as a loss function:
loss = tf.reduce_mean(tf.square(y_target - final_output))
9.
Now we'll declare our optimizing algorithm and initialize our variables:
my_opt = tf.train.GradientDescentOptimizer(0.005)
train_step = my_opt.minimize(loss)
init = tf.initialize_all_variables()
sess.run(init)
10.
Next we loop through our training iterations. We'll also initialize two
lists that we can store our train and test loss. In every loop we also
want to randomly select a batch from the training data for fitting to the
model:
# First we initialize the loss vectors for storage.
loss_vec = []
test_loss = []
for i in range(500):
# First we select a random set of indices for the batch.
rand_index = np.random.choice(len(x_vals_train),
size=batch_size)
# We then select the training values
rand_x = x_vals_train[rand_index]
rand_y = np.transpose([y_vals_train[rand_index]])
# Now we run the training step
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
# We save the training loss
temp_loss = sess.run(loss, feed_dict={x_data: rand_x,
y_target: rand_y})
loss_vec.append(np.sqrt(temp_loss))
# Finally, we run the test-set loss and save it.
test_temp_loss = sess.run(loss, feed_dict={x_data:
x_vals_test, y_target: np.transpose([y_vals_test])})
test_loss.append(np.sqrt(test_temp_loss))
if (i+1)%50==0:
print('Generation: ' + str(i+1) + '. Loss = ' +
str(temp_loss))
11.
And here is how we can plot the losses with matplotlib:
plt.plot(loss_vec, 'k-', label='Train Loss')
plt.plot(test_loss, 'r--', label='Test Loss')
plt.title('Loss (MSE) per Generation')
plt.xlabel('Generation')
plt.ylabel('Loss')
plt.legend(loc='upper right')
plt.show()
Figure 4: We plot the loss (MSE) of the train and test sets. Notice that
we are slightly overfitting the model after 200 generations, as the test
MSE does not drop any further, but the training MSE does continue
to drop.
How it works…
To visualize our model as a neural network diagram, refer to the following
figure:
Figure 5: Here is a visualization of our neural network that has five
nodes in the hidden layer. We are feeding in three values, the sepal length
(S.L), the sepal width (S.W.), and the pedal length (P.L.). The target will
be the petal width. In total, there will be 26 total variables in the model.
There's more…
Note that we can identify when the model starts overfitting on the training