From time to time, a website named Kaggle hosts several competitions in the fields of Data Science and Computer Vision. One of those competitions was the Dogs vs. Cats challenge, where the objective was “to create an algorithm to distinguish dogs from cats”. Although this particular challenge already has been finished, I thought that it’d give me a pretty good material to a tutorial. Let’s learn how to solve this problem together using OpenCV!
Here’s a live demo:
Setup environment
I’ll assume that you already have OpenCV 3.0 configured in your machine (if you don’t, you can do it here). Also, I’ll use the Boost library to read files in a directory (you can perhaps skip it and replace my code by dirent.h. It should work in the same way). You can download Boost here. Those are the only two external libraries that I’m going to use in this tutorial.
Ok, ok, let’s start by downloading the training and test sets. Click here and download the test1.zip (271.15mb). You may need to register first. After downloading, extract them to a folder of your preference. The training set will be used to adjust the parameters of our neural network (we will talk in details later), while the test set will be used to check the performance of our neural network (how good it is at generalizing unseen examples). Unhappily, the provided test set by Kaggle is not labeled, so we will split the training set (in the provided link) and use a part of it as our test set.
Reading training samples
Let’s start coding! First, let’s start by reading the list of files within the training set directory:
#include <vector>#include <algorithm>#include <functional>#include <map>#include <set>#include <fstream>#include <opencv2/core/core.hpp>#include <opencv2/highgui/highgui.hpp>#include <opencv2/features2d/features2d.hpp>#include <opencv2/ml/ml.hpp>#include <boost/filesystem.hpp>namespacefs=boost::filesystem;/** * Get all files in directory (not recursive) * @param directory Directory where the files are contained * @return A list containing the file name of all files inside given directory **/std::vector<std::string>getFilesInDirectory(conststd::string&directory){std::vector<std::string>files;fs::pathroot(directory);fs::directory_iteratorit_end;for(fs::directory_iteratorit(root);it!=it_end;++it){if(fs::is_regular_file(it->path())){files.push_back(it->path().string());}}returnfiles;}intmain(intargc,char**argv){if(argc!=4){std::cerr<<"Usage: <IMAGES_DIRECTORY> <NETWORK_INPUT_LAYER_SIZE> <TRAIN_SPLIT_RATIO>"<<std::endl;exit(-1);}std::stringimagesDir=argv[1];intnetworkInputSize=atoi(argv[2]);floattrainSplitSize=atof(argv[3]);std::cout<<"Reading training set..."<<std::endl;doublestart=(double)cv::getTickCount();std::vector<std::string>files=getFilesInDirectory(imagesDir);std::random_shuffle(files.begin(),files.end());}
The function getFilesInDirectory expects as input a directory and returns a list of filenames within this directory. In our main, we expect to receive three parameters from command line: The directory where our training set is stored, the size of our network input layer and the ratio of our training set (i.e., 0.75 indicates that 75% of the images within the training set will be used to train our neural network while the remaining 25% will be used to test it). We then shuffle the list of filenames (in order to prevent bias). Pretty straight-forward until here, aye? :)
Now we are going to iterate over each filename inside files and read the image associated to it. Since we will do it twice (one during the training step and another during the test step), let’s create a function apart in order to modularize our code.
opencv_ann.cpp
1234567891011121314151617181920212223
typedefstd::vector<std::string>::const_iteratorvec_iter;/** * Read images from a list of file names and returns, for each read image, * its class name and its local descriptors */voidreadImages(vec_iterbegin,vec_iterend,std::function<void(conststd::string&,constcv::Mat&)>callback){for(autoit=begin;it!=end;++it){std::stringfilename=*it;std::cout<<"Reading image "<<filename<<"..."<<std::endl;cv::Matimg=cv::imread(filename,0);if(img.empty()){std::cerr<<"WARNING: Could not read image."<<std::endl;continue;}std::stringclassname=getClassName(filename);cv::Matdescriptors=getDescriptors(img);callback(classname,descriptors);}}
There it is. The readImages function expect to receive as input two vector iterators (one for the start of our vector and another for the end, indicating the range from which we will iterate over). Finally, it expects another third parameter, a lambda function called “callback” (lambda functions are only available on C++11, so enable it on compiler by adding the -std=c++11 flag or -std=c++0x on old compilers). Now let’s look more carefuly on what’s happening inside this function.
We use a for to iterate over each filename between the limiters begin and end. For each filename, we read its associated image through the OpenCV imread function. The second parameter passed to imread indicates the color space (0 = gray scale. We don’t need the color information in this example. You’ll find the explanation later). After calling imread, we check if we could really read the image (through the empty method). If don’t, we skip to the next filename. Otherwise, we get the class name and the descriptors associated to the read image and return them to the “callback” function. Now let’s implement the getClassName and getDescriptors functions.
If you look at the files inside the training set you extracted, you will find out that they are named as “dog.XXXXX.jpg” or “cat.XXXXX.jpg”. The first three letters is always the class name, where the remaining is only an identifier. So let’s get those three first letters!
opencv_ann.cpp
1234567
/** * Extract the class name from a file name */inlinestd::stringgetClassName(conststd::string&filename){returnfilename.substr(filename.find_last_of('/')+1,3);}
Now what should the getDescriptors function looks like? Let’s figure out on the next topic.
Extracting features
There are several approaches here. We could use the color histogram, or perhaps the histogram of oriented gradients, etc., … However, I’m going through a different approach. I’m going to use the KAZE algorithm to extract local features from the image. Since we can’t submit local features to a neural network (because the number of descriptors varies), I’m also going to use the Bag of words strategy in order to address this problem, turning all set of descriptors into a single histogram of visual words, and THAT will be used as input to our neural network. Got it? Excellent! So let’s implement the getDescriptors to extract the KAZE features from an image, and later, after all KAZE features had been extracted, we’ll apply the Bag of Words technique.
opencv_ann.cpp
1234567891011
/** * Extract local features for an image */cv::MatgetDescriptors(constcv::Mat&img){cv::Ptr<cv::KAZE>kaze=cv::KAZE::create();std::vector<cv::KeyPoint>keypoints;cv::Matdescriptors;kaze->detectAndCompute(img,cv::noArray(),keypoints,descriptors);returndescriptors;}
structImageData{std::stringclassname;cv::MatbowFeatures;};intmain(intargc,char**argv){if(argc!=4){std::cerr<<"Usage: <IMAGES_DIRECTORY> <NETWORK_INPUT_LAYER_SIZE> <TRAIN_SPLIT_RATIO>"<<std::endl;exit(-1);}std::stringimagesDir=argv[1];intnetworkInputSize=atoi(argv[2]);floattrainSplitRatio=atof(argv[3]);std::cout<<"Reading training set..."<<std::endl;doublestart=(double)cv::getTickCount();std::vector<std::string>files=getFilesInDirectory(imagesDir);std::random_shuffle(files.begin(),files.end());cv::MatdescriptorsSet;std::vector<ImageData*>descriptorsMetadata;std::set<std::string>classes;readImages(files.begin(),files.begin()+(size_t)(files.size()*trainSplitRatio),[&](conststd::string&classname,constcv::Mat&descriptors){// Append to the set of classesclasses.insert(classname);// Append to the list of descriptorsdescriptorsSet.push_back(descriptors);// Append metadata to each extracted featureImageData*data=newImageData;data->classname=classname;data->bowFeatures=cv::Mat::zeros(cv::Size(networkInputSize,1),CV_32F);for(intj=0;j<descriptors.rows;j++){descriptorsMetadata.push_back(data);}});std::cout<<"Time elapsed in minutes: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()/60.0<<std::endl;}
I created a struct named ImageData, with two fields: classname and bowFeatures. Before calling the readImages function, I instanciated three variables: descriptorsSet (the set of descriptors of all read images), descriptorsMetadata (a vector of the struct we previously created. It’s being filled in such way that it has the same number of elements as the number of rows of descriptorsSet. That way, the i-th row of descriptorsSet can also be used to access its metadata (the class name, for instance)). And, for last, the classes variables (a set containing all found classes).
Training the Bag of Words
Now that we have the whole set of descriptors stored in the descriptorsSet variable, we can apply the Bag of words strategy. The Bag of Words algorithm is really simple: First we use a clustering algorithm (such as k-means) to obtain k centroids. Each centroid representates a visual word (the set of visual words is often called vocabulary). For each image, we create a histogram of size M, where M is the number of visual words. Now, for each extracted descriptor from the image, we measure its distance to all visual words, obtaining the index of the nearest one. We use that index to increment the position of histogram corresponding to that index, obtaining, that way, a histogram of visual words, that can later be submitted to our neural network.
intmain(){...std::cout<<"Creating vocabulary..."<<std::endl;start=(double)cv::getTickCount();cv::Matlabels;cv::Matvocabulary;// Use k-means to find k centroids (the words of our vocabulary)cv::kmeans(descriptorsSet,networkInputSize,labels,cv::TermCriteria(cv::TermCriteria::EPS+cv::TermCriteria::MAX_ITER,10,0.01),1,cv::KMEANS_PP_CENTERS,vocabulary);// No need to keep it on memory anymoredescriptorsSet.release();std::cout<<"Time elapsed in minutes: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()/60.0<<std::endl;// Convert a set of local features for each image in a single descriptors// using the bag of words techniquestd::cout<<"Getting histograms of visual words..."<<std::endl;int*ptrLabels=(int*)(labels.data);intsize=labels.rows*labels.cols;for(inti=0;i<size;i++){intlabel=*ptrLabels++;ImageData*data=descriptorsMetadata[i];data->bowFeatures.at<float>(label)++;}}
We use OpenCV k-means function to obtain k centroids (where k is the size of our network input layer, since the size of our histogram must be compatible with it), stored in the vocabulary variable. We also pass an additional parameter, labels, indicating the index of the nearest cluster for each descriptor, so we don’t need to computer it twice. Now, iterating over each element of labels, we fill our histograms, the bowFeatures field of our ImageData struct. The strategy of filling the descriptorsMetadata to make its number of elements as the number of rows of descriptorsSet seemed to be very convenient here, as we can directly access the histogram associated to each descriptor.
Training the neural network
Now that we have the histogram of visual words for each image, we can finally supply them to our neural network. But, before that, we need to tell to our neural network the expected output for each image. The reason for that is simple: A neural network, or more precisely, the variation of neural network that we are interested in using, called Multilayer perceptron, is a supervised learning algorithm. A supervised learning algorithm is one that tries to estimate a function H(x) (called hypothesis function) that correctly maps inputs to outputs (for instance, we are considering as input the images and as output the class associated to each image - cat or dog).
So we need to supply the class name associated to each image (or, more precisely, to each histogram of visual words) in order to enable it to “learn” the pattern. However, a neural network doesn’t understand categorical data. It works by showing numbers in the input layer and numbers in the output layer, and then it will try to adjust its weights in order that a function (called activation function) applied to the input numbers results in the output numbers. This process is shown in the image below.
Since the activation function generally outputs values between 0 and 1, it’s usual to encode the classes as a sequence of zeros where only one bit is set to one. This bit is different for each class. For example, consider the example of number of classes = 4. We would then have four codifications:
Class A = 1000
Class B = 0100
Class C = 0010
Class D = 0001
As we only have two classes, our codification will be:
Cat = 10
Dog = 01
opencv_ann.cpp
1234567891011121314151617181920
intmain(){...// Filling matrixes to be used by the neural networkstd::cout<<"Preparing neural network..."<<std::endl;cv::MattrainSamples;cv::MattrainResponses;std::set<ImageData*>uniqueMetadata(descriptorsMetadata.begin(),descriptorsMetadata.end());for(autoit=uniqueMetadata.begin();it!=uniqueMetadata.end();){ImageData*data=*it;cv::MatnormalizedHist;cv::normalize(data->bowFeatures,normalizedHist,0,data->bowFeatures.rows,cv::NORM_MINMAX,-1,cv::Mat());trainSamples.push_back(normalizedHist);trainResponses.push_back(getClassCode(classes,data->classname));delete*it;// clear memoryit++;}descriptorsMetadata.clear();}
Notice the use of the getClassCode. It’s a function that turns a class name into its binary codification. Also, pay attention to the cv::normalize function. We normalize the histogram of visual words in order to remove the bias of number of descriptors.
opencv_ann.cpp
123456789101112131415161718192021222324
/** * Transform a class name into an id */intgetClassId(conststd::set<std::string>&classes,conststd::string&classname){intindex=0;for(autoit=classes.begin();it!=classes.end();++it){if(*it==classname)break;++index;}returnindex;}/** * Get a binary code associated to a class */cv::MatgetClassCode(conststd::set<std::string>&classes,conststd::string&classname){cv::Matcode=cv::Mat::zeros(cv::Size((int)classes.size(),1),CV_32F);intindex=getClassId(classes,classname);code.at<float>(index)=1;returncode;}
And now we have the inputs and outputs for our neural network! We are finally able to train it!
/** * Get a trained neural network according to some inputs and outputs */cv::Ptr<cv::ml::ANN_MLP>getTrainedNeuralNetwork(constcv::Mat&trainSamples,constcv::Mat&trainResponses){intnetworkInputSize=trainSamples.cols;intnetworkOutputSize=trainResponses.cols;cv::Ptr<cv::ml::ANN_MLP>mlp=cv::ml::ANN_MLP::create();std::vector<int>layerSizes={networkInputSize,networkInputSize/2,networkOutputSize};mlp->setLayerSizes(layerSizes);mlp->setActivationFunction(cv::ml::ANN_MLP::SIGMOID_SYM);mlp->train(trainSamples,cv::ml::ROW_SAMPLE,trainResponses);returnmlp;}intmain(){...// Training neural networkstd::cout<<"Training neural network..."<<std::endl;start=cv::getTickCount();cv::Ptr<cv::ml::ANN_MLP>mlp=getTrainedNeuralNetwork(networkInputSize,trainSamples,trainResponses);std::cout<<"Time elapsed in minutes: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()/60.0<<std::endl;// We can clear memory now trainSamples.release();trainResponses.release();}
The getTrainedNeuralNetwork function expects to receive as input the size of training samples and training outputs. Inside the function, I first set two variables: networkInputSize, that is the number of columns (features) of our training samples and networkOutputSize, that is the number of columns of our training outputs. I then set layerSizes, that defines the number of layers and number of nodes for each layer of our network. For instance, I’m creating a network that only have one hidden layer (with size networkInputSize / 2), since I think it’ll be enough for our task. If you want improved accuracy, we can increase it, at cost of performance.
Evaluating our network
And now the training step is DONE! Let’s use our trained neural network to evaluate our test samples and measure how good it is. First, let’s train a FLANN model from the vocabulary, so we can calculate the histogram of visual words for each test sample much faster:
opencv_ann.cpp
123456789101112
intmain(){...// Train FLANN std::cout<<"Training FLANN..."<<std::endl;start=cv::getTickCount();cv::FlannBasedMatcherflann;flann.add(vocabulary);flann.train();std::cout<<"Time elapsed in minutes: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()/60.0<<std::endl;}
Now let’s read the test samples:
opencv_ann.cpp
123456789101112131415161718
intmain(){...// Reading test set std::cout<<"Reading test set..."<<std::endl;start=cv::getTickCount();cv::MattestSamples;std::vector<int>testOutputExpected;readImages(files.begin()+(size_t)(files.size()*trainSplitRatio),files.end(),[&](conststd::string&classname,constcv::Mat&descriptors){// Get histogram of visual words using bag of words techniquecv::MatbowFeatures=getBOWFeatures(flann,descriptors,networkInputSize);cv::normalize(bowFeatures,bowFeatures,0,bowFeatures.rows,cv::NORM_MINMAX,-1,cv::Mat());testSamples.push_back(bowFeatures);testOutputExpected.push_back(getClassId(classes,classname));});std::cout<<"Time elapsed in minutes: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()/60.0<<std::endl;}
We instanciated two variables: testSamples (set of histogram of visual words for each test samples) and testOutputExpected (the output expected for each test sample. We are using a number that correspond to the id of the class, obtained through the getClassId previously defined). We then get the Bag of Words features through the getBOWFeatures function and normalize it. What we still didn’t define is the getBOWFeatures function, that turns a set of local KAZE features into a histogram of visual words. Let’s do it:
opencv_ann.cpp
1234567891011121314151617
/** * Turn local features into a single bag of words histogram of * of visual words (a.k.a., bag of words features) */cv::MatgetBOWFeatures(cv::FlannBasedMatcher&flann,constcv::Mat&descriptors,intvocabularySize){cv::MatoutputArray=cv::Mat::zeros(cv::Size(vocabularySize,1),CV_32F);std::vector<cv::DMatch>matches;flann.match(descriptors,matches);for(size_tj=0;j<matches.size();j++){intvisualWord=matches[j].trainIdx;outputArray.at<float>(visualWord)++;}returnoutputArray;}
It uses the FLANN match method to calculate the nearest visual word for each descriptor. It then fill a histogram with the number of occurrences for each visual word. Pretty simple, right?
Now that we have the inputs and outputs for the test samples, let’s calculate a confusion matrix.
/** * Receives a column matrix contained the probabilities associated to * each class and returns the id of column which contains the highest * probability */intgetPredictedClass(constcv::Mat&predictions){floatmaxPrediction=predictions.at<float>(0);floatmaxPredictionIndex=0;constfloat*ptrPredictions=predictions.ptr<float>(0);for(inti=0;i<predictions.cols;i++){floatprediction=*ptrPredictions++;if(prediction>maxPrediction){maxPrediction=prediction;maxPredictionIndex=i;}}returnmaxPredictionIndex;}/** * Get a confusion matrix from a set of test samples and their expected * outputs */std::vector<std::vector<int>>getConfusionMatrix(cv::Ptr<cv::ml::ANN_MLP>mlp,constcv::Mat&testSamples,conststd::vector<int>&testOutputExpected){cv::MattestOutput;mlp->predict(testSamples,testOutput);std::vector<std::vector<int>>confusionMatrix(2,std::vector<int>(2));for(inti=0;i<testOutput.rows;i++){intpredictedClass=getPredictedClass(testOutput.row(i));intexpectedClass=testOutputExpected.at(i);confusionMatrix[expectedClass][predictedClass]++;}returnconfusionMatrix;}/** * Print a confusion matrix on screen */voidprintConfusionMatrix(conststd::vector<std::vector<int>>&confusionMatrix,conststd::set<std::string>&classes){for(autoit=classes.begin();it!=classes.end();++it){std::cout<<*it<<" ";}std::cout<<std::endl;for(size_ti=0;i<confusionMatrix.size();i++){for(size_tj=0;j<confusionMatrix[i].size();j++){std::cout<<confusionMatrix[i][j]<<" ";}std::cout<<std::endl;}}/** * Get the accuracy for a model (i.e., percentage of correctly predicted * test samples) */floatgetAccuracy(conststd::vector<std::vector<int>>&confusionMatrix){inthits=0;inttotal=0;for(size_ti=0;i<confusionMatrix.size();i++){for(size_tj=0;j<confusionMatrix.at(i).size();j++){if(i==j)hits+=confusionMatrix.at(i).at(j);total+=confusionMatrix.at(i).at(j);}}returnhits/(float)total;}intmain(){...// Get confusion matrix of the test setstd::vector<std::vector<int>>confusionMatrix=getConfusionMatrix(mlp,testSamples,testOutputExpected);// Get accuracy of our modelstd::cout<<"Confusion matrix: "<<std::endl;printConfusionMatrix(confusionMatrix,classes);std::cout<<"Accuracy: "<<getAccuracy(confusionMatrix)<<std::endl;}
OK, a lot happened here. Let’s check it step by step. First, in the getConfusionMatrix, I use the MLP predict method to predict the class for each test sample. It returns a matrix with the same number of columns as our number of classes, where on each column lies a “probability” of the sample belong to class corresponding to that column. We use than a function called getPredictedClass, which is called over each row of the output of predict method and return the column index with highest “probability”. Now that we have the predicted and expected classes, we can construct our confusion matrix by simplying incrementing the index composed by the tuple (expected, predicted).
In possess of the confusion matrix, we can easily calculate the accuracy, that is the ratio of correctly predicted samples, by simplying summing the diagonal of our confusion matrix (number of correct predictions) and diving by the sum of our cells of our confusion matrix (number of test samples).
Saving models
Finally, let’s save our models, so we can use it later on a production environment:
opencv_ann.cpp
1234567891011121314151617181920212223242526272829
/** * Save our obtained models (neural network, bag of words vocabulary * and class names) to use it later */voidsaveModels(cv::Ptr<cv::ml::ANN_MLP>mlp,constcv::Mat&vocabulary,conststd::set<std::string>&classes){mlp->save("mlp.yaml");cv::FileStoragefs("vocabulary.yaml",cv::FileStorage::WRITE);fs<<"vocabulary"<<vocabulary;fs.release();std::ofstreamclassesOutput("classes.txt");for(autoit=classes.begin();it!=classes.end();++it){classesOutput<<getClassId(classes,*it)<<"\t"<<*it<<std::endl;}classesOutput.close();}intmain(){...// Save modelsstd::cout<<"Saving models..."<<std::endl;saveModels(mlp,vocabulary,classes);return0;}
The MLP object that its own saving function called save (it also has a load method that can later be used to load a trained neural network from a file). We save the vocabulary (since we need it in order to convert the local features into a histogram of visual words) into a file named “vocabulary.yaml”. And, finally, we also save the class names associated to each id (so we can map the output of neural network to a name). That’s it! The full code can be found below.
#include <vector>#include <algorithm>#include <functional>#include <map>#include <set>#include <fstream>#include <opencv2/core/core.hpp>#include <opencv2/highgui/highgui.hpp>#include <opencv2/features2d/features2d.hpp>#include <opencv2/ml/ml.hpp>#include <boost/filesystem.hpp>namespacefs=boost::filesystem;typedefstd::vector<std::string>::const_iteratorvec_iter;structImageData{std::stringclassname;cv::MatbowFeatures;};/** * Get all files in directory (not recursive) * @param directory Directory where the files are contained * @return A list containing the file name of all files inside given directory **/std::vector<std::string>getFilesInDirectory(conststd::string&directory){std::vector<std::string>files;fs::pathroot(directory);fs::directory_iteratorit_end;for(fs::directory_iteratorit(root);it!=it_end;++it){if(fs::is_regular_file(it->path())){files.push_back(it->path().string());}}returnfiles;}/** * Extract the class name from a file name */inlinestd::stringgetClassName(conststd::string&filename){returnfilename.substr(filename.find_last_of('/')+1,3);}/** * Extract local features for an image */cv::MatgetDescriptors(constcv::Mat&img){cv::Ptr<cv::KAZE>kaze=cv::KAZE::create();std::vector<cv::KeyPoint>keypoints;cv::Matdescriptors;kaze->detectAndCompute(img,cv::noArray(),keypoints,descriptors);returndescriptors;}/** * Read images from a list of file names and returns, for each read image, * its class name and its local descriptors */voidreadImages(vec_iterbegin,vec_iterend,std::function<void(conststd::string&,constcv::Mat&)>callback){for(autoit=begin;it!=end;++it){std::stringfilename=*it;std::cout<<"Reading image "<<filename<<"..."<<std::endl;cv::Matimg=cv::imread(filename,0);if(img.empty()){std::cerr<<"WARNING: Could not read image."<<std::endl;continue;}std::stringclassname=getClassName(filename);cv::Matdescriptors=getDescriptors(img);callback(classname,descriptors);}}/** * Transform a class name into an id */intgetClassId(conststd::set<std::string>&classes,conststd::string&classname){intindex=0;for(autoit=classes.begin();it!=classes.end();++it){if(*it==classname)break;++index;}returnindex;}/** * Get a binary code associated to a class */cv::MatgetClassCode(conststd::set<std::string>&classes,conststd::string&classname){cv::Matcode=cv::Mat::zeros(cv::Size((int)classes.size(),1),CV_32F);intindex=getClassId(classes,classname);code.at<float>(index)=1;returncode;}/** * Turn local features into a single bag of words histogram of * of visual words (a.k.a., bag of words features) */cv::MatgetBOWFeatures(cv::FlannBasedMatcher&flann,constcv::Mat&descriptors,intvocabularySize){cv::MatoutputArray=cv::Mat::zeros(cv::Size(vocabularySize,1),CV_32F);std::vector<cv::DMatch>matches;flann.match(descriptors,matches);for(size_tj=0;j<matches.size();j++){intvisualWord=matches[j].trainIdx;outputArray.at<float>(visualWord)++;}returnoutputArray;}/** * Get a trained neural network according to some inputs and outputs */cv::Ptr<cv::ml::ANN_MLP>getTrainedNeuralNetwork(constcv::Mat&trainSamples,constcv::Mat&trainResponses){intnetworkInputSize=trainSamples.cols;intnetworkOutputSize=trainResponses.cols;cv::Ptr<cv::ml::ANN_MLP>mlp=cv::ml::ANN_MLP::create();std::vector<int>layerSizes={networkInputSize,networkInputSize/2,networkOutputSize};mlp->setLayerSizes(layerSizes);mlp->setActivationFunction(cv::ml::ANN_MLP::SIGMOID_SYM);mlp->train(trainSamples,cv::ml::ROW_SAMPLE,trainResponses);returnmlp;}/** * Receives a column matrix contained the probabilities associated to * each class and returns the id of column which contains the highest * probability */intgetPredictedClass(constcv::Mat&predictions){floatmaxPrediction=predictions.at<float>(0);floatmaxPredictionIndex=0;constfloat*ptrPredictions=predictions.ptr<float>(0);for(inti=0;i<predictions.cols;i++){floatprediction=*ptrPredictions++;if(prediction>maxPrediction){maxPrediction=prediction;maxPredictionIndex=i;}}returnmaxPredictionIndex;}/** * Get a confusion matrix from a set of test samples and their expected * outputs */std::vector<std::vector<int>>getConfusionMatrix(cv::Ptr<cv::ml::ANN_MLP>mlp,constcv::Mat&testSamples,conststd::vector<int>&testOutputExpected){cv::MattestOutput;mlp->predict(testSamples,testOutput);std::vector<std::vector<int>>confusionMatrix(2,std::vector<int>(2));for(inti=0;i<testOutput.rows;i++){intpredictedClass=getPredictedClass(testOutput.row(i));intexpectedClass=testOutputExpected.at(i);confusionMatrix[expectedClass][predictedClass]++;}returnconfusionMatrix;}/** * Print a confusion matrix on screen */voidprintConfusionMatrix(conststd::vector<std::vector<int>>&confusionMatrix,conststd::set<std::string>&classes){for(autoit=classes.begin();it!=classes.end();++it){std::cout<<*it<<" ";}std::cout<<std::endl;for(size_ti=0;i<confusionMatrix.size();i++){for(size_tj=0;j<confusionMatrix[i].size();j++){std::cout<<confusionMatrix[i][j]<<" ";}std::cout<<std::endl;}}/** * Get the accuracy for a model (i.e., percentage of correctly predicted * test samples) */floatgetAccuracy(conststd::vector<std::vector<int>>&confusionMatrix){inthits=0;inttotal=0;for(size_ti=0;i<confusionMatrix.size();i++){for(size_tj=0;j<confusionMatrix.at(i).size();j++){if(i==j)hits+=confusionMatrix.at(i).at(j);total+=confusionMatrix.at(i).at(j);}}returnhits/(float)total;}/** * Save our obtained models (neural network, bag of words vocabulary * and class names) to use it later */voidsaveModels(cv::Ptr<cv::ml::ANN_MLP>mlp,constcv::Mat&vocabulary,conststd::set<std::string>&classes){mlp->save("mlp.yaml");cv::FileStoragefs("vocabulary.yaml",cv::FileStorage::WRITE);fs<<"vocabulary"<<vocabulary;fs.release();std::ofstreamclassesOutput("classes.txt");for(autoit=classes.begin();it!=classes.end();++it){classesOutput<<getClassId(classes,*it)<<"\t"<<*it<<std::endl;}classesOutput.close();}intmain(intargc,char**argv){if(argc!=4){std::cerr<<"Usage: <IMAGES_DIRECTORY> <NETWORK_INPUT_LAYER_SIZE> <TRAIN_SPLIT_RATIO>"<<std::endl;exit(-1);}std::stringimagesDir=argv[1];intnetworkInputSize=atoi(argv[2]);floattrainSplitRatio=atof(argv[3]);std::cout<<"Reading training set..."<<std::endl;doublestart=(double)cv::getTickCount();std::vector<std::string>files=getFilesInDirectory(imagesDir);std::random_shuffle(files.begin(),files.end());cv::MatdescriptorsSet;std::vector<ImageData*>descriptorsMetadata;std::set<std::string>classes;readImages(files.begin(),files.begin()+(size_t)(files.size()*trainSplitRatio),[&](conststd::string&classname,constcv::Mat&descriptors){// Append to the set of classesclasses.insert(classname);// Append to the list of descriptorsdescriptorsSet.push_back(descriptors);// Append metadata to each extracted featureImageData*data=newImageData;data->classname=classname;data->bowFeatures=cv::Mat::zeros(cv::Size(networkInputSize,1),CV_32F);for(intj=0;j<descriptors.rows;j++){descriptorsMetadata.push_back(data);}});std::cout<<"Time elapsed in minutes: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()/60.0<<std::endl;std::cout<<"Creating vocabulary..."<<std::endl;start=(double)cv::getTickCount();cv::Matlabels;cv::Matvocabulary;// Use k-means to find k centroids (the words of our vocabulary)cv::kmeans(descriptorsSet,networkInputSize,labels,cv::TermCriteria(cv::TermCriteria::EPS+cv::TermCriteria::MAX_ITER,10,0.01),1,cv::KMEANS_PP_CENTERS,vocabulary);// No need to keep it on memory anymoredescriptorsSet.release();std::cout<<"Time elapsed in minutes: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()/60.0<<std::endl;// Convert a set of local features for each image in a single descriptors// using the bag of words techniquestd::cout<<"Getting histograms of visual words..."<<std::endl;int*ptrLabels=(int*)(labels.data);intsize=labels.rows*labels.cols;for(inti=0;i<size;i++){intlabel=*ptrLabels++;ImageData*data=descriptorsMetadata[i];data->bowFeatures.at<float>(label)++;}// Filling matrixes to be used by the neural networkstd::cout<<"Preparing neural network..."<<std::endl;cv::MattrainSamples;cv::MattrainResponses;std::set<ImageData*>uniqueMetadata(descriptorsMetadata.begin(),descriptorsMetadata.end());for(autoit=uniqueMetadata.begin();it!=uniqueMetadata.end();){ImageData*data=*it;cv::MatnormalizedHist;cv::normalize(data->bowFeatures,normalizedHist,0,data->bowFeatures.rows,cv::NORM_MINMAX,-1,cv::Mat());trainSamples.push_back(normalizedHist);trainResponses.push_back(getClassCode(classes,data->classname));delete*it;// clear memoryit++;}descriptorsMetadata.clear();// Training neural networkstd::cout<<"Training neural network..."<<std::endl;start=cv::getTickCount();cv::Ptr<cv::ml::ANN_MLP>mlp=getTrainedNeuralNetwork(trainSamples,trainResponses);std::cout<<"Time elapsed in minutes: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()/60.0<<std::endl;// We can clear memory now trainSamples.release();trainResponses.release();// Train FLANN std::cout<<"Training FLANN..."<<std::endl;start=cv::getTickCount();cv::FlannBasedMatcherflann;flann.add(vocabulary);flann.train();std::cout<<"Time elapsed in minutes: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()/60.0<<std::endl;// Reading test set std::cout<<"Reading test set..."<<std::endl;start=cv::getTickCount();cv::MattestSamples;std::vector<int>testOutputExpected;readImages(files.begin()+(size_t)(files.size()*trainSplitRatio),files.end(),[&](conststd::string&classname,constcv::Mat&descriptors){// Get histogram of visual words using bag of words techniquecv::MatbowFeatures=getBOWFeatures(flann,descriptors,networkInputSize);cv::normalize(bowFeatures,bowFeatures,0,bowFeatures.rows,cv::NORM_MINMAX,-1,cv::Mat());testSamples.push_back(bowFeatures);testOutputExpected.push_back(getClassId(classes,classname));});std::cout<<"Time elapsed in minutes: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()/60.0<<std::endl;// Get confusion matrix of the test setstd::vector<std::vector<int>>confusionMatrix=getConfusionMatrix(mlp,testSamples,testOutputExpected);// Get accuracy of our modelstd::cout<<"Confusion matrix: "<<std::endl;printConfusionMatrix(confusionMatrix,classes);std::cout<<"Accuracy: "<<getAccuracy(confusionMatrix)<<std::endl;// Save modelsstd::cout<<"Saving models..."<<std::endl;saveModels(mlp,vocabulary,classes);return0;}
Not bad! Not bad at all, considering the difficulty of some images! ;)
If you are interested in knowing how the server used to present the live demo in the beginning of this tutorial looks like, you can take a look at the source code below.
#include <vector>#include <set>#include <sstream>#include <fstream>#include <cstdlib>#include <iostream>#include <algorithm>#include <boost/bind.hpp>#include <boost/asio.hpp>#include <opencv2/core/core.hpp>#include <opencv2/highgui/highgui.hpp>#include <opencv2/features2d/features2d.hpp>#include <opencv2/ml/ml.hpp>usingboost::asio::ip::tcp;structContext{cv::Matvocabulary;cv::FlannBasedMatcherflann;std::map<int,std::string>classes;cv::Ptr<cv::ml::ANN_MLP>mlp;};/** * Extract local features for an image */cv::MatgetDescriptors(constcv::Mat&img){cv::Ptr<cv::KAZE>kaze=cv::KAZE::create();std::vector<cv::KeyPoint>keypoints;cv::Matdescriptors;kaze->detectAndCompute(img,cv::noArray(),keypoints,descriptors);returndescriptors;}/** * Get a histogram of visual words for an image */cv::MatgetBOWFeatures(cv::FlannBasedMatcher&flann,constcv::Mat&img,intvocabularySize){cv::Matdescriptors=getDescriptors(img);cv::MatoutputArray=cv::Mat::zeros(cv::Size(vocabularySize,1),CV_32F);std::vector<cv::DMatch>matches;flann.match(descriptors,matches);for(size_tj=0;j<matches.size();j++){intvisualWord=matches[j].trainIdx;outputArray.at<float>(visualWord)++;}returnoutputArray;}/** * Receives a column matrix contained the probabilities associated to * each class and returns the id of column which contains the highest * probability */intgetPredictedClass(constcv::Mat&predictions){floatmaxPrediction=predictions.at<float>(0);floatmaxPredictionIndex=0;constfloat*ptrPredictions=predictions.ptr<float>(0);for(inti=0;i<predictions.cols;i++){floatprediction=*ptrPredictions++;if(prediction>maxPrediction){maxPrediction=prediction;maxPredictionIndex=i;}}returnmaxPredictionIndex;}/** * Get the predicted class for a sample */intgetClass(constcv::Mat&bowFeatures,cv::Ptr<cv::ml::ANN_MLP>mlp){cv::Matoutput;mlp->predict(bowFeatures,output);returngetPredictedClass(output);}classsession{public:session(boost::asio::io_service&io_service,Context*context):socket_(io_service),context(context){}tcp::socket&socket(){returnsocket_;}voidstart(){socket_.async_read_some(boost::asio::buffer(data_,max_length),boost::bind(&session::handle_read,this,boost::asio::placeholders::error,boost::asio::placeholders::bytes_transferred));}private:Context*context;tcp::socketsocket_;enum{max_length=1024};chardata_[max_length];voidhandle_read(constboost::system::error_code&error,size_tbytes_transferred){if(!error){std::stringresult;// Reading imagestd::stringfilename(data_,std::find(data_,data_+bytes_transferred,'\n')-1);cv::Matimg=cv::imread(filename,0);if(!img.empty()){// Processing imagecv::MatbowFeatures=getBOWFeatures(context->flann,img,context->vocabulary.rows);cv::normalize(bowFeatures,bowFeatures,0,bowFeatures.rows,cv::NORM_MINMAX,-1,cv::Mat());intpredictedClass=getClass(bowFeatures,context->mlp);result=context->classes[predictedClass];}else{result="error";}memset(&data_[0],0,sizeof(data_));strcpy(data_,result.c_str());boost::asio::async_write(socket_,boost::asio::buffer(data_,bytes_transferred),boost::bind(&session::handle_write,this,boost::asio::placeholders::error));}else{deletethis;}}voidhandle_write(constboost::system::error_code&error){if(!error){socket_.async_read_some(boost::asio::buffer(data_,max_length),boost::bind(&session::handle_read,this,boost::asio::placeholders::error,boost::asio::placeholders::bytes_transferred));}else{deletethis;}}};classserver{public:server(boost::asio::io_service&io_service,shortport,Context*context):io_service_(io_service),acceptor_(io_service,tcp::endpoint(tcp::v4(),port)),context(context){start_accept();}private:Context*context;boost::asio::io_service&io_service_;tcp::acceptoracceptor_;voidstart_accept(){session*new_session=newsession(io_service_,context);acceptor_.async_accept(new_session->socket(),boost::bind(&server::handle_accept,this,new_session,boost::asio::placeholders::error));}voidhandle_accept(session*new_session,constboost::system::error_code&error){if(!error){new_session->start();}else{deletenew_session;}start_accept();}};intmain(intargc,char**argv){if(argc!=5){std::cerr<<"Usage: <NEURAL_NETWORK_INPUT_FILENAME> <VOCABULARY_INPUT_FILENAME> <CLASSES_INPUT_FILENAME> <PORT_NUMBER>"<<std::endl;exit(-1);}std::stringneuralNetworkInputFilename(argv[1]);std::stringvocabularyInputFilename(argv[2]);std::stringclassesInputFilename(argv[3]);intportNumber=atoi(argv[4]);std::cout<<"Loading models..."<<std::endl;doublestart=cv::getTickCount();// Reading neural networkcv::Ptr<cv::ml::ANN_MLP>mlp=cv::ml::ANN_MLP::load<cv::ml::ANN_MLP>(neuralNetworkInputFilename);// Read vocabularycv::Matvocabulary;cv::FileStoragefs(vocabularyInputFilename,cv::FileStorage::READ);fs["vocabulary"]>>vocabulary;fs.release();// Reading existing classesstd::map<int,std::string>classes;std::ifstreamclassesInput(classesInputFilename.c_str());std::stringline;while(std::getline(classesInput,line)){std::stringstreamss;ss<<line;intindex;std::stringclassname;ss>>index;ss>>classname;classes[index]=classname;}std::cout<<"Time elapsed in seconds: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()<<std::endl;// Train FLANN std::cout<<"Training FLANN..."<<std::endl;start=cv::getTickCount();cv::FlannBasedMatcherflann;flann.add(vocabulary);flann.train();std::cout<<"Time elapsed in seconds: "<<((double)cv::getTickCount()-start)/cv::getTickFrequency()<<std::endl;// Socket initializationstd::cout<<"Listening to socket on port "<<portNumber<<"..."<<std::endl;try{boost::asio::io_serviceio_service;Context*context=newContext;context->vocabulary=vocabulary;context->flann=flann;context->classes=classes;context->mlp=mlp;servers(io_service,portNumber,context);io_service.run();deletecontext;}catch(std::exception&e){std::cerr<<"Exception: "<<e.what()<<std::endl;return-1;}return0;}