In this tutorial I will show you how you can control your mouse using only a simple webcam. Nothing fancy, super simple to implementate. Let’s get on!
First things’ first. We are going to use OpenCV, an open-source computer vision library. You can find how to set up it here.
Reading the webcam
Let’s adopt a baby-steps approach. The very first thing we need is to read the webcam image itself. You can do it through the VideoCapture class in the OpenCV highgui module. VideoCapture takes one parameter, the webcam index or a path to a video.
eye_detector.cpp
123456789101112131415161718192021222324
#include <iostream>#include <opencv2/core/core.hpp>#include <opencv2/highgui/highgui.hpp>#include <opencv2/imgproc/imgproc.hpp>#include <opencv2/objdetect/objdetect.hpp> intmain(){cv::VideoCapturecap(0);// the fist webcam connected to your PCif(!cap.isOpened()){std::cerr<<"Webcam not detected."<<std::endl;return-1;}cv::Matframe;while(1){cap>>frame;// outputs the webcam image to a Matcv::imshow("Webcam",frame);// displays the Matif(cv::waitKey(30)>=0)break;// takes 30 frames per second. if the user presses any button, it stops from showing the webcam}return0;}
I took the liberty of including some OpenCV modules besides the necessary because we are going to need them in the future.
Now you can see that it’s displaying the webcam image. That’s something!
Now let’s get into the computer vision stuff!
Face and eye detection with Viola-Jones algorithm (Theory)
Here’s a bit of theory (you can skip it and go to the next section if you are just not interested): Humans can detect a face very easily, but computers do not. When an image is prompted to the computer, all that it “sees” is a matrix of numbers. So, given that matrix, how can it predict if it represents or not a face? Answer: Building probability distribuitions through thousands of samples of faces and non-faces. And it’s the role of a classifier to build those probability distribuitions. But here’s the thing: A regular image is composed by thousands of pixels. Even a small 28x28 image is composed by 784 pixels. Each pixel can assume 255 values (if the image is using 8-bits grayscale representation). So that’s 255784 number of possible values. Wow! Estimate probability distribuitions with some many variables is not feasible. This is where the Viola-Jones algorithm kicks in: It extracts a much simpler representations of the image, and combine those simple representations into more high-level representations in a hierarchical way, making the problem in the highest level of representation much more simpler and easier than it would be using the original image. Let’s see all the steps of this algorithm.
Haar-like Feature Extraction
We have some primitive “masks”, as shown below:
Those masks are slided over the image, and the sum of the values of the pixels within the “white” sides is subtracted from the “black” sides. Now the result is a feature that represents that region (a whole region summarized in a number).
Weak classifiers
Next step is to train many simple classifiers. Each classifier for each kind of mask. Those simple classifiers work as follows: Takes all the features (extracted from its corresponding mask) within the face region and all the features outside the face region, and label them as “face” or “non-face” (two classes). It then learns to distinguish features belonging to a face region from features belonging to a non-face region through a simple threshold function (i.e., faces features generally have value above or below a certain value, otherwise it’s a non-face). This classifier itself is very bad and is almost as good as random guesting. But if combined, they can arise a much better and stronger classifier (weak classifiers, unite!)
Cascading classifiers
Given a region, I can submit it to many weak classifiers, as shown above. Each weak classifier will output a number, 1 if it predicted the region as belonging to a face region or 0 otherwise. This result can be weighted. The sum of all weak classifiers weighted outputed results in another feature, that, again, can be inputted to another classifier. It’s said that that new classifier is a linear combination of other classifiers. Its role is to determine the right weight values such as the error be as minimum as possible.
What about eyes?
Well, eyes follow the same principle as face detection. But now, if we have a face detector previously trained, the problem becomes sightly simpler, since the eyes will be always located in the face region, reducing dramatically our search space.
Face and eye detection with Viola-Jones algorithm (practice)
Thankfully, the above algorithm is already implemented in OpenCV and a classifier using thousands and thousands of faces was already trained for us!
Let’s start by reading the trained models. You can download them here. Put them in the same directory as the .cpp file.
eye_detector.cpp
12345678910111213141516
intmain(){cv::CascadeClassifierfaceCascade;cv::CascadeClassifiereyeCascade;if(!faceCascade.load("./haarcascade_frontalface_alt.xml")){std::cerr<<"Could not load face detector."<<std::endl;return-1;}if(!eyeCascade.load("./haarcascade_eye_tree_eyeglasses.xml")){std::cerr<<"Could not load eye detector."<<std::endl;return-1;}...}
Now let’s modify our loop to include a call to a function named detectEyes:
voiddetectEyes(cv::Mat&frame,cv::CascadeClassifier&faceCascade,cv::CascadeClassifier&eyeCascade){cv::Matgrayscale;cv::cvtColor(frame,grayscale,CV_BGR2GRAY);// convert image to grayscalecv::equalizeHist(grayscale,grayscale);// enhance image contrast std::vector<cv::Rect>faces;faceCascade.detectMultiScale(grayscale,faces,1.1,2,0|CV_HAAR_SCALE_IMAGE,cv::Size(150,150));}
A break to explain the detectMultiScale method. It takes the following arguments:
inputImage: The input image
faces: A vector of rects where the faces were detected
scaleFactor: The classifier will try to upscale and downscale the image in a certain factor (in the above case, in 1.1). It will help to detect faces with more accuracy.
minNumNeighbors: How many true-positive neighbor rectangles do you want to assure before predicting a region as a face? The higher this face, the lower the chance of detecting a non-face as face, but also lower the chance of detecting a face as face.
flags: Some flags. In the above case, we want to scale the image.
minSize: The minimum size which a face can have in our image. A poor quality webcam has frames with 640x480 resolution. So 150x150 is more than enough to cover a face in it.
Let’s proceed. Now we have the faces detected in the vector faces. What to do next? Eye detection!
eye_detector.cpp
12345678
voiddetectEyes(...){...if(faces.size()==0)return;// none face was detectedcv::Matface=frame(faces[0]);// crop the facestd::vector<cv::Rect>eyes;eyeCascade.detectMultiScale(face,eyes,1.1,2,0|CV_HAAR_SCALE_IMAGE,cv::Size(150,150));// same thing as above }
Now we have both face and eyes detected. Let’s just test it by drawing the regions where they were detected:
eye_detector.cpp
12345678910
voiddetectEyes(...){...rectangle(frame,faces[0].tl(),faces[0].br(),cv::Scalar(255,0,0),2);if(eyes.size()!=2)return;// both eyes were not detectedfor(cv::Rect&eye:eyes){rectangle(frame,faces[0].tl()+eye.tl(),faces[0].tl()+eye.br(),cv::Scalar(0,255,0),2);}}
Looking good so far!
Detecting iris
Now we have detected the eyes, the next step is to detect the iris. For that, we are going to look for the most “circular” object in the eye region. Luckily, that’s already a function in OpenCV that does just that! It’s called HoughCircles, and it works as follows: It first apply an edge detector in the image, from which it make contours and from the contours made it tried to calculate a “circularity ratio”, i.e., how much that contour looks like a circle.
First we are going to choose one of the eyes to detect the iris. I’m going to choose the leftmost.
The getLeftmostEye only returns the rect from which the top-left position is leftmost. Nothing serious.
After I got the leftmost eye, I’m going to crop it, apply a histogram equalization to enhance constrat and then the HoughCircles function to find the circles in my image.
eye_detector.cpp
12345678
voiddetectEyes(...){...cv::Mateye=face(eyeRect);// crop the leftmost eyecv::equalizeHist(eye,eye);std::vector<cv::Vec3f>circles;cv::HoughCircles(eye,circles,CV_HOUGH_GRADIENT,1,eye.cols/8,250,15,eye.rows/8,eye.rows/3);}
Let’s take a deep look in what the HoughCircles function expects:
inputImage: The input image
circles: The circles that it found
method: Method to be applied
dp: Inverse ratio of the accumulator resolution
minDist: Minimal distance between the center of one circle and another
threshold: Threshold of the edge detector
minArea: What’s the min area of a circle in the image?
minRadius: What’s the min radius of a circle in the image?
maxRadius: What’s the max radius of a circle in the image?
Well, that’s it… As the function itself says, it can detect many circles, but we just want one. So let’s select the one belonging to the eyeball. For that, I chose a very stupid heuristic: Choose the circle that contains more “black” pixels in it! In another words, the circle from which the sum of pixels within it is minimal.
In order to know if a pixel is inside a pixel or not, we just test if the euclidean distance between the pixel location and the circle center is not higher than the circle radius. Piece of cake.
That’s good, now we supposely have the iris. However, the HoughCircles algorithms is very unstable, and therefore the iris location can vary a lot! We need to stabilize it to get better results. To do that, we simply calculate the mean of the last five detected iris locations.
std::vector<cv::Point>centers;cv::Pointstabilize(std::vector<cv::Point>&points,intwindowSize){floatsumX=0;floatsumY=0;intcount=0;for(inti=std::max(0,(int)(points.size()-windowSize));i<points.size();i++){sumX+=points[i].x;sumY+=points[i].y;++count;}if(count>0){sumX/=count;sumY/=count;}returncv::Point(sumX,sumY);}voiddetectEyes(...){...if(circles.size()>0){cv::Vec3feyeball=getEyeball(eye,circles);cv::Pointcenter(eyeball[0],eyeball[1]);centers.push_back(center);center=stabilize(centers,5);// we are using the last 5}}
Finally, let’s draw the iris location and test it!
cv::PointlastPoint;cv::PointmousePoint;voiddetectEyes(...){if(circles.size()>0){...if(centers.size()>1){cv::Pointdiff;diff.x=(center.x-lastPoint.x)*20;diff.y=(center.x-lastPoint.y)*-30;// diff in y is higher because it's "harder" to move the eyeball up/down instead of left/right}lastPoint=center;}}voidchangeMouse(cv::Mat&frame,cv::Point&location){if(location.x>frame.cols)location.x=frame.cols;if(location.x<0)location.x=0;if(location.y>frame.rows)location.y=frame.rows;if(location.y<0)location.y=0;system(("xdotool mousemove "+std::to_string(location.x)+" "+std::to_string(location.y)).c_str());}intmain(...){...while(1){...detectEyes(...);changeMouse(frame,mousePoint);...}return0;}
As you can see, I’m taking the difference of position between the current iris position and the previous iris position. Of course, this is not the best option. Ideally, we would detect the “gaze direction” in relation to difference between the iris position and the “rested” iris position. I let it for you to implement! Not that hard.
#include <iostream>#include <opencv2/core/core.hpp>#include <opencv2/highgui/highgui.hpp>#include <opencv2/imgproc/imgproc.hpp>#include <opencv2/objdetect/objdetect.hpp>cv::Vec3fgetEyeball(cv::Mat&eye,std::vector<cv::Vec3f>&circles){std::vector<int>sums(circles.size(),0);for(inty=0;y<eye.rows;y++){uchar*ptr=eye.ptr<uchar>(y);for(intx=0;x<eye.cols;x++){intvalue=static_cast<int>(*ptr);for(inti=0;i<circles.size();i++){cv::Pointcenter((int)std::round(circles[i][0]),(int)std::round(circles[i][1]));intradius=(int)std::round(circles[i][2]);if(std::pow(x-center.x,2)+std::pow(y-center.y,2)<std::pow(radius,2)){sums[i]+=value;}}++ptr;}}intsmallestSum=9999999;intsmallestSumIndex=-1;for(inti=0;i<circles.size();i++){if(sums[i]<smallestSum){smallestSum=sums[i];smallestSumIndex=i;}}returncircles[smallestSumIndex];}cv::RectgetLeftmostEye(std::vector<cv::Rect>&eyes){intleftmost=99999999;intleftmostIndex=-1;for(inti=0;i<eyes.size();i++){if(eyes[i].tl().x<leftmost){leftmost=eyes[i].tl().x;leftmostIndex=i;}}returneyes[leftmostIndex];}std::vector<cv::Point>centers;cv::PointlastPoint;cv::PointmousePoint;cv::Pointstabilize(std::vector<cv::Point>&points,intwindowSize){floatsumX=0;floatsumY=0;intcount=0;for(inti=std::max(0,(int)(points.size()-windowSize));i<points.size();i++){sumX+=points[i].x;sumY+=points[i].y;++count;}if(count>0){sumX/=count;sumY/=count;}returncv::Point(sumX,sumY);}voiddetectEyes(cv::Mat&frame,cv::CascadeClassifier&faceCascade,cv::CascadeClassifier&eyeCascade){cv::Matgrayscale;cv::cvtColor(frame,grayscale,CV_BGR2GRAY);// convert image to grayscalecv::equalizeHist(grayscale,grayscale);// enhance image contrast std::vector<cv::Rect>faces;faceCascade.detectMultiScale(grayscale,faces,1.1,2,0|CV_HAAR_SCALE_IMAGE,cv::Size(150,150));if(faces.size()==0)return;// none face was detectedcv::Matface=grayscale(faces[0]);// crop the facestd::vector<cv::Rect>eyes;eyeCascade.detectMultiScale(face,eyes,1.1,2,0|CV_HAAR_SCALE_IMAGE,cv::Size(30,30));// same thing as above rectangle(frame,faces[0].tl(),faces[0].br(),cv::Scalar(255,0,0),2);if(eyes.size()!=2)return;// both eyes were not detectedfor(cv::Rect&eye:eyes){rectangle(frame,faces[0].tl()+eye.tl(),faces[0].tl()+eye.br(),cv::Scalar(0,255,0),2);}cv::RecteyeRect=getLeftmostEye(eyes);cv::Mateye=face(eyeRect);// crop the leftmost eyecv::equalizeHist(eye,eye);std::vector<cv::Vec3f>circles;cv::HoughCircles(eye,circles,CV_HOUGH_GRADIENT,1,eye.cols/8,250,15,eye.rows/8,eye.rows/3);if(circles.size()>0){cv::Vec3feyeball=getEyeball(eye,circles);cv::Pointcenter(eyeball[0],eyeball[1]);centers.push_back(center);center=stabilize(centers,5);if(centers.size()>1){cv::Pointdiff;diff.x=(center.x-lastPoint.x)*20;diff.y=(center.y-lastPoint.y)*-30;mousePoint+=diff;}lastPoint=center;intradius=(int)eyeball[2];cv::circle(frame,faces[0].tl()+eyeRect.tl()+center,radius,cv::Scalar(0,0,255),2);cv::circle(eye,center,radius,cv::Scalar(255,255,255),2);}cv::imshow("Eye",eye);}voidchangeMouse(cv::Mat&frame,cv::Point&location){if(location.x>frame.cols)location.x=frame.cols;if(location.x<0)location.x=0;if(location.y>frame.rows)location.y=frame.rows;if(location.y<0)location.y=0;system(("xdotool mousemove "+std::to_string(location.x)+" "+std::to_string(location.y)).c_str());}intmain(intargc,char**argv){if(argc!=2){std::cerr<<"Usage: EyeDetector <WEBCAM_INDEX>"<<std::endl;return-1;}cv::CascadeClassifierfaceCascade;cv::CascadeClassifiereyeCascade;if(!faceCascade.load("./haarcascade_frontalface_alt.xml")){std::cerr<<"Could not load face detector."<<std::endl;return-1;}if(!eyeCascade.load("./haarcascade_eye_tree_eyeglasses.xml")){std::cerr<<"Could not load eye detector."<<std::endl;return-1;}cv::VideoCapturecap("./sample.mp4");// the fist webcam connected to your PCif(!cap.isOpened()){std::cerr<<"Webcam not detected."<<std::endl;return-1;}cv::Matframe;mousePoint=cv::Point(800,800);while(1){cap>>frame;// outputs the webcam image to a Matif(!frame.data)break;detectEyes(frame,faceCascade,eyeCascade);changeMouse(frame,mousePoint);cv::imshow("Webcam",frame);// displays the Matif(cv::waitKey(30)>=0)break;// takes 30 frames per second. if the user presses any button, it stops from showing the webcam}return0;}