click here to download the abstract project
ABSTRACT
Music plays a very important role in human’s daily life. Everyone wants to listen music of their individual taste, mostly based on their mood. Users always face the task of manually browsing the music and to create a playlist based on their current mood. The proposed project is very efficient which generates a music playlist based on the current mood of users. Facial expressions are the best way of expressing ongoing mood of the person. The objective of this project is to suggest songs for users based on their mood by capturing facial expressions. Facial expressions are captured through webcam and such expressions are fed into learning algorithm which gives most probable emotion. Once the emotion is recognized, the system suggests a play-list for that emotion, thus saves a lot of time for a user. Once the emotion is detected by CNN then the emotion is used by Spotify API and then the Spotify API generates a playlist according the emotion of the user.
Keywords: Face detection, Emotion recognition, Webcam, CNN classification, Spotify API, Music Playlist.
INTRODUCTION
Music plays an important role in our daily life. Users have to face the task of manually browsing the music.
Computer vision is a field of study which encompasses on how computer see and understand digital images and videos. Computer vision involves seeing or sensing a visual stimulus, make sense of what it has seen and also extract complex information that could be used for other machine learning activities.
We will implement our use case using the Haar Cascade classifier. Haar Cascade classifier is an effective object detection approach which was proposed by Paul Viola and Michael Jones in their paper, “Rapid Object Detection using a Boosted Cascade of Simple Features” in 2001. This project recognizes the facial expressions of user and play songs according to emotion. Facial expressions are best way of expressing mood of a person. The facial expressions are captured using a webcam and face detection is done by using Haar cascade classifier. The captured image is input to CNN which learn features and these features are analyzed to determine the current emotion of user then the music will be played according to the emotion.
In this project, five emotions are considered for classification which includes happy, sad, anger, surprise, neutral. This project consists of 4 modules-face detection, feature extraction, emotion detection, songs classification. Face detection is done by Haar cascade classifier, feature extraction and emotion detection are done by CNN. Finally, the songs are played according to the emotion recognized.
Convolutional Neural Networks (CNN) is a specific type of Artificial Neural Network which are widely used for image classification. CNN is a type of deep learning model for processing data that has a grid pattern, such as images, which is inspired by the organization of animal visual cortex and designed to automatically and adaptively learn spatial hierarchies of features, from low- to high-level patterns. CNN is a mathematical construct that is typically composed of three types of layers (or building blocks): convolution, pooling, and fully connected layers. The first two, convolution and pooling layers, perform feature extraction, whereas the third, a fully connected layer, maps the extracted features into final output, such as classification.
A convolution layer plays a key role in CNN, which is composed of a stack of mathematical operations, such as convolution, a specialized type of linear operation. In digital images, pixel values are stored in a two-dimensional (2D) grid, i.e., an array of numbers and a small grid of parameters called kernel, an optimizable feature extractor, is applied at each image position, which makes CNNs highly efficient for image processing, since a feature may occur anywhere in the image. As one layer feeds its output into the next layer, extracted features can hierarchically and progressively become more complex. The process of optimizing parameters such as kernels is called training, which is performed so as to minimize the difference between outputs and ground truth labels through an optimization algorithm called backpropagation and gradient descent, among others.
Applications of Computer Vision:
- Autonomous Vehicles.
- Facial Recognition.
- Image Search and Object Recognition.
Advantages of Computer Vision: - Faster and simpler process
- Better products and services
- Cost-reduction
Disadvantages of Computer Vision: - Lack of specialists
- Need for regular monitoring
Applications of CNN: - Decoding Facial Recognition.
- Analyzing Documents.
- Historic and Environmental Collections.
- Understanding Climate.
- Advertising.
Advantages of CNN: - Processing speed.
- Flexibility.
- Versatile in nature.
- Dynamic Behaviour.
- Speed
- Robustness.
Disadvantages of CNN: - CNN do not encode the position and orientation of object.
- Lack of ability to be spatially invariant to the input data
- Lots of training data is required
Adaptive Boosting (AdaBoost)
The AdaBoost (Adaptive Boosting) Algorithm is a machine learning algorithm for selecting the best subset of features among all available features. The output of the algorithm is a classifier (Prediction Function, Hypothesis Function) called a “Strong Classifier”. A Strong Classifier is made up of a linear combination of “Weak Classifiers” (best features). From a high level, in order to find these weak classifiers the algorithm runs for T iterations where T is the number of weak classifiers to find and it is set by you. In each iteration, the algorithm finds the error rate for all features and then choose the feature with the lowest error rate for that iteration.
The algorithm learns from the images we supply it and is able to determine the false positives and true negatives in the data, allowing it to be more accurate. We would get a highly accurate model once we have looked at all possible positions and combinations of those features. Training can be super extensive because of all the different possibilities and combinations you would have to check for every single frame or image.
Let’s say we have an equation for our features that determines the success rate with f1, f2 and f3 as the features and a1, a2, a3 as the respective weights of the features. Each of the features is known as a weak classifier. The left side of the equation F(x) is called a strong classifier. Since one weak classifier may not be as good, we get a strong classifier when we have a combination of two or three weak classifiers. As you keep adding, it gets stronger and stronger. This is called an ensemble