click here to download project abstract
click here to download project base paper
ABSTRACT
Stream clustering methods have been repeatedly used for spam filtering in order to categorize input messages/tweets into spam and non spam clusters. These methods assume each cluster contains a number of neighbor small (micro) clusters, where each micro cluster has a symmetric distribution. Nonetheless, this assumption is not necessarily correct and big micro clusters might have asymmetric distribution. To enhance the assigning accuracy of former methods in their online phase, we suggest replacing the Euclidean distance by a set
of classifiers in order to assign incoming samples to the most relative micro cluster with arbitrary distribution. Here, a set of incremental Naïve Bayes (INB) classifier is trained for micro clusters whose population exceeds a threshold. These INBs can capture the mean and boundary of micro clusters, while the Euclidean distance just considers the mean of clusters and acts inaccurate for asymmetric big micro clusters. In this paper, Den Stream was promoted by the proposed framework, called here as INB Den Stream. To show the effectiveness of INB-Den Stream, state-of-the-art methods such as Den Stream, Stream KM++, and Clu Stream were applied to the Twitter datasets and their performance was determined in terms of purity, general precision, general recall, F1 measure, parameter sensitivity, and computational complexity. The compared results implied the superiority of
our method to the rivals in almost the datasets..
Abstract:
This postgraduate project aims to develop a Twitter spam classification system utilizing machine learning techniques. The existing system lacks efficient spam detection mechanisms, leading to increased instances of spam on Twitter. The proposed system leverages advanced machine learning algorithms to enhance the accuracy of spam detection, providing users with a cleaner and safer Twitter experience.
Existing System:
The current Twitter system relies on basic rule-based filters for spam detection, resulting in a high rate of false positives and negatives. This limitation necessitates the implementation of a more robust and adaptive approach.
Proposed System:
The proposed system employs machine learning algorithms such as Naive Bayes, Support Vector Machines (SVM), and Natural Language Processing (NLP) techniques to analyze tweet content and user behavior for accurate spam classification. The system aims to reduce false positives and negatives, enhancing the overall reliability of spam detection.
System Requirements:
- Python programming language
- Machine learning libraries (e.g., scikit-learn, TensorFlow)
- Twitter API for data retrieval
- Web development tools for UI implementation
Algorithms:
- Naive Bayes
- Support Vector Machines (SVM)
- Natural Language Processing (NLP)
Hardware Requirements:
- Standard computer with sufficient processing power
- Adequate RAM for machine learning model training
Software Requirements:
- Python IDE
- Web development environment
- Twitter API access credentials
Architecture:
The system architecture follows a modular structure with components for data retrieval, preprocessing, machine learning model training, and a web-based user interface for end-user interaction.
Technologies Used:
- Python
- Flask (for web UI)
- scikit-learn, TensorFlow (for machine learning)
- Twitter API
Web User Interface:
The web-based user interface provides users with an interactive platform for accessing and analyzing spam classification results. It includes features such as real-time updates and a user-friendly design.