clickhere to download the abstract project of understanding word vectors

click here to download the base paper of understanding word vectors

At datapro, we provide final year projects with source code in python for computer science students in Hyderabad, Visakhapatnam.

**ABSTRACT**:

We make you to understanding word vectors in this paper. The need for clustering of documents is high in applications like document summarization, information retrieval etc. Huge collections of documents are pilling every day. It is really challenging to efficiently cluster given text documents. It is evident that clustering needs to be performed with best preprocessing and analyzing techniques with respect to preserving the order of sequence of words and concept of

words in the documents. In order to best understand the concepts in a document which further helps in clustering the document and putting it into the most appropriate cluster, it is essential to represent the document in a semantic representation. Semantic representation preserves the meaning of words in a document. There are many algorithms and approaches used till date which have their own merits and demerits. The algorithms used for word vectors here is “Word2Vec-Skip grams Model”, a word vector model is a neural network which generates a 100 dimension word vector i.e. a vector of 100 dimensions for each word, and the document is represented by computing a feature vector. A feature vector is calculated by using the word vectors by applying the min max method, min max method which is used summarizes all the vectors of the document into a single feature vector and for clustering is “k means”. It

is used for clustering the documents .

**CHAPTER-1: INTRODUCTION**

Text clustering is the application of cluster analysis to text-based documents. It is an efficient analysis technique used in the domain of the text mining to arrange a huge unorganised text documents into a subset of coherent clusters. Documents which are similar belong to the same cluster, whereas the documents which are dissimilar belong to different clusters. Clustering is unsupervised; it creates the clusters depending upon the pattern. Clustering is very much important as it determines the intrinsic grouping among the unlabeled data present. There are no criteria for a good clustering. It depends on the user, what are the criteria they may use which satisfy their need. For example,instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection). This algorithm must make some assumptions which constitute the similarity of points and each assumption make

different and equally valid clusters.

**K-M ean Algorithm**: K-Mean is first developed by James MacQueen in 1967.A

cluster is represented by its centroid, which is usually the mean of points within a

cluster. The objective function used for k-means is the sum of discrepancies between a

point and its centroid expressed through appropriate distance.

They have convex shapes clusters. Procedure of K-Mean:-

a) The technique requires arbitrary selection of choose k objects from D as the

initial centres, where k is the number of clusters and D is the data set

containing n objects.

3

b) Repeat the first step.

c) Reassign each object to the cluster to which object is most similar. It is

based on the mean value of the objects in the cluster.

d.) Calculate the mean value of the objects for each cluster.

e) Until no change**Advantages of K-Mean**:

- If the variables are large, then K-Means most of the time computationally

faster than hierarchical clustering methods. - K-Means produces tighter clusters than Hierarchical Clustering Method.

Disadvantages of K-Means Partition Algorithm: - It is difficult to predict the K Value.
- More difficulty in comparing quality of cluster.
- K-Means Algorithm does not work well with global cluster