click here to download the abstract project of text summarizer project

to download the base paper of text summarizer project

At datapro, we provide final year projects with source code in python for computer science students in Hyderabad, Visakhapatnam.

ABSTRACT
We provide text summarizer project in this paper. In this digital era the procurement of data is growing rapidly and this growth leads to excessive amount of data. This excessive amount of data demands more storage. To reduce the growing storage requirements, summarization of the data will be helpful. Summarization means reducing the original content to a shorter version preserving the key information and meaning of the context. As we humans read the whole text and collect the summary this could reduce the human intervention and provide the summary with lesser efforts. But humans collect the summary through knowledge and language capabilities it will be difficult task for a computer to perform text summarization. Generally, text summarization is done using the following two approaches: abstraction and extraction. The extractive approach generates the summary from the given input text. This approach provides an individual score to each of the sentences of the input and based on the score it includes or excludes the sentences from the condensed version. The abstractive approach generates the summary by advanced natural language methodologies.
Some of the text in the condensed version may not appear in the input text. Thus, it is not just rearranging and formatting the text as done in the extractive approach. The main aim of the project is to find a subset of data that contains all the important information of the input data set

Keywords – Abstractive, RNN, Summarization, Encoder, Decoder, LSTM, SoftMax, PUNKT, Stop words.

INTRODUCTION

  • Construct an intermediate representation of the input text which expresses the main aspects of the text.
  • Score the sentences based on the representation. Select a summary consisting of a number of sentences. Intermediate Representation: Every summarization system creates some intermediate
    representation of the text it intends to summarize and finds salient content based on this representation.
    There are two types of approaches based on the representation: Topic representation and Indicator representation.
  • Topic representation approaches transform the text into an intermediate representation and interpret the topics discussed in the text. Topic representation based summarization techniques differ in terms of their complexity and representation model.
  • Indicator representation approaches describe every sentence as a list of features (indicators) of importance such as sentence length, position in the document, having certain phrases, etc.
    Sentence Score: When the intermediate representation is generated, we assign an importance
    score to each sentence. In topic representation approaches, the score of a sentence represents how well the sentence explains some of the most important topics of the text. In most of the indicator representation methods, the score is computed by aggregating the evidence from different indicators.
    Sentence scoring is one of the most used processes in the area of Natural Language Processing (NLP) while working on textual data. It is a process to associate a numerical value with a sentence based on the used algorithm’s priority. This process is highly used especially on text summarization. There are many popular methods for sentence scoring like TF-IDF, Text Rank and so on.
    Summary Sentences: The summarizer system selects the top k most important sentences to produce a summary. Some approaches use greedy algorithms to select the important sentences and some approaches may convert the selection of sentences into an optimization problem where a collection of sentences is chosen, considering the constraint that it should maximize overall importance and coherency and minimize the redundancy. Text Summarization methods can be classified into extractive and abstractive summarization. An Extractive summarization method consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into shorter form. The extractive text summarization technique involves pulling key phrases from the source document and combining them to make a summary. The extraction is made according to the defined metric without making any changes to the texts.
    Here is an example of text summarizer project:
    Source Text:
    Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the city, Mary gave birth to a child named Jesus.
    Extractive summary:
    Joseph and Mary attend event Jerusalem. Mary birth Jesus. As you can see above, the words in bold have been extracted and joined to create a summary although sometimes the summary can be grammatically strange. An Abstractive summarization is an understanding of the main concepts in a document and then expressing those concepts in clear natural language. The abstraction technique entails paraphrasing and shortening parts of the source document. When abstraction is applied for text summarization in deep learning problems, it can overcome the grammar inconsistencies of the extractive method. The abstractive text summarization algorithms create new phrases and sentences that relay the most useful information from the original text just like humans do. Therefore, abstraction performs better than extraction. However, the text summarization algorithms required to do abstraction are more difficult to develop; that’s why the use of extraction is still popular.
    Here is an example:
    Source Text:
    Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the city, Mary gave birth to a child named Jesus.
  • Abstractive summary: Joseph and Mary came to Jerusalem where Jesus was born. In this study, we choose abstractive summarization approach for building the text summarizer by making use of Attention based Many-to-ManySeq2Seq encoder decoder architecture and recurrent neural network for building this model. RNN is used because it supports the mechanism of being recursive in nature while training the model. While training the model this recursive property provides room for performing cross-validation between the training data and helps in increasing the efficiency of the summarizer for creating unique reviews. The models help the user to generate abstract summaries for the provided input data preserving the key information and not deviating from the main context.
  • . Support Vector Regression Model:
  • Support Vector Regression as the name suggests is a regression algorithm that supports both linear and non-linear regressions. This method works on the principle of the Support Vector Machine. SVR differs from SVM in the way that SVM is a classifier that is used for predicting discrete categorical labels while SVR is a regressor that is used for predicting continuous ordered variables.
  • In simple regression, the idea is to minimize the error rate while in SVR the idea is to fit the error inside a certain threshold which means, work of SVR is to approximate the best value within a given margin called ε-tube.
  • you need to know the following term for text summarizer project:
  • Hyperplane: It is a separation line between two data classes in a higher dimension than the
    actual dimension. In SVR it is defined as the line that helps in predicting the target value.
  • Kernel: In SVR the regression is performed at a higher dimension. To do that we need a
    function that should map the data points into its higher dimension. This function is termed as
    the kernel. Type of kernel used in SVR is Sigmoidal Kernel, Polynomial Kernel, Gaussian
    Kernel, etc,
  • Boundary Lines: These are the two lines that are drawn around the hyperplane at a distance
    of ε (epsilon). It is used to create a margin between the data points.
  • Support Vector: It is the vector that is used to define the hyperplane or we can say that these
    are the extreme data points in the dataset which helps in defining the hyperplane. These data
    points lie close to the boundary.
    The objective of SVR is to fit as many data points as possible without violating the margin.Note that the classification that is in SVM use of support vector was to define the hyperplane but in SVR they are used to define the linear regression. SVR works on the principle of SVM with few minor differences. Given data points, it tries to find the curve. But since it is a regression algorithm instead of using the curve as a decision boundary it uses the curve to find the match between the vector and position of the curve. Support Vectors helps in determining the closest match between the data points and the function which is used to represent them. A regression algorithm that works on the principle of a Support Vector Machine (SVM) is known as the SVM Regression algorithm. It supports both linear and non-linear regressions. SVM regressor is used to predict continuous ordered variables. The main idea of SVR is to contain the error within a threshold i.e., it approximates the best value within a given margin. For Classification or outlier detection in an n-D space, a hyperplane or hyperplanes are constructed. Best parameters are considered for the SVM-Regressor and the model is fitted using the training set and the test set is used to forecast the values.
AUTOMATIC ABSTRACTIVE TEXT SUMMARIZER-text summarizer project
Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *