Click here to download the project base paper on web scraping
Abstract:
An essential operation in web corpus construction consists of web scraping retaining the desired content
while discarding the rest. Another challenge finding one’s way through websites. This article introduces a text discovery and extraction tool published under open-source license. Its installation and use are straightforward, notably from Python and on the command line. The software allows for main text, comments, and metadata extraction, while also providing building blocks for web crawling tasks. A comparative evaluation of real-world data also
shows its interest as well as the performance of other available solutions.
The contributions of this paper are threefold: it references the software, features a benchmark,
and provides a meaningful baseline for similar tasks. The tool performs significantly better
than other open-source solutions in this evaluation and in external benchmarks.
As useful monolingual text corpora across languages are highly relevant for the NLP community As useful monolingual text corpora across languages are highly relevant for the NLP community web corpora seem to be a natural way to gather language data. Corpus construction usually involves“crawling, downloading, ‘cleaning’ and de-duplicating the data, then linguistically annotating it and loading it into a corpus query tool”(Kilgarriff,2007). However, although text is ubiquitous on the Web, drawing accurate information from web pages can be difficult. In addition, the increasing variety of corpora, text types, and use cases make it more and more difficult to assess the usefulness and appropriateness of certain web texts for given research objectives.