Tag: text

There are 10 datasets tagged with text:

  • About From distribution page: This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly...
  • Openthesis
    • None Not Openly Licensed
    From the website: OpenThesis is a free repository of theses, dissertations, and other academic documents, coupled with powerful search, organization, and collaboration tools. We...
  • Ngrams and code from Dr. Peter Norvig's chapter for Beautiful Data (2009), edited by Segaran and Hammerbacher. Data files are derived from the Google Web Trillion Word Corpus, as...
  • New Zealand Digital Library
    • None Not Openly Licensed
    The library is a collection of machine-readable texts and metadata, especially relating to New Zealand and the Asia/Pacific Region. From the website: [The library] provides several...
  • Wikisource is a repository of English language text. As of October 2011, it contains over 240,000 pages. From the website Wikisource is an online library of free content publications...
  • Read the Web
    • None Not Openly Licensed
    This data includes facts extracted from 500 million web pages. From the project's website: To build a never-ending machine learning system that acquires the ability to extract...
  • Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated...
  • This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus (1200M words of NYT, APW, AFE, XIE). It also prepares dictionaries...
  • Microsoft Web N-Gram Service
    • None Not Openly Licensed
    Microsoft has developed services on the basis of ngrams from all of Bing's en_US corpus. The raw public data available include two files with the top 100k words from this corpus. The...
  • Spinn3r Indexing the Blogosphere
    • None Not Openly Licensed
    Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data, and you can focus on building your...