Tag: corpus

There are 13 datasets tagged with corpus:

  • About About TalkBank: The goal of TalkBank is to foster fundamental research in the study of human and animal communication. It will construct sample databases within each of the...
  • About From website: As of November 2007, the European Commission's Directorate-General for Translation (DGT) made publicly accessible its multilingual Translation Memory for the Acquis...
  • About VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac). We will make available all submitted...
  • About From distribution page: This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly...
  • MOCHA-TIMIT
    • None Not Openly Licensed
    About Authors: Alan Wrench, Queen Margaret University College. Funded by: Engineering and Physical Sciences Research Council. When created: November 1999....
  • The IBL Corpus
    • None Not Openly Licensed
    About The IBL Corpus was collected by the University of Plymouth and the University of Edinburgh as part of the EPSRC funded project IBL, Instruction-based Learning for Mobile Robots...
  • About From website: WordNet is a lexical database for English that has been widely adopted in artificial intelligence and computational linguistics for a variety of practical...
  • About From website: The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with...
  • OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community...
  • This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus (1200M words of NYT, APW, AFE, XIE). It also prepares dictionaries...
  • Resources, including corpora and software, for processing Hungarian language. Language resources The Hunglish Corpus is a sentence-aligned Hungarian-English parallel corpus...
  • Description Overview from home page: The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic...
  • The OCTC hosts open-content texts, encoded in TEI P5 XML, for many languages, each in a separate subcorpus. Another part of the OCTC stores interlanguage alignment info. The project is...