•    Freeware
  •    Shareware
  •    Research
  •    Localization Tools 20
  •    Publications 728
  •    Validators 2
  •    Mobile Apps 22
  •    Fonts 31
  •    Guidelines/ Draft Standards 3
  •    Documents 13
  •    General Tools 38
  •    NLP Tools 105
  •    Linguistic Resources 265
Item Name: Marathi 1T 2-gram Version 1
Author(s): Uma Gajendragadkar [umagadkar@gmail.com], Sarang Joshi
Release Date: November 17, 2015
Data Source(s): Web Collection
Application(s): Language Modeling
Language(s): Marathi
Language ID(s):Marathi
Citation: Uma Gajendragadkar, COEP, SPPU, Pune, INDIA and Sarang Joshi, PICT, SPPU, Pune, INDIA Marathi 1T 2-gram Version 1 Web Download.
This data set, contributed by Uma Gajendragadkar and Sarang Joshi, contains Marathi word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to two-grams. Three-gram, Four-gram, Five-gram can be made available on request to authors. This data can be used for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
Source Data
The n-gram counts were generated from approximately 29 crore word tokens of text from publicly accessible Web pages.
Character Encoding
The input encoding of documents was automatically detected, and all text was converted to UTF8.
Data Sizes
File sizes: approx. 170MB text files Number of tokens: 290,406,855 Number of sentences: 109,277,834 Number of raw unigrams: 765,589 Number of unigrams: 588,797 Number of bigrams: 3,470,365

Last updated on November 7, 2019


  More Details
  • Contributed by : Uma Gajendragadkar, Sarang Joshi
  • Product Type : Text Corpora
  • License Type : Research
  • System Requirement : Not Applicable
Similar / Suggested Resources

By : mahesh shelke September 23, 2019
Unable to view the content present in Sample file. Also trying to open it with UTF-8 format. So please do the need to refer the content.