We propose a new method for estimating the probability mass function (pmf) of a discrete and finite random variable from a small sample. We focus on the observed counts--the number of times each value appears in the sample--and define the maximum likelihood set (MLS) as the set of pmfs that put more mass on the observed counts than on any other set of counts possible for the same sample size. We characterize the MLS in detail in this article. We show that the MLS is a diamond-shaped subset of the probability simplex [0,1]k bounded by at most k x (k-1) hyper-planes, where k is the number of possible values of the random variable. The MLS always contains the empirical distribution, as well as a family of Bayesian estimators based on a Dirichlet prior, particularly the well-known Laplace estimator. We propose to select from the MLS the pmf that is closest to a fixed pmf that encodes prior knowledge. When using Kullback-Leibler distance for this selection, the optimization problem comprises finding the minimum of a convex function over a domain defined by linear inequalities, for which standard numerical procedures are available. We apply this estimate to language modeling using Zipf's law to encode prior knowledge and show that this method permits obtaining state-of-the-art results while being conceptually simpler than most competing methods.

Download full-text PDF

Source
http://dx.doi.org/10.1162/0899766053723078DOI Listing

Publication Analysis

Top Keywords

maximum likelihood
8
likelihood set
8
estimating probability
8
probability mass
8
mass function
8
random variable
8
prior knowledge
8
mls
5
set
4
set estimating
4

Similar Publications

Both acute kidney injury and chronic kidney disease are risk factors for many outcomes of gastrointestinal bleeding (GIB). These are associated with higher mortality, longer hospitalisation, and greater need for transfusion in case of overt GIB. Our study aimed to further evaluate the role of kidney function in several clinical outcomes of GIB patients.

View Article and Find Full Text PDF

In the last decades, natural and anthropogenic pressures have caused observable changes in the argan landscape despite its significance in Morocco. Remote sensing data can be used to monitor these changes over time and provide information on vegetation health and land cover changes. This study assesses the performance of supervised methods (support vector machine, maximum likelihood, and minimum distance) and unsupervised classification method (Isodata) for mapping the argan forest in the Smimou area of Essaouira province using remote sensing data from Landsat-5 and Landsat-8 (1985 and 2019).

View Article and Find Full Text PDF

Background: An estimated 10-30% of people with COVID-19 experience debilitating long-term symptoms or long covid. Underlying health conditions associated with chronic inflammation may increase the risk of long covid.

Methods: We conducted a systematic review and meta-analysis to examine whether long covid risk was altered by pre-existing asthma or chronic obstructive pulmonary disease (COPD) in adults.

View Article and Find Full Text PDF

Motivation: Genotyping of bi-parental populations can be performed with low-coverage next-generation sequencing (LC-NGS). This allows the creation of highly saturated genetic maps at reasonable cost, precisely localized recombination breakpoints (i.e.

View Article and Find Full Text PDF

Passion fruit (Passiflora edulis) is a commercially important crop known for its nutritional value, high antioxidant content, and use in beverages and desserts. Gulupa baciliform virus A (GBVA), tentatively named Badnavirus in the family Caulimoviridae, is a cryptic circular double-stranded DNA (dsDNA, ≈6,951 bps) virus recently reported in Colombia with asymptomatic infection of passion fruit (Sepúlveda et al. 2022).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!