Self organization of a massive document collection.

IEEE Trans Neural Netw

Neural Networks Research Centre, Helsinki University of Technology, Espoo, Finland.

Published: October 2012

This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the self-organizing map (SOM) algorithm. As the feature vectors for the documents statistical representations of their vocabularies are used. The main goal in our work has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data. In a practical experiment we mapped 6,840,568 patent abstracts onto a 1,002,240-node SOM. As the feature vectors we used 500-dimensional vectors of stochastic figures obtained as random projections of weighted word histograms.

Download full-text PDF

Source
http://dx.doi.org/10.1109/72.846729DOI Listing

Publication Analysis

Top Keywords

som algorithm
8
feature vectors
8
organization massive
4
massive document
4
document collection
4
collection article
4
article describes
4
describes implementation
4
implementation system
4
system organize
4

Similar Publications

Best current practice in the analysis of dynamic contrast enhanced (DCE)-MRI is to employ a voxel-by-voxel model selection from a hierarchy of nested models. This nested model selection (NMS) assumes that the observed time-trace of contrast-agent (CA) concentration within a voxel, corresponds to a singular physiologically nested model. However, admixtures of different models may exist within a voxel's CA time-trace.

View Article and Find Full Text PDF

To improve the scientific accuracy and precision of children's physical fitness evaluations, this study proposes a model that combines self-organizing maps (SOM) neural networks with cluster analysis. Existing evaluation methods often rely on traditional, single statistical analyses, which struggle to handle the complexity of high-dimensional, nonlinear data, resulting in a lack of precision and personalization. This study uses the SOM neural network to reduce the dimensionality of high-dimensional health data.

View Article and Find Full Text PDF

This study addresses the limited noninvasive tools for Head and Neck Squamous Cell Carcinoma (HNSCC) progression-free survival (PFS) prediction by identifying Computed Tomography (CT)-based biomarkers for predicting prognosis. A retrospective analysis was conducted on data from 203 HNSCC patients. An ensemble feature selection involving correlation analysis, univariate survival analysis, best-subset selection, and the LASSO-Cox algorithm was used to select functional features, which were then used to build final Cox Proportional Hazards models (CPH).

View Article and Find Full Text PDF

Objectives: Multiplex immunohistochemistry and immunofluorescence (mIHC/IF) are emerging technologies that can be used to help define complex immunophenotypes in tissue, quantify immune cell subsets, and assess the spatial arrangement of marker expression. mIHC/IF assays require concerted efforts to optimize and validate the multiplex staining protocols prior to their application on slides. The best practice guidelines for staining and validation of mIHC/IF assays across platforms were previously published by this task force.

View Article and Find Full Text PDF

Objective: The aim of the study is to verify whether the electronic nose system - an array of 17 gas sensors with a signal analysis system - is a useful tool for the classification and preliminary assessment of the quality of drainage water.

Material And Methods: Water samples for analysis were collected in the Park Ludowy (People's Park), located next to the Bystrzyca River, near the city center of Lublin in eastern Poland. Drainage water was sampled at 4 different points.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!