In this work, we describe our efforts in addressing two typical challenges involved in the popular text classification methods when they are applied to text moderation: the representation of multibyte characters and word obfuscations. Specifically, a multihot byte-level scheme is developed to significantly reduce the dimension of one-hot character-level encoding caused by the multiplicity of instance-scarce non-ASCII characters. In addition, we introduce a simple yet effective weighting approach for fusing n-gram features to empower the classical logistic regression. Surprisingly, it outperforms well-tuned representative neural networks greatly. As a continual effort toward text moderation, we endeavor to analyze the current state-of-the-art (SOTA) algorithm bidirectional encoder representations from transformers (BERT), which works well in context understanding but performs poorly on intentional word obfuscations. To resolve this crux, we then develop an enhanced variant and remedy this drawback by integrating byte and character decomposition. It advances the SOTA performance on the largest abusive language datasets as demonstrated by our comprehensive experiments. Our work offers a feasible and effective framework to tackle word obfuscations.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2021.3137045DOI Listing

Publication Analysis

Top Keywords

text moderation
12
word obfuscations
12
mgel multigrained
4
multigrained representation
4
representation analysis
4
analysis ensemble
4
ensemble learning
4
text
4
learning text
4
moderation work
4

Similar Publications

Background: Cervical cancer (CC) is preventable. CC screening decreases CC mortality. Emergency department (ED) patients are at disproportionately high risk for nonadherence with CC screening recommendations.

View Article and Find Full Text PDF

Objective: To assess the utility and challenges of using natural language processing (NLP) in electronic health records (EHRs) to ascertain health-related social needs (HRSNs) among older adults.

Study Setting And Design: We extracted HRSN information using the NLP system Clinical Text Analysis and Knowledge Extraction System (cTAKES), combined with Concept Unique Identifiers and Systematized Nomenclature for Medicine codes. We validated cTAKES performance, via manual chart review, on two HRSNs: food insecurity, which was included in the healthcare system's HRSN screening tool, and housing insecurity, which was not.

View Article and Find Full Text PDF

As the pace of enterprise digital transformation accelerates, intellectual capital (IC) has become a core driving force of gaining market competitive advantages and enhancing value creation capabilities. The paper aims to investigate the impact of IC and its components on financial performance of Chinese ecological protection and environmental governance companies during 2018-2021. In addition, the moderating effect of digital transformation between them is examined.

View Article and Find Full Text PDF

Background: People from lower socioeconomic groups are more likely to smoke and less likely to succeed in achieving abstinence, making tobacco smoking a leading driver of health inequalities. Contextual factors affecting subpopulations may moderate the efficacy of individual-level smoking cessation interventions. It is not known whether any intervention performs differently across socioeconomically-diverse populations and contexts.

View Article and Find Full Text PDF

Background: The accurate and timely diagnosis of oral potentially malignant lesions (OPMLs) is crucial for effective management and prevention of oral cancer. Recent advancements in artificial intelligence technologies indicates its potential to assist in clinical decision-making. Hence, this study was carried out with the aim to evaluate and compare the diagnostic accuracy of ChatGPT 3.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!