• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Centre for Artificial Intelligence and Cybersecurity – AIRI

  • Home
  • About Us
    • Vision, Mission and Goals
    • Center Activities
    • Center Faculty
    • Steering Committee
    • Press
  • Research
    • Scientific Projects
    • Research Papers
  • Laboratories
    • Machine Learning
    • Natural Speech & Language Processing
    • Blockchain Technology
    • Information Processing & Pattern Recognition
    • AI in Medicine
    • Data Mining
    • Computer Vision
    • Human-Computer Interaction
    • Maritime Cybersecurity
    • Autonomous Navigation
    • AI in Mechatronics
    • AI in Education
    • Hybrid Computational Methods
    • Drug Design
    • Legal Aspects of AI
    • Ethically Aligned AI
    • Cultural Complexity
  • Collaboration
    • Industry Collaboration
    • Industry Projects
    • International Collaboration
  • News
  • Contact

The Influence of Feature Representation of Text on the Performance of Document Classification

28.02.2019

In this paper we perform a comparative analysis of three models for a feature representation of text documents in the context of document classification. In particular, we consider the most often used family of bag-of-words models, the recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based models that have been rarely considered for the representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated with the three models and their variants. Multi-objective rankings are proposed as the framework for multi-criteria comparative analysis of the results. Finally, the results of the empirical comparison show that the commonly used bag-of-words model has a performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents.

Authors:
Sanda Martinčić-Ipšić, Tanja Miličić, Ljupčo Todorovski
Journal:
Applied Sciences
Publishing date:
20.02.2019
View original article

Primary Sidebar

Latest Projects

Machine Learning for Knowledge Transfer in Medical Radiology

Estimating River Discharges in Highly Stratified Estuaries

Multilayer Framework for the Information Spreading Characterization in Social Media during the COVID-19 Crisis (InfoCoV)

European Network for assuring food integrity using non-destructive spectral sensors

National Competence Centres in the Framework of EuroHPC (EUROCC)

Latest Research Papers

Rule-Based EEG Classifier Utilizing Local Entropy of Time–Frequency Distributions

Rethinking Effects of Innovation in Competition In The Era of New Digital Technologies

A Comparison of Approaches for Measuring the Semantic Similarity of Short Texts Based on Word Embeddings

Indoor Localization Based on Infrared Angle of Arrival Sensor Network

Gravitational-Wave Burst Signals Denoising Based on the Adaptive Modification of Intersection of Confidence Intervals Rule

Latest News

U Rijeci radi Centar za umjetnu inteligenciju, već su u prvoj godini rada povukli šest milijuna u 14 projekata

UNIRI Excellence Awards in Science

Talk on conference “Exploring Digital Legal Landscapes”

ICAIH 2020 conference presentation

International conference “Exploring Digital Legal Landscapes” – 11th of December, 2020

We provide the expertise for solving real world problems using AI

If your company wants to implement artificial intelligence in your products or services, or increase your level of cybersecurity, our multidisciplinary team of scientists is your ideal partner.

Contact us

Footer

Center for Artificial Intelligence and Cybersecurity
  • jlerga@airi.uniri.hr
  • +385 51 406 500

University of Rijeka

University of Rijeka

About the Center

  • About Us
  • News
  • Privacy Policy
  • Contact

Center Activities

  • Laboratories
  • Scientific Projects
  • Industry Projects
  • Research Papers
  • Industry Collaboration
  • International Collaboration

Footer bottom left

© 2020 Center for Artificial Intelligence and Cybersecurity, all rights reserved.

Designed & developed by Nela Dunato Art & Design