Fake opinion detection: how similar are crowdsourced datasets to real data?

Fornaciari, Tommaso; Cagnina, Leticia Cecilia; Rosso, Paolo; Poesio, Massimo

doi:https://doi.org/10.1007/s10579-020-09486-5

Mostrar el registro sencillo del ítem

dc.contributor.author

Fornaciari, Tommaso

dc.contributor.author

Cagnina, Leticia Cecilia Se ha confirmado la validez de este valor de autoridad por un usuario

dc.contributor.author

Rosso, Paolo Se ha confirmado la validez de este valor de autoridad por un usuario

dc.contributor.author

Poesio, Massimo

dc.date.available

2021-10-14T18:37:36Z

dc.date.issued

2020-12

dc.identifier.citation

Fornaciari, Tommaso; Cagnina, Leticia Cecilia; Rosso, Paolo; Poesio, Massimo; Fake opinion detection: how similar are crowdsourced datasets to real data?; Springer; Language Resources And Evaluation; 54; 4; 12-2020; 1019-1058

dc.identifier.issn

1574-020X

dc.identifier.uri

http://hdl.handle.net/11336/143656

dc.description.abstract

Identifying deceptive online reviews is a challenging tasks for Natural Language Processing (NLP). Collecting corpora for the task is difficult, because normally it is not possible to know whether reviews are genuine. A common workaround involves collecting (supposedly) truthful reviews online and adding them to a set of deceptive reviews obtained through crowdsourcing services. Models trained this way are generally successful at discriminating between ‘genuine’ online reviews and the crowdsourced deceptive reviews. It has been argued that the deceptive reviews obtained via crowdsourcing are very different from real fake reviews, but the claim has never been properly tested. In this paper, we compare (false) crowdsourced reviews with a set of ‘real’ fake reviews published on line. We evaluate their degree of similarity and their usefulness in training models for the detection of untrustworthy reviews. We find that the deceptive reviews collected via crowdsourcing are significantly different from the fake reviews published online. In the case of the artificially produced deceptive texts, it turns out that their domain similarity with the targets affects the models’ performance, much more than their untruthfulness. This suggests that the use of crowdsourced datasets for opinion spam detection may not result in models applicable to the real task of detecting deceptive reviews. As an alternative method to create large-size datasets for the fake reviews detection task, we propose methods based on the probabilistic annotation of unlabeled texts, relying on the use of meta-information generally available on the e-commerce sites. Such methods are independent from the content of the reviews and allow to train reliable models for the detection of fake reviews.

dc.format

application/pdf

dc.language.iso

eng

dc.publisher

Springer

dc.rights

info:eu-repo/semantics/restrictedAccess

dc.rights.uri

https://creativecommons.org/licenses/by-nc-sa/2.5/ar/

dc.subject

CROWDSOURCING

dc.subject

DECEPTION DETECTION

dc.subject

GROUND TRUTH

dc.subject

PROBABILISTIC LABELING

dc.subject.classification

Ciencias de la Computación Se ha confirmado la validez de este valor de autoridad por un usuario

dc.subject.classification

Ciencias de la Computación e Información Se ha confirmado la validez de este valor de autoridad por un usuario

dc.subject.classification

CIENCIAS NATURALES Y EXACTAS Se ha confirmado la validez de este valor de autoridad por un usuario