The reviews that comprise ReLi fit into the third type, and were extracted from the Skoob. The textual material varied widely in style, amount of subjectivity content and grammaticality. It showed a major presence of alternative spellings, emoticons and other features typical to internet writing, posing additional challenges for automatic language processors. The corpus is composed of reviews of 13 books 7 authors , comprising about , words and 12, sentences.

There are around reviews for each book and when this number could not be reached, we added other books by the same author until we arrived at a number close to Books and authors were chosen based on the number of reviews available per book. The variety of book styles led to a variety of language styles in ReLi: from very informal writing with heavy use of slang expressions, abbreviations, neologisms and emoticons, to more formal reviews with a more refined vocabulary.

Annotation Scheme Annotation consists of adding linguistic information tags to a corpus, according to the annotation goals. The definition of the tagset is a decision-making process related to the way the problem or task will be modeled.

Morphosyntactic annotation is the attaching of traditional linguistic categories, such as verb, noun or preposition, to words or expressions. In semantic annotation, semantic information is added to single words or larger portions of text. The kinds of tags are potentially infinite, and some instances refer to semantic classes of proper nouns, semantic roles, and polarities.

The annotation schema underlying the ReLi corpus involves: Opinion identification: Identification of the segment which expresses an opinion. In 16 , there is no opinion about the book, in 17 the opinion segment in underlined, and in 18 the whole sentence conveys opinion: The first chapters were kind of tedious, but as I hate abandoning a book, I kept going.

Romance adolescente bonitinho e meloso que poderia ter acabado no primeiro livro. However, at the sentence level, only one and overall opinion was considered. Different opinions and polarities are associated with each target.

However, even with both distinct polarities, we interpreted the whole sentence as positive as related to the book. Therefore, at the sentence level, the polarity is positive, even though at the phrase level there is a negative opinion.

Annotation process The fine tuning among annotators is a crucial aspect in the annotation process, especially when the task depends on a high level of semantic interpretation. ReLi was initially annotated by three annotator, — A, B e C —, although at the end there was only one left.

They all went through a training process until they were familiar with the task, the instructions, and the annotation tool. The relevance of always making interpretation decisions within the context of the review was also emphasized. Once the manual was finished, the corpus went through a process of revision to spot inconsistencies. A general revision of the content was also made by one of the authors of this article.

The annotation tool was adapted from a tool already developed by our research group. Inter-Annotator Agreement Study After around annotated reviews, we conducted a study of the agreement between the annotators. The following points were considered in the evaluation of agreement: sentences selected; polarity of the sentences selected; objects selected; opinions selected; polarity of the opinions selected. Although the annotation instructions gave some information with respect to segmentation, some variation was expected as to the extension of the units selected.

One of the challenges, therefore, was defining agreement in cases in which the annotators identified the same opinion or target of opinion, but diverged in relation to the limits of the unit. In fact, this is a task whose evaluation is more complex than the judgment of assignment of polarity to sentences. We relied on the evaluation process by Wiebe et al in two major points: i we considered expressions such as final part and final as equivalent expressions, and ii we used the agr metric Wiebe et al, , whose objective was to evaluate whether the annotators identified the same set of objects and opinions, that is, how much of what A annotated was also annotated by B.

Table 2 shows the results of the agreement study between annotators A, B and C.

The first row A B indicates the agreement between A and B, taking A as the baseline in other words, how many segments identified by A were also identified by B. The second row reflects the opposite situation: considering B as the baseline, how many choices made by B were also made by A. We separated the agreement study into two groups: agreement related to identification if the annotators identified the same set of expressions, as to opinion target and opinion itself and agreement related to polarity assignment once annotators agreed on the selected part, then we measured if they agreed about polarity.

Once annotators agreed on what to annotate, they tended to agree on the polarity orientation as well. The qualitative analysis showed that the rare cases of disagreement were due to adversative sentences in which the same sentence conveys contrastive opinions, and that the disagreement was precisely in the assignment of the overall polarity at the sentence level.

Phrase level disagreement occurred only once. Although the type of annotation is not exactly the same, the result of agreement in Wiebe et al. Exploring the corpus Table 3 shows polarity distribution in ReLi. Fui a nocaute, sem direito a re contagem. I cried. I got a bit down] It was the only book that got me nervous and apprehensive while reading it] Dolorido, pavoroso, nojento, repugnante e nauseante. Por todos esses adjetivos que o livro nos causa, ele consegue ser bom.

For all these adjectives that the book makes us feel, it is a good one. Nunca sofri tanto para ler um livro.


