- 3.Vader (NTLK, standalone):
Summary Hebrew Psych Lexicon
- 1.For sentiment In Vader -
- 1.“Screening for English language reading comprehension – each rater had to individually score an 80% or higher on a standardized college-level reading comprehension test.
- 2.Complete an online sentiment rating training and orientation session, and score 90% or higher for matching the known (prevalidated) mean sentiment rating of lexical items which included individual words, emoticons, acronyms, sentences, tweets, and text snippets (e.g., sentence segments, or phrases).
- 3.Every batch of 25 features contained five “golden items” with a known (pre-validated) sentiment rating distribution. If a worker was more than one standard deviation away from the mean of this known distribution on three or more of the five golden items, we discarded all 25 ratings in the batch from this worker.
- 4.Bonus to incentivize and reward the highest quality work. Asked workers to select the valence score that they thought “most other people” would choose for the given lexical feature (early/iterative pilot testing revealed that wording the instructions in this manner garnered a much tighter standard deviation without significantly affecting the mean sentiment rating, allowing us to achieve higher quality (generalized) results while being more economical).
- 5.Compensated AMT workers $0.25 for each batch of 25 items they rated, with an additional $0.25 incentive bonus for all workers who successfully matched the group mean (within 1.5 standard deviations) on at least 20 of 25 responses in each batch. Using these four quality control methods, we achieved remarkable value in the data obtained from our AMT workers – we paid incentive bonuses for high quality to at least 90% of raters for most batches.
- 1.6 million tweets labelled
- 13 languages
- Evaluated 6 pretrained classification models
- 10 CFV
- SVM / NB
- Annotator agreements.
- about 15% were intentionally duplicated to be annotated twice,
- by the same annotator
- by two different annotators
- Self-agreement from multiple annotations of the same annotator
- Inter-agreement from multiple annotations by different annotators
- It turns out that the self-agreement is a good measure to identify low quality annotators,
- the inter-annotator agreement provides a good estimate of the objective difficulty of the task, unless it is too low.
Alpha was developed to measure the agreement between human annotators, but can also be used to measure the agreement between classification models and a gold standard. It generalizes several specialized agreement measures, takes ordering of classes into account, and accounts for the agreement by chance. Alpha is defined as follows: