When it comes to sentiment and emotions knowing how someone is feeling can be an invaluable tool; when, it’s correct. To ensure our models are at the forefront of technology we put our talkAItive AI system to the test, the hardest test on the internet to date1. What we found was a dataset curated by Standford University to be the jaw breaker of AI sentiment. We choose this benchmark specifically because of the challenge it proposed. To keep with the academic testing standards we measured our results in F1-scores. In layman terms F1-Score is a robust measurement for accuracy. A higher score meaning a more accurate system.
Following the paper presented by Potts, Wu, Geiger and Kiela we performed 3 tests in total: v1, v2 and finally the combined v1+v2.
To summarize the paper, v1 was a test set derived from yelp reviews and curated to find difficult sentences for a model to understand. This subset was validated through a team of human data analysts.
The v2 test set is a subset the v1 which was still quite difficult for a model to understand; even after the model was trained on the v1 testset. Overall these test sets were derived to be difficult and validated through a set of human data analysts.
What we found was that talkAItive AI was in the top of its class. Our model scored a 73.99 / 100 on the combined v1+v2 test set. The model presented within the paper itself scored a 74 / 100 itself. When we tested it on the Google Sentiment API a score of 58.57 / 100 was achieved.
In the table below we show all 3 test results between talkAItive and Google.
|API||V1 Test Score||V2 Test Score||V1+2 Test Score|
|Google NLP API||59.87||53.98||58.56|
Though AI may not be perfect it has made incredible progress and at talkAItive we’re here to keep that technology growing and reaching towards the stars.
Below is a set of examples from the test set we used.
|Text||Gold Label||Google Score||talkAItive Score|
|I usually get no bun and do a lettuce wrap.. their lettuce is very lettuce-y.||neutral||-0.60||neutral|
|There have been so many terrible things in this year of 2020, but the store has not lost its touch.||positive||-0.50||positive|
|I threw my meal away even though I was really hungry and the price was such a steal.||negative||0.89||negative|
|The food was decent, so I'll start with that.||positive||-0.1||positive|
|I wasn't impressed. There are bright, colorful, easy-to-read menus in front of the register that conveniently detail all their menu items.||negative||0.0||negative|
|The funfetti cake was sort of unfunfetti cake.||negative||-0.20||negative|
|I just can't get over their prices -- I mean, 5 bucks for 5 blouses is unbelievable.||positive||-0.40||positive|
|They have a daily bread selection, and also a different specialty bread for each day of the week. The bread is really bad.||negative||0.0||negative|
1. Christopher Potts, Zhengxuan Wu, Atticus Geiger, and Douwe Kiela. 2020. DynaSent: A dynamic benchmark for sentiment analysis. Ms., Stanford University and Facebook AI Research.