1. Intro
This blog post explores the sentiment analysis capabilities of Azure Cognitive Text Analytics, AWS Comprehend, and custom fine-tuned models like RoBERTa and Phi2 2.7B.
The analysis strongly suggests encouraging my clients to prefer using custom Transformer models or LLMs for sentiment analysis. These models, particularly when fine-tuned, demonstrate superior accuracy and consistency. Fine-tuning is feasible with as few as 20,000 training examples, allowing for tailored solutions that align closely with specific data needs. This approach not only enhances performance but also provides flexibility and precision in sentiment classification, making it a valuable investment for more accurate insights.
2. Framework for Evaluation
My evaluation framework highlights three key aspects:
1. Accuracy: Assessed through F1 Scores, this measures how precisely each model categorizes sentiments, reflecting their effectiveness in sentiment detection.
2. Dependability: Evaluated via Agreement Analysis, this examines the consistency of predictions across different models, indicating the reliability of their classifications.
3. Potential Bias: Analyzed through Sentiment Distribution Profiles, this identifies any biases by comparing model outputs to the ground truth, ensuring balanced sentiment representation.
3. Data
To assess these aspects, I used a novel dataset specifically designed for aspect-based sentiment analysis in teacher performance evaluation. This dataset was created by collecting student feedback from American International University-Bangladesh and was labeled by undergraduate-level students into three sentiment classes: positive, negative, and neutral. After cleaning and preprocessing to ensure data quality and consistency, the final dataset contains over 2,000,000 student feedback instances related to teacher performance. This dataset is invaluable for developing and evaluating ABSA models for teacher performance evaluation. Link: Dataset: A Novel Dataset for Aspect-based Sentiment Analysis for Teacher Performance Evaluation
The dataset consists of student feedback on faculty performance, featuring several key columns:
• StudentComments: Contains textual feedback from students.
• Rating: A numerical score reflecting the student's overall rating of the faculty. (Not used in my analysis)
• totalwords: Indicates the word count of each comment.
• Sentiment: Labels the sentiment as positive, negative, or neutral.
• sent_pretrained: Shows sentiment predictions from a pretrained model. (Not used in my analysis)
• subjectivity: Classifies the comment as subjective or objective. (Not used in my analysis)
• subj-score: Provides a score indicating the degree of subjectivity. (Not used in my analysis)
• isSame: A boolean indicating if the sentiment matches between manual and pretrained label (Not used in my analysis)s.
• labels: Encodes sentiment into numerical labels for analysis.
Figure 1 shows a snapshot of my training data. The "StudentComments" column contains the free-form text, while the "Sentiment" column shows the labels. I've only included entries where student comments exceed 200 characters. In total, my dataset has 22,719 records, as shown in Figure 2.
This box plot (Figure-3) illustrates the distribution of ratings across different sentiment categories: positive, neutral, and negative.
• Positive Sentiment: Ratings are generally high, clustering around 5, indicating that positive comments correlate with high ratings.
• Neutral Sentiment: Ratings are more centered, with a tighter range around the mid-point, reflecting moderate feedback.
• Negative Sentiment: Ratings are lower, with a wider spread and outliers, showing that negative comments are associated with lower ratings.
The scatter plot (Figure-4) shows the relationship between the number of words in a comment and its rating, categorized by sentiment (positive, neutral, negative).
• Positive Sentiment (Blue): Tends to have higher ratings regardless of comment length.
• Neutral Sentiment (Red): Ratings are mostly centered around the middle, with varying comment lengths.
• Negative Sentiment (Green): Generally associated with lower ratings, with some longer comments.
4. Models
Next, let's talk about the models/services I've selected for my analysis:
Azure Cognitive Text Analytics
Azure Cognitive Text Analytics offers a sentiment analysis service that evaluates text and returns sentiment scores and labels for each sentence. It is particularly effective in scenarios such as social media monitoring, customer reviews, and discussion forums. Azure's model is known for its integration capabilities with other Microsoft services, making it a convenient choice for businesses already using Azure's ecosystem.
AWS Comprehend
AWS Comprehend provides a comprehensive natural language processing service that includes sentiment analysis. It is designed to handle large volumes of text with reasonable accuracy, making it suitable for enterprises with significant data processing needs. AWS's service is also scalable, allowing businesses to adjust their usage based on demand.
Custom Fine-Tuned Models
In addition to these standard API services, custom fine-tuned models like RoBERTa and Open source Large Language Models (LLM) offer a tailored approach to sentiment analysis. By fine-tuning these models on specific datasets, businesses can achieve higher accuracy and relevance in sentiment detection than standard API services. In this analysis, I've selected RoBERTa and Ph2 2.7B LLM.
Phi2
I used the microsoft/phi-2 model, a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source consisting of various NLP synthetic texts and filtered websites for safety and educational value. Phi-2 demonstrated nearly state-of-the-art performance among models with less than 13 billion parameters in benchmarks testing common sense, language understanding, and logical reasoning. This version of Phi2 has not been fine-tuned through reinforcement learning from human feedback, aiming to provide the research community with a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more.
RoBERTa
For this experiment, I used the cardiffnlp/twitter-roberta-base-sentiment-latest model, a RoBERTa-base model trained on approximately 124 million tweets from January 2018 to December 2021. It was fine-tuned for sentiment analysis using the TweetEval benchmark, which explains its better performance even before additional fine-tuning compared to Phi2. The original Twitter-based RoBERTa model can be found here, and the original reference paper is TweetEval.
5. Experiment Result
In this section, I present the experiment results from three perspectives to comprehensively evaluate the performance of various sentiment analysis models. First, the Sentiment Distribution Profile assesses potential biases by comparing how each model classifies sentiment against the ground truth. Second, I analyze F1 Scores to measure the accuracy and recall, providing a clear picture of each model's effectiveness in sentiment detection. Lastly, the Agreement Analysis examines the consistency of predictions across different models, highlighting the level of alignment and reliability in their sentiment classifications. These analyses collectively offer a detailed understanding of model performance and guide the selection of optimal sentiment analysis tools.
Sentiment Distribution Profile
The sentiment distribution graphs provide valuable insights into how different models classify sentiment compared to the ground truth:
1. Ground Truth: Represents the distribution of sentiment labels in my dataset, serving as the benchmark for comparison.
2. Azure Sentiment: Azure tends to classify more feedback as positive and mixed, with fewer neutral sentiments, showing a significant deviation from the ground truth.
3. AWS Sentiment: Similar to Azure, AWS has a high count of positive sentiments and overestimates mixed sentiments compared to the ground truth.
4. RoBERTa Sentiment: The pre-trained RoBERTa model aligns more closely with the ground truth for positive sentiments but underestimates neutral feedback. Its superior performance compared to the unfine-tuned Phi2 is due to prior fine-tuning on a Twitter sentiment dataset.
5. Fine-tuned RoBERTa Sentiment: Fine-tuning enhances RoBERTa's alignment with the ground truth, particularly for positive sentiments, though it still underestimates neutral feedback.
6. Phi2 Sentiment: The unfine-tuned Phi2 model significantly underperforms compared to others, struggling to accurately predict most data points, resulting in minimal representation in the chart.
7. Fine-tuned Phi2 Sentiment: Fine-tuned Phi2 closely matches the ground truth, demonstrating substantial improvement.
Both fine-tuned models (RoBERTa and Phi2) show enhanced alignment with the ground truth, with fine-tuned Phi2 performing the best. Its improvement over fine-tuned RoBERTa is incremental but noteworthy.
F1 Scores
While Azure and AWS APIs provide an accessible starting point for building sentiment analysis applications, they significantly fall short in accuracy and F1 score for specific use cases compared to fine-tuned models. Custom models like RoBERTa and Phi2 LLM, when properly fine-tuned, deliver superior performance tailored to specific needs.
Figure-7 below illustrates the F1 scores for each model. AWS serves as the benchmark, with other models compared against it to determine percentage improvements. The F1 score ranges from 0 to 1, with 1 being perfect. As a rule of thumb, models with an F1 score above 85% are considered suitable for production deployment.
Fine-tuned models, particularly Fine-tuned Phi2, significantly outperform both AWS and Azure in accuracy, precision, recall, and F1 score. While AWS and Azure offer convenient and scalable solutions, custom fine-tuning of models like RoBERTa and Phi2 provides a more tailored and accurate approach for specific use cases.
Pairwise agreement analysis
The highest agreement is between Fine-tuned RoBERTa and Fine-tuned Phi2 at 90.84%, indicating strong consistency between these two fine-tuned models.
Fine-tuned models tend to agree more with each other, demonstrating the effectiveness of fine-tuning in aligning sentiment predictions. In contrast, Azure and Phi2 show low agreement, highlighting variability in their sentiment analysis approaches.
6. Summary
The analysis strongly suggests encouraging my clients to prefer using custom Transformer models or LLMs for sentiment analysis. These models, particularly when fine-tuned, demonstrate superior accuracy and consistency. Fine-tuning is feasible with as few as 20,000 training examples, allowing for tailored solutions that align closely with specific data needs. This approach not only enhances performance but also provides flexibility and precision in sentiment classification, making it a valuable investment for more accurate insights.
For those aiming to maximize F1 scores, building a custom model is the optimal path forward. In the next part of this series, I'll explore how to fine-tune these models to better meet your specific requirements (link here).