The Goofus & Gallant Story Corpus for Practical Value Alignment

Md Sultan Al Nahian Institute for Biomedical Informatics
University of Kentucky
Lexington, KY, USA
sa.nahian@uky.edu Tasmia Tasrin Department of Computer Science
University of Kentucky
Lexington, KY, USA
tta245@uky.edu Spencer Frazier School of Interactive Computing
Georgia Institute of Technology
Atlanta, GA, USA
sfrazier7@gatech.edu Mark Riedl School of Interactive Computing
Georgia Institute of Technology
Atlanta, GA, USA
riedl@cc.gatech.edu Brent Harrison Department of Computer Science
University of Kentucky
Lexington,KY, USA
harrison@cs.uky.edu

Abstract

Values or principles are key elements of human society that influence people to behave and function according to an accepted standard set of social rules to maintain social order. As AI systems are becoming ubiquitous in human society, it is a major concern that they could violate these norms or values and potentially cause harm. Thus, to prevent intentional or unintentional harm, AI systems are expected to take actions that align with these principles. Training systems to exhibit this type of behavior is difficult and often requires a specialized dataset. This work presents a multi-modal dataset illustrating normative and non-normative behavior in real-life situations described through natural language and artistic images. This training set contains curated sets of images that are designed to teach young children about social principles. We argue that this is an ideal dataset to use for training socially normative agents given this fact.

Index Terms:

Machine Learning, Machine Ethics, Natural Language Processing

I Introduction

As autonomous systems grow in sophistication and capabilities, questions begin to arise about their ability to integrate into society. Rightfully, scientists have begun to question whether these systems can coexist with humans safely, given that agents and humans have different priorities on how they complete tasks. This difference in how humans and autonomous systems make decisions could potentially lead to harm, be it intentional or unintentional [1].

To mitigate the capability for autonomous systems to cause harm to humans, there has been an increased interest in value alignment. Value alignment is a property of an intelligent agent indicating that it can only pursue goals and activities that are beneficial to humans [2, 3, 4]. Russell [5] and Moor [6] have professed the importance of value alignment in creating autonomous agents that can safely coexist with humans. A value-aligned system makes decisions that align with human decisions in similar situations. Ideally, these decisions are unlikely to directly or indirectly cause harm.

While there have been many proposed approaches [7, 8] for creating value-aligned systems, there are some issues that repeatedly emerge. It has been difficult to decide what exactly constitutes a value for the purposes of training an autonomous system and where one would expect to find data concerning this type of information. This has made it difficult to train value-aligned systems in practice.

Recently, there has been increased acknowledgment of stories as a potential source of valuable information. Stories are one of the primary ways that humans communicate with one another. They also provide a means for humans to convey complex information to each other. In addition, stories are one of the primary ways that humans learn societal values and principles. One needs to look no further than children’s literature to observe this. Many stories written for children are designed to teach concepts such as how the world works, societal principles, and the basics of social interaction. This interest in stories as a source of value information has led to the curation of several story corpora for use in value extraction [9, 10, 11].

These datasets are often very large and curated by methods such as scraping large amounts of text data from the internet or using crowdsourcing to gather a set of stories. While these methods are effective at generating a large amount of data, there are limitations that must be considered. When datasets are curated via web scraping, they are susceptible to biases present in online communities. Crowdsourced data has similar limitations. It must also be noted that often this data is generated by non-experts, which could limit the overall effectiveness of the dataset.

To address these limitations, we introduce the Goofus & Gallant story corpus, a dataset comprised of story data specifically designed to teach social principles to young children. This dataset consists of Goofus & Gallant comic strips from 1995 to 2017 (see Figure 1). Each comic is accompanied by short text descriptions or quotes that describe the action that is occurring in each comic along with image that visually illustrate the scene of the action. Goofus & Gallant is a comic strip published by Highlights magazine that is designed to teach young children how to behave in various social situations by presenting two contrasting examples of behavior. Given the nature of these comics, we feel that they present a viable alternative to large story corpora in that they are meant to provide targeted instruction on societal principles in values. We believe that this makes these stories a viable training source for autonomous agents.

While we feel that this dataset has the potential to aid researchers in understanding human values and addressing the value alignment problem, it is not without its issues. First, it is a small dataset. As Goofus & Gallant is published once per month in Highlights magazine, we have twenty years of comic strips, that still contains less than 1000 comics. In addition, the comics themselves only depict behavior as being normative or non-normative. This is a rather coarse representation of societal values, which may make it difficult to use in value alignment. The first issue we feel is potentially offset by the targeted nature of the dataset. Each comic strip was written to teach a specific lesson, which strengthens the signal present in the dataset, especially when compared against larger and potentially noisier datasets. Finally, to address the second limitation, we chose to augment the dataset with more detailed value information. We elicit crowd workers to annotate Goofus & Gallant comics according to a taxonomy of social principles derived from [12]. This enables researchers to explore in more detail how these stories attempt to teach specific values.

In summary, this paper makes the following contributions: 1) We introduce the Goofus & Gallant story corpus for practical value alignment, containing twenty years of Goofus & Gallant comics. 2) We augment the dataset with additional value information based on the Kiesel et.al. value taxonomy [12]. 3) We present intelligent baselines on two example tasks to show how the datasat may be used to learn value information for use in value alignment.

Refer to caption — Figure 1: A modern example of Goofus & Gallant

II Related Works

AI’s actions are expected to be aligned with humans’ actions in similar situations and adhere to human society’s values and interests. This is called AI value alignment, a property of AI that ensures that AI can only pursue goals and activities that are beneficial and non-harmful to human society [2, 3, 4]. Unfortunately, aligning AI with human values is difficult [13] as human values are imbued implicitly in society that also varies with different scenarios, and there are infinitely many scenarios in the open world. Thus, it is difficult to directly specify what comprises values, and delineating them is outside the scope of this work.

There are various works in NLP that instilled moral judgments in specific narrow domains. For instance, hate speech and cyberbullying detection [14, 15], detecting suspicious posts on social networks [16] or fairness and biases [17, 18]. While these works focused on specialized domains, our dataset comprises more generalized moral concepts over a widespread spectrum of people’s everyday real-life scenarios.

Moral Stories [9] is one of the most recent works in moral reasoning on social scenarios. It has 12k short, structured stories that were crowdsourced from human authors. Each story has seven sentences describing seven categories, including a situation, action, and its consequence. The norms in Moral Stories are descriptive sentences and very specific to scenarios. In contrast, in our dataset, the norms are more general and binned to a finite set, which is more tractable to train Value-Aligned agents. Some other works on representing social norms and values over everyday situations in natural language are SCRUPLES [19], ETHICS [10] and SOCIAL CHEMISTRY 101 [20]. SCRUPLES collected 32k real-life anecdotes with normative judgments from a subreddit forum to construct the dataset, where [20] is a larger corpus on social norms and moral judgment consisting of 292k annotated situations.

Delphi [21] has combined the corpora from ETHICS [10], Moral Stories [9], SOCIAL CHEMISTRY [20] and SOCIAL BIAS INFERENCE CORPUS [18] and created a unified dataset on social norms and ethics called the COMMONSENSE NORM BANK. Trained on this unified dataset, Delphi is able to make moral judgments in real-life situations. However, it is unable to provide the notion of social norms or principles based on which the judgment was made. Moreover, the compiled corpus used in Delphi was originally collected from online forums such as Reddit. Thus, it may contain an inappropriate and biased set of examples, which can lead to improper and biased moral decisions by the model trained with this data. In contrast, we present the corpus collected from children’s stories, which are deliberately meant to teach social norms to children. We further curated the corpus by including only recent stories, ensuring the dataset quality and its alignment with recent societal norms. Moreover, the dataset is multimodal, with an image associated with each natural language story example.

III Dataset

To facilitate Value Alignment, we built a multimodal dataset, the Goofus & Gallant story corpus, illustrating social values through natural language texts and images. The dataset contains illustrations of social behaviors labeled as normative or non-normative and provides the inherent social principles or values of these behaviors as well. Thus, the dataset consists of two sub-datasets: the GnG Normative dataset, which describes normativity, and the GnG Principles Dataset, which describes the underlying social principles of involved in each example. We have constructed this dataset using a Children’s comic strip named Goofus & Gallant. In this section, we discuss the methodology employed to create these two dataset components in detail.

III-A GnG Normative Dataset

The dataset, built for the aid of the value alignment task, is based on a Children’s comic strip named Goofus & Gallant. The Goofus & Gallant comic strip is published by Highlights Magazine and began its run in 1946. It is meant to convey societal values to young children by providing them with examples of desirable and undesirable behaviors. These behaviors were depicted using the two titular main characters of the strip: Goofus and Gallant. Goofus and Gallant are young boys who have specific character traits. Goofus is a child that typically performs undesirable actions. The implication is that the behaviors that Goofus performs are not meant to be emulated. In contrast, Gallant is a young boy who typically performs desirable, socially acceptable actions that are meant to be emulated. This setup provided us with an automatically labeled corpus where all actions done by Gallant are labeled as normative, and all actions done by Goofus are labeled as non-normative. We named this dataset the GnG Normative dataset.

The advantage of Goofus & Gallant comic is that it has both image and text for each strip demonstrating an action in a social scenario. This allows us to utilize both visual and textual information to identify societal values. To better ensure that the machine learning models can learn relevant social values, we collected the most recent strips from 1995-2017. We extracted the text from each strip’s panel, but as the older images are of a lower visual quality than the newer images, we only included images from 2001 to 2017. This provided us with 1387 texts and 819 images.

As a result, we created two versions of this dataset: GnG text-only and GnG multi-modal normative dataset. The text-only version contains 1,387 texts collected from the comic panels. The multi-modal version includes the 819 images along with their corresponding texts from the comic panels. Additionally, we removed any explicit references to Goofus and Gallant by replacing their names with pronouns such as ”he” or ”they”.

III-B GnG Principles Dataset

The GnG normative dataset we created serves as the corpus for categorizing actions as normative or non-normative. It does not have any identifying information that expands on what principle or value is contained in each comic. But along with labeling actions as socially acceptable/unacceptable, it is crucial to know which social norms or principles are violated or upheld by these actions. To address this, we extended the GnG normative dataset by annotating Goofus and Gallant’s actions with relevant social values or principles, creating the GnG Principle dataset.

Before beginning the annotation task, it is essential to determine how annotators will provide the social principle information. We considered two approaches: 1) Writing the principles in a free-form text and 2) Selecting from a given set of social principles. The free-form texts will make the unique set of principles very diverse, which is difficult to generalize. Therefore, to make the principle classification problem more tractable, we decided to restrict the choice of principles to a finite set. For this purpose, we utilized the system introduced in [12] to define the set of “social principles”. In their work, Kiesel et al. proposed a value taxonomy with 54 values, which are both relevant and supported by social science research. We further downsized the number of values to better align it with the action description of Goofus & Gallant corpus. We ran the pre-trained value model developed by Kiesel et al. on the Goofus & Gallant dataset to get the zero-shot value prediction on the text descriptions of the actions in the corpus. This experiment gave us 27 social values aligned with the Goofus & Gallant texts. We considered these 27 social values to be the set of predefined principles that would label each action of Goofus and Gallant.

For this annotation task, we utilized both the text descriptions of the actions and their corresponding images. Therefore, we only annotated the instances from the GnG multi-modal dataset where images are available for the corresponding textual descriptions of the actions. We ran two data collection processes to curate this dataset: 1) using crowdsource workers and 2) utilizing Large Language Models (LLMs). In the following, we describe each process in detail.

III-B1 Data Curation–Using Crowdsource Workers

In our first attempt to build the GnG Principles dataset, we used crowdsource workers to annotate the principles. We recruited annotators exclusively from English-speaking countries. In the task, the annotators were given image-text pairs from the GnG multi-modal dataset and a fixed set of social principles curated through the process we discussed earlier. The images provided a visual illustration of the scene, while the associated text described the action performed in the scene. For each given data item, the annotators were required to provide the three most representative principles from the given principles list that were upheld or violated by the action demonstrated in the corresponding image-text pair. Each data item was annotated by three different annotators, with each annotator labeling eight items from the corpus. We recruited 150 crowd workers and annotated 400 examples, which is approximately half of the total number of examples.

After collecting responses from the annotators, we computed the frequency of each principle selected by the annotators for every data instance. The principle that appeared most frequently for a given instance was chosen as the final label, ensuring that the most commonly agreed-upon principle was used. To assess the quality of the collected annotation, we computed the inter-annotator agreement using the Fleiss-Kappa score [22]. The resulting kappa score was 0.54, indicating a moderate level of agreement among the annotators. To understand the reason behind this moderate agreement, we investigated further and found that very often, different annotators selected varying principles for the same example. This variability arose because social principles can have different meanings to different individuals and the same action can be aligned with multiple principles as well.

This variability made the class labels of the dataset sparse, which is usually difficult for machine learning models to learn from. To address this challenge, we sought an alternative approach to curate the dataset. We explored utilizing the capabilities of Generative AI, which is our second attempt to annotate the data, aiming to provide a more consistent and robust set of annotations.

III-B2 Data Curation–Utilizing Large Language Models

To create a more comprehensive principles dataset, in this process we utilized the capabilities of LLMs to annotate the Goofus and Gallant actions with human values. Specifically, we used the API provided by OpenAI to access the pre-trained LLMs. Since LLMs support text-only prompts, we included the textual description of the images instead of the actual images. To collect these textual descriptions, we involved crowdsource workers. Therefore, the entire principle annotation process consisted of three steps: first, collecting detailed descriptions of the scene illustrated in the images; second, using LLMs to predict principle labels for each data point, and finally, verifying the correctness of the responses of LLMs by humans. The details of each step are described below.

Collecting Scene Descriptions

In this step, we collected detailed descriptions for the images in the GnG multi-modal dataset. To ensure high-quality descriptions, we employed human annotators to write them. We provided the annotators with detailed instructions outlining the criteria for writing the descriptions.

The instructions specified that descriptions should be written in simple declarative sentences and needed to include the state of the different characters in the image and their interactions, as well as their facial expressions or gaze. Additionally, the descriptions were required to detail the objects in the environment and their properties, as well as the actions taking place. This structured approach ensured that the descriptions were comprehensive and consistent, providing a rich textual representation of the images for further annotation. Figure 2 shows an example of the collected response for an image.

For each image, we employed two annotators. The first annotator created the initial description based on the provided criteria. Afterward, the second annotator reviewed the description, adding any missing information if necessary to ensure all relevant details were captured. This two-step process improved the quality and comprehensiveness of the scene descriptions.

Annotating Principles using LLMs

After collecting the scene descriptions, we applied the zero-shot prompting technique of Large Language Models (LLMs) to predict the principles upheld or violated by the actions of Goofus and Gallant. We used the state-of-the-art GPT-4o model [23] from OpenAI. We provided three inputs in the prompt to the GPT model: the scene description, action description and compliance information. The compliance information indicates whether the action follows or violates social norms or principles, labeled as either followed or violated. If the comic strip text features Gallant, the compliance information is set as followed, and for Goofus, it is violated.

We applied the chat format for the prompt, where we provided an elaborated description of the task, our predefined set of principles along with their definitions, instructions to the GPT model, and the expected output format in the system prompt. Then, in the user prompt, we provided our inputs for the GPT model. We instructed the GPT model to identify at least two principles from the predefined list mentioned in the prompt for the given action. The first principle generated should be the most representative one, and the second generated principle should be the second most. We refer to the first principle as Principle 1 and the second principle as Principle 2. If the action does not align with any of the principles listed, the model can suggest a new principle. Furthermore, the GPT model has been asked to provide detailed explanations analyzing how the selected principles represent the given action. These explanations are generated for each principle assgined by the LLM.

For each example, we queried the GPT model five times. This approach enabled us to determine which principles were predicted most frequently across these queries. When the same principle was predicted multiple times for a given input, it indicates higher confidence from the model in that prediction. Thus, to ensure accurate and reliable annotations, we selected the most frequently predicted principle from these five iterations as the final prediction for Principle 1. If two principles had the same frequencies, then we randomly selected any one of them as the most frequent one. For the Principle 2, we also selected the most frequent one. But if the most frequent one had already been selected as Principle 1, we chose the second most frequent principle for the second category.

Human Verification

After post-processing the GPT responses and finalizing the predictions for principle 1 and principle 2, we evaluated the correctness of these predictions through human review. For each instance, we provided the scene description, action description, compliance information, and the two principles predicted by GPT to human reviewers. The reviewers were asked to select the principle they believed was violated or upheld by the action. If both principles seemed applicable, they were instructed to select both. If neither of the predicted principles was correct, they were to select ”None.”

We initially ran this process on 50 instances of the GnG multi-modal dataset to evaluate how well the responses of LLMs aligned with human judgment. We employed two annotators for each instance, with each annotator reviewing 10 instances, totaling 10 annotators for the initial review. After collecting their responses, we computed their agreement with the LLM predictions and found the agreement rate is 91% between 2 annotators for atleast one principle. This high level of agreement indicated that the GPT-4o predictions were of high quality and aligned with human judgment. Consequently, we extended the data collection process to the remaining instances of the GnG multi-modal dataset. For quality assurance, we randomly selected 100 instances from the extended dataset and reviewed them using the same protocol. We achieved a 93% agreement rate where the annotators agreed with at least one principle and a 61% agreement rate where the annotators agreed with both principles predicted by GPT-4o.

A summary of each dataset used in our experiments can be found in Table I.

TABLE I: Dataset summaries.

Dataset	Modality	Total	Train	Test
GnG Normative	Text only	$1387$	832	555
GnG Normative	Multi-modal	$819$	$655$	$164$
GnG Principles	Text only	$819$	$655$	$164$
GnG Principles	(with scene descriptions)

IV Tasks

We aimed to show that the stories in the Goofus & Gallant dataset contain rich knowledge about socio-cultural norms and values that can be learned by machine learning models. To achieve this, we explored two classification tasks on the Goofus & Gallant dataset. In the first task, we investigated whether we could determine if the action described in the story is socially acceptable or not (normative or non-normative). In the second task, we examined whether we could identify the social principle that is followed or violated by the action described in the story. Formally the two tasks are:

•

Normativity classification
•

Principles classification

IV-A Normativity Classification

In this task, the objective is to classify the actions demonstrated in the Goofus & Gallant comic strips as either normative or non-normative. We seek to show that knowledge of socially normative and non-normative behavior can be identified from naturally occurring stories. To achieve this, we conducted experiments to develop baseline machine learning models that can classify normative behavior from both textual descriptions and visual illustrations.

IV-B Principles Classification

In this task, we aimed to develop systems capable of understanding descriptions of human behavior with respect to normative principles. We defined normative principles as the guidelines that direct individuals to adhere to a society’s collective behavioral rules, such as “be respectful to traditions” or “be responsible”.

For this classification task, we used the GnG Principles dataset, created by augmenting the GnG multi-modal dataset with principles collected using the GPT-4o model and subsequently verified by humans. We trained several machine learning models to predict the inherent social principles underlying the actions described in the text. Usually, social behaviors or actions can comply with or violate multiple social values concurrently. For instance, “Gallant does his studying before watching TV”, can comply both normative social principles like “be responsible” and “be compliant”. Because of this inclusive property of social norms, we evaluated the correctness of the classifiers differently from the standard multi-class classification. In this task, a predicted principle is considered correct if it matches either of the two true principles.

TABLE II: Results of the Baseline models trained and tested on GnG Multi-modal Normative dataset

Modalities	Model	Acc	F1	Precision	Recall	MCC
Image	ViT-CL	70.32	72.94	78.48	68.13	40.94
	ViTForImageClassification	72.26	77.25	74.49	80.22	42.02
Text	BERT-CL	70.97	74.86	76.14	73.63	40.56
	BertForSequenceClassification	72.90	77.17	76.34	78.02	43.87
Image & Text	Dual Encoder	76.77	79.78	81.61	78.02	52.61

V Baselines

We built several classifiers for each of the tasks to determine the best-performing models for these tasks. We made use of transformer-based vision and language models to implement the classifiers. In this section, we discuss the details of these models.

V-A Normativity Classification Models

Using the images and texts of the GnG Normative dataset, we trained multiple binary classifiers capable of classifying events in stories as normative or non-normative. First, we built a model using only images as input to investigate the effectiveness of visual context in determining normativity. Next, we used only texts as input for the model. Finally, we incorporated both text and images as inputs in the model. In this way, we comprehended the implications of different modalities in classifying normativity.

V-A1 Image Only Model

In this model, we used only images as input to classify actions into normative or non-normative classes. We implemented two binary classifiers using the pre-trained Vision Transformer (ViT) [24]: 1) ViT with Custom Layers (ViT-CL) and 2) Pre-trained Image Classifier (ViTForImageClassification). For the first model, we used ViT as the base model and added a projection layer followed by a classification layer on top. The projection layer consists of a linear layer, an activation function, dropout, and layer normalization. The ViT base model provides the embedding of the special [CLS] token, which represents the entire input image. We passed this embedding vector through the projection and classification layers to make the final prediction.

In addition to the pre-trained base ViT, there are off-the-shelf ViT models specifically trained for image classification tasks. For our second classifier, we fine-tuned one of these models, ViTForImageClassification, without adding any additional layers since it already includes a trained classification layer.

V-A2 Text-based Models

The text-based classifiers take sentences as input and predict whether the event described is normative or non-normative. Similar to the image-only model, we implemented two different binary classifiers for the text-only input: 1) BERT [25] with Custom Layers (BERT-CL) and 2) Pre-trained Sequence Classifier (BertForSequenceClassification).

The first classifier we implemented is a transformer-based large language model with a similar projection head and classification layer added on top. The language model provides the contextualized embedding of the input sentence, represented by the embedding vector of a special classification token [CLS]. This embedding vector is then passed through the projection and classification layers to make the prediction.

For the second text-only model, we used a transformer-based pre-trained sequence classification model, BertForSequenceClassification, without adding any additional layers to leverage its capabilities directly.

V-A3 Image and Text model

To investigate how visual and textual information concurrently influence the classification of normative and non-normative actions, we trained a binary classifier that uses both image and text as inputs and implemented a transformer-based dual encoder network for it.

We combined the previously implemented ViT-CL and BERT-CL models using a projection head and a classification layer to build this model. The ViT-CL extracts the embedding vectors for images, while BERT-CL provides the embedding vectors for texts. The projection head aligns both image and text embeddings into the same latent space. Finally, the classification layer, which includes a linear layer followed by a softmax function, makes the prediction. Figure 3 shows the model’s architecture.

V-B Principles classification Models

To build the classifier for principle prediction, we utilized text-based information as the input. We fine-tuned a pre-trained sequence classification model, similar to the one used in the Text-only Normativity Classification Models mentioned previously, to implement the models.

For this task, we used the GnG Principles dataset to train the models. We provided the models with three inputs: the scene description, the action description, and the compliance information. These inputs allow the model to understand the environment of the scene, the specific action taking place, and whether the action complies with or violates social values. The model then predicts the principle that is either followed or violated by the action described in the scene. By leveraging the rich contextual embeddings generated by the transformer-based language models, our approach aims to accurately identify the underlying social principles guiding the actions within the stories.

TABLE III: Comparing the full Text-based models and sentiment analysis model on the test data of GnG text-only Normative Dataset.

Model	Acc	F1-score	Precision	Recall	MCC
BERT	94.05	94.03	94.55	93.53	88.11
DistilBERT	93.69	93.65	94.51	92.81	87.40
Sentiment Analysis	62.46	58.44	66.81	51.94	25.83

TABLE IV: Ablation Studies performed for Principles Classification task. Bold and underlined numbers show the best scores for the relaxed and strict metrics, respectively.

Input	Evaluation Type	Acc	F1-score	Precision	Recall	MCC
Scene +	Strict	52.26	16.84	18.06	18.13	35.94
Action	Relax	74.84	32.76	32.64	34.04	63.72
Action	Strict	52.90	19.15	20.21	19.57	37.00
Action	Relax	74.84	36.06	35.18	37.20	63.53

VI Experiments and Results

To show the effectiveness of our GnG Normative and GnG Principles dataset, we conducted two sets of experiments: 1) Normativity Classification experiments and 2) Principles Classification experiments. We used accuracy, precision, recall, f1-score and matthews correlation coefficient (MCC) as the evaluation metrics to assess classification quality.

VI-A Experiment 1: Normativity Classification

In this experiment, we seek to identify how well the machine learning models can classify normative and non-normative behavior from unseen Goofus and Gallant texts and images when trained on the GnG training set. It will allow us to understand how well the machine learning models can extract value information from the GnG normative dataset.

We conducted ablation studies to investigate the input modalities that contribute the most to classifying normative behavior and the best-performing machine learning models on these modalities. The results of this experiment are shown in the Table II. It shows the performance of different models on classifying normative behavior from different input modalities. All models in this experiment were trained and tested on the GnG multi-modal normative dataset. Additionally, we used the full GnG text-only normative dataset to train text-only models.

From the table II, both the image and text-only models achieve test accuracies greater than 70%, with the MCC exceeding 40. This indicates that both Goofus & Gallant images and texts contain value information that can be extracted by state-of-the-art machine learning models. Combining images and text in the input significantly improves the performance of predictions. The Dual Encoder, which combines ViT-CL and BERT-CL to process both image and text inputs, outperforms all other models that use a single modality.

The table shows that the fine-tuned ViTForImageClassification model slightly outperforms the custom ViT-CL model. A similar trend is observed in the text-only models as well, where BertForSequenceClassification performs better than the BERT-CL. One potential reason could be that the classification layers of the ViTForImageClassification and BertForSequenceClassification models have already been pre-trained with ample data, whereas the additional custom layers in ViT-CL and BERT-CL were only trained on data from the GnG dataset.

The models discussed above were trained exclusively using the images and texts from the GnG multi-modal normative dataset. Given the availability of additional text data in our GnG text-only normative dataset, we trained another set of text-only models on this complete dataset to further explore the effectiveness of our text-based data in classifying normativity from action descriptions. Table III displays the performance of these models. Both the BERT and DistillBERT models achieve accuracies and f1-score exceeding 93%, which further validates the quality of the Goofus & Gallant dataset

Finally, we ran a pre-trained sentiment analysis model on the GnG text-only test data to inspect if the sentiment analysis and classification of normative behaviors tasks share similar observations. We use the transformer-based pre-trained model distilbert-base-uncased-finetuned-sst-2-english [26] from Hugging Face for the sentiment analysis, which was trained on Stanford Sentiment Treebank [27] corpora and is one of the state-of-the-art models in this task. The experiment shows that the pre-trained sentiment analysis model struggles to detect normative/non-normative behaviors from the texts. It indicates the two tasks are not analogous to one another and require two different types of data to work with.

VI-B Experiment 2: Principles Classification

Principles classification is a multi-class classification task that aims to label each example of GnG Principles dataset with the two most representative principles that best identify social values illustrated in the action text. Since the dataset is textual, we used the BertForSequenceClassification model, which we fine-tuned and tested on the dataset. We performed 2 ablation studies based on the inputs we used. One study employed both scene and action descriptions, while the other used only action descriptions. Table IV shares the results of the experiment done for the classification task. As the compliance information was the common input for both studies, we did not add it to the table.

Recall that each example in the GnG principles dataset has two target principles: Principle 1 (the most representative) and Principle 2 (the second most representative). Since each example has two correct answers, we evaluated the model’s performance using a slightly different approach than the traditional method. We decided the correct prediction in two separate ways: 1. Strict - the predicted principle is correct if it matches Principle 1 (the most representative principle). 2. Relax - the predicted principle is correct if it matches either of the two true principles.

Table IV shows the results of this experiment. We trained two models based on the input information. In the first model, we included both scene and action descriptions along with compliance information. In the second model, we provided only the action description with compliance information. This allowed us to investigate the influence of scene descriptions on predicting the principle of the action and to determine if the model could identify the principle information from the action description alone without the scene description. From the results, we can see that the classification model using only action descriptions performed better in both the strict and relaxed metrics.

VII Discussion

In this study, we present a multi-modal dataset to facilitate value alignment. Our comprehensive experiments demonstrate the effectiveness of this dataset. From the first experiment, we observe that the normativity classification models can effectively distinguish between normative and non-normative actions of the GnG dataset. However, some instances are difficult to classify. For example, “I think I broke your camera, Dad. I’m sorry, says he”. In this scenario, the boy’s action of breaking the camera is not normative, but his apology instead of hiding the matter is commendable. In such instances, our model struggles, as these are particularly difficult examples.

In our experiments of principles classification, we observe that the classification model using only the action description as input performed better than the model using both action and scene descriptions as input. This is consistent for both strict and relaxed evaluation measures. We hypothesize that the scene descriptions are much longer than the action descriptions, providing a lot of additional information that may not always be directly relevant to identifying the principles. This extraneous information can overwhelm the model, making it harder to focus on the important information necessary for accurate classification.

VIII Conclusion

While value alignment is a noble goal, many subtleties surrounding values must be addressed before any practical alignment can be achieved. While there has been much research on utilizing large, crowdsourced corpora for training value models, these have often had less-than-desirable performance in practice. We propose that the solution to this is a high-quality, well-curated dataset designed specifically for conveying value information. To that end, we propose the Goofus & Gallant Story Corpus. This corpus contains a set of three well-curated datasets composed of images and texts to aid in practical value alignment.

We also present the performance of baseline models on a set of example tasks. These baselines show that the Goofus & Gallant Story Corpus can be used to perform value alignment tasks. It also shows that there is still room for improvement in using the dataset for these tasks, which we hope will encourage researchers to further investigate how this type of data can be used to augment current value alignment systems.

Acknowledgements

We would like to thank Highlights Magazine for allowing us to use and release the Goofus and Gallant dataset.

References

[1] D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. Dragan, “Cooperative inverse reinforcement learning,” Advances in neural information processing systems, vol. 29, 2016.
[2] N. Soares and B. Fallenstein, “Aligning superintelligence with human interests: A technical research agenda,” Machine Intelligence Research Institute technical report, vol. 8, 2014.
[3] S. Russell, D. Dewey, and M. Tegmark, “Research priorities for robust and beneficial artificial intelligence,” Ai Magazine, vol. 36, no. 4, pp. 105–114, 2015.
[4] T. Arnold, D. Kasenberg, and M. Scheutz, “Value alignment or misalignment - what will keep systems accountable?” in AAAI Workshop: AI, Ethics, and Society, 2017.
[5] S. J. Russell, Human Compatible: Artificial Intelligence and the Problem of Control. Viking (October 8, 2019), 2019.
[6] J. H. Moor, “The nature, importance, and difficulty of machine ethics,” IEEE intelligent systems, vol. 21, no. 4, pp. 18–21, 2006.
[7] M. Wulfmeier, “Efficient supervision for robot learning via imitation, simulation, and adaptation,” KI - Künstliche Intelligenz, pp. 1–5, 2019.
[8] B. C. Stadie, P. Abbeel, and I. Sutskever, “Third-person imitation learning,” in 5th International Conference on Learning Representations, ICLR 2017, 2017.
[9] D. Emelin, R. Le Bras, J. D. Hwang, M. Forbes, and Y. Choi, “Moral stories: Situated reasoning about norms, intents, actions, and their consequences,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 698–718.
[10] D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning AI with shared human values,” CoRR, vol. abs/2008.02275, 2020.
[11] L. Jiang, J. D. Hwang, C. Bhagavatula, R. L. Bras, J. Liang, J. Dodge, K. Sakaguchi, M. Forbes, J. Borchardt, S. Gabriel et al., “Can machines learn morality? the delphi experiment,” arXiv preprint arXiv:2110.07574, 2021.
[12] J. Kiesel, M. Alshomary, N. Handke, X. Cai, H. Wachsmuth, and B. Stein, “Identifying the human values behind arguments,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 4459–4471.
[13] N. Soares, “The value learning problem,” Machine Intelligence Research Institute, Berkley, 2015.
[14] A. Schmidt and M. Wiegand, “A survey on hate speech detection using natural language processing,” in Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media. Valencia, Spain: Association for Computational Linguistics, Apr. 2017, pp. 1–10.
[15] C. Van Hee, E. Lefever, B. Verhoeven, J. Mennes, B. Desmet, G. De Pauw, W. Daelemans, and V. Hoste, “Detection and fine-grained classification of cyberbullying events,” in Proceedings of the International Conference Recent Advances in Natural Language Processing. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA, Sep. 2015, pp. 672–680.
[16] S. Volkova, K. J. Shaffer, J. Y. Jang, and N. O. Hodas, “Separating facts from fiction: Linguistic models to classify suspicious and trusted news posts on twitter,” 7 2017. [Online]. Available: https://www.osti.gov/biblio/1373869
[17] T. Bolukbasi, K.-W. Chang, J. Zou, V. Saligrama, and A. Kalai, “Man is to computer programmer as woman is to homemaker? debiasing word embeddings,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. Red Hook, NY, USA: Curran Associates Inc., 2016, p. 4356–4364.
[18] M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, and Y. Choi, “Social bias frames: Reasoning about social and power implications of language,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 5477–5490.
[19] N. Lourie, R. L. Bras, and Y. Choi, “Scruples: A corpus of community ethical judgments on 32, 000 real-life anecdotes,” CoRR, vol. abs/2008.09094, 2020. [Online]. Available: https://arxiv.org/abs/2008.09094
[20] M. Forbes, J. D. Hwang, V. Shwartz, M. Sap, and Y. Choi, “Social chemistry 101: Learning to reason about social and moral norms,” CoRR, vol. abs/2011.00620, 2020. [Online]. Available: https://arxiv.org/abs/2011.00620
[21] L. Jiang, J. D. Hwang, C. Bhagavatula, R. L. Bras, M. Forbes, J. Borchardt, J. Liang, O. Etzioni, M. Sap, and Y. Choi, “Delphi: Towards machine ethics and norms,” 2021.
[22] J. L. Fleiss, “Measuring nominal scale agreement among many raters.” Psychological bulletin, vol. 76, no. 5, p. 378, 1971.
[23] OpenAI, “Gpt-4: Technical report,” 2024. [Online]. Available: https://openai.com/index/hello-gpt-4o/
[24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” CoRR, vol. abs/2010.11929, 2020. [Online]. Available: https://arxiv.org/abs/2010.11929
[25] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423
[26] “Distilbert base uncased finetuned sst-2,” https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english.
[27] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics, Oct. 2013, pp. 1631–1642.