Are AI Algorithms Playing Fairly with Age, Gender, and Skin Color?

Facebook researchers release a dataset intended to help machine learning developers test algorithms for bias

3 min read

Facebook's “Casual Conversations,'' consists of 45,186 videos of participants having non-scripted conversations
Images: Facebook

Does an algorithm treat people of different ages, genders, and skin colors equally, even under different lighting conditions? Facebook’s AI Red Team today released a data set—called Casual Conversations­—for use in answering that question. The ten terabytes of data consist of videos recorded by 3011 participants; the data set comprises approximately 15 one-minute segments per person, for more than 45,000 total minutes. The videos are tagged with age and gender as self-reported by each participant, by skin color as determined by trained annotators using a standard scale, and by lighting conditions, also as determined by annotators.

The Facebook AI Red Team’s research manager, Cristian Canton, gave me a simple example of how the dataset could be used by developers.

“Consider the Portal device,” he says. (Portal is Facebook’s $150 tabletop smart screen.) “We have a camera in it that tracks people. If I were engineer building that technology today, to make sure it is inclusive, I could take the Casual Conversations data set, run it through the tracking algorithm in the portal, and measure where it doesn’t perform well. Say, you might find that for a person of a given age, color, or gender in a low light, it doesn’t work. Then I would know that my algorithm has a deficiency for a specific sub group.”

Facebook’s researchers experimented with the dataset by testing it on the top five winners of last year’s Deepfake Detection Challenge, a competition to develop tools designed to automatically spot fraudulent media. In a research paper and blog post released today, they reported that, while all five algorithms struggled with darker skin tones, the model that performed most consistently across the dimensions of age, gender, and lighting conditions was not first place winner Selim Seferbekov, but rather the team that came in third place, NTechLab. The fourth place team, Eighteen Years Old, turned out ironically to be best at analyzing videos of subjects in the oldest cohort, above age 45.

Performing evenly across different demographics, was not part of the judging criteria for the Deep Fake Challenge, as the full Casual Conversations dataset was not yet available.

Said Canton: “If we were to redo the competition today, maybe we would consider looking for a more inclusive approach.”

The Casual Conversations dataset released this week is just the beginning of the work needed to create fairness in AI, Canton says. For one, he points out, the problem is multifaceted, and, while having this kind of data is helpful, it is not the end-all solution.

These pie charts show the frequency of the different tags for age, gender, apparent skin tone, and lighting conditions in the 45,186 videos that make up the Casual Conversations data set.These pie charts show the frequency of the different tags for age, gender, apparent skin tone, and lighting conditions in the 45,186 videos that make up the Casual Conversations data set.Image: Facebook

And as for the dataset development itself, he says, the team is just on “the first step of a long journey. We have identified age, gender, skin tone, and light conditions, but [these videos were] all recorded in the U.S. Maybe if we record in other countries, we will find we need to consider diversity axes that we haven’t yet seen.”

The audio part of the recordings, Canton indicated, also represents untapped potential. The audio files, created by asking subjects to respond to simple conversational prompts like “What is your favorite dish,” are currently tagged only for age and gender.

“We haven’t annotated the accents yet, but that is a potential avenue for future implementations. We do think there will be some interesting outcomes from the speech part of this. We want to test inclusivity of audio models.”

Canton hopes that releasing this data into the wild will elicit feedback that can be used to make the data set richer and more inclusive. “I would love to see adoption, and then for my colleagues and academics to tell us what they think. We want to be self-critical. With feedback, we can keep improving it. We hope it becomes a standard way to measure AI fairness.”

Canton also hopes this data set’s development will set a new standard. He is proud of the way this data set was created, including the fact that it was responsibly sourced. He stressed a number of times during our conversation that the 3000-plus subjects were paid for their efforts, were made fully aware about how their voice and video images are intended to be used, and can withdraw later if they change their minds about participation.

“We are trying to set the standard for what responsible AI should look like in the future,” he says, adding that the Facebook team wishes “to inspire other people recording data sets. It is important to do the right things and use the right tools.”


The Conversation (0)