As Washington intensifies its focus on the influence of social media platforms on American life and the democratic process, the heated rhetoric and myriad policy proposals lack one crucial element: data.
For all the concern over “community guidelines,” content moderation, fact-checking and advertising policies, we have few of the actual data points necessary to evaluate how well the companies are doing. Could it be that they get it right most of the time and it is just a few high-profile mistakes that are driving our concern? Conversely, are the companies getting it wrong much more than we — or even they — realize? What are the datasets that Congress should require from social companies in order to help the public better understand the role those platforms play in society today?
On paper, the platforms’ content moderation practices and fact-checking partnerships seem like reasonable solutions to the difficult task of keeping bad actors from disrupting their digital communities. Yet how closely do the companies adhere to these rules in practice? To what degree do the unconscious biases of the companies’ engineers manifest themselves in their algorithms? Would the American public be as supportive of content moderation if they understood the disproportionate ways it can impact certain voices or the unevenness in how the platforms apply their rules. In order to compare rhetoric to reality, we need data that captures the daily functioning of our modern public squares.
Here are 10 datasets that Congress could demand from social media companies that would begin to provide the critical insights needed to understand their roles in our modern democracy and highlight areas that may require further legislative action.
A Database of Violating Tweets. Given that all tweets are publicly viewable and already accessible to researchers using Twitter’s data APIs (application programming interfaces), there would be few privacy implications in requiring Twitter to provide a public database of all tweets the platform flags each day, along with a description of why Twitter believed each tweet was a violation of its rules or disputed by a fact-checker. Such a database would permit at-scale analysis of the kinds of content Twitter’s moderation efforts focus on, while the ability to compare those violating tweets against the rest of Twitter would make it possible to assess how evenhanded the platform’s removal efforts are.
A Database of Journalist & Politician Private Post Violations. Most social platforms, such as Facebook and Instagram, are a mixture of public and private content. Publicly shared content violations could be compiled and disseminated to researchers, as could public tweets, but private content such as non-public Facebook posts that are deleted or flagged as misinformation pose unique privacy challenges. One possibility would be to treat the verified official accounts of journalists and elected officials as different from other users, given their outsized role in the public discourse, and to automatically make available to researchers any posts by those accounts that are later deleted as violations of platform rules or disputed by fact-checkers. A separate voluntary submission database could allow ordinary users to submit their own posts that were deemed violations, along with the explanation they received regarding the violation. Having a single centralized database of such removals would make it easier to understand trends in the kinds of content platforms are most heavily policing and whether there is public agreement with the platforms’ decisions.
A Demographic Database of Content Removals. Social platforms use algorithms to estimate myriad demographic characteristics of their users, including race, gender, religion, sexual orientation and other attributes that marketers can use to precisely target their ads. While these attributes are imperfect, the fact that the companies make them available for ad targeting suggests they believe they are sufficiently accurate to build an advertising strategy upon. The companies should be required to compile regular demographic percentage breakdowns of deleted and flagged posts for each of their community guidelines and fact checks. For example, what percentage of “hate speech” posts were ascribed to persons of color or how many “misinformation” posts were by members of a given religious affiliation? Do the companies’ enforcement actions appear to disproportionately impact vulnerable voices?
A Database of Exempted Posts. A common criticism of content moderation is the unevenness with which it is applied. Why do some users seemingly face constant enforcement action while others posting the exact same material face no consequences? Why is one politician’s post preserved as “newsworthy” while another is removed as a violation? A critical missing component in our understanding of content moderation is the degree to which companies create silent exemptions from their rules. On paper, Facebook prohibits all forms of sexism, racism, bullying and threats of violence, but in practice, the company allows some posts as “humor” or otherwise declines to take action. How often do users report posts that the company determines are not a violation? And does it systematically exempt certain kinds of content? Compiling a central database of posts the companies rule are not violations would offer critical insights into how evenhanded they are and where their enforcement gaps are.
A Database of Deleted & Exempted Protest Posts. Protest marches are increasingly being organized over social media. As platforms extend their censorship to these posts, they are able to control speech that occurs beyond their digital borders. This makes understanding how platforms moderate protest-related speech uniquely important. For weeks Facebook touted its removal of COVID “reopening” protests that did not require social distancing, yet quietly waived those rules for the George Floyd protests. Having a centralized database of protest posts removed by platforms as well as those exempted from its rules would go a long way towards understanding how much the platforms are shaping the offline discourse.
Increased Access to Facebook’s Fact-Checking Database. Facebook provides an internal dashboard to fact-checking organizations that lists the posts it believes may be false or misleading. Today, access to that dashboard is extremely limited, but broadening access to policymakers and the academic community as a whole would enable much closer scrutiny of the kinds of material Facebook is focusing on. Given that the company already shares this content with its fact-checking partners, there would be fewer privacy implications to broadening that access to a wider pool of researchers.
A Database of Fact-Checked Posts. What are the kinds of posts that social platforms delete or flag as having been disputed by fact-checking organizations? Are climate change posts flagged more often than immigration posts? How are platforms managing the constantly changing guidelines for COVID-19, where just a few months ago posts recommending masks would have in theory been a violation of the platforms’ “misinformation” rules governing health information that goes against CDC guidance? How often are posts flagged based on questionable ratings or potentially conflicted sources?
In an ideal world, platforms would be required to compile a database of every post they flag as being disputed by a fact-checker. For public posts such as those on Twitter, this could be possible, but for platforms like Facebook, this would pose a privacy challenge. One possibility would be to require platforms such as Facebook to provide a daily report listing the URL of every fact check they relied upon to flag a user post that day, along with how many posts were flagged based on that fact check. For example, of all of the climate change fact checks published over the years, which are the ones that yield the most takedowns on social platforms? Do the most heavily cited fact checks rely on the same sources of “truth” as other fact checks on that topic or is a particular source, such as an academic “expert,” having an outsized influence on “truth” on social platforms? Such data would also help fact-checkers to periodically review their most-cited fact checks to verify that their findings still hold, while during pandemic public health officials could use it to flag emerging contested narratives.
Increased Access to Facebook Research Datasets. Through academic partnerships and programs like Social Science One, Facebook permits large-scale research on its 2 billion users, from manipulating their emotions to linking data sets to more in-depth analyses of the flow of information across its platform. Researchers from across the world have been given access to study misinformation and sharing on Facebook, and a closer look at the projects approved to date suggests the kinds of access they have been granted would also support work into understanding the biases of Facebook’s own moderation practices.
Algorithmic Trending Datasets. The power of algorithms to shape our awareness of events around us was driven home in 2014 when Twitter chronicled the unrest in Ferguson, Mo., while Facebook was filled with the smiling faces of people dumping buckets of ice water over their heads. A public dataset capturing how public posts are being prioritized or deemphasized by these algorithms across classes of users and over time would provide insights into inadvertent biases in these algorithms and provide greater visibility into what the public is and is not seeing.
The Legal System. Many of the “community guidelines” enforced by social platforms are, at least on paper, also violations of U.S. law, including libel, harassment and threats of violence. How often do social media companies or recipients of those messages refer them to law enforcement and what was the outcome of those cases? If few such posts are ever referred to law enforcement, why do social platforms believe harassment and threats of violence should not be reported to officials if they believe they are dangerous enough to warrant removal from their platforms? Tracking cases where posts were referred to law enforcement and the resulting legal decisions would shed light on how closely social media platforms’ interpretations of U.S. laws adhere to reality.
In the end, we lack the necessary data to determine what kinds of regulation are required for social platforms. The datasets above would give policymakers and researchers critical building blocks upon which to begin understanding Silicon Valley’s influence over democracy itself.