A solution for greater algorithmic fairness in digital ranking systems - Naver Labs Europe

A new greedy, brute-force solution improves fairness in ranking and encompasses realistic scenarios with multiple, unknown protected groups.

blank

Ranking systems are omnipresent in our digital daily lives, from search engines that return a list of relevant web pages, to e-commerce services that recommend items to buy, to social media platforms that suggest friends. The impact of such systems is often illustrated in the case of job applications where the candidates become the ‘data points’ to be ranked and where they’re subject to potential biases in the system. The biases found in ranking systems generally originate either from the data used by the (often, machine-learning-based) approach or from the ranking algorithm itself. The harm resulting from the ranking algorithm bias is generally related to the notion of ‘exposure’ inherent in ranking systems and its unfair distribution across ranked items. Exposure simply denotes the amount of attention one allocates to the different items in the ranking list. Naturally, exposure is not evenly distributed across items: items at the top of the ranking tend to receive far more attention than items at the bottom. For example, a search engine user usually settles for the first few results and rarely scrolls down to the bottom of the page.

Due to the potential unfairness stemming from algorithmic bias, researchers have recently been studying ways to balance the exposure allocated to ranking items while preserving the quality of the ranking—giving rise to the so-called ‘fair ranking’ problem. Most studies in the fair ranking literature (1, 2, 3, 4, 5) assume that the items to rank pertain to a ‘protected group’. In the case of job candidate ranking, one could for example consider two protected groups as the set of ‘male’ candidates and the set of ‘female’ candidates. The protected group membership is then utilized to enforce fairness, for example to fairly rank men and women for the job offer. In the widely adopted merit-based definition of fairness, a fair ranking system provides exposure to each protected group which is proportional to the overall merit of the group members. The merit of an item indicates its utility for the task underlying the ranking, such as the adequacy of a candidate to a job or the relevance of a web page to a user’s query.

Despite recent efforts aimed at achieving fair ranking, current approaches rely on two limiting assumptions:

  • Only one set of protected groups (i.e. one grouping) is considered at a time in the enforcement of fairness. For example, a job candidate ranking system could produce rankings that are fair with respect to gender but not with respect to groupings based on other characteristics, such as ethnicity, sexual orientation, disability status and so on. Arguably, addressing only one facet of unfairness and ignoring others still results in unfair rankings.
  • The grouping (e.g. gender-based grouping) and its protected groups (e.g. the ‘male’ group and the ‘female’ group) are assumed to be known, along with the protected group membership of each item to rank. However, individuals may be reluctant to disclose such personal details either out of fear of discrimination or simply to preserve their privacy.

Our work attempts to address these two limitations to get closer to providing a ranking system that ensures algorithmic fairness.

Towards fair ranking systems robust to multiple, unknown groupings

The problem we tackle is inspired by the set-up introduced in TREC Fair Ranking 2019 (5), which is a track in a long-standing series of information retrieval evaluation workshops. The goal of this track was to devise systems which, given a query about a scientific topic, would be able to rank a set of scientific articles so that higher-ranked documents would be more relevant to the query, and returned rankings would provide fair exposure to the different groups of authors who wrote those articles. The twist is that the challenge participants didn’t have access to the author-group assignments, and that several different, a priori unknown, grouping definitions would be used in the evaluation.

Our approach to fair ranking for multiple, unknown groupings (or ‘multi-grouping robust fair ranking’) relied on two components: a strategy to define some proxy grouping(s) to alleviate the lack of known groupings, and a fair ranking algorithm to embed those proxy groupings.

Selection strategies for proxy groupings

In the absence of known groupings, we explored different possible proxy groupings to act as substitutes. We focused on two types of strategies. The first one simply assumes that each author (or alternatively each article) made up their own group, known in the literature as ‘individual fairness’. The rationale for this choice is that individual fairness often subsumes group-level fairness (6). The second strategy we investigated relied on generating synthetic, random groupings. Although this may not appear to be very intuitive, the underlying idea is that enforcing fairness for a large enough number of random groupings leads to improving the coverage of possible groupings. It should therefore help in being more robust with respect to the set of ‘true’, unknown groupings considered for evaluation. Among the random groupings, we generated balanced groupings where the size of each group was approximately the same, and unbalanced groupings which reflected a more heterogeneous situation. To define the unbalanced random groupings, we relied on an existing stochastic process known as the Chinese restaurant process (CRP) (7). Figure 1 shows the different types of groupings considered.

figure 1 blog image grouping
Figure 1. The proxy groupings explored in this work are based on diverse strategies: individual groupings (left), balanced random groupings (centre) and unbalanced random groupings (right).

Fair ranking algorithm

Given one (or several) of the proxy groupings described above, we defined a fair ranking algorithm to issue rankings that are both useful with respect to the input queries and fair according to the chosen proxy grouping(s). Our approach is a simple, efficient heuristic which we named single-query greedy brute-force re-ranking (SGBR). It features the following characteristics:

 

  • Single query: In our scenario, a same query is repeated several times across the different searches performed by users. For reasons of tractability, we chose to enforce fairness for each unique query independently, as opposed to optimizing fairness simultaneously over all queries.
  • Greedy: The approach is timewise greedy as it processes queries one by one, and only relies on the current query and previous ones to compute the ranking to return. This is a natural assumption in the case where a system has to process users’ queries in an online fashion.
  • Brute force: SGBR operates in a query-wise brute-force fashion by enumerating a set of possible rankings for a given query, and returning the ranking that achieves the best combination of fairness and relevance to the query.

Performing true brute force would consist of generating and evaluating all permutations of the documents which need to be ranked for a query. As this is not feasible for a large set of documents, we followed a two-step methodology which considerably reduced the number of rankings to consider. We briefly summarize these two steps below, which are described in full mathematical detail in our paper (8).

The first step, which we denote as pre-ordering, computes an initial ranking of the documents associated with a query. This ranking should account simultaneously for the documents’ relevance to the query and for the potential exposure deficit of the groups associated with the documents in previous occurrences of the query based on the merit of those groups. In other words, the documents ranked at the top of the pre-ordering are those which are relevant and whose groups (from the proxy groupings) received too little exposure in the rankings issued for past occurrences of the same query. This process is depicted in Figure 2(a).

The second step, labelled top-item brute-force re-ranking, generates all the permutations of the top items in the pre-ordering while keeping the tail items in the same position. The rationale for permuting only the top items is that they are the most likely to have a large impact on the overall relevance and fairness of the ranking. Each candidate ranking formed in this way is then evaluated with the target metric. This metric balances the utility of the ranking (i.e. relevance with respect to the query) and fairness according to the proxy grouping(s). Finally, the ranking with the best score is returned for the current occurrence of the query. This step is illustrated in Figure 2(b).

Figure 2. The two steps of our single-query greedy brute-force
Figure 2. The two steps of our single-query greedy brute-force re-ranking (SGBR) approach—(a) the pre-ordering step and (b) the top-item brute-force re-ranking step—given for the example of the ‘chloroquine’ query. The documents to rank are denoted as d1, …, d6, and their relevance to the query is indicated with ‘rel’. The box colour denotes the assignment of a document to a protected group, with the simplified assumption that every document pertains to a single group. In practice, our approach comprises mechanisms to compose (i.e. aggregate) document exposure into group exposure, and vice versa.

Applying SGBR, and next steps

Our experiments based on the TREC Fair Ranking (5) benchmark simulate the scenario where fairness is evaluated with respect to a large number of different, unknown groupings. For that purpose, we randomly generated for the evaluation a new set of unbalanced groupings from a CRP. Unbalanced groupings were adopted as evaluation groupings to reflect that, in reality, protected groups are often unevenly distributed (e.g. the gender of applicants for computer engineering jobs). In our evaluation, we did not rely on the groupings proposed by the track’s organizers, as only two such groupings were provided—we judged this too limited for a reliable evaluation in a multi-grouping setting.

We compared the performance of the rankings generated by our SGBR-based approach on the different proxy groupings. We also considered as baselines two simple approaches: one that maximizes the utility of the ranking by ordering documents based on their relevance to the query, and another approach which returns a random permutation of the documents, ignoring their relevance. Our experimental results revealed that the variants based on SGBR and individual proxy groupings yielded both high utility and low unfairness in comparison to the other approaches. Indeed, the SGBR approach with individual groupings led to a reduction in the unfairness score of between 18 and 22% relative to the ‘maximum utility’ approach, while keeping utility within 0.0001% of the maximum possible utility. More details on the results are provided in our article (8). Our experiments empirically support the fact that adopting individual fairness seems to be the best choice when fairness is to be enforced with respect to a set of unknown groupings.

We’re currently working on characterizing when this individual fairness strategy is optimal from a theoretical standpoint, and on extending it to the multi-stakeholder setting, where fairness should be ensured simultaneously on the side of the producer (e.g. the document author) and on the side of the consumer (e.g. the reader). How we succeed will be the topic of a future post.

References

  1. Ranking with Fairness Constraints. L. Elisa Celis, Damian Straszak and Nisheeth K. Vishnoi. Proceedings of the 45th International Colloquium on Automata, Languages and Programming (ICALP 2018), 9–13 July 2018, pp. 28:1–28:15.
  2. Fairness of Exposure in Rankings. Ashudeep Singh and Thorsten Joachims. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18), July 2018, pp. 2219–2228.
  3. Measuring Fairness in Ranked Outputs. Ke Yang and Julia Stoyanovich. Proceedings of the 29th International Conference on Scientific and Statistical Database Management (SSDBM ’17), June 2017, pp. 22:1–22:6.
  4. FA*IR: A Fair Top-k Ranking Algorithm. Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed and Ricardo Baeza-Yates. Proceedings of the 2017 ACM Conference on Information and Knowledge Management (CIKM ’17), November 2017, pp. 1569–1578.
  5. Overview of the TREC 2019 Fair Ranking Track. Asia J. Biega, Fernando Diaz, Michael D. Ekstrand and Sebastian Kohlmeier. Twenty-Eighth Text REtrieval Conference Proceedings (TREC 2019), 13–15 November 2019, Gaithersburg, MD, USA.
  6. Equity of Attention: Amortizing Individual Fairness in Rankings. Asia J. Biega, Krishna P. Gummadi and Gerhard Weikum. The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’18), June 2018, pp. 405–414.
  7. Exchangeability and Related Topics. David J. Aldous. In Paul-Louis Hennequin (ed.). École d’Été de Probabilités de Saint-Flour XIII—1983. Lecture Notes in Mathematics, vol. 1117. Springer, Berlin, 1985, pp. 1–198.
  8. Multi-grouping Robust Fair Ranking. Thibaut Thonet and Jean-Michel Renders. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20), July 2020, pp. 2077–2080.