Picking random element, with weights

Whenever I'm doing some testing I need sample data. Easiest way to do it is to generate data using some random/generate_series queries.

But what if I need specific frequencies?

For example, I need to generate 10,000,000 rows, where there will be 10% of ‘a', 20% of ‘b', and the rest will be split equally between ‘c' and ‘d'?

online user" on Discord had similar problem, so figured I'll write a function that does that.

Since we need to pass, to the function, any number of args, and weigths for them, the simplest thing I can pass it as are 2 arrays:

  1. elements to randomly pick from
  2. their weights

Ready function is:

CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] ) RETURNS ANYELEMENT LANGUAGE plpgsql AS $$
DECLARE
    v_no_choices INT4 := array_length(p_choices, 1);
    v_no_weights INT4 := array_length(p_weights, 1);
    v_weight_sum FLOAT8;
    v_random     FLOAT8;
    v_i          INT4;
BEGIN
    IF v_no_choices IS NULL OR v_no_weights IS NULL OR v_no_choices <> v_no_weights THEN
        RETURN NULL;
    END IF;
 
    SELECT SUM(x) INTO v_weight_sum FROM unnest(p_weights) AS q(x);
 
    v_random := random() * v_weight_sum;
 
    FOR v_i IN array_lower(p_weights, 1) .. array_upper(p_weights, 1) loop
        IF v_random < p_weights[v_i] THEN
            RETURN p_choices[v_i];
        END IF;
        v_random := v_random - p_weights[v_i];
    END loop;
    RETURN NULL;
END;
$$;

It needs two arrays, they shouldn't be empty/null, and their number of elements have to be the same.

Example use:

=$ SELECT weighted_random( '{a,b,c,d}'::TEXT[], '{.5,.5,.1,.9}' );
 weighted_random
─────────────────
 d
(1 ROW)

In this example, it picks from four values:

  • a with weight 0.5
  • b with weight 0.5
  • c with weight 0.1
  • d with weight 0.9

Since the weights are any float8, we can pick virtually any number to represent weight. For example we could:

=$ SELECT weighted_random( '{a,b,c,d}'::TEXT[], '{1,2,3.5,3.5}' );

to generate one of the values with probabilities from the head of this post 🙂

We can easily check if the percentages are what we wanted by doing:

=$ WITH x AS (
    SELECT
        weighted_random( '{a,b,c,d}'::TEXT[], '{1,2,3.5,3.5}' ) AS v
    FROM
        generate_series(1,10000) i
)
SELECT
    v,
    COUNT(*),
    COUNT(*)::float8 * 100 / 10000 AS count_pct
FROM
    x
GROUP BY
    v
ORDER BY
    v;
 v │ COUNT │ count_pct
───┼───────┼───────────
 a │   9959.95
 b │  201420.14
 c │  347134.71
 d │  352035.2
(4 ROWS)

Of course, since these are randomly picked, the percentages fluctuate a bit, but should be close enough 🙂

Hope it could be useful for someone 🙂