内卷地狱

Wang Shusen Recommender Systems Study Notes — Re-Ranking

Edit Me

Wang Shusen Recommender Systems Study Notes — Re-Ranking

Re-Ranking

Diversity in Recommender Systems

Measuring Item Similarity

Similarity Measures

  • Based on item attribute tags.

    • Category, brand, keywords...
  • Based on item vector representations.

    • Item vectors learned by the retrieval two-tower model (not ideal).
    • Content-based vector representations (preferred).

Attribute Tag-Based Similarity

  • Item attribute tags: category, brand, keywords...
  • Compute similarity based on first-level category, second-level category, and brand.
    • Item ii: Beauty, Makeup, Chanel.
    • Item jj: Beauty, Perfume, Chanel.
    • Similarity: sim1(i,j)=1\text{sim}_1(i, j) = 1, sim2(i,j)=0\text{sim}_2(i, j) = 0, sim3(i,j)=1\text{sim}_3(i, j) = 1.

Content-based item representation can improve diversity.

CNN can process images to output feature vectors; BERT can process text to output feature vectors. The two vectors are then concatenated.

How to train CNN and BERT?

  • CLIP is the currently recognized most effective pre-training method.
  • Idea: For image–text pairs, predict whether the image and text match.
  • Advantage: No manual annotation needed. Posts on Xiaohongshu naturally contain images + text, and most posts have matching image-text content.

  • A batch contains mm positive pairs.
  • One image paired with m1m - 1 texts forms negative samples.
  • The batch contains m(m1)m(m - 1) negative pairs in total.

Methods for Improving Diversity

Post-processing in pre-ranking also requires diversity algorithms.

Post-processing in full ranking is also called re-ranking.

Maximal Marginal Relevance (MMR)

Diversity

  • Full ranking scores nn candidate items; let the fused scores be
    reward1,,rewardn\text{reward}_1, \dots, \text{reward}_n
  • Let the similarity between items ii and jj be sim(i,j)\text{sim}(i,j).
  • Select kk items from nn, requiring both high ranking scores and high diversity.

MMR Diversity Algorithm

  • Compute the Marginal Relevance score for each item ii in set R\mathcal{R}:

    MRi=θrewardi(1θ)maxjSsim(i,j)\text{MR}_i = \theta \cdot \text{reward}_i - (1 - \theta) \cdot \max\limits_{j \in \mathcal{S}} \text{sim}(i, j)
  • rewardi\text{reward}_i is the item's full-ranking score; maxjSsim(i,j)\max\limits_{j \in \mathcal{S}} \text{sim}(i, j) measures how similar the unselected item ii is to already selected items. A higher ranking score and lower similarity yield a higher MR score.

  • Maximal Marginal Relevance (MMR): Select the item with the highest MR score from unselected items:

    argmaxiRMRi\arg\max\limits_{i \in \mathcal{R}} \text{MR}_i

MMR Diversity Algorithm

  1. Initialize selected items S\mathcal{S} as the empty set; initialize unselected items R\mathcal{R} as the full set {1,,n}\{1, \dots, n\}.
  2. Select the item with the highest ranking score rewardi\text{reward}_i, move it from R\mathcal{R} to S\mathcal{S}.
  3. Repeat for k1k - 1 rounds: a. Compute scores {MRi}iR\{\text{MR}_i\}_{i \in \mathcal{R}} for all items in R\mathcal{R}. b. Select the item with the highest score and move it from R\mathcal{R} to S\mathcal{S}.

Sliding Window

  • MMR:

    argmaxiR{θrewardi(1θ)maxjSsim(i,j)}\arg\max\limits_{i \in \mathcal{R}} \left\{ \theta \cdot \text{reward}_i - (1 - \theta) \cdot \max\limits_{j \in \mathcal{S}} \text{sim}(i, j) \right\}
  • As more items are selected (i.e., S\mathcal{S} grows), it becomes harder to find item iRi \in \mathcal{R} that is dissimilar to all items in S\mathcal{S}.

  • Assuming sim\text{sim} has range [0,1][0,1]: when S\mathcal{S} is large, the diversity score maxjSsim(i,j)\max\limits_{j \in \mathcal{S}} \text{sim}(i, j) is always approximately 1, causing MMR to break down.

  • Solution: Set a sliding window W\mathcal{W}, e.g., the most recently selected 10 items, and replace S\mathcal{S} in the MMR formula with W\mathcal{W}.

  • Standard MMR:

    argmaxiR{θrewardi(1θ)maxjSsim(i,j)}.\arg\max\limits_{i \in \mathcal{R}} \left\{ \theta \cdot \text{reward}_i - (1 - \theta) \cdot \max\limits_{j \in \mathcal{S}} \text{sim}(i, j) \right\}.
  • With sliding window:

    argmaxiR{θrewardi(1θ)maxjWsim(i,j)}\arg\max\limits_{i \in \mathcal{R}} \left\{ \theta \cdot \text{reward}_i - (1 - \theta) \cdot \max\limits_{j \in \mathcal{W}} \text{sim}(i, j) \right\}

Re-Ranking Rules

Re-Ranking Rules

Rule: At most kk consecutive posts of a certain type

  • Xiaohongshu recommendation system items are divided into image-text posts and video posts.
  • At most k=5k = 5 consecutive image-text posts; at most k=5k = 5 consecutive video posts.
  • If positions ii through i+4i+4 are all image-text posts, then position i+5i+5 must be a video post.

Rule: At most 1 post of a certain type in every kk posts

  • Promoted posts from operations have their ranking score multiplied by a factor greater than 1 (boost), helping them get more exposure.
  • To prevent boost from harming user experience, limit to at most 1 promoted post per k=9k = 9 posts.
  • If position ii is a promoted post, then positions i+1i+1 through i+8i+8 cannot be promoted posts.

Rule: At most kk posts of a certain type in the first tt posts

  • The top tt posts receive the most visibility and matter most for user experience.
    (Xiaohongshu's top 4 form the first screen)
  • Xiaohongshu recommendation system has posts with e-commerce cards; too many may hurt user experience.
  • In the first t=1t=1 posts, at most k=0k=0 posts with e-commerce cards.
  • In the first t=4t=4 posts, at most k=1k=1 post with e-commerce cards.

MMR + Re-Ranking Rules

  • MMR selects one item per round:

    argmaxiR{θrewardi(1θ)maxjWsim(i,j)}\arg\max\limits_{i \in \mathcal{R}} \left\{ \theta \cdot \text{reward}_i - (1 - \theta) \cdot \max\limits_{j \in \mathcal{W}} \text{sim}(i, j) \right\}
  • Re-ranking combines MMR with rules to maximize MR subject to rule constraints.

  • Each round, first use rules to exclude some items from R\mathcal{R}, yielding subset R\mathcal{R'}.

  • Replace R\mathcal{R} with R\mathcal{R'} in the MMR formula; selected items satisfy the rules.

DPP: Mathematical Foundations

Parallelepiped

  • A parallelepiped in 2D space is a parallelogram.

  • Points in a parallelogram can be expressed as:

    x=α1v1+α2v2.\mathbf{x} = \alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2.

  • Coefficients α1\alpha_1 and α2\alpha_2 have range [0,1][0,1].

  • A parallelepiped in 3D space is a parallelepiped (3D).

  • Points in a parallelepiped can be expressed as:

    x=α1v1+α2v2+α3v3.\mathbf{x} = \alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \alpha_3 \mathbf{v}_3.

  • Coefficients α1,α2,α3\alpha_1, \alpha_2, \alpha_3 have range [0,1][0,1].

Parallelepiped

  • A set of vectors v1,,vkRd\mathbf{v}_1, \cdots, \mathbf{v}_k \in \mathbb{R}^d defines a kk-dimensional parallelepiped:

    P(v1,,vk)={α1v1++αkvk0α1,,αk1}.P(\mathbf{v}_1, \cdots, \mathbf{v}_k) = \{\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k \mid 0 \leq \alpha_1, \cdots, \alpha_k \leq 1\}.

  • Requires kdk \leq d; for example, a k=2k = 2-dimensional parallelogram in d=3d = 3-dimensional space.

  • If v1,,vk\mathbf{v}_1, \cdots, \mathbf{v}_k are linearly dependent, then volume vol(P)=0\text{vol}(P) = 0. (Example: k=3k = 3 vectors lying on a plane yield a parallelepiped with volume 0.)

Area of a Parallelogram

Using v2\mathbf{v}_2 as the base, how to compute the height q1\mathbf{q}_1?

  • Compute the projection of v1\mathbf{v}_1 onto v2\mathbf{v}_2:

    Projv2(v1)=v1Tv2v222v2.\text{Proj}_{\mathbf{v}_2}(\mathbf{v}_1) = \frac{\mathbf{v}_1^T \mathbf{v}_2}{\|\mathbf{v}_2\|_2^2} \cdot \mathbf{v}_2.

  • Compute

    q1=v1Projv2(v1).\mathbf{q}_1 = \mathbf{v}_1 - \text{Proj}_{\mathbf{v}_2}(\mathbf{v}_1).

  • Property: base v2\mathbf{v}_2 and height q1\mathbf{q}_1 are orthogonal.

Volume of a Parallelepiped

  • Volume = base area × height2\|\text{height}\|_2.

  • Parallelogram P(v1,v2)P(\mathbf{v}_1, \mathbf{v}_2) is the base of parallelepiped P(v1,v2,v3)P(\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3).

  • Height q3\mathbf{q}_3 is perpendicular to the base P(v1,v2)P(\mathbf{v}_1, \mathbf{v}_2).

When is Volume Maximized/Minimized?

  • Let v1\mathbf{v}_1, v2\mathbf{v}_2, v3\mathbf{v}_3 all be unit vectors.

  • When the three vectors are orthogonal, the parallelepiped is a cube; volume is maximized at vol=1\text{vol} = 1.

  • When the three vectors are linearly dependent, volume is minimized at vol=0\text{vol} = 0.

Measuring Item Diversity

  • Given kk items, represent them as unit vectors v1,,vkRd\mathbf{v}_1, \cdots, \mathbf{v}_k \in \mathbb{R}^d. (dkd \geq k)

  • Use parallelepiped volume to measure item diversity; volume ranges between 00 and 11.

  • If v1,,vk\mathbf{v}_1, \cdots, \mathbf{v}_k are mutually orthogonal (high diversity), volume is maximized at vol=1\text{vol} = 1.

  • If v1,,vk\mathbf{v}_1, \cdots, \mathbf{v}_k are linearly dependent (low diversity), volume is minimized at vol=0\text{vol} = 0.

  • Given kk items represented as unit vectors v1,,vkRd\mathbf{v}_1, \cdots, \mathbf{v}_k \in \mathbb{R}^d. (dkd \geq k)

  • Arrange them as columns of matrix VRd×k\mathbf{V} \in \mathbb{R}^{d \times k}.

  • With dkd \geq k, determinant and volume satisfy:

    det(VTV)=vol(P(v1,,vk))2.\det(\mathbf{V}^T \mathbf{V}) = \text{vol}(P(\mathbf{v}_1, \cdots, \mathbf{v}_k))^2.

  • Therefore, the determinant det(VTV)\det(\mathbf{V}^T \mathbf{V}) can be used to measure the diversity of vectors v1,,vk\mathbf{v}_1, \cdots, \mathbf{v}_k.

DPP: Diversity Algorithm

Diversity Problem

  • Full ranking scores nn items: reward1,,rewardn\text{reward}_1, \cdots, \text{reward}_n.

  • Vector representations of nn items: v1,,vnRd\mathbf{v}_1, \cdots, \mathbf{v}_n \in \mathbb{R}^d.

  • Select kk items from nn to form set S\mathcal{S}:

    • High value: sum of scores jSrewardj\sum_{j \in \mathcal{S}} \text{reward}_j should be maximized.
    • High diversity: volume of parallelepiped P(S)P(\mathcal{S}) formed by kk vectors in S\mathcal{S} should be maximized.
  • Let the kk item vectors in S\mathcal{S} form the columns of matrix VSRd×k\mathbf{V}_{\mathcal{S}} \in \mathbb{R}^{d \times k}.

  • Use these kk vectors as edges to form parallelepiped P(S)P(\mathcal{S}).

  • Volume vol(P(S))\text{vol}(P(\mathcal{S})) can measure the diversity of items in S\mathcal{S}.

  • With kdk \leq d, determinant and volume satisfy:

    det(VSTVS)=vol(P(S))2\det(\mathbf{V}_{\mathcal{S}}^T \mathbf{V}_{\mathcal{S}}) = \text{vol}(P(\mathcal{S}))^2

Determinantal Point Process (DPP)

  • DPP is a classical statistical machine learning method:

    argmaxS:S=klogdet(VSTVS)\arg\max_{\mathcal{S}: |\mathcal{S}|=k} \log \det(\mathbf{V}_{\mathcal{S}}^T \mathbf{V}_{\mathcal{S}})
  • Hulu's paper applies DPP to recommender systems:

    argmaxS:S=kθ(jSrewardj)+(1θ)logdet(VSTVS)\arg\max_{\mathcal{S}: |\mathcal{S}|=k} \theta \cdot \left( \sum_{j \in \mathcal{S}} \text{reward}_j \right) + (1 - \theta) \cdot \log \det(\mathbf{V}_{\mathcal{S}}^T \mathbf{V}_{\mathcal{S}})
  • DPP applied to recommender systems:

    argmaxS:S=kθ(jSrewardj)+(1θ)logdet(VSTVS)\arg\max_{\mathcal{S}: |\mathcal{S}|=k} \theta \cdot \left( \sum_{j \in \mathcal{S}} \text{reward}_j \right) + (1 - \theta) \cdot \log \det(\mathbf{V}_{\mathcal{S}}^T \mathbf{V}_{\mathcal{S}})
  • Let A\mathbf{A} be an n×nn \times n matrix with element (i,j)(i,j) equal to aij=viTvja_{ij} = \mathbf{v}_i^T \mathbf{v}_j.

  • Given vectors v1,,vnRd\mathbf{v}_1, \cdots, \mathbf{v}_n \in \mathbb{R}^d, computing A\mathbf{A} takes O(n2d)O(n^2 d) time.

  • AS=VSTVS\mathbf{A}_{\mathcal{S}} = \mathbf{V}_{\mathcal{S}}^T \mathbf{V}_{\mathcal{S}} is a k×kk \times k submatrix of A\mathbf{A}. If i,jSi, j \in \mathcal{S}, then aija_{ij} is an element of AS\mathbf{A}_{\mathcal{S}}.

  • DPP applied to recommender systems:

    argmaxS:S=kθ(jSrewardj)+(1θ)logdet(AS)\arg\max_{\mathcal{S}: |\mathcal{S}|=k} \theta \cdot \left( \sum_{j \in \mathcal{S}} \text{reward}_j \right) + (1 - \theta) \cdot \log \det(\mathbf{A}_{\mathcal{S}})
  • DPP is a combinatorial optimization problem: select a subset S\mathcal{S} of size kk from {1,,n}\{1, \cdots, n\}.

  • Let S\mathcal{S} denote selected items and R\mathcal{R} unselected items; solve greedily:

    argmaxiRθrewardi+(1θ)logdet(AS{i})\arg\max_{i \in \mathcal{R}} \theta \cdot \text{reward}_i + (1 - \theta) \cdot \log \det(\mathbf{A}_{\mathcal{S} \cup \{i\}})

Solving DPP

Brute Force Algorithm

  • Greedy algorithm:

    argmaxiRθrewardi+(1θ)logdet(AS{i}).\arg\max_{i \in \mathcal{R}} \theta \cdot \text{reward}_i + (1 - \theta) \cdot \log \det(\mathbf{A}_{\mathcal{S} \cup \{i\}}).
  • Complexity analysis:

    • For a single ii, computing the determinant of AS{i}\mathbf{A}_{\mathcal{S} \cup \{i\}} takes O(S3)O(|\mathcal{S}|^3) time.

    • For all iRi \in \mathcal{R}, computing determinants takes O(S3R)O(|\mathcal{S}|^3 \cdot |\mathcal{R}|) time.

    • The above must be solved kk times to select kk items. Using brute-force determinant computation, total time complexity is:

      O(S3Rk)=O(nk4).O(|\mathcal{S}|^3 \cdot |\mathcal{R}| \cdot k) = O(nk^4).
  • Total time complexity of brute-force algorithm:

    O(n2d+nk4).O(n^2 d + nk^4).

Hulu's Fast Algorithm

  • Hulu's paper designs a numerical algorithm that selects kk items from nn in only O(n2d+nk2)O(n^2 d + nk^2) time.

  • Given vectors v1,,vnRd\mathbf{v}_1, \cdots, \mathbf{v}_n \in \mathbb{R}^d, computing A\mathbf{A} takes O(n2d)O(n^2 d) time.

  • Compute all determinants in O(nk2)O(nk^2) time using Cholesky decomposition.

  • Cholesky decomposition AS=LLT\mathbf{A}_{\mathcal{S}} = \mathbf{L} \mathbf{L}^T, where L\mathbf{L} is a lower triangular matrix (all elements above the diagonal are zero).

  • Cholesky decomposition enables computing the determinant of AS\mathbf{A}_{\mathcal{S}}:

    • The determinant of lower triangular matrix L\mathbf{L} equals the product of diagonal elements.

    • The determinant of AS\mathbf{A}_{\mathcal{S}} is:

      det(AS)=det(L)2=ilii2.\det(\mathbf{A}_{\mathcal{S}}) = \det(\mathbf{L})^2 = \prod_i l_{ii}^2.
  • Given AS=LLT\mathbf{A}_{\mathcal{S}} = \mathbf{L} \mathbf{L}^T, one can quickly derive the Cholesky decomposition of all AS{i}\mathbf{A}_{\mathcal{S} \cup \{i\}}, thus quickly computing all determinants det(AS{i})\det(\mathbf{A}_{\mathcal{S} \cup \{i\}}).

  • Greedy algorithm:

    argmaxiRθrewardi+(1θ)logdet(AS{i}).\arg\max_{i \in \mathcal{R}} \theta \cdot \text{reward}_i + (1 - \theta) \cdot \log \det(\mathbf{A}_{\mathcal{S} \cup \{i\}}).
  • Initialization: S\mathcal{S} contains only one item; AS\mathbf{A}_{\mathcal{S}} is a 1×11 \times 1 matrix.

  • Each round:

    • Based on AS=LLT\mathbf{A}_{\mathcal{S}} = \mathbf{L} \mathbf{L}^T from the previous round, quickly derive the Cholesky decomposition of AS{i}\mathbf{A}_{\mathcal{S} \cup \{i\}} (iR\forall i \in \mathcal{R}).
    • From this, compute logdet(AS{i})\log \det(\mathbf{A}_{\mathcal{S} \cup \{i\}}).

DPP Extensions

Sliding Window

  • Let S\mathbf{S} denote selected items and R\mathcal{R} unselected items; DPP greedy solution:

    argmaxiRθrewardi+(1θ)logdet(AS{i}).\arg\max\limits_{i \in \mathcal{R}} \theta \cdot \text{reward}_i + (1 - \theta) \cdot \log \det (\mathbf{A}_{\mathbf{S} \cup \{i\}}).
  • As set S\mathbf{S} grows, similar items accumulate; item vectors tend toward linear dependence.

  • Determinant det(AS)\det(\mathbf{A}_{\mathbf{S}}) collapses to zero; its logarithm approaches negative infinity.

  • Greedy algorithm:

    argmaxiRθrewardi+(1θ)logdet(AS{i})\arg\max\limits_{i \in \mathcal{R}} \theta \cdot \text{reward}_i + (1 - \theta) \cdot \log \det (\mathbf{A}_{\mathbf{S} \cup \{i\}})
  • With sliding window:

    argmaxiRθrewardi+(1θ)logdet(AW{i})\arg\max\limits_{i \in \mathcal{R}} \theta \cdot \text{reward}_i + (1 - \theta) \cdot \log \det (\mathbf{A}_{\mathcal{W} \cup \{i\}})

Rule Constraints

  • Each round of the greedy algorithm selects one item from R\mathcal{R}:

    argmaxiRθrewardi+(1θ)logdet(AW{i})\arg\max\limits_{i \in \mathcal{R}} \theta \cdot \text{reward}_i + (1 - \theta) \cdot \log \det (\mathbf{A}_{\mathcal{W} \cup \{i\}})
  • Many rule constraints exist, e.g., at most 5 consecutive video posts (if 5 video posts have already appeared consecutively, the next must be an image-text post).

  • Use rules to exclude some items from R\mathcal{R}, yielding subset R\mathcal{R'}, then solve:

    argmaxiRθrewardi+(1θ)logdet(AW{i})\arg\max\limits_{i \in \mathcal{R'}} \theta \cdot \text{reward}_i + (1 - \theta) \cdot \log \det (\mathbf{A}_{\mathcal{W} \cup \{i\}})

贡献者


这篇文章有帮助吗?

最近更新

Involution Hell© 2026 byCommunityunderCC BY-NC-SA 4.0CCBYNCSA