内卷地狱

Wang Shusen Recommender Systems Study Notes — Feature Crossing

Edit Me

Wang Shusen Recommender Systems Study Notes — Feature Crossing

Feature Crossing

Factorization Machine (FM)

Linear Model

  • Given dd features, denoted as x=[x1,,xd]\mathbf{x} = [x_1, \cdots, x_d].

  • Linear model:

    p=b+i=1dwixi.p = b + \sum_{i=1}^{d} w_i x_i.
  • The model has d+1d + 1 parameters: w=[w1,,wd]\mathbf{w} = [w_1, \cdots, w_d] and bb.

  • Prediction is a weighted sum of features. (Addition only, no multiplication.)

Second-Order Crossed Features

  • Given dd features, denoted as x=[x1,,xd]\mathbf{x} = [x_1, \cdots, x_d].

  • Linear model + second-order crossed features:

    p=b+i=1dwixi+i=1dj=i+1duijxixj.p = b + \sum_{i=1}^{d} w_i x_i + \sum_{i=1}^{d} \sum_{j=i+1}^{d} u_{ij} x_i x_j.
  • The model has O(d2)O(d^2) parameters.

Linear model + second-order crossed features:

p=b+i=1dwixi+i=1dj=i+1duijxixj.p = b + \sum_{i=1}^{d} w_i x_i + \sum_{i=1}^{d} \sum_{j=i+1}^{d} u_{ij} x_i x_j. uijviTvju_{ij} \approx v^T_iv_j

Matrix UU has dd rows and dd columns; matrix VV has dd rows and kk columns; matrix VTV^T has kk rows and dd columns.

  • Factorization Machine (FM):

    p=b+i=1dwixi+i=1dj=i+1d(viTvj)xixj.p = b + \sum_{i=1}^{d} w_i x_i + \sum_{i=1}^{d} \sum_{j=i+1}^{d} \left( \mathbf{v}_i^T \mathbf{v}_j \right) x_i x_j.
  • FM has O(kd)O(kd) parameters. (kdk \ll d)

Factorization Machine

  • FM is a drop-in replacement for linear models; anywhere linear regression or logistic regression can be used, FM can be used instead.
  • FM uses second-order crossed features, making it more expressive than linear models.
  • Through the approximation uijviTvju_{ij} \approx \mathbf{v}_i^T \mathbf{v}_j, FM reduces the number of second-order cross weights from O(d2)O(d^2) to O(kd)O(kd).

Deep & Cross Network (DCN)

Retrieval and Ranking Models

Two-tower models and multi-objective ranking models are just structural frameworks; the internal neural networks can be any network architecture.

Cross Layer

Incorporates ResNet concepts to prevent vanishing gradients.

Cross Network

Deep & Cross Network (DCN)

DCN outperforms fully connected networks in practice and can be used in the user tower and item tower of two-tower models, the shared bottom network of multi-objective ranking models, and the expert networks in MMoE.

Learning Hidden Unit Contributions (LHUC)

The neural network structure is [multiple fully connected layers] → [Sigmoid × 2], so all values in the output vector lie between 0 and 2.

SENet & Bilinear Cross

  • SENet applies field-wise weighting to discrete features.

  • Field:

    • User ID embedding is a 64-dimensional vector.
    • The 64 elements (i.e., the embedding vector of one feature) form one field and receive the same weight.
    • The more important the feature, the higher the weight.
  • If there are mm fields, the weight vector is mm-dimensional.

Cross-Field Feature Crossing

Inner Product

Both xiTx^T_i and xjx_j are feature embedding vectors; fijf_{ij} is a scalar. With mm fields, pairwise inner products yield m2m^2 scalars.

Hadamard Product

Both xiTx^T_i and xjx_j are feature embedding vectors; fijf_{ij} is a vector. With mm fields, pairwise Hadamard products yield m2m^2 vectors — too many. One must manually specify which vector pairs to cross, rather than crossing all pairs.

Bilinear Cross (inner product)

With mm fields, there are m2m^2 scalar values fijf_{ij} and m2/2m^2/2 parameter matrices WijW_{ij}.

Bilinear Cross (Hadamard)

With mm fields, there are m2m^2 vector values fijf_{ij} and m2/2m^2/2 parameter matrices WijW_{ij}.

FiBiNet

Behavior Sequences

User Behavior Sequence Modeling

LastN Features

  • LastN: Item IDs from the user's most recent nn interactions (clicks, likes, etc.).
  • Embed LastN item IDs to get nn vectors.
  • Take the average of these nn vectors as one type of user feature.
  • Applicable to retrieval two-tower models, pre-ranking three-tower models, and full-ranking models.

DIN Model

DIN Model

  • DIN replaces simple averaging with weighted averaging, i.e., an attention mechanism.
  • Weights: similarity between the candidate item and the user's LastN items.

DIN Model

  • For a given candidate item, compute its similarity to each of the user's LastN items.
  • Use the similarity as weights to compute a weighted sum of the user's LastN item vectors, yielding a single vector.
  • Use this vector as a user feature input to the ranking model, estimating click-through rate, like rate, etc. for (user, candidate item) pairs.
  • This is essentially an attention mechanism.

Simple Averaging vs. Attention Mechanism

  • Both simple averaging and attention mechanism are applicable to full-ranking models.
  • Simple averaging is applicable to two-tower and three-tower models.
    • Simple averaging only uses LastN, which is a user-intrinsic feature.
    • The average of LastN vectors is used as input to the user tower.
  • Attention mechanism is not applicable to two-tower or three-tower models.
    • Attention mechanism requires LastN + candidate item.
    • The user tower cannot see the candidate item, so attention mechanism cannot be applied inside the user tower.

SIM Model

DIN Model

  • Computes a weighted average of the user's LastN vectors.
  • Weights are the similarity between the candidate item and each LastN item.

DIN Model Drawbacks

  • Attention layer computation n\propto n (length of user behavior sequence).
  • Can only record the most recent few hundred items; otherwise computation is too expensive.
  • Drawback: Focuses on short-term interests, forgetting long-term interests.

How to Improve DIN?

  • Goal: Retain long user behavior sequences (nn large) without excessive computation.

  • Improved DIN:

    • DIN computes a weighted average of LastN vectors using similarity as weights.
    • If a LastN item is very different from the candidate item, its weight is near zero.
    • Quickly eliminate LastN items unrelated to the candidate item, reducing computation in the attention layer.

SIM Model

  • Retains long-term user behavior history; nn can be in the thousands.
  • For each candidate item, quickly search the user's LastN records to find kk similar items.
  • Convert LastN into TopK, then feed into the attention layer.
  • SIM reduces computation (from nn to kk).

Step 1: Search

  • Method 1: Hard Search

    • Filter LastN items to keep only those with the same category as the candidate item.
    • Simple, fast, requires no training.
  • Method 2: Soft Search

    • Embed items to get vectors.
    • Use the candidate item vector as a query and perform kk-nearest neighbor search, keeping the kk nearest items in LastN.
    • Better performance, more complex to implement.

Step 2: Attention Mechanism

Using Temporal Information

  • Let δ\delta be the time elapsed since the user interacted with a LastN item.
  • Discretize δ\delta, then embed it to get vector d.
  • Concatenate the two vectors to represent a LastN item:
    • Vector x is the item embedding.
    • Vector d is the time embedding.

Why does SIM use temporal information?

  • DIN has a short sequence, recording the user's recent behavior.
  • SIM has a long sequence, recording the user's long-term behavior.
  • The more distant the interaction in time, the less important it is.

Conclusions

  • Long sequences (long-term interests) outperform short sequences (recent interests).
  • Attention mechanism outperforms simple averaging.
  • Soft search vs. Hard search? Depends on engineering infrastructure.
  • Using temporal information provides improvement.

贡献者


这篇文章有帮助吗?

最近更新

Involution Hell© 2026 byCommunityunderCC BY-NC-SA 4.0CCBYNCSA