Reducing data transfer with semijoins In distributed query processing, a semijoin strategy does which of the following to minimize communication cost?

Difficulty: Medium

Correct Answer: Sends only the join attributes to a remote site and then returns only the required matching rows.

Explanation:


Introduction / Context:
Network I/O is a major cost in distributed joins. A classic optimization is the semijoin, which uses projections on join attributes to filter remote relations before transferring full tuples, thereby cutting down data shipped across the network.


Given Data / Assumptions:

  • Two relations R and S reside at different sites.
  • We need R ⋈ S on some join attributes (e.g., R.k = S.k).
  • Goal: minimize data transferred while preserving correctness.


Concept / Approach:

In a semijoin, the initiating site sends only the distinct join attribute values (π_k(R)) to the remote site holding S. The remote site filters S to S’ = σ_{k ∈ π_k(R)}(S) and returns only the matching rows (or sometimes just their keys). This preselection step avoids shipping irrelevant tuples from S that would not contribute to the final join.


Step-by-Step Solution:

1) Project join attributes from the first relation.2) Ship this small set of key values to the remote site.3) Filter the remote relation by those keys to obtain only relevant rows.4) Return the reduced set (or keys) to complete the join with the local relation.


Verification / Alternative check:

Semijoin-based query plans are widely cited in distributed optimization, especially when selectivity is high and join attribute domains are much smaller than full tuples.


Why Other Options Are Wrong:

  • Sending all attributes (b/d) increases traffic unnecessarily.
  • Sending join attributes but returning all rows (a) defeats the purpose.
  • Broadcasting all data (e) is worst-case and rarely optimal.


Common Pitfalls:

  • Using semijoins when selectivity is poor, yielding little reduction.
  • Forgetting to deduplicate projected keys before shipping.


Final Answer:

Sends only the join attributes to a remote site and then returns only the required matching rows.

More Questions from Distributed Databases

Discussion & Comments

No comments yet. Be the first to comment!
Join Discussion