A 2-stage hybrid retrieval system trained on 45,000 StackOverflow records — SGD-based tag prediction for query expansion feeding into TF-IDF vectorization with cosine similarity ranking.
45K+
Training records
2-stage
Hybrid retrieval pipeline
SGD + TF-IDF
Model combination
Problem statement
Raw lexical matching misses intent, especially for short technical queries where the user omits the vocabulary present in the source documents. A single retrieval method can't bridge that gap reliably.
Architecture breakdown
I built a 2-stage pipeline: an SGDClassifier (log-loss) predicts contextual tags from the query, those predictions are injected back as query expansion, and TF-IDF + cosine similarity then retrieves over the enriched query vector — combining classification signal with retrieval relevance.
Tech stack explanation
System diagram
[ User Query ]
|
v
[ Clean + Normalize ]
|
+--> [ TF-IDF Vector ] ---> [ Similarity Search ]
|
+--> [ SGD Tag Predictor ] ---> [ Query Expansion ]
|
v
[ Re-ranked Results ]Key challenges
A machine learning search system built on ~45,000 StackOverflow records. The key insight was that a single retrieval technique misses intent — so the pipeline runs in 2 stages: classify the query to predict missing context tags, then use those enriched tags to improve the similarity search.
What I learned