Improved search relevance by combining tag prediction with semantic retrieval on a StackOverflow-scale dataset.
Problem statement
Search results from raw lexical matching often miss the intent behind short technical queries, especially when users omit the vocabulary used in the source documents.
Architecture breakdown
I used an SGDClassifier with log-loss to predict contextual tags, then injected those predictions back into the retrieval stage so the search engine could enrich a query before similarity scoring.
Tech stack explanation
System diagram
[ User Query ]
|
v
[ Clean + Normalize ]
|
+--> [ TF-IDF Vector ] ---> [ Similarity Search ]
|
+--> [ SGD Tag Predictor ] ---> [ Query Expansion ]
|
v
[ Re-ranked Results ]Key challenges
A machine learning retrieval system built on roughly 45,000 StackOverflow records, combining text preprocessing, TF-IDF vectorization, SGD-based tag prediction, and cosine similarity search.
What I learned