Hybrid ML Search & Retrieval System

ML Search Engine

A 2-stage hybrid retrieval system trained on 45,000 StackOverflow records â€” SGD-based tag prediction for query expansion feeding into TF-IDF vectorization with cosine similarity ranking.

45K+

Training records

2-stage

Hybrid retrieval pipeline

SGD + TF-IDF

Model combination

Problem statement

Raw lexical matching misses intent, especially for short technical queries where the user omits the vocabulary present in the source documents. A single retrieval method can't bridge that gap reliably.

Architecture breakdown

I built a 2-stage pipeline: an SGDClassifier (log-loss) predicts contextual tags from the query, those predictions are injected back as query expansion, and TF-IDF + cosine similarity then retrieves over the enriched query vector â€” combining classification signal with retrieval relevance.

- 45,000 StackOverflow records processed through an HTML cleaning, normalization, and tokenization pipeline
- 2-stage hybrid pipeline: SGD tag prediction for query expansion â†’ TF-IDF + cosine similarity retrieval
- SGDClassifier with log-loss tuned for multi-label technical tag prediction
- Cosine similarity retrieval over TF-IDF transformed query vectors for final ranking

Tech stack explanation

PythonPandasscikit-learnTF-IDFNLP preprocessingCosine similarity

System diagram

[ User Query ]
      |
      v
[ Clean + Normalize ]
      |
      +--> [ TF-IDF Vector ] ---> [ Similarity Search ]
      |
      +--> [ SGD Tag Predictor ] ---> [ Query Expansion ]
                                      |
                                      v
                              [ Re-ranked Results ]

Key challenges

A machine learning search system built on ~45,000 StackOverflow records. The key insight was that a single retrieval technique misses intent â€” so the pipeline runs in 2 stages: classify the query to predict missing context tags, then use those enriched tags to improve the similarity search.

- Combined classification and retrieval into a 2-stage pipeline on a 45K-record dataset â€” showing ML system design, not just model fitting.
- Built a more explainable retrieval system than a black-box ranker by making tag prediction an explicit, inspectable step.
- Demonstrated strong overlap between classical ML modeling and product-oriented search relevance.

What I learned

Hybrid pipelines often outperform single-technique systems when the user problem is nuanced.

Preprocessing quality can matter more than model novelty in retrieval tasks.

Classification can be useful as context generation, not just as a final output.

[ User Query ] | v [ Clean + Normalize ] | +--> [ TF-IDF Vector ] ---> [ Similarity Search ] | +--> [ SGD Tag Predictor ] ---> [ Query Expansion ] | v [ Re-ranked Results ]