AN

Alex Ndungu

CTO + Software Engineer + ML Engineer

Let's talk
HomeAboutExperienceProjectsSkillsContact
Let's talk
HomeAboutExperienceProjectsSkillsContact

Alex Ndungu

Backend systems, machine learning retrieval, and clean product-minded engineering for teams that care about reliability.

GitHubLinkedInLeetCodealexmeta517@gmail.com
Hybrid ML Search & Retrieval System

ML Search Engine

A 2-stage hybrid retrieval system trained on 45,000 StackOverflow records — SGD-based tag prediction for query expansion feeding into TF-IDF vectorization with cosine similarity ranking.

45K+

Training records

2-stage

Hybrid retrieval pipeline

SGD + TF-IDF

Model combination

Problem statement

Raw lexical matching misses intent, especially for short technical queries where the user omits the vocabulary present in the source documents. A single retrieval method can't bridge that gap reliably.

Architecture breakdown

I built a 2-stage pipeline: an SGDClassifier (log-loss) predicts contextual tags from the query, those predictions are injected back as query expansion, and TF-IDF + cosine similarity then retrieves over the enriched query vector — combining classification signal with retrieval relevance.

  • - 45,000 StackOverflow records processed through an HTML cleaning, normalization, and tokenization pipeline
  • - 2-stage hybrid pipeline: SGD tag prediction for query expansion → TF-IDF + cosine similarity retrieval
  • - SGDClassifier with log-loss tuned for multi-label technical tag prediction
  • - Cosine similarity retrieval over TF-IDF transformed query vectors for final ranking

Tech stack explanation

PythonPandasscikit-learnTF-IDFNLP preprocessingCosine similarity

System diagram

[ User Query ]
      |
      v
[ Clean + Normalize ]
      |
      +--> [ TF-IDF Vector ] ---> [ Similarity Search ]
      |
      +--> [ SGD Tag Predictor ] ---> [ Query Expansion ]
                                      |
                                      v
                              [ Re-ranked Results ]

Key challenges

A machine learning search system built on ~45,000 StackOverflow records. The key insight was that a single retrieval technique misses intent — so the pipeline runs in 2 stages: classify the query to predict missing context tags, then use those enriched tags to improve the similarity search.

  • - Combined classification and retrieval into a 2-stage pipeline on a 45K-record dataset — showing ML system design, not just model fitting.
  • - Built a more explainable retrieval system than a black-box ranker by making tag prediction an explicit, inspectable step.
  • - Demonstrated strong overlap between classical ML modeling and product-oriented search relevance.

What I learned

Hybrid pipelines often outperform single-technique systems when the user problem is nuanced.
Preprocessing quality can matter more than model novelty in retrieval tasks.
Classification can be useful as context generation, not just as a final output.