AN

Alex Ndungu

CTO + Software Engineer + ML Engineer

Let's talk
HomeAboutExperienceProjectsSkillsContact
Let's talk
HomeAboutExperienceProjectsSkillsContact

Alex Ndungu

Backend systems, machine learning retrieval, and clean product-minded engineering for teams that care about reliability.

GitHubLinkedInLeetCodealexmeta517@gmail.com
Hybrid ML Search & Retrieval System

ML Search Engine

Improved search relevance by combining tag prediction with semantic retrieval on a StackOverflow-scale dataset.

Problem statement

Search results from raw lexical matching often miss the intent behind short technical queries, especially when users omit the vocabulary used in the source documents.

Architecture breakdown

I used an SGDClassifier with log-loss to predict contextual tags, then injected those predictions back into the retrieval stage so the search engine could enrich a query before similarity scoring.

  • - HTML cleaning, normalization, and tokenization pipeline
  • - TF-IDF feature extraction tuned for technical language
  • - SGDClassifier for tag prediction and contextual query enhancement
  • - Cosine similarity retrieval over transformed query vectors

Tech stack explanation

PythonPandasscikit-learnTF-IDFNLP preprocessingCosine similarity

System diagram

[ User Query ]
      |
      v
[ Clean + Normalize ]
      |
      +--> [ TF-IDF Vector ] ---> [ Similarity Search ]
      |
      +--> [ SGD Tag Predictor ] ---> [ Query Expansion ]
                                      |
                                      v
                              [ Re-ranked Results ]

Key challenges

A machine learning retrieval system built on roughly 45,000 StackOverflow records, combining text preprocessing, TF-IDF vectorization, SGD-based tag prediction, and cosine similarity search.

  • - Combined classification and retrieval into a single practical search workflow.
  • - Created a more explainable retrieval pipeline than a black-box-only ranking model.
  • - Showed strong overlap between ML modeling and product-oriented search relevance.

What I learned

Hybrid pipelines often outperform single-technique systems when the user problem is nuanced.
Preprocessing quality can matter more than model novelty in retrieval tasks.
Classification can be useful as context generation, not just as a final output.