A QA finetuned RAG chatbot that answers questions about Singapore University of Technology and Design (SUTD)
Mar 25 - Apr 25
Description
We scraped the Singapore University of Technology and Design (SUTD) website and built a QA finetuned RAG system to answer questions about it.
Technologies
LLM and LangChain
Vector Stores Embedding
Chunking
RAG
Zyte Web Crawler
As part of my coursework, my teammates and I built a QA finetuned RAG chatbot that answers questions about the Singapore University of Technology and Design (SUTD). Through this project, I learnt a lot about LLM behaviours, how they react to different embeddings, chunking, finetuning, prompts and much more.
We set out with the objective to help prospective students get a better understanding of SUTD as information about the university can be hard to find and understand. We had hoped that this chatbot will be able to answer questions such as:
Before we start building the chatbot, we had to gather data and build up our knowledge base. We found out about the Zyte Web Crawler and used it to scrape the SUTD website.
I looked at the SUTD robot.txt and sitemap and decided to scrape all the pages with relevant paths such as /admissions
, `/undergraduate’ etc.
I then proceeded to scrap the websites and managed to retrieved more than 2,170 pages of data to build our knowledge base.
After scraping the data in HTML, we converted it to Markdown so that the text is easier to read with less noise. Additionally, we noticed that there are unnecessary information such as headers, footers and navigation bars so we removed those.
We then chunked the data using header-based chunking so that it will work better. We also made sure to add any necessary metadata such as the title of the page and respective headers so that the chunks will have more context.
Finally, we used sentence-transformers/all-mpnet-base-v2 as our embedding model and FAISS as our vector store. These were chosen as they are fast and efficient and good enough for our use case.
We then built our first RAG system using Llama 3.2 3B-Instruct and compared it with the non-RAG base model.
We can see that the RAG system is able to retrieve relevant information from the knowledge base and answer the questions to a good degree.
For comparison, this is an answer from the non-RAG model:
We can see that it cannot even identify SUTD as a university and is confusing it with another university in Singapore called NTU.
Now that we have a working RAG system, we wanted to further improve the performance of the chatbot by finetuning it with context of SUTD and also tested to see if a smaller finetuned model will be able to perform well.
We used the Llama 3.2 1B and first finetuned it on the Alpaca dataset so that it can learn instructed tasks. We then finetuned it again on a question and answer dataset. For this, we finetuned about 9% of the model parameters using unsloth.
My friends then generated a synthetic dataset of questions and answers pair using Gemini 2.0 API with web search using Tavily.
The generated dataset can be found on my friend`s hugging face dataset here: Keith’s QA Dataset
Finally, we finetuned our model and the models can be found here on my hugging face: Our Finetuned Model.
To evaluate the performance of our model, we first manually evaluated based on Accuracy of Answer, Relevance of Retrieved Context and Groundedness of the Answer.
We found that the finetuned RAG model overall performed better than the fintuned non-RAG model and base RAG model which is what we wanted!
As an added bonus, we decided to also evaluate the performance of our model using LLM as a judge. I wrote up a simple prompt and asked Gemini 2.0 to evaluate the performance of our model based on the same criteria as above.
In 15/20 questions, the LLM agreed that the finetuned RAG model performed better than the finetuned non-RAG model and base RAG model and provided good reasoning for it. This further confirms our manual evaluation was correct and that we have succeeded in our task!
This was an interesting exercise and I got to better understand the behaviours of smaller LLM models and how they can be improved using methods such as finetuning and RAG. I also learnt a lot about the importance of chunking and embeddings in building a good RAG system.