FalconGraph Search
Finding the right BGSU resource means digging through dozens of disconnected pages and PDFs. FalconGraph turns that sprawl into one graph you can ask questions of.
University resources live scattered across hundreds of pages, PDFs, and DOCX files with no unifying index. A keyword search returns links; it doesn't answer the question.
The goal was an answer engine: crawl those resources, model how they link together, and let the LLM respond with grounded citations instead of hallucinations, so every answer points back to a real source.
C++ crawler
A multi-threaded crawler using OpenMP reads a central pipeline.json (seed URLs, thread count, output paths) and deposits raw HTML/PDFs, using thread-safe shared queues to avoid redundant fetches across cores.
Clean & graph
A Python pass extracts text from HTML/PDF/DOCX into checkpointed clean_nodes.json and clean_edges.json, then reconstructs a directed link graph of how resources reference one another.
Embed & index
Node content is chunked and embedded (all-MiniLM-L6-v2 / text-embedding-3-small) into a FAISS index, so retrieval can rank by semantic similarity rather than keyword overlap.
RAG answer
A FastAPI /search endpoint retrieves and ranks the top snippets via FAISS, then has the LLM generate a grounded summary with inline citations back to source pages.
Next.js UI
The frontend renders markdown answers, the list of cited sources, and a graph preview of the surrounding link context, so the answer is auditable, not a black box.
C++ for the crawler, Python for everything else
The crawl is the throughput bottleneck, so it got a multi-threaded C++ binary with OpenMP. The cleaning, graph, and RAG layers prioritize iteration speed, so they stayed in Python. Each language doing the job it's best at.
Checkpointed cleaning
Crawls and parsing get interrupted. Writing clean_nodes/clean_edges with checkpoints means a failed run resumes instead of starting over, important under a hackathon clock.
Citations as a hard requirement
Every generated summary is grounded in retrieved snippets and links back to its source. The graph view exists so a reader can trust the answer by tracing it, not just reading it.
Built with