Building a RAG System for Searching Through Cybersecurity News using AI

Table of Contents

In today’s cybersecurity world, staying ahead of the latest threats means reading through an overwhelming number of news articles. A Retrieval-Augmented Generation (RAG) system can help by combining semantic search with AI-generated summaries, making it easier to extract insights from large amounts of unstructured data. In this post, I’ll walk through how to create such a system with short code examples.

RAG vs. Keyword Search vs. Regular LLM

Traditional keyword search can miss the context of articles and respond with too many or too little results. A RAG system can help with:

Understanding context: It retrieves relevant articles based on semantic similarity.
Generating summaries: It uses an LLM to produce coherent, context-aware answers.
Streamlining research: It enables fast retrieval of cybersecurity news, ensuring never missing any critical updates.

Key Components of a RAG

A regular database with large amounts of data.
Vector Database: Stores article embeddings (e.g., using ChromaDB).
Embedding Model: Converts text into numerical vectors (e.g., OpenAI’s ADA-002).
Large Language Model (LLM): Generates summaries and answers to questions (e.g., GPT-4).
Data Ingestion Pipeline: Fetches and processes news articles.

Setting Up A Vector Database

A vector database stores embeddings that represent news articles. For example, using ChromaDB:

import os
import chromadb
from chromadb.config import Settings

os.environ["CHROMA_PERSIST_DIRECTORY"] = "./data/chroma"

client = chromadb.PersistentClient(
    path=os.environ["CHROMA_PERSIST_DIRECTORY"],
    settings=Settings(anonymized_telemetry=False, allow_reset=True, is_persistent=True)
)
collection = client.get_or_create_collection(name="cyber_articles")

This code initializes a persistent ChromaDB client and creates a collection for news articles.

Generating Embeddings

An embedding model converts text into vectors. Here’s a short example:

text = "Latest cybersecurity breach reported..."
embedding = embedding_model.embed_query(text)
print("Embedding vector:", embedding)

This snippet shows how to obtain a numerical representation of an article’s content.

Adding Articles to the Vector Store

Embeddings need to be stored in vector database. Because texts can be transformed into vectors, but vectors can not be transformed into text, the original document needs to be stored alongside the vector. Metadata also helps to stay in context and provide valid sources to avoid halluzinations.

doc_id = "article_123"
metadata = {"title": "Cybersecurity Breach Alert", "date": "2025-01-13", "source": "NewsFeed"}
document = "Article content here..."

# Add document to the vector store
collection.add(
    ids=[doc_id],
    documents=[document],
    metadatas=[metadata],
    embeddings=[embedding]
)

This simple code block demonstrates how to add a document to the vector database along with its metadata and embedding.

Querying the Vector Store

To search for relevant articles, convert the query into an embedding and retrieve matching documents:

query_text = "What are dangerous russian threat actors?"
query_embedding = embedding_model.embed_query(query_text)

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=4,
    include=["documents", "metadatas"]
)

print("Search results:", results)

This example retrieves the top four documents that semantically match the query.

Generating a Summarized Response

Finally, combine the retrieved articles to generate a contextual summary using an LLM:

prompt = f"""
You are a cybersecurity expert. Given the following context from news articles:
{results['documents'][0]}
Answer the question: {query_text}
"""

response = llm.invoke(prompt)
print("Generated Answer:", response.content)

This snippet creates a prompt for the LLM that includes the context from retrieved articles and the original query, then prints the generated answer.

Conclusion

Building a RAG system for cybersecurity news helps automate the process of information retrieval and summarization. With a vector database, an embedding model, and a generative LLM, it is possible to quickly sift through unstructured articles to stay informed about the latest threats.

Feel free to copy, modify, and expand these examples to suit your needs. Happy coding and stay secure!

Author

Lars Ursprung