🧠 Armenian Financial Document RAG System

This repository contains the full pipeline for my Master's thesis at Yerevan State University (YSU), developed as part of the Data Science for Business master's program. The goal of this project is to build an end-to-end Retrieval-Augmented Generation (RAG) system using semantic search, LLMs, and fine-tuned embeddings to answer questions on Armenian banks’ financial PDF documents.

📦 Features

🔎 PDF-to-JSON conversion using pdfplumber
✂️ Text chunking with spaCy or nltk
🧠 Embedding with SentenceTransformers (with optional fine-tuning)
📚 Semantic retrieval using FAISS with keyword boosting
🤖 Query response generation via LLaMA 3 (via Ollama)
📊 Evaluation metrics: BLEU, ROUGE, BERTScore, cosine similarity
📈 Fine-tuning support with MRR and Recall@5 evaluation

⚙️ Prerequisites

Python 3.8+
Ollama installed and running locally
A model available in Ollama (e.g., llama3, mistral, etc.)
(Optional) GPU support for faster embedding and inference

📂 Project Structure

YSU-DSB-thesis
├── data/
│   └── json files
│
├── data_process/          # Scripts for preprocessing PDF files
│   ├── pdf2json.py              
│   └──  comb_json.py   
│         
├── modules/               # Core pipeline modules
│   ├── chunker.py            
│   ├── embedder.py          
│   ├── retriever.py          
│   ├── rag_engine.py       
│   ├── eval_pipeline.py     
│   └── fine_tuning_engine.py
│
├── testing/               # Tests for key modules
│   ├── test_chunks.py             
│   ├── test_embedder.py          
│   ├── test_evaluation.py          
│   └──  test_vector_database.py   
│
├── main.py                # Main script to run the full pipeline       
├── config.json                    
├── .gitignore
├── requirements.txt       
├── setup.sh
└── README.md

🧰 Installation

Clone the repository:

  git clone https://github.com/annpetrosiann/YSU_DSB_thesis.git

  cd YSU_DSB_thesis

Create a virtual environment (optional):

  python3 -m venv {virtual_environment_name}

Activate the virtual environment:

# On Windows

  python3 -m venv {virtual_environment_name}

# On macOS or Linux

  source {virtual_environment_name}/bin/activate

Install Dependencies

You have two options:

Manual install: using the provided requirements.txt

    pip install -r requirements.txt --upgrade

Automatic setup: run the setup.sh script which installs everything for you

    bash setup.sh

Upgrade sentence-transformers and llama-index (only if you didn’t run setup.sh)

    pip install --upgrade sentence-transformers llama-index

🗃️ Data

The data is sourced from the financial statements published on the official websites of Armenian banks. According to the Central Bank of Armenia, there are currently 18 licensed banks operating in the country. This project includes data from all 18 banks except VTB Bank, as its financial reports were not available in English.

The collected reports include:

📄 Annual reports
🧾 Audit reports
📘 Reports with consolidated notes
📊 Interim quarterly reports, which feature:
Balance sheets
Income statements
Statements of changes in equity
Cash flow statements
Normative indicators
Other statutory reports

While every effort was made to collect all available English-language reports, some documents were either missing or not publicly accessible for certain banks or years. The coverage period varies by bank, depending on when they began publishing reports in English. The latest available data across all banks is from 2024.

🛠️ Data Processing

All PDF files were converted into structured JSON format using the scripts in the data_process directory:

  # Step 1: Convert PDFs to JSON
  python3 data_process/pdf2json.py

  # Step 2: Combine JSONs into a single structured file
  python3 data_process/comb_json.py

The final dataset used for training is arm_banks.json, which is not included in this repository due to size and privacy concerns. Feel free to check out sample JSON files available in the data directory.

🧩 Modules + Config + Tests

Inside the modules directory, you'll find the core components of the pipeline:

Script	Purpose
`chunker.py`	Splits long documents into manageable text chunks using `spaCy` or `nltk`
`embedder.py`	Converts text chunks into embeddings using `SentenceTransformers` or HuggingFace
`retriever.py`	Retrieves top-k relevant chunks using FAISS and keyword matching
`rag_engine.py`	Constructs prompts and queries LLaMA via `Ollama`
`eval_pipeline.py`	Evaluates model output with metrics like BLEU, ROUGE-L, BERTScore, and cosine similarity
`fine_tuning_engine.py`	Fine-tunes embedding models and evaluates best configs using MRR and Recall@5

You can run any of these scripts directly:

    python3 modules/{script_name}.py

For example, running the fine-tuning engine:

    python3 modules/fine_tuning_engine.py

This will perform a grid search and print the best hyperparameter configuration along with MRR and Recall@5 scores. The best configuration is saved in config.json.

🧪 Unit tests for most modules can be found in the testing directory. These help validate that each script is functioning as expected.

🚀 Full Pipeline Execution

The main.py script orchestrates the entire RAG pipeline:

Loads and chunks the financial text data
Fine-tunes or loads a pre-trained embedding model
Generates embeddings and stores them in a FAISS vector index
Launches an interactive Q&A interface powered by LLaMA via Ollama

    python3 main.py

Once running, you can ask questions about the Armenian banking system, and the RAG system will generate responses based on the financial reports it has indexed.

💬 Example Interaction

❓ Your question (or 'exit'): What was Ameriabank’s total equity in 2022?

🦙 LLaMA (Markdown output):

In 2022, Ameriabank's total equity amounted to  'X' AMD, as reported in their annual financial statements.

----------------------------------------------------------------------------------------------------------
⭐️ Do you have a reference answer to evaluate? (y/n):

If you provide a reference answer, the system will compute evaluation metrics:

BLEU: Measures n-gram overlap between the generated and reference text.
ROUGE-L: Evaluates the longest common subsequence (LCS) between texts to capture fluency and informativeness.
BERTScore: Uses contextual embeddings from BERT to evaluate semantic similarity at a token level.
Cosine Similarity: Measures overall vector similarity between the generated and reference embeddings.

These metrics help you assess how well the system’s output matches a known correct answer — both syntactically and semantically.

📌 Credits

Student: Anahit Petrosyan

Supervisor: Yenok Hakobyan

Master’s Thesis – Data Science for Business, Yerevan State University, 2025.

📭 Contact

For questions, feedback, or collaboration opportunities, feel free to connect via Linkedin.

📄 License

This project is not currently licensed. All rights are reserved by the author. If you wish to use, reproduce, or reference any part of this work, please contact the author for permission.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Armenian Financial Document RAG System

📦 Features

⚙️ Prerequisites

📂 Project Structure

🧰 Installation

🗃️ Data

🛠️ Data Processing

🧩 Modules + Config + Tests

🚀 Full Pipeline Execution

💬 Example Interaction

📌 Credits

📭 Contact

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
data_process		data_process
modules		modules
testing		testing
.gitignore		.gitignore
README.md		README.md
config.json		config.json
main.py		main.py
requirements.txt		requirements.txt
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

🧠 Armenian Financial Document RAG System

📦 Features

⚙️ Prerequisites

📂 Project Structure

🧰 Installation

🗃️ Data

🛠️ Data Processing

🧩 Modules + Config + Tests

🚀 Full Pipeline Execution

💬 Example Interaction

📌 Credits

📭 Contact

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages