Skip to content

yeulih/semanticSimilarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Similarity Analysis Using BERT

This repository hosts the files used in a project comparing AI tools used for academic research. Specifically, semantic similarity analysis was performed on search results from different citation-mapping literature tools. The contents of the files are as follows:

  • README.md: Instructions for setting up and running BERT
    The instructions assume that you are familiar with installing software packages and working in a command-line environment.
  • test_bert.py: Python script for running a BERT smoke test
  • compare_pdfs.py: Python script analyzing semantic similarity for two PDFs
  • compare_jsons.py: Python script analyzing semantic similarity for two JSONs

Project overview

The project methodology involves (1) given an identical search input, collecting search results from a collection of citation-based tools; (2) using well-established natural language processing (NLP) techniques to generate numeric representations of the search results; and (3) conducting a quantitative analysis of the search results to identify how similar or dissimilar the search results are across the tools.

BERT (Bidirectional Encoder Representations from Transformers) is a LLM that excels at NLP tasks such as sentiment analysis, question answering, and semantic similarity. A pre-trained BERT model, Sentence-Bert (SBERT), is used to convert natural language text into tokens. The token representations are then provided as input to a transformer, which produces embeddings. Two embeddings, each representing a set of search results from one citation-based tool, are compared by calculating their cosine similarity score (a value between 0 and 1). The higher the score, the closer the results are semantically.

Similarity between two texts–or sets of search results–is useful as a comparative metric because it allows conclusions to be drawn regarding the consistency among various citation-based tools. For example, if tools A, B, and C generate similar search results, one can save time by not needing to run the same search in all three tools. This can save researchers valuable time and avoid efforts that are likely to result in similar or duplicative results. Otherwise, if tools A, B, and C don’t generate similar results, it could be worth reviewing the search results generated by each tool.

System requirements

Pytorch is an open source deep learning framework. One of its capabilities is enabling LLMs, such as BERT, to be run on a local computer. Pytorch as well as many supported open souce libraries are avilable publically online.

Before proceeding further, review the specified hardware and software requirements to see if your computer is able to run Pytorch.

Recommended hardware requirements:

  • Computer with Nvidia GPU that supports CUDA (optional if you do not plain to do heavy computing)
  • Intel Core i5 and higher or AMD Ryzen 7 and higher
  • 8 GB RAM
  • storage for storing language models

Note

The requirements were taken from (https://www.geeksforgeeks.org/python/pytorch-system-requirements). Although the PyTorch website does not list specific hardware requirements, running LLMs does necessitate a nontrivial amount of computing power.

Software requirements

Part 1: Set up and run BERT

To set up and run BERT, perform the following tasks in order:

  1. Install Python.
  2. Set up a Python virtual environment.
  3. Install Pytorch.
  4. Install Hugging Face Transformers.
  5. Run a BERT smoke test (optional).
  6. Save BERT locally for offline use (optional).

Install Python

Install Python from the official site (https://www.python.org/downloads/). Make sure to select the Add python.exe to PATH option on the install screen. pip, a package manager for Python, is automatically installed when using the official Python installer.

Important

Although Pytorch supports Python versions 3.10-3.14, it is recommended to install 3.10, as dependency issues in libraries can occur with later versions.

Set up a Python virtual environment

To set up a Python virtual environment:

  1. In a command-line, create a project folder. E.g. C:\projects\bert-local
  2. In the folder, run python -m venv <env_name> to create a virtual environment. E.g. python -m venv .venv
  3. In the same folder, run .\<env_name>\scripts\activate.bat to activate the environment. E.g. .\.venv\scripts\activate.bat

Install Pytorch

The package manager pip is used to install the PyTorch binaries. The command to install Pytorch in the virtual environment changes based on the following factors:

  • OS
  • computing using a GPU or CPU
  • GPU model (if applicable)
  • GPU driver installed

Use one of the following methods to find the correct command:

The following is a sample command to install Pytorch: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Tip

You can check your Nvidia driver version and CUDA version by running nvidia-smi in a command-line.

Important

Even if your GPU supports the latest CUDA SDKs, compatability issues still might occur when attemping to install the latest Pytorch CUDA wheels. You can use online LLMs such as ChatGPT to help you troubleshoot which CUDA wheel and Pytorch version to install, or you can choose to use only the CPU to compute and skip installing the CUDA wheel.

A computer with the Nvidia Quadro T2000 GPU was used to run the project. Working Pytorch configuration settings were:

  • Windows version: 10
  • Python version: 3.11.9
  • Pytorch version: 2.7.1
  • CUDA wheel: 11.8

Install Hugging Face Transformers

The Hugging Space Transformers library provides pre-trained LLMs that you can load in Pytorch.

To install the library:

  • In the activated virtual environment, run pip install transformers.

Run a BERT smoke test (optional)

To run a BERT smoke test:

  1. Create a file named test_bert.py containing the following code in your project folder:
from transformers import AutoTokenizer, AutoModel
import torch

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased").cuda()

inputs = tok("Hello from BERT on GPU!", return_tensors="pt").to("cuda")
outputs = model(**inputs)

print("BERT ran successfully on:", outputs.last_hidden_state.device)
  1. In the activated virtual environment, run python test_bert.py. If BERT is successfully run, the returned output is "BERT ran successfully on: cuda:0".

Save BERT locally for offline use (optional)

Hugging Face Transformers automatically caches model weights, configurations, and tokenizer data, so you can skip this task.

To save BERT locally for offline use:

  1. After loading the model and tokenizer, in your activated virtual environment, run python to invoke the Python interpretor as an interactive shell. This enables you to Python code line-by-line and see the results immediately.
  2. Run the following commands:
save_dir = "./<model_directory>"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

For example, the following commands save the model and tokenizer to a subdirectory of the current folder named models/bert-case-uncased.

save_dir = "./models/bert-base-uncased"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
  1. Later, you can load the model directly from the folder running the following commands in the Python interactive shell:
from transformers import BertTokenizer, BertModel
model_path = "./models/bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_path, local_files_only=True)
model = BertModel.from_pretrained(model_path, local_files_only=True)

Tip

Exit the Python interactive shell and return to the virtual environment at any time by running exit.

Part 2: Perform semantic similarity analysis

Once you have BERT set up and running, you can use BERT for inference tasks. Two sample Python scripts are provided:

  • compare_pdfs.py: This Python script extracts text from two PDF files, chunks it, generates embeddings, builds a document‑level embedding for each document, and computes similarity between the 2 files.
  • compare_jsons.py: This sample Python script extracts and processes text from two JSON files, generates embeddings for each entry, and computes an unordered set-level semantic similarity between the two files.

These scripts do not use the bert-base-uncased model in the Transformers library, as BERT models are not specifically trained for semantic similarity. Instead, the scripts use a SBERT model all-mpnet-base-v2 in the Sentence Transformers library.

To set up and run SBERT, perform the following tasks in order:

  1. Install Hugging Face Sentence Transformers.
  2. Run the Semantic Similarity Analysis Script.

Install Hugging Face Sentence Transformers

The Hugging Space Sentence Transformers library provides pre-trained SBERT models that you can load in Pytorch.

To install the library:

  • In the activated virtual environment, run pip install sentence-transformers.

Run the semantic similarity analysis script

To perform semantic similarity analysis on two text sources:

  1. Prepare two PDFs or JSONs for analysis.
  2. In your activated virtual environment, run one of the provided Python scripts: python <script_name> <text1_name> <text2_name>. E.g. python compare_pdfs.py pdf1.pdf pdf2.pdf
    A semantic similarity score is produced.

AI Use Disclosure

AI was used to assist with configuring the Pytorch environment, setting up BERT models, and generating code samples and scripts. All other content, including the instructions as seen in the README file, was drafted and reviewed manually.

About

Semantic Comparison

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages