This repository hosts the files used in a project comparing AI tools used for academic research. Specifically, semantic similarity analysis was performed on search results from different citation-mapping literature tools. The contents of the files are as follows:
README.md: Instructions for setting up and running BERT
The instructions assume that you are familiar with installing software packages and working in a command-line environment.test_bert.py: Python script for running a BERT smoke testcompare_pdfs.py: Python script analyzing semantic similarity for two PDFscompare_jsons.py: Python script analyzing semantic similarity for two JSONs
The project methodology involves (1) given an identical search input, collecting search results from a collection of citation-based tools; (2) using well-established natural language processing (NLP) techniques to generate numeric representations of the search results; and (3) conducting a quantitative analysis of the search results to identify how similar or dissimilar the search results are across the tools.
BERT (Bidirectional Encoder Representations from Transformers) is a LLM that excels at NLP tasks such as sentiment analysis, question answering, and semantic similarity. A pre-trained BERT model, Sentence-Bert (SBERT), is used to convert natural language text into tokens. The token representations are then provided as input to a transformer, which produces embeddings. Two embeddings, each representing a set of search results from one citation-based tool, are compared by calculating their cosine similarity score (a value between 0 and 1). The higher the score, the closer the results are semantically.
Similarity between two texts–or sets of search results–is useful as a comparative metric because it allows conclusions to be drawn regarding the consistency among various citation-based tools. For example, if tools A, B, and C generate similar search results, one can save time by not needing to run the same search in all three tools. This can save researchers valuable time and avoid efforts that are likely to result in similar or duplicative results. Otherwise, if tools A, B, and C don’t generate similar results, it could be worth reviewing the search results generated by each tool.
Pytorch is an open source deep learning framework. One of its capabilities is enabling LLMs, such as BERT, to be run on a local computer. Pytorch as well as many supported open souce libraries are avilable publically online.
Before proceeding further, review the specified hardware and software requirements to see if your computer is able to run Pytorch.
Recommended hardware requirements:
- Computer with Nvidia GPU that supports CUDA (optional if you do not plain to do heavy computing)
- Intel Core i5 and higher or AMD Ryzen 7 and higher
- 8 GB RAM
- storage for storing language models
Note
The requirements were taken from (https://www.geeksforgeeks.org/python/pytorch-system-requirements). Although the PyTorch website does not list specific hardware requirements, running LLMs does necessitate a nontrivial amount of computing power.
Software requirements
- For a list of supported operating systems, see the Pytorch website (https://pytorch.org/get-started/locally/) and select your OS (Linux, Mac, Windows) listed on the interactive chart.
To set up and run BERT, perform the following tasks in order:
- Install Python.
- Set up a Python virtual environment.
- Install Pytorch.
- Install Hugging Face Transformers.
- Run a BERT smoke test (optional).
- Save BERT locally for offline use (optional).
Install Python from the official site (https://www.python.org/downloads/). Make sure to select the Add python.exe to PATH option on the install screen. pip, a package manager for Python, is automatically installed when using the official Python installer.
Important
Although Pytorch supports Python versions 3.10-3.14, it is recommended to install 3.10, as dependency issues in libraries can occur with later versions.
To set up a Python virtual environment:
- In a command-line, create a project folder. E.g.
C:\projects\bert-local - In the folder, run
python -m venv <env_name>to create a virtual environment. E.g.python -m venv .venv - In the same folder, run
.\<env_name>\scripts\activate.batto activate the environment. E.g..\.venv\scripts\activate.bat
The package manager pip is used to install the PyTorch binaries. The command to install Pytorch in the virtual environment changes based on the following factors:
- OS
- computing using a GPU or CPU
- GPU model (if applicable)
- GPU driver installed
Use one of the following methods to find the correct command:
- If your GPU supports the latest CUDA SDKs and you want to install the latest stable Pytorch version, select your configuration using the interactive chart on (https://pytorch.org/get-started/locally/) and the command will be generated for you.
- If your GPU supports earlier CUDA SDKs or you want to install an earlier Pytorch version, find the relevant command at (https://pytorch.org/get-started/previous-versions/).
The following is a sample command to install Pytorch: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Tip
You can check your Nvidia driver version and CUDA version by running nvidia-smi in a command-line.
Important
Even if your GPU supports the latest CUDA SDKs, compatability issues still might occur when attemping to install the latest Pytorch CUDA wheels. You can use online LLMs such as ChatGPT to help you troubleshoot which CUDA wheel and Pytorch version to install, or you can choose to use only the CPU to compute and skip installing the CUDA wheel.
A computer with the Nvidia Quadro T2000 GPU was used to run the project. Working Pytorch configuration settings were:
- Windows version: 10
- Python version: 3.11.9
- Pytorch version: 2.7.1
- CUDA wheel: 11.8
The Hugging Space Transformers library provides pre-trained LLMs that you can load in Pytorch.
To install the library:
- In the activated virtual environment, run
pip install transformers.
To run a BERT smoke test:
- Create a file named
test_bert.pycontaining the following code in your project folder:
from transformers import AutoTokenizer, AutoModel
import torch
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased").cuda()
inputs = tok("Hello from BERT on GPU!", return_tensors="pt").to("cuda")
outputs = model(**inputs)
print("BERT ran successfully on:", outputs.last_hidden_state.device)
- In the activated virtual environment, run
python test_bert.py. If BERT is successfully run, the returned output is "BERT ran successfully on: cuda:0".
Hugging Face Transformers automatically caches model weights, configurations, and tokenizer data, so you can skip this task.
To save BERT locally for offline use:
- After loading the model and tokenizer, in your activated virtual environment, run
pythonto invoke the Python interpretor as an interactive shell. This enables you to Python code line-by-line and see the results immediately. - Run the following commands:
save_dir = "./<model_directory>"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
For example, the following commands save the model and tokenizer to a subdirectory of the current folder named models/bert-case-uncased.
save_dir = "./models/bert-base-uncased"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
- Later, you can load the model directly from the folder running the following commands in the Python interactive shell:
from transformers import BertTokenizer, BertModel
model_path = "./models/bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_path, local_files_only=True)
model = BertModel.from_pretrained(model_path, local_files_only=True)
Tip
Exit the Python interactive shell and return to the virtual environment at any time by running exit.
Once you have BERT set up and running, you can use BERT for inference tasks. Two sample Python scripts are provided:
compare_pdfs.py: This Python script extracts text from two PDF files, chunks it, generates embeddings, builds a document‑level embedding for each document, and computes similarity between the 2 files.compare_jsons.py: This sample Python script extracts and processes text from two JSON files, generates embeddings for each entry, and computes an unordered set-level semantic similarity between the two files.
These scripts do not use the bert-base-uncased model in the Transformers library, as BERT models are not specifically trained for semantic similarity. Instead, the scripts use a SBERT model all-mpnet-base-v2 in the Sentence Transformers library.
To set up and run SBERT, perform the following tasks in order:
The Hugging Space Sentence Transformers library provides pre-trained SBERT models that you can load in Pytorch.
To install the library:
- In the activated virtual environment, run
pip install sentence-transformers.
To perform semantic similarity analysis on two text sources:
- Prepare two PDFs or JSONs for analysis.
- In your activated virtual environment, run one of the provided Python scripts:
python <script_name> <text1_name> <text2_name>. E.g.python compare_pdfs.py pdf1.pdf pdf2.pdf
A semantic similarity score is produced.
AI was used to assist with configuring the Pytorch environment, setting up BERT models, and generating code samples and scripts. All other content, including the instructions as seen in the README file, was drafted and reviewed manually.