Language Models (LMs), such as BERT and the GPT series, are foundational to modern Natural Language Processing (NLP) and are integral to many everyday applications like search engines and personal assistants. Their power stems from pretraining on vast amounts of web text using self-supervised learning, allowing them to be adapted to new tasks with minimal finetuning. Pretraining methods like masked language modeling (MLM) and causal language modeling enable LMs to encode world knowledge, which is beneficial for knowledge-intensive tasks such as question answering.
A significant challenge in existing LM pretraining strategies is their tendency to model documents independently. This approach overlooks the rich dependencies that often exist between documents, particularly evident in sources like web text and scientific literature, which feature hyperlinks and citation links. These links are crucial as knowledge frequently spans across multiple documents, enabling multi-hop reasoning. For example, a Wikipedia article about the “Tidal Basin, Washington D.C.” might mention the “National Cherry Blossom Festival,” and a linked article could further explain that this festival celebrates “Japanese cherry trees.” This connection allows for understanding knowledge like “Tidal Basin has Japanese cherry trees,” information not fully present in either document alone. The source material highlights that text corpora should be viewed not merely as a collection of documents but as a graph where documents are interconnected by these links.
To address this limitation, the research introduces LinkBERT, a novel pretraining method designed to incorporate document link information for training more knowledgeable language models. The LinkBERT approach involves three main steps: constructing a document graph by identifying links between documents, creating link-aware training instances by placing linked documents together, and pretraining the LM using link-aware self-supervised tasks, specifically masked language modeling (MLM) and document relation prediction (DRP).
The construction of the document graph involves treating each document as a node and creating a directed edge between documents when a hyperlink exists from one to another. While other linking methods are possible, hyperlinks and citation links are prioritized due to their high relevance and widespread availability. The creation of link-aware LM inputs involves segmenting documents into smaller chunks, approximately 256 tokens each, and concatenating pairs of segments. Three options are presented for segment concatenation: contiguous segments from the same document (similar to existing LMs), random segments from different documents, and linked segments sampled from documents connected in the document graph. The inclusion of random and linked segments is intended to provide a training signal for LinkBERT to learn relationships between text segments.
For pretraining, LinkBERT employs two joint self-supervised tasks. Masked Language Modeling (MLM) involves masking tokens and predicting them using surrounding tokens, which encourages the LM to learn multi-hop knowledge by processing concepts from linked documents within the same input sequence. Document Relation Prediction (DRP) tasks the model with classifying the relationship between two segments as contiguous, random, or linked, thereby fostering an understanding of document relevance and dependencies, and enabling the learning of bridging concepts. These tasks can be analogized to graph-based self-supervised learning, with MLM corresponding to node feature prediction and DRP to link prediction.
LinkBERT has been evaluated in both general and biomedical domains. In the general domain, it uses Wikipedia with hyperlinks, while in the biomedical domain, PubMed is used with citation links. Performance was assessed on various downstream tasks, including general question answering (MRQA) and NLP benchmarks (GLUE), as well as biomedical NLP (BLURB) and question answering (MedQA, MMLU). The results indicate that LinkBERT consistently improves upon baseline models like BERT and PubmedBERT across tasks and domains. The gains are particularly notable in the biomedical domain, attributed to the significant interdependencies captured by citation links. The biomedical version, BioLinkBERT, achieved state-of-the-art performance on the BLURB, MedQA, and MMLU benchmarks.
LinkBERT demonstrates several key strengths. Firstly, it excels in multi-hop reasoning, showing significant improvements over BERT on tasks requiring such capabilities within the MRQA benchmarks, such as HotpotQA and triviaQA. This is exemplified by an instance where LinkBERT correctly identifies a city from a second document linked to the initial document about an organization, a task where BERT incorrectly predicts a location from the first document alone. The intuition is that exposing the model to related concepts and documents together during pretraining enhances its ability to reason across multiple sources in downstream applications. Secondly, LinkBERT demonstrates improved document relation understanding. When tested with distracting, irrelevant documents, LinkBERT maintained its QA accuracy, whereas BERT experienced a performance drop. This robustness is attributed to the DRP task during pretraining, which helps the model recognize document relevance. Thirdly, LinkBERT is effective in few-shot and data-efficient scenarios. When finetuned with only 1% or 10% of the training data for QA tasks, LinkBERT showed substantial improvements over BERT, suggesting it has internalized more world knowledge during pretraining due to the incorporation of document links.
LinkBERT is designed to be a drop-in replacement for BERT and is available on HuggingFace. The pretrained LinkBERT and BioLinkBERT models can be loaded using standard libraries. For downstream applications like question answering and text classification, users can either utilize provided finetuning scripts or adapt existing BERT finetuning scripts by simply changing the model path to LinkBERT. The research also suggests future directions, including extending LinkBERT to sequence-to-sequence models for document link-aware text generation and generalizing the concept of document links to other modalities, such as source code dependency links for code language models.
Key takeaways from the analysis include:
- LinkBERT is a novel pretraining method that leverages document links (hyperlinks, citations) to enhance language model knowledge.
- It constructs a document graph and creates link-aware training instances by concatenating segments from linked documents.
- LinkBERT employs joint pretraining tasks of Masked Language Modeling (MLM) and Document Relation Prediction (DRP).
- The method demonstrates significant improvements over baseline models like BERT across general and biomedical NLP tasks, particularly in multi-hop reasoning and data-efficient learning.
- BioLinkBERT achieved state-of-the-art performance on several biomedical benchmarks.
- LinkBERT is robust to irrelevant documents and better models relationships between multiple documents.
An educated reader interested in enhancing language model capabilities for knowledge-intensive and reasoning-heavy tasks should consider exploring the pretrained LinkBERT models available on HuggingFace and adapting them for their own projects. Further investigation into the provided GitHub repository for finetuning scripts and exploring the potential of generalizing LinkBERT to other model architectures and data modalities, as suggested in the research, would be beneficial.