LinkBERT: Enhancing Language Models with Document Link Information

Language models (LMs) like BERT and the GPT series have become foundational to modern Natural Language Processing (NLP) systems, powering everyday applications such as search engines and personal assistants. Their success stems from a powerful pretraining phase using self-supervised learning on vast amounts of web text. This process allows LMs to acquire a broad understanding of language and world knowledge without explicit human labeling. Common pretraining tasks include masked language modeling (MLM), where the model predicts masked words in a sentence, and causal language modeling, where the model predicts the next word in a sequence. Through these methods, LMs learn associations between concepts, which are crucial for knowledge-intensive tasks like question answering.

The Challenge of Single-Document Training

A significant limitation of many existing LM pretraining strategies is their tendency to process documents in isolation. This means that training instances are drawn from individual documents without considering the rich interdependencies that exist between them. In real-world text corpora, such as content from the web or scientific literature, documents are often interconnected through hyperlinks and citation links. These links are vital pathways for knowledge, enabling information to span across multiple documents. For example, an article about the “Tidal Basin, Washington D.C.” might mention the “National Cherry Blossom Festival,” and a linked article could then elaborate on the festival’s celebration of “Japanese cherry trees.” This connection allows for the acquisition of multi-hop knowledge, like “Tidal Basin has Japanese cherry trees,” a fact not fully contained within either document alone.

Models that ignore these document dependencies may fail to capture knowledge spread across multiple sources. Learning such multi-hop knowledge during pretraining is crucial for applications requiring deep understanding and reasoning, such as answering complex questions or discovering new information. Recognizing that text corpora can be viewed as graphs of interconnected documents, researchers have developed new approaches to leverage this structure.

Introducing LinkBERT: A Link-Aware Pretraining Method

LinkBERT is a novel pretraining method designed to incorporate document link information into language model training, thereby equipping LMs with enhanced world knowledge. The approach involves three key steps:

Document Graph Construction: The first step is to build a document graph by identifying and linking related documents within a corpus. This is typically achieved by treating each document as a node and creating directed edges to represent relationships like hyperlinks (from document A to document B) or citation links (document A cites document B). Hyperlinks and citation links are particularly valuable due to their high relevance and widespread availability.
Link-Aware Input Creation: To enable the LM to learn dependencies across documents, linked documents are brought together into the same input instances. Each document is segmented into smaller pieces (e.g., 256 tokens). Then, pairs of segments are created for training. The method offers three options for segment pairing:
- Contiguous segments: Segments from the same document, similar to standard LM training.
- Random segments: Segments sampled from two entirely different, randomly chosen documents.
- Linked segments: One segment from a random document and another from a document directly linked to it in the graph.
This variety in segment pairing helps create a training signal that guides the LM to recognize relationships between text segments.
Link-Aware LM Pretraining: The LM is then pretrained using these link-aware input instances with two joint self-supervised tasks:
- Masked Language Modeling (MLM): This task encourages the LM to learn multi-hop knowledge by predicting masked tokens, leveraging context that now includes information from linked documents. For instance, concepts like “Tidal Basin,” “National Cherry Blossom Festival,” and “Japanese cherry trees” can be presented together in a single training instance, allowing the LM to learn the relationships between them.
- Document Relation Prediction (DRP): This task trains the model to classify the relationship between two segments (contiguous, random, or linked). It helps the LM understand document relevance and dependencies, as well as bridging concepts that connect related information across documents.
These pretraining tasks can be conceptualized as performing self-supervised learning on the document graph, analogous to node feature prediction (MLM) and link prediction (DRP).

Performance and Strengths of LinkBERT

LinkBERT has been evaluated on various downstream NLP tasks in both general and biomedical domains. In the general domain, LinkBERT was pretrained on Wikipedia, incorporating hyperlinks. In the biomedical domain, it was pretrained on PubMed, utilizing citation links, leading to a variant called BioLinkBERT.

Across diverse benchmarks such as MRQA, GLUE, BLURB, MedQA, and MMLU, LinkBERT consistently demonstrated improvements over baseline language models trained without document links. The gains were particularly significant in the biomedical domain, highlighting the importance of citation links in scientific literature. BioLinkBERT achieved state-of-the-art performance on several of these biomedical benchmarks.

LinkBERT exhibits several notable strengths:

Effective Multi-hop Reasoning: LinkBERT significantly improves performance on tasks requiring multi-hop reasoning, such as those found in the MRQA benchmark (e.g., HotpotQA, TriviaQA). By integrating information from linked documents during pretraining, LinkBERT can better connect disparate pieces of information to arrive at correct answers, outperforming models that struggle to bridge information across documents.
Enhanced Document Relation Understanding: The Document Relation Prediction task during pretraining equips LinkBERT with a better ability to model relationships between documents. This proves beneficial in open-domain question answering scenarios where models must sift through multiple retrieved documents, some of which may be irrelevant. LinkBERT demonstrates robustness to distracting documents, maintaining accuracy where standard BERT might falter.
Data-Efficient and Few-Shot Learning: When fine-tuned with limited training data (1% or 10% of available data), LinkBERT shows substantial improvements over BERT. This indicates that LinkBERT has internalized more world knowledge during pretraining, making it more effective in low-resource settings and for knowledge-intensive tasks.

Availability and Future Directions

LinkBERT is designed as a drop-in replacement for BERT, making it easily integrable into existing NLP pipelines. Pretrained LinkBERT and BioLinkBERT models are available on HuggingFace. Researchers and developers can use these models by loading them via standard libraries and fine-tuning them for specific applications, with provided scripts and guidance for integration.

The development of LinkBERT opens exciting avenues for future research. This includes extending the document link-aware pretraining to other model architectures like GPT or sequence-to-sequence models for text generation, and generalizing the concept of document links to other modalities, such as incorporating source code dependency links for models trained on code.

Pros and Cons

Pros:

Enhances language models with multi-hop reasoning capabilities.
Improves understanding of relationships between documents.
Achieves better performance in data-efficient and few-shot learning scenarios.
Demonstrates strong performance gains, particularly in domains with rich inter-document links (e.g., biomedical literature).
Can be easily integrated as a drop-in replacement for existing BERT-based models.

Cons:

Requires the availability of document links within the pretraining corpus, which may not be present in all datasets.
The construction of the document graph can be computationally intensive for very large corpora.
The added complexity of DRP task might increase pretraining time and resource requirements.

Key Takeaways

LinkBERT leverages document links (hyperlinks, citations) to create more knowledgeable language models.
By processing linked documents together, LinkBERT learns multi-hop knowledge and improves reasoning.
The pretraining tasks include Masked Language Modeling and Document Relation Prediction.
LinkBERT shows significant improvements over standard models in areas like multi-hop reasoning and few-shot learning.
Pretrained LinkBERT models are publicly available, facilitating their use in various NLP applications.

Ibossumind

LinkBERT: Improving Language Model Training with Document Link