LinkBERT: Enhancing Language Models with Document Link Information

The Power of Pretrained Language Models

Language models (LMs) like BERT and the GPT series have revolutionized natural language processing (NLP), forming the backbone of many modern applications, from search engines to personal assistants. Their success stems from a pretraining phase where they learn from massive amounts of unlabeled text data. This self-supervised learning allows LMs to encode a vast amount of world knowledge and linguistic understanding, which can then be efficiently adapted to various downstream tasks with minimal task-specific fine-tuning.

Common pretraining strategies include masked language modeling (MLM), where the model predicts masked words in a sentence, and causal language modeling, where the model predicts the next word in a sequence. Through these methods, LMs learn associations between concepts (e.g., “dog,” “fetch,” “ball”) that are crucial for knowledge-intensive applications like question answering.

The Challenge of Isolated Documents

A significant limitation of many existing LM pretraining approaches is their tendency to process documents in isolation. This means that training instances are drawn from individual documents independently. However, real-world text, particularly from the web and scientific literature, is rich with inter-document dependencies, such as hyperlinks and citation links. These links are vital because knowledge often spans across multiple documents.

Consider the example of Wikipedia articles: one article might mention the “National Cherry Blossom Festival,” and a hyperlink could lead to another article detailing that the festival celebrates “Japanese cherry trees.” Combined, this linked information reveals multi-hop knowledge – “Tidal Basin has Japanese cherry trees” – which is not fully present in either document alone. Models trained without leveraging these links may fail to capture such crucial cross-document knowledge, hindering their performance on tasks requiring multi-hop reasoning or comprehensive understanding.

Introducing LinkBERT: A Link-Aware Approach

To address this limitation, researchers have developed LinkBERT, a novel pretraining method designed to incorporate document link information, thereby training language models with enhanced world knowledge. The approach involves three key steps:

Document Graph Construction: The first step involves building a document graph by identifying and linking related documents within a corpus. This is typically achieved by treating each document as a node and establishing directed edges between documents based on hyperlinks or citation links. These links, often highly relevant, serve to bring together related knowledge.
Link-Aware Input Creation: With the document graph established, LinkBERT creates specialized input instances for the LM. To enable the model to learn dependencies across documents, linked documents are placed together within the same input sequence. This is done by segmenting documents and then concatenating pairs of segments. Three options for creating these pairs exist:
- Contiguous Segments: Two consecutive segments from the same document (similar to existing LMs).
- Random Segments: Segments sampled from two entirely random documents.
- Linked Segments: One segment from a document and another from a document directly linked to it in the graph.
These varied concatenation strategies provide a training signal to help the LM recognize relationships between text segments.
Link-Aware LM Pretraining: The final step involves pretraining the LM using link-aware self-supervised tasks, primarily masked language modeling (MLM) and document relation prediction (DRP).
- Masked Language Modeling (MLM): By masking tokens and having the model predict them using surrounding context, including information from linked documents, LinkBERT learns multi-hop knowledge. For instance, concepts from linked documents appearing in the same input sequence allow the model to learn relationships like “Tidal Basin” -> “National Cherry Blossom Festival” -> “Japanese cherry trees.”
- Document Relation Prediction (DRP): This task trains the model to classify the relationship between two segments (contiguous, random, or linked). It encourages the LM to learn relevance and dependencies between documents, as well as bridging concepts.
These two tasks are trained jointly, effectively acting as node feature prediction and link prediction on the document graph.

LinkBERT’s Performance and Strengths

LinkBERT has been evaluated on various downstream NLP tasks in both general and biomedical domains, demonstrating significant improvements over baseline models that do not leverage document links. Specifically, LinkBERT shows:

Improved Performance Across Domains: LinkBERT consistently outperforms models like BERT and PubMedBERT on benchmarks such as GLUE, MRQA, BLURB, MedQA, and MMLU. The gains are particularly pronounced in the biomedical domain, where citation links are crucial for capturing complex dependencies. BioLinkBERT, the biomedical variant, achieved new state-of-the-art results on several key benchmarks.
Effective Multi-Hop Reasoning: LinkBERT excels at tasks requiring multi-hop reasoning. In question answering scenarios, it can correctly connect information across multiple documents to arrive at the right answer, whereas standard BERT might incorrectly focus on information within a single document. This is attributed to LinkBERT’s exposure to related concepts from linked documents during pretraining.
Robustness to Document Relations: The Document Relation Prediction task equips LinkBERT with a better understanding of relationships between documents. This makes it more robust to irrelevant or noisy documents in open-domain question answering, maintaining accuracy where BERT might falter.
Data Efficiency: LinkBERT demonstrates superior performance in few-shot and data-efficient learning scenarios. When fine-tuned with limited training data, it significantly outperforms BERT, suggesting that it has internalized more comprehensive world knowledge during pretraining.

Using LinkBERT

LinkBERT is designed as a drop-in replacement for BERT, making it easy to integrate into existing NLP pipelines. Pretrained LinkBERT models are available on platforms like HuggingFace, allowing researchers and developers to load and utilize them with straightforward code snippets.

Pros and Cons of LinkBERT

Pros:

Significantly improves performance on knowledge-intensive and reasoning-heavy NLP tasks.
Effectively captures multi-hop knowledge and cross-document dependencies.
Demonstrates enhanced robustness to irrelevant information and better data efficiency.
Easy to integrate as a replacement for existing BERT-based models.
Achieves state-of-the-art results in specialized domains like biomedical NLP.

Cons:

Requires a corpus with identifiable document links for effective training.
The process of building a high-quality document graph can be computationally intensive.
While it improves reasoning, complex multi-hop reasoning chains might still pose challenges for even LinkBERT.

Key Takeaways

LinkBERT represents a significant advancement in language model pretraining by effectively leveraging document link information. By incorporating hyperlinks and citations into the training process through link-aware input creation and specialized self-supervised tasks, LinkBERT models gain a richer understanding of world knowledge and inter-document relationships. This leads to improved performance in a variety of NLP applications, particularly those requiring multi-hop reasoning and robustness to complex data structures. The availability of pretrained LinkBERT models empowers the NLP community to build more knowledgeable and capable language-based systems.

Ibossumind

LinkBERT: Improving Language Model Training with Document Link