LinkBERT: Enhancing Language Models with Document Links

Language Models (LMs) like BERT and the GPT series have revolutionized Natural Language Processing (NLP), forming the backbone of many everyday technologies such as search engines and personal assistants. Their power stems from a self-supervised pretraining process on vast amounts of web text, allowing them to learn intricate language patterns and world knowledge without requiring explicit labels. This pretrained knowledge can then be efficiently adapted to a wide array of downstream tasks with minimal task-specific fine-tuning.

The core of LM pretraining often involves tasks like masked language modeling (MLM), where the model predicts missing words within a sentence, or causal language modeling (CLM), where it predicts the next word in a sequence. Through these methods, LMs internalize associations between concepts, enabling them to perform well on knowledge-intensive applications like question answering.

The Challenge of Isolated Documents

A significant limitation in many common LM pretraining strategies is their tendency to process documents in isolation. This approach, where training instances are drawn independently from each document in a corpus, overlooks the rich interdependencies that often exist between them. Text from the web or scientific literature, for instance, frequently contains hyperlinks and citation links that connect related pieces of information across different documents. These links are crucial because knowledge is often distributed and can span multiple documents, offering insights not present within any single document alone.

Consider the example of Wikipedia articles: one article about the “Tidal Basin, Washington D.C.” might mention the “National Cherry Blossom Festival.” By following a hyperlink to the festival’s article, one might learn it celebrates “Japanese cherry trees.” Combined, this multi-hop information reveals that the Tidal Basin has Japanese cherry trees, a piece of knowledge not fully contained in either document individually. Models trained without leveraging these document links may fail to capture such interconnected knowledge, which is vital for tasks like question answering and knowledge discovery.

Introducing LinkBERT: A Graph-Based Approach

To address this limitation, researchers have developed LinkBERT, a novel pretraining method that explicitly incorporates document link information to train language models with enhanced world knowledge. LinkBERT’s approach can be broadly categorized into three steps:

Document Graph Construction: First, links between documents within a text corpus are identified to build a document graph. Each document is treated as a node, and an edge is created between documents if a link (e.g., hyperlink, citation) exists from one to the other. Hyperlinks and citation links are favored due to their high relevance and ubiquitous availability.
Link-Aware Input Creation: To facilitate learning of multi-hop dependencies, linked documents are brought together in training instances. Documents are segmented into chunks (e.g., 256 tokens), and pairs of segments are concatenated to form input sequences for the LM. This can be done in three ways:
- Contiguous segments: Segments from the same document.
- Random segments: Segments from two randomly chosen documents.
- Linked segments: Segments from a document and another document linked to it in the graph.
This variety of input creation strategies aims to provide a training signal for LinkBERT to recognize relationships between text segments.
Link-Aware LM Pretraining: The LM is then pretrained using two joint self-supervised tasks:
- Masked Language Modeling (MLM): This task, similar to standard MLM, encourages the LM to learn multi-hop knowledge by predicting masked tokens within context that now includes linked document segments. For example, concepts from linked articles about the Tidal Basin, cherry blossoms, and Japanese trees can be presented together, allowing the model to learn their interrelations.
- Document Relation Prediction (DRP): This task involves classifying the relationship between two segments (contiguous, random, or linked). DRP helps the LM learn the relevance and dependencies between documents, as well as bridging concepts.
These pretraining tasks can be conceptualized as graph-based self-supervised learning algorithms: node feature prediction (MLM) and link prediction (DRP).

Performance and Strengths of LinkBERT

LinkBERT has been evaluated on various downstream NLP tasks in both general and biomedical domains. By pretraining on Wikipedia with hyperlinks and on PubMed with citation links, LinkBERT models (including BioLinkBERT for the biomedical domain) have demonstrated consistent improvements over baseline models that did not leverage document links.

Key strengths of LinkBERT include:

Effective Multi-Hop Reasoning: LinkBERT significantly improves performance on tasks requiring multi-hop reasoning, such as certain question-answering benchmarks. By bringing related documents into the same training instances, it enables the model to better connect information across documents to arrive at correct answers, outperforming models like BERT which may default to answers from a single document.
Improved Document Relation Understanding: The inclusion of the DRP task during pretraining enhances LinkBERT’s ability to understand relationships between documents. This makes it more robust to irrelevant or distracting documents in open-domain question answering scenarios, maintaining accuracy where BERT might experience a performance drop.
Data-Efficient Learning: LinkBERT shows considerable advantages in few-shot learning scenarios, where models are fine-tuned with limited training data. This suggests that LinkBERT internalizes more comprehensive world knowledge during pretraining, making it more effective in low-resource settings.

Using LinkBERT

LinkBERT is designed to be a drop-in replacement for existing BERT models. Pretrained LinkBERT models are available on platforms like HuggingFace, allowing easy integration into existing NLP pipelines. Researchers and developers can leverage LinkBERT for various applications, including question answering and text classification, by simply updating model paths or using provided fine-tuning scripts.

Pros and Cons

Pros:

Enhances language models with knowledge spanning multiple documents.
Improves performance on tasks requiring multi-hop reasoning.
More robust to irrelevant documents in information retrieval tasks.
Shows significant gains in few-shot learning scenarios.
Leverages readily available document link structures (hyperlinks, citations).

Cons:

Requires a corpus with well-defined document links for optimal performance.
The complexity of graph construction and link-aware input creation might increase computational overhead during pretraining.
Effectiveness may depend on the quality and density of links within the training corpus.

Key Takeaways

LinkBERT represents a significant advancement in language model pretraining by effectively integrating document link information. By treating text corpora as graphs and training with tasks like MLM and DRP on linked document segments, LinkBERT models demonstrate superior performance in multi-hop reasoning, document relation understanding, and data-efficient learning. This approach unlocks new potential for knowledge-intensive NLP applications and offers a promising direction for future research, including extending these concepts to other model architectures and data modalities.

Ibossumind

LinkBERT: Improving Language Model Training with Document Link