The development of language models (LMs) like BERT and the GPT series has significantly advanced natural language processing (NLP) capabilities, forming the foundation for many contemporary NLP systems used in applications such as search engines and personal assistants. The effectiveness of these models stems from their ability to be pretrained on vast amounts of text data through self-supervised learning, eliminating the need for labeled data. This pretraining allows them to adapt to a wide array of downstream tasks with minimal task-specific fine-tuning. Pretraining methodologies include masked language modeling (MLM), where models predict masked words, and causal language modeling, where models predict the next word in a sequence. Through these processes, LMs acquire world knowledge and associations between concepts that are crucial for tasks like question answering.
A significant challenge in current LM pretraining strategies is the prevalent practice of processing documents in isolation. This approach may limit a model’s capacity to capture knowledge that spans across multiple interconnected documents. The source material highlights that web text and scientific literature, commonly used for LM training, often contain document links, such as hyperlinks and citation links. These links are vital for knowledge discovery, as they connect related information that might not be present within a single document. For example, understanding that the “Tidal Basin, Washington D.C.” article is linked to an article about the “National Cherry Blossom Festival,” which celebrates “Japanese cherry trees,” allows for the inference of multi-hop knowledge like “Tidal Basin has Japanese cherry trees,” information not fully contained in either document alone. This limitation can hinder performance in knowledge-intensive applications. The article posits that a text corpus should be viewed not as a mere collection of documents, but as a graph where documents are nodes and links are edges. To address this, a new pretraining method called LinkBERT has been developed, aiming to incorporate document link information for training LMs with enhanced world knowledge.
The LinkBERT approach involves three core steps. First, it constructs a document graph by identifying links between documents. The primary focus is on using hyperlinks and citation links due to their high relevance and ubiquity. In this graph, each document is a node, and a directed edge exists from document ‘i’ to document ‘j’ if document ‘i’ contains a hyperlink to document ‘j’. Second, LinkBERT creates link-aware training instances by concatenating segments from linked documents. Documents are first chunked into segments of approximately 256 tokens. Then, pairs of segments are created for LM input, with three options for pairing: contiguous segments from the same document, random segments from different documents, and linked segments from documents connected in the graph. The inclusion of these diverse pairing options is intended to generate training signals that enable the model to learn the relationships between text segments. Third, LinkBERT is pretrained using two joint self-supervised tasks: masked language modeling (MLM) and document relation prediction (DRP). MLM encourages the model to learn multi-hop knowledge by predicting masked tokens using context from linked documents. DRP involves classifying the relationship between segment pairs as contiguous, random, or linked, which helps the LM learn document relevance and bridging concepts. These pretraining tasks can be conceptualized as graph-based self-supervised learning, with MLM analogous to node feature prediction and DRP similar to link prediction.
LinkBERT has been evaluated on various downstream NLP tasks in both general and biomedical domains. In the general domain, it was pretrained on Wikipedia using hyperlinks, and in the biomedical domain, it was pretrained on PubMed using citation links. Evaluations were conducted on benchmarks such as MRQA and GLUE for general NLP, and BLURB, MedQA, and MMLU for biomedical NLP. LinkBERT consistently improved upon baseline models like BERT and PubmedBERT across these tasks and domains. The improvements were particularly pronounced in the biomedical domain, attributed to the significant role of citation links in scientific literature, which BioLinkBERT effectively captures. BioLinkBERT achieved state-of-the-art results on the BLURB, MedQA, and MMLU benchmarks.
The analysis highlights several key strengths of LinkBERT. Firstly, it demonstrates enhanced multi-hop reasoning capabilities, evidenced by improved performance on tasks requiring information synthesis from multiple documents within the MRQA benchmarks. LinkBERT correctly predicts answers that require bridging information across documents, unlike BERT which might default to information within a single document. This is attributed to bringing related concepts and documents together in training instances, facilitating multi-document reasoning in downstream applications. Secondly, LinkBERT excels at document relation understanding. When presented with distracting, irrelevant documents in open-domain QA tasks, LinkBERT maintains its accuracy, whereas BERT shows a performance drop. This robustness is linked to the DRP pretraining task, which aids in recognizing document relevance. Thirdly, LinkBERT proves effective in few-shot and data-efficient scenarios for question answering. When fine-tuned with limited training data (1% or 10%), LinkBERT significantly outperforms BERT, suggesting it has internalized more world knowledge during pretraining, thereby supporting the hypothesis that document links contribute valuable knowledge to LMs.
LinkBERT is designed to be a direct replacement for BERT. Pretrained models, including LinkBERT (general domain) and BioLinkBERT (biomedical domain), are available on HuggingFace. Users can load these models using standard libraries like Hugging Face’s `transformers`. For downstream applications such as question answering and text classification, existing BERT fine-tuning scripts can be adapted by simply changing the model path to LinkBERT. The blog post also references code and data availability on GitHub. Future directions include generalizing LinkBERT’s document link-aware pretraining to sequence-to-sequence models for text generation and applying the concept of document links to other modalities, such as source code dependencies for code language models.
Key takeaways from the LinkBERT approach include:
- LinkBERT enhances language model pretraining by incorporating document link information, such as hyperlinks and citations, to capture knowledge that spans across multiple documents.
- The method involves building a document graph, creating link-aware training instances by placing linked documents together, and employing joint pretraining tasks of masked language modeling (MLM) and document relation prediction (DRP).
- LinkBERT demonstrates improved performance on various NLP tasks, particularly excelling in multi-hop reasoning, document relation understanding, and few-shot learning scenarios.
- BioLinkBERT, the biomedical version, achieves state-of-the-art results on several key biomedical NLP benchmarks.
- LinkBERT serves as a drop-in replacement for BERT, with pretrained models readily available for practical use.
An educated reader interested in advancing language model capabilities, especially in knowledge-intensive or reasoning-heavy applications, should consider exploring the available LinkBERT and BioLinkBERT models on HuggingFace and experimenting with them in their own projects. Further investigation into the provided research paper and GitHub repository for code and data would offer deeper insights into the implementation and potential extensions of this approach.