Language models (LMs) such as BERT and the GPT series are fundamental to modern natural language processing (NLP) systems, powering applications like search engines and personal assistants. Their effectiveness stems from a pretraining phase using self-supervised learning on vast amounts of web text, enabling adaptation to various tasks with minimal fine-tuning. For instance, BERT is trained to predict masked words, while GPT models predict the next word in a sequence. Through this pretraining, LMs acquire world knowledge, including associations between concepts, which is vital for knowledge-intensive applications like question answering (http://ai.stanford.edu/blog/linkbert/).
A significant limitation in many current LM pretraining strategies is the processing of documents in isolation. This approach overlooks the rich interdependencies that often exist between documents, such as hyperlinks and citation links, which are prevalent in sources like web text and scientific literature. These links are crucial for capturing knowledge that spans multiple documents. For example, a hyperlink between an article about the “Tidal Basin, Washington D.C.” and one about the “National Cherry Blossom Festival” can reveal multi-hop knowledge, such as “Tidal Basin has Japanese cherry trees,” which is not explicitly stated within either document alone. Models that disregard these connections may fail to learn such distributed knowledge, hindering performance on tasks requiring cross-document understanding and reasoning, like question answering and knowledge discovery (http://ai.stanford.edu/blog/linkbert/). The source material posits that a text corpus should be viewed not as a simple list of documents, but as a graph where documents are nodes connected by links.
To address this, a new pretraining method called LinkBERT has been developed. LinkBERT aims to incorporate document link information to train LMs with enhanced world knowledge. The LinkBERT approach involves three main steps: constructing a document graph by identifying links between documents, creating link-aware training instances by grouping linked documents, and pretraining the LM using link-aware self-supervised tasks: masked language modeling (MLM) and document relation prediction (DRP) (http://ai.stanford.edu/blog/linkbert/).
The construction of the document graph involves treating each document as a node and adding a directed edge if a hyperlink exists from one document to another. Hyperlinks and citation links are prioritized due to their high relevance and scalability. For creating link-aware LM inputs, documents are chunked into segments of approximately 256 tokens. These segments are then concatenated to form input sequences for the LM. Three options for segment concatenation are presented: contiguous segments from the same document (similar to existing LMs), random segments from different documents, and linked segments sampled from documents connected in the graph. The inclusion of these diverse options serves to provide a training signal that enables the model to learn relations between text segments (http://ai.stanford.edu/blog/linkbert/).
The pretraining phase utilizes two joint self-supervised tasks. Masked language modeling (MLM) involves masking tokens and predicting them using surrounding tokens, which encourages the LM to learn multi-hop knowledge by bringing related concepts from different documents into the same context. Document relation prediction (DRP) trains the model to classify the relationship between two segments (contiguous, random, or linked). This task helps the LM learn document relevance and dependencies, as well as bridging concepts. These pretraining tasks can be conceptually mapped to graph-based self-supervised learning algorithms: node feature prediction (MLM) and link prediction (DRP) (http://ai.stanford.edu/blog/linkbert/).
LinkBERT has been evaluated on general domain tasks using Wikipedia (incorporating hyperlinks) and on biomedical domain tasks using PubMed (incorporating citation links). The results indicate that LinkBERT consistently improves upon baseline models like BERT and PubmedBERT across various downstream tasks. The gains are particularly pronounced in the biomedical domain, attributed to the critical role of citation links in scientific literature, which BioLinkBERT (the biomedical version) leverages effectively to achieve state-of-the-art performance on benchmarks like BLURB, MedQA, and MMLU. LinkBERT demonstrates notable strengths in multi-hop reasoning, effectively answering questions that require integrating information from multiple documents. For instance, in a specific question-answering example from the MRQA benchmark, LinkBERT correctly identified a location based on information from two linked documents, unlike BERT which defaulted to information from a single document. Furthermore, LinkBERT exhibits robustness to irrelevant documents in open-domain question answering, maintaining QA accuracy where BERT experienced a performance drop, which is attributed to the DRP task aiding document relevance recognition. The model also excels in few-shot and data-efficient learning scenarios, indicating that it has internalized more knowledge during pretraining, thus requiring less data for fine-tuning on specific tasks (http://ai.stanford.edu/blog/linkbert/).
LinkBERT is designed for easy integration as a drop-in replacement for BERT. Pretrained LinkBERT models are available on HuggingFace, and can be loaded using standard transformer libraries. Fine-tuning scripts for downstream applications like question answering and text classification are also provided, allowing users to replace BERT model paths with LinkBERT in their existing workflows. The research emphasizes that LinkBERT is particularly beneficial for knowledge or reasoning-intensive applications (http://ai.stanford.edu/blog/linkbert/).
In summary, LinkBERT represents a novel pretraining method for language models that integrates document link information. By treating documents as a connected graph and creating link-aware training instances, LinkBERT enhances LM capabilities. The joint use of MLM and DRP tasks during pretraining allows LinkBERT to capture multi-hop knowledge and document relations more effectively than models trained on isolated documents. LinkBERT shows improved performance across general and domain-specific NLP tasks, demonstrating particular strengths in multi-hop reasoning, robustness to irrelevant information, and few-shot learning. The availability of pretrained models and code facilitates their adoption for various applications, especially those requiring advanced knowledge and reasoning capabilities (http://ai.stanford.edu/blog/linkbert/).
Educated readers interested in advancing language model capabilities, particularly in knowledge-intensive and reasoning-heavy applications, should consider exploring the use of LinkBERT. Experimenting with LinkBERT as a replacement for BERT in existing NLP pipelines or utilizing the provided fine-tuning scripts can offer insights into its practical benefits. Further investigation into extending LinkBERT’s principles to other model architectures, such as sequence-to-sequence models for generation, or to other data modalities like source code, could unlock new avenues for research and application development (http://ai.stanford.edu/blog/linkbert/).