Granite Foundation Models

4 minute read

Paper provides details about data used in pre-training phase.

DATA SOURCES

6.48 TB of data before pre-processing,
2.07 TB of data after pre-processing.
Unstructured English-text and code.
All non-text artifacts (e.g., images, HTML tags, etc.) were removed.
granite 13b trained on 1T tokens out of 14 datasets.

Data sets:

arXiv: Over 1.8 million scientific paper pre-prints posted to arXiv.
Common Crawl: Open repository of web crawl data
DeepMind Mathematics: Mathematical question and answer pairs data.
Free Law: Public-domain legal opinions from US federal and state courts.
GitHub Clean: Code data from CodeParrot covering a variety of coding languages.
Hacker News: News on computer science and entrepreneurship, taken between 2007-2018.
OpenWeb Text: Open-source version of OpenAI’s Web Text corpus containing web pages through 2019.
Project Gutenberg (PG-19): A repository of free e-books with focus on older works for which U.S. copyright has expired.
Pubmed Central: Biomedical and life sciences papers.
SEC Filings: 10-K/Q filings from the US Securities and Exchange Commission (SEC) for the years 1934-2022.
Stack Exchange: Anonymized set of all user-contributed content on the Stack Exchange network, a popular collection of websites centered around user-contributed questions and answers.(see [2] as a possible source)
USPTO: US patents granted from 1975 to May 2023, excluding design patents.
Webhose: Unstructured web content converted into machine-readable data feeds acquired by IBM.
Wikimedia: Eight English Wikimedia projects (enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary). containing extracted plain text from pages and articles.

DATA GOVERNANCE

Data Clearance and Acquisition

A dataset is accepted after a completing a formal process involving data license, usage restrictions and sensitivty. Paper remarks that books3 dataset is excluded due to copyright concerns.

Pre-Processing Pipeline

Text Extraction: Extract text from a document into a standard format.
Data De-Duplication: Remove duplicate data per-dataset basis. exact de-duplication is hash based. fuzzy de-dupliction is finds Jaccard similarity between documents with locality sensitive hashing.
Language Identification: Detect dominant language in a document.
Sentence Splitting: Split documents into sentences. This step is necessary for HAP (Hate, Abuse, Profanity) annotation step.
Hate, Abuse and Profanity Annotation: A HAP detector is used to assign score to each sentence.
Document Quality: A classifier (based on KenLM linear classifier pre-trained on Wikipedia documents) and a heuristic (Gopher Quality Filtering criteria) is used to generate a quality annotation.
URL Block-Listing: The Block list isincludes URLs of known copyrighted material as well as block-listed sites such those contained in the 2022 Review of Notorious Markets for Counterfeiting and Piracy.
Filtering: Previously generated annotations are used to determine whether to keep or filter the document.

Tokenization

GPT-NeoX 20B tokenizer is used.

Training

Algorithmic Details

Pre-Training

granite.13b base model:

Gaussian error linear unit (GELU) activation function
MultiQuery-Attention for inference efficiency
learned absolute positional embedding
FlashAttention to speed up the training and reduce its memory footprin
300K iterations, with a batch size of 4M tokens, for a total of 1 trillion tokens
Adam optimizer [24], with β1 = 0.9, β2 = 0.95, ϵ = 10−8, and a weight decay of 0.1
cosine learning rate schedule, with warmup of 2000 steps, and decay final learning rate down from 3 × 10−4 to 3 × 10−5
3D-parallel layout using both tensor and pipeline parallelism including sequence parallelism to enable training with 8K context length

Supervised Fine-Tuning

granite.13b.instruct:

cosine learning rate schedule with an initial learning rate of 2 × 10−5, a weight decay of 0.1
a batch size of 128, and a sequence length of 8192 tokens.

SFT data includes:

a subset of the Flan Collection
15K samples from Dolly
Anthropic’s human preference data about helpfulness and harmlessnes
Instructv3
internal synthetic datasets specifically designed for summarization and dialogue tasks

Contrastive Fine-Tuning

granite.13b.chat:

The paper proposes a contrastive alignment strategy which is call contrastive fine tunning (CFT) . To obtain negative examples, a separate LLM that is based on an early version of granite.13b.instruct which is fined-tuned on misaligned human datasets is used.

CFT dataset is based on:

samples from Anthropic’s human preference data about helpfulness and harmlessness that have been filtered using the OpenAssist reward model
samples from Dolly [2], and
samples from ProsocialDialog

Compute

Granite.13b used 256 A100 GPUs for 1056 hours and 120 TFLOPs.

TESTING AND EVALUATION

General Knowledge Benchmarks During Training

Includes a subset of Eleuther AI’s Language Model Evaluation Harness (lm-eval). used as light-weight tests run after every 100 billion tokens during training to validate model knowledge is advancing as training progresses

The following 12 datasets from lm-eval are used both in zero and few shot settings:

question answering for several domains (boolq, open-bookqa, piqa, sciq);
sentence completion (lambada)
commonsense reasoning (arc easy, arc challenge, copa, hellaswag, winogrande);
reading comprehension (race)
multidisciplinary multiple-choice collection (mmlu)

HELM

After pre-training is complete, a more comprehensive assessments relies on Stanford’s Hollistic Evaluation of Language Models (HELM) Benchmark.

Enterprise Evaluation Benchmarks

After training completes, we further evaluate our models on IBM-curated enterprise benchmarks to test our models performance in domains relevant to potential customers. IBM curated 11 publicly available finance benchmarks for evaluating models in the financial domain.

References

Share on

X Facebook LinkedIn Bluesky

Huseyin ABANOZ