to download the project base paper

Abstract:

In this article we show how the problem of neural text generation can be constructively reformulated in terms of transitions between the states of a finite-state machine. This framework leads to an efficient approach to guiding text generation with regular expressions and context-free grammars by allowing the construction of an index over a language model’s vocabulary. The approach is model agnostic, allows
one to enforce domain-specific knowledge and constraints, and enables the construction of reliable interfaces by guaranteeing the structure of the generated text. It adds little overhead to the token sequence generation process and significantly outperforms existing solutions. An implementation is provided in the open source Python library Outlines [Louf and Willard].

Introduction:

We are concerned with the problem of generating sequences of tokens from large language model (LLM)[Vaswani et al., 2017, Radford et al., 2019] that conform to regular expressions or context-free grammars (CFGs). This kind of guided LLM generation is used to make LLM model output usable under rigid formatting requirements that are either hard or costly to capture through fine-tuning alone [Beurer-Kellner et al., 2023, Scholak et al., 2021, Poesia et al., 2022a, Rabinovich et al., 2017, Weng, 2021, Dong et al., 2023, Poesia et al., 2022b, Geng et al., 2023, Wang et al., 2023]. Such features have recently been generalized in prompting libraries and interfaces [Microsoft,1 arXiv:2307.09702v4 [cs.CL] 19 Aug 2023
2023, Beurer-Kellner et al., 2023, Rickard, 2023a,b], but their applicability can be limited by their scaling costs. Most implementations of guided generation bias the score values used to determine the probabilities of the tokens in an LLM’s vocabulary. A common and sufficient approach involves repeated evaluations over the entire vocabulary in order to determine which tokens are valid–according to the
constraints and previously sampled tokens–and setting the probabilities of invalid tokens to zero. This approach entails a fixed O(N) cost for each token generated, where N is the size of the LLM’s vocabulary.
We propose an approach that uses the finite state machine (FSM) formulation of regular expressions to both arbitrarily start and stop guided generation and allow the construction of an index with which the set of nonzero-probability tokens can be obtained efficiently at each step. The result is an algorithm that costs O(1) on average. For the regular expression case, our approach shares the most similarity
with Kuchnik et al. [2023], which uses a transducer formulation to obtain FSMs defined over a language model’s vocabulary, and these FSMs contain much of the same information and scaling benefits as the indices described here. Our approach does not require the complete transducer abstraction
and can be used to more easily extend existing, efficient regular expression libraries without modifying the underlying automatons and their implementations. More importantly, our indexing approach can also be extended to CFGs and LALR(1) parsers to allow for efficient guided generation according to popular data formats and programming languages (e.g. JSON, Python, SQL, etc.). The transition to parsing is made by way of augmentations to traditional LALR(1) parser components and operations, making it–again–an
approach that can be used to extend existing parser implementations.

Algorithm 1: Basic LLM token sampling
1: function sample tokens(L)
2: s ← ()
3: for i ← 1, L do
4: α ← LM(s, θ)
5: Sample s ∼ Categorical(α)
6: if s = EOS then
7: break
8: end if
9: s ← append(s, s)
10: end for
11: return s
12: end function

Algorithm 2: LLM token sampling with masking
1: function sample tokens(L)
2: s ← ()
3: for i ← 1, L do
4: α ← LLM(s, θ)
5: Construct the mask m (s)
6: α˜ ← m ⊙α
7: Sample ˜s ∼ Categorical(α˜ )
8: if s˜ = EOS then
9: break
10: end if
11: s ← append(s, ˜s)
12: end for
13: return s
14: end function

Algorithm 3: Find sub-sequences of the FSM M that accept the string v
1: function find sub sequences(M, v)
2: M = (Q, Σ, δ, q0, F)
3: res ← ()
4: for r ∈ δ
−1
(·, v0) do ▷ Loop through states that read v0
5: p ← (r)
6: for i ← 1, |v| − 1 do ▷ Walk the FSM
7: if δ(r, vi) = ∅ then ▷ The FSM does not read vi
8: p ← ()
9: break ▷ Stop walking and try the next start state
10: end if
11: r ← δ(r, vi)
12: p ← append(p, r)
13: end for
14: res ← append(res, p)
15: end for
16: return res
17: end function

Algorithm 4: Construct a map from FSM states to subsets of V
1: function map states to vocab(M, V)
2: M = (Q, Σ, δ, q0, F)
3: Initialize the map σ with empty sets for each element in Q
4: for v ∈ V do ▷ Loop through the vocabulary
5: Z ← find sub sequences(M, v)
6: for z ∈ Z do ▷ Loop through state sequences accepting v
7: σ(z0) ← σ(z0) ∪ v
8: end for
9: end for
10: return σ
11: end function

Image:

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *