-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DynamicCache for models is not supported #42
Comments
Just re-did the patches to double check whether I missed anything or not and apparently missed one thing in the workarounds, you have to also add this as well in def to_pt_tensors(tensors: Union[Tuple[Union[torch.Tensor, Tensor, tf.Tensor], ...], Dict[str, Union[torch.Tensor, Tensor, tf.Tensor]]], convert_format: bool = False) -> Tuple[torch.Tensor, ...]:
"""
Take a tuple of either pytorch or buda tensors, and return pytorch tensors. Generate zero-tensors
if no value exists.
"""
pytorch_tensors = []
if not isinstance(tensors, (list, tuple)):
tensors = (tensors, )
for t in tensors:
if isinstance(t, torch.Tensor):
assert not convert_format, "Can't convert format of raw pytorch tensor - don't know what the target format is"
pytorch_tensors.append(t)
elif isinstance(t, (tf.Tensor, tf.Variable)):
pt = torch.Tensor(t.numpy() if t.dtype != tf.bfloat16 else tf.cast(t, tf.float32).numpy()).type(map_tf_dtype_to_pt(t.dtype))
pt.requires_grad = t.trainable if isinstance(t, tf.Variable) else torch.is_complex(pt) or torch.is_floating_point(pt)
pytorch_tensors.append(pt)
elif isinstance(t, Tensor):
if convert_format:
t = t.to_format(t.data_format)
if t.has_value():
pytorch_tensors.append(t.value())
else:
pytorch_tensors.append(t.create_pt_zeros())
elif t is None:
pytorch_tensors.append(None)
elif isinstance(t, (list, tuple)):
pytorch_tensors.append(to_pt_tensors(t))
elif isinstance(t, dict):
pt_tensor_list = to_pt_tensors(list(t.values()))
pt_dict = {k:v for (k, _), v, in zip(t.items(), pt_tensor_list)}
pytorch_tensors.append(pt_dict)
elif isinstance(t, np.ndarray):
pytorch_tensors.append(torch.Tensor(t))
elif isinstance(t, mxnet.ndarray.ndarray.NDArray):
pytorch_tensors.append(torch.Tensor(t.asnumpy()))
+ elif isinstance(t, transformers.cache_utils.DynamicCache):
+ ### CHANGE ###
+ pytorch_tensors.append(torch.Tensor(t))
elif isinstance(t, jaxlib.xla_extension.DeviceArray):
pytorch_tensors.append(torch.Tensor(np.array(t)))
else:
raise RuntimeError(f"Unknown type of tensor: {type(t)}")
ret = tuple(pytorch_tensors) if isinstance(tensors, (tuple, list)) else (pytorch_tensors,)
return ret More details on the workarounds
Python errors at this line... elif isinstance(t, jaxlib.xla_extension.DeviceArray): Not sure whats wrong with my environment (I'm using the pybuda docker image), but elif isinstance(t, mxnet.ndarray.ndarray.NDArray):
pytorch_tensors.append(torch.Tensor(t.asnumpy()))
+ elif isinstance(t, transformers.cache_utils.DynamicCache):
+ ### CHANGE ###
+ pytorch_tensors.append(torch.Tensor(t))
elif isinstance(t, jaxlib.xla_extension.DeviceArray):
pytorch_tensors.append(torch.Tensor(np.array(t))) At that point, [
Torch.Tensor([[...]]), # Input IDs
Torch.Tensor([[...]]), # Attention Mask
DynamicCache() # past_key_values
] So then I just simply pass DynamicCache value
When checking the microbatching size, for input in first_inputs:
+ ### CHANGE ###
+ if isinstance(input, transformers.cache_utils.DynamicCache):
+ continue
mb_size = get_microbatch_size(input)
+ if mb_size == 0:
+ continue # skip
elif (mb_size != microbatch_size) and (mb_size != 1):
raise RuntimeError("Microbatch size doesn't match for all inputs") When For the line that skips when
+ elif isinstance(input, transformers.cache_utils.DynamicCache):
+ ### CHANGE ###
+ out.append(input) I just simply pass it as normal, nothing much to it really.
if isinstance(input, torch.Tensor):
+ ### CHANGE ###
+ if input.numel() == 0:
+ out.append(Tensor.create_from_torch(input.clone()))
+ else:
+ out.append(Tensor.create_from_torch(torch.narrow(input.clone(), 0, 0, 1))) After some pybuda operations after (4) of processing and translating framework modules and parameters, the remove microbatching function is called again, this time
if self.input_dtypes:
+ ### CHANGE ###
+ if len(self.input_dtypes) != len(torch_inputs):
+ torch_inputs = torch_inputs + (torch.Tensor([]),)
assert len(self.input_dtypes) == len(torch_inputs), f"CPUDevice input_dtypes specified, but differs in size from number of actual inputs. Types specified: {len(self.input_dtypes)}, num inputs: {len(torch_inputs)}"
torch_inputs = tuple(t.type(typ) for t, typ in zip(torch_inputs, self.input_dtypes))
torch_inputs = detach_tensors(torch_inputs) This is when the model actually starts running. For qwen the sequences go like this when going forward(): Logs:
At that point, [torch.int32, torch.float32, torch.float32] Which is intended for input_ids, masked attention, and caching but (
torch.Tensor([[...]]),
torch.Tensor([[...]])
) So the tensor that was supposed to be for caching disappeared weirdly so what I did was just add in an empty tensor into existing tuple to fill in the gap so the line below doesn't error: assert len(self.input_dtypes) == len(torch_inputs), f"..." And in that case, Hopefully that's detailed enough to help you guys diagnose the problem @milank94. Sorry that I forgot to mention the other line that I added 😅 |
@LPanosTT That issue with triu is outdated now the only workarounds that I needed to implement are the ones that I mentioned here. Perhaps its your environment? I'm using pybuda's docker image for this: Then updated transformers in that environment to |
As for the poor outputs, what are you getting? This is what I get with Qwen-1.5-Chat, a very hallucinated one lol:
|
So the PyBuda API does not support DynamicCache. If you wish to use past cache you'll have to use legacy cache (list of tuples). In the I'll be adding an assertion that explicitly states that DynamicCache is not a supported input to avoid confusion about this in the future. |
Got it, thanks! |
Per #42 (comment) this is not needed. |
DynamicCache is automatically implemented for newer models: class Qwen2PreTrainedModel(PreTrainedModel):
config_class = Qwen2Config
base_model_prefix = "model"
supports_gradient_checkpointing = True
_no_split_modules = ["Qwen2DecoderLayer"]
_skip_keys_device_placement = "past_key_values"
_supports_flash_attn_2 = True
_supports_sdpa = True
_supports_cache_class = True # <---- This is False for older models like GPT-2 So until Pybuda supports DynamicCache just disable it before inferencing so you don't have to create a custom wrapper. For example: model = Qwen2ForCausalLM.from_pretrained("Qwen/Qwen1.5-0.5B-Chat", config=config)
model._supports_cache_class = False |
Im not familiar with DynamicCache in huggingface transformers, but I can tell that it's not being passed properly during microbatching checks.
Here's my workaround that enabled Phi-2 and Qwen-1.5 0.5B to work:
JushBJJ@f765838
Bounty PRs:
tenstorrent/tt-buda-demos#37
tenstorrent/tt-buda-demos#117
Steps to reproduce
The text was updated successfully, but these errors were encountered: