-
Notifications
You must be signed in to change notification settings - Fork 0
/
imdb classification_transformer model.py
1343 lines (1157 loc) · 64.1 KB
/
imdb classification_transformer model.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# IMDB Movie Review Classification
# Structure:
# 1. Attention is all we need.
# 2. Why eliminate overfitting is so important?
# 3. Summarise signs of overfitting.
# 4. Imoprt required libary and module
# 5. Review content
# 6. Beat the baseline: 50% accuracy!
# 6-1. Preparing the data
# 6-2. Sanity Check
# 7. Models for IMDB movie revview classification
<br>7-1. Processing words as a sequence: Transformer encoder based model
<br>7-2. Implementing positional embedding to Transformer encoder based model
<br>7-3. Bidirectional LSTM
<br>7-4. GloVe
<br>7-5. Bag-of-words: N-gram(N=2)
<br>7-6. Bigram
<br>7-7. TF-IDF bigram model
# 8. Created a new model for inference (inference_model)
# 9. Attention is all we need?
# 10. References
# 1. Attention is all we need.
#### In this coursework 2, we applied seven different methods on the IMDB sentiment classification mission. Since the publication of 'Attention is all you need', it has been sensational in the deep learning field, so the most we dive into is transformer architecture.
# 2. Why eliminate overfitting is so important?
#### It turns the biggest challenge for DLWP warriors is not the freezing winter morning, it is overfitting instead. Since we have been fought against overfitting during our entire term, it is nessassary to summarise some important concept of overfitting in this coursework 1 and 2.
#### Alleviate Overfitting is an important concept and step in deep learning, including the following reason:
1. Improve model generalisation. Overfitting situation represents the model only performs well on the training data instead of unseen data, meaning by alleviating overfitting, the model can have a better prediction ability in the real world application, and it is usually our final purpose as well.
2. Elevating model interpretability. Unfortunately, the real world data is full of noises. An overfitting model could learn noises in dataset instead of the useful patterns. Hence, alleviate overfitting is not only a nessasary process to reducing wasting of resources, but also training a useful model.
# 3. Summarise signs of overfitting.
### By observing the relative value among training loss and validation loss, and training accuracy and validation accuracy in coursework 1 and coursework 2, we can summarise the sign of overfitting:
#### 1. Training loss and validation loss.
If the training loss keep decreasing, but the validation loss starts to increase or stop decreasing, it is an typical overfitting, which means the model performs well on training dataset but performs poorly on unseen(validation) dataset.
#### 2. Training accuracy and validation accuracy.
If training accuracy keeps increasing, but validation accuracy stop increasing or starts decreasing, it is also a sign of overfitting because the model possibily just remeber the pattern of the training dataset rather than learning the ability to generalise to new data.
##### 3. The instability of model performance is a sign of overfitting as well. If the model performs very volatile on validation dataset, it usually represents the model is too sensitive to the training dataset, rather than generalise effetively.
##### 4. Unusual decrease in performance of validation datasest after increased in performance at the begin is also an indication of overfitting because the convergence speed is too fast so the model haven't learned enough features to become more generalision.
# 4. Imoprt required libary and module
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
# codes from ChatGPT 4, 23/12~31/12, 2023
import os, pathlib, shutil, random
import pandas as pd
import numpy as np
import keras
import sys
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns
from scipy.stats import norm
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
if 'google.colab' in sys.modules:
from google.colab import drive
drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/Colab Notebooks') # 'My Drive' is the default name of Google Drives
os.listdir()
#### Downloading the data and delete train/unsup category. Because these catgories
##### 7 minuites faster than process by T4 GPU.
#### Downloading the data and delete train/unsup category. Because these categories has no obvious negative of positive labels so they will not be used in our mission. The processing speed could increase by decreaing the data amount.
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup
# codes from Ch.11, DLWP
!ls aclImdb
# 5. Review content
#### According to ***DLWP 11.3.1***, we should "...always inspect what our data looks like before we dive into modeling it. It will ground our intuition about what our model is actually doing.":
# codes from Ch.11, DLWP
!cat aclImdb/train/pos/4077_10.txt
##### Let's have a more specific review:
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
# codes from ChatGPT 4, 23/12~31/12, 2023
def load_imdb_dataset(base_path):
labels = {'pos': 1, 'neg': 0}
rows = [] # use list to storage all rows
for label_type in ('neg', 'pos'):
dir_name = os.path.join(base_path, label_type)
files = os.listdir(dir_name)
print(f"Loading {len(files)} {label_type} reviews")
for fname in files:
if fname.endswith('.txt'):
with open(os.path.join(dir_name, fname), encoding='utf-8') as f:
text = f.read()
rows.append([text, labels[label_type]]) # Add each row of data to the list
df = pd.DataFrame(rows, columns=['review', 'sentiment']) # create a dataFrame directly from a list
return df
base_path = 'aclImdb/train'
df = load_imdb_dataset(base_path)
print(df['sentiment'].value_counts())
print(df.sample(5)) # random sampling of five elements
# codes from ChatGPT 4, 23/12~31/12, 2023
# convert numeric labels to text labels
df['sentiment_label'] = df['sentiment'].map({1: 'positive', 0: 'negative'})
# draw count bar chart using seaborn
plt.figure(figsize=(8, 6)) # set image size
sns.countplot(x=df['sentiment_label'])
plt.title('Sentiment Distribution') # set title
plt.xlabel('Sentiment') # set x-axis labels
plt.ylabel('Count') # set y-axis labels
plt.grid(True) # show grid lines
plt.show() # display the chart
# 6. Beat the baseline: 50% accuracy!
#### Since we've been remind that we have to clarify why the baseline of sentiment classification for IMDB movie reviews is 50% in coursework 1, here we are going to explain that. Because the sentiment classification mission we are goint to do here is either positive or negative, which means the accuracy rate is 50/50 by random guessing. Therefore, the baseline we set is more than 50% accuracy or the model is useless( at least need to be better than arandom guess, right?). Of course, if this is a regression mission, the baseline could be average, median, MSE, RMSE, or MAE, etc.
## 6-1. Preparing the data
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
batch_size = 32 # DLWP 11.3.1
base_dir = pathlib.Path("aclImdb")
train_dir = base_dir / "train"
val_dir = base_dir / "val"
test_dir = base_dir / "test"
for category in ("neg", "pos"):
if not os.path.isdir(val_dir / category): # do this only once
os.makedirs(val_dir / category) # make 'neg'/'pos' dir in validation
files = os.listdir(train_dir / category) # list files in 'train'
random.Random(1337).shuffle(files) # shuffle using a seed
num_val_samples = int(0.2 * len(files)) # 2% of our samples for validation
val_files = files[-num_val_samples:]
for fname in val_files: # move our files
shutil.move(
train_dir / category / fname,
val_dir / category / fname
)
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch11., DLWP
train_ds = tf.keras.utils.text_dataset_from_directory(
train_dir, batch_size=batch_size
)
val_ds = tf.keras.utils.text_dataset_from_directory(
val_dir, batch_size=batch_size
)
test_ds = tf.keras.utils.text_dataset_from_directory(
test_dir, batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x) # creates a new dataset (text_only_train_ds) that contains only the text portion of the training dataset, with labels removed.
## 6-2. Sanity Check
#### To ensure the format of dataset meets the requirements for model training, ***DLWP Listing 11.2*** displays the shapes and dtypes of the first batch:
#### Batch size: The `inputs.shape` is with 32 one dimensional samples that equals to `targets.shape`.
#### Data type: It is correct that inputs data type is `string` and outputs data type is `int32`(category label).
#### [0] Check the input and output for the first sample.
#### According to ***DLWP listing 8.9*** and Chapter ***11.3.1***, we cab apply the same method to create Dataset objects:
# codes from ChatGPT 4, 23/12~31/12, 2023
for inputs, targets in train_ds:
print("inputs.shape:", inputs.shape)
print("inputs.dtype:", inputs.dtype)
print("targets.shape:", targets.shape)
print("targets.dtype:", targets.dtype)
print("inputs[0]:", inputs[0])
print("targets[0]:", targets[0])
break
#### Since we need to control input size, we can calculate the average word count and the ratio of that over 600 words.
# idea from "https://www.kaggle.com/code/derrelldsouza/imdb-sentiment-analysis-eda-ml-lstm-bert?kernelSessionId=68871348"
# codes from ChatGPT 4, 23/12~31/12, 2023
# count the number of words in each comment
df['word_count'] = df['review'].apply(lambda x: len(x.split()))
# calculate the average number of words
average_word_count = df['word_count'].mean()
print(f"Average word count: {average_word_count}")
# calculate the percentage of comments in an interval of 100 words
word_count_bins = list(range(0, max(df['word_count']), 100)) + [max(df['word_count'])]
word_count_hist = np.histogram(df['word_count'], bins=word_count_bins)
bin_edges = word_count_hist[1]
bin_counts = word_count_hist[0]
# print the percentage of comments for each range
for i in range(len(bin_edges)-1):
bin_percentage = (bin_counts[i] / df.shape[0]) * 100
print(f"Percentage of reviews with word count between {bin_edges[i]} and {bin_edges[i+1]}: {bin_percentage:.2f}%")
# calculate the proportion of reviews longer than 600 words
over_600_words = df[df['word_count'] > 600].shape[0]
percentage_over_600 = (over_600_words / df.shape[0]) * 100
print(f"Percentage of reviews over 600 words: {percentage_over_600:.2f}%")
# plot words distribution histogram with mean, median and mode
plt.figure(figsize=(14, 7))
n, bins, patches = plt.hist(df['word_count'], bins=20, color='blue', alpha=0.7, rwidth=0.85, density=True)
# add mean, median and mode lines
plt.axvline(average_word_count, color='g', linestyle='dashed', linewidth=2, label=f'Mean: {average_word_count:.2f}')
median_val = np.median(df['word_count'])
mode_val = df['word_count'].mode().values[0]
plt.axvline(median_val, color='r', linestyle='dashed', linewidth=2, label=f'Median: {median_val:.2f}')
plt.axvline(mode_val, color='orange', linestyle='dashed', linewidth=2, label=f'Mode: {mode_val:.2f}')
plt.legend()
# calculate and plot the fitted probability density function
density = norm.pdf(bins, average_word_count, np.std(df['word_count']))
plt.plot(bins, density, color='black')
# set title and axis labels
plt.title('Words per review distribution')
plt.xlabel('Words in review')
plt.ylabel('Density')
# plot grid and chart
plt.grid(True)
plt.show()
## Everything is ready. Let's get started to train our models!
# 7. Models for IMDB movie revview classification
## Processing words as a sequence: The sequence model approach (DLWP 11.3.3)
#### Vectorizing the data
##### ***Listing 11.12***: "In order to keep a manageable input size*, we’ll truncate the inputs after the first 600 words. This is a reasonable choice, since the average review length is 233 words, and only 5% of reviews are longer than 600 words."
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
max_length = 600 # words in reviews; if the words amount of a review is lower than 600, padding to 600 with 0
max_tokens = 20000 # align with "vocab_size = 20000" in the embeding layer of Transformer encoder
text_vectorization = tf.keras.layers.TextVectorization(
max_tokens=max_tokens,
output_mode="int",
output_sequence_length=max_length, # manage input size here.*
)
text_vectorization.adapt(text_only_train_ds)
# DLWP 11.2-4, ensure operation runs in the tf.data workflow to get higher efficiency if we train models on GPU or TPU.
int_train_ds = train_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=4)
int_val_ds = val_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=4)
int_test_ds = test_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=4)
#### View vocabulary
#### According to ***DLWP 11.2.3***, some vacabulary may not exsit in our list, that's because it is extremly rare or it is not in the training data, hence we use an index to the use an “out of vocabulary” index (abbreviated as OOV index)—a catch-all for any token that wasn’t in the index.OOV token usually takes place as (index 1), and the mask token as (index 0), so we can find the first 2 vocabulary is '' and [UNK].
# idea from "https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews?kernelSessionId=40627787"
vocabulary = text_vectorization.get_vocabulary()
print(vocabulary[:100])
---
## 7-1. The Transformer architecture
#### The Transformer encoder implemented as a subclassed `Layer`
##### The dimension of the embedding vector is `embed_dim` and there are `num_heads` heads. For each head, each input vector will be divided into small segments of length `embed_dim` / `num_heads`. E.g. 256 / 4 = 64(works), 250 / 4 = 62.5(doesn't work).
##### DLWP 9.3.3, 11.4.3 Prevents the vanishing gradient problem and maintains feature stability in the deep layers of the network, thereby promoting efficient training of deeper Transformer models. In this way, layer normalization helps the model learn long-distance dependencies better
##### Padding Mask: This type of mask is used to mask (ie ignore) padding tokens in the input sequence. In text processing, it is common to use special padding tokens such as [PAD] to ensure that all sequences are of the same length. When computing attention, we do not want the model to process these filler tokens as valid inputs because they do not carry any meaningful information. The fill mask is used to set the attention weights at these locations to 0, ensuring that the model's attention is only focused on non-fill tokens.
##### Look-Ahead Mask (or Decoder Mask): This type of mask is used in the decoder part to prevent the model from "peeping" at future tokens when generating output. In sequence generation tasks (such as text generation), the model needs to generate output step by step and should only rely on previous tokens. The lookahead mask ensures that the model does not have access to information about subsequent tokens when calculating attention for the current token.
##### Expand the mask in the appropriate dimensions (using tf.newaxis) and apply it to the multi-head attention layer.
##### Since different comments may be of different lengths when dealing with text data, special padding symbols such as "[PAD]" are often used to ensure that all input sequences are of the same length. In this case, using a fill mask can help the model ignore the positions occupied by these fill symbols, ensuring that the model's attention mechanism only focuses on meaningful content.
##### In a task like IMDB sentiment classification, our model needs to look at and consider all words in the input sequence (except filler symbols) to understand the overall sentimental disposition. Therefore, there is no need to prevent the model from "seeing" any specific part of the sequence, which is what lookahead masking does in sequence generation tasks.
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
class TransformerEncoder(tf.keras.layers.Layer): # DLWP Listing 11.21
def __init__(self, embed_dim, dense_dim, num_heads, **kwargs): # **kwargs: keyword arguments
super().__init__(**kwargs)
self.embed_dim = embed_dim # Parameters!
self.dense_dim = dense_dim # ! embed_dim % num_head must be zero! (divisibility)
self.num_heads = num_heads
self.attention = tf.keras.layers.MultiHeadAttention( # Multi-head-attention, DLWP 11.4.2; able to process query, key, and value
num_heads=num_heads, key_dim=embed_dim
)
self.dense_proj = tf.keras.Sequential( # dense layer on top: like a nonlinearity; "top" means above the order in the model's architectural hierarchy. In other words, these dense layers are performed after the multi-head attention layer and are therefore called "on top".
[tf.keras.layers.Dense(dense_dim, activation="relu"),# Introduce ReLU to ease the level of vanishing gradient,
tf.keras.layers.Dense(embed_dim),] # for ReLU(x)=max(0,x): if x>0, outputs=1; if x<=0, outputs=0
)
self.layernorm_1 = tf.keras.layers.LayerNormalization() # layer norm; DLWP 9.3.3, 11.4.3
self.layernorm_2 = tf.keras.layers.LayerNormalization() # default apllied on the last dim (not the whole layer)
def call(self, inputs, mask=None): # We use padding by "max_length = 600" in "Vectorize the data"
if mask is not None: # optional mask (used in the decoder, see
mask = mask[:, tf.newaxis, :] # translation notebook for an analysis, Or, since the data has been padded, we therefore keep these codes)
attention_output = self.attention(
inputs, inputs, attention_mask=mask # We acutally use only two inputs: query for an input; value and key for an input
)
proj_input = self.layernorm_1(inputs + attention_output) # inputs + attn: residual connection, DLWP 9.3.2, to solve the problem of "vanishing gradient" by adding the input of a layer add back to its output
proj_output = self.dense_proj(proj_input) # Dense layer on top: like a nonlinearity; "top" means above the order in the model's architectural hierarchy.
return self.layernorm_2(proj_input + proj_output) # In other words, these dense layers are performed after the multi-head attention layer and are therefore called "on top".
def get_config(self): # retrieve config as a dict
config = super().get_config() # (required for Keras layers)
config.update({ # If we excute x=TransformerEncoder(embed_dim=256, num_heads=2, dense_dim=32),
"embed_dim": self.embed_dim, # the returned dictionary includes (embed_dim=256, num_heads=2, dense_dim=32).
"num_heads": self.num_heads, # For example: layer=PositionalEmbedding(sequence_length,input_dim,output_dim)
"dense_dim": self.dense_dim, # config=layer.get_config()
}) # new_layer=PositionalEmbedding(sequence_length,input_dim,output_dim) (same configuration)
return config
#### DLWP `Listing 11.25`Using the Transformer encoder for text classification
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
vocab_size = 20000 # Align with "max_size = 20000" in the Vectorizing the data because we need to ensure the text vectorization to embedding is same.
embed_dim = 256 # Divisibility by hum_heads
num_heads = 2
dense_dim = 32
inputs = tf.keras.Input(shape=(None,), dtype="int64")
x = tf.keras.layers.Embedding(vocab_size, embed_dim)(inputs) # 1. Regular embeddings
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x) # 2. Transformer encoder, only 1 layer
x = tf.keras.layers.GlobalMaxPooling1D()(x) # (reduce full sequence to a single vector...)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x) # the output of sigmoid is between 0 to 1, making it is suitable for binary classification task
model = tf.keras.Model(inputs, outputs)
model.compile(
optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"]
)
model.summary()
### Training and evaluating the Transformer encoder based model
#### T100 GPU is 20 minutes faster than process by T4 GPU.
#### According to our definition of overfitting, training loss is steady decrease but also steady increase invalidation loss. Moreover, training accuracy is steady increase in, but the validation accuracy is jumping back and forth and even decreaed in the last few epoch. We can conclude that it is overfitting.
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
callbacks = [
tf.keras.callbacks.ModelCheckpoint(
str(base_dir / "transformer_encoder.h5"),
save_best_only=True
)
]
history = model.fit(
int_train_ds,
validation_data=int_val_ds,
epochs=20,
callbacks=callbacks
)
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
def plot_history(history):
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,4))
loss = history.history["loss"]
val_loss = history.history["val_loss"]
acc = history.history["accuracy"]
val_acc = history.history["val_accuracy"]
epochs = range(1, len(loss) + 1)
ax1.plot(epochs, loss, label="Training")
ax1.plot(epochs, val_loss, label="Validation")
ax1.set_title("Training and validation loss")
ax1.legend()
ax2.plot(epochs, acc, label="Training")
ax2.plot(epochs, val_acc, label="Validation")
ax2.set_title("Training and validation accuracy")
ax2.legend()
plt.show()
plot_history(history) # a LOT of overfitting
#### We should run evaluate on out last model,but we want to see if it is affecting the test accuracy even if there's so many overfitting. The test accuracy is 0.871, which is not bad. But could we do better?
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
model = tf.keras.models.load_model( # DLWP 11.4.3
str(base_dir / "transformer_encoder.h5"),
custom_objects={"TransformerEncoder": TransformerEncoder}
)
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")
## 7-2. Implementing positional embedding as a subclassed layer
#### The previous transformer doesn't consider words sequence. Since Transformer is a hybrid approach, we can manually add sequence information by the positional encoding.
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
class PositionalEmbedding(tf.keras.layers.Layer): # Listing 11.24
def __init__(self, sequence_length, input_dim, output_dim, **kwargs): # Need to know the sequence length first
super().__init__(**kwargs)
self.token_embeddings = tf.keras.layers.Embedding( # token embeddings: semantic information
input_dim=input_dim, output_dim=output_dim # (input_dim = vocab_dim, tokens are integers)
)
self.position_embeddings = tf.keras.layers.Embedding( # position embeddings: syntactic/spatial information
input_dim=sequence_length, output_dim=output_dim # (input_dim = seq_len, instead of tokens, this
) # learns one embedding per sequence *position*)
self.sequence_length = sequence_length # store those params
self.input_dim = input_dim
self.output_dim = output_dim
def call(self, inputs):
length = tf.shape(inputs)[-1]
embedded_tokens = self.token_embeddings(inputs) # 1. create token embeddings
# 2. create pos embeddings
positions = tf.range(start=0, limit=length, delta=1) # (as many as our input length, delta: step size)
embedded_positions = self.position_embeddings(positions)
return embedded_tokens + embedded_positions # 3. Both embeddings are simply combined together.
def compute_mask(self, inputs, mask=None): # Turns int sequences into a mask (ignore all 0), example:
return tf.math.not_equal(inputs, 0) # [ 12 3 54 3 0 0 ]
# [ True True True True False False ]
def get_config(self): # retrieve config as a dict
config = super().get_config() # (required for Keras layers)
config.update({
"output_dim": self.output_dim,
"sequence_length": self.sequence_length,
"input_dim": self.input_dim,
})
return config
### 7-2-1. Combining the Transformer encoder with positional embedding
#### The test accuracy is 0.884, which is better than the model without positional embedding.
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32
inputs = tf.keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs) # 1. Positional embeddings
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x) # 2. Transformer encoder
x = tf.keras.layers.GlobalMaxPooling1D()(x) # (reduce full sequence to a vector)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(
optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"]
)
model.summary()
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
callbacks = [
tf.keras.callbacks.ModelCheckpoint(
str(base_dir / "full_transformer_encoder.h5"),
save_best_only=True
)
]
history = model.fit(
int_train_ds,
validation_data=int_val_ds,
epochs=20,
callbacks=callbacks
)
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
def plot_history(history):
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,4))
loss = history.history["loss"]
val_loss = history.history["val_loss"]
acc = history.history["accuracy"]
val_acc = history.history["val_accuracy"]
epochs = range(1, len(loss) + 1)
ax1.plot(epochs, loss, label="Training")
ax1.plot(epochs, val_loss, label="Validation")
ax1.set_title("Training and validation loss")
ax1.legend()
ax2.plot(epochs, acc, label="Training")
ax2.plot(epochs, val_acc, label="Validation")
ax2.set_title("Training and validation accuracy")
ax2.legend()
plt.show()
plot_history(history) #
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
model = tf.keras.models.load_model(
str(base_dir / "full_transformer_encoder.h5"), # 确保文件名与代码3中保存的模型匹配
custom_objects={"PositionalEmbedding": PositionalEmbedding, "TransformerEncoder": TransformerEncoder} # 如果您在模型中使用了自定义层
)
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")
### 7-2-2.
#### According to the Occam's razor theory, one possible way to improve the model performance is to reduce it's complication. Because the number of parameters will be affected by the embedding dimension(vocab_size*embed_dim), therefore we could decrease the number of parameter by reducing the embedding dimension.We decide to change the embed_dim to 128. However, the best accuracy in validation set is no better, only 0.8778(epoch 6), than the embed_dim=256 model, and then appeared lost of overfitting. We should intorduce early stopping method.
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
vocab_size = 20000
sequence_length = 600
embed_dim = 128 # Change here, from 256 to 128
num_heads = 2
dense_dim = 32
inputs = tf.keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs) # 1. Positional embeddings
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x) # 2. Transformer encoder
x = tf.keras.layers.GlobalMaxPooling1D()(x) # (reduce full sequence to a vector...)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(
optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"]
)
callbacks = [
tf.keras.callbacks.ModelCheckpoint(
str(base_dir / "full_transformer_encoder.h5"),
save_best_only=True
)
]
history = model.fit(
int_train_ds,
validation_data=int_val_ds,
epochs=20,
callbacks=callbacks
)
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
best_epoch = np.argmax(history.history['val_accuracy'])
best_val_accuracy = history.history['val_accuracy'][best_epoch]
print(f"Best Validation Accuracy: {best_val_accuracy:.4f} at Epoch {best_epoch+1}")
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
def plot_history(history):
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,4))
loss = history.history["loss"]
val_loss = history.history["val_loss"]
acc = history.history["accuracy"]
val_acc = history.history["val_accuracy"]
epochs = range(1, len(loss) + 1)
ax1.plot(epochs, loss, label="Training")
ax1.plot(epochs, val_loss, label="Validation")
ax1.set_title("Training and validation loss")
ax1.legend()
ax2.plot(epochs, acc, label="Training")
ax2.plot(epochs, val_acc, label="Validation")
ax2.set_title("Training and validation accuracy")
ax2.legend()
plt.show()
plot_history(history) #
### 7-2-3. Chang embed_size from 256 to 512.
#### However, even though higher embedding dimensions usually can provide more feature representation and capture more complex patterns(generalisation ability), but it accompaned with the risk of overfitting.
##### To responding our feedback on our coursework 1, "...we should activate silent mode when we using grid search...", so we tried "verbose=0" and see what will happened here. It works like training in silence and we can get which is the best epoch is, but the worse thing here is we can not observe the training process if we don't have the plots, even though we can infere that overfitting stated at epoch 8. We should really use silence training when perform grid search.
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
vocab_size = 20000
sequence_length = 600
embed_dim = 512 # Change here
num_heads = 2
dense_dim = 32
inputs = tf.keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs) # 1. Positional embeddings
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x) # 2. Transformer encoder
x = tf.keras.layers.GlobalMaxPooling1D()(x) # (reduce full sequence to a vector...)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(
optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"]
)
callbacks = [
tf.keras.callbacks.ModelCheckpoint(
str(base_dir / "full_transformer_encoder.h5"),
save_best_only=True
)
]
history = model.fit(
int_train_ds,
validation_data=int_val_ds,
epochs=20,
callbacks=callbacks,
verbose=0 # silence mode
)
best_epoch = np.argmax(history.history['val_accuracy'])
best_val_accuracy = history.history['val_accuracy'][best_epoch]
print(f"Best Validation Accuracy: {best_val_accuracy:.4f} at Epoch {best_epoch+1}")
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
def plot_history(history):
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,4))
loss = history.history["loss"]
val_loss = history.history["val_loss"]
acc = history.history["accuracy"]
val_acc = history.history["val_accuracy"]
epochs = range(1, len(loss) + 1)
ax1.plot(epochs, loss, label="Training")
ax1.plot(epochs, val_loss, label="Validation")
ax1.set_title("Training and validation loss")
ax1.legend()
ax2.plot(epochs, acc, label="Training")
ax2.plot(epochs, val_acc, label="Validation")
ax2.set_title("Training and validation accuracy")
ax2.legend()
plt.show()
plot_history(history) #
### 7-2-4. Increase num_heads from 2 to 4.
#### Since transformer is famous of its mutihead attention mechnism, perhaps we can reach a better result by increase it head from 2 to 4. Unfortunately, the validation accuracy still doesn't improve and it is sitll overfitting in almost the begining starge.
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 4 # Chang here
dense_dim = 32
inputs = tf.keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs) # 1. Positional embeddings
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x) # 2. Transformer encoder
x = tf.keras.layers.GlobalMaxPooling1D()(x) # (reduce full sequence to a vector...)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(
optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"]
)
model.summary()
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
callbacks = [
tf.keras.callbacks.ModelCheckpoint(
str(base_dir / "full_transformer_encoder.h5"),
save_best_only=True
)
]
history = model.fit(
int_train_ds,
validation_data=int_val_ds,
epochs=20,
callbacks=callbacks
)
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
best_epoch = np.argmax(history.history['val_accuracy'])
best_val_accuracy = history.history['val_accuracy'][best_epoch]
print(f"Best Validation Accuracy: {best_val_accuracy:.4f} at Epoch {best_epoch+1}")
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
def plot_history(history):
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,4))
loss = history.history["loss"]
val_loss = history.history["val_loss"]
acc = history.history["accuracy"]
val_acc = history.history["val_accuracy"]
epochs = range(1, len(loss) + 1)
ax1.plot(epochs, loss, label="Training")
ax1.plot(epochs, val_loss, label="Validation")
ax1.set_title("Training and validation loss")
ax1.legend()
ax2.plot(epochs, acc, label="Training")
ax2.plot(epochs, val_acc, label="Validation")
ax2.set_title("Training and validation accuracy")
ax2.legend()
plt.show()
plot_history(history) #
### 7-2-5. Increase num_heads from 4 to 8.
#### Perhaps we need more attention? Let's increase number of head from 4 to 8. We also increase dense_dim to 1024 to increase parameters. Sadly, the validation accuracy remains almost same, 0.876. We also introduce early stopping in this model.
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
vocab_size = 20000
sequence_length = 600
embed_dim = 512
num_heads = 8
dense_dim = 1024
inputs = tf.keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs) # 1. Positional embeddings
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x) # 2. Transformer encoder
x = tf.keras.layers.GlobalMaxPooling1D()(x) # (reduce full sequence to a vector...)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(
optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"]
)
model.summary()
# codes from ChatGPT 4, 23/12~31/12, 2023
# Define Early Stopping
early_stopping_callback = tf.keras.callbacks.EarlyStopping(
monitor='val_loss', # monitor validation loss
patience=3, # stop if no improvement after 3 epochs
restore_best_weights=True # restore to the best weights
)
callbacks = [
tf.keras.callbacks.ModelCheckpoint(
str(base_dir / "full_transformer_encoder.h5"),
save_best_only=True
),
early_stopping_callback # Early Stopping
]
history = model.fit(
int_train_ds,
validation_data=int_val_ds,
epochs=20,
callbacks=callbacks
)
# codes from ChatGPT 4, 23/12~31/12, 2023
best_epoch = np.argmax(history.history['val_accuracy'])
best_val_accuracy = history.history['val_accuracy'][best_epoch]
print(f"Best Validation Accuracy: {best_val_accuracy:.4f} at Epoch {best_epoch+1}")
#### We don't change activation from Sigmoid to Relu/Tanh is because for claasification task, the expected output [1, 0] from sigmoid is the reasonable choice.
### 7-2-6. Change optimiser from RMSProp to Adam
#### ADAM combines the features of gradient descent momentum and RMSprop and it computes both a first-moment estimate of the gradient (i.e., the mean, similar to momentum) and a second-order moment estimate (i.e., the uncentered variance). Therefore, it takes into account not only the mean of the square of past gradients (like RMSprop), but also the mean of past gradients (i.e. momentum) to to update weights. Theoritically, the performance could be better than RMSprop to the same model. However, we can find the validation accuracy doesn't get better. Maybe we should consider adjust the hyper parameter that is Learning rate, β1 (decay rate of first-order moment estimate), β2 (decay rate of second-order moment estimate) of ADAM?
# codes from Jérémie's ARTIFICIAL INTELLIGENCE (2023-24), Goldsmiths
# codes from Ch.11, DLWP
vocab_size = 20000
sequence_length = 600
embed_dim = 512
num_heads = 8
dense_dim = 1024
inputs = tf.keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs) # 1. Positional embeddings
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x) # 2. Transformer encoder
x = tf.keras.layers.GlobalMaxPooling1D()(x) # (reduce full sequence to a vector...)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)
model.compile(
optimizer="adam", # change from RMSprop to ADAM
loss="binary_crossentropy",
metrics=["accuracy"]
)
model.summary()
# codes from ChatGPT 4, 23/12~31/12, 2023
# Define Early Stopping
early_stopping_callback = tf.keras.callbacks.EarlyStopping(
monitor='val_loss', # monitor validation loss
patience=3, # stop if no improvement after 3 epochs
restore_best_weights=True # restore to the best weights
)
callbacks = [
tf.keras.callbacks.ModelCheckpoint(
str(base_dir / "full_transformer_encoder.h5"),
save_best_only=True
),
early_stopping_callback # Early Stopping
]
history = model.fit(
int_train_ds,
validation_data=int_val_ds,
epochs=20,
callbacks=callbacks
)
### 7-2-7. The validation accuracy doesn't improve, so we decide to modify and encapsulate the Transformer model so that it is easier to change its architecture.
# codes from ChatGPT 4, 23/12~31/12, 2023
# Define Transformer model
def build_transformer_model(vocab_size, sequence_length, embed_dim, num_heads, dense_dim, num_layers, batch_size):
# input layer
inputs = tf.keras.Input(shape=(sequence_length,), dtype="int64", batch_size=batch_size) # Ensure the input shape contains the sequence length
# embedding layer(could be replace as other embedding layer,such as GloVe)
x = tf.keras.layers.Embedding(vocab_size, embed_dim)(inputs)
# Transformer encoder layer
for _ in range(num_layers):
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
# Global pooling and output layer
x = tf.keras.layers.GlobalMaxPooling1D()(x)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
# build model
model = tf.keras.Model(inputs, outputs)
return model
### 7-2-8. Change batch size from 32 to 64. new_dense_dim = 512.
#### But neither do validation accuracy and overfitting has been improved.
# codes from ChatGPT 4, 23/12~31/12, 2023
# new parameters
new_batch_size = 64
new_vocab_size = 20000
new_sequence_length = 600
new_embed_dim = 256
new_num_heads = 4
new_dense_dim = 512
new_num_layers = 2
#compile model with neew params
new_model = build_transformer_model(new_vocab_size, new_sequence_length, new_embed_dim, new_num_heads, new_dense_dim, new_num_layers, new_batch_size)
new_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
# print model summary
new_model.summary()
# codes from ChatGPT 4, 23/12~31/12, 2023
# Early Stopping to recall to train new model
new_early_stopping_callback = tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=3,
restore_best_weights=True
)
new_callbacks = [
tf.keras.callbacks.ModelCheckpoint("new_full_transformer_encoder.h5", save_best_only=True),
new_early_stopping_callback
]
# trainin new model
new_history = new_model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=new_callbacks)
### 7-2-9. Change batch size from 32 to 64. new_dense_dim = 256.
#### Both validation accuracy and overfitting has no improvement.
# codes from ChatGPT 4, 23/12~31/12, 2023
# new parameters
new_batch_size = 64
new_vocab_size = 20000
new_sequence_length = 600
new_embed_dim = 256
new_num_heads = 4
new_dense_dim = 32
new_num_layers = 2
#compile model with neew params
new_model = build_transformer_model(new_vocab_size, new_sequence_length, new_embed_dim, new_num_heads, new_dense_dim, new_num_layers, new_batch_size)
new_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
# print model summary
new_model.summary()
# codes from ChatGPT 4, 23/12~31/12, 2023
# Early Stopping
new_early_stopping_callback = tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=3,
restore_best_weights=True
)
new_callbacks = [
tf.keras.callbacks.ModelCheckpoint("new_full_transformer_encoder.h5", save_best_only=True),
new_early_stopping_callback
]
# train new model
new_history = new_model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=new_callbacks)
### 7-2-10. New model 3. Change new_num_heads = 4.
#### The validation accuracy 0.8820 is better than previous two models, but overfitting happened early.
# codes from ChatGPT 4, 23/12~31/12, 2023
# new parameters
new_batch_size = 32
new_vocab_size = 20000
new_sequence_length = 600
new_embed_dim = 256
new_num_heads = 4
new_dense_dim = 32
new_num_layers = 2
#compile model with neew params
new_model = build_transformer_model(new_vocab_size, new_sequence_length, new_embed_dim, new_num_heads, new_dense_dim, new_num_layers, new_batch_size)
new_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
# print model summary
new_model.summary()
# codes from ChatGPT 4, 23/12~31/12, 2023
# Early Stopping
new_early_stopping_callback = tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=3,
restore_best_weights=True
)
new_callbacks = [
tf.keras.callbacks.ModelCheckpoint("new_full_transformer_encoder.h5", save_best_only=True),
new_early_stopping_callback
]
# train new model
new_history = new_model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=new_callbacks)
### 7-2-11. New model 4. Change new_num_heads from 4 to 2.
#### Both validation accuracy and overfitting improve less.
# codes from ChatGPT 4, 23/12~31/12, 2023
# new parameters
new_batch_size = 32
new_vocab_size = 20000
new_sequence_length = 600
new_embed_dim = 256
new_num_heads = 2
new_dense_dim = 32
new_num_layers = 2
#compile model with neew params
new_model = build_transformer_model(new_vocab_size, new_sequence_length, new_embed_dim, new_num_heads, new_dense_dim, new_num_layers, new_batch_size)
new_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
# print model summary
new_model.summary()
# codes from ChatGPT 4, 23/12~31/12, 2023
# Early Stopping
new_early_stopping_callback = tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=3,
restore_best_weights=True
)
new_callbacks = [
tf.keras.callbacks.ModelCheckpoint("new_full_transformer_encoder.h5", save_best_only=True),
new_early_stopping_callback
]
# train new model
new_history = new_model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=new_callbacks)
### 7-2-12. New model 5. Change new_num_heads from 4, and compile the model with a custom optimizer (custom learning rate=0.0005)
#### Both validation accuracy and overfitting improve less.
# codes from ChatGPT 4, 23/12~31/12, 2023
# new parameters
new_batch_size = 32
new_vocab_size = 20000
new_sequence_length = 600
new_embed_dim = 256
new_num_heads = 4
new_dense_dim = 32
new_num_layers = 2
# define custom learning rate
learning_rate = 0.0005
# Create an Adam optimizer instance with a custom learning rate
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
# Build a model with new parameters
new_model = build_transformer_model(new_vocab_size, new_sequence_length, new_embed_dim, new_num_heads, new_dense_dim, new_num_layers, new_batch_size)
# Compile the model with a custom optimizer
new_model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"])
# model summary
new_model.summary()
# codes from ChatGPT 4, 23/12~31/12, 2023
# Early Stopping
new_early_stopping_callback = tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=3,
restore_best_weights=True
)
new_callbacks = [
tf.keras.callbacks.ModelCheckpoint("new_full_transformer_encoder.h5", save_best_only=True),
new_early_stopping_callback
]
# train new model
new_history = new_model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=new_callbacks)
### 7-2-13. New model 6. Change new_num_heads from 4, and compile the model with a custom optimizer (custom learning rate=0.0003)
#### Both validation accuracy and overfitting improve less.
# codes from ChatGPT 4, 23/12~31/12, 2023
# new parameters
new_batch_size = 32
new_vocab_size = 20000
new_sequence_length = 600