-
Notifications
You must be signed in to change notification settings - Fork 18
/
3.html
1241 lines (1183 loc) · 212 KB
/
3.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="Styles/ebook.css" type="text/css" rel="stylesheet"/>
<link href="Styles/style.css" type="text/css" rel="stylesheet"/>
</head>
<body>
<div class="document"><div class="compound"></div>
<div class="section" id="processing-raw-text"><h1><font id="1">3 处理原始文本</font></h1>
<p><font id="2">文本的最重要来源无疑是网络。</font><font id="3">探索现成的文本集合,如我们在前面章节中看到的语料库,是很方便的。</font><font id="4">然而,在你心中可能有你自己的文本来源,需要学习如何访问它们。</font></p>
<p><font id="5">本章的目的是要回答下列问题:</font></p>
<ol class="arabic simple"><li><font id="6">我们怎样才能编写程序访问本地和网络上的文件,从而获得无限的语言材料?</font></li>
<li><font id="7">我们如何把文档分割成单独的词和标点符号,这样我们就可以开始像前面章节中在文本语料上做的那样的分析?</font></li>
<li><font id="8">我们怎样编程程序产生格式化的输出,并把结果保存在一个文件中?</font></li>
</ol>
<p><font id="9">为了解决这些问题,我们将讲述NLP 中的关键概念,包括分词和词干提取。</font><font id="10">在此过程中,你会巩固你的Python 知识并且了解关于字符串、文件和正则表达式知识。</font><font id="11">既然这些网络上的文本都是HTML 格式的,我们也将看到如何去除HTML 标记。</font></p>
<div class="note"><p class="first admonition-title"><font id="12">注意</font></p>
<p><font id="13"><strong>重点:</strong> 从本章开始往后我们的例子程序将假设你以下面的导入语句开始你的交互式会话或程序:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">from</span> __future__ <span class="pysrc-keyword">import</span> division <span class="pysrc-comment"># Python 2 users only</span>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">import</span> nltk, re, pprint
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">from</span> nltk <span class="pysrc-keyword">import</span> word_tokenize</pre>
</div>
<div class="section" id="accessing-text-from-the-web-and-from-disk"><h2 class="sigil_not_in_toc"><font id="14">3.1 从网络和硬盘访问文本</font></h2>
<div class="section" id="electronic-books"><h3 class="sigil_not_in_toc"><font id="15">电子书</font></h3>
<p><font id="16">NLTK 语料库集合中有古腾堡项目的一小部分样例文本。</font><font id="17">然而,你可能对分析古腾堡项目的其它文本感兴趣。</font><font id="18">你可以在<tt class="doctest"><span class="pre">http://www.gutenberg.org/catalog/</span></tt>上浏览25,000 本免费在线书籍的目录,获得ASCII 码文本文件的URL。</font><font id="19">虽然90%的古腾堡项目的文本是英语的,它还包括超过50 种语言的材料,包括加泰罗尼亚语、中文、荷兰语、芬兰语、法语、德语、意大利语、葡萄牙语和西班牙语(每种语言都有超过100 个文本)。</font></p>
<p><font id="20">编号2554 的文本是<em>《罪与罚》</em>的英文翻译,我们可以如下方式访问它。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">from</span> urllib <span class="pysrc-keyword">import</span> request
<span class="pysrc-prompt">>>> </span>url = <span class="pysrc-string">"http://www.gutenberg.org/files/2554/2554.txt"</span>
<span class="pysrc-prompt">>>> </span>response = request.urlopen(url)
<span class="pysrc-prompt">>>> </span>raw = response.read().decode(<span class="pysrc-string">'utf8'</span>)
<span class="pysrc-prompt">>>> </span>type(raw)
<span class="pysrc-output"><class 'str'></span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>len(raw)
<span class="pysrc-output">1176893</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>raw[:75]
<span class="pysrc-output">'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'</span></pre>
<div class="note"><p class="first admonition-title"><font id="21">注意</font></p>
<p><font id="22"><tt class="doctest"><span class="pre">read()</span></tt>过程将需要几秒钟来下载这本大书。</font><font id="23">如果你使用的Internet代理Python不能正确检测出来,你可能需要在使用<tt class="doctest"><span class="pre">urlopen</span></tt>之前用下面的方法手动指定代理:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>proxies = {<span class="pysrc-string">'http'</span>: <span class="pysrc-string">'http://www.someproxy.com:3128'</span>}
<span class="pysrc-prompt">>>> </span>request.ProxyHandler(proxies)</pre>
</div>
<p><font id="24">变量<tt class="doctest"><span class="pre">raw</span></tt>包含一个有1,176,893个字符的字符串。</font><font id="25">(我们使用<tt class="doctest"><span class="pre">type(raw)</span></tt>可以看到它是一个字符串。)</font><font id="26">这是这本书原始的内容,包括很多我们不感兴趣的细节,如空格、换行符和空行。</font><font id="27">请注意,文件中行尾的<tt class="doctest"><span class="pre">\r</span></tt>和<tt class="doctest"><span class="pre">\n</span></tt>,这是Python 用来显示特殊的回车和换行字符的方式(这个文件一定是在Windows 机器上创建的)。</font><font id="28">对于语言处理,我们要将字符串分解为词和标点符号,正如我们在<a class="reference external" href="./ch01.html#chap-introduction">1.</a>中所看到的。</font><font id="29">这一步被称为<span class="termdef">分词</span>,它产生我们所熟悉的结构,一个词汇和标点符号的列表。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>tokens = word_tokenize(raw)
<span class="pysrc-prompt">>>> </span>type(tokens)
<span class="pysrc-output"><class 'list'></span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>len(tokens)
<span class="pysrc-output">254354</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>tokens[:10]
<span class="pysrc-output">['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']</span></pre>
<p><font id="30">请注意,分词需要NLTK,但所有前面的打开一个URL以及读入一个字符串的任务都不需要。</font><font id="31">如果我们现在采取进一步的步骤从这个列表创建一个NLTK 文本,我们可以进行我们在<a class="reference external" href="./ch01.html#chap-introduction">1.</a>中看到的所有的其他语言的处理,也包括常规的列表操作例如切片:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>text = nltk.Text(tokens)
<span class="pysrc-prompt">>>> </span>type(text)
<span class="pysrc-output"><class 'nltk.text.Text'></span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>text[1024:1062]
<span class="pysrc-output">['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in',</span>
<span class="pysrc-output"> 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in',</span>
<span class="pysrc-output"> 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly',</span>
<span class="pysrc-output"> ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge', '.']</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>text.collocations()
<span class="pysrc-output">Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya</span>
<span class="pysrc-output">Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old</span>
<span class="pysrc-output">woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;</span>
<span class="pysrc-output">great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;</span>
<span class="pysrc-output">Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market</span></pre>
<p><font id="32">请注意,<span class="example">Project Gutenberg</span>以一个搭配出现。</font><font id="33">这是因为从古腾堡项目下载的每个文本都包含一个首部,里面有文本的名称、作者、扫描和校对文本的人的名字、许可证等信息。</font><font id="34">有时这些信息出现在文件末尾页脚处。</font><font id="35">我们不能可靠地检测出文本内容的开始和结束,因此在从<tt class="doctest"><span class="pre">原始</span></tt>文本中挑出正确内容且没有其它内容之前,我们需要手工检查文件以发现标记内容开始和结尾的独特的字符串:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>raw.find(<span class="pysrc-string">"PART I"</span>)
<span class="pysrc-output">5338</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>raw.rfind(<span class="pysrc-string">"End of Project Gutenberg's Crime"</span>)
<span class="pysrc-output">1157743</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>raw = raw[5338:1157743] <a href="./ch03.html#ref-raw-slice"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></a>
<span class="pysrc-prompt">>>> </span>raw.find(<span class="pysrc-string">"PART I"</span>)
<span class="pysrc-output">0</span></pre>
<p><font id="36">方法<tt class="doctest"><span class="pre">find()</span></tt>和<tt class="doctest"><span class="pre">rfind()</span></tt>(反向的find)帮助我们得到字符串切片需要用到的正确的索引值<a class="reference internal" href="./ch03.html#raw-slice"><span id="ref-raw-slice"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></span></a>。</font><font id="37">我们用这个切片重新给<tt class="doctest"><span class="pre">raw</span></tt>赋值,所以现在它以“PART I”开始一直到(但不包括)标记内容结尾的句子。</font></p>
<p><font id="38">这是我们第一次接触到网络的实际内容:在网络上找到的文本可能含有不必要的内容,并没有一个自动的方法来去除它。</font><font id="39">但只需要少量的额外工作,我们就可以提取出我们需要的材料。</font></p>
</div>
<div class="section" id="dealing-with-html"><h3 class="sigil_not_in_toc"><font id="40">处理HTML</font></h3>
<p><font id="41">网络上的文本大部分是HTML 文件的形式。</font><font id="42">你可以使用网络浏览器将网页作为文本保存为本地文件,然后按照下面关于文件的小节描述的那样来访问它。</font><font id="43">不过,如果你要经常这样做,最简单的办法是直接让Python来做这份工作。</font><font id="44">第一步是像以前一样使用<tt class="doctest"><span class="pre">urlopen</span></tt>。</font><font id="45">为了好玩,我们将挑选一个被称为<em>Blondes to die out in 200 years</em>的BBC新闻故事,一个都市传奇被BBC作为确立的科学事实流传下来:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>url = <span class="pysrc-string">"http://news.bbc.co.uk/2/hi/health/2284783.stm"</span>
<span class="pysrc-prompt">>>> </span>html = request.urlopen(url).read().decode(<span class="pysrc-string">'utf8'</span>)
<span class="pysrc-prompt">>>> </span>html[:60]
<span class="pysrc-output">'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'</span></pre>
<p><font id="46">你可以输入<tt class="doctest"><span class="pre"><span class="pysrc-keyword">print</span>(html)</span></tt>来查看HTML的全部内容,包括meta 元标签、图像标签、map 标签、JavaScript、表单和表格。</font></p>
<p><font id="47">要得到HTML的文本,我们将使用一个名为 <em>BeautifulSoup</em>的Python库,可从 <tt class="doctest"><span class="pre">http://www.crummy.com/software/BeautifulSoup/</span></tt> 访问︰</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">from</span> bs4 <span class="pysrc-keyword">import</span> BeautifulSoup
<span class="pysrc-prompt">>>> </span>raw = BeautifulSoup(html).get_text()
<span class="pysrc-prompt">>>> </span>tokens = word_tokenize(raw)
<span class="pysrc-prompt">>>> </span>tokens
<span class="pysrc-output">['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', ...]</span></pre>
<p><font id="48">它仍然含有不需要的内容,包括网站导航及有关报道等。</font><font id="49">通过一些尝试和出错你可以找到内容索引的开始和结尾,并选择你感兴趣的词符,按照前面讲的那样初始化一个文本。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>tokens = tokens[110:390]
<span class="pysrc-prompt">>>> </span>text = nltk.Text(tokens)
<span class="pysrc-prompt">>>> </span>text.concordance(<span class="pysrc-string">'gene'</span>)
<span class="pysrc-output">Displaying 5 of 5 matches:</span>
<span class="pysrc-output">hey say too few people now carry the gene for blondes to last beyond the next</span>
<span class="pysrc-output">blonde hair is caused by a recessive gene . In order for a child to have blond</span>
<span class="pysrc-output">have blonde hair , it must have the gene on both sides of the family in the g</span>
<span class="pysrc-output">ere is a disadvantage of having that gene or by chance . They do n't disappear</span>
<span class="pysrc-output">des would disappear is if having the gene was a disadvantage and I do not thin</span></pre>
</div>
<div class="section" id="processing-search-engine-results"><h3 class="sigil_not_in_toc"><font id="50">处理搜索引擎的结果</font></h3>
<p><font id="51">网络可以被看作未经标注的巨大的语料库。</font><font id="52">网络搜索引擎提供了一个有效的手段,搜索大量文本作为有关的语言学的例子。</font><font id="53">搜索引擎的主要优势是规模:因为你正在寻找这样庞大的一个文件集,会更容易找到你感兴趣语言模式。</font><font id="54">而且,你可以使用非常具体的模式,仅仅在较小的范围匹配一两个例子,但在网络上可能匹配成千上万的例子。</font><font id="55">网络搜索引擎的第二个优势是非常容易使用。</font><font id="56">因此,它是一个非常方便的工具,可以快速检查一个理论是否合理。</font></p>
<p class="caption"><font id="57"><span class="caption-label">表 3.1</span>:</font></p>
<p><font id="58">搭配的谷歌命中次数:<span class="example">absolutely</span>或<span class="example">definitely</span>后面跟着<span class="example">adore</span>, <span class="example">love</span>, <span class="example">like</span>或<span class="example">prefer</span>的搭配的命中次数。</font><font id="59">(Liberman, in <em>LanguageLog</em>, 2005)。</font></p>
<p></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">import</span> feedparser
<span class="pysrc-prompt">>>> </span>llog = feedparser.parse(<span class="pysrc-string">"http://languagelog.ldc.upenn.edu/nll/?feed=atom"</span>)
<span class="pysrc-prompt">>>> </span>llog[<span class="pysrc-string">'feed'</span>][<span class="pysrc-string">'title'</span>]
<span class="pysrc-output">'Language Log'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>len(llog.entries)
<span class="pysrc-output">15</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>post = llog.entries[2]
<span class="pysrc-prompt">>>> </span>post.title
<span class="pysrc-output">"He's My BF"</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>content = post.content[0].value
<span class="pysrc-prompt">>>> </span>content[:70]
<span class="pysrc-output">'<p>Today I was chatting with three of our visiting graduate students f'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>raw = BeautifulSoup(content).get_text()
<span class="pysrc-prompt">>>> </span>word_tokenize(raw)
<span class="pysrc-output">['Today', 'I', 'was', 'chatting', 'with', 'three', 'of', 'our', 'visiting',</span>
<span class="pysrc-output">'graduate', 'students', 'from', 'the', 'PRC', '.', 'Thinking', 'that', 'I',</span>
<span class="pysrc-output">'was', 'being', 'au', 'courant', ',', 'I', 'mentioned', 'the', 'expression',</span>
<span class="pysrc-output">'DUI4XIANG4', '\u5c0d\u8c61', '("', 'boy', '/', 'girl', 'friend', '"', ...]</span></pre>
<p><font id="92">伴随着一些更深入的工作,我们可以编写程序创建一个博客帖子的小语料库,并以此作为我们NLP的工作基础。</font></p>
</div>
<div class="section" id="reading-local-files"><h3 class="sigil_not_in_toc"><font id="93">读取本地文件</font></h3>
<p><font id="94">为了读取本地文件,我们需要使用Python内置的<tt class="doctest"><span class="pre">open()</span></tt>函数,然后是<tt class="doctest"><span class="pre">read()</span></tt>方法。</font><font id="95">假设你有一个文件<tt class="doctest"><span class="pre">document.txt</span></tt>,你可以像这样加载它的内容:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>f = open(<span class="pysrc-string">'document.txt'</span>)
<span class="pysrc-prompt">>>> </span>raw = f.read()</pre>
<div class="note"><p class="first admonition-title"><font id="96">注意</font></p>
<p class="last"><font id="97"><strong>轮到你来:</strong> 使用文本编辑器创建一个名为<tt class="doctest"><span class="pre">document.txt</span></tt>的文件,然后输入几行文字,保存为纯文本。</font><font id="98">如果你使用IDLE,在<em>File</em>菜单中选择<em>New Window</em>命令,在新窗口中输入所需的文本,然后在IDLE 提供的弹出式对话框中的文件夹内保存文件为<tt class="doctest"><span class="pre">document.txt</span></tt>。</font><font id="99">然后在Python 解释器中使用<tt class="doctest"><span class="pre">f = open(<span class="pysrc-string">'document.txt'</span>)</span></tt>打开这个文件,并使用<tt class="doctest"><span class="pre"><span class="pysrc-keyword">print</span>(f.read())</span></tt>检查其内容。</font></p>
</div>
<p><font id="100">当你尝试这样做时可能会出各种各样的错误。</font><font id="101">如果解释器无法找到你的文件,你会看到类似这样的错误:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>f = open(<span class="pysrc-string">'document.txt'</span>)
<span class="pysrc-except">Traceback (most recent call last):</span>
<span class="pysrc-except">File "<pyshell#7>", line 1, in -toplevel-</span>
<span class="pysrc-except">f = open('document.txt')</span>
<span class="pysrc-except">IOError: [Errno 2] No such file or directory: 'document.txt'</span></pre>
<p><font id="102">要检查你正试图打开的文件是否在正确的目录中,使用IDLE <em>File</em>菜单上的<em>Open</em>命令;</font><font id="103">另一种方法是在Python 中检查当前目录:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">import</span> os
<span class="pysrc-prompt">>>> </span>os.listdir(<span class="pysrc-string">'.'</span>)</pre>
<p><font id="104">另一个你在访问一个文本文件时可能遇到的问题是换行的约定,这个约定因操作系统不同而不同。</font><font id="105">内置的<tt class="doctest"><span class="pre">open()</span></tt>函数的第二个参数用于控制如何打开文件:<tt class="doctest"><span class="pre">open(<span class="pysrc-string">'document.txt'</span>, <span class="pysrc-string">'rU'</span>)</span></tt> —— <tt class="doctest"><span class="pre"><span class="pysrc-string">'r'</span></span></tt>意味着以只读方式打开文件(默认),<tt class="doctest"><span class="pre"><span class="pysrc-string">'U'</span></span></tt>表示“通用”,它让我们忽略不同的换行约定。</font></p>
<p><font id="106">假设你已经打开了该文件,有几种方法可以阅读此文件。</font><font id="107"><tt class="doctest"><span class="pre">read()</span></tt>方法创建了一个包含整个文件内容的字符串:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>f.read()
<span class="pysrc-output">'Time flies like an arrow.\nFruit flies like a banana.\n'</span></pre>
<p><font id="108">回想一<tt class="doctest"><span class="pre"><span class="pysrc-string">'\n'</span></span></tt>字符是<span class="termdef">换行符</span>;这相当于按键盘上的<em>Enter</em>开始一个新行。</font></p>
<p><font id="109">我们也可以使用一个<tt class="doctest"><span class="pre"><span class="pysrc-keyword">for</span></span></tt>循环一次读文件中的一行:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>f = open(<span class="pysrc-string">'document.txt'</span>, <span class="pysrc-string">'rU'</span>)
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">for</span> line <span class="pysrc-keyword">in</span> f:
<span class="pysrc-more">... </span> <span class="pysrc-keyword">print</span>(line.strip())
<span class="pysrc-output">Time flies like an arrow.</span>
<span class="pysrc-output">Fruit flies like a banana.</span></pre>
<p><font id="110">在这里,我们使用<tt class="doctest"><span class="pre">strip()</span></tt>方法删除输入行结尾的换行符。</font></p>
<p><font id="111">NLTK 中的语料库文件也可以使用这些方法来访问。</font><font id="112">我们只需使用<tt class="doctest"><span class="pre">nltk.data.find()</span></tt>来获取语料库项目的文件名。</font><font id="113">然后就可以使用我们刚才讲的方式打开和阅读它:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>path = nltk.data.find(<span class="pysrc-string">'corpora/gutenberg/melville-moby_dick.txt'</span>)
<span class="pysrc-prompt">>>> </span>raw = open(path, <span class="pysrc-string">'rU'</span>).read()</pre>
</div>
<div class="section" id="extracting-text-from-pdf-msword-and-other-binary-formats"><h3 class="sigil_not_in_toc"><font id="114">从PDF、MS Word 及其他二进制格式中提取文本</font></h3>
<p><font id="115">ASCII 码文本和HTML 文本是人可读的格式。</font><font id="116">文字常常以二进制格式出现,如PDF 和MSWord,只能使用专门的软件打开。</font><font id="117">第三方函数库如<tt class="doctest"><span class="pre">pypdf</span></tt>和<tt class="doctest"><span class="pre">pywin32</span></tt>提供了对这些格式的访问。</font><font id="118">从多列文档中提取文本是特别具有挑战性的。</font><font id="119">一次性转换几个文件,会比较简单些,用一个合适的应用程序打开文件,以文本格式保存到本地驱动器,然后以如下所述的方式访问它。</font><font id="120">如果该文档已经在网络上,你可以在Google 的搜索框输入它的URL。</font><font id="121">搜索结果通常包括这个文档的HTML 版本的链接,你可以将它保存为文本。</font></p>
</div>
<div class="section" id="capturing-user-input"><h3 class="sigil_not_in_toc"><font id="122">捕获用户输入</font></h3>
<p><font id="123">有时我们想捕捉用户与我们的程序交互时输入的文本。</font><font id="124">调用Python 函数<tt class="doctest"><span class="pre">input()</span></tt>提示用户输入一行数据。</font><font id="125">保存用户输入到一个变量后,我们可以像其他字符串那样操纵它。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>s = input(<span class="pysrc-string">"Enter some text: "</span>)
<span class="pysrc-output">Enter some text: On an exceptionally hot evening early in July</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(<span class="pysrc-string">"You typed"</span>, len(word_tokenize(s)), <span class="pysrc-string">"words."</span>)
<span class="pysrc-output">You typed 8 words.</span></pre>
</div>
<div class="section" id="the-nlp-pipeline"><h3 class="sigil_not_in_toc"><font id="126">NLP 的流程</font></h3>
<p><font id="127"><a class="reference internal" href="./ch03.html#fig-pipeline1">3.1</a>总结了我们在本节涵盖的内容,包括我们在<a class="reference external" href="./ch01.html#chap-introduction">1.</a>.中所看到的建立一个词汇表的过程。(其中一个步骤,规范化,将在<a class="reference internal" href="./ch03.html#sec-normalizing-text">3.6</a>讨论。)</font></p>
<div class="figure" id="fig-pipeline1"><img alt="Images/pipeline1.png" src="Images/8c5ec1a0132f7c85fd96eda9d9929d15.jpg" style="width: 571.5px; height: 212.7px;"/><p class="caption"><font id="128"><span class="caption-label">图 3.1</span>:处理流程:打开一个URL,读里面HTML 格式的内容,去除标记,并选择字符的切片;然后分词,是否转换为<tt class="doctest"><span class="pre">nltk.Text</span></tt>对象是可选择的;我们也可以将所有词汇小写并提取词汇表。</font></p>
</div>
<p><font id="129">在这条流程后面还有很多操作。</font><font id="130">要正确理解它,这样有助于明确其中提到的每个变量的类型。</font><font id="131">使用<tt class="doctest"><span class="pre">type(x)</span></tt>我们可以找出任一Python 对象<tt class="doctest"><span class="pre">x</span></tt>的类型,如</font><font id="132"><tt class="doctest"><span class="pre">type(1)</span></tt>是<tt class="doctest"><span class="pre"><int></span></tt>因为<tt class="doctest"><span class="pre">1</span></tt>是一个整数。</font></p>
<p><font id="133">当我们载入一个URL 或文件的内容时,或者当我们去掉HTML 标记时,我们正在处理字符串,也就是Python 的<tt class="doctest"><span class="pre"><str></span></tt>数据类型。</font><font id="134">(在<a class="reference internal" href="./ch03.html#sec-strings">3.2</a>节,我们将学习更多有关字符串的内容):</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>raw = open(<span class="pysrc-string">'document.txt'</span>).read()
<span class="pysrc-prompt">>>> </span>type(raw)
<span class="pysrc-output"><class 'str'></span></pre>
<p><font id="135">当我们将一个字符串分词,会产生一个(词的)列表,这是Python 的<tt class="doctest"><span class="pre"><list></span></tt>类型。</font><font id="136">规范化和排序列表产生其它列表:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>tokens = word_tokenize(raw)
<span class="pysrc-prompt">>>> </span>type(tokens)
<span class="pysrc-output"><class 'list'></span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>words = [w.lower() <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> tokens]
<span class="pysrc-prompt">>>> </span>type(words)
<span class="pysrc-output"><class 'list'></span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>vocab = sorted(set(words))
<span class="pysrc-prompt">>>> </span>type(vocab)
<span class="pysrc-output"><class 'list'></span></pre>
<p><font id="137">一个对象的类型决定了它可以执行哪些操作。</font><font id="138">比如我们可以追加一个链表,但不能追加一个字符串:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>vocab.append(<span class="pysrc-string">'blog'</span>)
<span class="pysrc-prompt">>>> </span>raw.append(<span class="pysrc-string">'blog'</span>)
<span class="pysrc-except">Traceback (most recent call last):</span>
<span class="pysrc-except"> File "<stdin>", line 1, in <module></span>
<span class="pysrc-except">AttributeError: 'str' object has no attribute 'append'</span></pre>
<p><font id="139">同样的,我们可以连接字符串与字符串,列表与列表,但我们不能连接字符串与列表:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>query = <span class="pysrc-string">'Who knows?'</span>
<span class="pysrc-prompt">>>> </span>beatles = [<span class="pysrc-string">'john'</span>, <span class="pysrc-string">'paul'</span>, <span class="pysrc-string">'george'</span>, <span class="pysrc-string">'ringo'</span>]
<span class="pysrc-prompt">>>> </span>query + beatles
<span class="pysrc-except">Traceback (most recent call last):</span>
<span class="pysrc-except"> File "<stdin>", line 1, in <module></span>
<span class="pysrc-except">TypeError: cannot concatenate 'str' and 'list' objects</span></pre>
</div>
</div>
<div class="section" id="strings-text-processing-at-the-lowest-level"><h2 class="sigil_not_in_toc"><font id="140">3.2 字符串:最底层的文本处理</font></h2>
<p><font id="141">现在是时候研究一个之前我们一直故意避开的基本数据类型了。</font><font id="142">在前面的章节中,我们侧重于将文本作为一个词列表。</font><font id="143">我们并没有细致的探讨词汇以及它们是如何在编程语言中被处理的。</font><font id="144">通过使用NLTK 中的语料库接口,我们可以忽略这些文本所在的文件。</font><font id="145">一个词的内容,一个文件的内容在编程语言中是由一个叫做<span class="termdef">字符串</span>的基本数据类型来表示的。</font><font id="146">在本节中,我们将详细探讨字符串,并展示字符串与词汇、文本和文件之间的联系。</font></p>
<div class="section" id="basic-operations-with-strings"><h3 class="sigil_not_in_toc"><font id="147">字符串的基本操作</font></h3>
<p><font id="148">可以使用单引号<a class="reference internal" href="./ch03.html#single-quotes"><span id="ref-single-quotes"><img class="callout" alt="[1]" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></span></a>或双引号<a class="reference internal" href="./ch03.html#double-quotes"><span id="ref-double-quotes"><img class="callout" alt="[2]" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></span></a>来指定字符串,如下面的例子代码所示。</font><font id="149">如果一个字符串中包含一个单引号,我们必须在单引号前加反斜杠<a class="reference internal" href="./ch03.html#backslash-escape"><span id="ref-backslash-escape"><img class="callout" alt="[3]" src="Images/7c20d0adbadb35031a28bfcd6dff9900.jpg"/></span></a>让Python 知道这是字符串中的单引号,或者也可以将这个字符串放入双引号中<a class="reference internal" href="./ch03.html#double-quotes"><img class="callout" alt="[2]" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></a>。</font><font id="150">否则,字符串内的单引号<a class="reference internal" href="./ch03.html#unescaped-quote"><span id="ref-unescaped-quote"><img class="callout" alt="[4]" src="Images/0f4441cdaf35bfa4d58fc64142cf4736.jpg"/></span></a>将被解释为字符串结束标志,Python 解释器会报告一个语法错误:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>monty = <span class="pysrc-string">'Monty Python'</span> <a href="./ch03.html#ref-single-quotes"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></a>
<span class="pysrc-prompt">>>> </span>monty
<span class="pysrc-output">'Monty Python'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>circus = <span class="pysrc-string">"Monty Python's Flying Circus"</span> <a href="./ch03.html#ref-double-quotes"><img alt="[2]" class="callout" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></a>
<span class="pysrc-prompt">>>> </span>circus
<span class="pysrc-output">"Monty Python's Flying Circus"</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>circus = <span class="pysrc-string">'Monty Python\'s Flying Circus'</span> <a href="./ch03.html#ref-backslash-escape"><img alt="[3]" class="callout" src="Images/7c20d0adbadb35031a28bfcd6dff9900.jpg"/></a>
<span class="pysrc-prompt">>>> </span>circus
<span class="pysrc-output">"Monty Python's Flying Circus"</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>circus = <span class="pysrc-string">'Monty Python'</span>s Flying Circus' <a href="./ch03.html#ref-unescaped-quote"><img alt="[4]" class="callout" src="Images/0f4441cdaf35bfa4d58fc64142cf4736.jpg"/></a>
<span class="pysrc-output"> File "<stdin>", line 1</span>
<span class="pysrc-output"> circus = 'Monty Python's Flying Circus'</span>
<span class="pysrc-output"> ^</span>
<span class="pysrc-output">SyntaxError: invalid syntax</span></pre>
<p><font id="151">有时字符串跨好几行。</font><font id="152">Python 提供了多种方式表示它们。</font><font id="153">在下面的例子中,一个包含两个字符串的序列被连接为一个字符串。</font><font id="154">我们需要使用反斜杠<a class="reference internal" href="./ch03.html#string-backslash"><span id="ref-string-backslash"><img class="callout" alt="[1]" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></span></a>或者括号<a class="reference internal" href="./ch03.html#string-parentheses"><span id="ref-string-parentheses"><img class="callout" alt="[2]" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></span></a>,这样解释器就知道第一行的表达式不完整。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>couplet = <span class="pysrc-string">"Shall I compare thee to a Summer's day?"</span>\
<span class="pysrc-more">... </span> <span class="pysrc-string">"Thou are more lovely and more temperate:"</span> <a href="./ch03.html#ref-string-backslash"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></a>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(couplet)
<span class="pysrc-output">Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>couplet = (<span class="pysrc-string">"Rough winds do shake the darling buds of May,"</span>
<span class="pysrc-more">... </span> <span class="pysrc-string">"And Summer's lease hath all too short a date:"</span>) <a href="./ch03.html#ref-string-parentheses"><img alt="[2]" class="callout" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></a>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(couplet)
<span class="pysrc-output">Rough winds do shake the darling buds of May,And Summer's lease hath all too short a date:</span></pre>
<p><font id="155">不幸的是,这些方法并没有展现给我们十四行诗的两行之间的换行。</font><font id="156">为此,我们可以使用如下所示的三重引号的字符串:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>couplet = <span class="pysrc-string">"""Shall I compare thee to a Summer's day?</span>
<span class="pysrc-more">... </span><span class="pysrc-string">Thou are more lovely and more temperate:"""</span>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(couplet)
<span class="pysrc-output">Shall I compare thee to a Summer's day?</span>
<span class="pysrc-output">Thou are more lovely and more temperate:</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>couplet = <span class="pysrc-string">'''Rough winds do shake the darling buds of May,</span>
<span class="pysrc-more">... </span><span class="pysrc-string">And Summer's lease hath all too short a date:'''</span>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(couplet)
<span class="pysrc-output">Rough winds do shake the darling buds of May,</span>
<span class="pysrc-output">And Summer's lease hath all too short a date:</span></pre>
<p><font id="157">现在我们可以定义字符串,也可以在上面尝试一些简单的操作。</font><font id="158">首先,让我们来看看<tt class="doctest"><span class="pre">+</span></tt>操作,被称为<span class="termdef">连接</span> <a class="reference internal" href="./ch03.html#string-concatenation"><span id="ref-string-concatenation"><img class="callout" alt="[1]" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></span></a>。</font><font id="159">此操作产生一个新字符串,它是两个原始字符串首尾相连粘贴在一起而成。</font><font id="160">请注意,连接不会做一些比较聪明的事,例如在词汇之间插入空格。</font><font id="161">我们甚至可以对字符串用乘法<a class="reference internal" href="./ch03.html#string-multiplication"><span id="ref-string-multiplication"><img class="callout" alt="[2]" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></span></a>:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'very'</span> + <span class="pysrc-string">'very'</span> + <span class="pysrc-string">'very'</span> <a href="./ch03.html#ref-string-concatenation"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></a>
<span class="pysrc-output">'veryveryvery'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'very'</span> * 3 <a href="./ch03.html#ref-string-multiplication"><img alt="[2]" class="callout" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></a>
<span class="pysrc-output">'veryveryvery'</span></pre>
<div class="note"><p class="first admonition-title"><font id="162">注意</font></p>
<p><font id="163"><strong>轮到你来:</strong> 试运行下面的代码,然后尝试使用你对字符串<tt class="doctest"><span class="pre">+</span></tt>和<tt class="doctest"><span class="pre">*</span></tt>操作的理解,弄清楚它是如何运作的。</font><font id="164">要小心区分字符串<tt class="doctest"><span class="pre"><span class="pysrc-string">' '</span></span></tt>,这是一个空格符,和字符串<tt class="doctest"><span class="pre"><span class="pysrc-string">''</span></span></tt>,这是一个空字符串。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>a = [1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1]
<span class="pysrc-prompt">>>> </span>b = [<span class="pysrc-string">' '</span> * 2 * (7 - i) + <span class="pysrc-string">'very'</span> * i <span class="pysrc-keyword">for</span> i <span class="pysrc-keyword">in</span> a]
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">for</span> line <span class="pysrc-keyword">in</span> b:
<span class="pysrc-more">... </span> <span class="pysrc-keyword">print</span>(line)</pre>
</div>
<p><font id="165">我们已经看到加法和乘法运算不仅仅适用于数字也适用于字符串。</font><font id="166">但是,请注意,我们不能对字符串用减法或除法:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'very'</span> - <span class="pysrc-string">'y'</span>
<span class="pysrc-except">Traceback (most recent call last):</span>
<span class="pysrc-except"> File "<stdin>", line 1, in <module></span>
<span class="pysrc-except">TypeError: unsupported operand type(s) for -: 'str' and 'str'</span>
<span class="pysrc-except"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'very'</span> / 2
<span class="pysrc-except">Traceback (most recent call last):</span>
<span class="pysrc-except"> File "<stdin>", line 1, in <module></span>
<span class="pysrc-except">TypeError: unsupported operand type(s) for /: 'str' and 'int'</span></pre>
<p><font id="167">这些错误消息是Python 的另一个例子,告诉我们的数据类型混乱。</font><font id="168">第一种情况告诉我们减法操作(即<tt class="doctest"><span class="pre">-</span></tt>) 不能适用于<tt class="doctest"><span class="pre">str</span></tt>(字符串)对象类型,而第二种情况告诉我们除法的两个操作数不能分别为<tt class="doctest"><span class="pre">str</span></tt>和<tt class="doctest"><span class="pre">int</span></tt>。</font></p>
</div>
<div class="section" id="printing-strings"><h3 class="sigil_not_in_toc"><font id="169">输出字符串</font></h3>
<p><font id="170">到目前为止,当我们想看看变量的内容或想看到计算的结果,我们就把变量的名称输入到解释器。</font><font id="171">我们还可以使用<tt class="doctest"><span class="pre"><span class="pysrc-keyword">print</span></span></tt>语句来看一个变量的内容:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(monty)
<span class="pysrc-output">Monty Python</span></pre>
<p><font id="172">请注意这次是没有引号的。</font><font id="173">当我们通过输入变量的名字到解释器中来检查它时,解释器输出Python 中的变量的值。</font><font id="174">因为它是一个字符串,结果被引用。</font><font id="175">然而,当我们告诉解释器<tt class="doctest"><span class="pre"><span class="pysrc-keyword">print</span></span></tt>这个变量时,我们没有看到引号字符,因为字符串的内容里面没有引号。</font></p>
<p><font id="176"><tt class="doctest"><span class="pre"><span class="pysrc-keyword">print</span></span></tt>语句可以多种方式将多个元素显示在一行,就像这样:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>grail = <span class="pysrc-string">'Holy Grail'</span>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(monty + grail)
<span class="pysrc-output">Monty PythonHoly Grail</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(monty, grail)
<span class="pysrc-output">Monty Python Holy Grail</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(monty, <span class="pysrc-string">"and the"</span>, grail)
<span class="pysrc-output">Monty Python and the Holy Grail</span></pre>
</div>
<div class="section" id="accessing-individual-characters"><h3 class="sigil_not_in_toc"><font id="177">访问单个字符</font></h3>
<p><font id="178">正如我们在<a class="reference external" href="./ch01.html#sec-a-closer-look-at-python-texts-as-lists-of-words">2</a>看到的列表,字符串也是被索引的,从零开始。</font><font id="179">当我们索引一个字符串时,我们得到它的一个字符(或字母)。</font><font id="180">一个单独的字符并没有什么特别,它只是一个长度为<tt class="doctest"><span class="pre">1</span></tt>的字符串。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>monty[0]
<span class="pysrc-output">'M'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>monty[3]
<span class="pysrc-output">'t'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>monty[5]
<span class="pysrc-output">' '</span></pre>
<p><font id="181">与列表一样,如果我们尝试访问一个超出字符串范围的索引时,会得到了一个错误:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>monty[20]
<span class="pysrc-except">Traceback (most recent call last):</span>
<span class="pysrc-except"> File "<stdin>", line 1, in ?</span>
<span class="pysrc-except">IndexError: string index out of range</span></pre>
<p><font id="182">也与列表一样,我们可以使用字符串的负数索引,其中<tt class="doctest"><span class="pre">-1</span></tt>是最后一个字符的索引<a class="reference internal" href="./ch03.html#last-character"><span id="ref-last-character"><img class="callout" alt="[1]" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></span></a>。</font><font id="183">正数和负数的索引给我们两种方式指示一个字符串中的任何位置。</font><font id="184">在这种情况下,当一个字符串长度为12 时,索引<tt class="doctest"><span class="pre">5</span></tt>和<tt class="doctest"><span class="pre">-7</span></tt>都指示相同的字符(一个空格)。</font><font id="185">(请注意,<tt class="doctest"><span class="pre">5 = len(monty) - 7</span></tt>。)</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>monty[-1] <a href="./ch03.html#ref-last-character"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></a>
<span class="pysrc-output">'n'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>monty[5]
<span class="pysrc-output">' '</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>monty[-7]
<span class="pysrc-output">' '</span></pre>
<p><font id="186">我们可以写一个<tt class="doctest"><span class="pre"><span class="pysrc-keyword">for</span></span></tt>循环,遍历字符串中的字符。</font><font id="187"><tt class="doctest"><span class="pre"><span class="pysrc-keyword">print</span></span></tt>函数包含可选的<tt class="doctest"><span class="pre">end=<span class="pysrc-string">' '</span></span></tt>参数,这是为了告诉Python 不要在行尾输出换行符。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>sent = <span class="pysrc-string">'colorless green ideas sleep furiously'</span>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">for</span> char <span class="pysrc-keyword">in</span> sent:
<span class="pysrc-more">... </span> <span class="pysrc-keyword">print</span>(char, end=<span class="pysrc-string">' '</span>)
<span class="pysrc-more">...</span>
<span class="pysrc-output">c o l o r l e s s g r e e n i d e a s s l e e p f u r i o u s l y</span></pre>
<p><font id="188">我们也可以计数单个字符。</font><font id="189">通过将所有字符小写来忽略大小写的区分,并过滤掉非字母字符。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">from</span> nltk.corpus <span class="pysrc-keyword">import</span> gutenberg
<span class="pysrc-prompt">>>> </span>raw = gutenberg.raw(<span class="pysrc-string">'melville-moby_dick.txt'</span>)
<span class="pysrc-prompt">>>> </span>fdist = nltk.FreqDist(ch.lower() <span class="pysrc-keyword">for</span> ch <span class="pysrc-keyword">in</span> raw <span class="pysrc-keyword">if</span> ch.isalpha())
<span class="pysrc-prompt">>>> </span>fdist.most_common(5)
<span class="pysrc-output">[('e', 117092), ('t', 87996), ('a', 77916), ('o', 69326), ('n', 65617)]</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>[char <span class="pysrc-keyword">for</span> (char, count) <span class="pysrc-keyword">in</span> fdist.most_common()]
<span class="pysrc-output">['e', 't', 'a', 'o', 'n', 'i', 's', 'h', 'r', 'l', 'd', 'u', 'm', 'c', 'w',</span>
<span class="pysrc-output">'f', 'g', 'p', 'b', 'y', 'v', 'k', 'q', 'j', 'x', 'z']</span></pre>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>monty[6:10]
<span class="pysrc-output">'Pyth'</span></pre>
<p><font id="201">在这里,我们看到的字符是<tt class="doctest"><span class="pre"><span class="pysrc-string">'P'</span></span></tt>, <tt class="doctest"><span class="pre"><span class="pysrc-string">'y'</span></span></tt>, <tt class="doctest"><span class="pre"><span class="pysrc-string">'t'</span></span></tt>和<tt class="doctest"><span class="pre"><span class="pysrc-string">'h'</span></span></tt>,它们分别对应于<tt class="doctest"><span class="pre">monty[6]</span></tt> ... <tt class="doctest"><span class="pre">monty[9]</span></tt>而不包括<tt class="doctest"><span class="pre">monty[10]</span></tt>。</font><font id="202">这是因为切片<span class="emphasis">开始</span>于第一个索引,但结束于最后一个索引的<span class="emphasis">前一个</span>。</font></p>
<p><font id="203">我们也可以使用负数索引切片——也是同样的规则,从第一个索引开始到最后一个索引的前一个结束;在这里是在空格字符前结束。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>monty[-12:-7]
<span class="pysrc-output">'Monty'</span></pre>
<p><font id="204">与列表切片一样,如果我们省略了第一个值,子字符串将从字符串的开头开始。</font><font id="205">如果我们省略了第二个值,则子字符串直到字符串的结尾结束:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>monty[:5]
<span class="pysrc-output">'Monty'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>monty[6:]
<span class="pysrc-output">'Python'</span></pre>
<p><font id="206">我们使用<tt class="doctest"><span class="pre"><span class="pysrc-keyword">in</span></span></tt>操作符测试一个字符串是否包含一个特定的子字符串,如下所示:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>phrase = <span class="pysrc-string">'And now for something completely different'</span>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">if</span> <span class="pysrc-string">'thing'</span> <span class="pysrc-keyword">in</span> phrase:
<span class="pysrc-more">... </span> <span class="pysrc-keyword">print</span>(<span class="pysrc-string">'found "thing"'</span>)
<span class="pysrc-output">found "thing"</span></pre>
<p><font id="207">我们也可以使用<tt class="doctest"><span class="pre">find()</span></tt>找到一个子字符串在字符串内的位置:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>monty.find(<span class="pysrc-string">'Python'</span>)
<span class="pysrc-output">6</span></pre>
<div class="note"><p class="first admonition-title"><font id="208">注意</font></p>
<p class="last"><font id="209"><strong>轮到你来:</strong> 造一句话,将它分配给一个变量, 例如,</font><font id="210"><tt class="doctest"><span class="pre">sent = <span class="pysrc-string">'my sentence...'</span></span></tt>。</font><font id="211">写切片表达式抽取个别词。</font><font id="212">(这显然不是一种方便的方式来处理文本中的词!)</font></p>
</div>
</div>
<div class="section" id="more-operations-on-strings"><h3 class="sigil_not_in_toc"><font id="213">更多的字符串操作</font></h3>
<p><font id="214">Python 对处理字符串的支持很全面。</font><font id="215"><a class="reference internal" href="./ch03.html#tab-string-methods">3.2</a>.所示是一个总结,其中包括一些我们还没有看到的操作。</font><font id="216">关于字符串的更多信息,可在Python 提示符下输入<tt class="doctest"><span class="pre">help(str)</span></tt>。</font></p>
<p class="caption"><font id="217"><span class="caption-label">表 3.2</span>:</font></p>
<p><font id="218">有用的字符串方法:<a class="reference external" href="./ch01.html#tab-word-tests">4.2</a>中字符串测试之外的字符串上的操作;所有的方法都产生一个新的字符串或列表</font></p>
<p></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>query = <span class="pysrc-string">'Who knows?'</span>
<span class="pysrc-prompt">>>> </span>beatles = [<span class="pysrc-string">'John'</span>, <span class="pysrc-string">'Paul'</span>, <span class="pysrc-string">'George'</span>, <span class="pysrc-string">'Ringo'</span>]
<span class="pysrc-prompt">>>> </span>query[2]
<span class="pysrc-output">'o'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>beatles[2]
<span class="pysrc-output">'George'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>query[:2]
<span class="pysrc-output">'Wh'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>beatles[:2]
<span class="pysrc-output">['John', 'Paul']</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>query + <span class="pysrc-string">" I don't"</span>
<span class="pysrc-output">"Who knows? I don't"</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>beatles + <span class="pysrc-string">'Brian'</span>
<span class="pysrc-except">Traceback (most recent call last):</span>
<span class="pysrc-except"> File "<stdin>", line 1, in <module></span>
<span class="pysrc-except">TypeError: can only concatenate list (not "str") to list</span>
<span class="pysrc-except"></span><span class="pysrc-prompt">>>> </span>beatles + [<span class="pysrc-string">'Brian'</span>]
<span class="pysrc-output">['John', 'Paul', 'George', 'Ringo', 'Brian']</span></pre>
<p><font id="249">当我们在一个Python 程序中打开并读入一个文件,我们得到一个对应整个文件内容的字符串。</font><font id="250">如果我们使用一个<tt class="doctest"><span class="pre"><span class="pysrc-keyword">for</span></span></tt>循环来处理这个字符串元素,所有我们可以挑选出的只是单个的字符——我们不选择粒度。</font><font id="251">相比之下,列表中的元素可以很大也可以很小,只要我们喜欢:例如,它们可能是段落、句子、短语、单词、字符。</font><font id="252">所以,列表的优势是我们可以灵活的决定它包含的元素,相应的后续的处理也变得灵活。</font><font id="253">因此,我们在一段NLP 代码中可能做的第一件事情就是将一个字符串分词放入一个字符串列表中(<a class="reference internal" href="./ch03.html#sec-tokenization">3.7</a>)。</font><font id="254">相反,当我们要将结果写入到一个文件或终端,我们通常会将它们格式化为一个字符串(<a class="reference internal" href="./ch03.html#sec-formatting">3.9</a>)。</font></p>
<p><font id="255">列表与字符串没有完全相同的功能。</font><font id="256">列表具有增强的能力使你可以改变其中的元素:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>beatles[0] = <span class="pysrc-string">"John Lennon"</span>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">del</span> beatles[-1]
<span class="pysrc-prompt">>>> </span>beatles
<span class="pysrc-output">['John Lennon', 'Paul', 'George']</span></pre>
<p><font id="257">另一方面,如果我们尝试在一个<em>字符串</em>上这么做——将<tt class="doctest"><span class="pre">query</span></tt>的第0个字符修改为<tt class="doctest"><span class="pre"><span class="pysrc-string">'F'</span></span></tt>——我们得到:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>query[0] = <span class="pysrc-string">'F'</span>
<span class="pysrc-except">Traceback (most recent call last):</span>
<span class="pysrc-except"> File "<stdin>", line 1, in ?</span>
<span class="pysrc-except">TypeError: object does not support item assignment</span></pre>
<p><font id="258">这是因为字符串是<span class="termdef">不可变的</span>:一旦你创建了一个字符串,就不能改变它。</font><font id="259">然而,列表是<span class="termdef">可变的</span>,其内容可以随时修改。</font><font id="260">作为一个结论,列表支持修改原始值的操作,而不是产生一个新的值。</font></p>
<div class="note"><p class="first admonition-title"><font id="261">注意</font></p>
<p class="last"><font id="262"><strong>轮到你来:</strong> 通过尝试本章结尾的一些练习,巩固你的字符串知识。</font></p>
</div>
</div>
</div>
<div class="section" id="text-processing-with-unicode"><h2 class="sigil_not_in_toc"><font id="263">3.3 使用Unicode 进行文字处理</font></h2>
<p><font id="264">我们的程序经常需要处理不同的语言和不同的字符集。</font><font id="265">“纯文本”的概念是虚构的。</font><font id="266">如果你住在讲英语国家,你可能在使用ASCII 码而没有意识到这一点。</font><font id="267">如果你住在欧洲,你可能使用一种扩展拉丁字符集,包含丹麦语和挪威语中的“ø”,匈牙利语中的“ő”,西班牙和布列塔尼语中的“ñ”,捷克语和斯洛伐克语中的“ň”。</font><font id="268">在本节中,我们将概述如何使用Unicode 处理使用非ASCII 字符集的文本。</font></p>
<div class="section" id="what-is-unicode"><h3 class="sigil_not_in_toc"><font id="269">什么是Unicode?</font></h3>
<p><font id="270">Unicode 支持超过一百万种字符。</font><font id="271">每个字符分配一个编号,称为<span class="termdef">编码点</span>。</font><font id="272">在Python 中,编码点写作<tt class="doctest"><span class="pre">\u</span></tt><em>XXXX</em>的形式,其中<em>XXXX</em>是四位十六进制形式数。</font></p>
<p><font id="273">在一个程序中,我们可以像普通字符串那样操纵Unicode 字符串。</font><font id="274">然而,当Unicode 字符被存储在文件或在终端上显示,它们必须被编码为字节流。</font><font id="275">一些编码(如ASCII 和Latin-2)中每个编码点使用单字节,所以它们可以只支持Unicode 的一个小的子集,足够单个语言使用了。</font><font id="276">其它的编码(如UTF-8)使用多个字节,可以表示全部的Unicode 字符。</font></p>
<p><font id="277">文件中的文本都是有特定编码的,所以我们需要一些机制来将文本翻译成Unicode——翻译成Unicode叫做<span class="termdef">解码</span>。</font><font id="278">相对的,要将Unicode 写入一个文件或终端,我们首先需要将Unicode 转化为合适的编码——这种将Unicode 转化为其它编码的过程叫做<span class="termdef">编码</span>,如<a class="reference internal" href="./ch03.html#fig-unicode">3.3</a>所示。</font></p>
<div class="figure" id="fig-unicode"><img alt="Images/unicode.png" src="Images/4a87f1dccc0e18aab5ec599d8d8358d6.jpg" style="width: 466.70000000000005px; height: 234.20000000000002px;"/><p class="caption"><font id="279"><span class="caption-label">图 3.3</span>:Unicode 解码和编码</font></p>
</div>
<p><font id="280">从Unicode 的角度来看,字符是可以实现一个或多个<span class="termdef">字形</span>的抽象的实体。</font><font id="281">只有字形可以出现在屏幕上或被打印在纸上。</font><font id="282">一个字体是一个字符到字形映射。</font></p>
</div>
<div class="section" id="extracting-encoded-text-from-files"><h3 class="sigil_not_in_toc"><font id="283">从文件中提取已编码文本</font></h3>
<p><font id="284">假设我们有一个小的文本文件,我们知道它是如何编码的。</font><font id="285">例如,<tt class="doctest"><span class="pre">polish-lat2.txt</span></tt>顾名思义是波兰语的文本片段(来源波兰语Wikipedia;可以在<tt class="doctest"><span class="pre">http://pl.wikipedia.org/wiki/Biblioteka_Pruska</span></tt>中看到)。</font><font id="286">此文件是Latin-2 编码的,也称为ISO-8859-2。</font><font id="287"><tt class="doctest"><span class="pre">nltk.data.find()</span></tt>函数为我们定位文件。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>path = nltk.data.find(<span class="pysrc-string">'corpora/unicode_samples/polish-lat2.txt'</span>)</pre>
<p><font id="288">Python的<tt class="doctest"><span class="pre">open()</span></tt>函数可以读取编码的数据为Unicode字符串,并写出Unicode字符串的编码形式。</font><font id="289">它采用一个参数来指定正在读取或写入的文件的编码。</font><font id="290">因此,让我们使用编码 <tt class="doctest"><span class="pre"><span class="pysrc-string">'latin2'</span></span></tt>打开我们波兰语文件,并检查该文件的内容︰</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>f = open(path, encoding=<span class="pysrc-string">'latin2'</span>)
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">for</span> line <span class="pysrc-keyword">in</span> f:
<span class="pysrc-more">... </span> line = line.strip()
<span class="pysrc-more">... </span> <span class="pysrc-keyword">print</span>(line)
<span class="pysrc-output">Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą</span>
<span class="pysrc-output">"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez</span>
<span class="pysrc-output">Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały</span>
<span class="pysrc-output">odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki</span>
<span class="pysrc-output">Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych</span>
<span class="pysrc-output">archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.</span></pre>
<p><font id="291">如果这不能在你的终端正确显示,或者我们想要看到字符的底层数值(或"代码点"),那么我们可以将所有的非 ASCII 字符转换成它们两位数<tt class="doctest"><span class="pre">\x</span></tt><em>XX</em> 和四位数 <tt class="doctest"><span class="pre">\u</span></tt><em>XXXX</em>表示法︰</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>f = open(path, encoding=<span class="pysrc-string">'latin2'</span>)
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">for</span> line <span class="pysrc-keyword">in</span> f:
<span class="pysrc-more">... </span> line = line.strip()
<span class="pysrc-more">... </span> <span class="pysrc-keyword">print</span>(line.encode(<span class="pysrc-string">'unicode_escape'</span>))
<span class="pysrc-output">b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'</span>
<span class="pysrc-output">b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'</span>
<span class="pysrc-output">b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'</span>
<span class="pysrc-output">b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'</span>
<span class="pysrc-output">b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'</span>
<span class="pysrc-output">b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'</span></pre>
<p><font id="292">上面输出的第一行有一个以<tt class="doctest"><span class="pre">\u</span></tt>转义字符串开始的Unicode转义字符串,即<tt class="doctest"><span class="pre">\u0144</span></tt>。</font><font id="293">相关的Unicode字符在屏幕上将显示为字形ń。</font><font id="294">在前面例子中的第三行中,我们看到<tt class="doctest"><span class="pre">\xf3</span></tt>,对应字形为ó,在128-255 的范围内。</font></p>
<p><font id="295">在Python 3中,源代码默认使用UTF-8编码,如果你使用的IDLE或另一个支持Unicode的程序编辑器,你可以在字符串中包含Unicode字符。</font><font id="296">可以使用<tt class="doctest"><span class="pre">\u</span></tt><em>XXXX</em>转义序列包含任意的Unicode字符。</font><font id="297">我们使用<tt class="doctest"><span class="pre">ord()</span></tt>找到一个字符的整数序数。</font><font id="298">例如︰</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>ord(<span class="pysrc-string">'ń'</span>)
<span class="pysrc-output">324</span></pre>
<p><font id="299">324的4位十六进制数字的形式是0144(输入 <tt class="doctest"><span class="pre">hex(324)</span></tt> 可以发现这点),我们可以定义一个具有适当转义序列的字符串。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>nacute = <span class="pysrc-string">'\u0144'</span>
<span class="pysrc-prompt">>>> </span>nacute
<span class="pysrc-output">'ń'</span></pre>
<div class="note"><p class="first admonition-title"><font id="300">注意</font></p>
<p class="last"><font id="301">决定屏幕上显示的字形的因素很多。</font><font id="302">如果你确定你的编码正确但你的Python 代码仍然未能显示出你预期的字形,你应该检查你的系统上是否安装了所需的字体。</font><font id="303">可能需要配置你的区域设置来渲染UTF-8编码的字符,然后使用<tt class="doctest"><span class="pre"><span class="pysrc-keyword">print</span>(nacute.encode(<span class="pysrc-string">'utf8'</span>))</span></tt>才能在你的终端看到ń显示。</font></p>
</div>
<p><font id="304">我们还可以看到这个字符在一个文本文件内是如何表示为字节序列的︰</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>nacute.encode(<span class="pysrc-string">'utf8'</span>)
<span class="pysrc-output">b'\xc5\x84'</span></pre>
<p><font id="305"><tt class="doctest"><span class="pre">unicodedata</span></tt>模块使我们可以检查Unicode字符的属性。</font><font id="306">在下面的例子中,我们选择超出ASCII范围的波兰语文本的第三行中的所有字符,输出它们的UTF-8 转义值,然后是使用标准Unicode约定的它们的编码点整数(即以<tt class="doctest"><span class="pre">U+</span></tt>为前缀的十六进制数字),随后是它们的Unicode名称。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">import</span> unicodedata
<span class="pysrc-prompt">>>> </span>lines = open(path, encoding=<span class="pysrc-string">'latin2'</span>).readlines()
<span class="pysrc-prompt">>>> </span>line = lines[2]
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(line.encode(<span class="pysrc-string">'unicode_escape'</span>))
<span class="pysrc-output">b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">for</span> c <span class="pysrc-keyword">in</span> line: <a href="./ch03.html#ref-unicode-info"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></a>
<span class="pysrc-more">... </span> <span class="pysrc-keyword">if</span> ord(c) > 127:
<span class="pysrc-more">... </span> <span class="pysrc-keyword">print</span>(<span class="pysrc-string">'{} U+{:04x} {}'</span>.format(c.encode(<span class="pysrc-string">'utf8'</span>), ord(c), unicodedata.name(c)))
<span class="pysrc-output">b'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE</span>
<span class="pysrc-output">b'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE</span>
<span class="pysrc-output">b'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE</span>
<span class="pysrc-output">b'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK</span>
<span class="pysrc-output">b'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE</span></pre>
<p><font id="307">如果你使用<tt class="doctest"><span class="pre">c</span></tt>替换掉<a class="reference internal" href="./ch03.html#unicode-info"><span id="ref-unicode-info"><img class="callout" alt="[1]" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></span></a>中的<tt class="doctest"><span class="pre">c.encode(<span class="pysrc-string">'utf8'</span>)</span></tt>,如果你的系统支持UTF-8,你应该看到类似下面的输出:</font></p>
<div class="line-block"><div class="line"><font id="308">ó U+00f3 LATIN SMALL LETTER O WITH ACUTE</font></div>
<div class="line"><font id="309">ś U+015b LATIN SMALL LETTER S WITH ACUTE</font></div>
<div class="line"><font id="310">Ś U+015a LATIN CAPITAL LETTER S WITH ACUTE</font></div>
<div class="line"><font id="311">ą U+0105 LATIN SMALL LETTER A WITH OGONEK</font></div>
<div class="line"><font id="312">ł U+0142 LATIN SMALL LETTER L WITH STROKE</font></div>
</div>
<p><font id="313">另外,根据你的系统的具体情况,你可能需要用<tt class="doctest"><span class="pre"><span class="pysrc-string">'latin2'</span></span></tt>替换示例中的编码<tt class="doctest"><span class="pre"><span class="pysrc-string">'utf8'</span></span></tt>。</font></p>
<p><font id="314">下一个例子展示Python字符串函数和<tt class="doctest"><span class="pre">re</span></tt>模块是如何能够与Unicode字符一起工作的。</font><font id="315">(我们会在下面一节中仔细看看<tt class="doctest"><span class="pre">re</span></tt> 模块。</font><font id="316"><tt class="doctest"><span class="pre">\w</span></tt>匹配一个"单词字符",参见<a class="reference internal" href="./ch03.html#tab-re-symbols">3.4</a>)。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>line.find(<span class="pysrc-string">'zosta\u0142y'</span>)
<span class="pysrc-output">54</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>line = line.lower()
<span class="pysrc-prompt">>>> </span>line
<span class="pysrc-output">'niemców pod koniec ii wojny światowej na dolny śląsk, zostały\n'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>line.encode(<span class="pysrc-string">'unicode_escape'</span>)
<span class="pysrc-output">b'niemc\\xf3w pod koniec ii wojny \\u015bwiatowej na dolny \\u015bl\\u0105sk, zosta\\u0142y\\n'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">import</span> re
<span class="pysrc-prompt">>>> </span>m = re.search(<span class="pysrc-string">'\u015b\w*'</span>, line)
<span class="pysrc-prompt">>>> </span>m.group()
<span class="pysrc-output">'\u015bwiatowej'</span></pre>
<p><font id="317">NLTK分词器允许Unicode字符串作为输入,并输出相应地Unicode字符串。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>word_tokenize(line)
<span class="pysrc-output">['niemców', 'pod', 'koniec', 'ii', 'wojny', 'światowej', 'na', 'dolny', 'śląsk', ',', 'zostały']</span></pre>
</div>
<div class="section" id="using-your-local-encoding-in-python"><h3 class="sigil_not_in_toc"><font id="318">在Python中使用本地编码</font></h3>
<p><font id="319">如果你习惯了使用特定的本地编码字符,你可能希望能够在一个Python文件中使用你的字符串输入及编辑的标准方法。</font><font id="320">为了做到这一点,你需要在你的文件的第一行或第二行中包含字符串:<tt class="doctest"><span class="pre"><span class="pysrc-string">'# -*- coding: <coding> -*-'</span></span></tt>。</font><font id="321">请注意<em><coding></em>必须是像<tt class="doctest"><span class="pre"><span class="pysrc-string">'latin-1'</span></span></tt>, <tt class="doctest"><span class="pre"><span class="pysrc-string">'big5'</span></span></tt>或<tt class="doctest"><span class="pre"><span class="pysrc-string">'utf-8'</span></span></tt>这样的字符串 (见 <a class="reference internal" href="./ch03.html#fig-polish-utf8">3.4</a>)。</font></p>
<div class="figure" id="fig-polish-utf8"><img alt="Images/polish-utf8.png" src="Images/5eeb4cf55b6d18d4bcb098fc72ddc6d7.jpg" style="width: 605.0px; height: 326.0px;"/><p class="caption"><font id="322"><span class="caption-label">图 3.4</span>:Unicode 与IDLE:IDLE编辑器中UTF-8编码的字符串字面值;这需要在IDLE属性中设置了相应的字体;这里我们选择Courier CE。</font></p>
</div>
<p><font id="323">上面的例子还说明了正规表达式是如何可以使用编码的字符串的。</font></p>
</div>
</div>
<div class="section" id="regular-expressions-for-detecting-word-patterns"><h2 class="sigil_not_in_toc"><font id="324">3.4 使用正则表达式检测词组搭配</font></h2>
<p><font id="325">许多语言处理任务都涉及模式匹配。</font><font id="326">例如:我们可以使用<tt class="doctest"><span class="pre">endswith(<span class="pysrc-string">'ed'</span>)</span></tt>找到以<span class="example">ed</span>结尾的词。</font><font id="327">在<a class="reference external" href="./ch01.html#tab-word-tests">4.2</a>中我们看到过各种这样的“词测试”。</font><font id="328">正则表达式给我们一个更加强大和灵活的方法描述我们感兴趣的字符模式。</font></p>
<div class="note"><p class="first admonition-title"><font id="329">注意</font></p>
<p class="last"><font id="330">介绍正则表达式的其他出版物有很多,它们围绕正则表达式的语法组织,应用于搜索文本文件。</font><font id="331">我们不再赘述这些,只专注于在语言处理的不同阶段如何使用正则表达式。</font><font id="332">像往常一样,我们将采用基于问题的方式,只在解决实际问题需要时才介绍新特性。</font><font id="333">在我们的讨论中,我们将使用箭头来表示正则表达式,就像这样:«<tt class="doctest"><span class="pre">patt</span></tt>»。</font></p>
</div>
<p><font id="334">在Python中使用正则表达式,需要使用<tt class="doctest"><span class="pre"><span class="pysrc-keyword">import</span> re</span></tt>导入<tt class="doctest"><span class="pre">re</span></tt>库。</font><font id="335">我们还需要一个用于搜索的词汇列表;我们再次使用词汇语料库(<a class="reference external" href="./ch02.html#sec-lexical-resources">4</a>)。</font><font id="336">我们将对它进行预处理消除某些名称。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">import</span> re
<span class="pysrc-prompt">>>> </span>wordlist = [w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> nltk.corpus.words.words(<span class="pysrc-string">'en'</span>) <span class="pysrc-keyword">if</span> w.islower()]</pre>
<div class="section" id="using-basic-meta-characters"><h3 class="sigil_not_in_toc"><font id="337">使用基本的元字符</font></h3>
<p><font id="338">让我们使用正则表达式«<tt class="doctest"><span class="pre">ed$</span></tt>»查找以<span class="example">ed</span>结尾的词汇。</font><font id="339">我们将使用函数<tt class="doctest"><span class="pre">re.search(p, s)</span></tt>检查字符串<tt class="doctest"><span class="pre">s</span></tt>中是否有模式<tt class="doctest"><span class="pre">p</span></tt>。我们需要指定感兴趣的字符,然后使用美元符号,它是正则表达式中有特殊用途的符号,用来匹配单词的末尾:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>[w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> wordlist <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'ed$'</span>, w)]
<span class="pysrc-output">['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', ...]</span></pre>
<p><font id="340"><tt class="doctest"><span class="pre">.</span></tt></font><font id="341"><span class="termdef">通配符</span>匹配任何单个字符。</font><font id="342">假设我们有一个8 个字母组成的词的字谜室,<span class="example">j</span>是其第三个字母,<span class="example">t</span>是其第六个字母。</font><font id="343">空白单元格中的每个地方,我们用一个句点:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>[w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> wordlist <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'^..j..t..$'</span>, w)]
<span class="pysrc-output">['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', ...]</span></pre>
<div class="note"><p class="first admonition-title"><font id="344">注意</font></p>
<p class="last"><font id="345"><strong>轮到你来:</strong> 驼字符<tt class="doctest"><span class="pre">^</span></tt>匹配字符串的开始,就像<tt class="doctest"><span class="pre">$</span></tt>符号匹配字符串的结尾。</font><font id="346">如果我们不用这两个符号而使用«<tt class="doctest"><span class="pre">..j..t..</span></tt>»搜索,刚才例子中我们会得到什么样的结果?</font></p>
</div>
<p><font id="347">最后,<tt class="doctest"><span class="pre">?</span></tt></font><font id="348">符合表示前面的字符是可选的。</font><font id="349">因此«<tt class="doctest"><span class="pre">^e-?mail$</span></tt>» 将匹配<span class="example">email</span>和<span class="example">e-mail</span>。</font><font id="350">我们可以使用<tt class="doctest"><span class="pre">sum(1 <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> text <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'^e-?mail$'</span>, w))</span></tt>计数一个文本中这个词(任一拼写形式)出现的总次数。</font></p>
</div>
<div class="section" id="ranges-and-closures"><h3 class="sigil_not_in_toc"><font id="351">范围与闭包</font></h3>
<div class="figure" id="fig-t9"><img alt="Images/T9.png" src="Images/cf5ffc116dbddb4a34c65925b0d558cb.jpg" style="width: 297.6px; height: 131.0px;"/><p class="caption"><font id="352"><span class="caption-label">图 3.5</span>:T9:9个键上的文本</font></p>
</div>
<p><font id="353"><span class="termdef">T9</span>系统用于在手机上输入文本(见<a class="reference internal" href="./ch03.html#fig-t9">3.5</a>))。</font><font id="354">两个或两个以上以相同击键顺序输入的词汇,叫做<span class="termdef">textonyms</span>。</font><font id="355">例如,<span class="example">hole</span>和<span class="example">golf</span>都是通过序列4653输入。</font><font id="356">还有哪些其它词汇由相同的序列产生?</font><font id="357">这里我们使用正则表达式«<tt class="doctest"><span class="pre">^[ghi][mno][jlk][def]$</span></tt>»:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>[w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> wordlist <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'^[ghi][mno][jlk][def]$'</span>, w)]
<span class="pysrc-output">['gold', 'golf', 'hold', 'hole']</span></pre>
<p><font id="358">表达式的第一部分«<tt class="doctest"><span class="pre">^[ghi]</span></tt>»匹配以<span class="example">g</span>, <span class="example">h</span>或<span class="example">i</span>开始的词。</font><font id="359">表达式的下一部分,«<tt class="doctest"><span class="pre">[mno]</span></tt>»限制了第二个字符是<span class="example">m</span>, <span class="example">n</span>或<span class="example">o</span>。</font><font id="360">第三部分和第四部分同样被限制。</font><font id="361">只有4个单词满足这些限制。</font><font id="362">注意,方括号内的字符的顺序是没有关系的,所以我们可以写成«<tt class="doctest"><span class="pre">^[hig][nom][ljk][fed]$</span></tt>» 并匹配同样的词汇。</font></p>
<div class="note"><p class="first admonition-title"><font id="363">注意</font></p>
<p class="last"><font id="364"><strong>轮到你来:</strong> 来看一些“手指绕口令”,只用一部分数字键盘搜索词汇。</font><font id="365">例如«<tt class="doctest"><span class="pre">^[ghijklmno]+$</span></tt>»或更为简洁的«<tt class="doctest"><span class="pre">^[g-o]+$</span></tt>»,将匹配只使用中间行的4、5、6 键的词汇«<tt class="doctest"><span class="pre">^[a-fj-o]+$</span></tt>»将匹配使用右上角2、3、5、6 键的词汇。</font><font id="366"><tt class="doctest"><span class="pre">-</span></tt>和<tt class="doctest"><span class="pre">+</span></tt>表示什么意思?</font></p>
</div>
<p><font id="367">让我们进一步探索<tt class="doctest"><span class="pre">+</span></tt>符号。</font><font id="368">请注意,它可以适用于单个字母或括号内的字母集:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>chat_words = sorted(set(w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> nltk.corpus.nps_chat.words()))
<span class="pysrc-prompt">>>> </span>[w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> chat_words <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'^m+i+n+e+$'</span>, w)]
<span class="pysrc-output">['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine',</span>
<span class="pysrc-output">'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>[w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> chat_words <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'^[ha]+$'</span>, w)]
<span class="pysrc-output">['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh',</span>
<span class="pysrc-output">'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa',</span>
<span class="pysrc-output">'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', ...]</span></pre>
<p><font id="369">很显然,<tt class="doctest"><span class="pre">+</span></tt>简单地表示“前面的项目的一个或多个实例”,它可以是单独的字母如<tt class="doctest"><span class="pre">m</span></tt>,可以是一个集合如<tt class="doctest"><span class="pre">[fed]</span></tt>或者一个范围如<tt class="doctest"><span class="pre">[d-f]</span></tt>。</font><font id="370">现在让我们用<tt class="doctest"><span class="pre">*</span></tt>替换<tt class="doctest"><span class="pre">+</span></tt>,它表示“前面的项目的零个或多个实例”。</font><font id="371">正则表达式«<tt class="doctest"><span class="pre">^m*i*n*e*$</span></tt>»将匹配所有我们用«<tt class="doctest"><span class="pre">^m+i+n+e+$</span></tt>»找到的,同时包括其中一些字母不出现的词汇,例如,</font><font id="372"><span class="example">me</span>, <span class="example">min</span>和<span class="example">mmmmm</span>。</font><font id="373">请注意<tt class="doctest"><span class="pre">+</span></tt>和<tt class="doctest"><span class="pre">*</span></tt>符号有时被称为的<span class="termdef">Kleene闭包</span>,或者干脆<span class="termdef">闭包</span>。</font></p>
<p><font id="374">运算符<tt class="doctest"><span class="pre">^</span></tt>当它出现在方括号内的第一个字符位置时有另外的功能。</font><font id="375">例如,«<tt class="doctest"><span class="pre">[^aeiouAEIOU]</span></tt>»匹配除元音字母之外的所有字母。</font><font id="376">我们可以搜索NPS 聊天语料库中完全由非元音字母组成的词汇,使用«<tt class="doctest"><span class="pre">^[^aeiouAEIOU]+$</span></tt>» 查找诸如<tt class="doctest"><span class="pre">:):):)</span></tt>, <tt class="doctest"><span class="pre">grrr</span></tt>, <tt class="doctest"><span class="pre">cyb3r</span></tt>和<tt class="doctest"><span class="pre">zzzzzzzz</span></tt>这样的词。</font><font id="377">请注意其中包含非字母字符。</font></p>
<p><font id="378">下面是另外一些正则表达式的例子,用来寻找匹配特定模式的词符,这些例子演示如何使用一些新的符号:<tt class="doctest"><span class="pre">\</span></tt>, <tt class="doctest"><span class="pre">{}</span></tt>, <tt class="doctest"><span class="pre">()</span></tt>和<tt class="doctest"><span class="pre">|</span></tt>。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>wsj = sorted(set(nltk.corpus.treebank.words()))
<span class="pysrc-prompt">>>> </span>[w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> wsj <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'^[0-9]+\.[0-9]+$'</span>, w)]
<span class="pysrc-output">['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5',</span>
<span class="pysrc-output">'0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99',</span>
<span class="pysrc-output">'1.01', '1.1', '1.125', '1.14', '1.1650', '1.17', '1.18', '1.19', '1.2', ...]</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>[w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> wsj <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'^[A-Z]+\$$'</span>, w)]
<span class="pysrc-output">['C$', 'US$']</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>[w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> wsj <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'^[0-9]{4}$'</span>, w)]
<span class="pysrc-output">['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', ...]</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>[w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> wsj <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'^[0-9]+-[a-z]{3,5}$'</span>, w)]
<span class="pysrc-output">['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', ...]</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>[w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> wsj <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$'</span>, w)]
<span class="pysrc-output">['black-and-white', 'bread-and-butter', 'father-in-law', 'machine-gun-toting',</span>
<span class="pysrc-output">'savings-and-loan']</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>[w <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> wsj <span class="pysrc-keyword">if</span> re.search(<span class="pysrc-string">'(ed|ing)$'</span>, w)]
<span class="pysrc-output">['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced', 'Advancing', ...]</span></pre>
<div class="note"><p class="first admonition-title"><font id="379">注意</font></p>
<p class="last"><font id="380"><strong>轮到你来:</strong> 研究前面的例子,在你继续阅读之前尝试弄清楚<tt class="doctest"><span class="pre">\</span></tt>, <tt class="doctest"><span class="pre">{}</span></tt>, <tt class="doctest"><span class="pre">()</span></tt>和<tt class="doctest"><span class="pre">|</span></tt> 这些符号的功能。</font></p>
</div>
<p><font id="381">你可能已经知道反斜杠表示其后面的字母不再有特殊的含义而是按照字面的表示匹配词中特定的字符。</font><font id="382">因此,虽然<tt class="doctest"><span class="pre">.</span></tt></font><font id="383">很特别,但是<tt class="doctest"><span class="pre">\.</span></tt></font><font id="384">只匹配一个句号。</font><font id="385">大括号表达式,如<tt class="doctest"><span class="pre">{3,5}</span></tt>, 表示前面的项目重复指定次数。</font><font id="386">管道字符表示从其左边的内容和右边的内容中选择一个。</font><font id="387">圆括号表示一个操作符的范围,它们可以与管道(或叫析取)符号一起使用,如«<tt class="doctest"><span class="pre">w(i|e|ai|oo)t</span></tt>»,匹配<span class="example">wit</span>, <span class="example">wet</span>, <span class="example">wait</span>和<span class="example">woot</span>。</font><font id="388">你可以省略这个例子里的最后一个表达式中的括号,使用«<tt class="doctest"><span class="pre">ed|ing$</span></tt>»搜索看看会发生什么,这是很有益处的。</font></p>
<p><font id="389">我们已经看到的元字符总结在<a class="reference internal" href="./ch03.html#tab-regexp-meta-characters1">3.3</a>中:</font></p>
<p class="caption"><font id="390"><span class="caption-label">表 3.3</span>:</font></p>
<p><font id="391">正则表达式基本元字符,其中包括通配符,范围和闭包</font></p>
<p></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>word = <span class="pysrc-string">'supercalifragilisticexpialidocious'</span>
<span class="pysrc-prompt">>>> </span>re.findall(r<span class="pysrc-string">'[aeiou]'</span>, word)
<span class="pysrc-output">['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>len(re.findall(r<span class="pysrc-string">'[aeiou]'</span>, word))
<span class="pysrc-output">16</span></pre>
<p><font id="439">让我们来看看一些文本中的两个或两个以上的元音序列,并确定它们的相对频率:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>wsj = sorted(set(nltk.corpus.treebank.words()))
<span class="pysrc-prompt">>>> </span>fd = nltk.FreqDist(vs <span class="pysrc-keyword">for</span> word <span class="pysrc-keyword">in</span> wsj
<span class="pysrc-more">... </span> <span class="pysrc-keyword">for</span> vs <span class="pysrc-keyword">in</span> re.findall(r<span class="pysrc-string">'[aeiou]{2,}'</span>, word))
<span class="pysrc-prompt">>>> </span>fd.most_common(12)
<span class="pysrc-output">[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253),</span>
<span class="pysrc-output">('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95)]</span></pre>
<div class="note"><p class="first admonition-title"><font id="440">注意</font></p>
<p><font id="441"><strong>轮到你来:</strong> 在W3C 日期时间格式中,日期像这样表示:2009-12-31。</font><font id="442">Replace the <tt class="doctest"><span class="pre">?</span></tt> in the following Python code with a regular expression, in order to convert the string <tt class="doctest"><span class="pre"><span class="pysrc-string">'2009-12-31'</span></span></tt> to a list of integers <tt class="doctest"><span class="pre">[2009, 12, 31]</span></tt>:</font></p>
<p class="last"><font id="444"><tt class="doctest"><span class="pre">[int(n) <span class="pysrc-keyword">for</span> n <span class="pysrc-keyword">in</span> re.findall(?, <span class="pysrc-string">'2009-12-31'</span>)]</span></tt></font></p>
</div>
</div>
<div class="section" id="doing-more-with-word-pieces"><h3 class="sigil_not_in_toc"><font id="445">在单词片段上做更多事情</font></h3>
<p><font id="446">一旦我们会使用<tt class="doctest"><span class="pre">re.findall()</span></tt>从单词中提取素材,就可以在这些片段上做一些有趣的事情,例如将它们粘贴在一起或用它们绘图。</font></p>
<p><font id="447">英文文本是高度冗余的,忽略掉词内部的元音仍然可以很容易的阅读,有些时候这很明显。</font><font id="448">例如,<span class="example">declaration</span>变成<span class="example">dclrtn</span>,<span class="example">inalienable</span>变成<span class="example">inlnble</span>,保留所有词首或词尾的元音序列。</font><font id="449">在我们的下一个例子中,正则表达式匹配词首元音序列,词尾元音序列和所有的辅音;其它的被忽略。</font><font id="450">这三个析取从左到右处理,如果词匹配三个部分中的一个,正则表达式后面的部分将被忽略。</font><font id="451">我们使用<tt class="doctest"><span class="pre">re.findall()</span></tt>提取所有匹配的词中的字符,然后使<tt class="doctest"><span class="pre"><span class="pysrc-string">''</span>.join()</span></tt>将它们连接在一起(更多连接操作参见<a class="reference internal" href="./ch03.html#sec-formatting">3.9</a>)。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>regexp = r<span class="pysrc-string">'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'</span>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">def</span> <span class="pysrc-defname">compress</span>(word):
<span class="pysrc-more">... </span> pieces = re.findall(regexp, word)
<span class="pysrc-more">... </span> return <span class="pysrc-string">''</span>.join(pieces)
<span class="pysrc-more">...</span>
<span class="pysrc-prompt">>>> </span>english_udhr = nltk.corpus.udhr.words(<span class="pysrc-string">'English-Latin1'</span>)
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(nltk.tokenwrap(compress(w) <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> english_udhr[:75]))
<span class="pysrc-output">Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and</span>
<span class="pysrc-output">of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn</span>
<span class="pysrc-output">of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn</span>
<span class="pysrc-output">rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,</span>
<span class="pysrc-output">and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and</span></pre>
<p><font id="452">接下来,让我们将正则表达式与条件频率分布结合起来。</font><font id="453">在这里,我们将从罗托卡特语词汇中提取所有辅音-元音序列,如<span class="example">ka</span>和<span class="example">si</span>。</font><font id="454">因为每部分都是成对的,它可以被用来初始化一个条件频率分布。</font><font id="455">然后我们为每对的频率画出表格:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>rotokas_words = nltk.corpus.toolbox.words(<span class="pysrc-string">'rotokas.dic'</span>)
<span class="pysrc-prompt">>>> </span>cvs = [cv <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> rotokas_words <span class="pysrc-keyword">for</span> cv <span class="pysrc-keyword">in</span> re.findall(r<span class="pysrc-string">'[ptksvr][aeiou]'</span>, w)]
<span class="pysrc-prompt">>>> </span>cfd = nltk.ConditionalFreqDist(cvs)
<span class="pysrc-prompt">>>> </span>cfd.tabulate()
<span class="pysrc-output"> a e i o u</span>
<span class="pysrc-output">k 418 148 94 420 173</span>
<span class="pysrc-output">p 83 31 105 34 51</span>
<span class="pysrc-output">r 187 63 84 89 79</span>
<span class="pysrc-output">s 0 0 100 2 1</span>
<span class="pysrc-output">t 47 8 0 148 37</span>
<span class="pysrc-output">v 93 27 105 48 49</span></pre>
<p><font id="456">考查<span class="example">s</span>行和<span class="example">t</span>行,我们看到它们是部分的“互补分布”,这个证据表明它们不是这种语言中的独特音素。</font><font id="457">从而我们可以令人信服的从罗托卡特语字母表中去除<span class="example">s</span>,简单加入一个发音规则:当字母<span class="example">t</span>跟在<span class="example">i</span>后面时发<span class="example">s</span>的音。</font><font id="458">(注意单独的条目<em>su</em>即<em>kasuari</em>,‘cassowary’是从英语中借来的)。</font></p>
<p><font id="459">如果我们想要检查表格中数字背后的词汇,有一个索引允许我们迅速找到包含一个给定的辅音-元音对的单词的列表将会有帮助,例如,</font><font id="460"><tt class="doctest"><span class="pre">cv_index[<span class="pysrc-string">'su'</span>]</span></tt>应该给我们所有含有<span class="example">su</span>的词汇。</font><font id="461">下面是我们如何能做到这一点:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>cv_word_pairs = [(cv, w) <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> rotokas_words
<span class="pysrc-more">... </span> <span class="pysrc-keyword">for</span> cv <span class="pysrc-keyword">in</span> re.findall(r<span class="pysrc-string">'[ptksvr][aeiou]'</span>, w)]
<span class="pysrc-prompt">>>> </span>cv_index = nltk.Index(cv_word_pairs)
<span class="pysrc-prompt">>>> </span>cv_index[<span class="pysrc-string">'su'</span>]
<span class="pysrc-output">['kasuari']</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>cv_index[<span class="pysrc-string">'po'</span>]
<span class="pysrc-output">['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa',</span>
<span class="pysrc-output">'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', ...]</span></pre>
<p><font id="462">这段代码依次处理每个词<tt class="doctest"><span class="pre">w</span></tt>,对每一个词找出匹配正则表达式«<tt class="doctest"><span class="pre">[ptksvr][aeiou]</span></tt>»的所有子字符串。</font><font id="463">对于词<span class="example">kasuari</span>,它找到<span class="example">ka</span>, <span class="example">su</span>和<span class="example">ri</span>。</font><font id="464">因此,<tt class="doctest"><span class="pre">cv_word_pairs</span></tt>将包含<tt class="doctest"><span class="pre">(<span class="pysrc-string">'ka'</span>, <span class="pysrc-string">'kasuari'</span>)</span></tt>, <tt class="doctest"><span class="pre">(<span class="pysrc-string">'su'</span>, <span class="pysrc-string">'kasuari'</span>)</span></tt>和<tt class="doctest"><span class="pre">(<span class="pysrc-string">'ri'</span>, <span class="pysrc-string">'kasuari'</span>)</span></tt>。</font><font id="465">更进一步使用<tt class="doctest"><span class="pre">nltk.Index()</span></tt>转换成有用的索引。</font></p>
</div>
<div class="section" id="finding-word-stems"><h3 class="sigil_not_in_toc"><font id="466">查找词干</font></h3>
<p><font id="467">在使用网络搜索引擎时,我们通常不介意(甚至没有注意到)文档中的词汇与我们的搜索条件的后缀形式是否相同。</font><font id="468">查询<span class="example">laptops</span>会找到含有<span class="example">laptop</span>的文档,反之亦然。</font><font id="469">事实上,<span class="example">laptop</span>与<span class="example">laptops</span>只是词典中的同一个词(或词条)的两种形式。</font><font id="470">对于一些语言处理任务,我们想忽略词语结尾,只是处理词干。</font></p>
<p><font id="471">抽出一个词的词干的方法有很多种。</font><font id="472">这里的是一种简单直观的方法,直接去掉任何看起来像一个后缀的字符:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">def</span> <span class="pysrc-defname">stem</span>(word):
<span class="pysrc-more">... </span> <span class="pysrc-keyword">for</span> suffix <span class="pysrc-keyword">in</span> [<span class="pysrc-string">'ing'</span>, <span class="pysrc-string">'ly'</span>, <span class="pysrc-string">'ed'</span>, <span class="pysrc-string">'ious'</span>, <span class="pysrc-string">'ies'</span>, <span class="pysrc-string">'ive'</span>, <span class="pysrc-string">'es'</span>, <span class="pysrc-string">'s'</span>, <span class="pysrc-string">'ment'</span>]:
<span class="pysrc-more">... </span> <span class="pysrc-keyword">if</span> word.endswith(suffix):
<span class="pysrc-more">... </span> return word[:-len(suffix)]
<span class="pysrc-more">... </span> return word</pre>
<p><font id="473">虽然我们最终将使用NLTK 中内置的词干提取器,看看我们如何能够使用正则表达式处理这个任务是有趣的。</font><font id="474">我们的第一步是建立一个所有后缀的连接。</font><font id="475">我们需要把它放在括号内以限制这个析取的范围。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>re.findall(r<span class="pysrc-string">'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$'</span>, <span class="pysrc-string">'processing'</span>)
<span class="pysrc-output">['ing']</span></pre>
<p><font id="476">在这里,尽管正则表达式匹配整个单词,<tt class="doctest"><span class="pre">re.findall()</span></tt>只是给我们后缀。</font><font id="477">这是因为括号有第二个功能:选择要提取的子字符串。</font><font id="478">如果我们要使用括号来指定析取的范围,但不想选择要输出的字符串,必须添加<tt class="doctest"><span class="pre">?:</span></tt>,它是正则表达式许多神秘奥妙的地方之一。</font><font id="479">下面是改进后的版本。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>re.findall(r<span class="pysrc-string">'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$'</span>, <span class="pysrc-string">'processing'</span>)
<span class="pysrc-output">['processing']</span></pre>
<p><font id="480">然而,实际上,我们会想将词分成词干和后缀。</font><font id="481">所以,我们应该用括号括起正则表达式的这两个部分:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>re.findall(r<span class="pysrc-string">'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$'</span>, <span class="pysrc-string">'processing'</span>)
<span class="pysrc-output">[('process', 'ing')]</span></pre>
<p><font id="482">这看起来很有用途,但仍然有一个问题。</font><font id="483">让我们来看看另外的词,<span class="example">processes</span>:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>re.findall(r<span class="pysrc-string">'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$'</span>, <span class="pysrc-string">'processes'</span>)
<span class="pysrc-output">[('processe', 's')]</span></pre>
<p><font id="484">正则表达式错误地找到了后缀<span class="example">-s</span>,而不是后缀<span class="example">-es</span>。</font><font id="485">这表明另一个微妙之处:星号操作符是“贪婪的”,所以表达式的<tt class="doctest"><span class="pre">.*</span></tt>部分试图尽可能多的匹配输入的字符串。</font><font id="486">如果我们使用“非贪婪”版本的“*”操作符,写成<tt class="doctest"><span class="pre">*?</span></tt>,我们就得到我们想要的:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>re.findall(r<span class="pysrc-string">'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$'</span>, <span class="pysrc-string">'processes'</span>)
<span class="pysrc-output">[('process', 'es')]</span></pre>
<p><font id="487">我们甚至可以通过使第二个括号中的内容变成可选,来得到空后缀:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>re.findall(r<span class="pysrc-string">'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'</span>, <span class="pysrc-string">'language'</span>)
<span class="pysrc-output">[('language', '')]</span></pre>
<p><font id="488">这种方法仍然有许多问题,(你能发现它们吗?)</font><font id="489">但我们仍将继续定义一个函数来获取词干,并将它应用到整个文本:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">def</span> <span class="pysrc-defname">stem</span>(word):
<span class="pysrc-more">... </span> regexp = r<span class="pysrc-string">'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'</span>
<span class="pysrc-more">... </span> stem, suffix = re.findall(regexp, word)[0]
<span class="pysrc-more">... </span> return stem
<span class="pysrc-more">...</span>
<span class="pysrc-prompt">>>> </span>raw = <span class="pysrc-string">"""DENNIS: Listen, strange women lying in ponds distributing swords</span>
<span class="pysrc-more">... </span><span class="pysrc-string">is no basis for a system of government. Supreme executive power derives from</span>
<span class="pysrc-more">... </span><span class="pysrc-string">a mandate from the masses, not from some farcical aquatic ceremony."""</span>
<span class="pysrc-prompt">>>> </span>tokens = word_tokenize(raw)
<span class="pysrc-prompt">>>> </span>[stem(t) <span class="pysrc-keyword">for</span> t <span class="pysrc-keyword">in</span> tokens]
<span class="pysrc-output">['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut',</span>
<span class="pysrc-output">'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme',</span>
<span class="pysrc-output">'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',',</span>
<span class="pysrc-output">'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']</span></pre>
<p><font id="490">请注意我们的正则表达式不但将<span class="example">ponds</span>的<span class="example">s</span>删除,也将<span class="example">is</span>和<span class="example">basis</span>的删除。</font><font id="491">它产生一些非词如<span class="example">distribut</span>和<span class="example">deriv</span>,但这些在一些应用中是可接受的词干。</font></p>
</div>
<div class="section" id="searching-tokenized-text"><h3 class="sigil_not_in_toc"><font id="492">搜索已分词文本</font></h3>
<p><font id="493">你可以使用一种特殊的正则表达式搜索一个文本中多个词(这里的文本是一个词符列表)。</font><font id="494">例如,<tt class="doctest"><span class="pre"><span class="pysrc-string">"<a> <man>"</span></span></tt>找出文本中所有<span class="example">a man</span>的实例。</font><font id="495">尖括号用于标记词符的边界,尖括号之间的所有空白都被忽略(这只对NLTK中的<tt class="doctest"><span class="pre">findall()</span></tt>方法处理文本有效)。</font><font id="496">在下面的例子中,我们使用<tt class="doctest"><span class="pre"><.*></span></tt><a class="reference internal" href="./ch03.html#single-token-wildcard"><span id="ref-single-token-wildcard"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></span></a>,它将匹配所有单个词符,将它括在括号里,于是只匹配词(例如</font><font id="497"><span class="example">monied</span>)而不匹配短语(例如,</font><font id="498"><span class="example">a monied man</span>)会生成。</font><font id="499">第二个例子找出以词<span class="example">bro</span>结尾的三个词组成的短语<a class="reference internal" href="./ch03.html#three-word-phrases"><span id="ref-three-word-phrases"><img alt="[2]" class="callout" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></span></a>。</font><font id="500">最后一个例子找出以字母<span class="example">l</span>开始的三个或更多词组成的序列<a class="reference internal" href="./ch03.html#letter-l"><span id="ref-letter-l"><img alt="[3]" class="callout" src="Images/7c20d0adbadb35031a28bfcd6dff9900.jpg"/></span></a>。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">from</span> nltk.corpus <span class="pysrc-keyword">import</span> gutenberg, nps_chat
<span class="pysrc-prompt">>>> </span>moby = nltk.Text(gutenberg.words(<span class="pysrc-string">'melville-moby_dick.txt'</span>))
<span class="pysrc-prompt">>>> </span>moby.findall(r<span class="pysrc-string">"<a> (<.*>) <man>"</span>) <a href="./ch03.html#ref-single-token-wildcard"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></a>
<span class="pysrc-output">monied; nervous; dangerous; white; white; white; pious; queer; good;</span>
<span class="pysrc-output">mature; white; Cape; great; wise; wise; butterless; white; fiendish;</span>
<span class="pysrc-output">pale; furious; better; certain; complete; dismasted; younger; brave;</span>
<span class="pysrc-output">brave; brave; brave</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>chat = nltk.Text(nps_chat.words())
<span class="pysrc-prompt">>>> </span>chat.findall(r<span class="pysrc-string">"<.*> <.*> <bro>"</span>) <a href="./ch03.html#ref-three-word-phrases"><img alt="[2]" class="callout" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></a>
<span class="pysrc-output">you rule bro; telling you bro; u twizted bro</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>chat.findall(r<span class="pysrc-string">"<l.*>{3,}"</span>) <a href="./ch03.html#ref-letter-l"><img alt="[3]" class="callout" src="Images/7c20d0adbadb35031a28bfcd6dff9900.jpg"/></a>
<span class="pysrc-output">lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la</span>
<span class="pysrc-output">la la; lovely lol lol love; lol lol lol.; la la la; la la la</span></pre>
<div class="note"><p class="first admonition-title"><font id="501">注意</font></p>
<p class="last"><font id="502"><strong>轮到你来:</strong>巩固你对正则表达式模式与替换的理解,使用<tt class="doctest"><span class="pre">nltk.re_show(</span></tt><em>p, s</em><tt class="doctest"><span class="pre">)</span></tt>,它能标注字符串<em>s</em>中所有匹配模式<em>p</em>的地方,以及<tt class="doctest"><span class="pre">nltk.app.nemo()</span></tt>,它能提供一个探索正则表达式的图形界面。</font><font id="503">更多的练习,可以尝试本章尾的正则表达式的一些练习。</font></p>
</div>
<p><font id="504">当我们研究的语言现象与特定词语相关时建立搜索模式是很容易的。</font><font id="505">在某些情况下,一个小小的创意可能会花很大功夫。</font><font id="506">例如,在大型文本语料库中搜索<span class="example">x and other ys</span>形式的表达式能让我们发现上位词(见<a class="reference external" href="./ch02.html#sec-wordnet">5</a>):</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">from</span> nltk.corpus <span class="pysrc-keyword">import</span> brown
<span class="pysrc-prompt">>>> </span>hobbies_learned = nltk.Text(brown.words(categories=[<span class="pysrc-string">'hobbies'</span>, <span class="pysrc-string">'learned'</span>]))
<span class="pysrc-prompt">>>> </span>hobbies_learned.findall(r<span class="pysrc-string">"<\w*> <and> <other> <\w*s>"</span>)
<span class="pysrc-output">speed and other activities; water and other liquids; tomb and other</span>
<span class="pysrc-output">landmarks; Statues and other monuments; pearls and other jewels;</span>
<span class="pysrc-output">charts and other items; roads and other features; figures and other</span>
<span class="pysrc-output">objects; military and other areas; demands and other factors;</span>
<span class="pysrc-output">abstracts and other compilations; iron and other metals</span></pre>
<p><font id="507">只要有足够多的文本,这种做法会给我们一整套有用的分类标准信息,而不需要任何手工劳动。</font><font id="508">然而,我们的搜索结果中通常会包含误报,即</font><font id="509">我们想要排除的情况。</font><font id="510">例如,结果<span class="example">demands and other factors</span>暗示<span class="example">demand</span>是类型<span class="example">factor</span>的一个实例,但是这句话实际上是关于要求增加工资的。</font><font id="511">尽管如此,我们仍可以通过手工纠正这些搜索的结果来构建自己的英语概念的本体。</font></p>
<div class="note"><p class="first admonition-title"><font id="512">注意</font></p>
<p class="last"><font id="513">这种自动和人工处理相结合的方式是最常见的建造新的语料库的方式。</font><font id="514">我们将在<a class="reference external" href="./ch11.html#chap-data">11.</a>继续讲述这些。</font></p>
</div>
<p><font id="515">搜索语料也会有遗漏的问题,即</font><font id="516">漏掉了我们想要包含的情况。</font><font id="517">仅仅因为我们找不到任何一个搜索模式的实例,就断定一些语言现象在一个语料库中不存在,是很冒险的。</font><font id="518">也许我们只是没有足够仔细的思考合适的模式。</font></p>
<div class="note"><p class="first admonition-title"><font id="519">注意</font></p>
<p class="last"><font id="520"><strong>轮到你来:</strong> 查找模式<span class="example">as x as y</span>的实例以发现实体及其属性信息。</font></p>
</div>
</div>
</div>
<div class="section" id="normalizing-text"><h2 class="sigil_not_in_toc"><font id="521">3.6 规范化文本</font></h2>
<p><font id="522">在前面的程序例子中,我们在处理文本词汇前经常要将文本转换为小写,即</font><font id="523"><tt class="doctest"><span class="pre">set(w.lower() <span class="pysrc-keyword">for</span> w <span class="pysrc-keyword">in</span> text)</span></tt>。</font><font id="524">通过使用<tt class="doctest"><span class="pre">lower()</span></tt>我们将文本<span class="termdef">规范化</span>为小写,这样一来<span class="example">The</span>与<span class="example">the</span>的区别被忽略。</font><font id="525">我们常常想比这走得更远,去掉所有的词缀以及提取词干的任务等。</font><font id="526">更进一步的步骤是确保结果形式是字典中确定的词,即叫做词形归并的任务。</font><font id="527">我们依次讨论这些。</font><font id="528">首先,我们需要定义我们将在本节中使用的数据:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>raw = <span class="pysrc-string">"""DENNIS: Listen, strange women lying in ponds distributing swords</span>
<span class="pysrc-more">... </span><span class="pysrc-string">is no basis for a system of government. Supreme executive power derives from</span>
<span class="pysrc-more">... </span><span class="pysrc-string">a mandate from the masses, not from some farcical aquatic ceremony."""</span>
<span class="pysrc-prompt">>>> </span>tokens = word_tokenize(raw)</pre>
<div class="section" id="stemmers"><h3 class="sigil_not_in_toc"><font id="529">词干提取器</font></h3>
<p><font id="530">NLTK 中包括几个现成的词干提取器,如果你需要一个词干提取器,你应该优先使用它们中的一个,而不是使用正则表达式制作自己的词干提取器,因为NLTK 中的词干提取器能处理的不规则的情况很广泛。</font><font id="531">Porter 和Lancaster 词干提取器按照它们自己的规则剥离词缀。</font><font id="532">请看Porter词干提取器正确处理了词<span class="example">lying</span>(将它映射为<span class="example">lie</span>),而Lancaster词干提取器并没有处理好。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>porter = nltk.PorterStemmer()
<span class="pysrc-prompt">>>> </span>lancaster = nltk.LancasterStemmer()
<span class="pysrc-prompt">>>> </span>[porter.stem(t) <span class="pysrc-keyword">for</span> t <span class="pysrc-keyword">in</span> tokens]
<span class="pysrc-output">['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond',</span>
<span class="pysrc-output">'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern',</span>
<span class="pysrc-output">'.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from',</span>
<span class="pysrc-output">'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>[lancaster.stem(t) <span class="pysrc-keyword">for</span> t <span class="pysrc-keyword">in</span> tokens]
<span class="pysrc-output">['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut',</span>
<span class="pysrc-output">'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem',</span>
<span class="pysrc-output">'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not',</span>
<span class="pysrc-output">'from', 'som', 'farc', 'aqu', 'ceremony', '.']</span></pre>
<p><font id="533">词干提取过程没有明确定义,我们通常选择心目中最适合我们的应用的词干提取器。</font><font id="534">如果你要索引一些文本和使搜索支持不同词汇形式的话,Porter词干提取器是一个很好的选择(<a class="reference internal" href="./ch03.html#code-stemmer-indexing">3.6</a> 所示,它采用<em>面向对象</em>编程技术,这超出了本书的范围,字符串格式化技术将在<a class="reference internal" href="./ch03.html#sec-formatting">3.9</a>讲述,<tt class="doctest"><span class="pre">enumerate()</span></tt>函数将在<a class="reference external" href="./ch04.html#sec-sequences">4.2</a>解释)。</font></p>
<div class="pylisting"><p></p>
<pre class="doctest"><span class="pysrc-keyword">class</span> <span class="pysrc-defname">IndexedText</span>(object):
<span class="pysrc-keyword">def</span> <span class="pysrc-defname">__init__</span>(self, stemmer, text):
self._text = text
self._stemmer = stemmer
self._index = nltk.Index((self._stem(word), i)
<span class="pysrc-keyword">for</span> (i, word) <span class="pysrc-keyword">in</span> enumerate(text))
<span class="pysrc-keyword">def</span> <span class="pysrc-defname">concordance</span>(self, word, width=40):
key = self._stem(word)
wc = int(width/4) <span class="pysrc-comment"># words of context</span>
<span class="pysrc-keyword">for</span> i <span class="pysrc-keyword">in</span> self._index[key]:
lcontext = <span class="pysrc-string">' '</span>.join(self._text[i-wc:i])
rcontext = <span class="pysrc-string">' '</span>.join(self._text[i:i+wc])
ldisplay = <span class="pysrc-string">'{:>{width}}'</span>.format(lcontext[-width:], width=width)
rdisplay = <span class="pysrc-string">'{:{width}}'</span>.format(rcontext[:width], width=width)
<span class="pysrc-keyword">print</span>(ldisplay, rdisplay)
<span class="pysrc-keyword">def</span> <span class="pysrc-defname">_stem</span>(self, word):
return self._stemmer.stem(word).lower()</pre>
<div class="section" id="lemmatization"><h3 class="sigil_not_in_toc"><font id="536">词形归并</font></h3>
<p><font id="537">WordNet词形归并器只在产生的词在它的词典中时才删除词缀。</font><font id="538">这个额外的检查过程使词形归并器比刚才提到的词干提取器要慢。</font><font id="539">请注意,它并没有处理<span class="example">lying</span>,但它将<span class="example">women</span>转换为<span class="example">woman</span>。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>wnl = nltk.WordNetLemmatizer()
<span class="pysrc-prompt">>>> </span>[wnl.lemmatize(t) <span class="pysrc-keyword">for</span> t <span class="pysrc-keyword">in</span> tokens]
<span class="pysrc-output">['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond',</span>
<span class="pysrc-output">'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of',</span>
<span class="pysrc-output">'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a',</span>
<span class="pysrc-output">'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical',</span>
<span class="pysrc-output">'aquatic', 'ceremony', '.']</span></pre>
<p><font id="540">如果你想编译一些文本的词汇,或者想要一个有效词条(或中心词)列表,WordNet词形归并器是一个不错的选择。</font></p>
<div class="note"><p class="first admonition-title"><font id="541">注意</font></p>
<p class="last"><font id="542">另一个规范化任务涉及识别<span class="termdef">非标准词</span>,包括数字、缩写、日期以及映射任何此类词符到一个特殊的词汇。</font><font id="543">例如,每一个十进制数可以被映射到一个单独的标识符<tt class="doctest"><span class="pre">0.0</span></tt>,每首字母缩写可以映射为<tt class="doctest"><span class="pre">AAA</span></tt>。</font><font id="544">这使词汇量变小,提高了许多语言建模任务的准确性。</font></p>
</div>
</div>
<div class="section" id="regular-expressions-for-tokenizing-text"><h2 class="sigil_not_in_toc"><font id="545">3.7 用正则表达式为文本分词</font></h2>
<p><font id="546">分词是将字符串切割成可识别的构成一块语言数据的语言单元。</font><font id="547">虽然这是一项基础任务,我们能够一直拖延到现在为止才讲,是因为许多语料库已经分过词了,也因为NLTK中包括一些分词器。</font><font id="548">现在你已经熟悉了正则表达式,你可以学习如何使用它们来为文本分词,并对此过程中有更多的掌控权。</font></p>
<div class="section" id="simple-approaches-to-tokenization"><h3 class="sigil_not_in_toc"><font id="549">分词的简单方法</font></h3>
<p><font id="550">文本分词的一种非常简单的方法是在空格符处分割文本。</font><font id="551">考虑以下摘自<em>《爱丽丝梦游仙境》</em>中的文本:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>raw = <span class="pysrc-string">"""'When I'M a Duchess,' she said to herself, (not in a very hopeful tone</span>
<span class="pysrc-more">... </span><span class="pysrc-string">though), 'I won't have any pepper in my kitchen AT ALL. Soup does very</span>
<span class="pysrc-more">... </span><span class="pysrc-string">well without--Maybe it's always pepper that makes people hot-tempered,'..."""</span></pre>
<p><font id="552">我们可以使用<tt class="doctest"><span class="pre">raw.split()</span></tt>在空格符处分割原始文本。</font><font id="553">使用正则表达式能做同样的事情,匹配字符串中的所有空格符<a class="reference internal" href="./ch03.html#split-space"><span id="ref-split-space"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></span></a>是不够的,因为这将导致分词结果包含<tt class="doctest"><span class="pre">\n</span></tt>换行符;我们需要匹配任何数量的空格符、制表符或换行符<a class="reference internal" href="./ch03.html#split-whitespace"><span id="ref-split-whitespace"><img alt="[2]" class="callout" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></span></a>:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>re.split(r<span class="pysrc-string">' '</span>, raw) <a href="./ch03.html#ref-split-space"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></a>
<span class="pysrc-output">["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in',</span>
<span class="pysrc-output">'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper',</span>
<span class="pysrc-output">'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe',</span>
<span class="pysrc-output">"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>re.split(r<span class="pysrc-string">'[ \t\n]+'</span>, raw) <a href="./ch03.html#ref-split-whitespace"><img alt="[2]" class="callout" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></a>
<span class="pysrc-output">["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in',</span>
<span class="pysrc-output">'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper',</span>
<span class="pysrc-output">'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe',</span>
<span class="pysrc-output">"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]</span></pre>
<p><font id="554">正则表达式«<tt class="doctest"><span class="pre">[ \t\n]+</span></tt>»匹配一个或多个空格、制表符(<tt class="doctest"><span class="pre">\t</span></tt>)或换行符(<tt class="doctest"><span class="pre">\n</span></tt>)。</font><font id="555">其他空白字符,如回车和换页符,实际上应该也包含。</font><font id="556">于是,我们将使用一个<tt class="doctest"><span class="pre">re</span></tt>库内置的缩写<tt class="doctest"><span class="pre">\s</span></tt>,它表示匹配所有空白字符。</font><font id="557">前面的例子中第二条语句可以改写为<tt class="doctest"><span class="pre">re.split(r<span class="pysrc-string">'\s+'</span>, raw)</span></tt>。</font></p>
<div class="note"><p class="first admonition-title"><font id="558">注意</font></p>
<p class="last"><font id="559"><strong>要点:</strong> 记住在正则表达式前加字母<tt class="doctest"><span class="pre">r</span></tt>(表示"原始的"),它告诉Python解释器按照字面表示对待字符串,而不去处理正则表达式中包含的反斜杠字符。</font></p>
</div>
<p><font id="560">在空格符处分割文本给我们如<tt class="doctest"><span class="pre"><span class="pysrc-string">'(not'</span></span></tt>和<tt class="doctest"><span class="pre"><span class="pysrc-string">'herself,'</span></span></tt>这样的词符。</font><font id="561">另一种方法是使用Python提供给我们的字符类<tt class="doctest"><span class="pre">\w</span></tt>匹配词中的字符,相当于<tt class="doctest"><span class="pre">[a-zA-Z0-9_]</span></tt>。</font><font id="562">它还定义了这个类的补集<tt class="doctest"><span class="pre">\W</span></tt>,即</font><font id="563">所有字母、数字和下划线以外的字符。</font><font id="564">我们可以在一个简单的正则表达式中用<tt class="doctest"><span class="pre">\W</span></tt>来分割所有单词字符<em>以外</em>的输入:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>re.split(r<span class="pysrc-string">'\W+'</span>, raw)
<span class="pysrc-output">['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in',</span>
<span class="pysrc-output">'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper',</span>
<span class="pysrc-output">'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without',</span>
<span class="pysrc-output">'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered',</span>
<span class="pysrc-output">'']</span></pre>
<p><font id="565">可以看到,在开始和结尾都给了我们一个空字符串(要了解原因请尝试<tt class="doctest"><span class="pre"><span class="pysrc-string">'xx'</span>.split(<span class="pysrc-string">'x'</span>)</span></tt>)。</font><font id="566">通过<tt class="doctest"><span class="pre">re.findall(r<span class="pysrc-string">'\w+'</span>, raw)</span></tt>使用模式匹配词汇而不是空白符号,我们得到相同的标识符,但没有空字符串。</font><font id="567">现在,我们正在匹配词汇,我们处在扩展正则表达式覆盖更广泛的情况的位置。</font><font id="568">正则表达式«<tt class="doctest"><span class="pre">\w+|\S\w*</span></tt>»将首先尝试匹配词中字符的所有序列。</font><font id="569">如果没有找到匹配的,它会尝试匹配后面跟着词中字符的任何<em>非</em>空白字符(<tt class="doctest"><span class="pre">\S</span></tt>是<tt class="doctest"><span class="pre">\s</span></tt>的补)。</font><font id="570">这意味着标点会与跟在后面的字母(如</font><font id="571"><span class="example">'s</span>)在一起,但两个或两个以上的标点字符序列会被分割。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>re.findall(r<span class="pysrc-string">'\w+|\S\w*'</span>, raw)
<span class="pysrc-output">["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',',</span>
<span class="pysrc-output">'(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t",</span>
<span class="pysrc-output">'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does',</span>
<span class="pysrc-output">'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that',</span>
<span class="pysrc-output">'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']</span></pre>
<p><font id="572">让我们扩展前面表达式中的<tt class="doctest"><span class="pre">\w+</span></tt>,允许连字符和撇号:«<tt class="doctest"><span class="pre">\w+([-']\w+)*</span></tt>»。</font><font id="573">这个表达式表示<tt class="doctest"><span class="pre">\w+</span></tt>后面跟零个或更多<tt class="doctest"><span class="pre">[-']\w+</span></tt>的实例;它会匹配<span class="example">hot-tempered</span>和<span class="example">it's</span>。</font><font id="574">(我们需要在这个表达式中包含<tt class="doctest"><span class="pre">?:</span></tt>,原因前面已经讨论过。)</font><font id="575">我们还将添加一个模式来匹配引号字符,让它们与它们包括的文字分开。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(re.findall(r<span class="pysrc-string">"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*"</span>, raw))
<span class="pysrc-output">["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',',</span>
<span class="pysrc-output">'(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I',</span>
<span class="pysrc-output">"won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup',</span>
<span class="pysrc-output">'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper',</span>
<span class="pysrc-output">'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']</span></pre>
<p><font id="576">上面的表达式也包括«<tt class="doctest"><span class="pre">[-.(]+</span></tt>»,这会使双连字符、省略号和左括号被单独分词。</font></p>
<p><font id="577"><a class="reference internal" href="./ch03.html#tab-re-symbols">3.4</a>列出了我们已经在本节中看到的正则表达式字符类符号,以及一些其他有用的符号。</font></p>
<p class="caption"><font id="578"><span class="caption-label">表 3.4</span>:</font></p>
<p><font id="579">正则表达式符号</font></p>
<p></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>text = <span class="pysrc-string">'That U.S.A. poster-print costs $12.40...'</span>
<span class="pysrc-prompt">>>> </span>pattern = r<span class="pysrc-string">'''(?x) # set flag to allow verbose regexps</span>
<span class="pysrc-more">... </span><span class="pysrc-string"> ([A-Z]\.)+ # abbreviations, e.g. U.S.A.</span>
<span class="pysrc-more">... </span><span class="pysrc-string"> | \w+(-\w+)* # words with optional internal hyphens</span>
<span class="pysrc-more">... </span><span class="pysrc-string"> | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%</span>
<span class="pysrc-more">... </span><span class="pysrc-string"> | \.\.\. # ellipsis</span>
<span class="pysrc-more">... </span><span class="pysrc-string"> | [][.,;"'?():-_`] # these are separate tokens; includes ], [</span>
<span class="pysrc-more">... </span><span class="pysrc-string">'''</span>
<span class="pysrc-prompt">>>> </span>nltk.regexp_tokenize(text, pattern)
<span class="pysrc-output">['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']</span></pre>
<p><font id="605">使用verbose 标志时,不可以再使用<tt class="doctest"><span class="pre"><span class="pysrc-string">' '</span></span></tt>来匹配一个空格字符;使用<tt class="doctest"><span class="pre">\s</span></tt>代替。</font><font id="606"><tt class="doctest"><span class="pre">regexp_tokenize()</span></tt>函数有一个可选的<tt class="doctest"><span class="pre">gaps</span></tt>参数。</font><font id="607">设置为<tt class="doctest"><span class="pre">True</span></tt>时,正则表达式指定标识符间的距离,就像使用<tt class="doctest"><span class="pre">re.split()</span></tt>一样。</font></p>
<div class="note"><p class="first admonition-title"><font id="608">注意</font></p>
<p class="last"><font id="609">我们可以使用<tt class="doctest"><span class="pre">set(tokens).difference(wordlist)</span></tt>通过比较分词结果与一个词表,然后报告任何没有在词表出现的标识符,来评估一个分词器。</font><font id="610">你可能想先将所有标记变成小写。</font></p>
</div>
</div>
<div class="section" id="further-issues-with-tokenization"><h3 class="sigil_not_in_toc"><font id="611">分词的进一步问题</font></h3>
<p><font id="612">分词是一个比你可能预期的要更为艰巨的任务。</font><font id="613">没有单一的解决方案能在所有领域都行之有效,我们必须根据应用领域的需要决定那些是词符。</font></p>
<p><font id="614">在开发分词器时,访问已经手工分词的原始文本是有益的,这可以让你的分词器的输出结果与高品质(或称“黄金标准”)的词符进行比较。</font><font id="615">NLTK语料库集合包括宾州树库的数据样本,包括《华尔街日报》原始文本(<tt class="doctest"><span class="pre">nltk.corpus.treebank_raw.raw()</span></tt>)和分好词的版本(<tt class="doctest"><span class="pre">nltk.corpus.treebank.words()</span></tt>)。</font></p>
<p><font id="616">分词的最后一个问题是缩写的存在,如<span class="example">didn't</span>。</font><font id="617">如果我们想分析一个句子的意思,将这种形式规范化为两个独立的形式:<span class="example">did</span>和<span class="example">n't</span>(或<span class="example">not</span>)可能更加有用。</font><font id="618">我们可以通过查表来做这项工作。</font></p>
</div>
</div>
<div class="section" id="segmentation"><h2 class="sigil_not_in_toc"><font id="619">3.8 分割</font></h2>
<p><font id="620">本节将讨论更高级的概念,你在第一次阅读本章时可能更愿意跳过本节。</font></p>
<p><font id="621">分词是一个更普遍的<span class="termdef">分割</span>问题的一个实例。</font><font id="622">在本节中,我们将看到这个问题的另外两个实例,它们使用与到目前为止我们已经在本章看到的完全不同的技术。</font></p>
<div class="section" id="sentence-segmentation"><h3 class="sigil_not_in_toc"><font id="623">断句</font></h3>
<p><font id="624">在词级水平处理文本通常假定能够将文本划分成单个句子。</font><font id="625">正如我们已经看到,一些语料库已经提供句子级别的访问。</font><font id="626">在下面的例子中,我们计算布朗语料库中每个句子的平均词数:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())
<span class="pysrc-output">20.250994070456922</span></pre>
<p><font id="627">在其他情况下,文本可能只是作为一个字符流。</font><font id="628">在将文本分词之前,我们需要将它分割成句子。</font><font id="629">NLTK通过包含Punkt 句子分割器<a class="reference external" href="./bibliography.html#kissstrunk2006" id="id1">(Kiss & Strunk, 2006)</a>使得这个功能便于使用。</font><font id="630">这里是使用它为一篇小说文本断句的例子。</font><font id="631">(请注意,如果在你读到这篇文章时分割器内部数据已经更新过,你会看到不同的输出):</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>text = nltk.corpus.gutenberg.raw(<span class="pysrc-string">'chesterton-thursday.txt'</span>)
<span class="pysrc-prompt">>>> </span>sents = nltk.sent_tokenize(text)
<span class="pysrc-prompt">>>> </span>pprint.pprint(sents[79:89])
<span class="pysrc-output">['"Nonsense!"',</span>
<span class="pysrc-output"> 'said Gregory, who was very rational when anyone else\nattempted paradox.',</span>
<span class="pysrc-output"> '"Why do all the clerks and navvies in the\n'</span>
<span class="pysrc-output"> 'railway trains look so sad and tired, so very sad and tired?',</span>
<span class="pysrc-output"> 'I will\ntell you.',</span>
<span class="pysrc-output"> 'It is because they know that the train is going right.',</span>
<span class="pysrc-output"> 'It\n'</span>
<span class="pysrc-output"> 'is because they know that whatever place they have taken a ticket\n'</span>
<span class="pysrc-output"> 'for that place they will reach.',</span>
<span class="pysrc-output"> 'It is because after they have\n'</span>
<span class="pysrc-output"> 'passed Sloane Square they know that the next station must be\n'</span>
<span class="pysrc-output"> 'Victoria, and nothing but Victoria.',</span>
<span class="pysrc-output"> 'Oh, their wild rapture!',</span>
<span class="pysrc-output"> 'oh,\n'</span>
<span class="pysrc-output"> 'their eyes like stars and their souls again in Eden, if the next\n'</span>
<span class="pysrc-output"> 'station were unaccountably Baker Street!"',</span>
<span class="pysrc-output"> '"It is you who are unpoetical," replied the poet Syme.']</span></pre>
<p><font id="632">请注意,这个例子其实是一个单独的句子,报道Lucian Gregory先生的演讲。</font><font id="633">然而,引用的演讲包含几个句子,这些已经被分割成几个单独的字符串。</font><font id="634">这对于大多数应用程序是合理的行为。</font></p>
<p><font id="635">断句是困难的,因为句号会被用来标记缩写而另一些句号同时标记缩写和句子结束,就像发生在缩写如<span class="example">U.S.A.</span>上的那样。</font></p>
<p><font id="636">断句的另一种方法见<a class="reference external" href="./ch06.html#sec-further-examples-of-supervised-classification">2</a>节。</font></p>
</div>
<div class="section" id="word-segmentation"><h3 class="sigil_not_in_toc"><font id="637">分词</font></h3>
<p><font id="638">对于一些书写系统,由于没有词的可视边界表示这一事实,文本分词变得更加困难。</font><font id="639">例如,在中文中,三个字符的字符串:爱国人(ai4 “love” [verb], guo3 “country”,ren2 “person”) 可以被分词为“爱国/人”,“country-loving person”,或者“爱/国人”,“love country-person”。</font></p>
<p><font id="640">类似的问题在口语语言处理中也会出现,听者必须将连续的语音流分割成单个的词汇。</font><font id="641">当我们事先不认识这些词时,这个问题就演变成一个特别具有挑战性的版本。</font><font id="642">语言学习者会面对这个问题,例如小孩听父母说话。</font><font id="643">考虑下面的人为构造的例子,单词的边界已被去除:</font></p>
<p></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>text = <span class="pysrc-string">"doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"</span>
<span class="pysrc-prompt">>>> </span>seg1 = <span class="pysrc-string">"0000000000000001000000000010000000000000000100000000000"</span>
<span class="pysrc-prompt">>>> </span>seg2 = <span class="pysrc-string">"0100100100100001001001000010100100010010000100010010000"</span></pre>
<p><font id="657">观察由0和1组成的分词表示字符串。</font><font id="658">它们比源文本短一个字符,因为长度为<span class="math">n</span>文本可以在<span class="math">n-1</span>个地方被分割。</font><font id="659"><a class="reference internal" href="./ch03.html#code-segment">3.7</a>中的<tt class="doctest"><span class="pre">segment()</span></tt>函数演示了我们可以从这个表示回到初始分词的文本。</font></p>
<div class="pylisting"><p></p>
<pre class="doctest"><span class="pysrc-keyword">def</span> <span class="pysrc-defname">segment</span>(text, segs):
words = []
last = 0
<span class="pysrc-keyword">for</span> i <span class="pysrc-keyword">in</span> range(len(segs)):
<span class="pysrc-keyword">if</span> segs[i] == <span class="pysrc-string">'1'</span>:
words.append(text[last:i+1])
last = i+1
words.append(text[last:])
return words</pre>
<p><font id="661">现在分词的任务变成了一个搜索问题:找到将文本字符串正确分割成词汇的字位串。</font><font id="662">我们假定学习者接收词,并将它们存储在一个内部词典中。</font><font id="663">给定一个合适的词典,是能够由词典中的词的序列来重构源文本的。</font><font id="664">根据<a class="reference external" href="./bibliography.html#brent1995" id="id2">(Brent, 1995)</a>,我们可以定义一个<span class="termdef">目标函数</span>,一个打分函数,我们将基于词典的大小和从词典中重构源文本所需的信息量尽力优化它的值。</font><font id="665">我们在<a class="reference internal" href="./ch03.html#fig-brent">3.8</a>中说明了这些。</font></p>
<div class="figure" id="fig-brent"><img alt="Images/brent.png" src="Images/ced4e829d6a662a2be20187f9d7b71b5.jpg" style="width: 711.3px; height: 267.59999999999997px;"/><p class="caption"><font id="666"><span class="caption-label">图 3.8</span>:计算目标函数:给定一个假设的源文本的分词(左),推导出一个词典和推导表,它能让源文本重构,然后合计每个词项(包括边界标志)与推导表的字符数,作为分词质量的得分;得分值越小表明分词越好。</font></p>
</div>
<p><font id="667">实现这个目标函数是很简单的,如例子<a class="reference internal" href="./ch03.html#code-evaluate">3.9</a>所示。</font></p>
<div class="pylisting"><p></p>
<pre class="doctest"><span class="pysrc-keyword">def</span> <span class="pysrc-defname">evaluate</span>(text, segs):
words = segment(text, segs)
text_size = len(words)
lexicon_size = sum(len(word) + 1 <span class="pysrc-keyword">for</span> word <span class="pysrc-keyword">in</span> set(words))
return text_size + lexicon_size</pre>
<p><font id="669">最后一步是寻找最小化目标函数值的0和1的模式,如<a class="reference internal" href="./ch03.html#code-anneal">3.10</a>所示。</font><font id="670">请注意,最好的分词包括像<span class="example">thekitty</span>这样的“词”,因为数据中没有足够的证据进一步分割这个词。</font></p>
<div class="pylisting"><p></p>
<pre class="doctest"><span class="pysrc-keyword">from</span> random <span class="pysrc-keyword">import</span> randint
<span class="pysrc-keyword">def</span> <span class="pysrc-defname">flip</span>(segs, pos):
return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]
<span class="pysrc-keyword">def</span> <span class="pysrc-defname">flip_n</span>(segs, n):
<span class="pysrc-keyword">for</span> i <span class="pysrc-keyword">in</span> range(n):
segs = flip(segs, randint(0, len(segs)-1))
return segs
<span class="pysrc-keyword">def</span> <span class="pysrc-defname">anneal</span>(text, segs, iterations, cooling_rate):
temperature = float(len(segs))
while temperature > 0.5:
best_segs, best = segs, evaluate(text, segs)
<span class="pysrc-keyword">for</span> i <span class="pysrc-keyword">in</span> range(iterations):
guess = flip_n(segs, round(temperature))
score = evaluate(text, guess)
<span class="pysrc-keyword">if</span> score < best:
best, best_segs = score, guess
score, segs = best, best_segs
temperature = temperature / cooling_rate
<span class="pysrc-keyword">print</span>(evaluate(text, segs), segment(text, segs))
<span class="pysrc-keyword">print</span>()
return segs</pre>
<p><font id="673">有了足够的数据,就可能以一个合理的准确度自动将文本分割成词汇。</font><font id="674">这种方法可用于为那些词的边界没有任何视觉表示的书写系统分词。</font></p>
<div class="section" id="formatting-from-lists-to-strings"><h2 class="sigil_not_in_toc"><font id="675">3.9 格式化:从列表到字符串</font></h2>
<p><font id="676">我们经常会写程序来汇报一个单独的数据项例如一个语料库中满足一些复杂的标准的特定的元素,或者一个单独的总数统计例如一个词计数器或一个标注器的性能。</font><font id="677">更多的时候,我们写程序来产生一个结构化的结果;例如:一个数字或语言形式的表格,或原始数据的格式变换。</font><font id="678">当要表示的结果是语言时,文字输出通常是最自然的选择。</font><font id="679">然而当结果是数值时,可能最好是图形输出。</font><font id="680">在本节中,你将会学到呈现程序输出的各种方式。</font></p>
<div class="section" id="from-lists-to-strings"><h3 class="sigil_not_in_toc"><font id="681">从列表到字符串</font></h3>
<p><font id="682">我们用于文本处理的最简单的一种结构化对象是词列表。</font><font id="683">当我们希望把这些输出到显示器或文件时,必须把这些词列表转换成字符串。</font><font id="684">在Python做这些,我们使用<tt class="doctest"><span class="pre">join()</span></tt>方法,并指定字符串作为使用的“胶水”。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>silly = [<span class="pysrc-string">'We'</span>, <span class="pysrc-string">'called'</span>, <span class="pysrc-string">'him'</span>, <span class="pysrc-string">'Tortoise'</span>, <span class="pysrc-string">'because'</span>, <span class="pysrc-string">'he'</span>, <span class="pysrc-string">'taught'</span>, <span class="pysrc-string">'us'</span>, <span class="pysrc-string">'.'</span>]
<span class="pysrc-prompt">>>> </span><span class="pysrc-string">' '</span>.join(silly)
<span class="pysrc-output">'We called him Tortoise because he taught us .'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-string">';'</span>.join(silly)
<span class="pysrc-output">'We;called;him;Tortoise;because;he;taught;us;.'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-string">''</span>.join(silly)
<span class="pysrc-output">'WecalledhimTortoisebecausehetaughtus.'</span></pre>
<p><font id="685">所以<tt class="doctest"><span class="pre"><span class="pysrc-string">' '</span>.join(silly)</span></tt>的意思是:取出<tt class="doctest"><span class="pre">silly</span></tt>中的所有项目,将它们连接成一个大的字符串,使用<tt class="doctest"><span class="pre"><span class="pysrc-string">' '</span></span></tt>作为项目之间的间隔符。</font><font id="686">即</font><font id="687"><tt class="doctest"><span class="pre">join()</span></tt>是一个你想要用来作为胶水的字符串的一个方法。</font><font id="688">(许多人感到<tt class="doctest"><span class="pre">join()</span></tt>的这种表示方法是违反直觉的。)</font><font id="689"><tt class="doctest"><span class="pre">join()</span></tt>方法只适用于一个字符串的列表——我们一直把它叫做一个文本——在Python中享有某些特权的一个复杂类型。</font></p>
</div>
<div class="section" id="strings-and-formats"><h3 class="sigil_not_in_toc"><font id="690">字符串与格式</font></h3>
<p><font id="691">我们已经看到了有两种方式显示一个对象的内容:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>word = <span class="pysrc-string">'cat'</span>
<span class="pysrc-prompt">>>> </span>sentence = <span class="pysrc-string">"""hello</span>
<span class="pysrc-more">... </span><span class="pysrc-string">world"""</span>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(word)
<span class="pysrc-output">cat</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">print</span>(sentence)
<span class="pysrc-output">hello</span>
<span class="pysrc-output">world</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>word
<span class="pysrc-output">'cat'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span>sentence
<span class="pysrc-output">'hello\nworld'</span></pre>
<p><font id="692"><tt class="doctest"><span class="pre"><span class="pysrc-keyword">print</span></span></tt>命令让Python努力以人最可读的形式输出的一个对象的内容。</font><font id="693">第二种方法——叫做变量提示——向我们显示可用于重新创建该对象的字符串。</font><font id="694">重要的是要记住这些都仅仅是字符串,为了你用户的方便而显示的。</font><font id="695">它们并不会给我们实际对象的内部表示的任何线索。</font></p>
<p><font id="696">还有许多其他有用的方法来将一个对象作为字符串显示。</font><font id="697">这可能是为了人阅读的方便,或是因为我们希望<span class="termdef">导出</span>我们的数据到一个特定的能被外部程序使用的文件格式。</font></p>
<p><font id="698">格式化输出通常包含变量和预先指定的字符串的一个组合,例如</font><font id="699">给定一个频率分布<tt class="doctest"><span class="pre">fdist</span></tt>,我们可以这样做:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>fdist = nltk.FreqDist([<span class="pysrc-string">'dog'</span>, <span class="pysrc-string">'cat'</span>, <span class="pysrc-string">'dog'</span>, <span class="pysrc-string">'cat'</span>, <span class="pysrc-string">'dog'</span>, <span class="pysrc-string">'snake'</span>, <span class="pysrc-string">'dog'</span>, <span class="pysrc-string">'cat'</span>])
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">for</span> word <span class="pysrc-keyword">in</span> sorted(fdist):
<span class="pysrc-more">... </span> <span class="pysrc-keyword">print</span>(word, <span class="pysrc-string">'->'</span>, fdist[word], end=<span class="pysrc-string">'; '</span>)
<span class="pysrc-output">cat -> 3; dog -> 4; snake -> 1;</span></pre>
<p><font id="700">输出包含变量和常量交替出现的表达式是难以阅读和维护的。</font><font id="701">一个更好的解决办法是使用<span class="termdef">字符串格式化表达式</span>。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">for</span> word <span class="pysrc-keyword">in</span> sorted(fdist):
<span class="pysrc-more">... </span> <span class="pysrc-keyword">print</span>(<span class="pysrc-string">'{}->{};'</span>.format(word, fdist[word]), end=<span class="pysrc-string">' '</span>)
<span class="pysrc-output">cat->3; dog->4; snake->1;</span></pre>
<p><font id="702">要了解这里发生了什么事情,让我们在字符串格式化表达式上面测试一下。</font><font id="703">(现在,这将是你探索新语法的常用方法。)</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'{}->{};'</span>.format (<span class="pysrc-string">'cat'</span>, 3)
<span class="pysrc-output">'cat->3;'</span></pre>
<p><font id="704">花括号<tt class="doctest"><span class="pre"><span class="pysrc-string">'{}'</span></span></tt>标记一个<span class="termdef">替换字段</span>的出现:它作为传递给<tt class="doctest"><span class="pre">str.format()</span></tt>方法的对象的字符串值的占位符。</font><font id="705">我们可以将<tt class="doctest"><span class="pre"><span class="pysrc-string">'{}'</span></span></tt>嵌入到一个字符串的内部,然后以适当的参数调用<tt class="doctest"><span class="pre">format()</span></tt>来让字符串替换它们。</font><font id="706">包含替换字段的字符串叫做<span class="termdef">格式字符串</span>。</font></p>
<p><font id="707">让我们更深入的解开这段代码,以便更仔细的观察它的行为:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'{}->'</span>.format(<span class="pysrc-string">'cat'</span>)
<span class="pysrc-output">'cat->'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'{}'</span>.format(3)
<span class="pysrc-output">'3'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'I want a {} right now'</span>.format(<span class="pysrc-string">'coffee'</span>)
<span class="pysrc-output">'I want a coffee right now'</span></pre>
<p><font id="708">我们可以有任意个数目的占位符,但<tt class="doctest"><span class="pre">str.format</span></tt>方法必须以数目完全相同的参数来调用。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'{} wants a {} {}'</span>.format (<span class="pysrc-string">'Lee'</span>, <span class="pysrc-string">'sandwich'</span>, <span class="pysrc-string">'for lunch'</span>)
<span class="pysrc-output">'Lee wants a sandwich for lunch'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'{} wants a {} {}'</span>.format (<span class="pysrc-string">'sandwich'</span>, <span class="pysrc-string">'for lunch'</span>)
<span class="pysrc-except">Traceback (most recent call last):</span>
<span class="pysrc-except">...</span>
<span class="pysrc-except"> '{} wants a {} {}'.format ('sandwich', 'for lunch')</span>
<span class="pysrc-except">IndexError: tuple index out of range</span></pre>
<p><font id="709">从左向右取用给<tt class="doctest"><span class="pre">format()</span></tt>的参数,任何多余的参数都会被简单地忽略。</font></p>
<div class="system-message"><p class="system-message-title"><font id="710">System Message: ERROR/3 (<tt class="docutils">ch03.rst2</tt>, line 2265)</font></p>
<font id="711"> Unexpected indentation.</font></div>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'{} wants a {}'</span>.format (<span class="pysrc-string">'Lee'</span>, <span class="pysrc-string">'sandwich'</span>, <span class="pysrc-string">'for lunch'</span>)
<span class="pysrc-output">'Lee wants a sandwich'</span></pre>
<p><font id="712">格式字符串中的替换字段可以以一个数值开始,它表示<tt class="doctest"><span class="pre">format()</span></tt>的位置参数。</font><font id="713"><tt class="doctest"><span class="pre"><span class="pysrc-string">'from {} to {}'</span></span></tt>这样的语句等同于<tt class="doctest"><span class="pre"><span class="pysrc-string">'from {0} to {1}'</span></span></tt>,但是我们使用数字来得到非默认的顺序:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'from {1} to {0}'</span>.format(<span class="pysrc-string">'A'</span>, <span class="pysrc-string">'B'</span>)
<span class="pysrc-output">'from B to A'</span></pre>
<p><font id="714">我们还可以间接提供值给占位符。</font><font id="715">下面是使用<tt class="doctest"><span class="pre"><span class="pysrc-keyword">for</span></span></tt>循环的一个例子:</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>template = <span class="pysrc-string">'Lee wants a {} right now'</span>
<span class="pysrc-prompt">>>> </span>menu = [<span class="pysrc-string">'sandwich'</span>, <span class="pysrc-string">'spam fritter'</span>, <span class="pysrc-string">'pancake'</span>]
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">for</span> snack <span class="pysrc-keyword">in</span> menu:
<span class="pysrc-more">... </span> <span class="pysrc-keyword">print</span>(template.format(snack))
<span class="pysrc-more">...</span>
<span class="pysrc-output">Lee wants a sandwich right now</span>
<span class="pysrc-output">Lee wants a spam fritter right now</span>
<span class="pysrc-output">Lee wants a pancake right now</span></pre>
</div>
<div class="section" id="lining-things-up"><h3 class="sigil_not_in_toc"><font id="716">对齐</font></h3>
<p><font id="717">到目前为止,我们的格式化字符串可以在页面(或屏幕)上输出任意的宽度。</font><font id="718">我们可以通过插入一个冒号<tt class="doctest"><span class="pre"><span class="pysrc-string">':'</span></span></tt>跟随一个整数来添加空白以获得指定宽带的输出。</font><font id="719">所以<tt class="doctest"><span class="pre">{:6}</span></tt>表示我们想让字符串对齐到宽度6。</font><font id="720">数字默认表示右对齐<a class="reference internal" href="./ch03.html#right-justified"><span id="ref-right-justified"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></span></a>,单我们可以在宽度指示符前面加上<tt class="doctest"><span class="pre"><span class="pysrc-string">'<'</span></span></tt>对齐选项来让数字左对齐<a class="reference internal" href="./ch03.html#left-justified"><span id="ref-left-justified"><img alt="[2]" class="callout" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></span></a>。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'{:6}'</span>.format(41) <a href="./ch03.html#ref-right-justified"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></a>
<span class="pysrc-output">' 41'</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'{:<6}'</span> .format(41) <a href="./ch03.html#ref-left-justified"><img alt="[2]" class="callout" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></a>
<span class="pysrc-output">'41 '</span></pre>
<p><font id="721">字符串默认是左对齐,但可以通过<tt class="doctest"><span class="pre"><span class="pysrc-string">'>'</span></span></tt>对齐选项右对齐。</font></p>
<div class="system-message"><p class="system-message-title"><font id="722">System Message: ERROR/3 (<tt class="docutils">ch03.rst2</tt>, line 2313)</font></p>
<font id="723"> Unexpected indentation.</font></div>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'{:6}'</span>.format(<span class="pysrc-string">'dog'</span>) <a href="./ch03.html#ref-left-justified-str"><img alt="[1]" class="callout" src="Images/7e6ea96aad77f3e523494b3972b5a989.jpg"/></a>
<span class="pysrc-output">'dog '</span>
<span class="pysrc-output"></span><span class="pysrc-prompt">>>> </span><span class="pysrc-string">'{:>6}'</span>.format(<span class="pysrc-string">'dog'</span>) <a href="./ch03.html#ref-right-justified-str"><img alt="[2]" class="callout" src="Images/be33958d0b44c88caac0dcf4d4ec84c6.jpg"/></a>
<span class="pysrc-output"> ' dog'</span></pre>
<p><font id="724">其它控制字符可以用于指定浮点数的符号和精度;例如<tt class="doctest"><span class="pre">{:.4f}</span></tt>表示浮点数的小数点后面应该显示4个数字。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">import</span> math
<span class="pysrc-prompt">>>> </span><span class="pysrc-string">'{:.4f}'</span>.format(math.pi)
<span class="pysrc-output">'3.1416'</span></pre>
<p><font id="725">字符串格式化很聪明,能够知道如果你包含一个<tt class="doctest"><span class="pre"><span class="pysrc-string">'%'</span></span></tt>在你的格式化字符串中,那么你想表示这个值为百分数;不需要乘以100。</font></p>
<pre class="doctest"><span class="pysrc-prompt">>>> </span>count, total = 3205, 9375
<span class="pysrc-prompt">>>> </span><span class="pysrc-string">"accuracy for {} words: {:.4%}"</span>.format(total, count / total)
<span class="pysrc-output">'accuracy for 9375 words: 34.1867%'</span></pre>
<p><font id="726">格式化字符串的一个重要用途是用于数据制表。</font><font id="727">回想一下,在<a class="reference external" href="./ch02.html#sec-extracting-text-from-corpora">1</a>中,我们看到从条件频率分布中制表的数据。</font><font id="728">让我们自己来制表,行使对标题和列宽的完全控制,如<a class="reference internal" href="./ch03.html#code-modal-tabulate">3.11</a>所示。</font><font id="729">注意语言处理工作与结果制表之间是明确分离的。</font></p>
<div class="pylisting"><p></p>
<pre class="doctest"><span class="pysrc-keyword">def</span> <span class="pysrc-defname">tabulate</span>(cfdist, words, categories):
<span class="pysrc-keyword">print</span>(<span class="pysrc-string">'{:16}'</span>.format(<span class="pysrc-string">'Category'</span>), end=<span class="pysrc-string">' '</span>) <span class="pysrc-comment"># column headings</span>
<span class="pysrc-keyword">for</span> word <span class="pysrc-keyword">in</span> words:
<span class="pysrc-keyword">print</span>(<span class="pysrc-string">'{:>6}'</span>.format(word), end=<span class="pysrc-string">' '</span>)
<span class="pysrc-keyword">print</span>()
<span class="pysrc-keyword">for</span> category <span class="pysrc-keyword">in</span> categories:
<span class="pysrc-keyword">print</span>(<span class="pysrc-string">'{:16}'</span>.format(category), end=<span class="pysrc-string">' '</span>) <span class="pysrc-comment"># row heading</span>
<span class="pysrc-keyword">for</span> word <span class="pysrc-keyword">in</span> words: <span class="pysrc-comment"># for each word</span>
<span class="pysrc-keyword">print</span>(<span class="pysrc-string">'{:6}'</span>.format(cfdist[category][word]), end=<span class="pysrc-string">' '</span>) <span class="pysrc-comment"># print table cell</span>
<span class="pysrc-keyword">print</span>() <span class="pysrc-comment"># end the row</span>
<span class="pysrc-prompt">>>> </span><span class="pysrc-keyword">from</span> nltk.corpus <span class="pysrc-keyword">import</span> brown
<span class="pysrc-prompt">>>> </span>cfd = nltk.ConditionalFreqDist(