More on Jane Austen and stylistic signatures

Ted Underwood responded to my post on Jane Austen’s style–pointing out the prevalence of adverbs, “to be” constructions, and terms of certainty–by raising the issue of baseline comparisons: “I’d like to know whether this is something about Austen in particular, or whether it’s a characteristic feature of a period/genre. I don’t intuitively know which is more likely.”

Let’s explore! I’m again using Ted’s corpus and software, comparing a given author’s work to the whole corpus. This file is a transcript of the commands and output I’m interpreting below.

I thought the most conventional guess of an author to produce results similar to Austen would be Maria Edgeworth. Here’s the list for her:

 WORDS OVERREPRESENTED BY MANN-WHITNEY RHO 
1	understand     	0.937	271	
2	recollect      	0.923	309	
3	talking        	0.916	127	
4	know           	0.916	523	
5	could          	0.913	754	
6	provoking      	0.912	41.9	
7	nonsense       	0.911	62.3	
8	perfectly      	0.905	119	
9	explain        	0.903	192	
10	continually    	0.889	95.4	
11	tired          	0.888	76	
12	going          	0.888	205	
13	do             	0.884	586	
14	dear           	0.88	792	
15	sorry          	0.879	79.5	
16	satisfied      	0.879	93.8	
17	yesterday      	0.879	48.9	
18	liked          	0.875	48.1	
19	spoiled        	0.874	19.6	
20	directly       	0.869	77.2	
21	quite          	0.869	136	
22	please         	0.868	182	
23	you            	0.868	2467	
24	repeated       	0.868	233	
25	decide         	0.866	101	
26	afraid         	0.864	148	
27	repeating      	0.862	52.7	
28	thank          	0.862	115	
29	manage         	0.86	44	
30	guess          	0.86	97.8	
31	sure           	0.859	290	
32	ashamed        	0.857	35.4	
33	put            	0.856	140	
34	admiration     	0.855	90.5	
35	disappointed   	0.855	44.8	
36	surprised      	0.855	75.6	
37	tiresome       	0.853	37.2	
38	especially     	0.853	76.3	
39	not            	0.853	802	
40	reading        	0.853	80.1	
41	dressing       	0.852	9.04	
42	said           	0.852	2783	
43	formerly       	0.851	50	
44	understanding  	0.851	103	
45	possible       	0.85	157	
46	because        	0.85	261	
47	really         	0.85	125	
48	any            	0.85	632	
49	saw            	0.85	183	
50	think          	0.85	173	

My unsystematic eyeballs see no forms of “to be” and far fewer adverbs than populated Austen’s list. Terms of cognition seem especially prominent:

 WORDS OVERREPRESENTED BY MANN-WHITNEY RHO 
1	understand     	0.937	271	
2	recollect      	0.923	309	
4	know           	0.916	523	
9	explain        	0.903	192	
25	decide         	0.866	101	
30	guess          	0.86	97.8	
44	understanding  	0.851	103	
50	think          	0.85	173	

What about Charlotte Lennox? Her list has “extremely” and “wholly” in the first and sixth places, but only one other “-ly” adverb (“instantly” at #29). Lennox’s vocabulary emphasizes the dynamics of sociability. Highlights:

 WORDS OVERREPRESENTED BY MANN-WHITNEY RHO 
2	civility       	0.97	117	
7	amiable        	0.959	353	
8	accompany      	0.959	55.8	
11	conversation   	0.957	258	
12	behaviour      	0.954	419	
13	mortified      	0.949	34.6	
14	mortification  	0.948	113	
15	received       	0.945	119	
18	amusements     	0.939	32.3	
19	entreaties     	0.937	54.9	
20	apprehensions  	0.937	89.4	
21	attentions     	0.936	70.9	
27	conduct        	0.929	195	
28	insisted       	0.928	80.6	
29	instantly      	0.927	209	
30	countenance    	0.925	123	
31	situation      	0.924	260	
33	visit          	0.923	107	
35	arrival        	0.922	83.5	
36	acknowledged   	0.92	53	
37	reception      	0.92	46.8	
38	circumstance   	0.919	98.7	
41	relations      	0.917	84.3	
42	letter         	0.916	312	
43	politeness     	0.916	110	
44	shocked        	0.914	89.2	
45	accident       	0.913	74.1	
46	inform         	0.913	74.8	
47	acquaintance   	0.912	131	
50	ordered        	0.91	66.6	

Walter Scott’s list of 50 (using only his fiction for the sake of comparison) includes only three adverbs, none in his top 30, and the highest-ranking is an adverb of action: “hastily.” Scott’s list evokes military contexts and especially hierarchies of authority:

1	answered       	0.958	2519	
4	warrant        	0.944	501	
8	risk           	0.93	263	
13	permit         	0.914	247	
14	trusty         	0.913	169	
19	weapon         	0.905	235	
22	boot           	0.902	127	
23	followers      	0.898	505	
27	domestics      	0.897	122	
30	commanded      	0.895	222	
32	courtesy       	0.894	262	
33	quarrel        	0.893	183	
34	kinsman        	0.892	432	
35	assistance     	0.892	248	
37	saddle         	0.891	109	
43	displeasure    	0.89	123	
44	attendance     	0.889	162	
47	willingly      	0.889	170	

Hannah More’s list (again, using only her fiction) is unsurprisingly packed with religious terminology, and I see little overlap between her list and the others.

If you want motion in your novel, open your James Fenimore Cooper:

 WORDS OVERREPRESENTED BY MANN-WHITNEY RHO 
1	movements      	0.979	903	
3	movement       	0.97	576	
4	direction      	0.961	579	
6	commenced      	0.958	374	
8	companion      	0.952	645	
18	distance       	0.915	552	
20	quest          	0.913	190	
21	returned       	0.913	829	
27	companions     	0.902	268	
37	disappeared    	0.894	137	
38	preparations   	0.893	93.3	
39	placing        	0.893	74.7	
40	position       	0.892	168	

At this point, I think we have at least a preliminary answer to our question: the prevalence of adverbs and so forth in Austen’s works is indeed characteristic of Austen herself, rather than her period or genre.

This little exploration was great fun for me, as the results returned a mix of new insights–particularly about Austen and Edgeworth–and reassuring common-sense confirmation that the tool identifies the characteristic thematic emphases of Scott and More. In a follow-up post, I’ll offer some quick thoughts about other uses of this kind of word-frequency analysis, from the perspective of a beginning user with a pedagogical emphasis.

Advertisements

Jane Austen and contemporary prose style

I’m on leave this semester to do work in the Digital Humanities, so I’ll be posting a lot about that. My interest in DH is not–or has not been–quantitative, but I am expanding my range by dabbling in quantitative methods, currently with the help of Ted Underwood’s wonderful introduction to the topic.

At the end of Ted’s post, he provides a dataset and a program he wrote to find groups of words that form something like stylistic signatures in authors and genres. I’ve been playing with the program, with fascinating results. I’ll share one here. This is the list of overrepresented words in Jane Austen’s works according to one of the measures Ted uses:


WORDS OVERREPRESENTED BY MANN-WHITNEY RHO
1 very 0.985 3283
2 wishing 0.984 154
3 staying 0.982 176
4 satisfied 0.977 188
5 fortnight 0.975 152
6 herself 0.973 1553
7 agreeable 0.973 350
8 be 0.971 2645
9 smallest 0.971 182
10 any 0.971 1112
11 really 0.968 555
12 acquaintance 0.967 462
13 excessively 0.967 91.8
14 nothing 0.967 639
15 assure 0.965 268
16 settled 0.964 261
17 marrying 0.964 196
18 much 0.964 841
19 attentions 0.962 212
20 encouraging 0.961 51
21 directly 0.96 290
22 deal 0.96 329
23 warmly 0.96 96.3
24 must 0.96 1141
25 sorry 0.958 198
26 certainly 0.957 323
27 not 0.957 2023
28 tolerably 0.957 95.9
29 handsome 0.957 136
30 quite 0.956 765
31 been 0.956 899
32 exactly 0.955 248
33 invitation 0.955 194
34 being 0.954 699
35 obliged 0.954 280
36 seeing 0.954 206
37 always 0.953 470
38 pleasantly 0.952 37.8
39 delighted 0.951 107
40 talked 0.95 342
41 perfectly 0.949 283
42 distressing 0.949 61.5
43 solicitude 0.949 89.7
44 comfortable 0.948 167
45 walking 0.948 129
46 continuing 0.947 39.1
47 engaged 0.945 120
48 enjoyment 0.942 122
49 dislike 0.941 86.7
50 talking 0.941 194

The list is interesting in many ways, especially in comparison to the corresponding lists for other authors, but I want to emphasize a side point. “Very” tops the list, and it may also top the list of words I discourage my students from using in their papers. (Mark Twain: “Substitute ‘damn’ every time you’re inclined to write ‘very;’ your editor will delete it and the writing will be just as it should be.”) And that’s not all: I push students to minimize adverbs, intensifiers, terms of certainty, and “to be” constructions. Such words infuse Austen’s list:


WORDS OVERREPRESENTED BY MANN-WHITNEY RHO
1 very 0.985 3283
8 be 0.971 2645
11 really 0.968 555
13 excessively 0.967 91.8
21 directly 0.96 290
23 warmly 0.96 96.3
26 certainly 0.957 323
28 tolerably 0.957 95.9
30 quite 0.956 765
31 been 0.956 899
32 exactly 0.955 248
34 being 0.954 699
37 always 0.953 470
38 pleasantly 0.952 37.8
41 perfectly 0.949 283

I’ve thought many times about writing a handout on style that outlines the conventional guidelines of modern, essayistic style with counterexamples from great literature. (What would Hamlet do without “to be”?) But this list encourages me to take such thinking a step further: Austen’s case alone could become the foundation of a unit on voice, style, and convention.