School of Linguistics and Applied Language Studies

The Wellington Corpus of Spoken New Zealand English

Project Director
Janet Holmes

Corpus Manager
Bernadette Vine

This information is taken from Holmes, J., Vine, B. & Johnson, G. (1998). The Wellington Corpus of Spoken New Zealand English: a Users' Guide. Wellington: School of Linguistics and Applied Language Studies, Victoria University of Wellington.

Further information can be obtained from:

Corpus Manager, Archive of New Zealand English, School of Linguistics and Applied Language Studies, Victoria University of Wellington, PO Box 600, Wellington 6140, NEW ZEALAND.
Email: bernadette.vine@vuw.ac.nz
Tel: (+64 4) 463 5639
Fax: (+64 4) 463 5604

Composition of the WSC

The collection dates for the WSC were 1 January 1988 to December 31 1994. Ninety-nine percent of the data was collected in 1990 to 1994.
The proportions of speech styles are:

Formal Speech/Monologue 12%
Semi-formal Speech/Elicited Monologue 13%
Informal Speech/Dialogue 75%

The extracts in the corpus are divided into 15 categories and these categories cover a range of contexts in which each style of speech is found. In the table below, the categories are grouped in terms of whether they are monologues or dialogues, public or private, scripted or unscripted. The codes assigned to the categories are also provided, along with the word targets for each category.

The formal speech section of the WSC involves all the monologue categories and the DGUs (Parliamentary debate). The semi-formal section is comprised of the interview categories, both public and private: oral history (DPH), social dialect (DPP) and broadcast interviews (DGI). The remaining dialogue categories comprise the informal speech section, with 50% of the overall corpus being comprised of private face-to-face conversations (DPC).

Table: WSC CATEGORIES AND WORD TARGETS Category Text Category Code Word Target
Monologue:

Category Text Category Code Word Target
Monologue:
Public scripted, broadcast
Broadcast news MSN 24,000
Broadcast monologue MST 10,000
Broadcast weather MSW 2,000
Monologue:
Public unscripted
Sports commentary MUC 20,000
Judge's summation MUJ 4,000
Lecture MUL 28,000
Teacher monologue MUS 12,000
Dialogue:
Private
Conversation DPC 500,000
Telephone conversation DPF 70,000
Oral history interview DPH 20,000
Social dialect interview DPP 30,000
Dialogue:
Public
Radio talkback DGB 80,000
Broadcast interview DGI 80,000
Parliamentary debate DGU 20,000
Transactions and Meetings DGZ 100,000
TOTAL 1,000,000

Whose speech is included?

Our corpus is a corpus of spoken New Zealand English. We needed to establish, therefore, criteria for selecting people to be included. We rejected the notion of selecting people who sounded as if they were New Zealanders, since this would have self-evidently pre-judged an issue which the corpus data was intended to illuminate - namely what constitutes New Zealand English. Similarly non-linguistic criteria such as citizenship or residency are fraught with problems, since those who hold such qualifications may be very recent arrivals from elsewhere. Even longer-term residents cannot be expected to have acquired features which distinguish New Zealand speech from other varieties if they have arrived in the country after puberty. Consequently, we adopted a criterion which has been regarded by others as very stringent, but which we felt confident would ensure the integrity of the New Zealand samples included in the corpus.

A speaker of New Zealand English is defined as someone who has lived in New Zealand since before the age of 10 years

A certain amount of overseas experience was regarded as normal within New Zealand, but, again for reasons relating to the need to establish the distinctive features of a New Zealand variety of English, people who had spent extensive periods of time overseas were excluded. More than ten years or over half their lifetime (whichever was the greater) was considered an extensive period of time, and this rendered people ineligible for inclusion in the spoken corpus. Also excluded were people who had returned from an overseas trip within the last year.

To summarise:

INCLUDED EXCLUDED 
Lived in NZ since before age of 10 years Arrived in NZ after the age of 10 years 
10 years or less spent overseas, or 
Less than 1/2 lifetime (whichever greater)
More than 10 years spent overseas, or 
More than 1/2 lifetime
Last overseas trip over 1 year ago Last overseas trip less than 1 year ago 

Ethnic and gender representation

People of any ethnicity (e.g. Dutch, Samoan, Greek, Tongan) were considered eligible for inclusion in the spoken corpus provided they satisfied the criterion for eligibility as a New Zealander. No attempt was made to include representative samples from particular ethnic groups other than Maori. It was considered important to include an appropriate proportion of the speech of the indigenous Maori people, and while this was not possible within each sub-category, it was recognised as a reasonable aim for the corpus as a whole. Maori contribute 18% of the total words in our transcribed corpus and Pakeha 76%.

Some degree of gender balance was also considered desirable, with an ideal overall goal of 50% female speech and 50% male speech within the 1,000,000 word sample. Women contribute 52% and men 48% of the final transcribed words, reflecting the New Zealand population balance.

Other social factors

Recognising that it was unrealistic to attempt to collect a representative sample which took account of additional social variables such as social class, regional origin, level of education, occupation and age, no attempt was made to pre-determine the number of contributers in such categories. However, every speech sample collected is described as fully as possible in these respects for each speaker contributing to the corpus. No attempt was made at iwi representation and information on iwi affiliation was not collected.