NORC is requesting counts of word types and tokens for each student (for each story) as well as MLU (mean length of utterance).
NOTE: MLU was requested, but I’m not sure how to obtain it given that there are often multiple utterances per row/field and these utterances are not reliably/consistently delimited.)
They provided an excel file (data.xlsx
originally WPB_all_data_5_16_13.xlsx
)
containing all utterances to be analyzed.
The provided data table contains 11 columns:
INDEX COL HEADER
0 A StudentID
1 B WPBNA_PG1-2
2 C WPBNA_PG3-4
3 D WPBNA_PG5-6
4 E WPBNA_PG7-8
5 F WPBNA_PG9-10
6 G WPBNB_PG1-2
7 H WPBNB_PG3-4
8 I WPBNB_PG5-6
9 J WPBNB_PG7-8
10 K WPBNB_PG9-10
We converted this into a more convenient format with the following columns:
ID - student id
STORY - A or B
PAGE - 1 (1-2), 2 (3-4), etc.
TOKENS - word tokens parsed from TEXT
TEXT - original text
See data.xls for the resulting data table.
query.py - script used to generate report.xls
from data.xls
data.xls - data file described above
report.xls - resulting report containing word token and type counts for each subject/story.
Date: May 17, 2013 10:51:50 AM CDT Subject: word counts
Here is the data set from our field study. As we discussed on Tuesday we are interested in getting token, types and MLU for each story for each student. These are organized so that each student ID has two stories across a row.
That means for each student ID we need 3 counts for story 1 (WPBNA_PG 1-2, PG 3-4, PG 5- 6, PG 7-8) and 3 counts for story 2 (WPBNB PG1-2, PG 3-4, PG 5-6, PG 7-8).