block by joyrexus 5615982

Query script and report for NORC request.

README

NORC is requesting counts of word types and tokens for each student (for each story) as well as MLU (mean length of utterance).

NOTE: MLU was requested, but I’m not sure how to obtain it given that there are often multiple utterances per row/field and these utterances are not reliably/consistently delimited.)

They provided an excel file (data.xlsx originally WPB_all_data_5_16_13.xlsx) containing all utterances to be analyzed.

The provided data table contains 11 columns:

INDEX  COL    HEADER
    0    A    StudentID

    1    B    WPBNA_PG1-2 
    2    C    WPBNA_PG3-4
    3    D    WPBNA_PG5-6 
    4    E    WPBNA_PG7-8
    5    F    WPBNA_PG9-10

    6    G    WPBNB_PG1-2
    7    H    WPBNB_PG3-4
    8    I    WPBNB_PG5-6
    9    J    WPBNB_PG7-8
   10    K    WPBNB_PG9-10

We converted this into a more convenient format with the following columns:

ID     - student id
STORY  - A or B
PAGE   - 1 (1-2), 2 (3-4), etc.
TOKENS - word tokens parsed from TEXT 
TEXT   - original text

See data.xls for the resulting data table.

Files

Email Record

Date: May 17, 2013 10:51:50 AM CDT Subject: word counts

Here is the data set from our field study. As we discussed on Tuesday we are interested in getting token, types and MLU for each story for each student. These are organized so that each student ID has two stories across a row.

That means for each student ID we need 3 counts for story 1 (WPBNA_PG 1-2, PG 3-4, PG 5- 6, PG 7-8) and 3 counts for story 2 (WPBNB PG1-2, PG 3-4, PG 5-6, PG 7-8).

query.py