How to prepare and analyse the PISA database

This note summarises the main steps of using the PISA database. It describes the PISA data files and explains the specific features of the PISA survey together with its analytical implications. This document also offers links to existing documentations and resources (including software packages and pre-defined macros) for accurately using the PISA data files.

Preparation of the PISA data files

The PISA database contains the full set of responses from individual students, school principals and parents. In what follows, a short summary explains how to prepare the PISA data files in a format ready to be used for analysis.

Generate data files available on the PISA website

The files available on the PISA website include background questionnaires, data files in ASCII format (from 2000 to 2012), codebooks, compendia and SAS and SPSS data files in order to process the data.

For generating databases from 2000 to 2012, all data files (in text format) and corresponding SAS or SPSS control files are downloadable from the PISA website (www.oecd.org/pisa). SAS or SPSS users need to run the SAS or SPSS control files that will generate the PISA data files in SAS or SPSS format respectively. Before starting analysis, the general recommendation is to save and run the PISA data files and SAS or SPSS control files in year specific folders, e.g. the PISA 2003 data files in “c:\pisa2003\data\”.

For generating databases from 2015, PISA data files are available in SAS for SPSS format (in .sas7bdat or .sav) that can be directly downloaded from the PISA website.

Description of the PISA data files

The main data files are the student, the school and the cognitive datasets. These data files are available for each PISA cycle (PISA 2000 – PISA 2022). From 2006, parent and process data files, from 2012, financial literacy data files, and from 2015, a teacher data file are offered for PISA data users.

The student data files are the main data files. Apart from the students’ responses to the questionnaire(s), such as responses to the main student, educational career questionnaires, ICT (information and communication technologies) it includes, for each student, plausible values for the cognitive domains, scores on questionnaire indices, weights and replicate weights.

The school data files contain information given by the participating school principals, while the teacher data file has instruments collected through the teacher-questionnaire. Responses for the parental questionnaire are stored in the parental data files.

The cognitive data files include the coded-responses (full-credit, partial credit, non-credit) for each PISA-test item. In 2012, two cognitive data files are available for PISA data users.

The cognitive item response data file includes the coded-responses (full-credit, partial credit, non-credit), while the scored cognitive item response data file has scores instead of categories for the coded-responses (where non-credit is score 0, and full credit is typically score 1). The cognitive test became computer-based in most of the PISA participating countries and economies in 2015; thus from 2015, the cognitive data file has additional information on students’ test-taking behaviour, such as the raw responses, the time spent on the task and the number of steps students made before giving their final responses.

The financial literacy data files contains information from the financial literacy questionnaire and the financial literacy cognitive test.

In 2015, a database for the innovative domain, collaborative problem solving is available, and contains information on test cognitive items.

In computer-based tests, machines keep track (in log files) of – and, if so instructed, could analyze – all the steps and actions students take in finding a solution to a given problem. From 2012, process data (or log ) files are available for data users, and contain detailed information on the computer-based cognitive items in mathematics, reading and problem solving. The study by Greiff, Wüstenberg and Avvisati (2015) and Chapters 4 and 7 in the PISA report Students, Computers and Learning: Making the Connection provide illustrative examples on how to use these process data files for analytical purposes. All other log file data are considered confidential and may be accessed only under certain conditions. Researchers who wish to access such files will need the endorsement of a PGB representative to do so. For more information, please contact edu.pisa@oecd.org.

PISA 2012 process (log) data files (at the bottom of the link)

Merge the PISA data files

In order to run specific analysis, such as school level estimations, the PISA data files may need to be merged. (Please note that variable names can slightly differ across PISA cycles. The examples below are from the PISA 2015 database.)

To merge the student data file with the school or/and the teacher data file(s), use the country code 3-character (variable name: CNT in the PISA 2015 data file) and the international school ID (variable name: CNTSCHID in the PISA 2015 data file) for performing the merging process.

To merge the student data file with the parent data file, use the country code 3-character (variable name: CNT in the PISA 2015 data file) and the international student ID (variable name: CNTSTUID in the PISA 2015 data file) for performing the merging process.

To merge the student data file with the cognitive or financial literacy data file(s), use the country code 3-character (variable name: CNT in PISA 2015), the international school ID (variable name: CNTSCHID in the PISA 2015 data file) and the international student ID (variable name: CNTSTUID in the PISA 2015 data file) variables for performing the merging process.

Methodology to analyse the PISA database

Use sampling weights for unbiased estimates and standard-errors

PISA collects data from a sample, not on the whole population of 15-year-old students. The sample has been drawn in order to avoid bias in the selection procedure and to achieve the maximum precision in view of the available resources (for more information, see Chapter 3 in the PISA Data Analysis Manual: SPSS and SAS, Second Edition).

In practice, this means that the estimation of a population parameter requires to (1) use weights associated with the sampling and (2) to compute the uncertainty due to the sampling (the standard-error of the parameter).

Use final student weights for obtaining unbiased parameter estimates

All analyses using PISA data should be weighted, as unweighted analyses will provide biased population parameter estimates. In PISA 2015 files, the variable w_schgrnrabwt corresponds to final student weights that should be used to compute unbiased statistics at the country level.

The final student weights add up to the size of the population of interest. When conducting analysis for several countries, this thus means that the countries where the number of 15-year students is higher will contribute more to the analysis. For this reason, in some cases, the analyst may prefer to use senate weights, meaning weights that have been rescaled in order to add up to the same constant value within each country. Each country will thus contribute equally to the analysis.

Use replicate weights for obtaining unbiased standard errors

A statistic computed from a sample provides an estimate of the population true parameter. One should thus need to compute its standard-error, which provides an indication of their reliability of these estimates – standard-error tells us how close our sample statistics obtained with this sample is to the true statistics for the overall population. These estimates of the standard-errors could be used for instance for reporting differences that are statistically significant between countries or within countries.

As the sample design of the PISA is complex, the standard-error estimates provided by common statistical procedures are usually biased. Moreover, the mathematical computation of the sample variances is not always feasible for some multivariate indices. For these reasons, the estimation of sampling variances in PISA relies on replication methodologies, more precisely a Bootstrap Replication with Fay’s modification (for details see Chapter 4 in the PISA Data Analysis Manual: SAS or SPSS, Second Edition or the associated guide “Computation of standard-errors for multistage samples”). The general principle of these methods consists of using several replicates of the original sample (obtained by sampling with replacement) in order to estimate the sampling error. The statistic of interest is first computed based on the whole sample, and then again for each replicate. The replicate estimates are then compared with the whole sample estimate to estimate the sampling variance.

In PISA 80 replicated samples are computed – and for all of them, a set of weights are computed as well.

In practice, this means that one should estimate the statistic of interest using the final weight as described above, then again using the replicate weights (denoted by w_fsturwt1- w_fsturwt80 in PISA 2015, w_fstr1- w_fstr80 in previous cycles). The standard-error is then proportional to the average of the squared differences between the main estimate obtained in the original samples and those obtained in the replicated samples (for details on the computation of average over several countries, see the Chapter 12 of the PISA Data Analysis Manual: SAS or SPSS, Second Edition).

Procedures and macros are developed in order to compute these standard errors within the specific PISA framework (see below for detailed description).

Use plausible values for student proficiency estimates

PISA reports student performance through plausible values (PVs), obtained from Item Response Theory models (for details, see Chapter 5 of the PISA Data Analysis Manual: SAS or SPSS, Second Edition or the associated guide “Scaling of Cognitive Data and Use of Students Performance Estimates”). The general principle of these models is to infer the ability of a student from his/her performance at the tests. In practice, plausible values are generated through multiple imputations based upon pupils’ answers to the sub-set of test questions they were randomly assigned and their responses to the background questionnaires.

PISA is designed to provide summary statistics about the population of interest within each country and about simple correlations between key variables (e.g. between socio-economic status and student performance). PISA is not designed to provide optimal statistics of students at the individual level.

The use of PV has important implications for PISA data analysis:

For each student, a set of plausible values is provided, that corresponds to distinct draws in the plausible distribution of abilities of these students. In the first cycles of PISA five plausible values are allocated to each student on each performance scale – and since PISA 2015, ten plausible values are provided by student. Accurate analysis requires to average all statistics over this set of plausible values.
Plausible values should not be averaged at the student level, i.e. by computing in the dataset the mean of the five or ten plausible values at the student level and then computing the statistic of interest once using that average PV value. In addition, even if a set of plausible values is provided for each domain, the use of pupil fixed effects models is not advised, as the level of measurement error at the individual level may be large.

In practice, an accurate and efficient way of measuring proficiency estimates in PISA requires five steps:

Compute estimates for each Plausible Values (PV)
Compute final estimate by averaging all estimates obtained from (1)
Compute sampling variance (unbiased estimate are providing by using only one PV)
Compute imputation variance (measurement error variance, estimated for each PV and then average over the set of PVs)
Compute final standard error by combining (3) and (4)

Users will find additional information, notably regarding the computation of proficiency levels or of trends between several cycles of PISA in the PISA Data Analysis Manual: SAS or SPSS, Second Edition

Software packages and macros to analyse the PISA database

Several tools and software packages enable the analysis of the PISA database. These packages notably allow PISA data users to compute standard errors and statistics taking into account the complex features of the PISA sample design (use of replicate weights, plausible values for performance scores).

SAS or SPSS

Pre-defined SPSS macros are developed to run various kinds of analysis and to correctly configure the required parameters such as the name of the weights. These macros are available on the PISA website to confidently replicate procedures used for the production of the PISA results or accurately undertake new analyses in areas of special interest. Chapter 17 (SAS) / Chapter 17 (SPSS) of the PISA Data Analysis Manual: SAS or SPSS, Second Edition offers detailed description of each macro.

The PISA Data Analysis Manual: SAS or SPSS, Second Edition also provides a detailed description on how to calculate PISA competency scores, standard errors, standard deviation, proficiency levels, percentiles, correlation coefficients, effect sizes, as well as how to perform regression analysis using PISA data via SAS or SPSS

Download the SAS Macro with 5 plausible values (Please note that the macro for calculating PISA scores from PISA 2000 to 2012 is available with 5 plausible values among the linked macros. From 2015, 10 plausible values should be used to generate PISA performance scores. Download the SAS macro with 10 plausible values.
PISA Data Analysis Manual: SAS, Second Edition
Download the SPSS Macros(Please note that the macro for calculating PISA scores are available with 5 plausible values, and has not been updated yet to analyse PISA 2015 data with 10 plausible values)
PISA Data Analysis Manual: SPSS, Second Edition

The IEA International Database Analyzer (IDB Analyzer) is an application developed by the IEA Data Processing and Research Center (IEA-DPC) that can be used to analyse PISA data among other international large-scale assessments.

The IDB Analyzer is a windows-based tool and creates SAS code or SPSS syntax to perform analysis with PISA data. The generated SAS code or SPSS syntax takes into account information from the sampling design in the computation of sampling variance, and handles the plausible values as well.

The code generated by the IDB Analyzer can compute descriptive statistics, such as percentages, averages, competency levels, correlations, percentiles and linear regression models. The tool enables to test statistical hypothesis among groups in the population without having to write any programming code.

Stata

The package “repest” developed by the OECD allows Stata users to analyse PISA among other OECD large-scale international surveys, such as PIAAC and TALIS. “Repest” computes estimate statistics using replicate weights, thus accounting for complex survey designs in the estimation of sampling variances. The package also allows for analyses with multiply imputed variables (plausible values); where plausible values are used, the average estimator across plausible values is reported and the imputation error is added to the variance estimator. “Repest” is a standard Stata package and is available from SSC (type “ssc install repest” within Stata to add “repest”).

Stata repest package description
Stata repest repository (includes a “Cheat sheet” and a section on “Getting started”)

R

The R package Rrepest, also developed by the OECD, has similar affordances to the Stata repest package, but is faster – in server environments in particular - thanks to the use of parallel computing.

Rrepest repository (includes a “Cheat sheet”, links to documentation and worked-out examples)
Rrepest package description and manual (on CRAN)

The R package intsvy allows R users to analyse PISA data among other international large-scale assessments. The use of PISA data via R requires data preparation, and intsvy offers a data transfer function to import data available in other formats directly into R. Intsvy also provides a merge function to merge the student, school, parent, teacher and cognitive databases.

The analytical commands within intsvy enables users to derive mean statistics, standard deviations, frequency tables, correlation coefficients and regression estimates. Additionally, intsvy deals with the calculation of point estimates and standard errors that take into account the complex PISA sample design with replicate weights, as well as the rotated test forms with plausible values.