Choosing among data sources for LMI
General 'health warnings', applicable across a number of data sources, include:
Provenance of data
When considering any data set, it is helpful for any user to keep in mind information on how the data were collected (i.e. methodology) and why it was collected. This will enable an initial assessment as to the likely reliability of the data, and an initial assessment about its robustness. Questions to consider include:
- if the data were collected for a specific reason, what are the implications of the rationale of data collection for the coverage and reliability of the data set?
- what are the implications of the data collection methodology for the coverage/degree of detail available in the data set?
- what period does the data relate to?
- if the information is not current, but is being used as a proxy for the prevailing situation, is there any reason to expect that the there have been substantial changes in the period since data collection?
In general, if the provenance of data cannot be established clearly (in terms of sourcing and timing) it is prudent to exercise a degree of caution in interpreting that information and translating it into intelligence.
Classification issues
Various classification systems (both standard and non-standard) are available. Users need to beware that:
- Standard classification systems (e.g. Standard Industrial Classification, Standard Occupation Classification) change over time to take account of developments in economy and society. (This has implications for analysis of trends.) Some suppliers match 'new' and 'old' classification systems in order to produce consistent classifications/produce data series on a consistent basis. There is a tension between, on the one hand:
- pressures to resist changes to classification systems, in order to maintain comparability between data sources and over time (so enabling the generation of time series data); and, on the other hand,
- pressures to update classification systems to better reflect reality, address new 'policy' issues, etc.
- A category name/label may not necessarily have the same coverage between sources - but there is a move towards 'standard' classifications and 'harmonisation' of classification systems (e.g. across the EU). Therefore, it may be appropriate to check the detail of classification systems.
- (The issue of the same label encompassing different definitions of the same phenomenon tends to arise in circumstances when a particular issues rises up the policy agenda, and no universal standard is agreed or adopted - examples include definitions of 'cultural industries', 'the knowledge economy', etc.)
- Non-standard classification systems might well seem attractive for a particular purpose, but difficulties arise when a system/facility draws together data sets/ information using different scheme, because of a lack of comparability.
Boundary and 'geography' issues
Some of the issues here are similar to those addressed under the 'classification issues', immediately above.
- Boundaries of geographical areas may change over time. A recent key change was the shift from Standard Statistical Regions to Government Office Regions. Such boundary changes have implications for the generation of time series statistics.
- Some geographical areas are more 'stable' than others are.
- At sub-regional level, especially, the same 'name' can refer to different geographical units - e.g. Cambridge local authority district, Cambridge TTWA, etc. Often, users may bring together data on different topics from a number of sources adopting different geographical units.
- Use of non-standard geographies poses a problem for a LLMI facility/system because we cannot be sure of comparability between sources. Also what one person thinks of as 'area X' may be different from what another person thinks of as 'area X'.
- When focusing attention on a particular region or local area, there is often a tendency for information users to 'treat' that area as an 'island' - cut off from the influence of cross-boundary flows.
Residence- and workplace- basis of information
This links to the issue of cross-boundary flows. It is important to know whether the data from a particular source refers to people living in an area (i.e. a residence base) or people working in an area (a workplace base). Sometimes indicators are compiled using a numerator compiled on one base, and a denominator compiled on the other.
Survey non-response bias
In any data based on a survey it is important to consider the possibility of any potential bias caused by non-response, together with the impact of such non-response for the robustness and quality of the data.
Social surveys often find that the most socially excluded sections of the population do not respond to surveys. So, the people most difficult to survey are those who are difficult to contact at home (because they are out or because they are unwilling to answer the door to strangers) and people who are alienated from the wider society. This is a particular problem, since these are the target groups for many government initiatives aimed at combating social exclusion.
Related to survey non-response bias are further issues of:
- proxy responses - In some surveys a member of the household may provide answers on behalf of other members of the household. Users need to bear in mind whether, and to what extent, the use of proxy responses has implications for the quality of the data.
- recall error - In some surveys respondents are asked to remember events over a period of time. This introduces the possibility of recall error.
Scope and coverage of administrative data
Often a key advantage of administrative data sets at successively more disaggregated geographical scales is that they provide complete coverage. However, the user needs to bear in mind that administrative data are collected for administrative purposes, and so reference is made to administrative definitions. As administrative definitions change, so does the scope and coverage of administrative data collected. This can create difficulties in generating time series data. Moreover, the effect of changes in scope and coverage of administrative counts can vary at different geographic levels.
Alternative information sources
In order to answer a particular question or examine specific topic of interest, there may be a number of different data sources to which a user can turn for information. While in some instances the sources will 'tell the same story', in other instances the details/trends may be contradictory. This may arise because different methodologies were used to collect information, coverage may vary, the concepts may be defined differently, different classification systems may have been used, the time period to which the information refers may be different, or the appropriateness of the analytical techniques used in manipulation of data may vary. If 'the stories are different' it does not necessarily mean that one source is 'right' and the other 'wrong', or that one source is 'better' than the other is. It may mean that further investigation may be necessary to try and find reasons for the differences.
Last cached: 2008-09-02 12:20 PM