Herding cats: The collection, classification and analysis of web- based content for online researchers Rachael Adlington, School of Education Abstract As human interactions take place in online environments, researchers are examining web-based communicative artefacts with increasing attention. One emerging area of interest is the construction of web-based texts, such as blogs, by young children. The analysis of such texts presents novel challenges and opportunities for researchers. Unlike hard copy texts, blogs by their very nature change over time, presenting themselves as ‘moving targets’ for analysis. Analysing blogs is akin to herding cats, which is a feat notoriously difficult to accomplish. Fortunately, blog data is electronic and web-based, which allows for new cataloguing and data management solutions to emerge, and better methods to sort and store large amounts of data. This paper explores the difficulties encountered in capturing online blog data. In doing so, it also showcases a number of solutions for the collection, storage, classification and analysis of electronic, web-based artefacts in general. Introduction As online interactions between people become commonplace, and popular media, news and current affairs proliferate in the online environment, researchers have turned their attention to the artefacts of this new research context (Baldry & Thibault, 2006). There is a strong tradition of using content analysis to analyse traditionally offline artefacts, such as newspapers and television programs, and the principles and practices of this analysis can apply equally to web-based artefacts (van Leewin & Jewitt, 2001). However, the differences between online and offline content can present challenges and opportunities to researchers using content analysis for web-based materials. The changeable nature of web-based artefacts renders them problematic moving targets for analysis. Like herding cats, analysing web-based artefacts is something notoriously difficult to do. The focus of this paper is on the mechanics of web-based text data management. It explores methods and tools that may facilitate research focused on web-based content; particularly research that employs content analysis. The paper draws upon experiences gained during a study that explored blogs created by young school- aged children. In doing so, it provides an overview of the difficulties of capturing online data, and showcases solutions for collecting, storing, classifying and analysing electronic, web- based artefacts. The issues discussed in this paper are based on the author’s experiences with capturing and analysing online content as part of her research on the blogging habits of young children. Authors of online texts, such as blogs, use a range of components, or resources, such as image, text and sound, separately and together 2010 POST GRADUATE CONFERENCE to create meaning (Unsworth, 2008). Part of this research sought to analyse the content of blogs in terms of the ways in which young bloggers use and combine meaning-making resources. Content analysis is an established method used to analyse a range of multimedia texts, including newspaper, television and film (Bell, 2001; Herring, 2010). The use of content analysis has been extended to a variety of web-based environments and artefacts (Herring, 2010). Examples include: online discussion boards (Hara, Bonk, & Angeli, 2000; Rourke, Anderson, Garrison, & Archer, 2001); websites (Ha & James, 1998); personal home pages (Dillon & Gushrowski, 2000; Döring, 2006); social network spaces, such as MySpace (Jones, Millermaier, Goya-Martinez & Schuler, 2008); and blogs (Herring, Scheidt, Bonus, & Wright, 2004; Herring, Scheidt, Kouper, & Wright, 2006; Lagu, Kaufman, Asch & Armstrong, 2008; Miller & Shepherd, 2004). However, the notion of web content analysis can be seen in two lights: as web-based content analysis, and as web-content analysis (Herring, 2010). Web-based content analysis tries to fit traditional content analysis approaches to web content, using randomised sampling, and narrowed, pre-determined research questions, coding categories and coding processes in order to maintain methodological rigour (Herring, 2010). However Herring argues that, in practice, research often requires a more exploratory approach to analysis of web content, as, ‘phenomena of interest cannot always be indentified in advance of establishing a coding scheme’ (Herring, 2010, p. 4). Also, the volume of data available for analysis makes truly random sampling difficult, and the ‘intermingling of channels of communication may especially require novel coding categories’ (Herring, 2010, p. 4). To further complicate matters, authors of online texts, such as blogs, use a range of components, or resources, such as images, text and sound, both separately and together to create meaning (Unsworth, 2008). Consequently, applying content analysis to web-based artefacts requires new approaches for these new sorts of texts. One of the main obstacles to overcome for collection and analysis is the changeable nature of web-based texts (Herring, 2010; McMillan, Hoy, Kim & McMahon, 2008). Blogs, for example, are created using ongoing contributions, so changes to a blog’s content are highly likely. Indeed, one of the young bloggers in the research which forms the basis of this paper, added blog entries 21 times in 11 days. All web-based texts are able to be altered by their creators at any time, while some, such as blogs and social network spaces (i.e., Facebook), by their very nature change frequently. Thus after establishing a data set, which in itself presents challenges, web-based texts must be collected and analysed in such a way as to minimise the chances of them changing between viewings. Ensuring multiple coders are viewing identical items is also imperative for inter-rater reliability (McMillan, 2000). Other issues requiring consideration include: capturing and analysing moving content, such as film, animation and sound (Kim & Kuljis, 2010); capturing and analysing multiple pieces of content within one item; and describing the interplay between items (Baldry & Thibault, 2006). Preliminary findings of this UNIVERSITY OF NEW ENGLAND blog research indicate that almost one-third of the blogs examined included video, animation or music; all of which required different approaches to collection and content analysis. Web-based content may also be non-linear or hyperlinked. This was found to be the case in the blog research and made collection and analysis of this content substantively different to linear texts. In light of these issues, the following sections of this paper explore possible solutions for data collection and storage, classification and analysis of web-based texts. Discussion draws upon the research undertaken into the blog authoring practices of young school-aged children. Although the focus of the study was blogs, the methods and tools described apply equally well to other web-based texts, as other web-based texts share common features, such as changeable content, non- linear structure and the inclusion of multimedia and moving content. Data collection When collecting web-based texts, an initial consideration of the research was the need to revisit selected texts, as well as to find a way to easily catalogue, search and sort lists of texts within a data set. With respect to the blog-authoring research, this need was particularly important as large numbers of texts had to be examined. To revisit texts in a data set, the URL (web address) of each text was able to be stored using a web browser’s bookmarking tool and accessed at a later time. Building upon the bookmarking tool, many browsers, such as Mozilla’s Firefox, include a library tool used to catalogue bookmarked URLs for ease of future access. The Firefox library tool enables URLs to be stored in folders. Descriptions or notes for each entry can be recorded, as shown in Image 1. URL entries can be tagged with terms to indicate the nature of a corresponding text, and the same tags are easily applied to other entries. Tags populate a separate section within the library and clicking a particular tag reveals a list of all of library entries tagged with that tag (see Image 2). The Firefox library allows stored entries to be searched (see Image 2). This function searches for terms present in any part of an entry (the name of the entry, tags, location or description), and can search within particular folders in the library. For the researcher, the ability to organise URLs, add descriptions and tags, and search library contents, make it easy to map a collection of texts in order to check that it matches the researcher’s needs and to find URLs of texts with particular characteristics within a data set for further analysis. Also, undertaking the process of assigning tags and descriptions provides the opportunity to become familiar with the texts and think about ways in which items in the data set may be described. However, the search function in Firefox’s library is limited and can only search for one term at a time. It cannot search, for example, using Boolean operators (i.e., AND, OR NOT) or regular expressions to combine search terms. These limitations of the Firefox library reduce the value of this library for the researcher who wishes to use it to sort and locate URLs with particular features indicated by tags; for example, tagged as authored by a five- year- old but not from Australia. 2010 POST GRADUATE CONFERENCE Image 1: Firefox’s library, showing folders, entries within a folder, and details of an entry including the blog’s name, URL, tags and description. Image 2: Firefox’s library, showing a list of tags (left) and the library entries tagged with the selected tag ‘05yo’ (right). The Search Bookmarks bar is also shown (top right). In addition to browser-based bookmarking, online systems may be used to collate URLs. Social bookmarking services, such as Delicious, store bookmarked links in an online user account, and allow for users to share bookmarks publicly (socially) or share privately with selected individuals (Arakji, Benbunan-Fich & Koufaris, 2009). Using a private social bookmark space enables teams of researchers access to the exact same data set from anywhere at any time. Social bookmarking tools provide similar functionality in terms of tagging, but, like the Firefox library, are limited in terms of searchability (Delicious, n.d., paras. 13-15). Another online solution to collecting bookmarks is the use of archiving services, such as Instapaper and Read-it-later. An Instapaper account enables users to save web pages to read at a later time by clicking on a button on the browser (Instapaper, 2008-2011). UNIVERSITY OF NEW ENGLAND Bookmarks may be shared with colleagues, and pages can be read in a ‘text-only’ view. The text-only view may suit researchers who are only interested in analysing print (Frakes, 2010a). However, Instapaper is limited to archiving single pages, and Instapaper will not allow users to save pages that require a log-in (Frakes, 2010b). Tagging functionality is very limited and a free Instapaper account is limited to ten items (Maguire, 2011). Read it Later is a service similar to Instapaper, but allows the download of pages for offline access. Read it Later also has a greater range of tagging capabilities (Read it Later, 2011). Unfortunately, the free version has limited functionality The Firefox browser add-on (an additional piece of software added to the browser), TagSifter, may be installed on Firefox for greater tag-sorting functionality. TagSifter, works with existing library entries and tags, allowing entries to be searched using single and multiple tags with a wide range of operators (see Image 3). Tags can be colour coded to group related tags (see Image 3); for example, yellow indicates all tags pertaining to date. Once data has been collected and catalogued the next challenge for the researcher of web-based texts is to find suitable methods for data storage and classification. Image 3: Left: TagSifter, showing the results (‘related bookmarks’) of the search for ‘05yo’ and the other tags that the results are tagged with (‘related tags’). Related tags can also be clicked to search within the results of the first search; Right: the list of operators that may be used in TagSifter searches. 2010 POST GRADUATE CONFERENCE Data storage and classification Recording the URLs of texts included in a data set is an important first step in data management. The use of a searchable library in which URL items can be tagged and annotated, as described above, is valuable, but was found to be insufficient for the blog research. As the data set grew and data analysis loomed, tagged URLs provided limited, text-only information and was insufficient to indentify individual texts. To counter this kind of issue, a screenshot of a text may be easier to recognise in a database than simply that text’s URL. A bigger issue is that a URL is only the link to a text; the text itself is still online and vulnerable to changing. Some authors update texts on a regular basis; adding new content and deleting old content. Changeable texts may not matter for some research, while other research may wish to document changes in dynamic texts over time. Critically, the process of content analysis can be thwarted if, for example, a text disappears from the Internet before analysis is complete. While content analysis of changeable online texts is increasing in popularity (McMillan et al., 2008), some researchers are yet to acknowledge the issue of data stability (e.g., Lagu et al., 2008; Jones et al., 2008). Researchers who analyse texts online run the risk of compromising data integrity unless they take steps to minimize this risk. One solution is to capture and store texts offline (Herring, 2010; McMillan, 2000). This allows the researcher to take a snapshot in time of a text that will not be subject to change (Kim & Kuljis, 2010). A review of methodologies described in published literature reveal a tendency to discuss issues pertaining to sampling, coding and inter-rater reliability, (e.g., Arakji et al., 2009; Dillon & Gushrowski, 2000; Döring, 2006; Ha & James, 1998, McMillan, 2001). However, few authors include sufficient detail of the tools used to perform these tasks (e.g., Kim & Kuljis, 2010). The following two sections engage in discussion of a number of the solutions available for researchers to store, classify, code and analyse online texts. As technologies change at a rapid pace, this is not intended to be an exhaustive list. Rather, it should be considered as a starting point for researchers, as well as an indication of the features and limitations of the kinds of software that may be used. A number of methods and software may be used in combination to capture and store various parts of a web-based text offline, such as images of individual pages and entire sites. For example, in the case of blog pages, all comments, images, animated gifs and video can be captured. The advantage of capturing web-based text as a screenshot is that it makes the text easier to recognise in a database than simply storing that text’s URL. These images also provide a good starting point for later analysis. However, screenshots only capture what is visible in the browser window, even though many homepages are longer than the browser window. To overcome this limitation, another Firefox add-on, Screengrab!, can be used to capture an entire single webpage as an image (see Image 4). UNIVERSITY OF NEW ENGLAND Image 4: Screen shot of visible portion of page (left) and the entire page captured by Screengrab! (right). Once screen shots are captured using software such as Screengrab!, it is useful to store them in a searchable database designed for images. LittleSnapper is one solution that can store large quantities of images; it is easy to search utilizing tagging and descriptions in a similar fashion to TagSifter (see Image 5). This utility enables the researcher to sort groups of screenshots and find items matching particular parameters for analysis. 2010 POST GRADUATE CONFERENCE Image 5: LittleSnapper interface, showing the search tool (centre top) and image details (right), including date of inclusion, tags and description. An image database, such as LittleSnapper, is suitable for storing images of individual webpages, and can be used to house individual images from particular texts for later analysis. Kim & Kuljis (2010) also recommend Local Website Archive, a tool that can be used to download and store individual web pages, as opposed to screenshots of pages. However, neither screenshots nor Local Website archive can capture meaning-making resources such as animated gifs, sound and video. Using screenshots also makes analysis of these types of resources, or the interplay between them and other parts of the text, very difficult. Furthermore an image database is not ideal for housing non- linear texts, such as blogs or websites. If, for example, the researcher wanted to capture an entire complex blog containing links to comments and pages within the blog, taking a series of screenshots and positioning them in an image database to reflect the blog’s non-linear nature would be difficult, time consuming and ineffective. Depending on the complexity of the online texts under investigation, the type of resources included, and the anticipated nature of analysis, it may be necessary to devise a means of storing non-linear texts, such as blogs and websites, offline in their entirety. To achieve this, the Firefox add-on, ScrapBook, may be used to download and store any website, including blogs. ScrapBook downloads and stores the contents of complex texts, such as blogs, for viewing offline. Hyperlinks remain active and text can be navigated the same way as online. Entries can also be catalogued and searched (see Image 6). ScrapBook renders images, animated gifs and sounds when viewing texts offline (see Image 6), and plays video content as long as it was located in the same place as the original text online (i.e., Blogger users can upload videos to their Blogger space for use on Blogger blogs, and this is downloaded and rendered by ScrapBook). However, UNIVERSITY OF NEW ENGLAND ScrapBook does not download embedded videos located elsewhere, such as those embedded in blogs but actually hosted on YouTube but. Such videos can be viewed in the online text, or downloaded and stored using other means, but not viewed offline with ScrapBook. Following on from data storage and classification, coding and analysis is now explored. Image 6: ScrapBook, showing search tool (top left), entries in folders (left) and browser interface that displays the selected blog for navigating (right). Data coding and analysis There are a number of ways to reduce potential issues with data analysis of web-based texts. In a meta-analysis of nineteen studies of online texts, McMillan (2000) identified several novel approaches to analysing texts online. For example, researchers can specify a particular window of time in which coders access texts, such as period of a week (McMillan, 2000). Specifying timeframes for analysis is a practical way to approach large numbers of texts or fewer, more complex texts, providing the texts are relatively stable (older websites, for example). Coders can be given a fixed window of time and/or specific task to perform (McMillan, 2000). In the case of texts that included dated content, such as a newspaper archive, coders could code text that falls within a particular period; for example, articles from May, 2005. These methods are suitable for texts in which additional content is included periodically but not typically removed, such as an archive-style website. While such methods can and have been used for blogs (Lagu et al. 2008), researchers must understand that data validity might be compromised as archived content on a blog can be changed retrospectively by its author. Another option for some texts is to code historical iterations. Coding historical iterations is appropriate for rapidly changing texts that include histories of all alterations, such as a wiki. In this instance, the researcher would indicate a historical time period for coding wiki alterations; for example, changes recorded in the wiki history from January 2009 to December 2009. All of these approaches aim to reduce the likelihood that coders are coding different iterations. However these options are by no means foolproof, as some do not eliminate the problem of coding different iterations, or are only applicable to certain types of online texts. With respect to the research on blogs created by young school-aged children, coding presented unique problems not easily solved using online coding methods. Some of the 2010 POST GRADUATE CONFERENCE blogs were large texts that required a long time for analysis. Blogs, by their journaling nature, change regularly. Blog entries are date-stamped upon posting, but the author can edit past entries without changing the posting date. Also, blogs do not include histories, so removal or editing of entries is not recorded. However, offline data storage circumvented these issues, and provided unique data coding and analysis opportunities not necessarily afforded by online analysis techniques and software solutions, or indeed by paper-based coding and analysis. Tools used for text URL storage (Firefox library, TagSifter) facilitate rudimentary data coding and analysis in that stored URLs may be coded using multiple tags and tagged items may be sorted. For example, a set of blog URLs is coded with tags to indicate the age of the author and frequency of blog entries. A search can then be conducted for all blogs authored by six-year-olds, and the ‘frequency’ tags on the resultant sub-set can be analysed to determine the average frequency of entries. LittleSnapper, used to store screenshots of individual pages, also facilitates coding as tags and descriptions can be added to images, and images can be annotated using a range of editing tools (see Image 7). Digital annotation of images is more efficient than annotating printed materials as annotations are easily edited. Some editing tools also provide greater functionality than paper-based equivalents; for example, the annotation tool allows information to be added that can be collapsed and expanded. Also, a particular data sub-set can be located quickly post-annotation for further comparison using tag searches. Scrapbook includes similar annotation features to LittleSnapper, such as a comments section, highlighter and annotation tool (See Image 8). Besides capturing replicas of entire live websites, ScrapBook is functionally superior to LittleSnapper for data coding and analysis. Text details are automatically recorded by ScrapBook, such as the text title, URL and download date. Significantly for researchers, coded data can be compiled and exported from ScrapBook to another document, such as a Word document or spreadsheet. Analysis can then be undertaken using statistics analysis software, such as PASW (Predictive Analytics SoftWare, formerly known as SPSS). UNIVERSITY OF NEW ENGLAND Image 7: LittleSnapper interface, showing image editing tools: rectangle drawing tool (red rectangle around text), blur tool (over face), arrow and annotation (in expanded format). Image 8: ScrapBook, showing annotation tools (icons in the horizontal bar beneath the blog): highlighter, in-line comments and sticky annotation, erasers; and comments tool (bottom). Conclusion The commonalities between web-based and other forms of texts allow researchers to utilise content analysis in their undertakings. There is a firm tradition of content analysis of artefacts containing print, and a burgeoning methodological collection of advice regarding content analysis of images, film/video and, more recently, sound (Baldry & Thibault, 2006; Bell, 2001; van Leewin & Jewitt, 2001). However, web-based texts, such 2010 POST GRADUATE CONFERENCE as the blogs in this study, present researchers with unique challenges. Online researchers may find the need to capture and analyse ever-changing web-based texts, a task akin to herding cats. The texts may be non-linear in nature, requiring a more three-dimensional approach to data collection and analysis, and texts may also contain moving components, such as animated gifs, video and music, necessitating data collection techniques that are capable of capturing this type of content. Each of these challenges were encountered during the blog research, and a number of tools were sourced and trialled to determine their value in countering issues. Upon comparison, it was found that tools provided functionality pertinent to particular types of web-based texts and/or of a particular scale. However each tool has its limitations, particularly regarding its ability to capture problematic data, such as sound, animation and more complex texts. Regardless of limitations, such tools both facilitate the capture, storage, classification, coding and analysis of web-based texts, and enhance the capabilities of researchers working with web-based texts. The capabilities of tools range from the fundamental abilities of browsers that can store and catalogue basic information about webpages, to browser add-ons and stand-alone software applications that enhance all facets of data management and analysis of web-based texts. With so many options available, it was found that the real challenge for researchers of web-based texts was less a question of how to collect, manage and analyse data, and more a question of which tool to use for which operation. Defining the types of data likely to be encountered and the ways in which data will be coded and analysed are key challenges to be overcome, followed by the sourcing of appropriate software to support these endeavours. References Arakji, R., Benbunan-Fichm R., & Koufaris, M. (2009). Exploring contributions of public resources in social bookmarking systems. Decision Support Systems, 47(3), 245- 253.Baldry, A., & Thibault, P.J. (2006). Multimodal Transcription and Text Analysis. London: Equinox Publishing Ltd. Bell, P. (2001). Content analysis of visual images. In T. van Leeuwen & C. Jewitt (Eds.).Handbook of Visual Analysis (pp. 10-34). London: Sage Publications.Delicious, (n.d.). Frequently Asked Questions. Retrieved from://www.delicious.com/help/faq# . Dillon, A., & Gushrowski, B.A. (2000). Genres and the web: Is the personal home page the first uniquely digital genre? Journal of the American Society for Information Science, 51(2), 202-205. Döring, N. (2006). Personal home pages on the web: A review of research. Journal of Computer-Mediated Communication, 7(3), 19. Frakes, D. (2010a). Techworld web graphic tools Instapaper review: Review. Retrieved from ://review.techworld.com/web-graphics-tools/3223802/instapaper- review/?view=review&intcmp=rv-ia-tb-. Frakes, D. (2010b). Techworld web graphic tools Instapaper review: Verdict. Retrieved from ://review.techworld.com/web-graphics-tools/3223802/instapaper- review/?view=review&intcmp=rv-ia-tb-. Ha, L., & James, E. L. (1998). Interactivity reexamined: A baseline analysis of early business web sites. Journal of Broadcasting & Electronic Media, 42(4), 457-474. Hara, N., Bonk, C.J., & Angeli, C. (2000). Content analysis of online discussion in an applied educational psychology course. Instructional Science, 28, 115-152. Herring, S.C., Scheidt, L.A., Bonus, S., & Wright, E. (2004). Bridging the gap: A genre analysis of weblogs. Paper presented at the 37th Hawaii International Conference on System Sciences (HICSS-37), Los Alamitos. UNIVERSITY OF NEW ENGLAND Herring, S.C., Scheidt, L.A., Kouper, I., & Wright, E. (2006). A longitudinal content analysis of weblogs: 2003-2004. In M. Tremayne (Ed.), Blogging, Citizenship and the Future of Media (pp. 3-20). London: Routledge. Herring, S. (2010). Web content analysis: Expanding the paradigm. In J. Hunsinger, Klastrup, L., & Allen, M. (Eds.). International Handbook of Internet Research (pp. 233-249). Dordrecht: Springer. Instapaper, (2008-2011). Instapaper: A simple tool to save web pages for reading later. Retrieved from ://www.instapaper.com/. Jones, S., Millermaier, S., Goya-Martinez, M., & Schuler, J. (2008). Whose space is MySpace? A content analysis of MySpace profiles. First Monday, 13(9). Retrieved from ://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2202/ Kim, I., & Kuljis, J. (2010). Applying content analysis to web-based content. Journal of Computing and Information Technology, 18(4), 369-275. Retrieved from ://cit.srce.hr/index.php/CIT/article/view/1924/. Lagu, T., Kaufman, E. J., Asch, D. A., & Armstrong, K. (2008). Content of weblogs written by health professionals. Journal of General Internal Medicine, 23(10), 1642-1646. van Leeuwen, T., & Jewitt, C. (Eds.) (2001). Handbook of Visual Analysis. London: Sage Publications Maguire, A. (2011). Instapaper App Review – Definitely one to check out!. Retrieved from ://digireado.wordpress.com/2011/03/04/instapaper-app-review-%E2%80%93- definitely-one-to-check-out/. McMillan, S. J. (2000). The microscope and the moving target: The challenges of applying content analysis to the world wide web. Journalism and Mass Communication Quarterly, 77(1), 80-98. McMillan, S. J., Hoy, M. G., Kim, J., & McMahon, C. (2008). A multifaceted tool for a complex phenomenon: Coding web-based interactivity as technologies for interaction evolve. Journal of Computer-Mediated Communication, 13, 794-826. Miller, C.R., & Shepherd, D. (2004). Blogging as social action: A genre analysis of the weblog. In A. Gurak, S. Antonijevic, L. Johnson, C. Ratliff & J. Reyman (Eds.). Into the Blogosphere: Rhetoric, Community, and Culture of Weblogs. Retrieved from ://blog.lib.umn.edu/blogosphere/blogging_as_social_action_a_genre_analysis_of_the_w eblog. . Read it Later (2011). Read it later iPhone/iPod App. Retrieved from ://readitlaterlist.com/iphone/. Rourke, L., Anderson, T., Garrison, D.R., & Archer, W. (2001) Methodological Issues in the Content Analysis of Computer Conference Transcripts, International Journal of Artificial Intelligence in Education (IJAIED), 12, 8-22. Unsworth, L. (2008). Multiliteracies and metalanguage: Describing image/text relations as a resource for negotiating multimodal texts. In J. Coiro, M. Knobel, C. Lankshear & D. Leu (Eds.). Handbook of Research on New Literacies (pp. 379-408). New York: Lawrence Erlbaum Associates. 2010 POST GRADUATE CONFERENCE