Data Management

All I Really Need to Know I Learned From Peer-Reviewed Papers (Part 1)

I remember feeling a spark of urgent curiosity when I found a copy of All I Really Need to Know I Learned in Kindergarten on a shelf in the guest bedroom. I was 11. And though I had made it to middle school, I had never attended kindergarten. This book contained information that I lacked and needed. I hid under the guest bed and read it cover-to-cover.

This character trait — this drive to read my way into knowledge — is still going strong in my life as an early career ecologist. Recently, I turned to Dr. Marieke Frassl’s 2018 Ten simple rules for collaboratively writing a multi-authored paper as I took on a leadership role writing a paper with my postdoc cohort. Reading this guide for collaborative writing gave me a new sense of focus and energized me for the ensuing work of organizing notes, framing our paper, and planning for an upcoming writing retreat.

I’m a reader, and so it shouldn’t be surprising that I seek paper-based advice in the stacks of my #365papers To Read Pile. Reflecting on the helpful scaffolding that I found in Ten simple rules for collaboratively writing a multi-authored paper, I pulled out my favorite Advice Papers from the last year. Flipping through the pdfs, I wondered, Why do we publish advice in journals? Why did these papers, which often echo advice I’ve already received in person or on twitter, resonate so much for me? What does it mean to offer your advice via peer-reviewed papers?

One of the major perks of writing for PLoS Ecology is the opportunity to cold-email scientists (or work-email scientist-friends) and pick their brains about their papers on exploding pollen, unexpected biodiversity hotspots on historic battlefields, and epic fieldwork roadtrips. So, I started writing to the authors of my favorite Advice Papers. This exercise took on a life of its own as Advice authors shared their stories, and their advice, with me. At the same time, I started collaborating on my own Advice Paper with coauthors. The project of selecting the year’s top Advice Papers has expanded beyond my initial curiosity and grown way too long for a single blog post. Here is the first of a two-part series on the best recent Advice Papers in ecology — Part One: How to Do the Science.

The two best papers I read on doing science were Broman and Woo’s 2018 Data Organization in Spreadsheets in The American Statistician and Dyson et al’s 2019 Conducting urban ecology research on private property: advice for new urban ecologists in Journal of Urban Ecology. I ranked Data Organization in Spreadsheets as one of my top-ten Summer 2018 papers, and I continue to stan this lovely guide to foundational data management. While my research is largely National Parks-based and urban ecology on private property seems to fall outside of my wheelhouse, I appreciate the framework for planning urban fieldwork in Dyson’s paper, and my friend Carly Ziter is a coauthor. When the paper came out, Carly tweeted “A few of us ECR urban ecologists got together and wrote the paper we wish we had been able to read before starting private property research.” At the time, I was hip-deep in revisions with a few alpine ECR ecologists on the paper that we wished we had been able to read before starting common garden research. I had to read someone else’s version of the paper they’d wished they’d been able to read and see that this process could be completed. 

Dr. Karen Dyson explained, “During my first (urban) field season I realized very quickly that I had had no idea what I was getting myself into.” She was surprised by the time commitment needed for communicating with private property owners to set up site visits and experienced the gamut of hospitality from having security called on her to being subject to overly-friendly non-stop talkers. “Basic things like bathroom breaks required more planning than you would expect. If I recall correctly, it was this last point that I was commiserating with my co-author Tracy about when the first idea for this paper came about.” Second author Dr. Carly Ziter agreed, “Like Karen, I didn't know many people working on private land when I started my PhD fieldwork, and I really just muddled through it pretty naively.” Private property is an important part of the urban ecological landscape, but the challenges of working on private property mean that urban ecology research is often conducted through remote sensing or from a sidewalk. Dyson wrote, “You’re never going to understand ecology in cities if you don’t engage with people—and not just park administrators, but the individuals who make myriad decisions each day on every parcel about what trees to cut down, what shrubs to plant, etc. All this is critical to furthering the field, and we wanted to see more of it, done well, with sensitivity to the people whose lives we’re intruding on.”

Dyson put together a workshop on the topic for ESA 2016, and Ziter attended. She remembers thinking, “finally, other people who get what this is like!” Dyson interviewed Ziter for the paper, and as Ziter remembers, “at some point, I think I more or less invited myself onto the team (thanks Karen et al!). I started out thinking this is the paper I wish I had been able to read as a graduate student, and of course by the time the paper came out I was starting my own lab, so now I think I'm so excited that MY grad students will be able to read this before they start fieldwork.”

I asked Ziter and Dyson why they decided that this advice needed to be presented in a peer-reviewed paper. Ziter notes that “Urban ecology is growing really quickly right now. And as the field grows, there are more and more students collecting urban data whose advisors/labmates are not trained in urban ecology or urban field methods (e.g. in my case, I was the only urban-focused grad student in my lab). So there isn't that passed-down or institutionalized knowledge present within research groups to help students get started.” And, as Dyson recognizes, “Peer-review is more permanent and has gravitas, and can be cited as a reason for doing something. We also wanted open source, since it’s accessible to those without library connections. Also, this is a serious subject that needs to be treated seriously, and often isn’t… which is also why we interviewed almost 30 people from as many countries as we could and went searching outside the discipline for role models.” There’s definitely some field site pride on the line. Carly explains the exasperation of hearing, “oh you do urban ecology? Your fieldwork must be so easy.” “Really the logistics are often more challenging than working in traditional field sites. So it was personally really rewarding to be able to help Karen and the team articulate in a more formal way that hey, this isn't just in our heads, there really are unique and pervasive challenges inherent in this kind of work (just as there are challenges inherent in more remote field ecology that we don't face!)”

The origin story behind Data Organization in Spreadsheets is a bit different from Dyson’s work to build a coalition dedicated to capturing and publishing best practices for field work on private property. Dr. Karl Broman’s website on organizing data in spreadsheets — “largely a response to a particularly badly organized set of data from a collaborator” — already existed when Jenny Bryan and Hadley Wickham were organizing a special issue on Data Science for the journal The American Statistician. He admits that, “it seemed unnecessary to write an article when I could already point people to the website,” and he backed out of his promise to contribute to the special issue. But, he reports, “Jenny didn't want me to back out and asked several friends if they'd help me to write the article, and Kara Woo agreed to do that and did the bulk of the work of rearranging the content in the form of an article and adding an introduction citing relevant literature.”

The peer review process for Data Organization in Spreadsheets was fairly straightforward. Broman writes, “every article solicited for the issue was assigned two reviewers from among the authors of other articles. The reviews were constructive and helpful. After the review, the article was published at PeerJ Preprints and also formally submitted to American Statistician...American Statistician is paywalled; available to most statisticians but not many others. I paid some huge fee (like $3500) to make it open access, since the target audience for the paper is much broader. I hemmed and hawed about whether to pay to make it OA; the fee seemed way too high, and the material was already available both at PeerJ Preprints and as a website. But I did pay and I'm glad I did, because I think way more people have read the paper, as a consequence of it being free. If people find the paper and it's available, they'll read it, but I think if they get a paywall, they're not likely to look further to find a free version.”

In contrast, the urban ecology peer review process was long and winding, though it also included a PeerJ Preprint. When it was finally published, Dyson shared the journey in a twitter thread. “It was desk rejected from Landscape and Urban Planning and Methods in Ecology and Evolution and rejected after review from Urban Ecosystems.” She remained dedicated to the paper throughout: “Since I ran the workshop at ESA 2016 and a well-attended poster at ESA 2107, we knew there was a need for it among students…We also put it in PeerJ preprints and it was one of the top five read/visited papers of 2018. So despite getting very frustrated with the process, we didn’t really lose faith in the manuscript—though we did give it complete reorganization after the rejection from Urban Ecosystems. We saw Journal of Urban Ecology was doing a free open access as they got started and decided ‘why not?’ since they’d also published Pickett and McDonnell’s The art and science of writing a publishable article. They’ve been lovely throughout the process—and have been great about re-tweeting and promoting the paper. It’s now one of their most read articles.” Here, Ziter chimed in to say, “I should disclose that I am sometimes the thumbs behind that twitter account. So that's why it got good twitter press ;). But I have no other role in the journal decisions or review process - so the rest of the loveliness is on them!”

Finally, I asked Broman and Dyson if they had any favorite Advice Papers. Dyson answered with an enthusiastic “Yes! In general, I love advice papers and papers that compare methodology, so I enjoyed putting this one together and hope to do more!” (I agree — we should write an urban-alpine ecology crossover!). She highlighted, “Hilty and Merenlender’s 2003 paper that deals with many of these issues (though not as in depth) on rural private property… [and] we used a few papers as models when we were writing (and re-writing) our manuscript, including Harrison’s Getting started with meta‐analysis; Goldberg et al’s Critical considerations for the application of environmental DNA methods to detect aquatic species; and particularly Clancy et al’s Survey of Academic Field Experiences (SAFE): Trainees Report Harassment and Assault.”

Broman writes that he didn't seek out any advice papers for guidance/structure while writing Spreadsheets. He muses, “I think the main advice papers I'm familiar with are those "ten tips for ..." [sic] at PLoS Computational Biology, which have been really useful though I think the formula has become a bit grating. I also really like Bill Noble's paper on organizing projects.”

Thanks to Broman, Dyson and Ziter for sharing their advice and adding to my reading list. Both of these papers are well-written and offer tangible, useful advice. I’ve found myself ruminating on them as I plan future fieldwork, and definitely wishing I could have read them much earlier as I wrap up old projects and wrestle with my old data.Stay tuned for Part Two: How to Write About the Science You (and Others) Did.

References:

Dyson, K., Ziter, C., Fuentes, T. L., & Patterson, M. S. (2019). Conducting urban ecology research on private property: advice for new urban ecologists. Journal of Urban Ecology, 5(1), 48–10. http://doi.org/10.1093/jue/juz001

Broman, K. W., & Woo, K. H. (2018). Data Organization in Spreadsheets. The American Statistician, 72(1), 1–10. http://doi.org/10.1080/00031305.2017.1375989 

Hiking with Reviewer 2

This is a deep dive into my own research — the backstory behind a single line in a recently published paper and the data-driven trip down memory lane that was spurred by an innocent question from Reviewer 2. 

This research took place on Wabanaki land. I want to respectfully acknowledge the Maliseet, Micmac, Penobscot, and Passamaquoddy tribes, who have stewarded this land throughout the generations. I am certainly not the first person to devote time and energy to tracking seasonal changes on Mount Desert Island. 

This week one of my dissertation chapters, Trails-as-transects: phenology monitoring across heterogeneous microclimates in Acadia National Park, Maine, was published in the journal Ecosphere. In this project, I pulled the space-for-time trick and hiked three mountains repeatedly to collect a lot of phenology observations across diverse microclimates. The mountains in Acadia are not huge — these granite ridges roll up from the Gulf of Maine and top out at 466 m — but my transect hikes were between 4.8 km and 9.7 km each, and I wore out a pair of trail runners each season. I took to heart Richard Nelson’s advice: “There may be more to learn by climbing the same mountain a hundred times than by climbing a hundred different mountains.” 

A couple months ago, in our second round of reviews, Reviewer 2 noted, “I think that it would be useful for those wanting to replicate your transect-as-trails approach (especially land managers) to know approximately how many person hours it took to complete a transect observation, here in the main text or in the appendix.” I had a magnet (which is apparently also available as a coaster) hanging next to my desk in grad school: over a silhouette of a golden retriever with three tennis balls in its mouth, it reads: “If it’s worth doing…it’s worth overdoing.” This magnet perfectly describes my response to Reviewer 2. I sent a back-of-the-envelope estimate to my coauthors, but I couldn’t shake the feeling that the precise person hours per transect was a knowable statistic. In addition to my field notes scribbled into weatherproof notebooks, I had collected my data via fulcrum, a smartphone app that automatically recorded the time of each observation. From my cache of fulcrum csv and xlsx files, I should be able to automatically pull the time of the first and last observation of each transect. The 10.7 MB of data in my fulcrum files represented four years of field work, hours and hours on the trails, slogging through rain, snow, and sun, training field assistants, combing through patches of lowbush blueberry and mountain cranberry for the first, hidden open flower.

I became obsessed with the idea of seriously calculating person hours per transect, but I was increasingly convinced that a single number would be meaningless. I also realized that I lacked the coding chops to deal with my messy raw data: 171 files, each with 77 columns, usually containing data from a single transect, but occasionally comprising half a transect (when we had to bail due to weather) or more than one transect (when I ran ambitious double-days, or my field assistants and I split up). I turned to Porzana Solutions, and Auriel Fournier expertly helped me unlock my person hours data.Over 177 the hikes in my fulcrum files, the mean time between first and last observation is 3.51 hours.

Three and a half hours does not even begin to tell the story. This blog post is my second supplemental appendix. Here is the story of person hours per transect — the lead time, the pregnant field season, and the phenology of phenology monitoring. 

Before the first observation and after the last

There is a lead time in every transect hike. After rolling out of bed, pulling on the same old running shorts, race tshirt, and powder blue sunglasses, after packing the same handful of granola bars, dried papaya, and sharp cheddar, zipping my phone into its waterproof case, and slinging my backpack into the passenger seat, after driving to the trailhead and placing my research permit on my dashboard, there’s still a gap between the start of the fieldwork and the first official observation of the day. Especially as the summer crowds began arriving in June, I had to get out early to grab a spot at the limited parking by the north or the south end of Pemetic, or else add some extra miles from a spillover lot*. Even at the best parking spot, the approach to the Sargent South Ridge trailhead requires navigating 0.7 miles of carriage roads between the car and the trail on every hike. When I started the project in 2013, the Sequester kept Park Loop Road closed late into the spring season. For the first six weeks of fieldwork, I walked along the empty road to access Cadillac North Ridge, and Pemetic North and South Ridge.

The transect hikes were 4.8 km (Pemetic), 9.2 km (Cadillac), and 9.7 km (Sargent) up the North Ridge and down the South Ridge or vice versa (all of the mountains had uncreatively named north and south ridge trails). So at the end of a transect, I was 4.8, 9.2, or 9.7 km away from my car. I could run the carriage roads to connect the trailheads after Sargent or Pemetic (a 6.6 km run post-Sargent, and 7.2 km run post-Pemetic). From Cadillac South Ridge, a run up Route 3 to park loop road got me back to the north ridge trailhead in 10 km. Sometimes I arranged rides with friends to skip the run, and when I had funding for field assistants in 2015 and 2016 we often carpooled to drop a car at the finish line for each other. (There were some benefits to this running routine — in 2014 I won free ice cream after placing third in my age group in the Acadia Half Marathon.)

The person hours per transect statistic is limited because not every transect was a straight shot. Sometimes we had to bail 3km into a hike due to bad weather and finish the transect another day. Once, one of my field assistants took a wrong turn and recorded phenology observations on the wrong trail down Pemetic, and so I went back, retraced her steps, and picked up the right trail the next day. Once, I did a wild two-a-day and in the middle of Cadillac, I ran down the Canon Brook Trail, looped through the Pemetic transect, and then ran back up the Cadillac West Face Trail to finish Cadillac. Once, I had a friend in town and we caught a ride to the summit of Cadillac and then enjoyed the leisurely hike down the south ridge with my eight-month-old in the baby backpack.

While the time between first and last observation averaged just over 4 hours for Cadillac, 2.5 hours for Pemetic, and 3 hours and 40 minutes for Sargent, those times discount the bookends of the hikes. As much as I’m railing against the answer to my query here, the process of working with Porzana Solutions to calculate these times has been incredibly rewarding. I feel like I’m getting to know my both raw data and the tidyverse in a weirdly intimate way that goes way beyond a standard tutorial. 

The pregnant field season

In 2015 I was 17 weeks pregnant at the start of my field season. In addition to my daughter, I was also joined in the field by two field assistants. According to the Porzana analysis, I hiked less than half as many transects in 2015 (15) compared to each of the two previous years (2013: 35** hikes, 2014: 37 hikes). I actually hiked 20 transects that year — my assistants were entering the data (and getting credit for the hike in fulcrum) while we hiked together in the beginning of the season***. On my solo transects in 2015, I felt sloooooow. I averaged thirty minutes slower than 2013 and 2014 on Cadillac, 50 minutes slower on Pemetic, and 22 minutes slower on Sargent. On top of this, I was covering less ground — in 2013 and 2014 I had monitored phenology in off-trail Northeast Temperate Network plots near my transects in an effort to compare trail-side phenology with forested sites that was ultimately cut from my dissertation. In 2015, I stuck to the trails.

I remember feeling pretty terrible at the beginning of most hikes that year. I had one favorite spruce tree on the south ridge of Sargent, and I can picture myself looking up through the needles on more than one occasion from my lie-down-spot while I tried to decide if a bite of granola bar would make me feel more or less nauseous. As I climbed above treeline and into the breeze the fog of morning sickness would lift, and as I hiked downhill, my daughter would do this funny little fetus-roll and kick in a way that I interpreted to be happy.

Hiking while pregnant was hard, but it felt easier than grappling with the looming challenges of becoming a parent. I liked the hard of fieldwork, it was the kind of hard that I felt capable of conquering. I also loved being pregnant in Bar Harbor. It was my fifth field season in Acadia and I had this wonderful community of supportive colleagues and mentors at the park service and in town. I had a favorite yoga class, a favorite milkshake, a favorite iced chai and blueberry muffin spot. I also had two field assistants — my pregnancy fortuitously aligned with NSF funding! — and working with Ella and Natasha that season was great. The person hours per transect figure obscures my field assistants, folding us into each other and masking the time we spent training together on the ridges. It also hides my pregnancy in the averages. I want to recognize those extra 22-50 minutes: they were some of the best worst minutes of my PhD.

The phenology of phenology monitoring

The person hours per transect average doesn’t show the sprint finishes of June. I monitored thirty species (the paper highlights the 9 most common taxa) of spring-flowering plants. On the transect hikes, I recorded leaf out and flowering phenology. In April, this was a bit of a scavenger hunt, and I’d pour over thickets of shrub stems for the first sign of bud break, then in May I’d peek into each curled Canada mayflower leaf for flower buds. By early June, my plants had leafed out, and the flowering season was winding down. I knew the trails by heart, and the location of each focal taxa along the ridge was bright in my mental map; each transect became a point-to-point trail run between the last phenological hold outs. Did the rhodora finish flowering on Cadillac? Had the last sheep’s laurel buds opened on Pemetic? Were the blueberries beginning to ripen below Sargent’s summit?

As I followed the spring phenology, I grew faster, my calf muscles more defined, my appetite more voracious. Acadia’s steep climbs will whip you into shape. I remember in 2013 arriving in the field a month after passing my comps and feeling so sluggish after a winter of studying instead of running. In comparison, I ran hard in the winter of 2013-2014, set a personal best half marathon time in a trail race in March, and just cruised through the early season field work in 2014. Even in 2015, as I grew rounder each week, I also grew more comfortable with the trails. Hiking while pregnant became easier over the season, although I’m happy it ended when it did, because that trend was not sustainable into the third trimester. 

I think about Reviewer #2 and I want to ask: do you mean the person hours per transect in April? Or at the end of June? What kind of mileage were you averaging before the start of the field season? Do you have any old hamstring injuries? Tell me about your field assistants. Do you like to stop for lunch at the summit or are you an on-the-go-snacker? Did you pack a couple bucks to buy a Harbor Bar at the Cadillac souvenir shop? Are you saving your energy for the 10k run at the end of the transect? Is the National Park Service well-funded in this year’s federal budget? How do you feel about stopping for a swim in Sargent Mountain Pond?

I love these questions because each one pulls on a thread winding through my Acadia memories. I hiked upwards of 125 transects between 2013 and 2016, and now that the paper is done, I’m a little sad to be shelving the fieldnotes for good. The trail runners that I wore are long gone, my field hat fell apart, most of my baggy race tshirts carried me through my second pregnancy and suffered for it.

In the end, the idiosyncrasies of the hikes were smoothed and flattened into the sentence, “Each transect could be completed in under 6 person-hours.” This is both true and wildly circumscribed. Not unlike a well done chapter of a PhD dissertation.

*Acadia National Park actually closed the lot by the Pemetic North Ridge trailhead in 2017 and it’s now exclusively a bus stop for the island explorer, the free bus that begins running right as my season wraps up at the end of June.

**This doesn’t include hikes before I had figured out the fulcrum platform. There was "no" data on those hikes (nothing was leafing or blooming, no signs of budburst) and they only exist in my field note books.

***I hired three field assistants for this project and, concurrently, a common garden experiment. In 2014, Paul was my garden guy, but we also hiked two transects together and he hiked two solo. In 2015, Ella, Natasha, and I split the transect and garden work. Ella came back for most of the 2016 season and then I finished the two projects solo in June 2016.

Summer Reading (Part 2)

Last week I wrote about my favorite new papers on mountains and phenology after a summer of scientific reading. In the second half of my top ten list, I’m highlighting some plant mysteries and best practices of 2018. 

“Plant mysteries” is a label that I’m using to lump together three plant papers that I can’t stop thinking about. They cover some of my favorite methodological quirks — historical field notes, herbarium digitization, citizen science — and two genera that I think are cool — Sibbaldia and Erythronium. The mysteries range from: Is this still here? to Why is this here in two colors?  to Can I get this specimen to tell me what else grew here? without much thematic overlap, but all three papers tell gripping stories. If nothing else, they share a strong natural history foundation and well-executed scientific writing that made for lovely hammock-reading.

“Best practices” are just that — descriptions of how we can improve our science as individuals and collectively. We can design better spreadsheets for our data and we can support gender equity in our scientific societies. I strongly recommend that all ecologists read up on both. 

Plant Mysteries

I didn’t particularly notice [trophy collecting/associated taxa/pollen color polymorphism] before, but now I can’t not see it…

1. Sperduto, D.D., Jones, M.T. and Willey, L.L., 2018. Decline of Sibbaldia procumbens (Rosaceae) on Mount Washington, White Mountains, NH, USA. Rhodora, 120 (981), pp.65-75.

I love this deep dive into the history of snowbank community alpine plant that occurs in exactly one ravine in New England (though it’s globally widespread across Northern Hemisphere arctic-alpine habitats). Over the past four decades, surveys in Tuckerman’s Ravine have documented a continuous decline in the abundance of creeping sibbaldia, and recently researchers have been unable to find it at all. This would make creeping sibbaldia the first documented extirpation of an alpine vascular plant in New England. Dr. Daniel Sperduto and coauthors revisit the photographs and notes from contemporary surveys and find that mountain alders are encroaching on the creeping sibbaldia’s snowbank habitats. These notes also include anecdotes of local disturbances like turf slumping at the sites where creeping sibbaldia used to be found. In herbaria across New England, Sperduto and coauthors discovered sheets covered with dozens of specimens — this “trophy collection activity” in the 19th century led them to calculate that “there are more than three times as many plants with roots at the seven herbaria examined than the maximum number of plants counted in the field within the last 100 years.” I am obviously partial to New England alpine plants, and I got to see Sperduto present this research as a part of an engaging plenary session at the Northeast Alpine Stewardship Gathering in April, so you could write this off as a niche interest. Despite this, I see creeping sibbaldia as a lens for considering the universal mysteries of population decline and extirpation, and the challenges of tying extirpation to concrete cause-and-effect stories. 

2. Pearson, K.D., 2018. Rapid enhancement of biodiversity occurrence records using unconventional specimen data. Biodiversity and Conservation, pp.1-12.

Leveraging herbarium data for plant research is so hot right now. But what if you could squeeze even more information from a specimen label? For example, many collectors note “associated taxa” along with the date and location of collection. The associated taxa are plants that were seen nearby, but not collected — a kind of ghostly palimpsest of the community that grew around the chosen specimen. Herbaria across the globe have spent the past decades digitizing specimens and uploading photographs of their pressed plants. In this process, the associated taxa on specimen labels are often stored in a ‘habitat’ database field. In this impressive single-author paper, Dr. Kaitlin Pearson extracts the associated taxa data from Florida State University’s Robert K. Godfrey Herbarium database with elegant code that can recognize abbreviated binomial names and identify misspellings. She then compared the county-level distributions of the associated taxa database with their known county-level distribution from floras and herbarium specimens. Incredibly “the cleaned associated taxon dataset contained 247 new county records for 217 Florida plant species when compared to the Atlas of Florida Plants.” There are plenty of caveats: the associated taxa can’t be evaluated for misidentification the way a specimen can, and lists of associated taxa are obviously subject to the same spatial biases as herbarium specimens. But this is clearly a clever study with a beautifully simple conclusion: “broadening our knowledge of species distributions and improving data- and specimen-collection practices may be as simple as examining the data we already have.” 

3. Austen, E.J., Lin, S.Y. and Forrest, J.R., 2018. On the ecological significance of pollen color: a case study in American trout lily (Erythronium americanum). Ecology, 99(4), pp.926-937.

Did you read Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Modelsin a seminar and think, this seems like an amazing resource but I’m an ecologist and examples about school children watching Sesame Street or election outcomes and incumbency for US congressional election races just don’t resonate with me? The ecological and evolutionary mystery of red/yellow pollen polymorphism is super interesting in its own right and Dr. Emily Austen and coauthors thoroughly attack this question. For me — and I’ve admitted here before that I am the kind of learner who benefits from repetition  — Austen’s statistical methods are the star. Austen demonstrates glm best practices and brings stunningly clear plant ecology examples to the Gelman and Hill framework. I would probably teach this paper in a field botany course (trout lilies are charismatic! look at this fun map of pollen color polymorphism!), but I would absolutely prefer to assign it in a statistical methods course, especially as a supplement/set of alternative exercises to Gelman and Hill. 

Best Practices

Do this…

1. Potvin, D.A., Burdfield-Steel, E., Potvin, J.M. and Heap, S.M., 2018. Diversity begets diversity: A global perspective on gender equality in scientific society leadership. PloS one, 13(5), p.e0197280.

Gender equality in biology dramatically decreases as you look up the ladder in academia — compare the gender breakdown in the population of graduate students to tenured professors and gender disparity is stark. Leadership in our field is still heavily male skewed. Dr. Dominique Potvin and her coauthors asked, is this true in scientific societies too? Scientific societies are generally more open than academic departments, and there is more transparency in the process of electing governing boards and leadership positions. Potvin and coauthors leveraged these traits to ask: what is the role of scientific societies in rectifying gender inequity? why are some societies better than others at promoting women in leadership? After considering 202 societies in the zoological sciences, they found that the culture of the society — the age of the society age, size of its board and whether or not a it had an outward commitment or statement of equality — was the best predictor of equality in the gender ratio of society boards and leadership positions. This “outward commitment or statement of equality” covered anything published on the society website — a statement, committee, or other form of affirmative action program — that “implies that the society is dedicated to increasing diversity or improving gender equality.” Of the 202 societies they studied, only 39 (19.3%) had one of these visible commitments to equality. Whether societies with high proportions of female board members were more likely to draft and publish these statements, or whether societies that invested time and energy in producing such commitments attracted more women to leadership positions is a bit of a chicken-and-egg riddle. Societies looking to reflect on their own state of gender equality can take advantage of the resource presented in Table 6: “Health checklist for scientific societies aiming for gender equality.” Assessing gender equality is kind of a low hanging fruit — and the authors encourage societies to reflect on intersectionality and race, age, ethnicity, sexuality, religion and income level as well. Basically, if a scientific society is struggling to support white women in 2018, there’s an excellent chance it is failing its brown, LGTBQ, and first-generation members to a much greater extent.

2. Broman, K.W. and Woo, K.H., 2018. Data organization in spreadsheets. The American Statistician, 72(1), pp. 2-10.

If I could send a paper in a time machine, I would immediately launch Broman and Woo’s set of principles for spreadsheet data entry and storage back to 2009, when I started my master’s project. Reading through this list of best practices made me realize how many lessons I learned the hard way — how many times have I violated the commandments to “be consistent”, “choose good names for things”, or “do not use font color or highlighting as data”? Way too many! Eventually, I pulled it together and developed a data entry system of spreadsheets that mostly conforms to the rules outlined in this paper. But, if I’d read this first, I would have skipped a lot of heartache and saved a lot of time. This is an invaluable resource for students as they prepare for field seasons and dissertation projects. Thank you Broman and Woo, for putting these simple rules together in one place with intuitive and memorable examples! 

Happy Fall Reading! 

The Hidden Gems of Data Accessibility Statements

Sometimes the best part of reading a scientific paper is an unexpected moment of recognition — not in the science, but in the humanity of the scientists. It’s reassuring in a way to find small departures from the staid scientific formula: a note that falls outside of the expected syntax of Abstract-Introduction-Methods-Results-Discussion. As an early career scientist who is very much in the middle of sculpting dissertation chapters into manuscripts, it’s nice to remember that the #365papers I read are the products of authors who, like me, struggled through revisions and goofed off with coauthors and found bleak humor in the dark moments. 

Ecology blogs, twitter, and the wider media also love noting the whimsical titles, funny (and serious) acknowledgements, memorable figures, and unique determinations of co-authorship order that have appeared in the pages of scientific journals.

I enjoy stumbling on these moments of levity in my TO READ file; last spring I procrastinated formatting my dissertation by avidly reading the Acknowledgements section of anyone I’d even vaguely overlapped with in my PhD program. One place I have not thought to look for serendipitous science humor: the Data Availability Statement. As it turns out, I have been missing an interesting story.

A recent PLOS ONE paper set out to analyze the Data Availability Statements of nearly 50,000 recent PLOS ONE papers. This may sound like a dull topic, but Lisa Federer and coauthors' work is surprisingly engaging, topical, and thought provoking. In March 2014 PLOS unveiled a data policy requiring Research Articles to include a Data Availability Statement providing readers with details on how to access the relevant data for each paper. But, as Federer et al point out “‘availability’ can be interpreted in ways that have vastly different practical outcomes in terms of who can access the data and how.” 

Why do Data Availability Statements matter? In ecology, open data advocates make the case for reproducibility and re-use. So many of us work on small study areas and amass isolated spreadsheets of data, and then publish on our system, maybe throwing a subset of the data we collected into a supplementary file. But big picture questions that look across scales, ecosystems, and approaches rely on big data — and big data is often an amalgam of many small datasets from a wide array of scientists. Small (or any size) datasets that are publicly available, and easy to access in data repositories instead of old lab notebooks or defunct lab computers, are much more likely to have legs, to get re-used and re-tested, and contribute to the field at large.

While PLOS was on the vanguard of Data Accessibility Statements among peer-reviewed journals, Federer’s review of the contents of these Data Availability Statements makes it clear that we are not yet in the shiny future of Open Data. PLOS’ Data Accessibility policy “strongly recommends” that data be deposited in a public repository; Federer found that only 18.2% of PLOS papers named a specific repository or source where data were available. Most Data Accessibility Statements direct the reader to the paper itself or supplementary information. Even among the data repository articles, some Data Accessibility Statements indicated a repository but failed to include a URL, DOI, or accession number — basically sending readers on a wild goose chase to locate their data within the repository. 

Other statements seem to have been entered as placeholders, potentially intended to be replaced upon publication of the article, such as “All raw data are available from the XXX [sic] database (accession number(s) XXX, XXX [sic])” or “The data and the full set of experimental instructions from this study can be found at <repository name>. [This link will be made publically [sic] accessible upon publication of this article.]” These two articles, published in 2016 and 2015, respectively, still contain this placeholder text as of this writing.

 These examples of placeholders that made it into publication are embarrassing, but human, and as Federer points out, Data Accessibility Statements should be reviewed by editors and peer reviewers with the same scrutiny that we apply to study design, statistical analyses, and citations. I have worked on meta-analyses and projects that depend on data from existing digital archives. The frustration of chasing down supplementary information, Dryad DOIs, and GitHub addresses only to find a dead end or a broken corresponding author email address is a feeling akin to discovering squirrels chewing through temperature logger wires halfway through the field season. Federer notes that the tide is turning towards open data: after a rocky start in 2014 — Federer’s team parsed many papers likely submitted before (but published after) the Data Availability policy went into effect — 2015 and 2016 saw the percent of papers that lacked a Data Availability Statement drop dramatically. Over the same time period, Federer notes slight increases in the number of statements referring to data in a repository and fewer that claim the data is in the paper or — shudder — available upon request.

At a broader level, open data is a newly politicized topic. The EPA recently proposed new standards that would ban scientific studies from informing regulatory purposes unless all the raw data was widely available in public and could be reproduced. This is not so much a gold standard as a gag rule.

In a PLOS editorial, John P. A. Ioannidis points out that while “making scientific data, methods, protocols, software, and scripts widely available is an exciting, worthy aspiration” in eliminating all but so-called perfect science from the regulatory process, the EPA is committing to making decisions that “depend uniquely on opinion and whim.” Most of the raw data from past studies are not publicly available — and as Federer’s research shows, even in an age of required Data Availability Statements, open data is still a work in progress. And so we beat on — scientists against anti-science Environmental Protection Agency administrators, borne back ceaselessly in support of publishing accessible, open data as a kind of green light to past research. 

References:

Federer LM, Belter CW, Joubert DJ, Livinski A, Lu Y-L, Snyders LN, et al. (2018) Data sharing in PLOS ONE: An analysis of Data Availability Statements. PLoS ONE 13(5): e0194768. https://doi.org/10.1371/journal. pone.0194768 

Ioannidis JPA (2018) All science should inform policy and regulation. PLoS Med 15(5): e1002576. https://doi.org/10.1371/journal.pmed.1002576