Webscraping in R

MIT Political Methodology Lab (PML) Workshop

Eyal Hanfling & Raymond Wang

February 16, 2024

Introduction1

Today we will cover:

  • What is scraping

  • What are you scraping (basics of HTML)

  • What to do with the stuff your scrape (Regex, extraction)

  • Worked Examples: (1) International Security, (2) Asian Infrastructure Investment Bank, (3) archived Pakistani press releases

What is scraping

  • Scraping is the automated collection of unstructured data on the web.

  • Scraping public websites, even against the terms of service, is probably legal. See hiQ Labs, Inc. v. LinkedIn Corp (2019).

  • But circumventing technical restrictions (passwords, captchas) is probably illegal.

  • Something different from scraping: calling an API (like the Twitter API or City of Cambridge API) to download data

What do with Scraped Data

  • extract objects using Regular Expressions (Regex)
  • Use Optical Character Recognition (OCR) to extract text from a PDF with tools like Tesseract or daiR

Before we start:

Helps to keep the stringr and rvest references open while coding!

stringr.tidyverse.org

rvest.tidyverse.org

Intro to HTML and CSS

  • HTML stands for “HyperText Markup Language”
    • Content is enclosed in tags. Ex: <title>Page Title<title>
    • Tags can have attributes. Ex: <h1 id='first'>A heading</h1>
      • h1-h6 for headings
      • p for paragraph
      • a for links
  • CSS is a language that styles elements of a page
    • Select <p> tags, color red, font size to 12px:
    • p {color: red; font-size: 12px;}

Look at the source of polisci.mit.edu

<div class="text">
  <div class="namearea">
        <h2><a href="/people/grad-student">Grad Student</a></h2>
      </div><!-- divnamearea -->

                    <div class="phonearea">
          <p class="phone">
                      <span class="email"><a href="mailto:gradstudent@mit.edu">gradstudent@mit.edu</a></span>
                                        </p>

Intro to rvest

  • Start with read_html()
    • Put in a link, output is an xml document
  • Use html_element() or html_elements() to extract an element
    • Use a CSS selector to define the element
    • Ex: html_element("h1")
  • Use html_text() to extract the text of an element
  • Use html_attr() to extract the value of an attribute
  • NEW: read_html_live() to interact with a live web page

Mini Example: scraping email addresses

<div class="text">
  <div class="namearea">
        <h2><a href="/people/grad-student">Grad Student</a></h2>
      </div><!-- divnamearea -->

                    <div class="phonearea">
          <p class="phone">
                      <span class="email"><a href="mailto:gradstudent@mit.edu">gradstudent@mit.edu</a></span>
                                        </p>
#Read the HTML File
read_html("https://polisci.mit.edu/people/graduate-students") %>%
  #Extract HTML element <span>
  html_elements('span') %>% 
  #Extract HTML element <a>
  html_elements('a') %>% 
  #Extract HTML attribute href
  html_attr("href")

Intro to Regex

  • Regex stands for Regular Expression

  • Tool to manipulate text data, called strings

  • Main package is stringr and the grep family from base R

#loading all packages needed
library(tidyverse)
library(rvest)
library(stringr)
library(lubridate)
library(glue)
library(httr)

text <- c("Boston is so cold", "It is 25 degrees in Hong Kong")
#grep returns index 
grep('cold', text)
[1] 1
#grepl returns logical vector of same length
grepl("cold", text)
[1]  TRUE FALSE
#str_subset returns all matches in full
str_subset(text, 'cold')
[1] "Boston is so cold"
#which is equivalent to:
text %>% .[grep('cold',.)]
[1] "Boston is so cold"

Intro to Regex

Regex is composed of 3 components

  • literal characters: matched by a single characters, i.e. a, b, c

  • character classes: matched by any number of characters. Use brackets to denote: [azx], [A-Z]

  • modifiers: operate on literal characters or character classes. Denoted by special characters, such as +,*,?

  • We combine these components to match patterns in strings and manipulate them

  • Regex checker!

text <- c("Boston is so cold", "It is 25 degrees in Hong Kong",
          "02139")
text %>% str_subset('\\d{2}')
[1] "It is 25 degrees in Hong Kong" "02139"                        
#spaces matter!
text %>% str_subset('\\d{2} ')
[1] "It is 25 degrees in Hong Kong"
text %>% str_subset('^\\d')
[1] "02139"

Other Resources

Typical Workflow

  1. Get URLs needed to get target data
  2. Examine underlying html structure using developer tools in browser, with the help of SelectorGadget
  3. Create function that extracts relevant information from the page
  4. Test function on one/a few links to check that it generalizes
  5. Iterate over all URLs
  6. Clean data

Scraping Int Security

  • Goal: Scrape title, abstract, author, and date of all IS articles published since 2022
  • What are the first steps?

Step 2: Getting elements of interest

OK, now we have the links to each article. Let’s inspect what an article page looks like.

Step 2: Getting elements of interest

How to get the abstract?

  • Need to identify what distinguishes the abstract chunk from other paragraph chunks

Step 2: Getting elements of interest

Using regex, extract the abstract from the html of the article. Hint: The workflow is

  1. Get the html of the article page using read_html
  2. Use what you learnt about extraction to abstract section using html_elements
  3. Use regex to extract the relevant chunk from the vector
  4. Convert to plaintext using html_text
Show code
html_url <- articles[1]
html_article <- read_html(html_url) 
ab <- html_article %>% html_elements(.,css = 'section') %>% 
  .[grep('^<section class="abstract"><p>',.)] %>% html_text()
ab
[1] "Atonement is a state practice that comprises an official political apology and the offer of reparation payments to former victims of mass atrocities, war crimes, and human rights abuses. Despite being considered the moral and right thing to do, atonement has occurred only once at the state level: between West Germany and Israel in 1952. Whereas existing explanations view the West German pathway after the Holocaust as either an ethical choice or a domestic policy induced by U.S. pressure, atonement can also be a political decision. Politicians may give official apologies and pay reparations because such practices promise tangible political benefits. An investigation of the West German–Israeli case and a comparison with two non-atoning perpetrators of World War II, Austria and Japan, illustrate the plausibility of these claims. Atonement emerged as a bilateral strategy between West Germany and Israel because it represented a politically expedient option for both countries. This finding offers insights into when politicians may pursue atonement in other cases and points to a potential avenue toward long-term international stability and durable peace."

Step 2: Getting elements of interest

Ditto for title, authors, and date.

#title, using xpath to demontrate importance of using ' vs "
title <- html_article %>% html_elements(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "article-title-main", " " ))]') %>% 
  html_text()

#author
author <- html_article %>% html_elements(css = '.stats-author-info-trigger') %>% 
  html_text()
#collapsing 
author <- paste(author, collapse = ", ")

author
[1] "Kathrin Bachleitner"
#date 
date <- html_article %>% html_elements(css = '.article-date') %>% 
  html_text %>% lubridate::mdy(.)
date
[1] "2023-01-04"

WAIT! Our title is a bit ugly

title
[1] "\r\n                    The Path to Atonement: West Germany and Israel after the Holocaust\r\n                "

Clean it up using stringr pacakge

title %>% str_replace_all(., "[\r\n]" , "") %>% str_trim
[1] "The Path to Atonement: West Germany and Israel after the Holocaust"

Step 2: Getting elements of interest

#doing all
out_list <- lapply(1:length(articles), function(x){
  # Sys.sleep(30) #for longer tasks use sys.sleep

  html_article <- read_html(articles[x])
   
  #abstract
  abstract <- html_article %>% html_elements(css = 'section') %>% .[grep('^<section class="abstract"><p>',.)] %>% html_text()
  
  #title
  title <- html_article %>% html_elements(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "article-title-main", " " ))]') %>%
  html_text() %>% str_replace_all(., "[\r\n]" , "") %>% str_trim()
  
  #author
  author <- html_article %>% html_elements(css = '.stats-author-info-trigger') %>% 
    html_text()
  #collapsing 
  author <- paste(author, collapse = ", ")
  
  #date 
  date <- html_article %>% html_elements(css = '.article-date') %>% html_text %>% lubridate::mdy(.)
  
  list(t = title, au = author, d = date, ab = abstract)
})

df <- data.table::rbindlist(out_list, fill = T)
# write.xlsx(df, file = 'int_sec_art.xlsx')

Step 3: Violà!

t au d ab
The Psychology of Nuclear Brinkmanship Reid B. C. Pauly, Rose McDermott 2023-01-01 Conventional wisdom sees nuclear brinkmanship and Thomas Schelling's pathbreaking “threat that leaves something to chance” as a solution to the problem of agency in coercion. If leaders cannot credibly threaten to start a nuclear war, perhaps they can at least introduce uncertainty by signaling that the decision is out of their hands. It is not so easy to remove humans from crisis decision-making, however. Often in cases of nuclear brinkmanship, a human being retains a choice about whether to escalate. When two sides engage in rational decision-making, the chance of strategic nuclear exchange should be zero. Scholars have explained how risks associated with accidents, false warnings, and pre-delegation creep into nuclear crises. An investigation of how chance can still produce leverage while leaders retain a choice over whether and when to escalate adds to this scholarship. There remains an element of choice in chance. For a complete understanding of nuclear brinkmanship, psychology and emotion must be added to the analysis to explain how leaders make decisions under pressure. Human emotions can introduce chance into bargaining in ways that contradict the expectations of the rational cost-benefit assumptions that undergird deterrence theory. Three mechanisms of nuclear brinkmanship—accidents, self-control, and control of others—illustrate how a loss of control over the use of nuclear weapons is not a necessary element of the threat that leaves something to chance. Choice does not have to be eliminated for a risk of catastrophic destruction to remain.
Social Cohesion and Community Displacement in Armed Conflict Daniel Arnon, Richard J. McAlexander, Michael A. Rubin 2023-01-01 What are the origins of conflict-related population displacement? Why do some communities in conflict zones suffer mass casualties while others evade conflict violence? Whether civilians migrate before or after belligerent operations in their vicinity influences the scale of casualties and population displacement in war. “Preemptive evacuation” is a specific manifestation of forced displacement, in which whole communities leave their homes before belligerents attempt to seize control in their local area. In conflicts involving strategic civilian-targeted violence, social cohesion, by promoting collective action, enhances communities’ capabilities to mobilize collective migration, thereby increasing the likelihood of preemptive evacuation. An investigation of the 1948 Arab-Israeli War probes the plausibility of the theory. Detailed information about Arab Palestinian villages in the previously restricted Village Files is used to construct a village-level dataset, which measures social cohesion and other social, political, and economic characteristics. These documents and data provide crucial sources of evidence to researchers investigating Palestinian society and development, the origins of Israel's statehood, and the Israeli-Palestinian conflict. Findings suggest that areas where communities lack social cohesion may suffer higher casualties from targeted violence, signaling a need for urgent diplomatic and humanitarian prevention or mitigation efforts.
Push and Pull on the Periphery: Inadvertent Expansion in World Politics Nicholas D. Anderson 2023-01-01 Why do great powers engage in territorial expansion? Much of the existing literature views expansion as a largely intentional activity directed by the leaders of powerful states. Yet nearly 25 percent of important historical instances of great power expansion are initiated by actors on the periphery of the state or empire without authorization from their superiors at the center. Periphery-driven “inadvertent expansion” is most likely to occur when leaders in the capital have limited control over their agents on the periphery. Through their actions, peripheral agents effectively constrain leaders from withdrawing from these newly captured territories because of sunk costs, domestic political pressure, and national honor. When leaders in the capital expect geopolitical consequences from regional or other great powers, such as economic sanctions, militarized crises, or war, they are far less likely to authorize the territorial claims. A mixed-methods research strategy combines new quantitative data on great power territorial expansion with three qualitative case studies of successful (and failed) inadvertent expansion by Russia, Japan, and France. Inadvertent expansion has not completely gone away, particularly among smaller states, where government authority can be weak, control over states’ apparatuses can be loose, and civil-military relations can be challenging.
Correspondence: Debating China's Use of Overseas Ports David C. Logan, Robert C. Watts, IV, Isaac B. Kardon, Wendy Leutert 2023-01-01 NA
Editors' Note 2023-01-01 NA
Summaries 2023-01-01 NA
The Cult of the Persuasive: Why U.S. Security Assistance Fails Rachel Tecott Metz 2023-01-01 Security assistance is a pillar of U.S. foreign policy and a ubiquitous feature of international relations. The record, however, is mixed at best. Security assistance is hard because recipient leaders are often motivated to implement policies that keep their militaries weak. The central challenge of security assistance, then, is influence. How does the United States aim to influence recipient leaders to improve their militaries, and what drives its approach? Influence in security assistance can be understood as an escalation ladder with four rungs: teaching, persuasion, conditionality, and direct command. Washington increasingly delegates security assistance to the Department of Defense, and the latter to the U.S. Army. U.S. Army advisers tend to rely exclusively on teaching and persuasion, even when recipient leaders routinely ignore their advice. The U.S. Army's preference for persuasion and aversion to conditionality in security assistance can be traced to its bureaucratic interests and to the ideology that it has developed—the cult of the persuasive—to advance those interests. A case study examines the bureaucratic drivers of the U.S. Army's persistent reliance on persuasion to influence Iraqi leaders to reform and strengthen the Iraqi Army. Qualitative analysis leverages over one hundred original interviews, as well as oral histories and recently declassified U.S. Central Command documents. The findings illustrate how the interests and ideologies of the military services tasked with implementing U.S. foreign policy can instead undermine it.
Summaries 2022-10-01 NA
China's Party-State Capitalism and International Backlash: From Interdependence to Insecurity Margaret M. Pearson, Meg Rithmire, Kellee S. Tsai 2022-10-01 Contrary to expectations, economic interdependence has not tempered security conflict between China and the United States. In response to perceived domestic and external threats, the Chinese Communist Party's actions to ensure regime security have generated insecurity in other states, causing them to adopt measures to constrain Chinese firms. Security dilemma dynamics best explain the subsequent reactions from many advanced industrialized countries to the evolution of China's political economy into party-state capitalism. Party-state capitalism manifests in two signature ways: (1) expansion of party-state authority in firms through changes in corporate governance and state-led financial instruments; and (2) enforcement of political fealty among various economic actors. Together, these trends have blurred the distinction between state and private capital in China and resulted in backlash, including intensified investment reviews, campaigns to exclude Chinese firms from strategic sectors, and the creation of novel domestic and international institutions to address perceived threats from Chinese actors. The uniqueness of China's model has prompted significant reorganization of the rules governing capitalism, both nationally and globally.
How Much Risk Should the United States Run in the South China Sea? M. Taylor Fravel, Charles L. Glaser 2022-10-01 How strenuously, and at what risk, should the United States resist China's efforts to dominate the South China Sea? An identification of three options along a continuum—from increased resistance to China's assertive policies on one end to a partial South China Sea retrenchment on the other, with current U.S. policy in the middle—captures the choices facing the United States. An analysis of China's claims and behavior in the South China Sea and of the threat that China poses to U.S. interests concludes that the United States' best option is to maintain its current level of resistance to China's efforts to dominate the South China Sea. China has been cautious in pursuing its goals, which makes the risks of current policy acceptable. Because U.S. security interests are quite limited, a significantly firmer policy, which would generate an increased risk of a high-intensity war with China, is unwarranted. If future China's actions indicate its determination has significantly increased, the United State should, reluctantly, end its military resistance to Chinese pursuit of peacetime control of the South China Sea and adopt a policy of partial South China Sea retrenchment.
Dangerous Changes: When Military Innovation Harms Combat Effectiveness Kendrick Kuo 2022-10-01 Prevailing wisdom suggests that innovation dramatically enhances the effectiveness of a state's armed forces. But self-defeating innovation is more likely to occur when a military service's growing security commitments outstrip shrinking resources. This wide commitment-resource gap pressures the service to make desperate gambles on new capabilities to meet overly ambitious goals while cannibalizing traditional capabilities before beliefs about the effectiveness of new ones are justified. Doing so increases the chances that when wartime comes, the service will discover that the new capability cannot alone accomplish assigned missions, and that neglecting traditional capabilities produces vulnerabilities that the enemy can exploit. To probe this argument's causal logic, a case study examines British armor innovation in the interwar period and its impact on the British Army's poor performance in the North African campaign during World War II. The findings suggest that placing big bets on new capabilities comes with significant risks because what is lost in an innovation process may be as important as what is created. The perils of innovation deserve attention, not just its promises.
Small Satellites, Big Data: Uncovering the Invisible in Maritime Security Saadia M. Pekkanen, Setsuko Aoki, John Mittleman 2022-10-01 Data from small satellites are rapidly converging with high-speed, high-volume computational analytics. “Small satellites, big data” (SSBD) changes the ability of decision-makers to persistently see and address an array of international security challenges. An analysis of these technologies shows how they can support decisions to protect or advance national and commercial interests by detecting, attributing, and classifying harmful, hostile, or unlawful maritime activities. How might the military, law enforcement, and intelligence communities respond to maritime threats if these new technologies eliminate anonymity at sea? The emerging evidence presented on maritime activities is intertwined with national security (e.g., territorial and resource claims, sanctions violations, and terrorist attacks), legal and illicit businesses (e.g., illegal fishing, trafficking, and piracy), and other concerns (e.g., shipping and transit, chokepoints, and environmental damage). The ability of SSBD technologies to observe and catch wrongdoing is important for governments as well as the commercial, academic, and nongovernmental sectors that have vested interests in maritime security, sustainable oceans, and the rule of law at sea. But findings indicate that transparency alone is unlikely to deter misconduct or change the behavior of powerful states.
Nowhere to Hide? Global Policing and the Politics of Extradition Daniel Krcmaric 2022-10-01 Global policing efforts go far beyond combatting terrorism. The United States has tracked down war criminals in the former Yugoslavia, prosecuted Mexican drug kingpins in U.S. courts, transferred a Congolese warlord to the International Criminal Court, and even invaded foreign countries to apprehend wanted suspects. Likewise, Chinese police and intelligence forces crisscross the globe engaging in surveillance, abductions, and forced repatriations. But global policing activities are hard to study because they tend to occur “in the shadows.” Extradition treaties—agreements that facilitate the formal surrender of wanted fugitives from one country to another—represent a unique part of the global policing architecture that is directly observable. An original dataset of every extradition treaty that the United States has signed since its independence shows that extradition cooperation is not an automatic response to the globalization of crime. Instead, it is an extension of geopolitical competition. Geopolitical concerns are crucial because many states try to weaponize extradition treaties to target their political opponents living abroad, not just common criminals. Future research should reconceptualize the role of individuals in international security because many governments believe that a single person—whether a dissident, a rebel, or a terrorist—can imperil their national security.
Summaries 2022-07-01 NA
Then What? Assessing the Military Implications of Chinese Control of Taiwan Brendan Rittenhouse Green, Caitlin Talmadge 2022-07-01 The military implications of Chinese control of Taiwan are understudied. Chinese control of Taiwan would likely improve the military balance in China's favor because of reunification's positive impact on Chinese submarine warfare and ocean surveillance capabilities. Basing Chinese submarine warfare assets on Taiwan would increase the vulnerability of U.S. surface forces to attack during a crisis, reduce the attrition rate of Chinese submarines during a war, and likely increase the number of submarine attack opportunities against U.S. surface combatants. Furthermore, placing hydrophone arrays off Taiwan's coasts for ocean surveillance would forge a critical missing link in China's kill chain for long-range attacks. This outcome could push the United States toward anti-satellite warfare that it might otherwise avoid, or it could force the U.S. Navy into narrower parts of the Philippine Sea. Finally, over the long term, if China were to develop a large fleet of truly quiet nuclear attack submarines and ballistic missile submarines, basing them on Taiwan would provide it with additional advantages. Specifically, such basing would enable China to both threaten Northeast Asian sea lanes of communication and strengthen its sea-based nuclear deterrent in ways that it is otherwise unlikely to be able to do. These findings have important implications for U.S. operational planning, policy, and grand strategy.
Noncombat Participation in Rebellion: A Gendered Typology Meredith Loken 2022-07-01 Research on women's participation in rebel organizations often focuses on “frontline” fighters. But there is a dearth of scholarship about noncombat roles in rebel groups. This is surprising because scholarship on gender and rebellion suggests that women's involvement in rebel governance, publicity, and mobilization can have positive effects on civilian support for and participation in rebel organizations cross-nationally. Further, women often make up the critical infrastructure that maintains rebellion. A new conceptual typology of participation in rebellion identifies four dimensions along which individuals are involved in noncombat labor: logistics, outreach, governance, and community management. These duties are gendered in ways that make women's experiences and opportunities unique and, often, uniquely advantageous for rebel organizations. Women take on complex roles within rebellion, including myriad tasks and duties that rebels perform in conjunction with or in lieu of combat labor. An in-depth analysis of women's noncombat participation in the Provisional Irish Republican Army in Northern Ireland demonstrates this typology's purpose and promise. Attention to noncombat labor enables a more comprehensive analysis of rebel groups and of civil wars. Studying these activities through this framework expands our understanding of rebellion as a system of actors and behaviors that extends beyond fighting. Future scholarship may use this typology to explain variation in types of women's participation or the outcomes that they produce.
Reviewers for Volume 46 2022-07-01 NA
Narratives and War: Explaining the Length and End of U.S. Military Operations in Afghanistan C. William Walldorf, Jr. 2022-07-01 Why did the U.S. war in Afghanistan last so long, and why did it end? In contrast to conventional arguments about partisanship, geopolitics, and elite pressures, a new theory of war duration suggests that strategic narratives best answer these questions. The severity and frequency of attacks by al-Qaeda and the Islamic State across most of the 2000s and 2010s generated and sustained a robust collective narrative across the United States focused on combatting terrorism abroad. Audience costs of inaction generated by this narrative pushed President Barack Obama (2009) and President Donald Trump (2017) to not only sustain but increase troops in Afghanistan, against their better judgement. Strategic narratives also explain the end to the war. The defeat of the ISIS caliphate and a significant reduction in the number of attacks on liberal democratic states in the late 2010s caused the severity and frequency of traumatic events to fall below the threshold necessary to sustain a robust anti-terrorism narrative. As the narrative weakened, advocates for war in Afghanistan lost political salience, while those pressing retrenchment gained leverage over policy. Audience costs for inaction declined and President Joe Biden ended the war (2021). As President Biden seeks to rebalance U.S. commitments for an era of new strategic challenges, an active offshore counterterrorism program will be necessary to maintain this balance.
Strategic Substitution: China's Search for Coercive Leverage in the Information Age Fiona S. Cunningham 2022-07-01 China's approach to gaining coercive leverage in the limited wars that it has planned to fight against nuclear-armed adversaries differs from the choices of other states. A theory of strategic substitution explains why China relied on threats to use information-age weapons strategically instead of nuclear threats or conventional victories in the post–Cold War era. Information-age weapons (counterspace weapons, large-scale cyberattacks, and precision conventional missiles) promise to provide quick and credible coercive leverage if they are configured to threaten escalation of a conventional conflict using a “brinkmanship” or “calibrated escalation” force posture. China pursued information-age weapons when it faced a leverage deficit, defined as a situation in which a state's capabilities are ill-suited for the type of war and adversary that it is most likely to fight. China's search for coercive leverage to address those defi- cits became a search for substitutes because its leaders doubted the credibility of nuclear threats and were unable to quickly redress a disadvantage in the conventional military balance of power. A review of original Chinese-language written sources and expert interviews shows that China pursued a coercive cyberattack capability to address a leverage deficit after the United States bombed China's embassy in Belgrade in 1999. China's low dependence on information networks shaped its initial choice of a brinkmanship posture for large-scale offensive cyber operations. China switched to a calibrated escalation posture in 2014, following a dramatic increase in its vulnerability to cyberattacks.
Editors' Note 2022-04-01 NA
Why Drones Have Not Revolutionized War: The Enduring Hider-Finder Competition in Air Warfare Antonio Calcara, Andrea Gilli, Mauro Gilli, Raffaele Marchetti, Ivan Zaccagnini 2022-04-01 According to the accepted wisdom in security studies, unmanned aerial vehicles, also known as drones, have revolutionizing effects on war and world politics. Drones allegedly tilt the military balance in favor of the offense, reduce existing asymmetries in military power between major and minor actors, and eliminate close combat from modern battlefields. A new theory about the hider-finder competition between air penetration and air defense shows that drones are vulnerable to air defenses and electronic warfare systems, and that they require support from other force structure assets to be effective. This competition imposes high costs on those who fail to master the set of tactics, techniques, procedures, technologies, and capabilities necessary to limit exposure to enemy fire and to detect enemy targets. Three conflicts that featured extensive employment of drones—the Western Libya military campaign of the second Libyan civil war (2019–2020), the Syrian civil war (2011–2021), and the Armenia-Azerbaijan conflict over Nagorno-Karabakh (2020)—probe the mechanisms of the theory. Drones do not by themselves produce the revolutionary effects that many have attributed to them.
Summaries 2022-04-01 NA
Decline and Disintegration: National Status Loss and Domestic Conflict in Post-Disaster Spain Steven Ward 2022-04-01 Decline has long been a central concern of international relations scholarship, but analysts have only recently begun to investigate whether a change in international status influences a state's domestic politics. A new theoretical framework for understanding the domestic political consequences of relative national decline posits that eroding national status activates two sets of social psychological dynamics that contribute to domestic conflict inside declining states. First, eroding state status prompts some groups to strengthen their commitment to the state's status and dominant national identity, at the same time as it prompts other groups to disidentify from the state. Second, eroding status produces incentives for substate actors to derogate and scapegoat one another. These dynamics are particularly likely to contribute to center-periphery conflict in multinational states after instances of acute status loss. The plausibility of the argument is demonstrated by showing how the erosion of Spain's status (especially because of military failure in the 1898 Spanish-American War and the subsequent loss of its last colonies in the Americas) intensified domestic conflict in Spain during the first decades of the twentieth century. Findings indicate that decline may actually exacerbate domestic conflict, making it more difficult for states to adopt appropriate reforms.
Pier Competitor: China's Power Position in Global Ports Isaac B. Kardon, Wendy Leutert 2022-04-01 China is a leader in the global transportation industry, with an especially significant position in ocean ports. A mapping of every ocean port outside of China reveals that Chinese firms own or operate terminal assets in ninety-six ports in fifty-three countries. An original dataset of Chinese firms' overseas port holdings documents the geographic distribution, ownership, and operational characteristics of these ports. What are the international security implications of China's global port expansion? An investigation of Chinese firms' ties to the Party-state reveals multiple mechanisms by which the Chinese leadership may direct the use of commercial port assets for strategic purposes. International port terminals that Chinese firms own and operate already provide dual-use capabilities to the People's Liberation Army during peacetime, establishing logistics and intelligence networks that materially enable China to project power into critical regions worldwide. But this form of networked state power is limited in wartime because it depends on commercial facilities in non-allied states. By providing evidence that overseas bases are not the sole index of global power projection capabilities, findings advance research on the identification and measurement of sources of national power. China's leveraging of PRC firms' transnational commercial port network constitutes an underappreciated but consequential form of state power projection.
Soldiers' Dilemma: Foreign Military Training and Liberal Norm Conflict Renanah Miles Joyce 2022-04-01 The United States regularly seeks to promote the liberal norms of respect for human rights and deference to civilian authority in the militaries that it trains. Yet norm-abiding behavior often does not follow from liberal foreign military training. Existing explanations ascribe norm violations either to insufficient socialization or to interest misalignment between providers and recipients. One reason violations occur is because liberal training imparts conflicting norms. How do militaries respond when they confront the dilemma of conflict between the liberal norms of respect for human rights and civilian control of the military? The U.S. policy expectation is that trained militaries will prioritize human rights over obedience to civilian authorities. But when liberal norms clash, soldiers fall back on a third norm of cohesion, which refers to the bonds that enable military forces to operate in a unified, group- and missionoriented way. Cohesion functions as both a military norm (particularly at the individual level) and an interest (particularly at the institutional level). If a military prioritizes cohesion, then it will choose the path that best serves its organization, which may entail violating human rights, civilian control, or both. An exploration of the effects of norm conflict on military attitudes among the Armed Forces of Liberia uses an experiment embedded in a survey to probe the theory. Results provide preliminary evidence that norm conflict weakens support for human rights and democracy. Results are strongest among soldiers with more U.S. training.
The Nuclear Balance Is What States Make of It David C. Logan 2022-04-01 Does nuclear superiority offer states political or military benefits? And do those benefits accrue beyond acquiring a secure second-strike capability? International relations theory has long held that nuclear superiority does not confer significant advantages, a conclusion supported by much of the qualitative literature on bargaining and crisis interactions between nuclear-armed states. New work by scholars using statistical methods to analyze data on nuclear crises, interstate disputes, and compellent threats has sought to answer these questions, producing conflicting results. Despite the contributions of these recent works, this line of research has assumed that warhead counts are an appropriate measure of nuclear capabilities and that states possess accurate information about the material balance. Instead, states use multiple quantitative and qualitative characteristics to evaluate the nuclear balance, and they often have inaccurate or incomplete information about the size, composition, and configuration of other states' nuclear forces. Using new data, replications of two prominent recent works show that results are sensitive to how the nuclear balance is operationalized. Drawing on archival and interview data from the United States and the Soviet Union during the Cold War, findings show how states and leaders often understand and respond to the nuclear balance in inconsistent, asymmetric, and subjective ways.
Assessing China-U.S. Inadvertent Nuclear Escalation Wu Riqiang 2022-02-25 China-U.S. inadvertent escalation has been a focus of recent international relations literature. The current debate, however, has not paid sufficient attention to two important factors: the survivability of China's nuclear forces under unintentional conventional attacks; and China's nuclear command, control, and communication (NC3) system. Based on detailed analysis of these two variables, three potential mechanisms of China-U.S. inadvertent escalation are examined: use-it-or-lose-it, unauthorized/accidental, and damage-limitation. Although the possibility of a major China-U.S. conventional war inadvertently escalating to a nuclear level cannot be excluded, the risk is extremely low. China's nuclear forces would survive U.S. inadvertent conventional attacks and, thus, are unlikely to be significantly undermined. Even though China's NC3 system might be degraded during a conventional war with the United States, Chinese leadership would likely maintain minimum emergency communications with its nuclear forces. Moreover, China's NC3 system is highly centralized, and it prioritizes “negative control,” which can help to prevent escalation. China's nuclear retaliatory capability, although limited, could impede U.S. damage-limitation strikes to some extent. To keep the risk of inadvertent escalation low, both sides must take appropriate precautions and exercise self-restraint in their planning and operations.
Summaries 2022-02-25 NA
Prediction and Judgment: Why Artificial Intelligence Increases the Importance of Humans in War Avi Goldfarb, Jon R. Lindsay 2022-02-25 Recent scholarship on artificial intelligence (AI) and international security focuses on the political and ethical consequences of replacing human warriors with machines. Yet AI is not a simple substitute for human decision-making. The advances in commercial machine learning that are reducing the costs of statistical prediction are simultaneously increasing the value of data (which enable prediction) and judgment (which determines why prediction matters). But these key complements—quality data and clear judgment—may not be present, or present to the same degree, in the uncertain and conflictual business of war. This has two important strategic implications. First, military organizations that adopt AI will tend to become more complex to accommodate the challenges of data and judgment across a variety of decision-making tasks. Second, data and judgment will tend to become attractive targets in strategic competition. As a result, conflicts involving AI complements are likely to unfold very differently than visions of AI substitution would suggest. Rather than rapid robotic wars and decisive shifts in military power, AI-enabled conflict will likely involve significant uncertainty, organizational friction, and chronic controversy. Greater military reliance on AI will therefore make the human element in war even more important, not less.
Insurgent Armies: Military Obedience and State Formation after Rebel Victory Philip A. Martin 2022-02-25 Why do some winning rebel groups build obedient and effective state militaries after civil war, while others suffer military defections? When winning rebels face intense security threats during civil wars, rebel field commanders are more likely to remain obedient during war-to-peace transitions. Intense security threats incentivize militants to create more inclusive leadership structures, reducing field commanders’ incentives to defect in the postwar period. Intense security threats also reduce commanders’ capacity for postwar resistance by forcing insurgents to remain mobile and adopt shorter time horizons in rebel-governed territory, reducing the likelihood that field commanders will develop local ties and independent support bases. The plausibility of the argument is examined using a new list of winning rebel groups since 1946. Two case studies—Zimbabwe and Côte d'Ivoire—probe the causal mechanisms of the theory. The study contributes to debates about the consequences of military victory in civil war, the postwar trajectories of armed groups, and the conditions necessary for civil-military cohesion in fragile states.
A Farewell to Arms? Election Results and Lasting Peace after Civil War Sarah Zukerman Daly 2022-02-25 Why does fighting recur after some civil conflicts, whereas peace consolidates following others? The untested conventional wisdom is that—absent safeguards—postwar elections are dangerous for peace because electoral losers will reject the election results and remilitarize. New cross-national data on postwar election results and belligerent-level data on remilitarization contest this view. Citizens tend to elect peace because they engage in “security voting”; they elect the party that they deem best able to secure the state, using the war outcome as the heuristic that guides their security vote. Findings indicate that the chance of renewed war increases if there is an inversion in the military balance of power after war, and the war-loser performs poorly in the elections. If, instead, relative military power remains stable, or citizens accurately update their understandings of the postwar power balance, a civil war actor is unlikely to remilitarize if it loses the election. Knowing when and how these belligerent electoral actors choose to either sustain or break the peace informs important theoretical and policy debates on how to harness democracy's benefits while mitigating its risks.
Defending the United States: Revisiting National Missile Defense against North Korea Jaganath Sankaran, Steve Fetter 2022-02-25 North Korea has made significant strides in its attempt to acquire a strategic nuclear deterrent. In 2017, it tested intercontinental ballistic missiles (ICBMs) and completed a series of nuclear test explosions. These may provide North Korea with the technical foundation to deploy a nuclear-armed ICBM capable of striking the United States. The Ground-based Midcourse Defense (GMD) missile defense system is intended to deter North Korean nuclear coercion and, if deterrence fails, to defeat a limited North Korean attack. Despite two decades of dedicated and costly efforts, however, the GMD system remains unproven and unreliable. It has not demonstrated an ability to defeat the relatively simple and inexpensive countermeasures that North Korea can field. The GMD system has suffered persistent delays, substantial cost increases, and repeated program failures because of the politically motivated rush to deploy in the 1990s. But GMD and other U.S. missile defense efforts have provoked serious concerns in Russia and China, who fear it may threaten their nuclear deterrents. Diplomacy and deterrence may reassure Russia and China while constraining North Korea's nuclear program. An alternate airborne boost-phase intercept system may offer meaningful defense against North Korean missiles without threatening the Russian or Chinese deterrents.

RSelenium: Dealing with Dynamic Tables

  • Sometimes websites have tables we would like to scrape, but they’re Java so we cannot use the URL to navigate them

  • Example: AIIB Projects List

  • Solution: Make R be a ‘user’ that ‘clicks’ through the table RSelenium

  • This is just a short example, see previous tutorial link

RSelenium: Dealing with Dynamic Tables

library(RSelenium)
# Starts a session in firefox
rD <- rsDriver(browser = "firefox",
               port = 4545L,
               verbose = F,
               chromever = NULL)
remDr <- rD[["client"]]
# Go to page
remDr$navigate("https://www.aiib.org/en/projects/list/year/All/member/All/sector/All/financing_type/All/status/Approved")

RSelenium: Finding the Right Button

  • We want R to click the pagination button!
  • We use findElement and clickElement

RSelenium: Clicking the Button

webElem <- remDr$findElement(using = 'xpath',
                               value = '/html/body/div[2]/section/div/div[5]/div/div/div/div/div/div/div/div[2]/a[3]/i')

# Checking selected the right element
webElem$highlightElement()

# Clicking 
webElem$clickElement()

Scraping Press Releases

  • Goal: ~10k releases from Pakistan’s Press Information Department (PID) published between 2009 and 2011

  • ⚠️ Websites get taken down, links break, formatting changes

  • Two major sections: scraping data + cleaning data

Goal:

Making a Plan

  1. Collect link for each date (~1000)
  2. Scrape the page for each date
  3. Split up the contents on the page into individual releases
  4. Data cleaning: make sure we’re left with only press releases

⛔️ Before You Start:

  • Not every link is going to work! (some pages not captured by archive.org)
  • So, use {r} tryCatch() to try each page and return an error if failure
  • Otherwise, scraper will break randomly in the middle of the process (or the middle of the night!)

Step 2: Scrape the page for each date

  • End Goal: Data frame with 2 columns, pr_text and pr_date.

  • Text is the text scraped from the page, date is just the link we’re scraping (the date is in the link 🙃)

  • Follows the same format as the IS example

Show code
#Start with a lapply over the 1005 links you collected
pr_out_list <- lapply(1:1005, function(x){
  
  #Might be a good idea to take a 30 second break between each call
  #Sys.sleep(30) 
  
  #Part 1: collecting text
  text <- tryCatch(
    {
    #R console is going to spit this out each time it tries to scrape  
    message("This is the 'try' part")
    
    #The URL we are scraping = the base archive.org/pid link + the date link we scraped already    
    read_html(paste0("https://web.archive.org/web/20130502224925/http://pid.gov.pk/", get_individuallinks[x])) %>% 
      
    #Collecting the HTML tables on the page  
    html_elements("td") %>%
    #Collecting the text from those tables
    html_text() %>% 
    #Trimming whitespace  
    trimws() %>%
    #Pasting everything back together (bunch of tables)  
    paste(., collapse = " ")
    },
    
    #R console is going to spit this out each time it fails 
    error=function(cond) {
      message(paste("URL does not seem to exist:", get_individuallinks[x]))
      message("Here's the original error message:")
      message(cond)
      return(NA)
    }
  )
  
  #Part 2: collecting the date (just getting the url of we scraped)
  date <- get_individuallinks[x] 
  
  #Putting the text and URL together
  list(pr_text = text, pr_date = date)
})

#Combining the list into a df
df_all <- data.table::rbindlist(pr_out_list, fill = T)
#save(df_all, file= "df_all.rData")

HINTS

  • Remove rows/dates that are NA / we weren’t able to scrape
  • Convert all text to uppercase
  • Lots of random line breaks/whitspace can be removed with "[\r\n]"
  • Use strsplit and unnest to split up the chunks of text at a meaningful spot
  • After the splitting, use str_detect and nchar to remove chunks that don’t have anything useful. Example: “PR 146” or “(NOT TO BE PUBLISHED, BROADCAST, AND TELECAST BEFORE THE MORNING OF JANUARY 1, 2012)”
  • Get the pr_date into something that lubridate will understand (day-month-year)

Steps 3 and 4: Split up releases and clean

Using tidyverse to put everything in one code chunk, but steps can also be split up!

Show code
#Take the result of the previous scrape (or the downloaded input from Eyal)
releases <- df_all %>% 
  
  #Remove rows that have NA for pr_text
  drop_na(pr_text) %>%
  
  #Split pr_text strings at "PRESS RELEASE" (after converting it all to uppercase)
  mutate(pr_text = strsplit(as.character(toupper(pr_text)), "PRESS RELEASE")) %>% 
  
  #Unnest these split up strings (from 1 row to ~5 or 6 rows per date)
  unnest(pr_text) %>%
  
  #Remove line breaks from pr_text
  mutate(pr_text = str_remove_all(pr_text, "[\r\n]")) %>% 
  
  #Remove rows that have "not to be published" text, BUT only those with less than 200 characters
  filter(!(str_detect(pr_text, "NOT TO BE PUBLISHED, BROADCAST") & nchar(as.character(pr_text))<=200)) %>%
  
  #Remove other entries that have less than 200 characters
  filter(nchar(pr_text)>=200) %>%
  
  #Remove the terms press, .htm, and issue from the date column
  mutate(pr_date = str_remove_all(pr_date, "press|.htm|issue")) %>%
  
  #Use lubridate to get the pr_date column into a date format
  mutate(pr_date = dmy(pr_date))

Part 1: Visualize the data with a bar graph

Show code
#Do a count of PRs by year-month and make a bar graph
ggplot(releases, aes(format(pr_date, "%Y-%m"))) +
  geom_bar(stat = "count") +
  theme(axis.text.x=element_text(angle=60, hjust=1))+
  labs(title="Scraped Press Releases, Pakistan Press Information Department" ,x = "Month", 
       y = "Number of Releases")

Part 2: Visualize the data with a word cloud

Show code
library(quanteda)
library("quanteda.textplots")

#Convert the df to a quanteda corpus
pr_corpus <- corpus(releases, text_field = "pr_text")

#Create a Document-Feature Matrix (DFM) of the PR terms — but remove common stopwords ("for," "and," "if," etc..), punctuation, symbols, and numbers
pr_dfm <- dfm(pr_corpus, remove = stopwords("english"),
              remove_punct = TRUE,
              remove_symbols = TRUE,
              remove_numbers = TRUE)

#Make a wordcloud!
textplot_wordcloud(pr_dfm, min_count = 6, max_words = 100)

Thank you!

  • Please scrape responsibly!
    • Add delays between pages
    • Use distributed scraping only against rich sites