01A4477C-5C6E-41CA-A287-24E5B719EA6D
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-03-24
- External-Description
- This bag contains Twitter data for 108,884 tweets that were sent to or from
William Gibson (@greatdismal). However it is believed that the dataset
does does not include retweets. At the time of data collection the
greatdismal account had tweeted and retweeted 46,350 tweets, but this
collection only includes 19,743 sent by greatdismal. The inference here
is that the difference are retweets, or missed tweets.
The data was retrieved using a program that collected tweet identifiers
from Twitter's search user interface for the query:
from:greatdismal OR to:greatdismal
A sorted version of this list of identifiers is available in the
greatdismal-ids.txt file. The full tweet JSON data for the tweets was then
retrieved from Twitter's API using twarc (http://twitter.com/edsu/twarc).
This data can be found in greatdismal.json file.
- Size
- 275.2MB
- License
- UMD only
04C60F25-0683-4105-99F1-E432E4E1A1A8
- Bagging-Date
- 2015-06-16
- External-Description
- This bag contains multiple artifacts for the Occupied Japan
Gender, Class and Race project website created in 2003-2004
by Marlene Mayo, who was a Freeman Foundation Fellow. The website
was found at http://mith.umd.edu/gcr and was password protected
do to conerns about copyright.
The bag was created on June 15, 2015 after an earlier migration from
zelda.umd.edu at UMD to Amazon Web Services failed to migrate
the website. At that point a snapshot of the website (PHP code,
MySQL database and static assets) was created, which you can find
as gcr.tar.gz. At the same time the website was crawled with wget
(the command is in get.sh) to both mirror the contents of the website
and also create a WARC and CDX web archive. The static mirror of the
website can be found in the mith.umd.edu directory. The static site was
then put back online at http://mith.umd.edu/gcr on a machine hosted
on Amazon EC2.
- Size
- 562.0MB
061F44EC-3CFD-4403-8199-23F3124B9FF9
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2020-03-11
- External-Description
- This bag contains tweets collected by Ed Summers in collaboration with Vernon
Mitchell, Cassie Adcock, and AJ Robinson (WUSTL). They document the resistance
to efforts by the Hindu Right in India to deprive Muslim Indians from
citizenship. Data collection from the Twitter filter stream began on December
25, 2019 and ended on January 21 after collecting 14,144,417 tweets. In addition
on Decemebr 25, 2019 data was collected from the Twitter search API collecting
5,345,453 tweets back to December 20, 2019. In total 19,489,870 were collected
and their tweet ids are are stored in sorted order in the ids.txt.gz file. In
addition the twarc commands that were used to run the data collection can be
found in the search.sh and stream.sh scripts in the data directory.
Below is the email from Cassie sent on December 24, 2019 which outlined the
rationale for what was collected:
From: cadcock@wustl.edu
Subject: Re: Archiving Twitter RE: India unrest
Date: Tue, 24 Dec 2019 06:53:45 -0800 (PST)
Thank you, Ed and Vernon. I'm also including AJ Robinson on this message,
since she expressed interest in this archiving project.
I'm primarily thinking of creating a dataset for future researchers. This is
a major political event, in which the Hindu Right, which has been steadily
rising over recent decades, finally confronts massive popular resistance in its
effort to impose its vision of a Hindu Nation by depriving Muslim Indians of
citizenship. The resistance is India-wide, and includes a wide base of social
groups. This is also a major turning point, because we finally see the Hindu
Nationalist government in power directing police forces to direct unwarranted
brutality and violence on Muslim Indians all across the country. On both
counts, we have seen this before, but localized in particular cities or states,
not India-wide.
I have zero tech skills of this kind so I would be grateful to see some
preliminary archive underway. The core tags seem to be:
#CAA
#CAAProtests
#CAB
#CABProtests
#NRC
#JamiaMilia
#JamiaProtest
And variations –
#NoToCAB
#NoToNRC
#CitizenshipAmendmentAct
#CAA_NRC_Protests
#IndiaAgainstCAA_NRC
I'm less concerned to archive the propaganda-mongering of the Hindu Right, but those tags are:
#isupportCAB
#IsupportCAB2019
#ISupportCAA_NRC
#ISupportCAA
- Size
- 17.3GB
06F5C818-8CFD-477F-A0C0-7EA6EEE0BF76
- Contact-Name
- Damien Pfister
- Contact-Email
- dsp@umd.edu
- Bagging-Date
- 2020-08-16
- External-Description
- This bag contains the Omeka server side code and database snapshot for
the Internet Research Agency Ads website that was created by Damien
Pfister in collaboration with Ed Summers and Purdom Lindblad of MITH.
The bag also contains a wget crawl of https://mith.umd.edu/irads/ which
is persisted as a static site and a WARC file.
File listing:
irads.sql.gz - Omeka MySQL database dump
mith-irads.tar.gz - Omeka server side PHP code
irads.warc.gz - WARC file generated by wget crawl
mith.umd.edu.tar.gz - static site created with wget crawl
The site description at the time of archiving was as follows:
This site explores--and offers users the opportunity to explore--the rhetoric
of computational propaganda that occurred on Facebook during the 2016 election.
The project was developed by Dr. Damien Smith Pfister, Nora Murphy, Meridith
Styer, and Misti Yang in collaboration with Purdom Lindblad and Ed Summers at
the Maryland Institute for Technology in the Humanities. 160 students from the
"Interpreting Strategic Discourse" classes offered by the University of
Maryland's Department of Communication coded the dataset by hand.
The IRAdS website contains over 3,000 Facebook advertisements that the Internet
Research Agency, a Russia-linked “troll farm,” purchased in the run-up to the
2016 election campaign. This is one of the most sophisticated efforts at
computational propaganda yet, but little systematic analysis has been done on
this data corpus.
Pfister, Murphy, and Yang developed the codebook based on concepts developed in
the class (e.g. metaphors, myths, ideographs, semiotics; syllabus available
here). Our hope is that this dataset will surface and organize different themes
across these advertisements. In collaboration with MITH, these advertisements
will be posted, with our analysis embedded as metadata, on a website that other
publics can use to better understand Russia’s propaganda efforts.
- Size
- 5.4GB
085323E0-E95A-4045-A1DF-27B5F65C1EE6
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-06-30
- External-Description
- On June 17, 2015 a mass shooting took place at Emanuel African Methodist
Episcopal Church in Charleston, South Carolina. Soon afterward the
#charleston and #charlestonshooting hashtags were used on Twitter to
spread news of the event. On June 18th Bergis Jules of UC Riverside
and Ed Summers of University of Maryland began collecting #charleston
and #charlestonshooting tweets as both a historical search and a stream.
Both files are included here, and comprise 3,099,173 tweets from
June 10 to June 30. The first few hundred of the tweets included #charleston
tweets prior to the event.
- Size
- 2.1GB
097C9916-8BB3-43A9-BD9F-EF26AF5B150E
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2018-07-10
- External-Description
- This bag contains a wget capture of the https://archive.blackgothamarchive.org/
Omeka website on July 10, 2018 in order to decommission the Omeka website
that was several versions behind, and hadn't been updated in 4 years but
still contains a useful collection of materials. The static version of the
website was mounted in place of the Omeka site, and the server side PHP
and database were placed into this bag. The wget capture was executed with
the bagweb utility https://github.com/edsu/bagweb/
The Omeka instance couldn't be upgraded to the latest version of Omeka
because its theme was not compatible. This meant that PHP could not be updated
past v5.6, and security patches couldn't be installed. If you want to
bring the Omeka site back online you will need to use PHP v5.6. This
apt-get install command should pull in the necessary modules:
sudo apt-get install php5.6 php5.6-mbstring php5.6-xml php5.6-mysql \
php5.6-common php5.6-gd php5.6-json php5.6-cli php5.6-curl
You will also need to enable mod_rewrite in Apache:
sudo a2enmod rewrite
- Size
- 925.5MB
0ED6D8FA-829A-43FE-A02F-C9394763641A
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-10-14
- External-Description
- 463,956 tweets collected between 2016-01-29 and 2016-10-14 that used the
hashtag #SayHerName. During this time #SayHerName was a social movement that
raised awareness for black female victims of police brutality and anti-Black
violence in the United States. The tweets were collected as part of a
collaborative research project between MITH and the Sociology Department at
the University of Maryland. The results of analyzing the data were published
in this paper: Brown, M., Ray, R., Summers, E. & Fraistat, N. (2017)
#SayHerName: a case study of intersectional social media activism. Ethnic and
Racial Studies, 40(11), 1831-1846.
http://dx.doi.org/10.1080/01419870.2017.1334934
The bag contains four files, one of which was collected from Twitter's Search
API and the other three were collected from the Streaming API. The reason for
the seaprate streaming API files is that there was network connectivity
problems that went unnoticed for some period of time that resulted in two gaps.
The first gap occurs on March 19, 2016 and extends to April 22, 2016. The
second gap starts on June 26, 2016 and goes till July 6, 2016. The twarc.log
file was created by the twarc utility as it was collecting data from the
Twitter APIs.
On January 2, 2017 the stream4.json.gz file was updated to remove a partial
JSON object in the last line of stream4.json.gz which was caused by
the forced termination of the stream. This description was also updated with
information based on use of the data by Trevor Muñoz for a presentation at MLA
2018.
- Size
- 274.9MB
1049CF0E-6B74-433B-A0F3-D074E960D9ED
- Contact-Name
- Ed Summers
- Contact-Email
- edsu@umd.edu
- Bagging-Date
- 2018-10-30
- External-Description
- The African American History, Culture and Digital Humanities conference was
held at the University of Maryland at College Park between October 18-20,
2018. Because of MITH's involvement in the project the tweets containing
the hashtags #aadhum2018 and #aadhum18 were collected using the twarc
utility on October 22. The search retrieved 7,226 tweets back to October 11.
Since there were many tweets in response threads that did not have the
hashtag, the twarc replies.py utility was used to collect conversation
threads that hung off of the originally collected tags. This generated
an additional 738 tweets, which are included along with all the originally
collected tweets search-replies.jsonl.gz.
- Size
- 8.3MB
- License
- UMD Only
10DF7C8C-6A81-49D0-A9A2-3528E9B0D73C
- Bagging-Date
- 2016-03-20
- External-Description
- wget capture of http://mith.umd.edu/offthetracks/ created by ed on 2016-03-20
the bag includes:
- offthetracks.sql.gz: wordpress database dump
- mith.umd.edu-offthetracks.tar.gz: wordpress directory snapshot
- mith.umd.edu.tar.gz: static site capture from wget
- offthetracks.warc.gz: warc file for static site crawl
- Size
- 644.9MB
157CB91A-389D-47B1-81F3-07F524D86E09
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-06-12
- External-Description
- wget capture of http://mithumd.edu/tile/ created by Ed Summers on
2017-06-12. The payload files include the Wordpress code and MySQL database
dump as well as a mirror of the website as it existed with the search
turned off, and a WARC file.
- Size
- 169.2MB
16088C55-9565-4907-962D-6B2D7AEA02F7
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-05-20
- External-Description
- This is a snapshot of the DISC website code and database taken on May 20, 2015
by Porter Olsen. The DISC website was retired because of concerns over
security vulnerabilities in the code in 2006. The physical hardware that
served the MITH website at that time (minerva.umd.edu) was preserved at that
time. Porter Olsen was able to locate the server code and database in 2015.
- Size
- 2.3MB
16948DE6-1A6E-4115-9CC6-2B9859443FAE
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-08-19
- External-Description
- This is a snapshot of the /export/software directory on zelda.umd.edu on August
18, 2015 when it was finally decommissioned (turned off). Previously zelda was
the host that made many MITH web properties available. The websites were
largely moved over to Amazon Web Services in December of 2014. But zelda was
left on for 8 months while we transitioned some last remaining DNS names that
were pointed at zelda. /export/software contains all the database and website
content. It is in a tarball, which has not preserved timestamps and usernames
unfortunately. For that there is going to be a forensic snapshot of zelda which
should be made available as a separate bag. This bag is meant to be a
convenience to get data that is needed after zelda is taken offline, without
requiring remounting disk images.
- Size
- 201.2GB
185A16A4-82D9-4810-8568-B52D83BBAAD6
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-08-15
- External-Description
- This bag contains 1,304,702 tweets that contain the word "ferguson" between
July 30, 2015 and August 11, 2015. This includes the lead up to 1st
anniversary of Michael Brown's killing on August 8th, 2014.
- Size
- 835.2MB
- License
- UMD Only
1A440285-EF8E-430D-81D0-B9B591A7BF90
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-06-12
- External-Description
- wget capture of http://www.blackgothamarchive.org created by Ed Summers on
2017-06-12. The payload files include the Wordpress code and MySQL database
dump as well as a mirror of the website as it existed with the search
turned off, and a WARC file.
- Size
- 24.2MB
1BBB9316-15B9-402E-A518-DDE0A2C93B5D
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-08-15
- External-Description
- The shooting of Samuel DuBose occurred during a traffic stop for a
missing front license plate on July 19, 2015, in Cincinnati, Ohio.
Ray Tensing, a white University of Cincinnati police officer, fatally
shot DuBose, a black man, when Dubose started his car and, according
to Tensing, began to drive off. This bag contains 696,894 tweets that
were sent between July 21 and August 8th with the hashtag #SamuelDubose.
Data collection started on July 29 at 17:12:17 as a search and
stream.
- Size
- 467.2MB
- License
- UMD Only
1CA8ACBE-39A6-4CB2-8EFC-0C5014064DD9
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-10-09
- External-Description
- This bag contains two datasets of tweets that were collected for "hillary" and
"trump" from October 9, 2016 9:00 EDT to October 9, 2016 10:30 EDT. This time
period covers the duration of their second debate at Washington University in
St Louis. The "trump" dataset includes 327,751 tweets and the "hillary"
dataset includes 337,320 tweets. According to the logs at least 1,208,536 and
419,544 tweets for the trump and hillary datasets were not delivered due to
throttling of the data by Twitter. The datasets were collected using two
separate twarc processes that were running on the same m4.xlarge Amazon EC2
instance using different sets of Twitter developer keys. Included in the
payload directory are the scripts that were used to start the data collection,
their respective log files and the data files themselves.
- Size
- 474.0MB
- License
- UMD Only
1d7d4ef3-78a9-4ca4-be4a-b67f24acb5f9
- Bagging-Date
- 2015-02-22
- External-Description
- This is a collection of 1,556,702 tweets generated with twarc for the period
2015-02-12 23:31:56 to 2015-02-28 01:51:52 that mentioned the word "iran"
or the hashtags #irantalks and #irandeal.
- Size
- 765.2MB
- License
- UMD only
232580EC-3762-474E-A78A-0C44D616007A
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-08-08
- External-Description
- ['wget capture of http://mith.umd.edu/pda2013/ created by ed on 2016-08-08', 'The bag contains a snapshot of the Personal Digital Archiving 2013 Wordpress\n website. The bag contains the Wordpress source and database in the state it \n was found on August 8, 2016. The bag also includes the output of a wget\n mirror crawl of the website and also a resulting WARC file for the crawl.\n The static site was then used to replace the Wordpress site.']
- Size
- 20.0MB
2543A0EF-3164-48DE-B21C-FE7A5695F62B
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-11-13
- External-Description
- Created for Daniella Koonce who is a Sociology student studying the 2016
Presidential election. They were collected using the #electionresults
hashtag.
- Size
- 95.7MB
- License
- UMD Only
2654078E-ACC4-408E-93CC-B9320C4A3443
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2021-02-22
- External-Description
- This bag contains snapshots of the Lakeland Community Heritage Project's
Airtable bases as of 2021-02-22. The two primary Airtable bases Lakeland
Digital Archive (LDA) and Lakeland Digitization Tracking (LDT) were combined
into a new Airtable base to be named Lakeland Digital Archive (LDA2). The
snapshots also include LDA-testing, which was used as a scratch space for
moving the LDA base forwards. Snapshots were created with
https://github.com/simonw/airtable-export
LDA is comprised mostly of photos and some oral history interviews that were
collected from an Omeka Instance running at lakeland.umd.edu, and from two
project members hard drive storage (Mary Sies and Maxine Gross). LDT is
comprised of image scans of materials collected during the LCHP's community
digitization events in the Fall of 2019.
The data files that these Airtable bases described were located on Google
Drive, in GitHub repositories, on the MITH Network Attache Storage, and in
Amazon S3. Some were also embedded in websites like
https://lakeland.umd.edu/asa/ and the Lakeland Omeka site.
The LDA and LDT bases were combined into the LDA2 base using bespoke code
located at https://github.com/umd-mith/lakeland-data-munging the digital files
were also moved in a central location on the filesystem in preparation for them
moving to cloud storage (Dropbox). For a discussion of this process see:
https://app.asana.com/0/1191235198649641/1198495673312437/f
The export of the YAML, JSON and SQLite versions of the three bases was
created with the included export.sh. It depends on having an AIRTABLE_KEY set
appropriately in the environment.
- Size
- 24.8MB
- License
- Lakeland Community Heritage Project
28573AC6-11D6-4CF9-91B8-239A19034166
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2018-04-20
- External-Description
- This dataset contains tweets that were sent during the Ethics and Archiving
the Web conference that was held at the New Museum in New York City,
March 22 - 24, 2018. The eaw18.jsonl.gz file contains 3,155 tweets that were
collected from Twitter's search API using the utility twarc on March 25th.
That file contains tweets that use the #eaw18 hashtag, and includes tweets
back to March 15th. On March 27th the twarc utility replies.py was run to
collect the threaded conversations around the tweets which resulted in an
additional 1,345 tweets. The combined original tweets and replies can be
found in the eaw18-replies.jsonl.gz file. More about the conference itself
can be found at https://eaw.rhizome.org
- Size
- 3.7MB
- License
- UMD Only
286971F9-EED7-488D-A6CE-947189A05D36
- Contact-Name
- Ed Summers
- Contact-Email
- edsu@umd.edu
- Bagging-Date
- 2018-05-14
- External-Description
- This is a snapshot of the Storify stories that were created for the
African American History, Culture and Digital Humanities project
(AADHum) between 2015-2017. They were originally found at
https://storify.com/UMD_AADHum but were downloaded using the storified
utility https://github.com/docnow/storified just before the service
announced it would shut down on May 16.
- Size
- 26.9MB
9B4439F2-9329-4E4D-9F5D-69C470E1C8B9
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-09-26
- External-Description
- This bag contains a wget capture of http://mith.umd.edu/sharedhorizons/ which
includes a mirror copy and a warc file. The wordpress code and database are
also saved within the bag.
- Size
- 401.2MB
34322F08-93A8-4309-A3AA-89D20C42B06F
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-08-17
- External-Description
- This bag contains 8,444,354 tweets mentioning the word Iran in 13 different
scripts, collected between July 6 and August 15, 2015. The full list
of words was Iran,Иран,Իրան,ﺈﻳﺭﺎﻧ,איראן,İran,ईरान,ইরান,Эрон,อิหร่าน,इरान,이란,Іран
They were collected by MITH for Matt Miller of the Roshan Institute at
the University of Maryland. The interest in this particular time period is
the signing of the Joint Comprehensive Plan of Action between Iran,
China, France, Russian, United Kingdom and United States about Iran's
nuclear program on July 14, 2015.
- Size
- 4.6GB
36C6BFE6-66A4-4634-9AA9-0D702D9D3B2E
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-09-29
- External-Description
- These are tweets from @TriciaLockwood that were scraped from Twitter's search on September 29, 2015. The query
@TriciaLockwood was used, which will include any tweet sent by or to
her, as well as any tweets that mention her handle in the text of the tweet.
It is important to remember that Twitter's search does not include retweets.
So you will not find @TriciaLockwood's retweets in this dataset.
There are 49,573 tweets in JSON format, and at the time she was listed as
having 12,438 tweets. The scraping process was a PhantomJS script that
exercised the infinite scroll in search results, and extracted the tweet
ids. The tweet ids were then hydrated using the twarc tool.
- Size
- 14.7MB
- License
- UMD Only
37E83ECD-5182-4EA1-8DDA-84D629DB2FBC
- Contact-Name
- Trevor Muñoz
- Contact-Email
- tmunoz@umd.edu
- Bagging-Date
- 2017-10-02
- External-Description
- This bag contains data for the Godwin Diary website at Oxford University
http://godwindiary.bodleian.ox.ac.uk It was made available by James Cummings
who gave it to Trevor Muñoz at the DH2017 conference in Montreal. The email
from Cummings asking Trevor and others at MITH if we wanted a backup copy of
the data is included as email.txt in the data directory. From the email
it appears that this content is all "openly licensed".
- Size
- 88.7GB
3A292749-2C95-4E49-861A-CD0FFD22B14D
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-06-12
- External-Description
- wget capture of http://www.mith.org/camp/ created by Ed Summers on
2017-06-12. The payload files include the Wordpress code and MySQL database
dump as well as a mirror of the website as it existed with the search
turned off, and a WARC file.
- Size
- 109.7MB
3DFEA6F3-D830-4C95-852D-9619958627D9
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2019-04-15
- External-Description
- On March 31, 2019, Hussle was fatally shot outside his store, Marathon
Clothing, in South Los Angeles. Eric Holder, a 29-year-old man who had
confronted Hussle earlier in the day, was arrested and charged with murder on
April 2, 2019. Hussle’s memorial service was held on April 11 at the Staples
Center in Los Angeles, with tickets given away free of charge. The 25.5-mile
(41.0 km) funeral procession wound through the streets of South L.A.
including Watts where he spent some of his formative years. On Wed Apr 03,
2019 tweets with the following hashtags were collected from the Twitter
streaming and search APIs: NipseyHussle, Nipsey, Nipsey Hussle,
RIPNipseyHussle, RIPNipsey. The collection includes 11,642,103 tweets from
March 28 until April 15.
- Size
- 8.7GB
- License
- UMD Only
3F933CBB-0DCE-425F-86E5-D282C1E09B53
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-08-02
- External-Description
- On July 5, 2016 Alton Sterling, a 37-year-old black man,
was shot several times at close range while held down on the ground by two
white Baton Rouge Police Department officers in Baton Rouge, Louisiana. The
shooting was recorded by multiple bystanders, which were spread on social
media. The shooting led to protests in Baton Rouge and a request for a civil
rights investigation by the US Dept of Justice. This bag contains 5,960,419
tweets that used the #AltonSterling hashtag that were collected starting on
July 6, 2016 until August 2, 2016. The total tweets includes 1,028,065 tweets
that were collected by searching for tweets that had already been sent. The
first tweet this search found was sent July 5 at 2:05PM CST.
- Size
- 5.0GB
- License
- UMD Only
404CC0BB-A5D0-4FBF-8920-F3F0F1BC2CEF
- Contact-Name
- Porter Olsen
- Contact-Email
- polsen@umd.edu
- Bagging-Date
- 2015-05-19
- External-Description
- This bag contains a static website for the Hughes@100 event held at the
University of Maryland on February 25, 2002. The original website at
http://mith.umd.edu/hughes/ became unavailable, possibly as early as 2003.
In May 2015 Porter Olsen recovered the website from an old webserver named
Minerva. When the website was put back online the original QuickTime videos
of the event were converted to mp4 for accessibility reasons. The original
QuickTime files were left in place, along with what appeared to be editor
backup files with a tilde extension.
- Size
- 652.0MB
07D2FABF-80D8-45C9-A5E1-3932584CD52B
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-12-11
- External-Description
- The shooting of Korryn Gaines occurred on August 1, 2016, in Randallstown,
Maryland, near Baltimore, resulting in the death of Gaines, a 23-year-old
African-American woman, and the shooting of her son. According to the
Baltimore County Police Department, officers sought to serve Gaines a warrant
in relation to an earlier traffic violation. Upon entering her apartment, an
hours-long standoff ensued, ending when Gaines threatened police officers
with her shotgun. At least one of the officers shot Gaines, killing her and
wounding Gaines' five-year-old son. Portions of the standoff were filmed
by Gaines and posted to social-media networking sites; however, upon police
request, Facebook deactivated Gaines' Facebook and Instagram accounts,
leading to criticisms of the company's involvement in the incident.
This bag contains 705,974 tweets from the streaming and search API that used
the #KorrynGaines hashtag between August 1 and August 9.
- Size
- 591.9MB
- License
- UMD Only
424E0234-3975-453E-9307-FE3EBC65560A
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2020-10-14
- External-Description
- This bag contains a wget mirror of the Digital Studies in the Arts and Humanities Wordpress website at
https://dsah.umd.edu. It was crawled in October, 2020 when the Wordpress
multisite install was turned off. The administration of
the dsah.umd.edu domain was passed to Marisa Parham (AADHum Director). The
server side Wordpress files and database are part of the Wordpress multisite
set up that is saved in bag 8DBEB7E3-72E4-4F0A-80A3-3586D63EEA42.
wget needed to be instructed to collect and rewrite resources at mith.umd.edu
since the multisite setup used that host for images and css. The wget command
looked like this: wget --directory-prefix dsah --output-file wget.log
--warc-file dsah --mirror --page-requisites --span-hosts --html-extension
--convert-links --execute robots=off --no-parent --exclude-directories example
--level 3 --domains dsah.umd.edu,mith.umd.edu https://dsah.umd.edu
- Size
- 183.5MB
4661A68-5404-443E-9571-A9E69F4DBDAE
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-04-21
- External-Description
- 12,994,199 tweets collected for the period of 2015-03-05 to 2014-04-16
for the Roshan Institue at the Universiyt of Maryland. The tweets all
contain the word Iran in a set of different scripts, including:
Iran,Иран,Իրան,ﺈﻳﺭﺎﻧ,איראן,İran,ईरान,ইরান,Эрон,อิหร่าน,इरान,イ ,이란,Іран
- Size
- 5.8GB
- License
- UMD only
46D94136-7227-4EAF-8C66-18F62E981AE5
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-10-20
- External-Description
- This bag contains two datasets of tweets that were collected for "hillary"
and "trump" from October 20, 2016 9:00 EDT to October 9, 2016 10:30 EDT. This
time period covers the duration of their third debate at the University of
Nevada in Las Vegas. The "trump" dataset includes 331,598 tweets and the
"hillary" dataset includes 323,816 tweets. According to the logs at least
855,022 and 358,288 tweets for the trump and hillary datasets were not
delivered due to throttling of the data by Twitter. The datasets were
collected using two separate twarc processes that were running on the same
Amazon EC2 instance using different sets of Twitter developer keys.
Included in the payload directory are the scripts that were used to start the
data collection, their respective log files and the data files themselves.
- Size
- 409.4MB
- License
- UMD Only
477403F8-C472-4BA9-B8E5-5EB737C23F0C
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-01-24
- External-Description
- These are tweets that were collected between August 27, 2015 and
January 4, 2016 that mention the word "trump". They were collected
from Twitter's streaming API. Due to network outages there are gaps
between the files. There are 40,202,199 in all.
- Size
- 25.1GB
- License
- UMD Only
4CEB5DE0-DA52-4127-AE16-64DEBF34170D
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-09-04
- External-Description
- 20,729 #MyAsianAmerican tweets collected between Aug 25 02:24:10 UTC and
- Sep 04 19:50:26 2015. The hashtag was first used by Jason Fong,
a 15 year old highschool student in Redondo Beach High School who
was responding to controversial statements made by Presidential
candidate Jeb Bush. This story in the LA Times talked about how the
tweet trended that night.
http://www.latimes.com/local/lanow/la-high-school-student-myasianamericanstory-anchor-baby-narrative-20150825-htmlstory.html
- Size
- 9.7MB
- License
- UMD Only
4D41FEA7-9E85-45B8-9499-362212278CAB
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-03-24
- External-Description
- Data collected from the Twitter filter stream for #blm,#blacklivesmatter
between 2016-01-29 and 2017-03-18 using twarc. It includes 17,292,130
tweets. The files are broken into segments because of network connectivity
problems, so there are varying time gaps present between the files. Also
when the hashtags were trending globally rate limits may have prevented some
tweets from being streamed over the API. Data collection stopped on
2017-03-18 because of an authentication error that was the result of the
keys having changed. On October 6, 2017 during some data processing of
the files it was discovered that due to gzip encoding errors in
stream3.json.gz and stream5.json.gz the total count had been undercounted
at 13,732,829. The encoding was fixed and the fixities in the manifests were
updated.
- Size
- 12.4GB
- License
- UMD Only
5336FCBD-CDE9-4477-8805-D574D0D99CE5
- Bagging-Date
- 2016-02-01
- External-Description
- wget capture of http://mith.umd.edu/apiworkshop/ created by Ed Summers
on 2016-02-01. The payload includes the following files:
- apiworkshop.warc.gz - WARC capture
- mith.umd.edu.tar.gz - Static site version
- wordpress.tar.gz - Wordpress installation archive
- wp_apiworkshop.sql.gz - Wordpress database snapshot
- Size
- 10.1MB
56B1171F-E859-47C0-B3BD-B4D5932B8D4C
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-09-27
- External-Description
- This bag contains two datasets of tweets that were collected for "hillary"
and "trump" from September 26, 2016 22:00 GMT to September 27, 2016 08:00.
This time period covers the duration of their first debate at Hofstra
University, which started at 9pm EDT and lasted 95 minutes. The "trump"
dataset includes 1,636,098 tweets and the "hillary" dataset includes
1,303,084 tweets. According to the logs at least 2,059,946 and 730,512 tweets
for the trump and hillary datasets were not delivered due to throttling of
the data by Twitter. The datasets were collected using two separate twarc
processes that were running on the same m4.xlarge Amazon EC2 instance using
different sets of Twitter developer keys. Included in the payload directory
are the scripts that were used to start the data collection, their respective
log files and the data files themselves.
- Size
- 2.0GB
- License
- UMD Only
02685C1D-5D87-4F26-8834-02180386523C
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-02-14
- External-Description
- This is a snapshot of the MITH Forensics Workshop website generated on
February 14, 2016 using wget. The included files are:
- forensics.sql.gz: a database dump of the WordPress site
- mith.umd.edu-forensics.tar.gz: a snapshot of the WordPress installation
- mith.umd.edu.tar.gz: a static site generated with wget
- forensics.warc.gz: a web archive file generated with wget
On testing the website it was discovered that the live site had a broken
link for the Stephen Eniss' audio presentation. The correct link was
deteremined by looking on the filesystem and noticing that the link
contained a typo. It was difficult to fix the link in the WordPress
site since it wasn't completely operational. So the link was fixed
in the static representation that was mounted on the Web.
- Size
- 674.7MB
5B9BCD78-4DFD-4CFD-A0F1-4C53D0623549
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-04-13
- External-Description
- This bag contains data from the Early Americas Digital Archive at
http://mith.ummd.edu/eada/ The snapshot was created on 2016-04-13.
The mirrored content can be found in mith.umd.edu.tar.gz and the
corresponding WARC data is in eada.warc.gz. The existing server side
data and code can be found in eada.tar.gz and the MySQL database export
is in eada.sql.gz.
- Size
- 1.3GB
5CDCC634-09A9-47BE-B8B5-7FB9CDF094BC
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2018-12-22
- External-Description
- This bag contains a wget capture of the Society for Textual Scholarship
Website that was held at the University of Maryland in May of 2017. The
conference website lived at https://mith.umd.edu/sts2017. The website was
archived because the conference has passed, and the contents are no longer
going to change. The bag contains a WARC file that was generated during the
crawl. The static site that was created needed to be modified slightly to get
the page header to work correctly. Since the website was part of MITH's
multisite installation of Wordpress there was no distinct database or
files to archive in addition to the snapshot.
- Size
- 11.7MB
6ABF6C54-A1D0-4AE4-B412-738E482C41A8
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2018-11-19
- External-Description
- This dataset of tweets was created in response to the Pittsburgh synagogue
shooting that occurred at the Tree of Life Congregation in the Squirrell Hill
neighborhood of Pittsburgh, Pennsylvania on October 27, 2018. 3,603,049
tweets were collected from the Twitter Search API for the time period of
October 22 to October 30 for all tweets matching any of the following keywords
Pittsburgh, pittsburghsynagogueshooting, pittsburghstrong,
treeoflifesynagogue, treeoflife, pittsburghsynagogue, strongerthanhate,
antisemitism, pittsburghshooting, squirrellhill, treeoflifeshooting.
- Size
- 2.7GB
- License
- UMD Only
6BB00C6E-B557-478E-947E-04D0E6FFDC8C
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-07-08
- External-Description
- This package contains 10,406,506 #lovewins tweets that were sent in the
aftermath of the Supreme Court's decision in Obergefell v. Hodges
that was announced on June 26th, 2015. They cover the period of
June 22, 2015 at 03:13:55 to July 02 at 09:03:39. The tweets were collected
starting on June 26, 2015 by doing a search of Twitter with twarc, and also
setting up a stream capture at the same time. You can see the results of
both operations in the two files in the payload.
- Size
- 6.2GB
- License
- UMD Only
6DC120C7-8901-417B-B387-0C6AFECEE9E8
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-04-08
- External-Description
- This bag contains 2,033,898 tweets mentioning the word "ferguson" between
2015-02-25 03:34:08 and 2015-03-21 08:27:25. They were collected to
help document the reaction to the Investigation of the Ferguson Police
Department report that was released by the Department of Justice report
on 2015-03-04. Accidentally this time period also included the reaction
to two police officers being shot at in Ferguson on 2015-03-12.
The data was collected using the twarc tool.
- Size
- 1.1GB
- License
- UMD only
6E665C92-1FC5-4C45-81AD-26CB2AADB3E4
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-06-12
- External-Description
- wget capture of http://mith.umd.edu/engl668k created by Ed Summers on
2017-06-12. The payload files include the Wordpress code and MySQL database
dump as well as a mirror of the website as it existed with the search
turned off, and a WARC file.
- Size
- 780.1MB
78B68395-B69C-4084-A66C-B497F36CCD82
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-11-20
- External-Description
- wget capture of http://mith.umd.edu/diaspora2008/ created by ed on 2016-11-20
also includes server side PHP code and a database snapshot.
- Size
- 18.3MB
7D840688-D48E-4220-9F95-CB9574B72FE0
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-08-30
- External-Description
- The 2017 solar eclipse occurred on August 21 and and was total for Oregon,
Idaho, Wyoming, Nebraska, Kansas, Missouri, Illinois, Kentucky, Tennessee,
North Carolina, Georgia, and South Carolina. This bag includes 13,548,321
tweets that included any of the keywords solareclipse2017, solareclipse,
eclipse2017, eclipseday or eclipse for the period August 17 to August 23,
2017. The hashtags were were selected after watching Twitter's streaming API
for the trending hashtag #solareclipse2017 and counting the most popular
co-occurring hashtags. Since data collection via the search API was unlikely
to finish within the 7 day window that search results are available, two
separate searches were run with twarc starting on August 23. The first
(search.jsonl.gz) included tweets that happened since tweet id
899673005755858944. The second (search-maxid.jsonl.gz) includes searches that
happened prior to 899673005755858944. The search API was used instead of
the streaming API because the streaming API was emitting notifications that
many tweets were not delivered, because the volume was so high.
- Size
- 9.5GB
- License
- UMD Only
7FC1F69F-13C8-4CA3-8919-7C66735DCEC4
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-05-27
- External-Description
- This is a collection of tweets using the #SayHerName hashtag.
It includes 188,146 tweets for the period of November 6, 2010 to
May 27, 2015 that were collected with the twarc tool. The dataset is
made up of three different gzipped files of tweets: scraped.json.gz,
search.json.gz and stream.json.gz.
search.json.gz was collected from the search API starting on May 22, 2015.
At the same time data collection fromt the streaming API was started and
captured as stream.json.gz. The scraped.json.gz file contains tweets
that were scraped from Twitter's search UI for the period prior to where
search.json.gz left off. These tweet ids were then rehydrated using twarc.
It is important to note that the scraped tweets did not seem to include
retweets. While the JSON is similar in structure, the coverage is quite
different from search.json.gz and stream.json.gz.
The #SayHerName report was released on May 20, 2015 by the African American
Policy Forum and the Center for Intersectionality and Social Policy Studies at
Columbia University as well as Andrea Ritchie, Soros Justice Fellow. More
about the use of this hashtag can be read at
http://www.aapf.org/sayhernamereport/ The tweets were collected as
part of a collaboration between MITH at the University of Maryland
and the University Archives at the University of California at Riverside.
- Size
- 124.8MB
- License
- UMD and UCR only
5D2787AA-30B3-4C52-8A57-F0D534CF3A6A
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-03-26
- External-Description
- 4,328,507 Tweets mentioning the hashtag #tcot collected between 2016-12-01
and 2017-03-26. #tcot is a hashtag for Top Conservatives on Twitter which
was found to be a hashtag that was used in right wing responses to the
Ferguson uprising on Twitter.
- Size
- 3.2GB
- License
- UMD Only
8016E9D2-A121-4862-8461-69D558AE035F
- Contact-Name
- Ed Summers
- Contact-Email
- edsu@umd.edu
- Bagging-Date
- 2018-04-18
- External-Description
- This bag includes datasets that were created with the docnow prototype that
ran during 2016-2017 at http://app.docnow.io. These datasets were created
by Bergis Jules (co-pi) and shared via the DocNow Catalog application as
tweet identifier datasets. The underlying JSON data was transferred from UMD's
Amazon Web Services EC2 instance where the prototype application was running
to Washington University in St Louis in April, 2018 after the conclusion of
the initial phase of work on Documenting the Now grant from the Mellon
Foundation.
Each dataset is listed below which includes the search query that was used,
the time it was created, the number of tweets, and the path to filename.
The data was collected from the Twitter Search API which provided access
to the last 10 days of tweets from the time of the search.
query: #blktwitterstorians
file: data/201701050445-517854.json.gz
created: 2017-01-05 04:45:36
tweets: 371
query: #BLMKidnapping
file: data/201701050534-c5bf15.json.gz
created: 2017-01-05 05:34:01
tweets: 136990
query: #SaveACA
file: data/201701132241-97612e.json.gz
created: 2017-01-13 22:41:36
tweets: 137012
query: #blackwomenatwork
file: data/201703291343-c266c9.json.gz
created: 2017-03-29 13:43:02
tweets: 140000
query: #charlottesville
file: data/201708121111-124fac.json.gz
created: 2017-08-12 11:11:51
tweets: 100000
query: #BlackTheory
file: data/201709200025-e4ce49.json.gz
created: 2017-09-20 00:25:03
tweets: 1430
query: #blackdigarchive
file: data/201712130923-3c8c3b.json.gz
created: 2017-12-13 09:23:06
tweets: 1775
query: #blackdigarchive
file: data/201712130923-3c8c3b.json.gz
created: 2017-12-13 09:23:06
tweets: 1775
query: #GifHistory
file: data/201802120320-715464.json.gz
created: 2018-02-12 03:20:30
tweets: 31929
- Size
- 362.2MB
80B7E5EC-A53D-4F2D-8599-9038A84F61DA
- Bagging-Date
- 2016-01-24
- External-Description
- Tweets mentioning the hashtag #BlackOnCampus between 2015-11-08 and
2015-11-25 collected. Data collection started on 2015-11-12 when a job
to collect from the search and streaming APIs were started.
- Size
- 107.0MB
- License
- UMD only
82DE438D-A361-4BBA-843F-0DB40EEFBB23
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-05-27
- External-Description
- This bag contains data from the Foreign Languages in America project that
was collected by Peter Mallios of the University of Maryland and given
to MITH in May of 2015. The original data consisted of two Box folders
which contained TIFF, JPEG, DJVU, OCR and Excel files for image scans that
were collected by the FLA team. This data was normalized into a single
directory structure for use in a static Jekyll website. The software for
doing the normalization is available at:
https://github.com/umd-mith/fla-processing
- Size
- 119.3GB
831F8CD-1F7B-42A5-BB10-904FAD15204A
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-04-13
- External-Description
- This bag contains 846,602 tweets mentioning the hashtag #walterscott from 2015-04-01 05:36:27 to 2015-04-13 17:47:09 (UTC). On April
7th, 2015 police officer Michael Slager was arrested and charged with the
first degree murder of Walter Scott. A twarc process was started to then to
collect tweets using the hashtag, and another process was started to get
as many existing #walterscott tweets from the preceeding week.
- Size
- 549.0MB
- License
- UMD only
87674858-96AB-45F2-90CE-E712F443A658
- Bagging-Date
- 2016-01-24
- External-Description
- Tweets mentioning #MizzouHungerStrike and #ConcernedStudent1950 between
2015-11-01 and 2015-11-24. These were two hashtags used during the
2015 University of Missouri protests related to race, workplace benefits
and leadership that resulted in resignations of the president of the
University and the chancellor of the Columbia campus.
- Size
- 198.9MB
- License
- UMD only
8DBEB7E3-72E4-4F0A-80A3-3586D63EEA42
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2020-10-07
- External-Description
- This bag contains backups of resources related to the multisite Wordpress
instance that ran at mith.umd.edu, and was decommissioned in 2020.
Included in this bag are the Wordpress export (export.xml.gz) the Wordpress
server side files (mith.umd.edu.tar.gz) a database snapshot
(mithpressdelta.sql.gz) and the results of running a wget mirroring
operation on the site with warc generation while it was live (static.tar.gz
and mith.umd.edu.warc.gz). The multipress Wordpress website was used to
manage the mith.umd.edu website and also the aadhum.umd.edu, dsah.umd.edu
and guide.dhcuration.org websites.
- Size
- 4.0GB
9065fa97-8ac3-4b11-9703-7cff623c560a
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-02-04
- External-Description
- This bag contains 6,342,294 tweets collected between October 12 and November
30, 2014 related to Poreshenko and Putin. They were collected by Ed Summers
and Tatyana Lockot for a series of articles being written for Global Voices.
The tweets were collected from the Twitter filter stream API using twarc,
which was configured to retrieve tweets with either of these keywords:
Putin, Poroshenko, Путин, Порошенко and Путін.
- Size
- 2.7GB
- License
- UMD only
94404079-594A-41EF-9A14-4266CC97FFC1
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-10-17
- External-Description
- This bag contains tweets that were collected between September 19, 2017 and
October 5, 2017 that mentioned #CatalanReferendum, #CatalalonianReferendum,
#Catalonia, #1oct, #1o or #votarem. These were hashtags used in the lead up
to the Catalan Independence Referendum on October 1, 2017. The referendum was
declared illegal under Spanish law, and the Spanish police were ordered to
prevent it. The hashtags were selected after monitoring the
#CatalanReferendum hashtag for several hours on September 28 to determine
what the top hashtags were. The tweets themselves were collected from the
Twitter Search API using twarc and its twarc-archive utility. twarc-archive
was run every hour to collect the tweets that occurred since the last run.
The data collection was a collaboration with Vicenç Ruiz Gómez and Aniol
Maria of the Society of Catalan Archivists working in conjunction with Ed
Summers of MITH.
- Size
- 6.5GB
- License
- UMD Only
95DDB7B2-E88F-4BB3-AAC5-8CFAD1076AA5
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2018-06-03
- External-Description
- 4,058,754 tweets collected from the streaming and search APIs using the
keyword "gaza" covering the period to 2018-05-08 to 2018-05-19. The
stream and stream data collection was started on 2018-05-16. This time
period included the opening of the US Embassy in Jerusalem on May 14th.
On the same day Israeli forces killed over over 60 Palestinians, and injured
2,700 who were part of a non-violent protest in the Gaza Strip.
https://www.democracynow.org/2018/5/24/after_latest_gaza_slaughter_open_an
- Size
- 3.4GB
9707DA4B-6EE8-4ECE-8B24-F8604E8C6A4F
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-11-15
- External-Description
- 2.1 million tweets that used the #NoDAPL or #StandWithStandingRock hashtags
over the period of Oct 18 - Nov 7.
- Size
- 1.9GB
- License
- UMD Only
98EFC987-7C1A-4615-8CD6-836174C6DAF3
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2020-01-19
- External-Description
- The 1619 Project was developed by The New York Times Magazine in 2019 with
the goal of re-examining the legacy of slavery in the United States and timed
for the 400th anniversary of the arrival in America of the first enslaved
people from West Africa. It is an interactive project by Nikole Hannah-Jones,
a reporter for The New York Times, with contributions by the paper's writers,
including essays, poems, short fiction, and a photo essay.[1] Originally
conceived of as a special issue for August 20, 2019, it was soon turned into
a full-fledged project, including a special broadsheet section in the
newspaper, live events, and a multi-episode podcast series. (from Wikipedia)
This bag contains metadata for tweets related to the hashtag #1619project.
They were collected on January 8, 2020 using twint and the keyword
1619project.
twint -s '1619project' --csv --output twint.csv
twint scrapes Twitter's search results and writes the results as CSV. The
tweet identifiers were extracted from this CSV and included as the ids.txt
file. On January 19, 2020 the ids.txt was hydrated as tweets.jsonl using the
twarc tool. This explains the discrepancy of 443 tweets that were deleted
between January 8 and January 19.
- Size
- 248.7MB
9BA6F68E-808E-4D64-A39F-558C4CD92072
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-06-12
- External-Description
- wget capture of http://mith.umd.edu/dccresearch/ created by Ed Summers on
2017-06-12. The payload files include the Wordpress code and MySQL database
dump as well as a mirror of the website as it existed with the search
turned off, and a WARC file.
- Size
- 45.8MB
A0DADAB6-8C1C-4B6E-A19E-04B5DE839258
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2020-10-14
- External-Description
- ['wget capture of https://guide.dhcuration.org/ created by ed on 2020-10-14', 'This bag contains a wget mirror of the guide.dhcuration.org website that was\n created on October 14, 2020. The site is being moved to the\n archive.mith.umd.edu website as part of the dismantling of the multisite\n Wordpress server at mith.umd.edu. For access to the Wordpress server side files\n and database please see bag 8DBEB7E3-72E4-4F0A-80A3-3586D63EEA42. The static\n site was generated using the following wget command in order to capture and\n rewrite links to other domains:\n wwget --output-file wget.log --warc-file dhdc --mirror --page-requisites\n --span-hosts --html-extension --convert-links --execute robots=off --no-parent\n --domains guide.dhcuration.org,mith.umd.edu,humanitiesdatacurationguide.wordpress.com\n https://guide.dhcuration.org/']
- Size
- 2.1MB
A325B271-260A-4E4C-A8A5-49A88F37BA42
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2018-12-02
- External-Description
- 27,954,936 tweets collected from Twitter's streaming API with the following
query: twarc filter 'female,rage,woman,anger,femalerage,angry,women,feminist'
it ran from November 17 to December 2, 2018. The data was collected for
Brittany Starr who is a student in the UMD English Department.
- Size
- 23.0GB
- License
- UMD Only
A36E23C8-45E3-4ECD-8D8E-610CEDF60441
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-08-14
- External-Description
- This bag contains the Omeka installation and database snapshot for the
lakeland.umd.edu website that was obtained from the jarvis.umd.ed Glue
host at UMD. The files were copied from
/afs/glue.umd.edu/department/oit/aett/otal/server/omeka/lakeland using
rsync. The resulting directory was then archived and compressed with tar
as lakeland.tar.gz. The Omeka database was configured to talk to a MySQL
data named omkealakeland. This database was exported with mysqldump and
compressed as omkealakeland.sql.gz
The files were obtained as part of a collaboration with Mary Sies of American
Studies to put the Lakeland Digital Archive on a firmer footing. Due to staff
turnover system level access to jarvis had been lost by Mary's team over the
years. Thanks to input from former UMD employee Jill Reese and the help of
UMD DIT I was able to get ssh access and locate the filed and database. See
this issue ticket for more context: https://umd.service-now.com/itsc?sys_id=7a03b7fe0f94070c7f232ca8b1050e3f&view=sp&id=ticket&table=incident
- Size
- 4.9GB
A43EB791-1B69-413E-BD61-58F79BD9C4CE
- Bagging-Date
- 2018-02-20
- External-Description
- This is a snapshot of MITH's Digital Dialogue Storify tweets that was
collected on February 20, 2018 using the storified tool:
https://github.com/docnow/storified Included in each story are the HTML,
JSON, and HTML exports that were maded available by Storify before they
shuttered the service. The original index.html was renamed to
index-original.html and was written to index.html with relative image
links that were downloaded.
- Size
- 63.6MB
A5752D17-3670-47BE-AE7B-08E0D3BE7A28
- Bagging-Date
- 2016-01-24
- External-Description
- 3,719,967 tweets mentioning "bowie" between 2016-01-11 and 2016-01-15
Bowie died on January 10, 2015. Data collection started on January 12
when data collection was started from the streaming API and data was
collected from the search API. The separate search files are the result
of Internal Server errors from Twitter's search API which resulted in
a search needing to be started again.
- Size
- 2.2GB
- License
- UMD only
A9933BF8-CBB3-41EC-B2A8-C7ABDE481B6A
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2018-08-02
- External-Description
- This bag contains a hydrated version of the Twitter Event Datasets
(2012-2016) created by Arkaitz Zubiaga. The original dataset
contains 147,055,035 tweet ids from 30 different events that
are split into separate files. More about the dataset can be
read about at https://doi.org/10.6084/m9.figshare.5100460.v2
The tweet ids were hydrated between June 15 and July 2, 2018
using the twarc utility. Only 86,062,113 tweets were hydrated
which is a 42.5% deletion rate. The payload directory includes
a hydrate.sh script that was used to do the hydration, as well
as the README that was distributed with the original dataset.
Finally the wayback directory contains a script to examine
links to webarchives in the hydrated JSON data.
- Size
- 59.5GB
A9AE8E15-34AA-45AD-878A-50E5AB745F71
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-06-26
- External-Description
- This bag contains a backup of a Wordpress site and database backup for the
Humanities Intensive Learning & Teaching website. The site moved
from mith.umd.edu/training/ to www.dhtraining.org where it is actively
maintained. This backup was crated as part of a cleanup of MITH's main
Wordpress host.
- Size
- 10.3MB
AE0A86DE-E17D-438E-BCDF-AA1F04851CAF
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-02-04
- External-Description
- This bag contains tweets that were hydrated from the Beyond the Hashtags
research study conducted by Deen Freelon, Charlton D. McIlwain, and Meredith D.
Clark in February of 2016. Their report, which includes details about how this
dataset was assembled, is included as a PDF, and more information about the
dataaset can also be found at
http://cmsimpact.org/resource/beyond-hashtags-ferguson-blacklivesmatter-online-struggle-offline-justice/
In January 2017 Freelon released the 40,815,975 tweet ids for the dataset.
http://dfreelon.org/2017/01/03/beyond-the-hashtags-twitter-data/ They were
hydrated over the course of a few weeks afterwards using twarc. 34,264,560
(83%) of the original tweets were hydrated.
- Size
- 22.7GB
- License
- UMD Only
AF330002-664C-4321-98D2-E753BE8DD025
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2014-12-10
- External-Description
- This Twitter data was collected as part of a partnership between
CivicLab, Harvard University and MITH. It represents 15,080,078 tweets that
mention "ferguson" for the period between Nov 11 - Dec 8. The twarc utility
was used to collect the tweets from the Twitter stream API. Involved
individuals included: Greg Coleman, Kim Lamke, Molly Lloyd and
Benjamin Sugar.
- Size
- 6.6GB
- License
- UMD only
B09C1434-FFF9-4C73-B6A3-72FF63036A69
- Bagging-Date
- 2016-02-28
- External-Description
- wget capture of http://mith.umd.edu/topicmodeling/ created by ed on 2016-02-28
mith.umd.edu-topicmodeling.tar.gz - wordpress snapshot
topicmodeling.sql.gz - wordpress database
mith.umd.edu.tar.gz - wordpress wget mirror
topicmodeling.warc.gz - warc file from the crawl
- Size
- 25.9MB
B436DDA9-FFC1-4BD3-B358-55D56EB9334B
- Contact-Name
- Stephanie Sapienza
- Contact-Email
- sapienza@umd.edu
- Bagging-Date
- 2017-09-18
- External-Description
- This bag contains data dropped off by UMD faculty member Mary Sies for the
Lakeland Community Heritage Project. On August 16, 2017 she dropped off a set
of DVDs and a hard drive at MITH. Stephanie Sapienza spoke to Mary about what
was contained on the DVDs and the hard drives, and then copied them to an
external hard disk. The hard disk was then mirrored to Google Drive. In
addition they were tarred up and gzipped by Ed Summers to preserve timestamps
and compress the data. The goal was to allow the other copies to be manipulated
while keeping a copy of what was delivered to MITH for preservation purposes.
The two tarballs that are present in the data payload directory of the bag are
the result of that archiving. From Stephanie's conversation with Mary It is
believed that the media were used by students as they worked with content
collected in the community to eventually upload it to the Lakeland Omeka site.
A backup of that site is present in
s3://mith-bags/A36E23C8-45E3-4ECD-8D8E-610CEDF60441
- Size
- 120.3GB
BDD870A8-587B-419C-A8A7-44D8227ABA29
- Bagging-Date
- 2018-10-07
- External-Description
- This bag contains the Wordpress server side code and database for the Vintage
Computers Omeka site which lived at https://mith.umd.edu/vintage-computers.
The data payload also includes the output of mirroring the site with wget,
which generated a static website and WARC file. The static version of the
website was mounted in place of the live site. The site was archived because
it had a custom theme that required significant work to make it work with the
latest version of Omeka 2. The site was also no longer being actively
developed but still had value as an archive.
- Size
- 462.0MB
B9525BE0-FD7B-41C9-B3B1-F189CB2AD642
- Contact-Name
- Ed Summers
- Contact-Email
- edsu@umd.edu
- Bagging-Date
- 2018-07-13
- External-Description
- This bag contains a wget capture of https://www.digitalmishnah.org/ as
well as backups of the server side Wordpress code and database.
The Wordpress site was deemed no longer active and rather than folding
it into the MITH multisite Wordpress we decided to snapshot it and put
it in our static archive.
- Size
- 14.3MB
B9C8B188-5026-4965-9384-605E02FA55E5
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-08-15
- External-Description
- Sandra Bland was an African-American woman who was found dead in a jail cell
in Waller County, Texas, on July 13, 2015. This bag contains 3,805,452
tweets that were sent with the hashtag #SandraBland between July 15 and
August 8th. Data collection started on July 17 at 20:13:24 when search and
stream twarc jobs were started.
- Size
- 2.6GB
- License
- UMD Only
BB31CFDF-C413-4BB1-B9EC-3A4A68D274FC
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-08-14
- External-Description
- The Unite the Right rally was a gathering of far-right white nationalist
groups in Charlottesville, Virginia, United States, on August 11 and 12,
2017. Those assembled at the rally included members of white
supremacist, white nationalist, alt-right, neo-Confederate, neo-Nazi, and
militia movements. The participants were protesting against the removal of
Confederate monuments and memorials from public spaces, specifically the
Robert Edward Lee Sculpture in Emancipation Park.
Hundreds of protesters and counterprotesters were in attendance. There were
several violent clashes between protesters and counterprotesters. One
protester plowed a car into a crowd of counterprotesters, killing a woman and
injuring 19 other people, including five critically. At least 19 people
were injured in street brawls and other violence at the rally.
This bag contains 6,040,247 tweets mentioning 'charlottesville' collected with
the twarc utility. 5,382,975 were collected from Twitter's streaming API, and
657,272 from the search API. The collected tweets range in time from
2017-08-03 17:16:17 to 2017-08-13 23:24:26 GMT. Data collection began
2017-08-12 14:17:06 GMT. The log files for both processes are also included.
Since the keyword 'charlottesville' was trending for several days the
stream.log file contains information about how many tweets were undelivered.
- Size
- 5.4GB
- License
- UMD Only
BDD870A8-587B-419C-A8A7-44D8227ABA29
- Bagging-Date
- 2018-10-07
- External-Description
- This bag contains the Wordpress server side code and database for the Vintage
Computers Omeka site which lived at https://mith.umd.edu/vintage-computers.
The data payload also includes the output of mirroring the site with wget,
which generated a static website and WARC file. The static version of the
website was mounted in place of the live site.The site was archived because
it had a custom theme that required significant work to make it work with the
latest version of Omeka 2. The site was also no longer being actively
developed but still had value as an archive.
After deployment a problem was discovered in the JavaScript lightroom
library, which was not able to load the builder and effects library via the
link to scriptaculous.js. In addition the high resolution images in the
archive directory that are loaded by lightroom were missing. This problem was
fixed by rewriting the scriptaculous links in the items html, and the archive
files were copied from the backed up files into the static assets. After
this bit of surgery the zooming images worked as the did originally. This
bag was then updated with the latest content for the static site.
- Size
- 451.4MB
C2B9AC64-79E7-4EFD-8142-64CB0407E51E
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2018-06-12
- External-Description
- This bag contains assets related to the www.soweto76archive.org website
that was created by Angel Nieves and Gregory Lord while they were
working with MITH at the University of Maryland. In June 2018 the site
was archived because it was running a 9 year old version of Wordpress
that was not compatible with PHP7 which the rest of MITH was upgrading to.
The server side assets (PHP, media files) and MySQL database was archived
as wordpress.tar.gz and soweto.sql.gz. The site was then crawled with
wget to create a static site as well as a WARC file. The specific wget
command can be found in https://github.com/edsu/bagweb The resulting
static site was put in place of the Wordpress site, which was
decommisioned and sent to Nieves and Lord.
- Size
- 790.9MB
C328DC72-013E-4BB6-AE12-54F075739627
- Contact-Name
- Ed Summers
- Contact-Email
- edsu@umd.edu
- Bagging-Date
- 2018-09-05
- External-Description
- On August 16, 2018 Aretha Franklin died in Detroit, Michigan at the age of
76. Franklin, also known as the Queen of Soul, had an award winning
career as a singer, songwriter, actress and pianist while also being
described as the voice of the civil rights movement. This bag contains two
tweet datasets. The first was collected from the search API during the
response to the announcement of her death, which includes tweets from
August 8 - August 19 using the query '"Aretha Franklin" OR "Queen of Soul"'.
The second dataset was collected over August 24 to September 3, which
includes the date of her funeral on August 31. This second dataset was
collected using the query '"Aretha Franklin" OR "Queen of Soul" OR
ArethaHomegoing OR ArethaFranklinFuneral OR ArethaFranklin' which includes
hashtags that were trending at the time. The datasets contain 2,832,128
and 1,332,442 tweets respectively.
- Size
- 3.1GB
- License
- UMD Only
CA53EE17-91AD-41AC-936D-14C00AFE4EA9
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-06-12
- External-Description
- wget capture of http://mith.umd.edu/digitalstorytelling created by Ed Summers
on 2017-06-12. The payload files include the Wordpress code and MySQL database
dump as well as a mirror of the website as it existed with the search
turned off, and a WARC file.
- Size
- 483.4MB
CC08A512-2A39-4248-B9AD-B07557618837
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-11-21
- External-Description
- 987,938 tweets retrieved mentioning #PuertoRico over the period of October 4
to November 7, 2017. This was a period where there was increased concern
being expressed in social media about the response to the humanitarian crisis
caused by Hurricane Maria, which made landfalll on September 20. Tweets
were collected from the streaming API and the search API. In both cases
tweets using #PuertoRico were collected.
- Size
- 740.4MB
- License
- UMD Only
CD36C7E9-2451-4EB8-BA81-46DC278DC66F
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-05-20
- External-Description
- This bag contains 2,195,394 tweets that mention #BaltimoreUprising
or #BaltimoreRiots between April 29 and May 14, 2015. They were
were collected both from the Twitter search and streaming APIs.
This time period saw demonstrations and protests in Baltimore
using these two hashtags following the death of Freddie Gray on
April 19th, 2015 after his arrest on April 12th.
- Size
- 1.4GB
- License
- UMD Only
D30ABFEC-C35D-4849-8EA3-94ECE584E552
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-02-02
- External-Description
- These are files collected from the NEH Funded Projects database for the
Office of Digital Humanities
https://securegrants.neh.gov/publicquery/Faq.aspx
They were collected when it was announced that the Trump administration
wanted to defund the NEH
http://thehill.com/policy/finance/314991-trump-team-prepares-dramatic-cuts
MITH was able to obtain a spreadsheet (Muffin Files.xlsx) which contains
the identifiers for each funded project. A program (download.py) was written
that downloads the PDF and Excel file for each project and puts it in the
white-papers directory. In addition a file data.csv is included which is
the combination of all the Excel files as a CSV.
- Size
- 995.8MB
D547F886-D302-4747-B08A-188A645CBFEA
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-01-25
- External-Description
- This is a collection of tweets related to the 2016 US presidential election,
collected over the period of July 13, 2016 to November 10, 2016 by George
Washington University. GWU made the collection available as a tweet
identifier dataset, which was then hydrated at the University of Maryland
over the period of December 1, 2016 and January 2, 2017 using the twarc
utility.
The original dataset contained 270,189,978 unique twitter identifiers of which
237,651,319 were hydrated (88%). This bag contains the ids that were used
for hydration in the ids directory, and the hydrated tweets in the tweets
directory. Each id file has a corresponding README that explains how the
dataset was created, including what keywords were used to created it.
More can be learned about the original dataset at:
http://hdl.handle.net/10.7910/DVN/PDI7IN
- Size
- 167.9GB
- License
- UMD Only
D5707486-0875-4DDF-B7B3-65D20CD4250C
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2019-02-08
- External-Description
- This dataset was created on February 7, 2019 to document the reaction to
Stacy Abrams response to the Presidential State of the Union address
delivered on February 5, 2019 from 21:00 to 21:22 PM EST. There are
1,001,590 tweets from January 28, 2019 to February 7, 2019 which were
collected from Twitter's search API using twarc and the query:
"Stacey Abrams" OR abramsaddress OR staceyabrams OR soturesponse'
- Size
- 682.0MB
- License
- UMD Only
D651C3F6-5619-4A42-A8BC-7C22B7A9A44A
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-05-20
- External-Description
- This bag contains 32,056 tweets that mention "ferguson" between
August 8 and August 10, 2014. They were collected on May 7th, 2015
using a script that collected Twitter identifiers from the search
form on Twitter's website:
https://github.com/edsu/twarc/blob/master/utils/discover_ids.py
The identifiers were then rehydrated using twarc's --hydrate option.
Some important ramifications to be aware of is that the dataset does
not include tweets that were deleted before May 7th, 2015 ; and retweets
are not included.
This datset augments another tweet collection (mith-bag
fe28a093-d3f4-42d7-83ba-f5ba1b1cc765) which has a more complete snapshot
but is missing tweets just after the killing of Michael Brown on
August 9th.
- Size
- 12.2MB
- License
- UMD Only
D6C8ED6A-13A0-483E-950A-EE6089DFE463
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2020-10-14
- External-Description
- ['wget capture of https://aadhum.umd.edu/ created by ed on 2020-10-14', "This bag contains a wget mirror of the African American History, Culture and\n Digital Humanities (AADHum) Wordpress website at https://aadhum.umd.edu. It was\n crawled in October, 2020 when the Wordpress website was transferred away from\n MITH's AWS infrastructure and to the AADHum project themselves. The\n administration of the aadhum.umd.edu domain was passed to Marisa Parham (AADHum\n Director). The server side Wordpress files and database are part of the\n Wordpress multisite set up that is saved in bag\n 8DBEB7E3-72E4-4F0A-80A3-3586D63EEA42. The mirror copy was created with wget but\n the /events/ path needed to be excluded since it contained a calendar that\n became a crawler trap. In addition wget needed to be instructed to collect and\n rewrite resources at mith.umd.edu since the multisite setup used that host for\n images and css. The wget command looked like this:\n wget --directory-prefix aadhum --output-file wget.log --warc-file aadhum\n --mirror --page-requisites --span-hosts --html-extension --convert-links\n --execute robots=off --no-parent --exclude-directories example --level\n 3 --domains aadhum.umd.edu,mith.umd.edu https://aadhum.umd.edu"]
- Size
- 347.1MB
D7ACD500-FCAF-454E-8B3C-031CB4012145
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-05-14
- External-Description
- This bag contains tweets mentioning #CharlieHebdo, #JeSuisCharlie,
#JeSuisAhmed and #JeSuisJuif for the period of January 7th to 28th, 2015.
They were initially collected by Nick Ruest at York University, who
made the tweet ID datasets available http://hdl.handle.net/10864/10830
The data was rehydrated using twarc over the period of February 20th
to 24th, 2015. Significant portions of the original data were deleted
in the time between when they were tweeted and when they were rehydrated.
The original id lists are included along with the hydrated data, which
amount to 13 968,293 unique ids.
- Size
- 9.1GB
D8D28FB4-87DC-4DD9-AC82-6260B54AE684
- Bagging-Date
- 2017-04-22
- External-Description
- This bag contains 10,159,892 tweets and retweets sent by or to jk_rowling
between 2015-07-08 and 2017-03-18. The tweets were collected with Social Feed
Manager (m5_003). The directory path that SFM stored twitter data was
archived and compressed as data/tweets.tar.gz.
You will notice on unarchiving that that the archive includes many individual
tweet files organized into a directory tree by year, month, day, hour. Each
file usually contains 15 minutes of tweets. These files are usually gzip
compressed but you will notice that there are a few that are not. You will
also notice that a small number of files were not close correctly so you get
an error like "unexpected end of file" on reading the end of the file.
- Size
- 4.9GB
- License
- UMD Only
DAD66855-2344-4261-8688-EADEB3A5EC25
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-06-12
- External-Description
- wget capture of http://mith.umd.edu/eng738T created by Ed Summers on
2017-06-12. The payload files include the Wordpress code and MySQL database
dump as well as a mirror of the website as it existed with the search
turned off, and a WARC file.
- Size
- 135.3MB
DB7C5ADA-D8AD-4D1B-A389-960EE5A11ADC
- Contact-Name
- Trevor Muñoz
- Contact-Email
- tmunoz@umd.edu
- Bagging-Date
- 2019-06-21
- External-Description
- This bag contains disk images from retired MITH server "zelda." Images were created using the BitCurator software.
- Size
- 205.1GB
- License
- UMD only
DEF510C5-A888-48D6-BAB2-D1A0040008C4
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-12-11
- External-Description
- On July 6, 2016, Philando Castile was fatally shot by Jeronimo Yanez, a St.
Anthony, Minnesota police officer, after being pulled over in Falcon Heights,
a suburb of St. Paul. Castile was driving a car with his girlfriend, Diamond
Reynolds, and her four-year-old daughter as passengers when he was pulled
over by Yanez and another officer.
This bag contains 2,950,803 tweets collected from the search and streaming
API for the hashtags #FalconHeightsShooting, #PhilandoCastile and
#DiamondReynolds between July 7 and September 9, 2016.
- Size
- 2.1GB
- License
- UMD Only
E0F66049-106D-472A-8B00-969E1C834993
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-06-12
- External-Description
- wget capture of http://mith.umd.edu/musical-theatre/ created by Ed Summers on
2017-06-12. The payload files include the Wordpress code and MySQL database
dump as well as a mirror of the website as it existed with the search
turned off, and a WARC file.
- Size
- 4.4MB
E4D77D13-3B44-4203-A795-3F950E45F40F
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2016-08-28
- External-Description
- These are tweets collected during the Documenting the Now meeting held
in St Louis on August 22-23. They all use the #docnowcommunity hashtag.
- Size
- 12.6MB
- License
- UMD Only
E630AB56-C721-4CC5-8663-E049854B7687
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-08-18
- External-Description
- The Unite the Right rally (also known as the Charlottesville rally) was a
protest in Charlottesville, Virginia, United States from August 11–12, 2017,
to oppose the removal of a statue of Robert E. Lee in Emancipation Park,
which itself was renamed from Lee Park two months earlier. Protesters
included white supremacists, white nationalists, neo-Confederates, neo-Nazis,
and militias. This bag contains 200,113 tweets collected with the
#unitetheright hashtag. Data collection was performed twice from the search
API using twarc: once at 2017-08-13 11:46:05 GMT and the other at 2017-08-15
12:03:48 GMT. The second search was run to collect only up to where the first
search left off. The time ranges for the tweets are from 2017-08-04 11:44:12
to 2017-08-15 16:03:30 GMT.
- Size
- 155.3MB
- License
- UMD Only
E6E7A45A-EE6A-4575-ACAE-BD322AE84F87
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-06-26
- External-Description
- This bag contains a backup of a Wordpress site and database for the
Project Bamboo website that ran at www.projectbamboo.org. It was no longer
active at the time of archiving since the DNS record had since been
pointed at a Google site. It was archived and removed from MITH's running
Wordpress site as part of a clean up project.
- Size
- 35.2MB
E7C141E1-6F3B-48EA-AEE7-57F8BFB06CC8
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-06-26
- External-Description
- This bag contains a Wordpress site backup and its respective MySQL database
for the www.bitcurator.org domain. The domain was no longer registered at
the time of the backup but it a Wordpress instance was active at
www.bitcurator.net which is a domain managed by the University of North
Carolina. It was surmised that the site moved from .org to .net when the
project moved to UNC. The Wordpress site was archived as part of a clean
up of the main MITH Wordpress host.
- Size
- 221.3MB
E967BD40-FC32-477F-9E5E-92C61B22807A
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2015-05-20
- External-Description
- This is a dataset of tweets collected during April 15, 2015 and May 13, 2015
that mention the hashtag #FreddieGray. One portion of the tweets was collected
from Twitter's search API and the other set is from the streaming API.
Both sets were collected using the twarc tool. The total dataset includes
2,983,934 tweets. Freddie Gray was an African-American man who was arrested
by the Baltimore Police Department on April 12, 2015, and died on April 19,
2015 due to an injury to his spinal cord that was believed to be the result
of his treatment by the police.
- Size
- 1.8GB
- License
- UMD Only
F0A98C71-8BF6-42FC-B476-015E21A84CAD
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-05-17
- External-Description
- 782,509 tweets including the hashtag #macronleaks or #macrongate that were
collected between 2017-05-10 16:14:51 and 2017-05-02 07:02:05 UTC. The tweets
were collected from the Twitter Search API using twarc. The data does not
include the first use of the #macrongate hashtag, but it does include the
first use of the #macronleaks hashtag which went viral after Wikileaks
published it. More about the story of the #marconleaks hashtag can be found at:
http://www.newyorker.com/news/news-desk/the-far-right-american-nationalist-who-tweeted-macronleaks
- Size
- 580.7MB
- License
- UMD Only
F1EBD541-82F7-4DC9-A7CC-60C9DE94E8F2
- Contact-Name
- Trevor Muñoz
- Contact-Email
- tmunoz@umd.edu
- Bagging-Date
- 2019-06-12
- External-Description
- These were interviews conducted by MITH and collaborators with members of the
Lakeland community. The activities were conducted with support from a
Community Partnership Grant from the American Studies Association.
https://www.theasa.net/awards/grants/community-partnership-grants
The interviews are MP3 and SRT transcript files that were created by uploading
original recordings to the otter.ai service.
- Size
- 109.4MB
F4C3C4C6-EFE0-4712-BA93-B2948D3D66E3
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2018-02-27
- External-Description
- On February 27, 2018 the National Museum of African American History and
Culture hosted a Twitter chat with the Documenting the Now project.
Bergis Jules from the Documenting the Now team coordinated the project
responses and delivered them as the @documentnow user on Twitter. Other
people from the project and elsewehere responded. These event started at
9:30 AM EST and finished at 10:30 AM EST. Tweets with the designated
hashtag #ArchivesBlackHistory were collected using twarc at 11:30 AM on
Feburary 27, 2018. It collected 1402 tweets, some of which were created
prior to the twitter chat, since it has been used in other promotional
outreach by the NMAAHC.
- Size
- 9.8MB
FE0207E7-E21E-41F8-8A05-1F11BC68CFF8
- Bagging-Date
- 2015-10-19
- External-Description
- On Friday, June 5, 2015, at a pool party in McKinney, Texas, a police officer
was video-recorded restraining an unarmed African-American fifteen-year-old
girl on the ground. He later drew his handgun during the same incident.
This bag contains 180,000 tweets containing the hashtag #McKinney that were
collected between 20:15:53 and 23:46:26 on June 7, 2015. They were collected
by Bergis Jules at the University of California at Riverside in collaboration
with MITH.
- Size
- 124.1MB
- License
- UMD Only
FE3814C1-54A3-46BB-8093-3A90D81AF928
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2017-03-26
- External-Description
- This bag contains 2,711,011 tweets collected from the Twitter filter stream
between 2017-02-09 and 2017-03-18 that used any of the following hashtags:
alternativefacts, fakenews, truthiness, postfact, posttruth, factcheck.
They were collected as a research experiment for Damien Smith Pfister
in the Department of Communication.
- Size
- 2.1GB
- License
- UMD Only
fe28a093-d3f4-42d7-83ba-f5ba1b1cc765
- Contact-Name
- Ed Summers
- Contact-Email
- ehs@pobox.com
- Bagging-Date
- 2014-08-30
- External-Description
- A collection of 13,238,863 tweets mentioning 'ferguson' from 2014-08-10 22:44:43 to 2014-08-27 15:15:50. The tweets were collected
from the Twitter Search API using the twarc utility. They were subsequently
run through deduplication process and also a URL unshortening process that
added the unshortened_url key to url entities in the original json data.
- Size
- 8.4GB
- License
- UMD only