This is an informal guide outlining how I scaled the task of downloading data from social media sites. Social media companies have little incentive to make this process easy, instead opting to make public-interest research on their platforms difficult or impossible to conduct, so I’m writing up my notes in the hopes that it might be useful to one of the many researchers or journalists out there reinventing this particular wheel.
This certainly isn't the only approach, but it’s what I’ve used for the last few years to gather many tens of millions of posts from Twitter/X, TikTok, Facebook, and YouTube.
And, as with any project, it's always worthwhile to find out if there are existing tools you can use before you build something like this on your own. To this end, the Coalition For Independent Technology Research has been an invaluable resource for me for finding out about existing tools and datasets. That's how I found out about projects like Junkipedia and minet, which are worth checking out before diving into this guide—they might suit your use case!
This guide will probably be useful if you’re trying to do some kind of large-scale data collection, and you can figure out how to do it locally on a small scale, but run into issues doing it at the scale you need.
Specifically, this guide assumes you’re more or less comfortable with moving data around locally, and it’s the scaling and parallelization part that you feel lost about.
Throughout this guide, I use the Twitter Academic API as an example, which has since been closed. Don’t fret, though! Having collected data from Facebook, YouTube, and TikTok, I can confirm that minor variations on this approach will work for all sorts of data downloading tasks.
The approach itself involves using a cloud compute service to set up an HTTP endpoint that does a small piece of data collection each time it’s called, then orchestrating all of the different calls you need to make to this endpoint via a queue manager. If you just want a quick pointer explaining how this particular approach works at a high level, here’s a Medium post that explains it succinctly.
It should be noted that there are many, many ways to manage tasks like this as you scale, and this might not be how you would want to set things up if you were building long-term production data engineering infrastructure— this is only one approach I took as a research data engineer working alone!
Some common tools with a bit more overhead are Apache Airflow or Prefect, which can help orchestrate large and complex data pipelines and centralize things in one UI. If you’re really doing some heavy duty data engineering for the long haul, consider looking up other data workflow orchestration tools and see if those suit your use case.
The approach I’m outlining here was designed for the following situation:
This may still be useful, but you’ll need to modify this approach in cases where:
I’ll make notes throughout this guide for places I have made modifications as the above situations have cropped up.
Before I get into the details, huge disclaimer.
Don’t use this to take advantage of people’s data.
The goal of this guide is to support public-interest research that helps hold platforms accountable. The fact that data is technically publicly available does not mean that you should use it for whatever you want. Using social media data towards surveillance and profiteering is against the spirit of this guide!
Even when TOS aren’t violated, dynamics between researchers and data collection subjects are thorny. Like, what does it mean for a non-Black researcher to study Black Twitter, as an outsider? What right do people have for their data to be anonymized, if it’s “already public”? Do most people want their social media data used for research?
This kind of data collection task should only be implemented when there’s a good, documented, and well-scoped reason for it to be happening. Please be thoughtful!
Below is my order of operations when I’m trying to set up a system like this. Don’t worry about the details— it’ll all be explained as we go.
The purpose of this guide is to help scale data downloading, so we’re going to start by thinking about the big picture. First, we come up with a way to break our large data collection task into many smaller subtasks. Then, we set up all of our data collection subtasks to run independently and automatically, handling errors as they go, until we end up with a big pile of raw data. Finally, we process that raw data into a more manageable format and are left with a beautiful dataset!
Here’s the gist of the setup I’ll be outlining in this guide. Don’t worry too much about the technical details, it’ll all get explained further down.
I used the Google Cloud Platform (GCP) to do all of the cloud-hosted components, but Amazon and Microsoft (and probably other places) have comparable services*.
A) Subtask Queue: A queue that handles the execution of each subtask.
B) Subtask Executor: This is an HTTP function that gets invoked by the subtask queue. It’ll take a subtask, execute it, and write the resulting data out.
C) Data Source: This is where you’re getting data from.
D) Raw Data: This is the place where you’re writing raw data, in whatever format you get it.
E) Dataset: This is the final place that your organized data will live.
So, you have a large data collection task and you need to break it into many little data collection pieces. I’m referring to these as the “Big Task” and “subtasks”.
Ideally, you want subtasks that are independent and small: that is, ideally, tasks shouldn’t depend on one another and can be finished in a few minutes. This is for the following reasons:
Figuring out how to get tasks of the right size requires some trial-and-error, so expect to do lots of experimentation here to find the right subtasks for your situation.
Every case will look a little different here, so I’m going to go into some detail about different ways I’ve approached this.
Here’s an example:
If your Big Task is something along the lines of “Query every instance of ‘#dog’ via the Twitter API from 2018-2022”, the subtask breakdown could be pretty straightforward: you could define a subtask as querying Twitter for all of the ‘#dog’ tweets in a single 24-hour period. That would give you 1826 subtasks, one for each day between January 1, 2018 and December 31, 2022.
However, a breakdown this straightforward is not always possible. Some situations you might find yourself in:
Not every API will let you make queries that are specific enough that you’ll only get a handful of results at a time. That is, sometimes you’ll just get a million results all at once and have to scroll back through it. Luckily, when this happens, it’s typically provided in a linked list of smaller chunks that you can iterate through one at a time, and if your job fails, you’ll still have a pointer to the last chunk you were reading from.
In this case, you can still kinda think of reading each chunk as a subtask. It’ll just be different because the subtasks are not independent, so you’ll have to use a different approach than the one I outline in this guide. In the appendix below, I give a brief example of what I’ve done in those cases.
If you can’t create queries such that you get a small amount of data at a time and you have to restart scrolling through all of the data from the beginning every time you make a query, one approach might be using the “repeat the same subtask” approach below.
In this case, the “same subtask” is just scrolling through the data as far as you can before you get cut off. You’d run it over and over and hope that one of your attempts is lucky and makes it through a lot of the data before crashing. This is also a situation where you’d want to avoid running things in a time-sensitive environment like a cloud function— just let it run as long as it will go and keep re-trying.
Perhaps there are ways to do a better job with that, but that’s outside of the scope of this guide.
If your subtask is something that never changes, like “get the latest posts from this account every day”, you can also use the approach I outline in this guide, with a minor change.
Tl;dr: Figure out how to get the data you want to get, then save a small amount of it locally somewhere.
Just get a python script running on your computer that accesses the data you want to access. I think of this as the first version of my Subtask Executor that will ultimately end up running on the cloud.
In my case, this was successfully querying 10 tweets from the Twitter Academic API. At this stage, your code may simply be copy-pasted from some walkthrough or example you found on github. Your credentials key will be sitting as a local file somewhere or in your env. That’s fine! (We’ll address that later)
Save some sample data to a local file in whatever form makes sense (probably .json or .csv). You’ll want this later.
Tl;dr: Figure out where you want to dump your data, and figure out how to put it there.
I suggest dumping all of your raw data into some kind of blob storage. I use GCP Storage Buckets for this purpose.
Getting reading and writing to work here on any cloud service will likely involve some amount of fiddling with annoying permissions errors.
You can write to a storage bucket like this (for secret management, see the code here). I used these helper functions:
1def get_service_account_credentials() -> service_account.Credentials:
2 """
3 Retrieves service account credentials from a secret manager and returns a credentials object.
4 Returns:
5 service_account.Credentials: The service account credentials.
6 """
7
8 # We will replace this with reading credentials from a secret manager soon.
9
10 credentials = service_account.Credentials.from_service_account_file(
11 "credentials.json"
12 )
13
14 return credentials
15
16
17def write_to_bucket(
18 out_json: Dict[Any, Any],
19 output_bucket_name: str,
20 output_path: str,
21 storage_client: storage.Client,
22) -> None:
23 """Writes output JSON data to a bucket.
24 Args:
25 out_json (JSON object): JSON data you wish to write to the bucket.
26 output_bucket_name (str): The name of the output bucket.
27 output_path (str): The path to write the data to within the output bucket.
28 storage_client (storage.Client): The storage client to use for writing the data.
29 """
30
31 bucket = storage_client.bucket(output_bucket_name)
32
33 data_blob = bucket.blob(output_path)
34
35 with data_blob.open("w") as f:
36 f.write(json.dumps(out_json))
37
38 credentials = get_service_account_credentials()
39
40 storage_client = storage.Client(credentials=credentials)
So I could write to buckets like this:
1write_to_bucket(
2 SOME_JSON_DATA,
3 output_bucket_name=BUCKET_NAME,
4 output_path=OUTPUT_PATH,
5 storage_client=storage_client,
6)
It’s a good idea to record as much data provenance as possible, and include redundancy wherever it’s reasonable to do so. Including basic metadata that answers questions like “what query did I make to retrieve this particular post?” and “when did I download this?” is not only helpful, but may also be necessary as a research standard.
It’s common for this data to end up somewhere in the file path, although I also like to save this data in at least one additional way. You can tack it on to raw JSON, add some columns to output csv, save a whole separate file or table somewhere with metadata—the world is your oyster.
An example:
raw_tiktok_data > query_by_keyword > {tiktok_id} > [media.mp4, metadata.json]
That is:
dataset name > how you got this data > what unique piece of data this is > [the media itself, information about the run that produced this media]
Don’t worry too much about your data being perfectly organized; it’s common for data to end up a little disorganized in the early stages of a research project. As long as you keep track of enough metadata, you’ll be able to do cleanup in the ingestion step at the end.
Tl;dr: Get the local subtask-executing code to run on the cloud. Also, this is the time to set up your secret manager.
Now it’s time to bundle up all of our code and set it to run on the cloud!
At this stage, my local code did the following:
To make this code into an HTTP function, I loosely followed this tutorial. I added a main.py file with an entry point function (that had the appropriate @functions_framework.http
decorator and took in a request
argument). Using request.get_json()
, you can access your incoming JSON arguments as a dictionary.
These JSON arguments are where subtask parameters are going to get passed via the task queue (which we haven’t set up yet).
So, now instead of having some local script with hard-coded parameters like this:
1batch_query.query_tweets(
2 query=["dog"],
3 output_bucket_name="dog-twitter-data",
4 start_date="2020-01-01",
5 end_date="2020-01-02",
6)
You’ll just be reading those parameters from the input JSON:
1batch_query.query_tweets(
2 query=request_json["terms"],
3 output_bucket_name=request_json["output_bucket_name"],
4 start_date=request_json["start_date"],
5 end_date=request_json["end_date"],
6)
Then, you can make the same call by sending your HTTP function this JSON in via a POST request:
{“query”: “dog”, “start_date”: “2020-01-01”, “end_date”: “2020-01-02”}
In GCP, you can test your function out under the ‘Testing’ tab by putting your JSON in ‘Configure Triggering Event’. You can also use the CLI to run a test.
Tip: To start out here, I would recommend having your entry point function print out the value of request.get_json()
to make sure you can actually pass parameters to your function, like this:
1import json
2import functions_framework
3from google.cloud import logging
4
5logging_client = logging.Client()
6logger = logging_client.logger("main_logs")
7
8
9@functions_framework.http
10def handle_http(request) -> str:
11 """Handles incoming HTTP post requests from Cloud Scheduler.
12 Parameters are expected in the JSON body of the request.
13
14 Args:
15 request: HTTP post request
16
17 Returns:
18 String indicating status of the completed request
19 """
20
21 request_json = request.get_json(silent=True)
22
23 print("Received parameters: %s" % json.dumps(request_json, indent=2))
24
25 return "Successfully processed request"
Once you’ve made sure you can actually invoke your HTTP function via the task queue and read that JSON, then try executing your actual subtask.
Once set up, my HTTP function did this:
Some other code notes to be aware of:
Secrets and permissions
If your code relies on any keys or tokens lying around in your env, set up a secret manager for them during this stage. For me, I used the secret manager to hold my Twitter API Bearer Token and various service account credentials.
Tl;dr: Write a script that generates subtasks and adds them to a queue.
Each “subtask” is really just a JSON string that gets passed to your HTTP function via a POST request. But how do we actually make all of those POST requests to the function?
There are lots of ways! The way I did it here was using Google Cloud Tasks. It will make all of the POST requests to your HTTP function one at a time, and also do some handy bookkeeping for you: it’ll keep track of when your subtasks fail, automatically reschedule failed tasks, and manage how quickly and at what concurrency your tasks run.
Google Cloud Tasks aren’t the only way to do this, though, you can invoke your HTTP function any number of ways. There are practical reasons not to use a task manager: for example, if you want to run the same subtask over and over, you wouldn’t want to re-enqueue the same task a million times in a fancy queue like this—you can set up a scheduler that will invoke the same subtask over and over. See this appendix note for more notes on that.
As for actually adding tasks to the queue, I wrote a separate python script here to generate subtasks and enqueue them, and ran this locally, since I only needed to enqueue each subtask once. This code generated subtask parameters, formatted them as JSON, and followed this starter code to add that JSON as a payload to the HTTP-invoking task.
Tl;dr: Move all of your data from blob storage into something more structured, like a table.
Now you’ve theoretically gathered a bunch of data and it’s sitting in blob storage somewhere! Amazing!
Most of the time for me, the ingestion step is a python script I run periodically that takes all of the raw data ingested since ingestion was last run, formats it, and dumps it into a table. You could also run ingestion as a cron job, automatically ingesting data every day.
In my example, Tweets come in a very structured format, so I made a BigQuery table with the same schema, added a few fields for data provenance, and dumped tweets directly into it. More often than not, though, you’ll need to do a little more planning. What queries do you want to be able to make? How can you structure your data to make those queries easy to make?
I know you know this, but I’m saying it anyway: once you have things running smoothly, make sure you’ve written stuff down! Keeping notes as you go will save you an enormous amount of work and heartache, and it’s good to periodically check and make sure you’re keeping those up.
On that note, this concludes my notes on data downloading with cloud functions. I hope this has been helpful, and best of luck in your data collection adventures!
Many thanks to Megan Brown, Alex Hanna, Jason Greenfield, William Pietri, and Skyler Sinclair for their feedback in the development of this piece.
Example: I wrote the pagination token (which functions as the pointer in the linked list of results) for each chunk to a file, and each “task” was just iterating through more chunks. The gist was: “Read the last pagination token you saw from a .txt file. Scroll back as far as you can, and when you get stopped, write the last token you saw to a .txt file. If you hit the end of the results, indicate that in the .txt file, notify me, and end this job”. I have done this both via a cloud task with google cloud scheduler and as a plain old cron job running on a cloud vm.
If you’re running the same thing over and over— e.g. your subtasks are all identical, like “get the latest posts from this account today”—you can make a really easy substitution here. Instead of having a queue with a bunch of different tasks to execute, you can set up a single task in the Cloud Scheduler. It’ll just have the same JSON parameters every time, rather than a new set for every subtask. You can set it to run, e.g., every day at 1am...
...and have the scheduler make a POST request to the HTTP function with your parameters:
If you have any credentials that need managing, such as API keys, passwords, or bearer tokens, I highly recommend using a secret manager to keep track of them.
Secret managers keep important credentials in one place, and keep track of who has access to which credentials. (A kind of secret management system you may have encountered already: personal password managers, e.g. 1password, LastPass, or one built into your browser.) Cloud infrastructures often have their own secret management systems that allow you to access credentials easily.
Coding notes
I used Google Secret Manager here. It’s very simple but took me an irritatingly long time to debug so here are some screenshots:
On the cloud console side, I made a new secret for my Twitter bearer token…
…made sure my gservice account had access to it…
…and then made sure my gservice account had the “Secret Manager Secret Accessor” role.
Once that was set up, my code to access secrets looked roughly like this:
1from google.cloud import secretmanager
2
3def read_secret(
4 project_id: str, secret_name: str, is_json: bool = False
5) -> Union[str, Dict[Any, Any]]:
6 """
7 Reads a secret from a secret manager.
8
9 If is_json is true, will decode and return the secret as JSON.
10 Otherwise, the secret will be interpreted as a string.
11
12 Args:
13 project_id (str): The project ID.
14 secret_name (str): The name of the secret.
15 is_json (bool, optional): Whether to interpret the secret as JSON. Defaults to False.
16
17 Returns:
18 The secret, either as a string or a JSON object.
19 """
20
21 secrets_client = secretmanager.SecretManagerServiceClient()
22
23 request = {"name": f"projects/{project_id}/secrets/{secret_name}/versions/latest"}
24
25 secret_response = secrets_client.access_secret_version(request)
26
27 if is_json:
28
29 output = json.loads(secret_response.payload.data.decode("UTF-8"))
30
31 else:
32
33 output = secret_response.payload.data.decode("UTF-8")
34
35 return output
You can use it like this:
1from google.oauth2 import service_account
2
3def get_service_account_credentials() -> service_account.Credentials:
4 """
5 Retrieves service account credentials from a secret manager and returns a credentials object.
6
7 Returns:
8 service_account.Credentials: The service account credentials.
9 """
10
11 credentials = service_account.Credentials.from_service_account_info(
12 read_secret(PROJECT_ID, secret_name=SERVICE_ACCOUNT_KEY_NAME, is_json=True),
13 scopes=["https://www.googleapis.com/auth/cloud-platform"],
14 )
15
16 return credentials