By Sorsa Editorial

Published June 13, 2026. Reflects the current X Developer Policy on content redistribution and the official X API pay-per-use rates, verified June 2026.

Key Takeaway: Building a Twitter (X) dataset for machine learning means collecting tweets from an API, cleaning and deduplicating them, labeling for the task, splitting into train, validation, and test sets, and exporting to a format like JSONL. Since the free API and open scrapers stopped working, fresh collection now runs through a paid data API.

That last point is the practical hurdle. With the free tier gone since 2023 and scrapers like snscrape and twint no longer working, a fresh, on-topic corpus has to come from a data API. Sorsa API, an alternative Twitter/X API, returns public tweets and profiles as plain JSON behind one header, with search, timeline, and bulk endpoints built for collection. Billing is flat: one call counts as one request whatever it returns, the author profile is included free, and a bulk endpoint rehydrates up to 100 tweet IDs per call, which is how you revive the ID-only datasets researchers publish. Plans start at $49 for 10,000 requests at a flat 20 requests per second, with no approval queue.

Contents

How do you build a Twitter dataset for machine learning?

Building a Twitter dataset runs through five stages: collect raw tweets from an API, clean and deduplicate them, label them for the task, split them into train, validation, and test sets, and export to a training format such as JSONL. The same pipeline serves both a classification dataset and an instruction-tuning set for a language model; only the labeling step changes.

Start from the objective, not the data. A sentiment classifier wants short tweets with a clear positive, negative, or neutral label. An instruction-tuned model wants prompt and response pairs. A domain-adaptation run wants a large pool of in-topic text with light labels or none. Knowing the target decides the query, the volume, and the schema before a single tweet is pulled.

The work splits cleanly. Collection and rehydration are an API problem, covered below with code. Labeling, normalization, and splitting are a data problem you handle locally in Python. The rest of this guide takes each in turn, with the 2026 constraints that break most older tutorials called out where they bite.

Should you build your own dataset or use a public one?

Use a public dataset when one already matches your task; build your own when topic, language, time window, or labels are specific to you. Public sets on Hugging Face and Kaggle are free and fast to load, but they are fixed in scope, often years old, frequently English-only, and many ship as tweet IDs that you must rehydrate before the text exists.

The well-known options are worth knowing first. Sentiment140 holds about 1.6 million tweets labeled for sentiment, and the TweetEval benchmark from Cardiff NLP bundles seven tweet classification tasks (irony, hate, offensive, stance, emoji, emotion, sentiment). For a generic English sentiment or emotion model, loading one of these beats collecting anything.

Build your own when the gap is real:

  • The topic is narrow or recent and no public set covers it.
  • You need a language or region the public sets skip.
  • You need a fresh time window, not tweets from 2017.
  • Your label scheme does not match any existing dataset.

For the sentiment case specifically, our guide to tweet sentiment classification in Python walks the modeling side once the data exists.

How to collect tweets for a dataset in 2026

Collect tweets through three routes: keyword and operator search for topic corpora, user timelines for per-account corpora, and Lists or Communities for curated sources. The free API tier closed in 2023 and scrapers such as snscrape and twint stopped working, so fresh collection at any volume now goes through a data API that returns search and timeline results directly.

A read-only API makes this a short loop: query an endpoint, read a page of tweets, follow the cursor, write each tweet to a file. The Sorsa search endpoint accepts standard Twitter advanced search operators like lang:, since:, until:, and -filter:retweets, returns about 20 tweets per page with the author profile attached, and paginates with next_cursor. Build the query visually first with the search query builder if the syntax is unfamiliar.

python
import json, time, requests

API_KEY = "YOUR_SORSA_API_KEY"
BASE = "https://api.sorsa.io/v3"
HEADERS = {"ApiKey": API_KEY, "Content-Type": "application/json"}

def collect(query, target=5000):
    rows, cursor = [], None
    while len(rows) < target:
        body = {"query": query, "order": "latest"}
        if cursor:
            body["next_cursor"] = cursor
        r = requests.post(f"{BASE}/search-tweets", headers=HEADERS, json=body, timeout=30)
        if r.status_code == 429:
            time.sleep(1)
            continue
        r.raise_for_status()
        data = r.json()
        for t in data.get("tweets", []):
            rows.append({
                "id": str(t["id"]),
                "text": t.get("full_text", ""),
                "lang": t.get("lang", ""),
                "created_at": t.get("created_at", ""),
                "username": (t.get("user") or {}).get("username", ""),
            })
        cursor = data.get("next_cursor")
        if not cursor:
            break
    return rows

with open("tweets.jsonl", "w", encoding="utf-8") as f:
    for row in collect('"your topic" lang:en -filter:retweets', target=5000):
        f.write(json.dumps(row, ensure_ascii=False) + "\n")

For per-account corpora, swap the search call for the user timeline endpoint with a username; for curated sources, pull from a List or a Community feed, covered in the Lists and Communities guide. To sample for balance, the mentions endpoint adds filters such as min_likes, since_date, and until_date, which let you pull, for example, only well-engaged tweets inside a date range.

Two efficiency notes. Because the author profile rides inside every tweet, you get the handle, follower count, and account age in the same response with no extra call. And when the data fits a batch shape, batch endpoints cut the request count: a single call takes up to 100 tweet IDs or 100 usernames, which is the main lever in reducing request volume. The full search endpoint reference and cursor pagination cover the parameters.

If you would rather not run an API at all, a thin alternative is the older scraper route, but expect breakage; our breakdown of why open-source scrapers break and the current scraping approaches explain the trade.

How do you rehydrate a tweet-ID-only dataset?

Rehydration means taking a list of tweet IDs and fetching the current content for each one. Public research datasets are distributed as IDs rather than full tweets because the platform's terms restrict redistributing tweet text, so the text only exists after you look the IDs up against an API. Expect to recover a subset: tweets that were deleted, or posted by suspended or now-private accounts, no longer return.

A bulk endpoint makes this cheap. The Sorsa bulk tweet endpoint accepts up to 100 IDs per call and returns the full tweet objects, so a 50,000 ID file is roughly 500 requests rather than 50,000 single lookups.

python
import requests

API_KEY = "YOUR_SORSA_API_KEY"
URL = "https://api.sorsa.io/v3/tweet-info-bulk"
HEADERS = {"ApiKey": API_KEY, "Content-Type": "application/json"}

def chunks(seq, n=100):
    for i in range(0, len(seq), n):
        yield seq[i:i + n]

ids = [line.strip() for line in open("tweet_ids.txt") if line.strip()]
rehydrated = []
for batch in chunks(ids, 100):
    r = requests.post(URL, headers=HEADERS, json={"tweet_links": batch}, timeout=30)
    r.raise_for_status()
    rehydrated.extend(r.json().get("tweets", []))

print(f"Recovered {len(rehydrated)} of {len(ids)} tweets")

The recovery rate drops the older the ID list is, since more of the original tweets disappear every year. For the archive side of this, including where ID lists come from and how much an old event dataset retains, see our guide to searching and recovering older tweets.

How to structure and label a dataset for training

Give each record a stable schema and export to JSONL, one object per line. A classification set needs the tweet ID, the text, the label, the language, and the timestamp; an instruction-tuning set needs an instruction, an optional input, and the target output. JSONL is the format the Hugging Face datasets library and most fine-tuning pipelines read directly.

A classification record and an instruction record look like this:

json
{"id": "1782368585664626774", "text": "the new build keeps crashing on launch", "label": "negative", "lang": "en", "created_at": "2026-05-01T10:30:00Z"}
json
{"instruction": "Classify the sentiment of this tweet.", "input": "the new build keeps crashing on launch", "output": "negative"}

Clean before you label. Deduplicate on the tweet ID, drop very short or empty tweets, and filter by the lang field so the set stays in your target language. Following the TweetEval convention from Cardiff NLP, replace URLs with a {{URL}} token and non-verified usernames with {{USERNAME}}, so the model learns language patterns rather than specific links or handles.

For labels, three approaches scale. Hand-label a small gold set for evaluation. Pre-label the bulk with a small existing model, then have a human correct it, which is far faster than labeling from scratch. Or, for instruction tuning, write prompt and response pairs around the tweets. LLM fine-tuning guides consistently stress quality over raw volume: a few thousand clean, well-formed pairs usually beat a noisy set ten times the size.

Finally, split before training so evaluation stays honest. A common split is 80 percent train, 10 percent validation, 10 percent test, shuffled with a fixed seed for reproducibility.

python
import json, random

rows = [json.loads(line) for line in open("tweets.jsonl", encoding="utf-8")]
random.seed(42)
random.shuffle(rows)
n = len(rows)
train = rows[: int(0.8 * n)]
val = rows[int(0.8 * n): int(0.9 * n)]
test = rows[int(0.9 * n):]

What about platform terms and ethics?

Collecting public tweets for your own model is one thing; redistributing the dataset is where the rules bite. X's Developer Policy states that if you share content with third parties, including downloadable datasets, you may distribute only post IDs and user IDs, not the full text, which is exactly why public research datasets ship as IDs.

The current terms set concrete limits. Per the X Developer Policy, you may not distribute more than 1,500,000 post IDs to any one entity within a 30 day period without written permission, while individuals acting on behalf of an academic institution for non-commercial research may share an unlimited number of IDs. Researchers using EU Digital Services Act access fall under separate provisions.

The practical pattern that keeps you inside the lines: keep the full text you collect for your own training, and if you publish a dataset, publish the tweet IDs and a rehydration script so others recreate the text themselves. The terms also prohibit tracking or monitoring sensitive groups and events, so steer collection away from those uses. This is general guidance, not legal advice, and the terms change, so check the current X Developer Agreement before you release anything.

How much does collecting a dataset cost?

Cost depends on volume and on whether the source bills per resource or per request. The official X API charges per resource fetched: each post read is $0.005 and each user profile read is $0.010, with a hard ceiling of 2 million post reads per month, figures confirmed in our X API pricing breakdown. A flat per-request model counts one call as one request and includes the author profile for free.

GoalOfficial X API (pay-per-use)Sorsa (Pro, $199/mo)
100,000 tweets, author included~$500 to $1,500 (post plus user reads)within plan, roughly 1,000 to 5,000 requests
Rehydrate 50,000 IDs~$250 to $750roughly 500 requests (100 per call)
Monthly read ceiling2,000,000 post reads, then blockedrequest-based plan, no per-tweet meter
AuthOAuth 2.0 bearer tokensingle ApiKey header
Write accessYesNo (read-only)

On read-heavy collection, a flat per-request bill lands roughly 30 to 50 times cheaper than per-resource pricing, and the 100,000 tweet pull above fits comfortably inside a single Pro month. For posting or sending messages the official API is the only option, and that is its territory; for building a dataset, which is pure reading, a read-only API is cheaper and simpler. Researchers can also check the discounted academic research access, and the same collection code ports to any stack, including a pure Python workflow.

A real dataset build

A small AI team building a finance-sentiment model came to this after pricing the official route. They needed roughly 200,000 recent English tweets about a basket of tickers, labeled positive, negative, or neutral. On per-resource billing, the post reads alone ran toward four figures before author profiles were counted, and the 2 million monthly read cap loomed once they planned refreshes. Moving collection to a flat per-request API cut that data cost by roughly 30 to 50 times, the expected result for any read-heavy pull, since the author profile came free and search returned 20 tweets per request.

The build stayed small. A keyword query per ticker with a language filter fed a JSONL file; a small classifier pre-labeled the rows and an analyst corrected a sample; URLs and handles were normalized to tokens; and the set was split 80, 10, 10 with a fixed seed. When they later wanted to extend an older public event dataset, they rehydrated its ID list in batches of 100 and accepted that a slice of the original tweets had since been deleted. The lesson they kept was the ordering: define the label scheme and the query first, because re-collecting after a schema change is the expensive mistake, not the API bill.

Frequently asked questions

Can you still collect tweets for a dataset in 2026?

Yes, but not through the old tools. The free Twitter API tier closed in 2023, and scrapers such as snscrape and twint no longer work against the current platform. Fresh collection at any real volume now runs through a paid data API that returns search and timeline results, which you page through and write to a file. Public datasets remain an option when one already fits the task.

Where can you get a ready-made Twitter dataset?

Hugging Face and Kaggle host the main public tweet datasets. Sentiment140 holds about 1.6 million sentiment-labeled tweets, and the TweetEval benchmark covers seven classification tasks such as irony, hate, and emotion. Many academic datasets ship as tweet IDs rather than full text, so you rehydrate the IDs against an API before the tweet content exists in your copy.

How do you rehydrate tweet IDs?

Rehydration looks up the current content for a list of tweet IDs. A bulk endpoint makes it cheap: a read-only API such as Sorsa accepts up to 100 IDs per call and returns the full tweet objects, so a 50,000 ID file becomes about 500 requests. Expect to recover a subset, because tweets that were deleted or posted by suspended or private accounts no longer return.

How many tweets do you need to fine-tune a model?

It depends on the task, and quality matters more than volume. For a narrow classification or instruction-following task, a few thousand clean, correctly labeled examples often move the model meaningfully, while a noisy set ten times larger can do less. Define the objective, label a small high-quality set first, and grow it only if evaluation shows the model needs more coverage.

Collecting public tweets for your own model is generally permitted, but redistribution is restricted. X's Developer Policy allows sharing only post IDs and user IDs in published datasets, not full tweet text, with a cap of 1,500,000 IDs per entity per 30 days and an exception for academic institutions. Publish IDs and a rehydration script rather than raw text, and check the current terms before release.

Do you need the official X API to build a dataset?

No. The official X API is required only for write actions such as posting. For collecting public tweets and profiles, a read-only alternative such as the Sorsa API works through a single ApiKey header and bills per request, so a large pull is not metered per tweet and the author profile is included free. Plans start at $49 for 10,000 requests.

Getting started

The quickest way to size a dataset is to run the query before writing a pipeline. Open the API playground, run a search with your operators, and read the JSON fields you will map into JSONL. When the query looks right, drop the collection script above into your project and point it at a file.

When you are ready to build, the quickstart guide covers the header auth and your first call, and the request-based plans start at $49 for 10,000 requests with a flat 20 requests per second and no approval queue.

Reviewed by Keksich, founder of Sorsa, marketer and X API researcher.

This guide draws on our team's hands-on work running the Sorsa API and the live API and its documentation, plus public datasets and platform terms checked at the time of writing. The content-redistribution rules were read from the X Developer Policy, the public-dataset and normalization details from the TweetEval dataset on Hugging Face, and the official X API per-resource pricing was cross-checked against current public breakdowns and our own pricing teardown. More on who we are is on the about page. Verified June 13, 2026.