top of page

We constructed our own dataset INSVIDEO with 213,847 micro-videos and 6,786 users. The dataset is released here.

We detailed the dataset construction process as follows. We first crawled micro-videos from the Instagram. In particular, we manually chose hashtags from hashtag dictionary website2 as our seed hashtags. The hashtags are organized into a four-layer hierarchical structure, with 16, 1,333, and 4,092 leaf nodes in the second-layer, third-layer and fourth-layer, respectively. We then searched the hashtags on Instagram and collected at most the top nine posts for each hashtag. and regarded their users who post these posts as active users. For each active user, we crawled his/her at most 50 published micro-videos and video descriptions. In this way, we harvested 334,826 micro-videos from 9,170 active users. We further conducted data cleaning on micro-videos, hashtags, and users. For micro-videos, we removed the videos with no hashtags or missing modalities (visual, acoustic and text). For hashtags, we conducted spell checking and word lemmatization, and then removed the hashtags occurring less than 50 times. For users, we removed the users with less than 10 micro-videos. After the data cleaning, we obtained a dataset of 213,847 micro-videos and 15,751 hashtags from 6,786 users and each has 13.4 hashtags on average. Besides, the average time span of the micro-videos is 30s. The statistics of the INSVIDEO are summarized in Table 1.


If you want more data (i.e., downloaded micro-videos , covers and their features, please visit the website below.

bottom of page