The paper is currently under review and the dataset is being reviewed by the ethical committee. The rest of the dataset (images & videos) will be released upon acceptance.

1. Scooter Setup (Platform)

The Fig. 1 illustrates a cutting-edge mobile platform designed for comprehensive environmental data collection, particularly focused on assessing urban sidewalks and pedestrian pathways. Built on a custom electric scooter, it features an array of sophisticated sensors and recording tools for in-depth analysis of pedestrian infrastructure.

fig1

Fig. 1: Our mobile multi-sensory capturing platform setup.

Key components include a high-precision GNSS/IMU unit for accurate location and motion data, a ZED2i stereo camera, and a 360-degree Ouster LiDAR with 64 beams for capturing detailed visual and spatial data. An NVIDIA AGX Orin module serves as the central processing unit, while a 4K monocular camera and stereo audio recorder provide high-quality visual and sound data. These sensors are catalogued in Table 1. Fig 2 illustrates the platform’s coordinate system, providing a visual context to the sensor setup.

fig2

Fig. 2: Configuration of the sensor installation (Back-view, side-view, top-view).

The platform’s design is crucial for urban data analytics, enhancing mobile robotics and micro-mobility services by enabling the study of pedestrian traffic patterns and aiding in the maintenance and improvement of city infrastructure. It also aims to increase the efficiency and safety of mobile robotics and micro-mobility in complex urban settings.

Sensor	Details
ZED2i Stereo camera	Dual 1/3" 4MP CMOS
	FOV: 110°(H) x 70°(V) x 120°(D)
	Range: 0.3 m to 20 m
	Gyroscope, Accelerometer
	Magnetometer, Barometer, Temperature
Ouster LiDAR	Vertical Resolution: 64 channels
	Horizontal Resolution: 1024
	FOV: 360°
	Range: 200 m
Osmo Action 4	4K (4096 x 3072) in 4:3 resolution
	FOV: 155°
	Sampling Frequency: AAC: 48.0 kHz
Tascan DR-07X	Built-In Cardioid Condenser Stereo Pair
Tascan DR-07X	Sampling Frequency: 96 kHz / 24-Bit
HWT901B-RS232 IMU	Acceleration, Gyroscope, Magnet Field
	Angle (X, Z-axis: ±180°, Y ±90°)
	Data output frequency: 100 Hz
ZED-F9P GNSS	Receives both L1C/A and L2C bands
	PVT (basic location over UBX binary protocol)
	Frequency: 2 Hz

Table. 1: Sensor details of the platform.

2. Modalities

2.1 Stereo Camera Data

A total of 91,411 pairs of stereo images were recorded using the ZED2i stereo camera, each pair precisely timestamped for in-depth temporal studies. By employing the ZED camera’s API, we extracted depth maps and point cloud information from these image pairs, as illustrated in Fig. 3. Out of the entire dataset, we carefully selected 950 images for manual labeling. In particular, we annotated only the RGB images from this subset, which provided the groundwork for identifying and segmenting pedestrian pathways and sidewalks. This targeted labeling effort, combined with the depth and point cloud information, forms the cornerstone for a comparative analysis with a ground truth dataset derived from LiDAR data.

fig3

Fig. 3: Three types of data format present in our stereo-camera dataset. From top RGB, depth map and point cloud.

2.2 LiDAR Data

A cumulative count of 15,497 LiDAR scans was collected, incorporating point cloud data complemented by details from both reflectivity and range channels, each meticulously time-stamped to ensure chronological synchronization. Owing to the substantial quantity of LiDAR data, a methodical data preprocessing method was adopted. This involved selectively sampling 950 scans from each type of data?point cloud, reflectivity, and range?to enhance the efficiency of computational tasks. Fig. 4 illustrates the range and reflectivity information derived from the LiDAR scans.

fig4

Fig. 4: Range and Reflectivity channels extracted from the Ouster LiDAR.

2.3 GNSS Data

The dataset provided contains an extensive array of unprocessed GNSS measurements that are crucial for high-accuracy navigation and geospatial analysis. It comprises data on pseudorange, carrier phase, and Doppler shift from various satellite systems, including GPS, GLONASS, Galileo, and BeiDou. Pseudorange data provides a preliminary calculation of the distance from the satellite to the receiver, whereas carrier phase data delivers a more precise distance measurement, which is vital for differential GNSS applications that demand superior precision. The Doppler data informs us about how quickly the distance between the satellite and receiver is changing, a metric useful for determining the receiver’s speed.

Each satellite is identified by a specific svId, and the dataset contains signal quality indicators such as carrier-to-noise ratio (CnO), which reveals the strength and clarity of the signal, and the lock time, which shows how long the signal has been consistently tracked. The UBX-NAV-PVT message produces an accurate navigation output, which includes the receiver’s position (latitude and longitude), velocity vectors, and time information, all crucial for instantaneous navigation and for collecting geospatial data with timestamps. Additionally, the NMEA-GNVGT message offers information on the course and speed relative to the ground, further enhancing the dataset’s value for research on pedestrian dynamics and urban planning. This raw GNSS data is instrumental for accurately reconstructing pedestrian routes, thereby supporting sophisticated studies in autonomous navigation and location-based services. Fig. 5 illustrates the geographic extent and the path of our data collection effort.

fig5

Fig. 5: Georegistered trajectory points overlaid on OpenStreetMap.

2.4 IMU Data

The dynamics over time of a complex set of physical measurements are depicted in Fig. 6. The high-resolution data encompasses a variety of inertial and environmental metrics such as three-dimensional acceleration, gyroscope data, angular positioning, magnetometer measurements, and the surrounding temperature, all timestamped.

For the creation of cutting-edge navigation tools used in mobile robots and micro-mobility solutions, the fusion of Inertial Measurement Unit (IMU) data is critical. This data, comprising accelerometers, gyroscopes, and magnetometers, is particularly beneficial for identifying and characterizing vital urban structures, like bicycle paths and curb ramps. It does so by capturing and processing the unique proportions of space used by pedestrians and micro-mobility vehicles. Gyroscopes are key to maintaining the precise alignment of the maps in the ever-changing urban settings, while magnetometers offer essential orientation details in locations with diverse magnetic characteristics. The integration of IMU data with exact time markers and GNSS coordinates greatly enhances the precision of positional information. The implementation of sophisticated data merging methods, such as the Kalman filter, Extended Kalman Filter, and Fuzzy Logic Systems, along with accelerometer information, enhances the quality and utility of the navigation system. Such improvements play a significant role in providing a detailed representation of pedestrian movement and behavior, producing accurate and useful maps for mobile robotics and micro-mobility services.

fig6

Fig. 6: Plots for each set of measurements over time.

2.5 4K Data

A total of 45,957 4K images were captured using the Osmo action camera. Due to the lack of an automated triggering mechanism to synchronize the 4K images with the LiDAR and stereo camera data, we adopted a two-phase approach. Initially, we reduced the dataset to 5,000 images to make it more manageable. Afterward, we created a custom interface designed to systematically sift through both the timestamped and 4K image datasets. Utilizing this interface, we applied a quantitative method to precisely match the most suitable frames from the 4K dataset with each timestamped image. This technique allowed us to successfully synchronize 950 images, ensuring perfect alignment with the data from the other instruments.

2.6 Audio Data

More than 25 minutes of unedited audio data have been supplied from two different devices: the Osmo Action Camera 3 and the Tascam audio recorder.

fig7

Fig. 7: Audio analysis visualizations using YAMNet.

3. Data Annotation

The mobile unit traversed an urban landscape, maintaining a speed of approximately 13 kilometers per hour, and selectively collected data to accurately capture the evolving aspects of pedestrian environments. Rather than labeling each individual frame, emphasis was placed on pivotal instances that held the greatest significance for subsequent analysis and segmentation tasks. Included in the dataset are sensor frames captured at intervals, providing a dense layer of data that is vital for functions such as tracking, odometry, and pinpointing locations within city contexts. This information is exceptionally valuable for progress in areas such as mobile robotics, micro-mobility solutions, and the design of urban spaces.

For the process of annotating the dataset consisting of 2,850 images, the Segment Anything Model (SAM) was utilized. SAM is distinguished by its proficiency in zero-shot learning, which greatly diminishes the necessity for labor-intensive manual labeling, thereby enhancing the efficiency and scalability of segmentation processes. This approach simplifies the data preparation phase and is directed at boosting the adaptability of computer vision systems when it comes to the examination of urban infrastructure.

fig8

Fig. 8: Examples of sidewalk and pedestrian route segmentation.