Atomic Files¶
The following types of atomic files are defined:
filename | content | example |
---|---|---|
xxx.geo | Store geographic entity attribute information. | geo_id, type, coordinates |
xxx.usr | Store traffic user information. | usr_id, gender, birth_date |
xxx.rel | Store the relationship information between entities, such as road networks. | rel_id, type, origin_id, destination_id |
xxx.dyna | Store traffic condition information. | dyna_id, type, time, entity_id, location_id |
xxx.ext | Store external information, such as weather, temperature, etc. | ext_id, time, properties |
config.json | Used to supplement the description of the above table information. |
Note: For different traffic prediction tasks, different atomic files may be used, and a dataset may not contain all six kinds of atomic files.
The format of .geo
, .usr
, .rel
, .dyna
, and .ext
is similar to the csv
file, which consists of multiple columns of data.
Geo Table¶
An element in the Geo table consists of the following four parts:
geo_id, type, coordinates, properties (multiple columns).
geo_id: The primary key uniquely determines a geo entity.
type: The type of geo. Range in [
Point
,LineString
,Polygon
]. These three values are consistent with the points, lines and planes in Geojson.coordinates: Array or nested array composed of float type. Describe the location information of the geo entity, using the coordinates format of Geojson.
properties: Describe the attribute information of the geo entity. If there are multiple attributes, you can use different column names to define multiple columns of data, such as
POI_name
,POI_type
. [For grid data, there must be two columnsrow_id
andcolumn_id
which represent the row and column numbers of the grid. ]
Note: Geojson’s coordinates format: (Longitude first, latitude second)
Point: [102.0, 0.5]
LineString: [ [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0] ]
Polygon: [[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ]]
Usr Table¶
An element in the Usr table consists of the following two parts:
usr_id, properties (multiple columns).
usr_id: The primary key uniquely determines a usr entity.
properties: Describe the attribute information of the usr entity. If there are multiple attributes, different column names can be used to define multiple columns of data, such as
gender
,birth_date
.
Rel Table¶
An element in the Rel table consists of the following four parts:
rel_id, type, origin_id, destination_id, properties (multiple columns).
rel_id: The primary key uniquely determines the relationship between entities.
type: The type of rel. Range in [
usr
,geo
], which indicates whether the relationship is based ongeo
orusr
.origin_id: The ID of the origin of the relationship, which is either in the Geo table or in the Usr table.
destination_id: The ID of the destination of the relationship, which is one of the Geo table or the Usr table.
properties: Describe the attribute information of the relationship. If there are multiple attributes, different column names can be used to define multiple columns of data.
Dyna Table¶
An element in the Dyna table consists of the following five parts:
dyna_id, type, time, entity_id(multiple columns), properties(multiple columns).
Introduction to each column¶
dyna_id: The primary key uniquely determines a record in the Dyna table.
type: The type of dyna. There are two values:
trajectory
(for trajectory-based task) andstate
(for traffic state prediction task).time: Time information, using the date and time combination notation in ISO-8601 standard, such as:
2020-12-07T02:59:46Z
.entity_id: Describe which entity the record is based on, which is the ID of
geo
orusr
.properties: Describe the attribute information of the record. If there are multiple attributes, different column names can be used to define multiple columns of data, such as both speed data and flow data.
Detailed description¶
For traffic state prediction tasks:
The format is: dyna_id, state, time, entity_id, properties,entity_id may have different changes:
For entities that can use one-dimensional numbering for sensors, road sections, areas, etc., this column is the corresponding ID, the column name is [entity_id], and the file suffix name is
.dyna
.For grid-based traffic data, the column name is [row_id, column_id], and the file extension is
.grid
.For od-based traffic data, the column name is [origin_id, destination_id], and the file suffix name is
.od
.For grid-od-based traffic data, the column name is [origin_row_id, origin_column_id, destination_row_id, destination_column_id], and the file extension is
.gridod
.
For trajectory-based tasks:
The trajectory data includes GPS point trajectory, road segment-based trajectory (after map matching), user check-in trajectory (POI-based trajectory).
GPS point trajectory
The format is: dyna_id, type, time, entity_id, (traj_id), coordinates, properties.
The content of the entity_id column should be usr_id, traj_id represents the number of multiple trajectories of the same user (starting from 0) and if the user has only one trajectory this column can be empty, the content of the coordinates column is the latitude and longitude of the GPS point.
Road segment-based trajectory
The format is: dyna_id, type, time, entity_id, (traj_id), location, properties.
The content of the entity_id column should be usr_id, traj_id represents the number of multiple trajectories of the same user (starting from 0) and if the user has only one trajectory this column can be empty, the content of the location column is geo_id, which refs to the geo table and represents a road segment.
POI-based trajectory
The format is: dyna_id, type, time, entity_id, (traj_id), location, properties.
The content of the entity_id column should be usr_id, traj_id represents the number of multiple trajectories of the same user (starting from 0) and if the user has only one trajectory this column can be empty, the content of the location column is geo_id, which refs to the geo table and represents a POI.
For a specific task, the input of the trajectory next-location prediction task is based on the trajectory of the POI point, the input of the estimated time of arrival task is the trajectory of the GPS point or the trajectory based on the road segment, the input of the map matching task is the trajectory of the GPS point, and the output is the trajectory based on the road segment.
Data arrangement method¶
Dyna table should be arranged according to the double keywords of (entity_id) and (time), that is, records with the same (entity_id) are put together and sorted according to (time).
Specially, for trajectory data, the trajectories of the same user (entity_id) should be sorted by (traj_id) first, and those with the same (traj_id) should be sorted by (time).
E.g:
dyna_id,type,time,entity_id,traffic_speed
0,state,2012-03-01T00:00:00Z,773869,64.375
1,state,2012-03-01T00:05:00Z,773869,62.666666666666664
2,state,2012-03-01T00:10:00Z,773869,64.0
...
34270,state,2012-06-27T23:50:00Z,773869,66.75
34271,state,2012-06-27T23:55:00Z,773869,65.11111111111111
34272,state,2012-03-01T00:00:00Z,767541,67.625
34273,state,2012-03-01T00:05:00Z,767541,68.55555555555556
34274,state,2012-03-01T00:10:00Z,767541,63.75
...
68542,state,2012-06-27T23:50:00Z,767541,62.25
68543,state,2012-06-27T23:55:00Z,767541,66.88888888888889
68544,state,2012-03-01T00:00:00Z,767542,67.125
68545,state,2012-03-01T00:05:00Z,767542,65.44444444444444
...
dyna_id,type,time,entity_id,traj_id,coordinates,current_dis,current_state
0,trajectory,2014-08-03T18:29:00Z,810,0,"[104.115353,30.64392]",0.0,1.0
1,trajectory,2014-08-03T18:29:40Z,810,0,"[104.113091,30.642129]",0.294091467,1.0
2,trajectory,2014-08-03T18:30:20Z,810,0,"[104.110404,30.64393]",0.6199505718,1.0
3,trajectory,2014-08-03T18:31:00Z,810,0,"[104.108335,30.640667]",1.0332595142,1.0
...
21,trajectory,2014-08-03T18:53:23Z,810,0,"[104.076552,30.626844]",7.0811683597,0.0
22,trajectory,2014-08-03T18:13:00Z,810,1,"[104.106701,30.6916]",0.0,1.0
23,trajectory,2014-08-03T18:13:30Z,810,1,"[104.103889,30.688151]",0.4683813603,1.0
24,trajectory,2014-08-03T18:14:21Z,810,1,"[104.102828,30.68506]",0.8267467429,1.0
Ext Table¶
An element in the Ext table consists of the following three parts:
ext_id, time, properties (multiple columns).
ext_id: The primary key uniquely determines a record in the external data table.
time: Time information, using the date and time combination notation in ISO-8601 standard, such as:
2020- 12-07T02:59:46Z
.properties: Describe the attribute information of the record. If there are multiple attributes, different column names can be used to define multiple columns of data, such as both temperature data and humidity data.
Data Type Definition¶
The data type definition of each column in the dataset needs to be given in the config file, which is helpful for subsequent data processing.
Type | Description |
---|---|
geo_id | Discrete limited IDs that exist in the Geo table. |
usr_id | Discrete limited IDs that exist in the Usr table. |
rel_id | Discrete limited IDs that exist in the Rel table. |
time | Time string conforming to ISO-8601 standard. |
coordinate | String conforming to the coordinate representation in geojosn format. |
num | Real number. |
enum | Enum string. |
other | The rest are stored in string type. |
Config File¶
The config file is used to supplement the information describing the above five tables themselves. It is stored in json
format and consists of six keys: geo
, usr
, rel
, dyna
, ext
, and info
.
For
geo
,rel
,dyna
:Contains a key of
including_types
, and uses an array to describe thetype
values in the table. After that, eachtype
is used as a key, describing which keys are contained in theproperties
table and their data types under thetype
.For
usr
,ext
:Contains a
properties
key, describing which keys are contained in theproperties
table and their data types.For
info
:Contains other necessary statistical information of the dataset, for different traffic prediction tasks, contains different content.
For traffic state prediction task:
geo_file
: The file name of the.geo
file, string type, the default is the dataset name.rel_file
: The file name of the.rel
file, string type, the default is the dataset name.data_files
: The file name of the data file (such as.dyna
,.grid
,.gridod
), array or string type, the default is the dataset name.ext_file
: The file name of the.ext
file, string type, the default is the dataset name.weight_col
: The column name loaded from the.rel
file, string array with only one element or a string type. When not specified, if the.rel
file has only one column of weight columns, there is no problem, otherwise an error will be reported.data_col
: The column names loaded from the data files(such as.dyna
,.grid
,.gridod
), array or string type, load all columns if not specified.ext_col
: The column names loaded from the external file, array or string type, load all columns if not specified.output_dim
: Specify the dimensions of the model output, generally should be the same as the number of attribute columns specified indata_col
.time_intervals
: The length of the data set time slice, in seconds.init_weight_inf_or_zero
: Range in [inf
,zero
]. When loading the.rel
file to construct the adjacency matrix, the initial adjacency matrix is all INF (inf
) or all 0 (zero
), and the default isinf
.set_weight_link_or_dist
: Range in [link
,dist
], when loading the.rel
file to construct the adjacency matrix, use the original value in the weight column in the file (dist
) or revise it to a matrix of all 01 (link
), and the default isdist
.calculate_weight_adj
: Whether the weight of the adjacency matrix obtained from the.rel
file needs to be further calculated, default toFalse
. Some adjacency matrices are calculated based on the original data. The current calculation method is Gaussian kernel method with threshold: $$w_{ij} = \exp \left(- \frac{d_{ij}^{2}}{\sigma^{2}}\right)$$.weight_adj_epsilon
: The threshold of the Gaussian kernel. If the calculated weight is less than the threshold, it becomes 0, that is, $$ w_{ij}[w_{ij}<\epsilon]=0$$. This parameter depends on the parametercalculate_weight_adj=True
.
For trajectory next-location prediction task:
distance_upper
: The maximum distance between POI points.
Example:
{
"geo":{
"including_types":[
"Point"
],
"Point":{
"poi_name":"other",
}
},
"usr":{
"properties":{
"user_type":"enum",
"birth_year":"time",
"gender":"enum"
}
},
"rel":{
"including_types":[
"geo"
],
"geo":{
"link_weight":"num"
}
},
"dyna":{
"including_types":[
"state"
],
"state":{
"entity_id":"geo_id",
"traffic_speed":"num"
}
},
"dyna":{
"including_types":[
"state"
],
"state":{
"entity_id":"geo_id",
"traffic_speed":"num"
}
},
"ext":{
"properties":{
"temperature":"num"
}
},
"grid":{
"including_types":[
"state"
],
"state":{
"row_id": 15,
"column_id": 5,
"traffic_speed":"num"
}
},
"od":{
"including_types":[
"state"
],
"state":{
"origin_id":"geo_id",
"destination_id":"geo_id",
"traffic_speed":"num"
}
},
"gridod":{
"including_types":[
"state"
],
"state":{
"origin_row_id": 15,
"origin_column_id": 5,
"destination_row_id": 15,
"destination_column_id": 5,
"traffic_speed":"num"
}
},
"info": {
"time_interval": 300,
}
}