BEST-WEB-TOOLS Blog

My own blog posts about development, tech, finance and other (interesting) stuff.

❮❮ back

2022-07-17, Dev, Testing, PHP, Development, FakerPHP, OSINT, Sock puppet, Data generation

Generating Real Looking Test Data with FakerPHP

FakerPHP doing a very decent job on creating test-data. But some parts could be improved if you like to have even more natural looking fake-data for your testing needs.

What can we improve?

I played around a bit with online available data and APIs, and generated real looking phone numbers and addresses for my /fake-data online utility.

The generated fake-data is also suitable as a stuffing for a sock puppet in your next OSInt project. You can easily create valid looking persons without making up names and addresses just with FakerPHP. Sock puppet is the term used by OSInt investigators for the fake accounts they are using when looking up things in social media or websites.

You can install FakerPHP with Composer and the source code is on GitHub. FakerPHP requires a PHP version >= 7.1. The typical namespace to use is: Faker\Factory

  composer require fakerphp/faker

Better Username And Mail Address

The /fake-data util using a lot of things direct from FakerPHP. Like names, genders, passwords, domains and some other. First thing I did, to make the fake data more consistent, is to combine the names, usernames and mail addresses. Creating valid looking usernames from firstname and surname and using it for the mail address.

Let's start with creating a faker object with a random locale. Faker supports a lot of different locales from Arabic to Chinese. You can find a list in the faker documentation.

    use Faker\Factory;

    $locales = ['de_AT','en_GB'];
    $random = array_rand($locales);
    $faker = Factory::create($locales[$random]);

Depending on the locale you get different results for common names in that region. You also get states, cities and other fake-data depending on what the faker developers already have implemented for that locale. More infos for each locale you can find in the source code.

Beware: If there is no implementation in the selected locale, FakerPHP will fall back to en_US. That means if a locale have not implemented the Person provider, you will get english looking names and not the names from the region.

    $gender = $faker->randomElement(['male', 'female']);
    $firstname = $faker->firstName($gender);
    $lastname = $faker->lastName();

    // female, Maja, Sailer 

This creates some variants of typical usernames consisting of the full firstname and lastname or just the first letter of the firstname in combination with the lastname, seperated by a dot or a dash or nothing. There would be a method available for that, but sadly this method don't take a firstname or surname as variables and would give you completly unrelated results to your generated username. So we have to do it ourselves like that ...

    $username = strtolower($faker->randomElement([
        "$firstname.$lastname",
        "{$firstname[0]}.$lastname",
        "{$firstname[0]}-$lastname",
        $firstname[0].$lastname
    ]));

    // m.hübner

    $username = iconv("utf-8","ascii//TRANSLIT", $username);
    $username = preg_replace('/[^a-z\.\-@]/','', $username);
    
    // m.hubner

In some locales the names have special characters, like German Umlauts. If you don't want usernames with special characters you can get rid of it with icons and preg_replace like in the example above.

Typically, the email address is pretty much a combination from username and domainname. So we can just have to create a domain and combine the username with it.


    $domain = $faker->domainName();
    
    $email = $faker->randomElement([
        "$firstname.$lastname@$domain",
        "{$firstname[0]}.$lastname@$domain",
        "{$firstname[0]}-$lastname@$domain"
    ]);
    
    $email = iconv("utf-8","ascii//TRANSLIT", $email);
    $email = preg_replace('/[^a-z\.\-@]/','', $email);
    
    // m.hubner@koberl.com

In the end you have now a name, username and email address that looks valid on the first sight. Next up is country, city and street address.

Generate a Valid Country, City And Address

FakerPHP gives you pretty random city and street names by combining typical words that could look valid. For example combining "north", "port" and a firstname you get "North Oscar Port". But there is no guarantee that this city really exists.

To improve your cities and addresses, you could get all valid cities by downloading data from Geonames. There is a CSV called cities1000.zip (8MB) with all cities in the world with population greater than 1000 people. There is also a JSON version on GitHub.

The country code in the JSON is ISO 3166-1 alpha-2 country code. You can find all codes and even more country data in the annexare/Countries dataset on GitHub. Is it slow? Hmmh. It is not fast. There are 140k city names in this dataset and the unziped file is 13MB in size so it takes some milli seconds to parse the json and generate the array. But if you don't do it a loop it should be good enough.

    $countryCities = [];
    $all = json_decode(file_get_contents('cities.json'), true);
    
    $randomCountry = $faker->randomElement(['FR','AT','DE']);
    
    foreach($all as $city) {
        if($city['country'] === $randomCountry) { 
            $countryCities[] = $city;
        }
    }
    
    $randomCity = $faker->randomElement($countryCities);

Getting a valid street address is a bit more tricky. I use the Nominatim API for it. This is a free service by OpenStreetMap, and it lets you find an address by latitude longitude coordinates. In the city data you have the lat, lng for the city center and it is pretty save to think there is also a street in the center of a city.

With the data from Nominatim you will get a real valid and existing address on building level. The zoom level 18 in the url is important to get the street and building number.

    $url = "https://nominatim.openstreetmap.org/reverse?lat={$city['lat']}&lon={$city['lng']}&zoom=18&format=jsonv2";
    
    $response = Http::get($url);
    $result = json_decode($response->body(), true);

I am using the Laravel Http library here (use Illuminate\Support\Facades\Http;), but you can use any other http client library you want (Guzzle, ...). The only important thing is, that the requests needs to have a user-agent header!!

That gives us a full address and all the details like postcode, street and house number that someone can look up and will find in google maps.

{
    "place_id": 171174028,
    "licence": "Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright",
    "osm_type": "way",
    "osm_id": 265733241,
    "lat": "51.25627575",
    "lon": "7.148131197444216",
    "place_rank": 30,
    "category": "building",
    "type": "yes",
    "importance": 0,
    "addresstype": "building",
    "name": null,
    "display_name": "5, Mühlenschütt, Elberfeld, Gemarkung Elberfeld, Wuppertal, Nordrhein-Westfalen, 42103, Deutschland",
    "address": {
        "house_number": "5",
        "road": "Mühlenschütt",
        "suburb": "Elberfeld",
        "city": "Wuppertal",
        "state": "Nordrhein-Westfalen",
        "ISO3166-2-lvl4": "DE-NW",
        "postcode": "42103",
        "country": "Deutschland",
        "country_code": "de"
    },
    "boundingbox": [
        "51.256247",
        "51.2563135",
        "7.1480698",
        "7.1481927"
    ]
}

As an alternative, there is also an API from Geonames for street name autocomplete (called "Find nearest Address"). It does pretty much the same reverse geo coding as the OpenStreetMap API.

Valid Looking Phone Numbers

Phone numbers are the end boss of all data. Not very standardized but enough to be complicated. FakerPHP numbers are quite random numbers without any relation to country or anything else. On Wikipedia you could find a list of country codes and also some valid number prefixes and typical length for a lot of countries.

Then you just build variants of the phone country code, the vendor prefix and the right length of the number for the vendor. That's for the first impression of the number good enough. The only complicated thing is to maintain the list of the vendors for all the countries you want to support.


$vendorPrefixes = [
    'AT' => [
        '650' => [ 'name' => 'T-Mobile Austria GmbH (telering)', 'length' => 10 ],
        // ...
        '699' => [ 'name' => 'Hutchison 3G Austria GmbH (drei)', 'length' => 11 ],
    ],
    // ...
];

if(! empty($vendorPrefixes[$randomCountry])) {
    $vendorPrefix = array_rand($vendorPrefixes[$randomCountry]);
    $phoneLength = $vendorPrefixes[$randomCountry][$vendorPrefix]['length'];
    $phoneTelco = $vendorPrefixes[$randomCountry][$vendorPrefix]['name'];

    $vendorPrefix = $faker->numerify($vendorPrefix);
}

$phoneNumber = sprintf('%s %s %s', 
    $countryPrefix, 
    str_replace('x', rand(0, 9), $vendorPrefix),
    $faker->numerify(str_repeat('#', $phoneLength - strlen($vendorPrefix)))
);

Zodiac Sign For Birthdate

One more thing. Zodiac sings are missing in FakerPHP. Here is a function that gives you the matching zodiac sign for a birthdate. Just in the case you need a sock puppet in a dating platform ;)


public static function zodiac(string $dob): string {
    $zodiac = '';

    list ($year, $month, $day) = explode('-', $dob);

    if ( ( $month == 3 && $day > 20 ) || ( $month == 4 && $day < 20 ) ) { $zodiac = "Aries"; }
    elseif ( ( $month == 4 && $day > 19 ) || ( $month == 5 && $day < 21 ) ) { $zodiac = "Taurus"; }
    elseif ( ( $month == 5 && $day > 20 ) || ( $month == 6 && $day < 21 ) ) { $zodiac = "Gemini"; }
    elseif ( ( $month == 6 && $day > 20 ) || ( $month == 7 && $day < 23 ) ) { $zodiac = "Cancer"; }
    elseif ( ( $month == 7 && $day > 22 ) || ( $month == 8 && $day < 23 ) ) { $zodiac = "Leo"; }
    elseif ( ( $month == 8 && $day > 22 ) || ( $month == 9 && $day < 23 ) ) { $zodiac = "Virgo"; }
    elseif ( ( $month == 9 && $day > 22 ) || ( $month == 10 && $day < 23 ) ) { $zodiac = "Libra"; }
    elseif ( ( $month == 10 && $day > 22 ) || ( $month == 11 && $day < 22 ) ) { $zodiac = "Scorpio"; }
    elseif ( ( $month == 11 && $day > 21 ) || ( $month == 12 && $day < 22 ) ) { $zodiac = "Sagittarius"; }
    elseif ( ( $month == 12 && $day > 21 ) || ( $month == 1 && $day < 20 ) ) { $zodiac = "Capricorn"; }
    elseif ( ( $month == 1 && $day > 19 ) || ( $month == 2 && $day < 19 ) ) { $zodiac = "Aquarius"; }
    elseif ( ( $month == 2 && $day > 18 ) || ( $month == 3 && $day < 21 ) ) { $zodiac = "Pisces"; }

    return $zodiac;
}

With this simple tricks you can easily improve the look of your fake data. You can try it out on /fake-data.


❮❮ back